qrater-web-large-v1.0

A binary text classifier that distinguishes clean, usable web content from noisy web pages (boilerplate, ads, nav menus, cookie banners, login walls, paywalls, etc.).

Built for filtering web search results at scale — drop it into any retrieval or RAG pipeline to keep only pages worth reading.

Model	Params	Base	Speed	Val Acc	Val F1
qrater-web-large-v1.0	4B	Qwen3-Embedding-4B	~15 docs/s	92.1%	0.867
qrater-web-base-v1.0	0.6B	Qwen3-Embedding-0.6B	~16 docs/s	92.4%	0.873
qrater-web-small-v1.0	210M	EuroBERT-210m	~34 docs/s	90.6%	0.843

Speed measured on a single A100-80GB with vLLM classify mode, max 4096 tokens.

What it does

Given a web page (as markdown or plain text), the model predicts:

clean (label 1) — substantive, readable content suitable for AI consumption
dirty (label 0) — noise, boilerplate, broken formatting, thin content

Usage

Transformers

from transformers import pipeline

pipe = pipeline(
    "text-classification",
    model="chonkie-ai/qrater-web-large-v1.0",
    torch_dtype="bfloat16",
    device_map="auto",
)

result = pipe("# How DNS Works\n\nDNS resolution starts when...")
# [{'label': 'clean', 'score': 0.97}]

vLLM (recommended for throughput)

from vllm import LLM

model = LLM(
    "chonkie-ai/qrater-web-large-v1.0",
    task="classify",
    dtype="bfloat16",
    max_model_len=4096,
)

outputs = model.classify(["your web page text here"])
probs = outputs[0].outputs.probs  # [prob_dirty, prob_clean]

Training

Base model: Qwen/Qwen3-Embedding-4B
Training data: 10,000 labeled web pages
- 4,128 samples from live web search results, labeled by Claude
- 5,872 samples from Common Crawl, labeled by a 27B parameter classifier
- Target distribution: ~30% clean / ~70% dirty
Hyperparameters: 3 epochs, lr=5e-5, effective batch size 64, bf16 + Flash Attention 2, weight decay 0.01, warmup ratio 0.1
Hardware: 4x A100-80GB with gradient checkpointing

Label definition

A page is clean if:

It contains substantive, original content (articles, tutorials, documentation, research papers)
The main content is intact and readable after markdown conversion
Minimal boilerplate relative to content

A page is dirty if:

Dominated by navigation, ads, cookie notices, or login walls
Thin or auto-generated content with little substance
Broken formatting or encoding issues that make content unusable
Primarily lists of links, product listings, or search result pages

Evaluation

Validation set (1,000 held-out samples, same distribution as training):

Accuracy: 92.1%
F1 (clean class): 0.867

Live web search results (99 pages across 10 diverse queries):

30% classified clean — aligned with Claude baseline (~40%) and significantly more selective than Common Crawl-only training (67% clean)

Smaller models

This model serves as the teacher for the smaller qrater-web models, which are trained via temperature-scaled KL-divergence distillation from the soft probability outputs of this model.

Limitations

English-only — trained exclusively on English web content
Max input: 4,096 tokens — longer pages are truncated (the base model supports 40K but training used 4K)
Optimized for informational content — may be less calibrated on creative writing, social media, or e-commerce pages
Binary classification — does not grade quality on a spectrum

Citation

@misc{qrater2026,
  title={qrater-web-large-v1.0: Web Content Quality Classifier},
  author={Bhavnick Minhas},
  year={2026},
  url={https://huggingface.co/chonkie-ai/qrater-web-large-v1.0}
}