qrater-web-large-v1.0
A binary text classifier that distinguishes clean, usable web content from noisy web pages (boilerplate, ads, nav menus, cookie banners, login walls, paywalls, etc.).
Built for filtering web search results at scale — drop it into any retrieval or RAG pipeline to keep only pages worth reading.
| Model | Params | Base | Speed | Val Acc | Val F1 |
|---|---|---|---|---|---|
| qrater-web-large-v1.0 | 4B | Qwen3-Embedding-4B | ~15 docs/s | 92.1% | 0.867 |
| qrater-web-base-v1.0 | 0.6B | Qwen3-Embedding-0.6B | ~16 docs/s | 92.4% | 0.873 |
| qrater-web-small-v1.0 | 210M | EuroBERT-210m | ~34 docs/s | 90.6% | 0.843 |
Speed measured on a single A100-80GB with vLLM classify mode, max 4096 tokens.
What it does
Given a web page (as markdown or plain text), the model predicts:
- clean (label 1) — substantive, readable content suitable for AI consumption
- dirty (label 0) — noise, boilerplate, broken formatting, thin content
Usage
Transformers
from transformers import pipeline
pipe = pipeline(
"text-classification",
model="chonkie-ai/qrater-web-large-v1.0",
torch_dtype="bfloat16",
device_map="auto",
)
result = pipe("# How DNS Works\n\nDNS resolution starts when...")
# [{'label': 'clean', 'score': 0.97}]
vLLM (recommended for throughput)
from vllm import LLM
model = LLM(
"chonkie-ai/qrater-web-large-v1.0",
task="classify",
dtype="bfloat16",
max_model_len=4096,
)
outputs = model.classify(["your web page text here"])
probs = outputs[0].outputs.probs # [prob_dirty, prob_clean]
Training
- Base model: Qwen/Qwen3-Embedding-4B
- Training data: 10,000 labeled web pages
- 4,128 samples from live web search results, labeled by Claude
- 5,872 samples from Common Crawl, labeled by a 27B parameter classifier
- Target distribution: ~30% clean / ~70% dirty
- Hyperparameters: 3 epochs, lr=5e-5, effective batch size 64, bf16 + Flash Attention 2, weight decay 0.01, warmup ratio 0.1
- Hardware: 4x A100-80GB with gradient checkpointing
Label definition
A page is clean if:
- It contains substantive, original content (articles, tutorials, documentation, research papers)
- The main content is intact and readable after markdown conversion
- Minimal boilerplate relative to content
A page is dirty if:
- Dominated by navigation, ads, cookie notices, or login walls
- Thin or auto-generated content with little substance
- Broken formatting or encoding issues that make content unusable
- Primarily lists of links, product listings, or search result pages
Evaluation
Validation set (1,000 held-out samples, same distribution as training):
- Accuracy: 92.1%
- F1 (clean class): 0.867
Live web search results (99 pages across 10 diverse queries):
- 30% classified clean — aligned with Claude baseline (~40%) and significantly more selective than Common Crawl-only training (67% clean)
Smaller models
This model serves as the teacher for the smaller qrater-web models, which are trained via temperature-scaled KL-divergence distillation from the soft probability outputs of this model.
Limitations
- English-only — trained exclusively on English web content
- Max input: 4,096 tokens — longer pages are truncated (the base model supports 40K but training used 4K)
- Optimized for informational content — may be less calibrated on creative writing, social media, or e-commerce pages
- Binary classification — does not grade quality on a spectrum
Citation
@misc{qrater2026,
title={qrater-web-large-v1.0: Web Content Quality Classifier},
author={Bhavnick Minhas},
year={2026},
url={https://huggingface.co/chonkie-ai/qrater-web-large-v1.0}
}
License
Apache 2.0
- Downloads last month
- 12
Model tree for chonkie-ai/qrater-web-large-v1.0
Collection including chonkie-ai/qrater-web-large-v1.0
Evaluation results
- Accuracyself-reported0.921
- F1self-reported0.867