Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
chonkie-ai/qrater-web-large-v1.0 · Hugging Face
[go: Go Back, main page]

qrater-web-large-v1.0

A binary text classifier that distinguishes clean, usable web content from noisy web pages (boilerplate, ads, nav menus, cookie banners, login walls, paywalls, etc.).

Built for filtering web search results at scale — drop it into any retrieval or RAG pipeline to keep only pages worth reading.

Model Params Base Speed Val Acc Val F1
qrater-web-large-v1.0 4B Qwen3-Embedding-4B ~15 docs/s 92.1% 0.867
qrater-web-base-v1.0 0.6B Qwen3-Embedding-0.6B ~16 docs/s 92.4% 0.873
qrater-web-small-v1.0 210M EuroBERT-210m ~34 docs/s 90.6% 0.843

Speed measured on a single A100-80GB with vLLM classify mode, max 4096 tokens.

What it does

Given a web page (as markdown or plain text), the model predicts:

  • clean (label 1) — substantive, readable content suitable for AI consumption
  • dirty (label 0) — noise, boilerplate, broken formatting, thin content

Usage

Transformers

from transformers import pipeline

pipe = pipeline(
    "text-classification",
    model="chonkie-ai/qrater-web-large-v1.0",
    torch_dtype="bfloat16",
    device_map="auto",
)

result = pipe("# How DNS Works\n\nDNS resolution starts when...")
# [{'label': 'clean', 'score': 0.97}]

vLLM (recommended for throughput)

from vllm import LLM

model = LLM(
    "chonkie-ai/qrater-web-large-v1.0",
    task="classify",
    dtype="bfloat16",
    max_model_len=4096,
)

outputs = model.classify(["your web page text here"])
probs = outputs[0].outputs.probs  # [prob_dirty, prob_clean]

Training

  • Base model: Qwen/Qwen3-Embedding-4B
  • Training data: 10,000 labeled web pages
    • 4,128 samples from live web search results, labeled by Claude
    • 5,872 samples from Common Crawl, labeled by a 27B parameter classifier
    • Target distribution: ~30% clean / ~70% dirty
  • Hyperparameters: 3 epochs, lr=5e-5, effective batch size 64, bf16 + Flash Attention 2, weight decay 0.01, warmup ratio 0.1
  • Hardware: 4x A100-80GB with gradient checkpointing

Label definition

A page is clean if:

  • It contains substantive, original content (articles, tutorials, documentation, research papers)
  • The main content is intact and readable after markdown conversion
  • Minimal boilerplate relative to content

A page is dirty if:

  • Dominated by navigation, ads, cookie notices, or login walls
  • Thin or auto-generated content with little substance
  • Broken formatting or encoding issues that make content unusable
  • Primarily lists of links, product listings, or search result pages

Evaluation

Validation set (1,000 held-out samples, same distribution as training):

  • Accuracy: 92.1%
  • F1 (clean class): 0.867

Live web search results (99 pages across 10 diverse queries):

  • 30% classified clean — aligned with Claude baseline (~40%) and significantly more selective than Common Crawl-only training (67% clean)

Smaller models

This model serves as the teacher for the smaller qrater-web models, which are trained via temperature-scaled KL-divergence distillation from the soft probability outputs of this model.

Limitations

  • English-only — trained exclusively on English web content
  • Max input: 4,096 tokens — longer pages are truncated (the base model supports 40K but training used 4K)
  • Optimized for informational content — may be less calibrated on creative writing, social media, or e-commerce pages
  • Binary classification — does not grade quality on a spectrum

Citation

@misc{qrater2026,
  title={qrater-web-large-v1.0: Web Content Quality Classifier},
  author={Bhavnick Minhas},
  year={2026},
  url={https://huggingface.co/chonkie-ai/qrater-web-large-v1.0}
}

License

Apache 2.0

Downloads last month
12
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chonkie-ai/qrater-web-large-v1.0

Finetuned
(43)
this model

Collection including chonkie-ai/qrater-web-large-v1.0

Evaluation results