Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
BSC-LT/MrBERT-legal · Hugging Face
[go: Go Back, main page]

MrBERT-legal Model Card

MrBERT-legal is a new foundational multilingual legal model built on the ModernBERT architecture. The model is obtained via domain adaptation from MrBERT, initializing all weights from MrBERT and further training on a domain-specific legal corpus comprising 8B tokens (20.5% Spanish, 79.5% English) for 10 epochs.

Technical Description

Technical details of the MrBERT-legal model.

Description Value
Model Parameters 308M
Tokenizer Type SPM
Vocabulary size 256000
Precision bfloat16
Context length 8192

Training Hyperparemeters

Hyperparameter Value
Pretraining Objective Masked Language Modeling
Learning Rate 3E-03
Learning Rate Scheduler Cosine
Warmup 8,000,000,000 tokens
Optimizer decoupled_stableadamw
Optimizer Hyperparameters AdamW (β1=0.9,β2=0.98,ε =1e-06 )
Weight Decay 1E-05
Global Batch Size 512
Dropout 1E-01
Activation Function GeLU

How to use

>>> from transformers import pipeline
>>> from pprint import pprint

>>> unmasker = pipeline('fill-mask', model='BSC-LT/MrBERT-legal')

>>> pprint(unmasker("La parte demandante presentó una<mask>para reclamar daños y perjuicios.",top_k=3))
[{'score': 0.67013019323349,
  'sequence': 'La parte demandante presentó una demanda para reclamar daños y '
              'perjuicios.',
  'token': 28122,
  'token_str': 'demanda'},
 {'score': 0.23289400339126587,
  'sequence': 'La parte demandante presentó una acción para reclamar daños y '
              'perjuicios.',
  'token': 39344,
  'token_str': 'acción'},
 {'score': 0.06652028113603592,
  'sequence': 'La parte demandante presentó una solicitud para reclamar daños '
              'y perjuicios.',
  'token': 97479,
  'token_str': 'solicitud'}]
>>> pprint(unmasker("The plaintiff filed a<mask>to claim damages.", top_k=3))
[{'score': 0.23366118967533112,
  'sequence': 'The plaintiff filed a notice to claim damages.',
  'token': 46125,
  'token_str': 'notice'},
 {'score': 0.20604448020458221,
  'sequence': 'The plaintiff filed a declaration to claim damages.',
  'token': 181894,
  'token_str': 'declaration'},
 {'score': 0.1953258514404297,
  'sequence': 'The plaintiff filed a statement to claim damages.',
  'token': 53164,
  'token_str': 'statement'}]

EVALUATION

In addition to the MrBERT family, the following base foundation models were considered:

Multilingual Foundational Model Number of Parameters Vocab Size Description
xlm-roberta-base 279M 250K Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages.
mRoBERTa 283M 256K RoBERTa base model pretrained with 35 European languages and a larger vocabulary size.
mmBERT 308M 250K Multilingual ModernBERT pre-trained with staged language learning.
legal-bert-base-uncased 110M 31k BERT-base model pre-trained on legal corpora

The benchmarks used for comparison are:

  • MTEB: We select a subset of MTEB that evaluates legal tasks in Spanish and English.
  • LexBOE: A Spanish legal topic-classification dataset built from "Boletín Oficial del Estado" (BOE) articles, organized into 14 unified legal categories chosen to capture the thematic structure and linguistic characteristics of Spanish legal discourse. The reported metric is accuracy.
  • EurLEX ('en' split): A legal multi-label topic-classification dataset built from the LexGLUE benchmark, built from EU legislative documents and annotated with EuroVoc concepts, aiming to capture the characteristics of European legal language. The reported metric is accuracy.

Retrieval

Task Name Task Type mmBERT
(308M)
MrBERT
(308M)
MrBERT-es
(150M)
legal-bert-
base-uncased
(110M)
MrBERT-legal
(308M)
LexBOE (ES) Text Classification 96.84 97.02 97.28 95.36 96.80
small-spanish-legal-dataset (ES) Retrieval 42.58 40.78 46.92 19.79 38.75
EURLEX (EN) Text Classification 97.43 97.40 97.41 97.42 97.33
AILAStatutes (EN) Retrieval 14.31 13.90 12.28 13.49 16.33
legal_summarization (EN) Retrieval 53.33 53.84 46.41 52.40 55.05
LegalBench (EN) Retrieval 60.15 58.88 58.26 63.42 58.04
NanoTouche2020 (EN) Retrieval 34.03 44.15 31.18 34.48 44.74
Average (EN) All Tasks 51.85 53.63 49.11 52.24 54.30
Average (EN + ES) All Tasks 56.95 58.00 55.68 53.77 58.15

Additional information

Author

The Language Technologies Lab from Barcelona Supercomputing Center.

Contact

For further information, please send an email to langtech@bsc.es.

Copyright

Copyright(c) 2026 by Language Technologies Lab, Barcelona Supercomputing Center.

Funding

This work has been supported and funded by the Ministerio para la Transformación Digital y de la Función Pública and the Plan de Recuperación, Transformación y Resiliencia – funded by the EU through NextGenerationEU, within the framework of the Modelos del Lenguaje project, as well as by the European Union – NextGenerationEU. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or European Commission. Neither the European Union nor the European Commission can be held responsible for them.

Acknowledgements

This project has benefited from the contributions of numerous teams and institutions through data contributions.

In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.

At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano, the "Instituto de Ingenieria del Conocimiento" and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.

At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration.

Their valuable efforts have been instrumental in the development of this work.

Disclaimer

Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

Citation

@article{tamayo2026mrbert,
  title={MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation},
  author={Tamayo, Daniel and Lacunza, I{\~n}aki and Rivera-Hidalgo, Paula and Da Dalt, Severino and Aula-Blasco, Javier and Gonzalez-Agirre, Aitor and Villegas, Marta},
  journal={arXiv preprint arXiv:2602.21379},
  year={2026}
}

License

Apache License, Version 2.0

Downloads last month
78
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BSC-LT/MrBERT-legal

Base model

BSC-LT/MrBERT
Finetuned
(9)
this model

Collection including BSC-LT/MrBERT-legal

Paper for BSC-LT/MrBERT-legal