GLaRe is where we do the research side of GeezLab. We work on AI and Natural Language Processing for Eritrean languages, and support students and researchers who are underrepresented in the field.
(glare: to shine with a strong or dazzling light)
We care about languages that don't yet have the tools they deserve.
We study AI and NLP problems specific to Tigrinya, Tigre, and other Eritrean languages. Most of these languages are severely underserved by existing tools, and we want to change that.
Everything we build, we share. Datasets, models, code, and tools are published openly so others can use them, build on them, and push the work further.
We support students and researchers who are working on language technology but lack institutional support. Good ideas shouldn't go unsupported because of where they are from.
Work conducted or supported by GLaRe:
A family of language models trained on Eritrean languages. The goal is to have good base models that work well for Tigrinya, Tigre, and related languages out of the box.
Tigrinya Abusive Language Detection: a multi-task benchmark with 13.7K annotated YouTube comments covering abusiveness, sentiment, and topic classification in both Ge'ez script and Latin transliterations.
Read Paper → View Project →Tigrinya Question Answering Dataset and Models: Developed the first human-annotated question answering benchmark dataset for Tigrinya, with 10.6K QA pairs from 290 news articles. Received an Outstanding Paper Award at ACL.
Read Paper → View Project →A benchmark dataset and evaluation for Language Identification (LI) targeting five typologically and phylogenetically related low-resourced East African languages that use the Ge'ez script.
Read Paper → View Project →Monolingual pre-trained language models for Tigrinya (TiRoBERTa, TiBERT, TiELECTRA), trained to provide strong baseline representations for downstream tasks.
View Paper → View Project →Tigrinya Language Modeling Dataset: A large-scale monolingual dataset collected from news articles, blogs, and books, featuring approximately 40 million tokens for training language models.
Dataset →Rules and algorithms for converting numbers to written Tigrinya and back. Useful for text-to-speech, accessibility, and localization. Released as an open-source Python package.
Read Paper → View Project →GLOCR is a large-scale open-source dataset for Text Recognition and Optical Character Recognition (OCR) of the Tigrinya language, featuring over 660,000 image-label pairs.
Dataset → View Project →A Tigrinya adaptation of the classic Google Analogy Test set, containing over 18K entries to empirically evaluate the semantic and syntactic qualities of word-embedding models.
View Project →A statistically generated bilingual lexicon between English and Tigrinya, built using parallel corpora without human supervision to aid research in low-resource environments.
View Project →Comprehensive word count compilations and stop-word lists curated for both Tigrinya and Tigre languages to support foundational NLP research.
Tigrinya → Tigre →A review of over 50 NLP studies on Tigrinya published between 2011 and 2025, covering machine translation, morphology, QA, speech, and more. Also maintained as an open bibliography on GitHub.
Read Paper → View Project →If you are working on NLP or AI for underrepresented languages, we may be able to help. GLaRe provides research resources, mentorship, and funding to selected research projects.
Send a brief proposal to research@geezlab.com with your abstract and what kind of support you need.