GLaRe - GeezLab Research Program

What We Do

We care about languages that don't yet have the tools they deserve.

🔬

Fundamental Research

We study AI and NLP problems specific to Tigrinya, Tigre, and other Eritrean languages. Most of these languages are severely underserved by existing tools, and we want to change that.

🌍

Open Resources

Everything we build, we share. Datasets, models, code, and tools are published openly so others can use them, build on them, and push the work further.

🎓

Support Researchers

We support students and researchers who are working on language technology but lack institutional support. Good ideas shouldn't go unsupported because of where they are from.

Research Projects

Work conducted or supported by GLaRe:

Dahlak LMs In Progress

A family of language models trained on Eritrean languages. The goal is to have good base models that work well for Tigrinya, Tigre, and related languages out of the box.

Language Models Tigrinya Tigre

TiALD NeurIPS 2025

Tigrinya Abusive Language Detection: a multi-task benchmark with 13.7K annotated YouTube comments covering abusiveness, sentiment, and topic classification in both Ge'ez script and Latin transliterations.

Content Moderation Multi-Task Dataset

Read Paper → View Project →

TiQuAD ACL 2023

Tigrinya Question Answering Dataset and Models: Developed the first human-annotated question answering benchmark dataset for Tigrinya, with 10.6K QA pairs from 290 news articles. Received an Outstanding Paper Award at ACL.

Question Answering Dataset Tigrinya

Read Paper → View Project →

GeezSwitch LREC 2022

A benchmark dataset and evaluation for Language Identification (LI) targeting five typologically and phylogenetically related low-resourced East African languages that use the Ge'ez script.

Language ID Dataset Ge'ez Script

Read Paper → View Project →

Tigrinya PLMs WiNLP 2021

Monolingual pre-trained language models for Tigrinya (TiRoBERTa, TiBERT, TiELECTRA), trained to provide strong baseline representations for downstream tasks.

Language Models Pre-training Tigrinya

View Paper → View Project →

TLMD Dataset

Tigrinya Language Modeling Dataset: A large-scale monolingual dataset collected from news articles, blogs, and books, featuring approximately 40 million tokens for training language models.

Corpus Language Modeling Tigrinya

Dataset →

Verbalizing Numbers GitHub

Rules and algorithms for converting numbers to written Tigrinya and back. Useful for text-to-speech, accessibility, and localization. Released as an open-source Python package.

Number Verbalization TTS

Read Paper → View Project →

GeezLab OCR Dataset

GLOCR is a large-scale open-source dataset for Text Recognition and Optical Character Recognition (OCR) of the Tigrinya language, featuring over 660,000 image-label pairs.

OCR Dataset Tigrinya

Dataset → View Project →

Analogy Test Dataset

A Tigrinya adaptation of the classic Google Analogy Test set, containing over 18K entries to empirically evaluate the semantic and syntactic qualities of word-embedding models.

Evaluation Word Embeddings Tigrinya

View Project →

Tigrinya-BiLexicon GitHub

A statistically generated bilingual lexicon between English and Tigrinya, built using parallel corpora without human supervision to aid research in low-resource environments.

Lexicon Bilingual

View Project →

Word Frequencies GitHub

Comprehensive word count compilations and stop-word lists curated for both Tigrinya and Tigre languages to support foundational NLP research.

Lexical Data Tigrinya Tigre

Tigrinya → Tigre →

Tigrinya Anthology GitHub

A review of over 50 NLP studies on Tigrinya published between 2011 and 2025, covering machine translation, morphology, QA, speech, and more. Also maintained as an open bibliography on GitHub.

Review Anthology Tigrinya

Read Paper → View Project →

Research in AI & NLP for Native Languages

What We Do

Fundamental Research

Open Resources

Support Researchers

Research Projects

Research Proposals