Last update: 2026-03-16

SEA-LION is a collection of Large Language Models (LLMs) and encoders which have been pretrained and fine-tuned for the Southeast Asia (SEA) region.

The SEA-LION-Embedding-E5-600M model is a Sentence Transformer optimised for 11 Southeast Asian languages. It has been fine-tuned from the multilingual-e5-large base, mapping sentences and paragraphs to a 1024-dimensional dense vector space. This model is designed for high-accuracy semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and RAG (Retrieval-Augmented Generation) workflows. It leverages the robust XLM-RoBERTa architecture pretrained on 100 languages, optimised here for 11 Southeast Asian languages: Burmese, Chinese, English, Filipino, Indonesian, Khmer, Lao, Malay, Tamil, Thai, and Vietnamese.

Citation

BibTeX:

@misc{limkonchotiwat2026seaembedding,
  title        = {SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia},
  author       = {Limkonchotiwat, Peerat and Ng, Raymond and Nutanong, Sarana and Ngui, Jian Gang},
  year         = {2026},
  eprint       = {2606.03027},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2606.03027}
}

Model Details

Model Description

The SEA-LION-E5-Embedding-600M model is a Sentence Transformer built on the Multilingual E5 Text Embeddings which was initialised from xlm-roberta-large architecture.

Model Type: Sentence Transformer
Base Architecture: E5 (Transformer Encoder)
Developed by: AI Products Pillar, AI Singapore
Funded by: Singapore NRF
Shared by: AI Products Pillar, AI Singapore
Context length: 512
Languages: Burmese, Chinese, English, Filipino, Indonesian, Khmer, Lao, Malay, Tamil, Thai, and Vietnamese
License: MIT
Finetuned from model: multilingual-e5-large

Model Sources

Documentation: Sentence Transformers Documentation
Repository: aisingapore/SEA-LION-E5-Embedding-600M

Uses

SEA-LION-E5-Embedding-600M details one of the variants available within this collection. If you are deploying our models for your specific use case, we would love to hear from you! Please feel free to contact us to share your experience or explore potential collaborations.

Model Variant	Model Repository	Suggesting Applications & Use Cases
Fine-tuned Embedding Models	- aisingapore/SEA-LION-E5-Embedding-600M - aisingapore/SEA-LION-ModernBERT-Embedding-300M - aisingapore/SEA-LION-ModernBERT-Embedding-600M	- Retrieval-Augmented Generation (RAG) - Information retrieval, and search - Similarity comparisons
Pre-trained Encoder Models	- aisingapore/SEA-LION-ModernBERT-300M - aisingapore/SEA-LION-ModernBERT-600M	- Fill mask - Text classification - Fine-tuning for downstream tasks (e.g., sentiment analysis, classification).
Pre-trained Model Checkpoints	- aisingapore/SEA-LION-ModernBERT-300M-checkpoints - aisingapore/SEA-LION-ModernBERT-600M-checkpoints	- Continued Pre-Training (CPT) - Fine-tuning for downstream tasks (e.g., sentiment analysis, classification).

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence_transformers>=2.2.2

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "aisingapore/SEA-LION-E5-Embedding-600M",
)

sentences = [
    "The weather is lovely today.",
    "อากาศวันนี้ดีมาก",
    "Dia berkendara ke stadion.",
]
embeddings = model.encode(sentences, prompt_name="STS")
print(embeddings.shape)
# [3, 1024]

similarities = model.similarity(embeddings, embeddings)
print(similarities)
#tensor([[1.0000, 0.8594, 0.5534],
#        [0.8594, 1.0000, 0.6343],
#        [0.5534, 0.6343, 1.0000]])

Training Details

Training Data

This model was tuned using a multi-stage training pipeline with the following datasets:

Contrastive Pre-training: 245 million text pairs (EN-EN and EN-SEA) to enhance cross-lingual alignment.
Fine-tuning: 13 million diverse text pairs (spanning EN-EN, CN-CN, EN-SEA, and SEA-SEA) to create the final fine-tuned model.

Language	Percentage
EN-EN	20%
CN-CN	20%
EN-SEA	10%
SEA-SEA	50%

Training Procedure

Preprocessing

Following the foundational training, the model's cross-lingual alignment was substantially enhanced by undergoing contrastive pre-training utilising 245 million text pairs, specifically focusing on English-to-English and English-to-Southeast Asian language mappings (EN-EN and EN-SEA). Finally, to ensure the model could effectively follow user instructions and handle complex interactions, it was instruction-tuned using a diverse dataset of 13 million text pairs spanning EN-EN, CN-CN, EN-SEA, and SEA-SEA, culminating in the final highly capable instruction-tuned model.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model is evaluated across three primary benchmark suites to provide a comprehensive assessment of embedding quality across Southeast Asian, Chinese, and English contexts:

SEA-BED (Southeast Asia Embedding Benchmark): The primary testing suite, consisting of 169 datasets across 10 Southeast Asian languages (Burmese, Filipino, Indonesian, Khmer, Malay, Lao, Tamil, Tetum, Thai, and Vietnamese). Notably, 71% of these datasets are native-authored or human-curated to preserve regional linguistic properties.
CMTEB (Chinese Massive Text Embedding Benchmark): A specialised subset of MTEB focused on Chinese language tasks, used to evaluate performance in one of the region's most prominent scripts.
MTEB (Massive Text Embedding Benchmark): The industry-standard global benchmark used to gauge general-purpose English embedding performance across a wide array of tasks.

Factors

Evaluation factors are categorised by task type and linguistic diversity to ensure the model's "fertility" and "nuance" are captured accurately:

Linguistic Coverage: Evaluation spans across 10+ languages, including complex Brahmic scripts (Burmese, Khmer, Lao, Tamil, Thai) and Latin-based SEA scripts (Indonesian, Filipino, Malay, Tetum, Vietnamese).
Task Modality:
- Retrieval/Reranking: Efficiency in finding relevant documents within a large corpus.
- Semantic Textual Similarity (STS): Precision in sentence-level semantic alignment.
- Clustering & Classification: Ability to group or categorise text based on latent semantic meaning.
- Summarisation & Bitext Mining: High-level semantic matching and cross-lingual alignment.
Architecture Efficiency: Performance is measured in the context of the ModernBERT architecture and Gemma 3 tokenizer to assess computational efficiency versus embedding quality.

Metrics

The following metrics are employed to assess the model's capabilities across different embedding tasks:

Task Name	Short Description	Metric
Classification	Learn a classifier over sentence embeddings to assign labels to individual sentences.	F1
Multi-label Classification	Predict multiple labels for each input text using a classifier trained on embeddings.	F1
Pair Classification	Predict a binary relationship between two sentences based on their embedding similarity.	Average Precision (AP)
Semantic Textual Similarity	Measure similarity between sentence pairs using continuous scores derived from distance metrics computed over their embeddings.	Cosine Similarity
Clustering	Groups embedded texts into clusters based on semantic similarity, using k-means with the number of unique labels of k.	V-measure
Bitext Mining	Identify translation pairs across two languages by retrieving the closest match for each sentence in a source set.	F1
Retrieval	Retrieve relevant documents for a given query by computing embedding similarity between the query and candidate texts.	NDCG@10
Instruction Retrieval	Extend traditional retrieval by incorporating detailed instructions into queries, pairing each query with a corresponding detailed instruction that outlines the criteria for determining document relevance.	nNDCG@5
Reranking	Reorder a set of candidate documents based on embedding similarity to a query to improve relevance ranking.	Mean Average Precision (MAP)

Results

Performance comparison of embedding models on SEA-BED (https://leaderboard.sea-lion.ai/embedding/SEA). Captured on 13/03/2026 02:50pm.

Environmental Impact

Carbon emission was estimated using the fact sheet from TRG Datacenters.

Hardware Type: Nvidia H200 140GB GPUs
Hours used: 896 hrs
Cloud Provider: SMC H200
Compute Region: Singapore
Carbon Emitted: appx. 252.13 kg CO2 e

Technical Specifications

Model Architecture and Objective

SEA-LION-E5-Embedding-600M is an encoder-only model based on XLM-R Large with E5-style contrastive pre-training and mean pooling.

Parameter	SEA-LION-E5-Embedding-600M
d_model	1024
head_dim	16
Vocabulary	250,000 (SentencePiece)
Sequence Length	512
Pooling Mode	Mean tokens (with attention mask)

Compute Infrastructure

Hardware

Hardware Type: Nvidia H200 140GB GPUs
Cloud Provider: SMC H200

Software

Python: 3.13.10
Sentence Transformers: 5.2.0.dev0
Transformers: 4.55.4
PyTorch: 2.9.1+cu128
Accelerate: 1.12.0
Datasets: 4.4.2
Tokenizers: 0.21.4

Glossary

E5: "EmbEddings from bidirEctional Encoder rEpresentations" – a weakly-supervised contrastive pre-training method for text embeddings.
SEA-BED: Southeast Asia Embedding Benchmark – a comprehensive evaluation suite for embedding models on SEA languages.
Asymmetric Retrieval: Retrieval tasks where query and document formulations differ; E5 uses prefixes to handle this.
Mean Pooling: Aggregating token embeddings by averaging (weighted by attention mask) to produce a fixed-size sentence representation.

More Information

While this model supports masked language modeling, it is primarily optimised via contrastive fine-tuning for downstream tasks such as sequence classification, token classification, or question answering. Please note that these weights have not been specifically aligned for safety; therefore, developers should implement their own safety evaluations and security measures. The authors disclaim all liability for any claims, damages, or other liabilities arising from the use of the released code or weights.

For more info, please contact us at sealion@aisingapore.org

Model Card Contact

sealion@aisingapore.org

Team

Ahmed Dabeer, Ahn Jeongmi, Antonyrex Sajeban, Chan Hok Teng Adwin, Cheng Zi Yi Nicholas, Choa Hsueh Mei Esther, Heng Jonathan, Huang Yuli, Jann Railey Estrada Montalan, Lee Chwan Ren, Leong Wai Yi, Leong Wei Qi, Liew Rachel, Limkonchotiwat Peerat, Muhammad Ridzuan Bin Mokhtar, Nagarajan Karthik, Ng Boon Cheong Raymond, Ngee Chia Tai, Ngui Jian Gang, Nguyen Thanh Ngan, Ong Tat-Wee David, Ong Zhi Hao, Pereira Mark, Poon Joseph, Rengarajan Hamsawardhini, Siow Wei Kang Bryan, Susanto Yosephine, Sutaveephamochanon Anocha, Tan Choon Meng, Tan Chor Phin Evelyn, Tan Siao Wei Jessica, Tan Yixian, Tee Jun Yun, Teng Kok Wai Walter, Teo Eng Sipp Leslie, Tjhi William, Wu Donghang, Yeo Yeow Tong, Yong Xianbin, Zhang Haoyang, Zhang Zhou

Acknowledgement

This project is supported by the National Research Foundation Singapore and Infocomm Media Development Authority (IMDA), Singapore under its National Large Language Model Funding Initiative.

Downloads last month: 1,523

Safetensors

Model size

0.6B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aisingapore/SEA-LION-E5-Embedding-600M

Base model

intfloat/multilingual-e5-large

Finetuned

(175)

this model

Collection including aisingapore/SEA-LION-E5-Embedding-600M

SEA-LION ModernBERT and Embedding

Collection

16 items • Updated 7 days ago

Papers for aisingapore/SEA-LION-E5-Embedding-600M

aisingapore
/

SEA-LION-E5-Embedding-600M

Model Card for SEA-LION-E5-Embedding-600M