Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages
[go: Go Back, main page]

https://github.com/isaac-chung/
cross-lingual-stability-judges.

\n","updatedAt":"2026-02-03T14:48:57.585Z","author":{"_id":"64cc0e80a257a3212c0c4b24","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64cc0e80a257a3212c0c4b24/wqs6WZN8-3OQthcnQXgN7.png","fullname":"Isaac Chung","name":"isaacchung","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":22,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8780943155288696},"editors":["isaacchung"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64cc0e80a257a3212c0c4b24/wqs6WZN8-3OQthcnQXgN7.png"],"reactions":[],"isReport":false}},{"id":"6982a404e9bd903ff9a9555a","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-04T01:42:28.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Cross-Examination Framework: A Task-Agnostic Diagnostic for Information Fidelity in Text-to-Text Generation](https://huggingface.co/papers/2601.19350) (2026)\n* [From Intuition to Expertise: Rubric-Based Cognitive Calibration for Human Detection of LLM-Generated Korean Text](https://huggingface.co/papers/2601.19913) (2026)\n* [Evaluating Robustness of Large Language Models in Enterprise Applications: Benchmarks for Perturbation Consistency Across Formats and Languages](https://huggingface.co/papers/2601.06341) (2026)\n* [Benchmarking Concept-Spilling Across Languages in LLMs](https://huggingface.co/papers/2601.12549) (2026)\n* [OLA: Output Language Alignment in Code-Switched LLM Interactions](https://huggingface.co/papers/2601.03589) (2026)\n* [One Instruction Does Not Fit All: How Well Do Embeddings Align Personas and Instructions in Low-Resource Indian Languages?](https://huggingface.co/papers/2601.10205) (2026)\n* [Can LLMs Solve My Grandma's Riddle? Evaluating Multilingual Large Language Models on Reasoning Traditional Bangla Tricky Riddles](https://huggingface.co/papers/2512.20324) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-04T01:42:28.805Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7114369869232178},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.02287","authors":[{"_id":"69820a9f9c2f139721ec34b8","user":{"_id":"64cc0e80a257a3212c0c4b24","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64cc0e80a257a3212c0c4b24/wqs6WZN8-3OQthcnQXgN7.png","isPro":false,"fullname":"Isaac Chung","user":"isaacchung","type":"user"},"name":"Isaac Chung","status":"claimed_verified","statusLastChangedAt":"2026-02-04T12:31:29.654Z","hidden":false},{"_id":"69820a9f9c2f139721ec34b9","user":{"_id":"6982f51f193fca304efab897","avatarUrl":"/avatars/6487ebc54f49d2207cc7b9eadc4aefc4.svg","isPro":false,"fullname":"Linda Freienthal","user":"Lindafr","type":"user"},"name":"Linda Freienthal","status":"claimed_verified","statusLastChangedAt":"2026-02-04T12:31:27.188Z","hidden":false}],"publishedAt":"2026-02-02T16:27:32.000Z","submittedOnDailyAt":"2026-02-03T12:18:57.577Z","title":"Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages","submittedOnDailyBy":{"_id":"64cc0e80a257a3212c0c4b24","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64cc0e80a257a3212c0c4b24/wqs6WZN8-3OQthcnQXgN7.png","isPro":false,"fullname":"Isaac Chung","user":"isaacchung","type":"user"},"summary":"Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences.\n This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at https://github.com/isaac-chung/cross-lingual-stability-judges.","upvotes":1,"discussionId":"69820a9f9c2f139721ec34ba","githubRepo":"https://github.com/isaac-chung/cross-lingual-stability-judges","githubRepoAddedBy":"user","ai_summary":"Controlled cross-lingual evaluation reveals instability in LLM assessment methods when targeting morphologically rich languages, indicating unreliable zero-shot judge transfer for discourse-level tasks.","ai_keywords":["large language models","cross-lingual evaluation","evaluation reliability","LLM-as-a-judge scoring","morphologically rich languages","zero-shot transfer","discourse-level assessment","generation conditions","synthetic data","language-specific calibration"],"githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64cc0e80a257a3212c0c4b24","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64cc0e80a257a3212c0c4b24/wqs6WZN8-3OQthcnQXgN7.png","isPro":false,"fullname":"Isaac Chung","user":"isaacchung","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2602.02287

Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages

Published on Feb 2
· Submitted by
Isaac Chung
on Feb 3

Abstract

Controlled cross-lingual evaluation reveals instability in LLM assessment methods when targeting morphologically rich languages, indicating unreliable zero-shot judge transfer for discourse-level tasks.

AI-generated summary

Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences. This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at https://github.com/isaac-chung/cross-lingual-stability-judges.

Community

Paper author Paper submitter

Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences.
This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at https://github.com/isaac-chung/
cross-lingual-stability-judges.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.02287 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.02287 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.