Xing SUN (孙星)

About Me

I am a Principal Researcher and Team Manager at Tencent Youtu Lab. I received my Ph.D. from The University of Hong Kong in 2016.

I dedicate my effort to three core pillars: Multimodal Large Language Models (MLLM), Agent, and Retrieval-Augmented Generation (RAG). We aim to bridge the gap between foundation models and real-world applications through robust, open-source tools and benchmarks.

Open Source Projects

We are actively building the TencentCloudADP ecosystem.

MLLM

Video-MME

The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis.

BenchmarkVideo

Youtu-VL

Open-source Vision-Language models including training recipes and inference code (e.g., Youtu-VL-4B).

MLLMTraining

VITA

The first-ever open-source interactive omni-multimodal LLM

MLLMTraining

Agent

Youtu-LLM

Lightweight, high-performance Large Language Models (2B parameters) for edge deployment.

LLMHuggingFace

Youtu-Agent

A flexible framework for building autonomous LLM agents, supporting complex tool calling, planning, and memory management.

FrameworkPython

Youtu-Tip

A desktop efficiency assistant powered by local LLMs (Ollama) and Youtu-Agent to automate daily workflows.

ApplicationLocal LLM

Selected Publications

Full list available on Google Scholar.

MLLM

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis CVPR 2025Highlight
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models NeurIPS 2025 Spotlight
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction NeurIPS 2025 Spotlight
VITA-Audio: Fast Interleaved Audio-Text Token Generation for Efficient Large Speech-Language Model NeurIPS 2025
RAR: Reversing Visual Attention Re-Sinking for Unlocking Potential in Multimodal Large Language Models ICLR 2026
Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs NeurIPS 2025
Learning Interleaved Image-Text Comprehension in Vision-Language Large Models ICLR 2025

Agent

Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning ICLR 2026
Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents ICLR 2026
RolePlot: A Systematic Framework for Evaluating and Enhancing the Plot-Progression Capabilities of Role-Playing Agents ACL 2025
Tell Me What You Don’t Know: Enhancing Refusal Capabilities of Role-Playing Agents via Representation Space Analysis and Editing ACL 2025
RoleMRC: A Fine-Grained Composite Benchmark for Role-Playing and Instruction-Following ACL 2025
TransMLA: Migrating GQA Models to MLA with Full DeepSeek Compatibility and Speedup NeurIPS 2025 Spotlight
MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL COLING 2025
LTD-Bench: Evaluating Large Language Models by Letting Them Draw NeurIPS 2025

RAG

Youtu-GraphRAG: Vertically Unified Agents for Graph Retrieval-Augmented Complex Reasoning ICLR 2026
Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards ICLR 2026
Attend to the Active: Structure-Aware Dynamic Attention in LLMs for Compositional Instruction Following ICLR 2026
Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving AAAI 2026
Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts EMNLP 2025
HRVDA: High-Resolution Visual Document Assistant CVPR 2024

Xing SUN (孙星)

About Me

Open Source Projects

MLLM

Video-MME

Youtu-VL

VITA

Agent

Youtu-LLM

Youtu-Agent

Youtu-Tip

RAG

Youtu-GraphRAG

Youtu-Parsing

Youtu-Embedding

Selected Publications