Deprecated : The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Xing SUN (孙星)
About Me
I am a Principal Researcher and Team Manager at Tencent Youtu Lab. I received my Ph.D. from The University of Hong Kong in 2016.
I dedicate my effort to three core pillars: Multimodal Large Language Models (MLLM) , Agent , and Retrieval-Augmented Generation (RAG) . We aim to bridge the gap between foundation models and real-world applications through robust, open-source tools and benchmarks.
Open Source Projects
We are actively building the TencentCloudADP ecosystem.
MLLM
The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis.
Benchmark Video
Open-source Vision-Language models including training recipes and inference code (e.g., Youtu-VL-4B).
MLLM Training
The first-ever open-source interactive omni-multimodal LLM
MLLM Training
Agent
Lightweight, high-performance Large Language Models (2B parameters) for edge deployment.
LLM HuggingFace
A flexible framework for building autonomous LLM agents, supporting complex tool calling, planning, and memory management.
Framework Python
A desktop efficiency assistant powered by local LLMs (Ollama) and Youtu-Agent to automate daily workflows.
Application Local LLM
RAG
Advanced RAG system leveraging Knowledge Graphs to enhance retrieval accuracy and structured reasoning.
RAG Graph
High-performance document parsing tools designed to convert raw files (PDF, Docx) into clean RAG-ready data.
Data Processing
Optimized embedding models tailored for semantic search and dense retrieval tasks.
Model Retrieval
Selected Publications
Full list available on Google Scholar .
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
CVPR 2025 Highlight
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
NeurIPS 2025 Spotlight
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
NeurIPS 2025 Spotlight
VITA-Audio: Fast Interleaved Audio-Text Token Generation for Efficient Large Speech-Language Model
NeurIPS 2025
RAR: Reversing Visual Attention Re-Sinking for Unlocking Potential in Multimodal Large Language Models
ICLR 2026
Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs
NeurIPS 2025
Learning Interleaved Image-Text Comprehension in Vision-Language Large Models
ICLR 2025
Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning
ICLR 2026
Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents
ICLR 2026
RolePlot: A Systematic Framework for Evaluating and Enhancing the Plot-Progression Capabilities of Role-Playing Agents
ACL 2025
Tell Me What You Don’t Know: Enhancing Refusal Capabilities of Role-Playing Agents via Representation Space Analysis and Editing
ACL 2025
RoleMRC: A Fine-Grained Composite Benchmark for Role-Playing and Instruction-Following
ACL 2025
TransMLA: Migrating GQA Models to MLA with Full DeepSeek Compatibility and Speedup
NeurIPS 2025 Spotlight
MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL
COLING 2025
LTD-Bench: Evaluating Large Language Models by Letting Them Draw
NeurIPS 2025
Youtu-GraphRAG: Vertically Unified Agents for Graph Retrieval-Augmented Complex Reasoning
ICLR 2026
Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards
ICLR 2026
Attend to the Active: Structure-Aware Dynamic Attention in LLMs for Compositional Instruction Following
ICLR 2026
Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving
AAAI 2026
Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts
EMNLP 2025
HRVDA: High-Resolution Visual Document Assistant
CVPR 2024
© 2026 Xing SUN. Last updated: Feb 2026.