Abstract
One-to-Many Temporal Grounding addresses the challenge of localizing multiple disjoint video segments for a single textual query through a comprehensive benchmark, novel reward functions, and improved policy optimization.
Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.
Community
Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85% and 15.61%, respectively.
Cool title. The shift toward one-to-many temporal grounding is interesting because most current methods seem to prioritize a single best-fit segment. It feels like a natural step forward for more nuanced video understanding.
I'm curious how the model handles the ambiguity of multiple potential groundings. Does it rely on a ranking system to prioritize these segments, or does it try to output a multimodal distribution of all possible matches?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/fa494212-c436-4162-9392-b4d5b4c709b0
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding (2026)
- EvoGround: Self-Evolving Video Agents for Video Temporal Grounding (2026)
- SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding (2026)
- Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding (2026)
- Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding (2026)
- DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation (2026)
- DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.06294 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
insomnia7/omtg_bench
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper