Rethinking Chain-of-Thought Reasoning for Videos
Abstract
Efficient video reasoning can be achieved using concise chains of thought and reduced visual tokens without manual annotations or supervised fine-tuning.
Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.
Community
Rethinking Chain-of-Thought Reasoning for Videos
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Efficient Reasoning via Thought-Training and Thought-Free Inference (2025)
- ViSS-R1: Self-Supervised Reinforcement Video Reasoning (2025)
- Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding (2025)
- Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models (2025)
- DTS: Enhancing Large Reasoning Models via Decoding Tree Sketching (2025)
- ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better (2025)
- ORION: Teaching Language Models to Reason Efficiently in the Language of Thought (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper