OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer
Abstract
OmniShotCut formulates shot boundary detection as structured relational prediction using a shot query-based dense video Transformer, addressing limitations of existing methods through synthetic transition generation and a comprehensive benchmark.
Shot Boundary Detection (SBD) aims to automatically identify shot changes and divide a video into coherent shots. While SBD was widely studied in the literature, existing state-of-the-art methods often produce non-interpretable boundaries on transitions, miss subtle yet harmful discontinuities, and rely on noisy, low-diversity annotations and outdated benchmarks. To alleviate these limitations, we propose OmniShotCut to formulate SBD as structured relational prediction, jointly estimating shot ranges with intra-shot relations and inter-shot relations, by a shot query-based dense video Transformer. To avoid imprecise manual labeling, we adopt a fully synthetic transition synthesis pipeline that automatically reproduces major transition families with precise boundaries and parameterized variants. We also introduce OmniShotCutBench, a modern wide-domain benchmark enabling holistic and diagnostic evaluation.
Community
OmniShotCut is a sensitive and more informative SoTA on the Shot Boundary Detection task.
OmniShotCut can detect shot changes of the video in diverse sources (anime, vlog, game, shorts, sports, screen recording, etc.), and recognize Sudden Jump and Transitions (dissolve, fade, wipe, etc.) by proposing a Shot-Query-based Video Transformer.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation (2026)
- Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning (2026)
- ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation (2026)
- CVA: Context-aware Video-text Alignment for Video Temporal Grounding (2026)
- Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding (2026)
- InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions (2026)
- Shot-Aware Frame Sampling for Video Understanding (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.24762 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper