Query-Driven Video Clip Extraction with Semantic Alignment

At Joyspace, we process thousands of hours of video content weekly. Our initial approach used a learned "interestingness" classifier—a binary model trained to identify highlight-worthy moments. This failed immediately in production. The same 90-minute sales call would yield completely different highlights depending on who was watching: sales teams wanted objection handling, product teams wanted feature requests, and marketing needed customer testimonials.

The fundamental insight: interestingness is not a property of the video; it's a function of viewer intent.

This led us to rebuild our extraction pipeline around explicit query specification and semantic matching. This post details our technical approach, focusing on the crisp attention mechanism that improved matching accuracy by 12% while reducing computational cost by 60%.

Problem Formulation

Given a video V of duration T seconds and a natural language query q, we need to extract n semantically coherent clips C = {c₁, c₂, ..., cₙ} where each clip cᵢ = (tₛ, tₑ) represents start/end timestamps.

Constraints:

Clips must be ranked by semantic relevance to query intent
Real-time processing: <10 minutes for 90-minute videos
Deterministic: same inputs → same outputs
Semantic coherence: clips must represent complete thoughts

The core technical challenge is computing a relevance function R(q, v) that accurately measures alignment between query intent and video segment semantics across modalities (visual, audio, transcript).

Architecture Overview

We use a dual-encoder architecture that maps queries and video segments into a shared embedding space where semantic similarity corresponds to cosine distance. The key innovation is applying crisp attention—structured sparsity in the attention mechanism—to both query understanding and video-segment matching.

Query q → Intent Encoder → q̂ ∈ ℝᵈ
Video V → Segment Encoder → {v̂₁, v̂₂, ..., v̂ₘ} ∈ ℝᵈ
Relevance: R(q, vᵢ) = cos(q̂, v̂ᵢ) = q̂ᵀv̂ᵢ / (||q̂|| · ||v̂ᵢ||)

Where d = 512 (embedding dimension chosen to balance expressiveness and computational efficiency).

Crisp Attention for Query Intent

Traditional attention mechanisms in transformers compute weighted combinations across all input tokens. For query understanding, this creates a problem: ambiguous or hedge words ("maybe," "possibly," "sort of") dilute the signal from intent-carrying tokens.

The Sparsity Hypothesis

Recent work in transformer optimization has shown that structured attention sparsity can improve model accuracy, not just efficiency. The hypothesis: forcing the model to select a small subset of highly relevant features acts as implicit regularization, preventing overfitting to weak correlations in training data.

We applied this principle to query intent extraction. Instead of dense attention over all query tokens, we use top-k attention where only the k highest-scoring tokens contribute to the intent representation:

Traditional attention:
  α = softmax(QKᵀ/√d)
  output = αV

Crisp attention (k=5):
  α = softmax(QKᵀ/√d)
  α_sparse = top_k(α, k)  # zero out all but top-k values
  α_normalized = α_sparse / sum(α_sparse)
  output = α_normalized V

For k=5, we force the model to identify the 5 most intent-relevant tokens. Empirically, this improved intent classification accuracy from 79% to 87% on our evaluation set of 2,000 manually labeled queries.

Intent Disambiguation

Users write queries with varying specificity:

Specific: "show pricing objection at 23 minutes"
Moderate: "find technical architecture discussion"
Vague: "interesting moments"

For specific queries, crisp attention focuses on entity terms ("pricing," "objection," "23 minutes"). For vague queries, it focuses on context terms that disambiguate based on video metadata (speaker roles, video category, historical user preferences).

We train a small routing network (2-layer MLP, 256 hidden units) that predicts optimal k per query:

k = routing_network(query_embedding, video_metadata)
k ∈ {3, 5, 7, 10}  # discrete choices

Specific queries → smaller k (more focused) Vague queries → larger k (more context)

This adaptive sparsity improved end-to-end matching precision by 6%.

Multimodal Video Encoding

Videos contain three signal modalities: visual frames, audio waveforms, and transcript text. Naive concatenation fails because modalities have different temporal resolutions and information densities.

Per-Modality Encoding

Visual: Sample frames at 1 FPS. Encode with a vision transformer to produce frame embeddings Fᵥ ∈ ℝᵗˣᵈᵛ.

Audio: Convert to mel spectrograms (80 bins, 25ms windows, 10ms hop). Encode with 1D CNN to produce audio embeddings Fₐ ∈ ℝᵗˣᵈₐ.

Transcript: Run ASR (automatic speech recognition) with word-level timestamps. Segment into sentences. Encode with transformer to produce text embeddings Fₜ ∈ ℝˢˣᵈₜ.

Cross-Modal Fusion with Crisp Attention

Different queries require different modality weightings. "Show product demo" needs visual focus. "Find pricing discussion" needs text focus. We learn query-dependent modality attention.

First, project all modalities to common dimension d:

Hᵥ = Fᵥ Wᵥ
Hₐ = Fₐ Wₐ
Hₜ = Fₜ Wₜ

Then apply crisp cross-modal attention. For each modality, we compute attention over the other two modalities, but only retain top-k=3 attention weights:

For visual features Hᵥ:
  Attend to [Hₐ, Hₜ] with crisp attention
  α_va, α_vt = crisp_attention(Hᵥ, [Hₐ, Hₜ])
  H'ᵥ = α_va·Hₐ + α_vt·Hₜ

This forces the model to decisively choose which modalities are relevant rather than hedging across all three.

Finally, we learn query-dependent fusion weights:

w = softmax(MLP(query_embedding))
w ∈ ℝ³ (weights for visual, audio, text)

H_fused = w₁·H'ᵥ + w₂·H'ₐ + w₃·H'ₜ

For text-heavy queries, w₃ dominates. For visual queries, w₁ dominates. The fusion is learned end-to-end during training.

Semantic Matching via Contrastive Learning

Training the dual encoder requires (query, video_segment, relevance) labels. We don't have these at scale. Instead, we use weak supervision from two sources:

Transcript overlap: Segments where transcript has high TF-IDF similarity with query
User interactions: Segments users selected after issuing queries (sparse but high-quality)

Hard Negative Mining

The key to effective contrastive learning is hard negatives—examples that are semantically similar but not correct matches. Random negatives are too easy; the model learns to separate them quickly but fails on subtle distinctions.

Our mining strategy:

For query q with positive segment v⁺:
  Compute similarity s_i = cos(q̂, v̂_i) for all segments in batch
  Select negatives where 0.3 < s_i < 0.6
  (Too similar → might be false negative)
  (Too dissimilar → uninformative for learning)

Example:

Query: "customer testimonial"
Hard negative: Salesperson describing typical customer results (similar keywords, wrong speaker)
Easy negative: Technical architecture discussion (completely unrelated)

Hard negatives force the model to learn fine-grained distinctions (first-person vs. third-person, customer vs. employee).

Loss Function

We use InfoNCE with temperature τ = 0.07:

L = -log(exp(cos(q̂,v̂⁺)/τ) / (exp(cos(q̂,v̂⁺)/τ) + Σᵢ exp(cos(q̂,v̂ᵢ⁻)/τ)))

Lower temperature sharpens the distribution, requiring the model to strongly distinguish positives from hard negatives.

To handle false negatives from weak supervision, we down-weight high-similarity negatives:

For negative v⁻ with similarity s:
  weight = max(0, 1 - 2·(s - 0.5)) if s > 0.5 else 1

This reduces penalty when the model ranks potential false negatives highly.

Efficient Retrieval at Scale

At inference, we need to find top-n segments from potentially thousands of candidates. Brute-force comparison is O(m·d) where m is segment count—too slow for real-time use.

Two-Stage Coarse-to-Fine Retrieval

Stage 1: Coarse filtering

Segment video into 30-second windows with 50% overlap. For a 90-minute video, this yields ~360 segments. Encode all segments in parallel, producing embedding matrix V ∈ ℝ³⁶⁰ˣ⁵¹².

Compute similarities via single matrix multiplication:

scores = q̂ᵀV  # shape: (360,)
top_100 = argsort(scores)[-100:]  # select top-100

This aggressive filtering (360 → 100) is safe: recall@100 > 95% in offline eval.

Stage 2: Fine-grained scoring

For the 100 candidates:

Extend temporal context (±10 seconds)
Re-encode at higher resolution
Apply crisp attention at segment level
Compute refined similarity scores

This two-stage approach reduces compute by 60% while maintaining quality.

Temporal Boundary Refinement

Initial segments are fixed-length (30s), which often cuts mid-sentence. We refine boundaries for semantic coherence:

Transcript alignment: Find sentence boundaries within ±5s of segment edges
Scene detection: Compute frame-to-frame similarity; break at discontinuities
Silence detection: Prefer breaks during silence (amplitude < -40dB for >0.5s)

We learn a boundary scoring function:

boundary_score(t) = w₁·is_sentence_boundary(t) +
                     w₂·is_scene_boundary(t) +
                     w₃·is_silence(t)

Weights w trained on 500 human-annotated "good" vs "bad" clip boundaries.

Non-Maximum Suppression

Due to overlapping windows, we get redundant high-scoring candidates. Apply temporal NMS:

candidates = sort_by_score_descending(segments)
result = []
for c in candidates:
  if overlap(c, any segment in result) > 0.3:
    skip c
  else:
    result.append(c)
return result[:n]

30% overlap threshold allows partial overlap (long discussions may produce multiple clips) while preventing near-duplicates.

Evaluation

We evaluated on a held-out test set of 500 videos with human-annotated relevant segments for 2,000 queries.

Crisp Attention Ablation

Configuration	Precision@5	Recall@10	MAP@10	Compute (relative)
Dense attention	0.79	0.71	0.76	1.0×
Crisp attention (k=10)	0.84	0.74	0.80	0.7×
Crisp attention (k=5)	0.89	0.76	0.83	0.4×
Crisp attention (k=3)	0.86	0.73	0.81	0.3×

k=5 provides optimal balance: 12% precision improvement over dense attention with 60% compute reduction.

Query Type Performance

Query Type	Count	Precision@5	Notes
Specific (with timestamps)	180	0.94	"pricing at 23 min"
Moderate (topic only)	1,100	0.89	"technical architecture"
Vague (general intent)	520	0.81	"interesting moments"
Entity-based	200	0.92	"kubernetes discussion"

Specific queries benefit most from crisp attention (forcing focus on timestamp entities). Vague queries see smaller gains (require more context).

Boundary Quality

91% of clips start at sentence boundaries (vs 45% with naive fixed-length segmentation). 87% end without cutting off speech. Average user rating: 4.2/5.0 for clip coherence.

Failure Modes

Query-Video Mismatch

When query asks for content not present in video, even top-scoring segment may be irrelevant. We use confidence thresholding: if max(scores) < 0.6, return "No matching segments found."

This false-negative rate is 2% (legitimate matches scored below threshold), but prevents returning random clips just to meet requested count.

Repetitive Content

Speaker repeats same point multiple times. All repetitions score highly. Mitigation: semantic deduplication using transcript embeddings. If cos(transcript(cᵢ), transcript(cⱼ)) > 0.85, keep higher-scored clip.

Context Dependency

Clip references prior context not included in segment ("Yes, that's exactly right" without the question). We penalize clips containing unresolved references ("that," "it," "this") without antecedents.

Lessons Learned

1. Sparsity as regularization, not just optimization

We initially explored crisp attention for compute efficiency. The accuracy improvement was unexpected. Hypothesis: forcing the model to select a small feature set prevents overfitting to spurious correlations in training data. The model learns more robust, generalizable features.

2. Query intent matters more than query expansion

Traditional IR systems expand queries with synonyms. For intent-driven matching, this dilutes focus. Better to narrow the query to core intent (via crisp attention) than expand it.

3. Hard negative mining is critical

With only easy negatives (random segments), the model achieved 82% precision. Adding hard negatives (0.3-0.6 similarity range) improved to 89%. The model must learn fine-grained semantic distinctions.

4. Multimodal fusion requires query-dependent weighting

Equal weighting of visual/audio/text yielded 81% precision. Learning query-dependent weights (via small MLP) improved to 89%. Different queries need different modality emphasis.

5. Two-stage retrieval is necessary at scale

Brute-force scoring all segments is too slow. Coarse-to-fine (360 → 100 → 10) maintains quality while reducing latency by 60%.

Future Directions

Cross-video search: Extend to corpus-level retrieval. Challenge: indexing 100M+ segments while maintaining <1s query latency. Exploring approximate nearest neighbor methods (HNSW, ScaNN).

Compositional queries: Support boolean operations: "(pricing AND objections) NOT discounts." Requires careful score calibration to make set operations meaningful.

Zero-shot generalization: Current model trained on specific domain (business videos). Exploring meta-learning approaches to generalize to unseen video categories without fine-tuning.

Temporal grounding: Instead of discrete clips, return continuous playback starting at query-relevant timestamp. Already supported by architecture (we have frame-level timestamps).

Conclusion

By applying crisp attention to query-driven video extraction, we achieved 91% precision—a 12% improvement over dense attention—while reducing computational cost by 60%. The key insight: structured sparsity acts as implicit regularization, forcing the model to commit to strong semantic signals rather than hedging across weak correlations.

This principle extends beyond video retrieval to any task requiring intent understanding and semantic matching. When in doubt, constrain what the model can attend to. Less attention, focused on the right features, beats more attention spread thin.

Try the system at https://joyspace.ai.

For technical discussion or collaboration: hello@joyspace.ai

Link to this post: https://joyspace.ai/query-driven-video-clip-extraction

Query-Driven Video Clip Extraction with Semantic Alignment

Query-Driven Video Clip Extraction with Semantic Alignment

Problem Formulation

Architecture Overview

Crisp Attention for Query Intent

The Sparsity Hypothesis

Intent Disambiguation

Multimodal Video Encoding

Per-Modality Encoding

Cross-Modal Fusion with Crisp Attention

Semantic Matching via Contrastive Learning

Hard Negative Mining

Loss Function

Efficient Retrieval at Scale

Two-Stage Coarse-to-Fine Retrieval

Temporal Boundary Refinement

Non-Maximum Suppression

Evaluation

Crisp Attention Ablation

Query Type Performance

Boundary Quality

Failure Modes

Query-Video Mismatch

Repetitive Content

Context Dependency

Lessons Learned

Future Directions

Conclusion

Ready to Get Started?

Share This Article

Share on Social Media