Behavioural capture stack
What is actually read from the video?
- Inputs
- Any modern video: venue AV, GoPro, iPhone, webcam. No wearables. No audience opt-in beyond the venue's standard consent.
- Computation
- 468 facial landmarks per detected face per frame via MediaPipe FaceMesh, body keypoints via MediaPipe Pose, voice prosody via Whisper-derived features. Face embeddings are never persisted; only the derived per-frame scoring vector survives the analysis pass.
- Output
- Per-frame analysis records: face slot, smile/surprise/neutral probabilities, attending boolean (head pitch ±25°, yaw ±30° of stage), optional bounding box.
Synchrony Score
Did the audience react together or one face at a time?
- Inputs
- Per-frame face arrays across the session. Faces are matched by stable slot index within frame; identity is not tracked across the room.
- Computation
- For each rolling window (default 30 seconds): compute the time-series of smile probability for every attending face that contributed at least 3 frames; compute Pearson r for every pair; report mean pairwise r per window and overall.
- Output
- Per-window r in [-1, 1] plus a normalized 0..1 score (r + 1) / 2 for UI surfacing.
- Caveats
- Requires at least two attending faces per window. Windows that do not meet the bar are reported as zero pairs, not imputed.
Held-Attention Index
What share of seconds did the room actually face the stage?
- Inputs
- Per-frame face arrays. The "attending" boolean is computed in the capture stack from head pitch and yaw relative to the camera (a stage-fixed proxy for "facing the speaker").
- Computation
- Aggregate faces by second-of-session. For each second compute attending / total. Mark the second as "held" when the fraction meets the threshold (default 0.6). Report the held rate, the mean attention fraction, and the longest consecutive sustained run.
- Output
- Per-second attentionFraction time series, headline heldRate, longestSustainedSec.
- References
- Standard head-pose attention proxy (per-frame pitch/yaw thresholding).
Applause Spectrogram
Where did the audible reactions land and how long did they hold?
- Inputs
- The AudioPeak stream produced by the capture stack: timestamp, RMS amplitude, duration above threshold, heuristic type (laugh / applause / ovation / reaction).
- Computation
- Count peaks by type. Sort by duration desc and keep the top N. Bin peaks into 5-second buckets, tracking weighted duration and peak amplitude per bin. Report the 95th-percentile amplitude across the session.
- Output
- countsByType, sustainedRanking, amplitudeP95, and a per-bin spectrogram payload for the heat-bar UI.
- References
- Affectiva audience response methodology (acoustic envelope segmentation).
Talk-over-Talk Delta
Compared to the prior session, what actually moved?
- Inputs
- Two complete sessions for the same speaker: SessionSummary plus the underlying engagement buckets.
- Computation
- Compute deltas on overallScore, meanAttention, and reactionTotals (per type). Compute Cohen's d using pooled standard deviation across bucket engagementScores. Classify: |d| ≥ 0.5 = meaningful, ≥ 0.2 = trend, otherwise noise.
- Output
- Per-metric deltas, Cohen's d, and a significance band the manager can act on.
- Caveats
- Requires at least two buckets per session. Speakers with one short session cannot anchor a baseline; the panel renders an explicit "no prior" state in that case.
- References
- Cohen's d effect size convention (Cohen, 1988).