Behavioural capture stack

What is actually read from the video?

Inputs: Any modern video: venue AV, GoPro, iPhone, webcam. No wearables. No audience opt-in beyond the venue's standard consent.
Computation: 468 facial landmarks per detected face per frame via MediaPipe FaceMesh, body keypoints via MediaPipe Pose, voice prosody via Whisper-derived features. Face embeddings are never persisted; only the derived per-frame scoring vector survives the analysis pass.
Output: Per-frame analysis records: face slot, smile/surprise/neutral probabilities, attending boolean (head pitch ±25°, yaw ±30° of stage), optional bounding box.
References: MediaPipe FaceMesh — 468 landmarks (Google Research, 2019)

Synchrony Score

Did the audience react together or one face at a time?

Inputs: Per-frame face arrays across the session. Faces are matched by stable slot index within frame; identity is not tracked across the room.
Computation: For each rolling window (default 30 seconds): compute the time-series of smile probability for every attending face that contributed at least 3 frames; compute Pearson r for every pair; report mean pairwise r per window and overall.
Output: Per-window r in [-1, 1] plus a normalized 0..1 score (r + 1) / 2 for UI surfacing.
Caveats: Requires at least two attending faces per window. Windows that do not meet the bar are reported as zero pairs, not imputed.
References: Inter-subject facial response synchrony (PMC11109022, 2024)

What share of seconds did the room actually face the stage?

Inputs: Per-frame face arrays. The "attending" boolean is computed in the capture stack from head pitch and yaw relative to the camera (a stage-fixed proxy for "facing the speaker").
Computation: Aggregate faces by second-of-session. For each second compute attending / total. Mark the second as "held" when the fraction meets the threshold (default 0.6). Report the held rate, the mean attention fraction, and the longest consecutive sustained run.
Output: Per-second attentionFraction time series, headline heldRate, longestSustainedSec.
References: Standard head-pose attention proxy (per-frame pitch/yaw thresholding).

Where did the audible reactions land and how long did they hold?

Inputs: The AudioPeak stream produced by the capture stack: timestamp, RMS amplitude, duration above threshold, heuristic type (laugh / applause / ovation / reaction).
Computation: Count peaks by type. Sort by duration desc and keep the top N. Bin peaks into 5-second buckets, tracking weighted duration and peak amplitude per bin. Report the 95th-percentile amplitude across the session.
Output: countsByType, sustainedRanking, amplitudeP95, and a per-bin spectrogram payload for the heat-bar UI.
References: Affectiva audience response methodology (acoustic envelope segmentation).

Compared to the prior session, what actually moved?

Inputs: Two complete sessions for the same speaker: SessionSummary plus the underlying engagement buckets.
Computation: Compute deltas on overallScore, meanAttention, and reactionTotals (per type). Compute Cohen's d using pooled standard deviation across bucket engagementScores. Classify: |d| ≥ 0.5 = meaningful, ≥ 0.2 = trend, otherwise noise.
Output: Per-metric deltas, Cohen's d, and a significance band the manager can act on.
Caveats: Requires at least two buckets per session. Speakers with one short session cannot anchor a baseline; the panel renders an explicit "no prior" state in that case.
References: Cohen's d effect size convention (Cohen, 1988).