The Tool That Always Answers Is Hiding Something
Run a forty-second clip of a dim, half-occluded face through most behavioural AI tools and you will get a clean, confident number back. That number looks identical to the one you would get from a well-lit, full-frame, two-minute clip. The tool does not tell you the difference. It cannot, because it was built to always answer.
We think that is the central reliability failure in this field, and we built GRW Project to behave differently. A system that produces a score for every input, regardless of input quality, is not more capable than one that sometimes declines. It is just less honest. It has moved the uncertainty out of the report and into your decision, where you can no longer see it.
Our position is straightforward. The measurement is only as good as the evidence underneath it, and the evidence is not always there. When it is not, the trustworthy move is to say so. So we did the engineering work to let the engine say it, clearly, as part of the output.
Where the Evidence Actually Comes From
GRW Project does not read emotions off a face. It measures geometry. Google's MediaPipe FaceMesh (Kartynnik et al., 2019) tracks 468 facial landmarks per frame, and from the changing distances and angles between those points we compute Action Unit proxies grounded in the Facial Action Coding System (Ekman and Friesen, 1978). AU6 plus AU12 for a Duchenne smile. AU4 for a brow lowerer. Blink dynamics in the spirit of Soukupova and Cech (2016). These are the raw signals behind Composure Index, Presence Quotient, and the rest.
Every one of those computations depends on the mesh being reliably anchored to a real face, frame after frame. If the landmarks drift because the face is half in shadow, or jump because a hand crosses the mouth, the AU proxies built on top of them inherit that noise directly.
This is the part most marketing skips. A behavioural score is a tower of derivations sitting on landmark stability. When the base is shaky, the tower does not collapse visibly. It produces a plausible-looking number that means far less than it appears to. The only defence is to measure the base before you trust the tower.
The Quality Gate: Four Checks Before Any Output
Before the engine produces a single score, the clip passes through a quality gate that checks four things. Clip duration: is there enough footage to establish a baseline and observe change, or are we extrapolating from a handful of frames? Face-detection rate: across the clip, what fraction of frames yielded a confidently located mesh rather than a guess or a dropout? Lighting: is illumination sufficient and even enough for landmark positions to be trusted, or is half the face in shadow? Occlusion: is the face consistently visible, or is it blocked by a hand, a microphone, hair, or a turn away from the lens?
These are not abstract thresholds. They are the specific conditions under which the 468-landmark mesh either holds its anchor or stops being reliable. A short, well-lit, unobstructed, high-detection clip passes cleanly. A long but poorly-lit clip with frequent occlusion does not, no matter how interesting the footage looks to a human eye.
The gate runs first, by design. It is cheaper and far more honest to refuse a measurement than to publish one and bury a caveat in a footnote nobody reads. The gate is where epistemic honesty becomes mechanical rather than aspirational.
High, Medium, Low, and ABSTAIN
Every score that clears the gate ships with a confidence level, and the level is not decoration. High confidence means duration, detection rate, lighting, and visibility were all strong and the signal was consistent enough that we stand firmly behind the read. Medium means the evidence was good but bounded: usable footage with some compromise, a score worth acting on with a human in the loop. Low means we computed a result under marginal conditions, and you should treat it as a prompt to look closer, not a verdict.
Below Low sits the state we are most proud of: ABSTAIN. When the evidence is too thin to support an honest read, the engine declines to score rather than guessing. It does not invent a number to fill the cell. It returns the reason it abstained (too short, too dark, too occluded, too few confident frames) so you know exactly what to fix and can re-record.
Abstention is not a failure mode. It is a designed output, ranked alongside the scores, and on a thin clip it is the most accurate thing the engine can tell you.
Why Skeptics Should Treat Abstention as the Feature
If you are evaluating behavioural intelligence and you are skeptical, you are exactly the buyer we built this for. The right test is not whether a tool produces impressive numbers on a good clip. Every serious tool does. The test is what it does on a bad one. A tool that abstains is telling you it has a defensible boundary between signal and noise, and that it knows where that boundary sits.
That boundary is also why our confidence tags are worth reading. A High-confidence Presence Quotient and a Low-confidence one are different objects, and conflating them is how organisations end up acting on noise and blaming the data. We separate them on the face of every report so the strength of the evidence travels with the number instead of getting lost.
We would rather hand you fewer scores you can stand behind than a full dashboard you have to second-guess. A tool that knows when to abstain has earned the right to be believed when it commits. That is the whole argument, and we are happy to be measured by it.