Technical Specification

How Axiom verifies capture integrity

A deep dive into our physics-based verification approach: how we use synchronized sensor data and rolling shutter constraints to distinguish authentic captures from synthetic or recaptured content.

Overview

Axiom is a capture-to-verdict verification system that evaluates whether video was recorded through a physically consistent capture process. The core technique is physics-constrained consistency checking between video-derived motion cues and device sensor telemetry.

The verification system operates on the principle that authentic video exhibits measurable consistency between visual motion and inertial measurements, a property that synthetic or recaptured content cannot easily replicate. The system comprises two core layers:

Hardware-rooted integrity. Cryptographic signatures from the device secure element bind sensor data to specific hardware. Non-exportable keys prevent forgery of the authentication chain.

Physics verification. Rolling shutter timing creates dense temporal constraints. IMU data must match optical flow in ways that synthetic video cannot satisfy without precise knowledge of the physical capture conditions.

Note: The approach requires controlled capture through our SDK. This is not a detector for arbitrary uploaded video. It's verification infrastructure for workflows where capture can be mandated.

Threat Model

The protocol defines three adversary capability tiers, each requiring different defensive mechanisms.

Tier 1: Consumer-grade manipulation

Adversaries with access to video editing tools and face manipulation software, but lacking ability to generate physically consistent sensor data. Against such adversaries, cryptographic signature verification provides adequate protection. Any post-capture modification invalidates the authentication chain.

Tier 2: Analog hole attacks

Adversaries with high-fidelity generative models who may attempt screen recapture: displaying synthetic content on a physical screen and filming it with an authenticated device. The physics verification layer addresses this by detecting inconsistencies between captured video and sensor readings produced during filming.

Tier 3: Laboratory-grade attacksIn Development

Adversaries with reference displays, robotic motion platforms, and precise synchronization capabilities. Defending against such adversaries requires active challenge-response protocols that introduce unpredictable elements into the capture process.

Trust Assumptions

The security model assumes hardware security modules resist key extraction and that platform attestation mechanisms reflect actual device security state. Platform attestation provides a probabilistic signal rather than cryptographic guarantee. Sophisticated attackers may circumvent software-based integrity checks.

The protocol does not assume integrity of the application runtime. Verification occurs server-side against signed transcripts, so client-side compromise affects only the ability to produce valid captures, not the ability to forge them.

Capture Pipeline

The capture system records video frames alongside synchronized inertial measurements. Gyroscope and accelerometer sample at approximately 200 Hz while video frames arrive at 30 Hz. Each sensor reading includes a timestamp from the device monotonic clock, enabling precise temporal alignment during verification.

Rolling Shutter Model

Consumer smartphone cameras employ rolling shutter sensors that expose each row sequentially rather than simultaneously. This creates dense temporal sampling within each frame. For a frame captured at time tframe with readout duration τ and height H, the exposure time of row y is:

t(y)=tframeτ/2+(yτ)/Ht(y) = t_{\text{frame}} - \tau/2 + (y \cdot \tau)/H

Typical readout durations range from 10 to 33 milliseconds. This temporal spread means that during camera rotation, different rows observe the scene at different orientations, producing characteristic geometric distortions that correlate with the gyroscope signal.

Cryptographic Binding

Direct signing of individual sensor samples is infeasible due to HSM latency. The protocol accumulates samples into batches and constructs a Merkle tree over each batch. The tree root is signed using a key held in the device secure element, producing an authenticated commitment to the batch contents.

The complete transcript chains successive commitments together, incorporating frame hashes, batch roots, timing metadata, and calibration parameters. Let Ci denote the commitment at step i, Hf the frame hash, Rb the batch Merkle root:

Ci=H(Ci1HfRbHtHp)C_i = H(C_{i-1} \| H_f \| R_b \| H_t \| H_p)

This construction ensures any modification to any component invalidates all subsequent commitments.

Physics Verification

The verification procedure compares observed visual motion against motion predicted from inertial measurements. Optical flow between consecutive frames is estimated using a neural network that produces dense correspondences with uncertainty values. Depth is estimated monocularly to enable decomposition of the flow field.

Kinematic Decomposition

Optical flow u(p) at image point p decomposes into rotational and translational contributions. The rotational component depends only on angular velocity ω measured by the gyroscope and is independent of scene depth:

u(p)=uω(p,ω)+ρ(p)uv(p,v)\mathbf{u}(p) = \mathbf{u}_\omega(p, \omega) + \rho(p) \cdot \mathbf{u}_v(p, \mathbf{v})

The rotational flow component can be computed directly from gyroscope readings and camera intrinsics. For authentic video, subtracting predicted rotation from observed flow yields a residual consistent with translational motion and depth structure.

Residual Analysis

The physics residual quantifies discrepancy between observed and predicted motion. Let û denote observed flow and ũ the flow predicted from IMU integration:

r(p)=u^(p)u~(p)r(p) = \| \hat{u}(p) - \tilde{u}(p) \|

Aggregating this residual over frames produces a consistency score. Authentic video produces low, stable residuals. Recaptured or synthetic video exhibits elevated residuals due to absence of genuine sensor correlation.

Decision Output

The product output is a three-state decision object with explicit abstention. Instead of forcing a binary claim, the system returns one of three outcomes:

Verified. The capture is physically consistent with a real camera moving in a real optical scene under the observability conditions the system can validate. Routes to fast-track automation.

Integrity violation. A hard constraint is violated strongly enough that spoofing, recapture, or injection is the most plausible explanation. Routes to review queue.

Inconclusive. The system cannot decide without unacceptable risk of false accusation. Triggers guided recapture with specific remediation instructions.

Design intent: Avoid false accusations by reserving "violation" for high-confidence cases. Inconclusive is a safety valve that triggers operational action (recapture) rather than implying wrongdoing.

The decision object includes reason codes, confidence scores, and remediation instructions. This enables workflow integration where the verification result drives routing logic rather than requiring human interpretation of probability scores.