Sharingan

Real-time gaze following

RUN

Sharingan gaze-following preview

Sharingan is a gaze-following model built on a Vision Transformer (ViT-12) encoder with a conditional DPT decoder. Unlike gaze estimation tools that predict where someone's eyes are pointing in degrees, Sharingan predicts the gaze target directly in the scene — the pixel location that a person is looking at.

The model takes a full scene image and one or more head crops, then outputs a 64×64 gaze heatmap per person and an in/out-of-frame probability. In this web implementation, up to 3 people share a single ViT forward pass (N=3 batch), so the backbone runs once regardless of how many people are in the frame.

Face detection uses MediaPipe BlazeFace (short-range model). The ONNX model runs entirely in the browser via ONNX Runtime Web with WebGPU acceleration. A discrete or integrated GPU is required; WASM fallback is not supported due to speed constraints.

The tool supports both live webcam and uploaded video sources. In video mode the file is processed frame-by-frame so every frame receives a full inference pass.


Output columns

Each recorded row corresponds to one person in one frame. Beyond the basic face and gaze coordinates, the CSV includes gaze vector metrics, heatmap uncertainty metrics, and pairwise Joint Visual Attention (JVA) scores.

ColumnDescription
frameFrame index (0-based)
timestampVideo timestamp in seconds
face_indexPerson index within the frame (0–2)
face_x1Face bounding box — left edge (pixels)
face_y1Face bounding box — top edge (pixels)
face_x2Face bounding box — right edge (pixels)
face_y2Face bounding box — bottom edge (pixels)
gaze_xPredicted gaze target — x (pixels)
gaze_yPredicted gaze target — y (pixels)
inout_probProbability that gaze target is in-frame (0–1)
Gaze vector
gaze_vxUnit gaze direction — x component (face center → gaze target)
gaze_vyUnit gaze direction — y component
gaze_angle_degGaze angle in degrees (atan2; 0° = right, 90° = down)
gaze_dist_pxPixel distance from face center to predicted gaze target
Heatmap metrics
hm_peakPeak probability of the softmax heatmap — higher means more confident gaze target
hm_entropyNormalised Shannon entropy of the heatmap (0 = perfectly certain, 1 = uniform)
hm_spread_pxWeighted spatial standard deviation of the heatmap in pixels (RMS of x and y)
Joint Visual Attention (JVA)
jva_partner_idxFace index of the best JVA partner in the frame (−1 if only one person detected)
jva_gaze_dist_pxPixel distance between this person's gaze target and their partner's gaze target
jva_dir_simCosine similarity of the two gaze direction vectors (−1 = opposite, +1 = identical)
jva_hm_overlapBhattacharyya coefficient between the two softmax heatmaps (0 = no overlap, 1 = identical)
jva_scoreCombined JVA score averaging gaze proximity, direction similarity, and heatmap overlap (0–1)
tagUser-defined label from the recording bar

Metrics reference
Gaze vector

Derived from the straight line between the face-bounding-box center and the argmax of the predicted heatmap. gaze_angle_deg follows standard atan2 convention (0° = rightward, angles increase clockwise). gaze_dist_px is a proxy for how far the person is looking relative to their own position in the frame.

Heatmap uncertainty

The raw 64×64 model output is converted to a softmax probability distribution before computing metrics. hm_peak and hm_entropy measure prediction confidence from complementary angles — a high-entropy, low-peak heatmap indicates the model is uncertain about where the person is looking. hm_spread_px captures the spatial extent of the probable gaze region.

Joint Visual Attention

JVA metrics are computed pairwise: each person is matched to the partner whose jva_score is highest. The score combines three cues — spatial proximity of gaze targets (falls to 0 at ~25% of frame diagonal), cosine similarity of gaze directions, and the Bhattacharyya coefficient between heatmaps. A jva_score near 1 indicates two people are very likely attending to the same location.

Source Code