Sharingan
Real-time gaze following
RUN
Sharingan is a gaze-following model built on a Vision Transformer (ViT-12) encoder with a conditional DPT decoder. Unlike gaze estimation tools that predict where someone's eyes are pointing in degrees, Sharingan predicts the gaze target directly in the scene — the pixel location that a person is looking at.
The model takes a full scene image and one or more head crops, then outputs a 64×64 gaze heatmap per person and an in/out-of-frame probability. In this web implementation, up to 3 people share a single ViT forward pass (N=3 batch), so the backbone runs once regardless of how many people are in the frame.
Face detection uses MediaPipe BlazeFace (short-range model). The ONNX model runs entirely in the browser via ONNX Runtime Web with WebGPU acceleration. A discrete or integrated GPU is required; WASM fallback is not supported due to speed constraints.
The tool supports both live webcam and uploaded video sources. In video mode the file is processed frame-by-frame so every frame receives a full inference pass.
Output columns
Each recorded row corresponds to one person in one frame. Beyond the basic face and gaze coordinates, the CSV includes gaze vector metrics, heatmap uncertainty metrics, and pairwise Joint Visual Attention (JVA) scores.
| Column | Description |
|---|---|
frame | Frame index (0-based) |
timestamp | Video timestamp in seconds |
face_index | Person index within the frame (0–2) |
face_x1 | Face bounding box — left edge (pixels) |
face_y1 | Face bounding box — top edge (pixels) |
face_x2 | Face bounding box — right edge (pixels) |
face_y2 | Face bounding box — bottom edge (pixels) |
gaze_x | Predicted gaze target — x (pixels) |
gaze_y | Predicted gaze target — y (pixels) |
inout_prob | Probability that gaze target is in-frame (0–1) |
| Gaze vector | |
gaze_vx | Unit gaze direction — x component (face center → gaze target) |
gaze_vy | Unit gaze direction — y component |
gaze_angle_deg | Gaze angle in degrees (atan2; 0° = right, 90° = down) |
gaze_dist_px | Pixel distance from face center to predicted gaze target |
| Heatmap metrics | |
hm_peak | Peak probability of the softmax heatmap — higher means more confident gaze target |
hm_entropy | Normalised Shannon entropy of the heatmap (0 = perfectly certain, 1 = uniform) |
hm_spread_px | Weighted spatial standard deviation of the heatmap in pixels (RMS of x and y) |
| Joint Visual Attention (JVA) | |
jva_partner_idx | Face index of the best JVA partner in the frame (−1 if only one person detected) |
jva_gaze_dist_px | Pixel distance between this person's gaze target and their partner's gaze target |
jva_dir_sim | Cosine similarity of the two gaze direction vectors (−1 = opposite, +1 = identical) |
jva_hm_overlap | Bhattacharyya coefficient between the two softmax heatmaps (0 = no overlap, 1 = identical) |
jva_score | Combined JVA score averaging gaze proximity, direction similarity, and heatmap overlap (0–1) |
tag | User-defined label from the recording bar |
Metrics reference
Gaze vector
Derived from the straight line between the face-bounding-box center and the argmax of the
predicted heatmap. gaze_angle_deg follows standard atan2 convention
(0° = rightward, angles increase clockwise). gaze_dist_px is a proxy for
how far the person is looking relative to their own position in the frame.
Heatmap uncertainty
The raw 64×64 model output is converted to a softmax probability distribution before
computing metrics. hm_peak and hm_entropy measure prediction
confidence from complementary angles — a high-entropy, low-peak heatmap indicates the
model is uncertain about where the person is looking. hm_spread_px captures
the spatial extent of the probable gaze region.
Joint Visual Attention
JVA metrics are computed pairwise: each person is matched to the partner whose
jva_score is highest. The score combines three cues — spatial proximity of
gaze targets (falls to 0 at ~25% of frame diagonal), cosine similarity of gaze directions,
and the Bhattacharyya coefficient between heatmaps. A jva_score near 1
indicates two people are very likely attending to the same location.