# Video Inference JSON Output Format

**File Structure**

The JSON output file contains several lists of equal length. The index of each list corresponds to a specific frame of the video being inferred upon.

```json
{
    frame_offsets: [...],
    time_offsets: [...],
    fine_tuned_model_id: [...],
    fine_tuned_model_id: [...],
    clip: [...],
    gaze: [...]
    ...
}
```

The frame\_offset lists the video frame numbers that were extracted and inferred on each model. The list always starts at `0`. For example, if the input video has 24 video frames-per-second, and we specify an `infer_fps` of 4 frames per second then the frame indices selected for inference (`frame_offsets`) will be `[0, 6, 12, 18, 24, 30,...]`\
\
It is best practice to select an `infer_fps` to be a factor of the video frame-rate. If the video frame rate is not a perfect multiple of the infer\_fps then the frame\_offset will be an approximation. Choose the minimum `infer_fps` that works for your application since higher values will result in greater cost and slower results. The system will not return output if `infer_fps` is greater than the video frame rate.

The time-offsets list indicate the time in the video playback when the frame occurs. Each time entry is in seconds, rounded to 4 decimal places.

The rest of the lists contain inference data. Each element of a list can be a dictionary or have a value `None`, in case that particular frame was not successfully inferred by the model.

In the next section, we elaborate on the results returned for different model types.

**Object Detection**

The example below illustrates one element of an object detection model's inference output list.

```json
{
    "time": 0.06994929000006778, 
    "image": {"width": 480, "height": 360}, 
    "predictions": [
        {
            "x": 360.0, 
            "y": 114.0, 
            "width": 36.0, 
            "height": 104.0, 
            "confidence": 0.6243005394935608, 
            "class": "zebra", 
            "class_id": 1
        }
    ]
}
```

The `time` field is the inference computation time and can usually be ignored.

The `image` field shows the dimensions of the input.

The `predictions` list contains each predicted class' information.

**Gaze Detection**

The example below illustrates one element of the gaze detection model's inference output list.

```json
{
    predictions: [
        {
            face: {
                x: 236,
                y: 208,
                width: 94,
                height: 94,
                confidence: 0.9232424,
                class: "face",
                class_confidence: null,
                class_id: 0,
                tracker_id: null
            },
            }
            landmarks: [
                {
                    x: 207,
                    y: 183
                },
                ... (6 landmarks)
            ]
            yaw: 0.82342350129345,
            pitch: 0.23152452412,      
        }
        ...
    ],
    time: 0.025234234,
    time_face_det: null,
    time_gaze_det: null
}
```

**Classification**

\<coming soon!>

**Instance Segmentation**

\<coming soon!>\\
