Video Inference JSON Output Format

The output of the video inference process is a JSON file. This page explains its format.

File Structure

The JSON output file contains several lists of equal length. The index of each list corresponds to a specific frame of the video being inferred upon.

    frame_offsets: [...],
    time_offsets: [...],
    fine_tuned_model_id: [...],
    fine_tuned_model_id: [...],
    clip: [...],
    gaze: [...]

The frame_offset lists the video frame numbers that were extracted and inferred on each model. The list always starts at 0. For example, if the input video has 24 video frames-per-second, and we specify an infer_fps of 4 frames per second then the frame indices selected for inference (frame_offsets) will be [0, 6, 12, 18, 24, 30,...] It is best practice to select an infer_fps to be a factor of the video frame-rate. If the video frame rate is not a perfect multiple of the infer_fps then the frame_offset will be an approximation. Choose the minimum infer_fps that works for your application since higher values will result in greater cost and slower results. The system will not return output if infer_fps is greater than the video frame rate.

The time-offsets list indicate the time in the video playback when the frame occurs. Each time entry is in seconds, rounded to 4 decimal places.

The rest of the lists contain inference data. Each element of a list can be a dictionary or have a value None, in case that particular frame was not successfully inferred by the model.

In the next section, we elaborate on the results returned for different model types.

Object Detection

The example below illustrates one element of an object detection model's inference output list.

    "time": 0.06994929000006778, 
    "image": {"width": 480, "height": 360}, 
    "predictions": [
            "x": 360.0, 
            "y": 114.0, 
            "width": 36.0, 
            "height": 104.0, 
            "confidence": 0.6243005394935608, 
            "class": "zebra", 
            "class_id": 1

The time field is the inference computation time and can usually be ignored.

The image field shows the dimensions of the input.

The predictions list contains each predicted class' information.

Gaze Detection

The example below illustrates one element of the gaze detection model's inference output list.

    predictions: [
            face: {
                x: 236,
                y: 208,
                width: 94,
                height: 94,
                confidence: 0.9232424,
                class: "face",
                class_confidence: null,
                class_id: 0,
                tracker_id: null
            landmarks: [
                    x: 207,
                    y: 183
                ... (6 landmarks)
            yaw: 0.82342350129345,
            pitch: 0.23152452412,      
    time: 0.025234234,
    time_face_det: null,
    time_gaze_det: null


<coming soon!>

Instance Segmentation

<coming soon!>

Last updated