Run CLIP on frames in a video.

CLIP is a zero-shot classification model that you can use to:

  1. Classify images;

  2. Cluster images;

  3. Compare the similarity between a text prompt and an image;

  4. Compare the similarity between two images, and more.

The Roboflow Video Inference API can return raw CLIP embeddings for the frames in your video (in either 512 or 768 dimensions, depending on the model you select) or compare text or image vectors and return a cosine similarity score for each frame.

Use CLIP with the Video Inference API

Use a Fine-Tuned Model with the Video Inference API

First, install the Roboflow Python package:

pip install roboflow

Next, create a new Python file and add the following code:

from roboflow import Roboflow, CLIPModel

rf = Roboflow(api_key="API_KEY")
model = CLIPModel()

job_id, signed_url, expire_time = model.predict_video(

results = model.poll_until_video_results(job_id)


Above, replace:

  • API_KEY: with your Roboflow API key

  • PROJECT_NAME: with your Roboflow project ID.

  • MODEL_ID: with your Roboflow model ID.

Learn how to retrieve your API key.

Learn how to retrieve a model ID.

Last updated