CLIP

CLIP is a machine learning model capable of generating embeddings for images and text. These embeddings can be used for zero shot classification, semantic image search, amongst many other use cases. There are three routes available to make use of CLIP on the Roboflow Inference Server:

  • embed_image: Used for calculating image embeddings

  • embed_text: Used for calculating text embeddings

  • compare: Used to calculate and then compare embeddings of text and images

Embed Image

Embedding an image is like compressing the information in that image down to a more manageable size. When we embed an image, we take the image as input which is comprised of tens of thousands of pixels and we refine that down to just a few hundred numbers called an embedding. These embeddings are particularly meaningful to the human eye, but when compared to other embeddings, they can prove very useful.

To generate an image embedding using CLIP and the Roboflow Inference Server:

#Define Request Payload
infer_clip_payload = {
    #Images can be provided as urls or as base64 encoded strings
    "image": {
        "type": "url",
        "value": "https://i.imgur.com/Q6lDy8B.jpg",
    },
}

# Define inference server url (localhost:9001, infer.roboflow.com, etc.)
base_url = "https://infer.roboflow.com"

# Define your Roboflow API Key
api_key = <YOUR API KEY HERE>

res = requests.post(
    f"{base_url}/clip/embed_image?api_key={api_key}",
    json=infer_clip_payload,
)

embeddings = res.json()['embeddings']

print(embeddings)
[[-0.4853120744228363, ... ]]

Multiple images can be embedded with one request:

#Define Request Payload
infer_clip_payload = {
    #Images can be provided as urls or as base64 encoded strings
    "image": [
        {
            "type": "url",
            "value": "https://i.imgur.com/Q6lDy8B.jpg",
        },
        {
            "type": "url",
            "value": "https://i.imgur.com/Q6lDy8B.jpg",
        }
    ],
}

res = requests.post(
    f"{base_url}/clip/embed_image?api_key={api_key}",
    json=infer_clip_payload,
)

Embed Text

CLIP can generate embeddings for text just like it can for images.

#Define Request Payload
infer_clip_payload = {
    "text": "the quick brown fox jumped over the lazy dog",
}

res = requests.post(
    f"{base_url}/clip/embed_text?api_key={api_key}",
    json=infer_clip_payload,
)

embeddings = res.json()['embeddings']

print(embeddings)
[[0.56842650744228363, ... ]]

Multiple text blocks can be batched into a single request:

#Define Request Payload
infer_clip_payload = {
    "text": [
        "the quick brown fox jumped over the lazy dog",
        "how vexingly quick daft zebras jump"
    ]
}

res = requests.post(
    f"{base_url}/clip/embed_text?api_key={api_key}",
    json=infer_clip_payload,
)

Compare

The true value of CLIP is realized when embeddings are compared. The comparison is a mathematical distance between two embeddings calculated using cosine similarity. This distance can be thought of as a similarity score. If two embeddings have a cosine similarity near 1, then they are similar.

When performing a compare, you define a prompt and one or multiple subjects. Since you can compare any combination of text or images, you must also define the prompt type and the subject type.

#Define Request Payload
infer_clip_payload = {
    "prompt": {
        "type": "url",
        "value": "https://i.imgur.com/Q6lDy8B.jpg",
    },
    "prompt_type": "image",
    "subject": "A very cute raccoon",
    "subject_type": "text",
}

res = requests.post(
    f"{base_url}/clip/compare?api_key={api_key}",
    json=infer_clip_payload,
)

similarity = res.json()['similarity']

print(similarity)
[0.30969720949239016]

Multiple prompts (up to eight in a single request) can be passed as a list:

infer_clip_payload = {
    "subject": {
        "type": "url",
        "value": "https://i.imgur.com/Q6lDy8B.jpg",
    },
    "subject_type": "image",
    "prompt": [
        "A very cute raccoon",
        "A large dog",
        "A black cate",
    ],
    "prompt_type": "text",
}
[0.80559720949239016, 0.20329720949239016, 0.505559720949239016]

Last updated