Perception Encoder

Use Meta's Perception Encoder to compute image and text embeddings on a Dedicated Deployment or self-hosted Inference

Perception Encoder is Meta's vision-language embedding model. It maps images and text into a shared embedding space for similarity search, zero-shot classification, and retrieval.

Perception Encoder is not available on the Serverless Hosted API. Run it on a Dedicated Deployment or self-hosted Inference.

We support three Perception Encoder endpoints:

  • /perception_encoder/embed_image — embed an image

  • /perception_encoder/embed_text — embed a string

  • /perception_encoder/compare — compute similarity between an image and a list of text prompts

Code sample

Install dependencies:

pip install requests opencv-python

The sample below sends an image to /perception_encoder/embed_image and prints the embedding shape. Set URL to your Dedicated Deployment URL or a local Inference server. Pass your Roboflow API Key via the API_KEY environment variable.

import base64
import os
import urllib.request

import cv2
import requests

URL = "https://your-deployment.roboflow.cloud"

image_url = "https://media.roboflow.com/notebooks/examples/dog.jpeg"
image_path = "dog.jpeg"
urllib.request.urlretrieve(image_url, image_path)

image = cv2.imread(image_path)
_, buffer = cv2.imencode(".jpg", image)
image_base64 = base64.b64encode(buffer).decode("utf-8")

response = requests.post(
    f"{URL}/perception_encoder/embed_image",
    json={
        "api_key": os.getenv("API_KEY"),
        "image": {"type": "base64", "value": image_base64},
    },
)
result = response.json()
embedding = result["embeddings"][0]
print(f"Embedding length: {len(embedding)}")
print(f"First values: {embedding[:5]}")

The code above prints the embedding shape to the terminal:

Set URL to match your deployment target:

Last updated

Was this helpful?