Roboflow Docs
DashboardResourcesProducts
  • Product Documentation
  • Developer Reference
  • Changelog
  • Roboflow Documentation
  • Quickstart
  • Workspaces
    • Workspaces, Projects, and Models
    • Create a Workspace
    • Rename a Workspace
    • Delete a Workspace
  • Team Members
    • Invite a Team Member
    • Role-Based Access Control (RBAC)
    • Change a Team Member Role
    • Remove a Team Member
  • Single Sign On (SSO)
  • Workflows
    • What is Workflows?
    • Create a Workflow
    • Build a Workflow
    • Test a Workflow
    • Deploy a Workflow
    • Workflow Examples
      • Multimodal Model Workflow
    • Share a Workflow
    • Workflows AI Assistant
  • Enterprise Integrations
  • Workflow Blocks
    • Run a Model
      • Object Detection Model
      • Single-Label Classification Model
    • Visualize Predictions
      • Bounding Box Visualization
      • Label Visualization
      • Circle Visualization
      • Background Color Visualization
      • Classification Label Visualization
      • Crop Visualization
  • Dataset Management
    • Create a Project
    • Upload Images, Videos, and Annotations
      • Import Data from Cloud Providers
        • AWS S3 Bucket
        • Azure Blob Storage
        • Google Cloud Storage
      • Import from Roboflow Universe
    • Manage Datasets
      • Dataset Batches
      • Search a Dataset
      • Set Dataset Classes
      • Add Tags to Images
      • Create an Annotation Attribute
      • Download an Image
      • Delete an Image
    • Dataset Versions
      • Create a Dataset Version
      • Preprocess Images
      • Augment Images
      • Delete a Version
      • Export a Dataset Version
    • Dataset Analytics
    • Merge Projects
    • Rename a Project
    • Delete a Project
    • Project Folders
    • Make a Project Public
  • Annotate
    • Introduction to Roboflow Annotate
    • Annotate an Image
      • Keyboard Shortcuts
      • Comment on an Image
      • Annotation History
      • Similarity Search
    • AI Labeling
      • Label Assist
      • Enhanced Smart Polygon with SAM
        • Smart Polygon (Legacy)
      • Box Prompting
      • Auto Label
    • Set Keypoint Skeletons
    • Annotate Keypoints
    • Annotate Multimodal Data
    • Collaborate on Labeling
    • Annotation Insights
  • Managed Labeling
  • Train
    • Train a Model
      • Train from a Universe Checkpoint
      • Train from Azure Vision
      • Train from Google Cloud
    • Roboflow Instant
    • Cancel a Training Job
    • Stop Training Early
    • View Training Results
    • View Trained Models
    • Evaluate Trained Models
  • Download a Dataset Version
  • Deploy
    • Deploy a Model or Workflow
    • Managed Deployments
    • Serverless Hosted API V2
      • Use in a Workflow
      • Use with the REST API
      • Run an Instant Model
    • Serverless Hosted API
      • Object Detection
      • Classification
      • Instance Segmentation
        • Semantic Segmentation
      • Keypoint Detection
      • Foundation Models
        • CLIP
        • OCR
        • YOLO-World
      • Video Inference
        • Use a Fine-Tuned Model
        • Use CLIP
        • Use Gaze Detection
        • API Reference
        • Video Inference JSON Output Format
      • Pre-Trained Model APIs
        • Blur People API
        • OCR API
        • Logistics API
        • Image Tagging API
        • People Detection API
        • Fish Detection API
        • Bird Detection API
        • PPE Detection API
        • Barcode Detection API
        • License Plate Detection API
        • Ceramic Defect Detection API
        • Metal Defect Detection API
    • Dedicated Deployments
      • Create a Dedicated Deployment
      • Make Requests to a Dedicated Deployment
      • Manage Dedicated Deployments with an API
    • Batch Processing
    • SDKs
      • Python inference-sdk
      • Web Browser
        • inferencejs Reference
        • inferencejs Requirements
      • Lens Studio
        • Changelog - Lens Studio
      • Luxonis OAK
    • Upload Custom Model Weights
    • Download Model Weights
    • Enterprise Deployment
      • License Server
      • Offline Mode
      • Kubernetes
      • Docker Compose
    • Monitor Deployed Models
      • Alerting
  • Universe
    • What is Roboflow Universe?
    • Find a Dataset on Universe
    • Explore Images in a Universe Dataset
    • Fork a Universe Dataset
    • Find a Model on Universe
    • Download a Universe Dataset
  • Set a Project Description
  • View Project Analytics
  • Support
    • Share a Workspace with Support
    • Delete Your Roboflow Account
    • Apply for Academic Credits
  • Billing
    • Premium Trial
    • Credits
      • View Credit Usage
      • Enable or Disable Flex Billing
      • Purchase Prepaid Credits
    • Plans
      • Purchase a Plan
      • Cancel a Plan
      • Update Billing Details
      • Update Payment Method
      • View Invoices
Powered by GitBook
On this page
  • Embed Image
  • Embed Text
  • Compare

Was this helpful?

  1. Deploy
  2. Serverless Hosted API
  3. Foundation Models

CLIP

CLIP is a machine learning model capable of generating embeddings for images and text. These embeddings can be used for zero shot classification, semantic image search, amongst many other use cases. There are three routes available to make use of CLIP on the Roboflow Inference Server:

  • embed_image: Used for calculating image embeddings

  • embed_text: Used for calculating text embeddings

  • compare: Used to calculate and then compare embeddings of text and images

Embed Image

Embedding an image is like compressing the information in that image down to a more manageable size. When we embed an image, we take the image as input which is comprised of tens of thousands of pixels and we refine that down to just a few hundred numbers called an embedding. These embeddings are particularly meaningful to the human eye, but when compared to other embeddings, they can prove very useful.

To generate an image embedding using CLIP and the Roboflow Inference Server:

#Define Request Payload
infer_clip_payload = {
    #Images can be provided as urls or as base64 encoded strings
    "image": {
        # "type" also can be "base64"
        "type": "url",
        # "value" also can be a base64 encoded string of image data
        "value": "https://i.imgur.com/Q6lDy8B.jpg",
    },
}

# Define inference server url (localhost:9001, infer.roboflow.com, etc.)
base_url = "https://infer.roboflow.com"

# Define your Roboflow API Key
api_key = <YOUR API KEY HERE>

res = requests.post(
    f"{base_url}/clip/embed_image?api_key={api_key}",
    json=infer_clip_payload,
)

embeddings = res.json()['embeddings']

print(embeddings)
[[-0.4853120744228363, ... ]]

Multiple images can be embedded with one request:

#Define Request Payload
infer_clip_payload = {
    #Images can be provided as urls or as base64 encoded strings
    "image": [
        {
            "type": "url",
            "value": "https://i.imgur.com/Q6lDy8B.jpg",
        },
        {
            "type": "url",
            "value": "https://i.imgur.com/Q6lDy8B.jpg",
        }
    ],
}

res = requests.post(
    f"{base_url}/clip/embed_image?api_key={api_key}",
    json=infer_clip_payload,
)

Embed Text

CLIP can generate embeddings for text just like it can for images.

#Define Request Payload
infer_clip_payload = {
    "text": "the quick brown fox jumped over the lazy dog",
}

res = requests.post(
    f"{base_url}/clip/embed_text?api_key={api_key}",
    json=infer_clip_payload,
)

embeddings = res.json()['embeddings']

print(embeddings)
[[0.56842650744228363, ... ]]

Multiple text blocks can be batched into a single request:

#Define Request Payload
infer_clip_payload = {
    "text": [
        "the quick brown fox jumped over the lazy dog",
        "how vexingly quick daft zebras jump"
    ]
}

res = requests.post(
    f"{base_url}/clip/embed_text?api_key={api_key}",
    json=infer_clip_payload,
)

Compare

The true value of CLIP is realized when embeddings are compared. The comparison is a mathematical distance between two embeddings calculated using cosine similarity. This distance can be thought of as a similarity score. If two embeddings have a cosine similarity near 1, then they are similar.

When performing a compare, you define a prompt and one or multiple subjects. Since you can compare any combination of text or images, you must also define the prompt type and the subject type.

#Define Request Payload
infer_clip_payload = {
    "prompt": {
        "type": "url",
        "value": "https://i.imgur.com/Q6lDy8B.jpg",
    },
    "prompt_type": "image",
    "subject": "A very cute raccoon",
    "subject_type": "text",
}

res = requests.post(
    f"{base_url}/clip/compare?api_key={api_key}",
    json=infer_clip_payload,
)

similarity = res.json()['similarity']

print(similarity)
[0.30969720949239016]

Multiple prompts (up to eight in a single request) can be passed as a list:

infer_clip_payload = {
    "subject": {
        "type": "url",
        "value": "https://i.imgur.com/Q6lDy8B.jpg",
    },
    "subject_type": "image",
    "prompt": [
        "A very cute raccoon",
        "A large dog",
        "A black cate",
    ],
    "prompt_type": "text",
}
[0.80559720949239016, 0.20329720949239016, 0.505559720949239016]
PreviousFoundation ModelsNextOCR

Last updated 2 months ago

Was this helpful?