Roboflow Docs
DashboardResourcesProducts
  • Product Documentation
  • Developer Reference
  • Changelog
  • Roboflow Documentation
  • Quickstart
  • Workspaces
    • Workspaces, Projects, and Models
    • Create a Workspace
    • Rename a Workspace
    • Delete a Workspace
  • Team Members
    • Invite a Team Member
    • Role-Based Access Control (RBAC)
    • Change a Team Member Role
    • Remove a Team Member
  • Single Sign On (SSO)
  • Workflows
    • What is Workflows?
    • Create a Workflow
    • Build a Workflow
    • Test a Workflow
    • Deploy a Workflow
    • Workflow Examples
      • Multimodal Model Workflow
    • Share a Workflow
    • Workflows AI Assistant
  • Enterprise Integrations
  • Workflow Blocks
    • Run a Model
      • Object Detection Model
      • Single-Label Classification Model
    • Visualize Predictions
      • Bounding Box Visualization
      • Label Visualization
      • Circle Visualization
      • Background Color Visualization
      • Classification Label Visualization
      • Crop Visualization
  • Dataset Management
    • Create a Project
    • Upload Images, Videos, and Annotations
      • Import Data from Cloud Providers
        • AWS S3 Bucket
        • Azure Blob Storage
        • Google Cloud Storage
      • Import from Roboflow Universe
    • Manage Datasets
      • Dataset Batches
      • Search a Dataset
      • Set Dataset Classes
      • Add Tags to Images
      • Create an Annotation Attribute
      • Download an Image
      • Delete an Image
    • Dataset Versions
      • Create a Dataset Version
      • Preprocess Images
      • Image Augmentation
        • Augmentation Types
          • Flip Augmentation
          • 90º Rotate Augmentation
          • Crop Augmentation
          • Rotation Augmentation
          • Shear Augmentation
          • Grayscale Augmentation
          • Hue Augmentation
          • Saturation Augmentation
          • Brightness Augmentation
          • Exposure Augmentation
          • Blur Augmentation
          • Noise Augmentation
          • Cutout Augmentation
          • Mosaic Augmentation
        • Add Augmentations to Images
      • Delete a Version
    • Dataset Analytics
    • Merge Projects
    • Rename a Project
    • Delete a Project
    • Project Folders
    • Make a Project Public
    • Download a Dataset
  • Annotate
    • Introduction to Roboflow Annotate
    • Annotate an Image
      • Keyboard Shortcuts
      • Comment on an Image
      • Annotation History
      • Similarity Search
    • AI Labeling
      • Label Assist
      • Smart Polygon
      • Box Prompting
      • Auto Label
    • Set Keypoint Skeletons
    • Annotate Keypoints
    • Annotate Multimodal Data
    • Collaborate on Labeling
    • Annotation Insights
  • Roboflow Labeling Services
  • Train
    • Train a Model
      • Train from a Universe Checkpoint
      • Train from Azure Vision
      • Train from Google Cloud
    • Roboflow Instant
    • Cancel a Training Job
    • Stop Training Early
    • View Training Results
    • View Trained Models
    • Evaluate Trained Models
  • Deploy
    • Deploy a Model or Workflow
    • Supported Models
    • Managed Deployments
    • Serverless Hosted API V2
      • Use in a Workflow
      • Use with the REST API
      • Run an Instant Model
    • Serverless Hosted API
      • Object Detection
      • Classification
      • Instance Segmentation
        • Semantic Segmentation
      • Keypoint Detection
      • Foundation Models
        • CLIP
        • OCR
        • YOLO-World
      • Video Inference
        • Use a Fine-Tuned Model
        • Use CLIP
        • Use Gaze Detection
        • API Reference
        • Video Inference JSON Output Format
      • Pre-Trained Model APIs
        • Blur People API
        • OCR API
        • Logistics API
        • Image Tagging API
        • People Detection API
        • Fish Detection API
        • Bird Detection API
        • PPE Detection API
        • Barcode Detection API
        • License Plate Detection API
        • Ceramic Defect Detection API
        • Metal Defect Detection API
    • Dedicated Deployments
      • Create a Dedicated Deployment
      • Make Requests to a Dedicated Deployment
      • Manage Dedicated Deployments with an API
    • Batch Processing
    • SDKs
      • Python inference-sdk
      • Web Browser
        • inferencejs Reference
        • inferencejs Requirements
      • Lens Studio
        • Changelog - Lens Studio
      • Luxonis OAK
    • Upload Custom Model Weights
    • Download Model Weights
    • Enterprise Deployment
      • License Server
      • Offline Mode
      • Kubernetes
      • Docker Compose
    • Model Monitoring
      • Alerting
  • Universe
    • What is Roboflow Universe?
    • Find a Dataset on Universe
    • Explore Images in a Universe Dataset
    • Fork a Universe Dataset
    • Find a Model on Universe
    • Download a Universe Dataset
  • Set a Project Description
  • View Project Analytics
  • Support
    • Share a Workspace with Support
    • Delete Your Roboflow Account
    • Apply for Academic Credits
  • Billing
    • Premium Trial
    • Credits
      • View Credit Usage
      • Enable or Disable Flex Billing
      • Purchase Prepaid Credits
    • Plans
      • Purchase a Plan
      • Cancel a Plan
      • Update Billing Details
      • Update Payment Method
      • View Invoices
Powered by GitBook
On this page
  • Add a Multimodal Model
  • Add a Connector

Was this helpful?

  1. Workflows
  2. Workflow Examples

Multimodal Model Workflow

You can use multimodal Vision Language Models like GPT-4o, Claude, Gemini, and Florence-2 with Roboflow Workflows.

Roboflow Workflows supports using multimodal models in Workflows.

There are four general-purpose multimodal models that you can use for a wide range of tasks. These are:

  • GPT-4o

  • Claude

  • Gemini

  • Florence-2

These models can be used for tasks such as:

  • Single- and multi-label image classification

  • Zero-shot object detection

  • Image captioning generation

  • And more

To use a multimodal model in Workflows, you need to:

  1. Add the model.

  2. Choose a task type.

  3. Use a built-in connector that converts the results of the model into a format understood by other Workflows block.

Let's walk through each of these steps.

Add a Multimodal Model

To use Claude, Gemini, or GPT-4o in Workflows, you need to add the block that corresponds with the model that you want to use.

For this guide, let’s walk through an example that uses Claude.

Click “Add Block” to add a block. Then, search for Claude:

A configuration panel will appear in which you can configure the Claude block.

If you use any multimodal model that calls an external API (i.e. GPT-4o), you will need to set your model API key.

You can use Claude (and Gemini and GPT-4o) for several tasks, including:

  • Open prompt: Directly passes your prompt to the multimodal model.

  • Text recognition (OCR): Reads characters in an image.

  • Structured output generation: Returns data in a specified format.

  • Single-label and multi-label classification: Returns one or more labels that represent the contents of an image.

  • Visual question answering: Answer a specific question about the contents of an image.

  • Captioning: Returns an image caption.

  • Unprompted object detection: Returns bounding boxes that correspond with the location of objects in an image.

You can choose from these tasks using the Task Type dropdown:

Once you have chosen a task type, the output from the block will be automatically added to your Workflow outputs.

Here is an example configuration for object detection:

Add a Connector

If you want to use the output from a multimodal model in other blocks, you will need to add a connector.

You can use connectors to process:

  • Classifications, and;

  • Bounding boxes.

For example, you can add a connector to retrieve bounding box values from zero-shot object detection supported by Claude 3.

This connector will process the boxes in such a way that lets you use them in the Visualization blocks such as Bounding Box Visualization and Label Visualization.

When you select a VLM connector, configure it to use:

  1. Your input image

  2. The output from your multimodal block

  3. The classes from your multimodal model

  4. The name of the multimodal model you are using (in this example, anthropic-claude)

  5. The task type you selected when you set up your multimodal model

Here is an example configuration of the VLM as Detector block:

You can then use the connector output with other blocks.

For example, you can use the connector output to display bounding boxes with a Bounding Box Visualization block. The Bounding Box Visualization block should be configured with your input image and the results from the VLM as Detector block:

PreviousWorkflow ExamplesNextShare a Workflow

Last updated 4 months ago

Was this helpful?

Here is an e that uses a multimodal model for object detection:

xample workflow