Foundation Models

Foundation models are large, pre-trained models that can be used on their own, or as part of a vision project, to solve a computer vision problem.

You can deploy the following foundation models on your own hardware with Inference:

  • Gaze (LC2S-Net): Detect the direction in which someone is looking.

  • CLIP: Classify images and compare the similarity of images and text.

  • DocTR: Read characters in images.

  • Grounding DINO: Detect objects in images using text prompts.

  • Segment Anything (SAM): Segment objects in images.

  • CogVLM: A large multimodal model (LMM).

To learn how to deploy these foundation models, refer to the Roboflow Inference documentation.

