Serverless Hosted API V2
Run Workflows and Model Inference on GPU-accelerated infrastructure in the Roboflow cloud.
Last updated
Was this helpful?
Run Workflows and Model Inference on GPU-accelerated infrastructure in the Roboflow cloud.
Last updated
Was this helpful?
Serverless Hosted API V2 is similar to the Serverless Hosted API but provides a GPU-accelerated inference endpoint for model inference. This allows for inferring on GPU-only models such as Florence2, SAM2 etc. and usually reduces the computational latency for Workflow and Inference requests.
In order to use Serverless Hosted API V2 for Workflows select it as the deployment option in the Workflow editor, like so:
This sets the inference backend to use Serverless GPU inference. We note that the Serverless Hosted Inference V2 does not currently support (1) Stream API (making it unsuitable for video inference). (2) Dynamic Python Blocks If your workflows use these features then we recommend checkout out Dedicated Deployments.
The Serverless Hosted API V2 has only one single endpoint
https://serverless.roboflow.com
for all models and workflows; this is in contrast to the V1 API which had many different endpoints based on the type of models being inferred.
Also note that the Semantic Segmentation models are not currently supported in V2.
Object detection, Keypoint detection
https://serverless.roboflow.com
https://detect.roboflow.com
Instance Segmentation
https://serverless.roboflow.com
https://outline.roboflow.com
Classification
https://serverless.roboflow.com
https://classify.roboflow.com
Semantic Segmentation
Currently not supported
https://segment.roboflow.com
Foundataion models e.g. CLIP, OCR, YOLO-World etc.
https://serverless.roboflow.com
https://infer.roboflow.com
The end-to-end latency of requests sent to the Serverless Hosted API V2 depends on several factors:
Model architecture, which has a bearing on the execution time
Size and resolution of the images that impact upload time and model inference time during execution
Network latency and bandwidth, which affects request upload time and response download time.
Service subscription and usage by other users at any specific time which could result in queueing latency
We show some representative benchmarks performed on the Serverless Hosted API V2 and the Hosted API V1 in the table below. The results for Serverless Hosted API V2 and Hosted Inference (V1) show the end-to-end latency (E2E) as well as the execution time (Exec). These numbers are for information only, we encourage users to perform their own benchmarks using our inference benchmark tools or their own custom benchmarks.
yolov8x-640
401 ms
29 ms
4084 ms
821 ms
yolov8m-640
757 ms
21 ms
572 ms
265 ms
yolov8n-640
384 ms
17 ms
312 ms
63 ms
yolov8x-1280
483 ms
97 ms
6431 ms
3032 ms
yolov8m-1280
416 ms
52 ms
1841 ms
1006 ms
yolov8n-1280
428 ms
35 ms
464 ms
157 ms
We encourage users to run their own benchmarks for their model inferences and workflows to get real metrics on their specific usecases.