NVIDIA GPU

As an additional Enterprise deployment, we offer an accelerated inference solution that you can deploy to your GPU devices.

Installation Requirements

These deployment options require a Roboflow Enterprise License.

To deploy the Enterprise GPU inference server, you must first install NVIDIA drivers and nvidia-container-runtime, allowing docker to passthrough your GPU to the inference server. You can test to see if your system already has nvidia-container-runtime and if your installation was successful with the following command:

docker run --gpus all -it ubuntu nvidia-smi

If your installation was successful, you will see your GPU device from within the container:

Tue Nov  9 16:04:47 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   41C    P0    56W / 149W |    504MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The last thing you need before building your GPU TRT container is your project information. This includes your Roboflow API Key, Model ID and Model Version. If you don't have this information you can follow this link to Locate Your Project Information. Once located, save those three variables for later.

Enterprise GPU TRT

The Enterprise GPU TRT deployment compiles your model on device, optimizing for the hardware that you have available. There are currently three deployment options for the GPU TRT container. Deploying to an EC2 instance via AWS, deploying to WSL2 via Windows, and deploying to Anaconda via Windows.

Amazon EC2 Deployments

Select AMI and Launch EC2 Instance

To run the TRT GPU container on an EC2 instance, you first need to select the proper AMI. The AMI can be configured when you are launching your instance and should be selected before you launch the instance. We are going to be using the NVIDIA GPU-Optimized AMI which comes with Ubuntu 20.04, Docker and other requirements pre-installed.

Login to EC2 Instance via SSH

With your EC2 instance successfully running, we can log into it using SSH and our Amazon Keypair. Amazon provides the documentation for connecting to their instances here. If you have your Keypair ready and know your EC2 instance's Public DNS, we can use the command below to log into the instance. The default instance-user-name for this instance should be ubuntu.

ssh -i /path/key-pair-name.pem instance-user-name@instance-public-dns-name

Start TRT GPU Docker Container

Once you are logged into the EC2 instance via SSH. We can start the Docker container with the following command:

sudo docker run --gpus all -p 9001:9001 --network="host" roboflow/inference-server-trt:latest

Compile Engine and Run Inference

Run inference on your model by posting a base64 encoded image to the server - if this is the first time you have compiled your model without cache, it will compile before inferring:

base64 your_img.jpg | curl -d @- "http://0.0.0.0:9001/[YOUR MODEL]/[YOUR VERSION]?api_key=[YOUR API KEY]"

Anaconda Deployments

Set-up Your Anaconda Environment

To run the TRT container on Anaconda or Miniconda, we need to create our conda environment and install Docker. To create our environment we can use the commands below inside of the Anaconda terminal.

conda create -n TRT python=3.8
conda activate TRT
pip install pybase64

Install Docker in Anaconda Environment

You can download and run docker via Docker Desktop or you can install Docker via conda-forge. The code below will install Docker using Anaconda's recipe manager.

conda install -c conda-forge docker

Run Docker Container Inside of Anaconda Environment

If you have installed Docker Desktop, make sure to have it running in order to access the container. Those of you that did not download Docker Desktop should be able to access the daemon version of Docker via our previous conda-forge install process.

Once your Anaconda environment can successfully access Docker. We can start the Docker container with the following command:

docker run --gpus all -p 9001:9001 roboflow/inference-server-trt:latest

Compile Engine and Run Inference

Open up another Anaconda terminal and navigate to a directory that contains data you want to run inference on. Run inference on your model by posting a base64 encoded image to the server - if this is the first time you have compiled your model without cache, it will compile before inferring:

pybase64 encode your_img.jpg | curl -d @- "http://localhost:9001/[YOUR MODEL]/[YOUR VERSION]?api_key=[YOUR API KEY]"

Windows Subsystem Deployments

Download Ubuntu via the Microsoft Store

For those of you that want to run our TRT container on Windows, another option besides Anaconda is using WSL2. Windows Subsystem for Linux allows anyone with Windows 10+ to run Ubuntu asynchronously with Windows via a terminal interface. You can find it and download Ubuntu 20.04.5 in the Microsoft Store for free. After installation, run WSL2 by typing Ubuntu into your windows search bar and starting the application.

Install Docker on WSL2 (Optional)

Ubuntu 20.04.5 LTS should come with Docker installed, but just in case it doesn't, below are some useful commands for installing Docker in Ubuntu. Similar to the Anaconda installation, you may also run Docker Desktop to bypass the need to install Docker. The full documentation can be found here: Install Docker Engine on Ubuntu

sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin

Run Docker Container Inside of WSL2

Once you have Docker successfully installed on your WSL2 environment, we can have Docker run the TRT container. To run the container, we need to use the command below which will start accepting inferences on port 9001.

sudo docker run --gpus all -p 9001:9001 roboflow/inference-server-trt:latest

Compile Engine and Run Inference

Now that the GPU TRT container is running in Docker. We can open up another Ubuntu terminal which will be using to send inference data to the docker container. Use ls and cd Navigate to the location of an image you want to run inference on and use the command below.

If this is your first inference, your model will take some time to compile. Following inferences after the model is built will be faster.

base64 your_img.jpg | curl -d @- "http://0.0.0.0:9001/[YOUR MODEL]/[YOUR VERSION]?api_key=[YOUR API KEY]"

Extended Functionality

Caching Your Model

In certain cases, you might want to cache your model locally so that each time the server starts it does not need to communicate with the external Roboflow servers to download your model.

To cache your model offline, first create a docker volume:

docker volume create roboflow 

Then, start the server with your docker volume mounted on the /cache directory:

sudo docker run --gpus all -p 9001:9001 --network="host" --mount source=roboflow,target=/cache roboflow/inference-server-trt:latest

Multi-GPU Support with Docker Compose

We have developed a repository for quickly accessing examples on how to use the Roboflow TRT Docker container. To get started, run the git clone command below to download our docker compose template.

git clone https://github.com/roboflow/trt-demos.git

For this example, we have configured the docker to run 8 GPUs with a load balancer. If you need to run less than 8 GPUs. We will cover that here.

Building the Load Balancer

To build the load balancer docker container use the command below. If you want more information on the load balancer we are using you can find that information here.

docker build . -t lb

Spinning up Docker Compose

Make sure that the names of the services in the docker-compose.yaml file are correctly reflected in the .conf/roboflow-nginx.conf file then run docker compose.

docker-compose up

Your Docker should now be spinning up multiple GPU containers that all share a volume and a port with the load balancer. This way the load balancer can manage the throughput of each container for optimal speed. If you are running in Docker Desktop, a successful boot up should look something like this.

Running Inference

Now that you have your GPU containers and the load balancer running. You are able to interact with the load balancer which will route all of your request to the respective GPU in order to maintain optimal throughput.

You can test the load balancer by opening up a new terminal and using one of the curl commands below.

# Amazon EC2 Deployments
base64 your_img.jpg | curl -d @- "http://0.0.0.0:9001/[YOUR MODEL]/[YOUR VERSION]?api_key=[YOUR API KEY]"

# Anaconda Deployments
pip install pybase64 
pybase64 encode your_img.jpg | curl -d @- "http://localhost:9001/[YOUR MODEL]/[YOUR VERSION]?api_key=[YOUR API KEY]"

# Windows Subsystem Linux
base64 your_img.jpg | curl -d @- "http://0.0.0.0:9001/[YOUR MODEL]/[YOUR VERSION]?api_key=[YOUR API KEY]"

Configuring your Docker Compose Files

To run less than the default 8 GPUs you will need to make some changes to a couple of the files in this repo. The first we will look at is the docker-compose.yaml which has a bunch of services such as Roboflow-GPU-1, Roboflow-GPU-2, etc. These services are what run the docker containers and attach to each GPU.

If we want to run only 3 GPUs, then we can delete all of the other services except for Roboflow-GPU-1, Roboflow-GPU-2 and Roboflow-GPU-3. To delete a service remove the line containing the service name and all the lines below it until the next service name.

docker-compose.yaml
version: "3"
services:  
  Roboflow-GPU-1:
    image: roboflow/inference-server-trt:latest
    restart: always
    volumes:
      - shared-volume:/cache
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['0']
            capabilities: [gpu]
  Roboflow-GPU-2:
    image: roboflow/inference-server-trt:latest
    restart: always
    volumes:
      - shared-volume:/cache
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['1']
            capabilities: [gpu]
            <Continued>

The next file we need to edit would be the roboflow-nginx.conf which is found inside of the conf folder.

Continuing our example of going from 8 to 3 GPUs. We will need to remove some of the server lines of code in the upstream myapp1. Specifically line 17 through line 21 aren't necessary anymore because that would exceed our target number.

roboflow-nginx.conf
user  nginx;
worker_processes  auto;

error_log  /var/log/nginx/error.log notice;
pid        /var/run/nginx.pid;


events {
    worker_connections  1024;
}

http {
    upstream myapp1 {
        server Roboflow-GPU-1:9001;
        server Roboflow-GPU-2:9001;
        server Roboflow-GPU-3:9001;
        server Roboflow-GPU-4:9001;
        server Roboflow-GPU-5:9001;
        server Roboflow-GPU-6:9001;
        server Roboflow-GPU-7:9001;
        server Roboflow-GPU-8:9001;
    }

    server {
        listen 80;

        location / {
            proxy_pass http://myapp1;
        }
    }
}

After you have changed these two files, you should be able to continue with the docker-compose tutorial by building the load balancer.

Using Multi-Stream with the TRT Container

In some cases, you may have multiple camera streams that you want to process in parallel on the same TRT Container on the same GPU. To spin up multiple model services on in your TRT container specify the --env NUM_WORKERS=[desired num_workers] On NVIDIA V100, we found that 2-4 workers provided optimal latency.

Exposing GPU Device ID in the TRT Container

In certain cases, you may want your TRT container to run on a specific GPU or vGPU. You can do so by specifying CUDA_VISIBLE_DEVICES=[DESIRED GPU OR MIG ID]

Troubleshooting

First, check that your request contains the proper model version request parameters:

Locating Your Project Information

Check that the latest container is pulled:

sudo docker pull roboflow/inference-server-trt:latest If using a cache volume, clear it:

sudo docker volume rm roboflow

sudo docker volume create roboflow

Re-check NVIDIA docker GPU drivers:

docker run --gpus all -it ubuntu nvidia-smi

Relaunch! If deployment errors persist, copy the server logs and send them to your Roboflow rep and we will jump in to help.

Last updated