Our Hosted API is suitable for most use-cases; it uses battle-tested infrastructure and seamlessly autoscales up and down to handle even the most intense use-cases. But, because it is hosted remotely, there are some scenarios where it's not ideal: notably, in situations where bandwidth is constrained or where production data cannot extend beyond your local network or corporate firewall. In those cases, an on-premise deployment is needed.
The Roboflow Inference server is a drop-in replacement for the Hosted Inference API that can be deployed on your own hardware. We have optimized it to get maximum performance from the NVIDIA Jetson line of edge-AI devices by specifically tailoring the drivers, libraries, and binaries specifically to its CPU and GPU architectures.
Support for more edge devices is coming soon; if you have specific needs, please get in touch.
On-device inference is a Roboflow Pro feature. If you are on the Starter Plan, you should develop against the Hosted API. It is API compatible and uses the same trained models and returns same predictions as on-device inference. Switching over when you're ready to go to production is a one-line code change (replacing
localhost:9001 or your Jetson's local IP address).
The inference API is available as a Docker container optimized and configured for the NVIDIA Jetson line of devices. You should use the latest version of NVIDIA's Jetson Jetpack which comes ready to run this container. To install, simply pull the container:
sudo docker pull roboflow/inference-server:jetson
Then run it (while passing through access to the Jetson's GPU and native networking stack for speed):
sudo docker run --net=host --gpus all roboflow/inference-server:jetson
You can now use your Jetson as a drop-in replacement for the Hosted Inference API (see those docs for example code snippets in several programming languages). If you're running your application directly on the Jetson, use the sample code from the Hosted API but replace
http://localhost:9001 in the API call. For example,
base64 YOUR_IMAGE.jpg | curl -d @- \"http://localhost:9001/xx-your-model--1?access_token=YOUR_KEY"
You can also run as a client-server context and send images to the Jetson for inference from another machine on your network; simply replace
localhost with the Jetson's local IP address.
Note: The first call to the model will take a few seconds to download your model weights and initialize them on the GPU; subsequent predictions will be much quicker.
On our local tests, we saw a sustained throughput of
4 frames per second on the Jetson Nano 2GB (with swap memory)
6 frames per second on the Jetson Nano 4GB
10 frames per second on the Jetson Xavier NX (single instance)
15 frames per second on the Jetson Xavier NX (2 instance cluster; see below)
These results were obtained using while operating in a client-server context (so there is some minor network latency involved) and a 416x416 model.
Note: If your application is also running on the Jetson itself, you will incur less network latency but will also be sharing compute and memory resources so your results may vary.
The weights for your model are currently downloaded each time the container runs. Full offline mode support (for autonomous and air-gapped devices) is coming soon; please reach out if you'd like to be an early tester.
The Jetson Nano 2GB requires a
swapfile to be created or it will run out of memory and crash while trying to initialize your model. Do this before you
docker run the inference server container to add 8GB of swap memory:
sudo fallocate -l 8.0G /swapfilesudo chmod 600 /swapfilesudo mkswap /swapfilesudo swapon /swapfile
To persist these changes, add the following line to the end of
/swapfile none swap 0 0
Then reboot your device.
Because the swapfile lives on your micro SD card, it's important to ensure you have a card with high throughput. You may also want to disable your X Server and run in headless mode (see below).
The inference server is configured with a cluster mode that will let it run multiple instances of itself in parallel and automatically do load balancing of requests between them. This will not improve the latency but it will let your device process multiple images at one time by letting it utilize more CPU cores.
The number of instances you can run is limited by input size of your model (determined by your
Resize preprocessing step or defaulting to 416x416 if you did not select a size), the amount of memory your device has, and the amount of memory needed for other services on the device (like your application code).
A Xavier NX's 8GB of memory can safely fit two instances of most models in memory while still leaving room for your program to run.
Tip: To ensure the maximum amount of memory is available, be sure to shut down your X Server with
sudo service gdm stop or, if you want this to be the default mode for the system,
sudo systemctl set-default multi-user.target.
To start a two instance cluster, add
--env INSTANCES=2 to the
docker run command:
sudo docker run --net=host --gpus all --env INSTANCES=2 roboflow/inference-server:jetson
In our tests, the Jetson Xavier NX's throughput went from ~10 frames per second with a single instance to 15 fps with two instances. We were only able to run a single instance on the Jetson Nano.
We got best results with power mode
1 (15 watt, 4 CPU cores) for 2-instance cluster mode on the Xavier NX. You can enable this mode with:
sudo nvpmodel -m 1
The Roboflow Inference Server takes up 5GB of disk space. We recommend using a fast SD Card (at least U3, V30). Alternatively, some Jetson-powered embedded devices feature integrated eMMC flash memory which should be even more performant. The default JetPack install consumes ~15GB so it's preferable to have an SD Card or Flash Memory capacity of 32GB or higher (but it is possible to run on 16GB by removing unnecessary packages).
Our Docker image contains all of the needed CUDA and CuDNN packages required to run our models; this means CUDA and CuDNN are not needed on the host so long as NVIDIA Container Runtime and the NVIDIA graphics drivers remain installed.
You can remove these packages and free up about 7GB of space by running the following commands to remove CUDA, CuDNN, and some other large extraneous packages:
sudo apt purge cuda-tools-10-2 libcudnn8 cuda-documentation-10-2 cuda-samples-10-2 nvidia-l4t-graphics-demos ubuntu-wallpapers-bionic libreoffice* chromium-browser* thunderbird fonts-noto-cjksudo apt autoremove