NVIDIA Jetson (On Device)
Deploy your Roboflow Train model to your NVIDIA Jetson with GPU acceleration.
Our Hosted API is suitable for most use-cases; it uses battle-tested infrastructure and seamlessly autoscales up and down to handle even the most intense use-cases. But, because it is hosted remotely, there are some scenarios where it's not ideal: notably, in situations where bandwidth is constrained or where production data cannot extend beyond your local network or corporate firewall or where you need realtime inference speeds on the edge. In those cases, an on-premise deployment is needed.
The Roboflow Inference server is a drop-in replacement for the Hosted Inference API that can be deployed on your own hardware. We have optimized it to get maximum performance from the NVIDIA Jetson line of edge-AI devices by specifically tailoring the drivers, libraries, and binaries specifically to its CPU and GPU architectures.
The inference API is available as a Docker container optimized and configured for the NVIDIA Jetson line of devices. You should use the latest stable version of NVIDIA's Jetson Jetpack (last tested on version 4.6) which comes ready to run this container. To install, simply pull the container:
sudo docker pull roboflow/inference-server:jetson
Then run it (while passing through access to the Jetson's GPU and native networking stack for speed):
sudo docker run --net=host --gpus all roboflow/inference-server:jetson
You can now use your Jetson as a drop-in replacement for the Hosted Inference API (see those docs for example code snippets in several programming languages). If you're running your application directly on the Jetson, use the sample code from the Hosted API but replace
http://localhost:9001in the API call. For example,
base64 YOUR_IMAGE.jpg | curl -d @- \
You can also run as a client-server context and send images to the Jetson for inference from another machine on your network; simply replace
localhostwith the Jetson's local IP address.
Note: The first call to the model will take a few seconds to download your model weights and initialize them on the GPU; subsequent predictions will be much quicker.
On our local tests, we saw a sustained throughput of
- 4 frames per second on the Jetson Nano 2GB (with swap memory)
- 6 frames per second on the Jetson Nano 4GB
- 10 frames per second on the Jetson Xavier NX (single instance)
- 15 frames per second on the Jetson Xavier NX (2 instance cluster; see below)
These results were obtained using while operating in a client-server context (so there is some minor network latency involved) and a 416x416 model.
Note: If your application is also running on the Jetson itself, you will incur less network latency but will also be sharing compute and memory resources so your results may vary.
The weights for your model are downloaded each time the container runs. Full offline mode support (for autonomous and air-gapped devices) is available for enterprise deployments.
The Jetson Nano 2GB requires a
swapfileto be created or it will run out of memory and crash while trying to initialize your model. Do this before you
docker runthe inference server container to add 8GB of swap memory:
sudo fallocate -l 8.0G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
To persist these changes, add the following line to the end of
/swapfile none swap 0 0
Then reboot your device.
Because the swapfile lives on your micro SD card, it's important to ensure you have a card with high throughput. You may also want to disable your X Server and run in headless mode (see below).
The inference server is configured with a cluster mode that will let it run multiple instances of itself in parallel and automatically do load balancing of requests between them. This will not improve the latency but it will let your device process multiple images at one time by letting it utilize more CPU cores.
The number of instances you can run is limited by input size of your model (determined by your
Resizepreprocessing step or defaulting to 416x416 if you did not select a size), the amount of memory your device has, and the amount of memory needed for other services on the device (like your application code).
A Xavier NX's 8GB of memory can safely fit two instances of most models in memory while still leaving room for your program to run.
Tip: To ensure the maximum amount of memory is available, be sure to shut down your X Server with
sudo service gdm stopor, if you want this to be the default mode for the system,
sudo systemctl set-default multi-user.target.
To start a two instance cluster, add
--env INSTANCES=2to the
sudo docker run --net=host --gpus all --env INSTANCES=2 roboflow/inference-server:jetson
In our tests, the Jetson Xavier NX's throughput went from ~10 frames per second with a single instance to 15 fps with two instances. We were only able to run a single instance on the Jetson Nano.
We got best results with power mode
1(15 watt, 4 CPU cores) for 2-instance cluster mode on the Xavier NX. You can enable this mode with:
sudo nvpmodel -m 1
The Roboflow Inference Server takes up 5GB of disk space. We recommend using a fast SD Card (at least U3, V30). Alternatively, some Jetson-powered embedded devices feature integrated eMMC flash memory which should be even more performant. The default JetPack install consumes ~15GB so it's preferable to have an SD Card or Flash Memory capacity of 32GB or higher (but it is possible to run on 16GB by removing unnecessary packages).
Our Docker image contains all of the needed CUDA and CuDNN packages required to run our models; this means CUDA and CuDNN are not needed on the host so long as NVIDIA Container Runtime and the NVIDIA graphics drivers remain installed.
You can remove these packages and free up about 7GB of space by running the following commands to remove CUDA, CuDNN, and some other large extraneous packages:
sudo apt purge cuda-tools-10-2 libcudnn8 cuda-documentation-10-2 cuda-samples-10-2 nvidia-l4t-graphics-demos ubuntu-wallpapers-bionic libreoffice* chromium-browser* thunderbird fonts-noto-cjk
sudo apt autoremove