# Troubleshooting

This page lists known issues, limitations, and workarounds for Batch Processing. If you encounter a problem not listed here, please report it through our [support channels](https://github.com/roboflow/inference/issues).

## Known Limitations

* Certain Workflow blocks requiring access to environment variables and local storage (like File Sink and Environment Secret Store) are blocked and will not execute.
* The service only works with Workflows that define a **single** input image parameter.

## Technical Details

* Data is stored in Data Staging with a **7-day expiry**.
* Each batch processing job contains multiple stages (typically `processing` and `export`). Each stage creates an output batch. We recommend using `export` stage outputs, as they are compressed for efficient transfer.
* A running job in the `processing` stage can be aborted using both the UI and CLI.
* An aborted or failed job can be restarted.
* The service automatically shards data and processes it in parallel:
  * The number of machines scales automatically based on data volume (throughput can reach 500k–1M images/hour for certain workloads).
  * Each machine runs multiple workers processing chunks of data. This is configurable and should be tuned to balance speed and cost.

## Job Timed Out

### Issue

Batch jobs terminate prematurely if the **Processing Timeout Hours** is set too low relative to the job's size or complexity.

<figure><img src="https://media.roboflow.com/inference/batch-processing/batch-processing-timeout.png" alt=""><figcaption><p>Processing Timeout setting in the UI</p></figcaption></figure>

### Details

The timeout setting (UI) or `--max-runtime-seconds` (CLI) defines the **maximum cumulative machine runtime across all parallel workers**.

* **Total compute time:** If the limit is 2 hours and the job spawns 2 machines, each can run for a maximum of 1 hour (2 machines x 1 hour = 2 hours total).
* **Divided per chunk:** Jobs are split into processing chunks to enable parallelism. The timeout is divided across chunks — a short timeout with many chunks may leave too little time per chunk.
* **Machine type matters:** Running complex Workflows on CPU increases processing time significantly. Use GPU where appropriate.

### Recommendations

* Start with a generous timeout (e.g., 4–6 hours) for large datasets or multi-stage Workflows.
* Monitor actual job runtimes to inform future timeout settings.
* Consider reducing chunk count or using video frame sub-sampling for faster processing.

## Workflow with SAHI Runs Too Long

### Issue

Jobs using SAHI — particularly with high-resolution inputs and instance segmentation — may take much longer than expected.

### Causes and Recommendations

**Excessive number of slices:** SAHI splits images into smaller slices for detection. With default settings and high-resolution inputs, this can mean dozens or hundreds of inferences per image.

* Check the Image Slicer block configuration. Reduce slices or downscale inputs using a Resize Image block earlier in the Workflow.

**Consider larger model input size instead of SAHI:** Training a model with larger input dimensions can eliminate the need for SAHI entirely. Test on a small sample first.

**Instance segmentation bottleneck:** When SAHI is used with instance segmentation, the Detections Stitch block (especially with NMS) can become a major bottleneck — stitching a single frame can take tens of seconds.

**Video jobs with SAHI:** Use FPS sub-sampling to skip frames:

* In the UI, use the **Video FPS sub-sampling** dropdown.
* In the CLI, use the `--max-video-fps` flag.

<figure><img src="https://media.roboflow.com/inference/batch-processing/limiting-video-fps.png" alt=""><figcaption><p>FPS sub-sampling setting in the UI</p></figcaption></figure>

## Out of Memory (OOM) Errors

### Issue

Jobs fail due to OOM errors when the Workflow consumes more RAM or VRAM than available.

### Common Causes

* **SAHI + Instance Segmentation:** This combination is extremely memory-intensive. SAHI multiplies inference calls, and instance segmentation generates large outputs (masks, scores), often leading to crashes.
* **Too many workers per machine:** Multiple workers optimize cost and speed for lightweight Workflows, but heavy Workflows (multiple large models, complex post-processing) will exceed available memory.

### Recommendations

* Use fewer workers per machine (e.g., 1 or 2) for Workflows with large models, SAHI, or high-resolution inputs.
* Lower the **Workers Per Machine** value under Advanced Options.
* Switch from CPU to GPU if your model needs higher memory throughput.
* Test your Workflow on a small dataset before running large batches.
* Reduce input resolution or simplify the Workflow by removing unneeded blocks.

<figure><img src="https://media.roboflow.com/inference/batch-processing/workers-number-adjustment.png" alt=""><figcaption><p>Workers per machine setting in the UI</p></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.roboflow.com/deploy/batch-processing/troubleshooting.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.