> For the complete documentation index, see [llms.txt](https://docs.roboflow.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.roboflow.com/datasets/adding-data/datasources.md).

# Datasources

Datasources let you continuously mirror images and metadata from cloud storage into your Roboflow asset library. Once mirrored, images are searchable by semantics, custom metadata, tags, or image similarity, and can be added to any Project for labeling and training.

Currently, AWS S3 and S3-compatible storage bucket mirroring is supported. Azure Blob Storage and Google Cloud Storage support is coming soon.

## How Bucket Mirror Works

When you configure a Datasource, Roboflow crawls your S3 bucket and imports all matching image files into your Workspace's [Asset Library](/workspaces/asset-library.md).

* Supported image formats: JPEG, PNG, BMP, WebP, AVIF
* Files already present in your workspace (matched by their S3 location and hash) are not re-imported, reducing egress costs
* If a `.json` sidecar file exists alongside an image with the same base name, its metadata is imported; nested keys are flattened using dot notation (e.g., `capture.temperature`) — see [Metadata Sidecars](#metadata-sidecars)
* Files that disappear from the bucket can optionally be removed from your workspace (see [Removing Orphaned Files](#removing-orphaned-files))

## Mirror your Bucket To Roboflow

### Prerequisites

1. An S3 bucket containing your image data
2. Access for Roboflow to read your bucket — choose one:

   **Option A: IAM credentials (Access Key + Secret)**

   Create an IAM user with the AWS managed policy `AmazonS3FilesReadOnlyAccess` attached, or a custom policy granting `s3:ListBucket`, `s3:GetObject`, and `s3:HeadObject` on the bucket and its objects. Provide the Access Key ID and Secret Access Key to Roboflow.

   **Option B: Bucket policy (grant Roboflow's AWS role)**

   No IAM credentials needed. Instead, add a bucket policy that grants Roboflow's AWS role read access directly.

### Configure a Datasource for Bucket Mirroring

[Create a new Datasource](https://app.roboflow.com/settings/datasources) from your workspace settings.

### Filtering with Glob Patterns

By default, all supported image files in the bucket are imported. You can restrict which files are imported using glob patterns, either specified directly or via a `.txt` file stored in the bucket.

You may also provide an explicit whitelist of file paths instead of glob patterns.

### Pattern semantics

* `*` matches any characters except `/` (single directory level)
* `**` matches any characters including `/` (multiple directory levels)

### Examples

**Match by prefix:**

```
harvest**
```

Matches: `harvest`, `harvest2024`, `harvest/sun/file.jpg`, `harvest-data.png`\
Does not match: `Harvest`, `my-harvest`

**Match everything in a folder:**

```
/harvest/sun/**
```

Matches: `/harvest/sun/file.txt`, `/harvest/sun/subfolder/image.jpg`, `/harvest/sun/deep/nested/path/data.png`\
Does not match: `/harvest/moon/file.txt`, `/other/sun/file.txt`

**Match by suffix within a subtree:**

```
/planting/**/*crops.png
```

Matches: `/planting/wheat-crops.png`, `/planting/subfolder/rice-crops.png`\
Does not match: `/planting/wheat.png`, `/other/wheat-crops.png`

**Match at a specific directory level with a name pattern:**

```
/*/a/**/*weed*2025-10-27.png
```

Matches: `/farm/a/field/weed-2025-10-27.png`, `/garden/a/plot/seaweed-data-2025-10-27.png`\
Does not match: `/farm/b/field/weed-2025-10-27.png`

**Exact path:**

```
/exact/path/to/file.jpg
```

Matches only that specific file.

**Literal wildcards in filenames:**\
Wrap the pattern in quotes to treat `*` as a literal character:

```
"/path/to/file*.jpg"
```

### Removing Orphaned Files

When `removeOrphanedSourcesFromWorkspace` is enabled, files that are no longer present in your S3 bucket (or no longer matched by your glob patterns) are removed from your Roboflow workspace, provided they are not referenced by any Project or another Datasource configuration.

This also applies when you delete a Datasource. If orphan removal is enabled and the bucket has been mirrored at least once, images originating from that bucket that are not used by any other Project may be removed by the cleanup worker. The delete confirmation dialog will warn you about this and require explicit acknowledgement before proceeding. To avoid this, disable orphan removal on the Datasource's mirror configs before deleting it.

### File Naming

The `namingStrategy` setting controls how imported files are named and displayed in Roboflow:

| Strategy   | Description                                                                                 |
| ---------- | ------------------------------------------------------------------------------------------- |
| `fullPath` | Uses the full S3 key path as the filename (default)                                         |
| `fileName` | Uses only the filename portion of the S3 key                                                |
| `eTag`     | Uses the S3 object ETag                                                                     |
| `metadata` | Uses a value from the image's metadata, specified by `namingStrategyMetadataKey` (required) |

### Image Updates

When an image in S3 is modified, Roboflow can update the copy in your workspace:

* `updateImageWhenNewer` (default: `true`) — re-imports the image when the S3 object is newer than the stored version
* `updateImageStrategy` — controls how the update is applied; currently `overwrite` (replaces the existing image) is supported

### Metadata Sidecars

Attach metadata to images by placing a `.json` sidecar file in the bucket alongside each image, using the same base name:

```
my-bucket/
  images/
    photo_001.jpg
    photo_001.json      # metadata for photo_001.jpg
    photo_002.jpg
    photo_002.json      # metadata for photo_002.jpg
```

The sidecar file contains key-value pairs:

```json
{
    "camera_id": "cam001",
    "location": "warehouse-3",
    "capture": { "temperature": 72.5, "humidity": 45 }
}
```

Nested objects are flattened using dot notation. The example above produces:

| Key                   | Value           |
| --------------------- | --------------- |
| `camera_id`           | `"cam001"`      |
| `location`            | `"warehouse-3"` |
| `capture.temperature` | `72.5`          |
| `capture.humidity`    | `45`            |

Sidecar file constraints:

* Maximum file size: 256 KB
* Must be valid JSON
* `null` and `undefined` values are filtered out

### Metadata Sync Strategies

When an image's metadata sidecar `.json` file is updated in S3, two settings control how the update is applied:

* `updateMetadataWhenNewer` (default: `true`) — re-syncs metadata when the sidecar file is newer than the stored version
* `updateMetadataStrategy` — controls how the synced metadata interacts with metadata you have set manually via the UI or API:

| Strategy                    | Behavior                                                                     |
| --------------------------- | ---------------------------------------------------------------------------- |
| `mergeBucketWins` (default) | Merges both sources; on key conflicts, the bucket value wins                 |
| `mergeUserWins`             | Merges both sources; on key conflicts, the user-set value wins               |
| `overwrite`                 | Bucket metadata completely replaces all existing metadata                    |
| `untilFirstChange`          | Syncs from bucket until a user manually edits any metadata field, then stops |
| `append`                    | Only adds new keys from the bucket; never overwrites existing keys           |

## Triggering a Mirroring

Datasources mirror on a recurring schedule. You can also trigger a mirror manually at any time from the [Datasources list](https://app.roboflow.com/settings/datasources) by clicking the play button next to a Datasource.

A manual trigger is subject to two guards:

* **In-progress**: If a sync is already running, you cannot start another one until it finishes.
* **Cooldown**: After a successful sync, manual re-triggers are blocked for one hour. The button tooltip shows how many minutes remain. If the previous sync found nothing new to import (zero files enqueued, or all files failed), the cooldown is skipped so you can retry immediately.

Scheduled (cron) syncs are not affected by the manual cooldown.

## Viewing Synced Assets

Each Datasource row in the [Datasources list](https://app.roboflow.com/settings/datasources) has an eye icon that opens the [Asset Library](/workspaces/asset-library.md) filtered to images from that specific Datasource. The icon is disabled until the Datasource has completed at least one sync.

To view all images synced from any Datasource, click "View Datasource Assets" at the bottom of the Datasources list. This link appears once at least one Datasource has run.

Both links navigate to the Asset Library with a pre-filled tag filter so you can browse, search, and manage only the bucket-synced subset of your Workspace images.

## Using S3-Compatible Storage

Datasources work with S3-compatible storage providers that implement the required S3 API operations, such as Wasabi, Backblaze, Cloudflare R2, DigitalOcean Spaces, and others. The same glob pattern filtering and metadata sidecar behavior applies. You must provide the provider's `endpoint` URL in the Datasource configuration and set the region to `auto` or the appropriate value for that provider.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.roboflow.com/datasets/adding-data/datasources.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
