Storage bucket now available on Hug Face Hub

The Hug Face Model and Dataset Repository is perfect for publishing your final product. However, production ML generates a constant stream of intermediate files (checkpoints, optimizer state, processed shards, logs, traces, etc.). These files change frequently, arrive from many jobs at once, and rarely require version control.

Storage buckets are built for exactly this purpose. Modifiable S3-like object storage that can be referenced on a hub, scripted from Python, and managed with the hf CLI. It is also supported by Xet, making it particularly efficient for ML artifacts that share content between files.

Why we built the bucket

Git quickly starts to feel like a false abstraction when dealing with:

A training cluster that writes checkpoints and optimizer state throughout the run A data pipeline that iterates over the raw dataset An agent that stores traces, memory, and a shared knowledge graph

The storage required in all these cases is the same. Write fast, overwrite if necessary, sync directories, delete old files, and continue working.

A bucket is a non-versioned storage container on a hub. It exists under the user or organization namespace, has standard Hugging Face permissions, can be private or public, has pages that can be opened in a browser, and can be addressed programmatically using a handle like hf://buckets/username/my-training-bucket.

Why is Xet important?

Buckets are built on Xet, Hugging Face’s chunk-based storage backend, which is more important than you might think.

Instead of treating files as monolithic blobs, Xet splits the content into chunks and eliminates duplicates across those chunks. Do you want to upload a processed dataset that is almost similar to the raw dataset? Many chunks already exist. Do you want to save consecutive checkpoints where large parts of the model are frozen? Same story. Bucket skips existing bytes. This means less bandwidth, faster transfers, and more efficient storage.

This is a natural fit for ML workloads. Training pipelines always produce a family of related artifacts (raw and processed data, successive checkpoints, agent traces, derived summaries), and Xet is designed to take advantage of that overlap.

For enterprise customers, billing is based on deduplicated storage, so shared chunks directly reduce your billed footprint. Deduplication helps with both speed and cost.

Prewarming: Bringing data closer to computing

The bucket resides on the hub. This means global storage by default. However, not all workloads, such as distributed training or large pipelines, can retrieve data from anywhere, and storage location directly impacts throughput.

Prewarming allows you to move hot data closer to the cloud provider and region where the compute is performed. Rather than having your data move between regions on every read, you declare where you want your data and the bucket makes sure it’s there when the job starts. This is especially useful for training clusters that require fast access to large datasets or checkpoints, or for multi-region setups where different parts of the pipeline run in different clouds.

We are partnering initially with AWS and GCP, with more cloud providers to come.

Start

With the hf CLI, you can get your bucket up and running in less than two minutes. First, install and log in.

curl -LsSf https://hf.co/cli/install.sh | bash hf authentication login

Create a bucket for your project.

hf bucket create my-training-bucket –private

Suppose your training job writes checkpoints locally to ./checkpoint. Sync that directory to your bucket.

high frequency bucket synchronize ./checkpoints hf://buckets/username/my-training-bucket/checkpoints

For large transfers, you may want to see what happens before anything is moved. –dry-run prints the plan without running anything.

high frequency bucket synchronize ./checkpoints hf://buckets/username/my-training-bucket/checkpoints –dry-run

You can also save your plan to a file for review and apply it later.

high frequency bucket synchronize ./checkpoints hf://buckets/username/my-training-bucket/checkpoints –plan sync-plan.jsonl hf bucket synchronize Apply –sync-plan.jsonl

Once completed, inspect your bucket from the CLI.

hf bucket list username/my-training-bucket -h

Or browse directly on the hub: https://huggingface.co/buckets/username/my-training-bucket.

That’s the whole loop. Create a bucket, sync your work data to the bucket, review the bucket if necessary, and save a versioned repository in case it’s worth publishing something. For one-time operations, hf Buckets cp copies individual files and hf Bucket Remove cleans up old objects.

Use buckets from Python

All of the above also works from Python via huggingface_hub (available since v1.5.0). The API follows the same pattern: create, sync, inspect.

from hug face hub import create_bucket, list_bucket_tree, sync_bucket create_bucket(“My training bucket”private =truthexists_ok=truth) sync bucket(
“./checkpoint”,
“hf://buckets/username/my-training-bucket/checkpoints”,)

for item in list_bucket_tree(
“Username/My Training Bucket”prefix=“Checkpoint”recursive =truth):
print(item.path, item.size)

This makes it easy to integrate the bucket into training scripts, data pipelines, or services that manage artifacts programmatically. The Python client also supports batch uploads, selective downloads, deletions, and moving buckets in case you need more control.

Bucket support is also available in JavaScript via @huggingface/hub (v2.10.5 and later), so you can also integrate buckets into Node.js services and web applications.

File system integration

Buckets also work through huggingface_hub’s fsspec-compatible file system, HfFileSystem. This means you can list, read, write, and glob the contents of your bucket using standard file system operations. Additionally, any library that supports fsspec can access buckets directly.

from hug face hub import hffs hffs.ls(“bucket/username/my-training-bucket/checkpoint”details =error) hffs.glob(“bucket/username/my-training-bucket/**/*.parquet”)

and Hehehe.open(“Bucket/username/my-training-bucket/config.yaml”, “r”) as f:
print(f.read())

Because fsspec is a standard Python interface for remote file systems, libraries such as pandas, Polars, and Dask can read from and write to your bucket using hf:// paths without any additional configuration.

import panda as pd df = pd.read_csv(“hf://buckets/username/my-training-bucket/results.csv”) df.to_csv(“hf://buckets/username/my-training-bucket/summary.csv”)

This makes it easy to connect your bucket to your existing data workflows without changing how your code reads or writes files.

From bucket to versioned repository

Buckets are fast, mutable places that exist while artifacts are still in motion. When something becomes a stable artifact, it typically belongs in a versioned model or dataset repository.

Our roadmap plans to support direct transfers in both directions between buckets and repositories. That is, either promote the final checkpoint weights to the model repository or commit the processed shards to the dataset repository once the pipeline completes. The working and publishing layers remain separate but fit into one continuous hub-native workflow.

Trusted by launch partners

Before opening Buckets to everyone, we ran a private beta with a small group of launch partners.

Many thanks to Jasper, Arcee, IBM, and PixAI for testing early versions, uncovering bugs, and sharing feedback that directly shaped this feature.

Conclusion and resources

Storage buckets bring the missing storage layer to the hub. These provide a hub-native location for checkpoints, processed data, agent traces, logs, and everything else that’s useful before becoming final: the mutable, high-throughput side of ML.

Because Bucket is built on Xet, it’s not only easier to use than forcing everything through Git. It is also efficient for the types of relevant artifacts that AI systems generate all the time. That means faster transfers, better deduplication, and on Enterprise plans, billing to benefit from your deduplicated footprint.

If you’re already using a hub, you can use buckets to keep more of your workflows in one place. Using S3-style storage provides better alignment with AI artifacts and a familiar model with a clear path to final publication in the hub.

Buckets are included in your existing Hub storage plan. Free accounts come with storage, and PRO and Enterprise plans offer higher limits. Please see the storage page for more details.

What's Hot

What Building Shippy taught us about building agents

OpenAI pushes ChatGPT to patient health records

Open system for recording robot operation data

What Building Shippy taught us about building agents

OpenAI pushes ChatGPT to patient health records

Open system for recording robot operation data

Security Incident Disclosure — July 2026

Tweak video and image models at scale with NVIDIA NeMo Automodel and 🤗 Diffuser

Harness, scaffolding, and AI agent terminology worth getting right

Most Popular

Security Incident Disclosure — July 2026

Tweak video and image models at scale with NVIDIA NeMo Automodel and 🤗 Diffuser

Harness, scaffolding, and AI agent terminology worth getting right

Don't Miss

What Building Shippy taught us about building agents

OpenAI pushes ChatGPT to patient health records

Open system for recording robot operation data

Subscribe to Updates

What's Hot

Storage bucket now available on Hug Face Hub

Why we built the bucket

Why is Xet important?

Prewarming: Bringing data closer to computing

Start

Use buckets from Python

File system integration

From bucket to versioned repository

Trusted by launch partners

Conclusion and resources

Related Posts