Hugging Face TRL is officially integrated with RapidFire AI to accelerate fine-tuning and post-training experimentation. TRL users can now discover, install, and run RapidFire AI as the fastest way to customize LLM by comparing multiple tweak/post-training configurations without making major code changes or bloating GPU requirements.
why is this important
When fine-tuning LLM or post-training, teams often don’t have the time or budget to compare multiple configurations, even though evaluation metrics can be significantly improved. RapidFire AI allows you to launch multiple TRL configurations simultaneously, even on a single GPU, and compare them in near real-time through a new adaptive chunk-based scheduling and execution scheme. In the internal benchmarks referenced on the TRL page, this results in approximately 16-24 times higher experimental throughput than comparing configurations one by one sequentially, allowing you to achieve better metrics faster.
RapidFire AI establishes live three-way communication between the IDE, metrics dashboard, and multi-GPU execution backend.
What you get right out of the box
Drop-in TRL wrappers — Use RFSFTConfig, RFDPOConfig, and RFGRPOConfig as near-zero-code replacements for SFT/DPO/GRPO configuration in TRL.
Adaptive chunk-based concurrent training – RapidFire AI shards datasets into a specified number of chunks and cycles configurations at chunk boundaries to enable early Apple-to-Apple comparisons and maximize GPU utilization.
Interactive Control Operations (IC Operations) – From the dashboard itself, you can stop, resume, delete, and clone changes (possibly using a warm start) to perform ongoing executions to avoid wasting resources on lower-performing configurations and double up on higher-performing configurations. No job restarts, no juggling separate GPUs or clusters, no resource bloat.

Clone a promising configuration with modified hyperparameters and optionally warmstart from parent weights, all from the live dashboard.
Multi-GPU orchestration — The RapidFire AI scheduler automatically places and orchestrates configurations into chunks of data across available GPUs through an efficient shared memory mechanism. Focus on models and evaluation metrics, not plumbing.
MLflow-based dashboard — Get real-time metrics, logs, and IC Ops in one place as soon as you start an experiment. Other dashboards such as Trackio, W&B, and TensorBoard will be supported soon.
structure
RapidFire AI randomly splits the dataset into “chunks” and cycles the LLM configuration through the GPU at chunk boundaries. Get incremental signals on assessment metrics across all configurations faster. Automatic checkpointing via an efficient shared memory-based adapter/model spill/load mechanism keeps training smooth, stable, and consistent. Use IC Ops to adapt in flight to stop underperformers early, adjust settings knobs to clone promising ones, and optionally warm-start from parent weights.

Sequential vs. Task Parallel vs. RapidFire AI: Adaptive scheduler maximizes GPU utilization across multiple configurations and GPUs. The bottom row shows stopping, cloning, and modifying running IC Ops.
Start
Install RapidFire AI and be up and running in less than a minute.
pip install Rapidfireai hackgingface-cli login –token YOUR_TOKEN pip uninstall -y hf-xet Rapidfireai init Rapidfireai start
The dashboard can be launched at http://localhost:3000 to monitor and control all experiments.
Supported TRL trainers
SFT and RFSFTConfig DPO and RFDPOConfig GRPO and RFGRPOConfig
They are designed as drop-in replacements, allowing you to maintain the TRL mental model while significantly increasing the concurrency and control of your application after fine-tuning/training.
Minimal TRL SFT example
Training multiple configurations simultaneously, even on a single GPU:
from rapid fire eye import experiment
from Rapidfireai.automl import listRFGridSearch, RFModelConfig, RFloraConfig, RFSFTConfig
from dataset import Load dataset
from transformer import AutoModelForCausalLM, AutoTokenizer dataset =load_dataset(“bitext/Bitext-customer-support-llm-chatbot-training-dataset”) train_dataset = dataset(“train”).select(range(128)).shuffle(seed=42)
surely Formatting_Functions(line):
return {
“prompt”🙁 {“role”: “system”, “content”: “You are a friendly customer support assistant.”},{“role”: “user”, “content”: line (“instruction”)},),
“completion”:({“role”: “assistant”, “content”: line (“response”)}) } Dataset = Dataset.map(format function) config_set = list(( RFModelConfig( model name=“TinyLlama/TinyLlama-1.1B-Chat-v1.0”peft_config=RFLoraConfig(r=8lora_alpha=16target module =(“q_project”, “v_project”)), training_args=RFSFTConfig(learning_rate=1e-3maximum number of steps =128fp16=truth), ), RFModelConfig( model name=“TinyLlama/TinyLlama-1.1B-Chat-v1.0”peft_config=RFLoraConfig(r=32lora_alpha=64target module =(“q_project”, “v_project”)), training_args=RFSFTConfig(learning_rate=1e-4maximum number of steps =128fp16=truth), formatting_func=formatting_function, ) )) Experiment = Experiment(experiment_name=)“sft-comparison”) config_group = RFGridSearch(configs=config_set, trainer_type=“SFT”)
surely Creating a model(Model configuration): Model = AutoModelForCausalLM.from_pretrained(model_config(“Model name”), device_map=“Auto”,torch_dtype=“Auto”
) tokenizer = AutoTokenizer.from_pretrained(model_config(“Model name”))
return (model, tokenizer) Experiment.run_fit(config_group, create_model, train_dataset, num_chunks=4seed=42) experiment.end()
What happens when you run this?
Suppose you want to run the above on a 2 GPU machine. Rather than training in sequence (Configuration 1 → Wait → Configuration 2 → Wait), both configurations are trained at the same time.
Approach time to comparison decision GPU utilization Sequential (traditional) ~15 minutes 60% utilization RapidFire AI (simultaneous) ~5 minutes 95%+ utilization
After both configurations have finished processing their first chunk of data, comparative decisions can be made three times faster for the same resources, rather than waiting for the entire dataset to appear in sequence. Open http://localhost:3000 to monitor live metrics and use IC Ops to stop, duplicate, or adjust executions in real-time based on what you see.
Benchmark: Real-world speedup
Below, we saw the team reach comparable overall best training losses (across all configurations tried) on time when switching from successive approximation to RapidFire AI-enabled hyperparallel experiments.
Scenario Sequential Time RapidFire AI Time Acceleration 4 configurations, 1 GPU 120 minutes 7.5 minutes 16× 8 configurations, 1 GPU 240 minutes 12 minutes 20× 4 configurations, 2 GPUs 60 minutes 4 minutes 15×
Benchmarking NVIDIA A100 40GB with TinyLlama-1.1B and Llama-3.2-1B models
Get started now
🚀 Try it out: Interactive Colab notebook — no setup required, runs in your browser
📚 Complete documentation: oss-docs.rapidfire.ai — complete guide, examples, API reference
💻 GitHub: RapidFireAI/rapidfireai — open source, production ready
📦 Install via PyPI: pypi.org/project/rapidfireai — pip install Rapidfireai
💬 Join the community: Discord — get help, share results, request features
RapidFire AI was built because the common practice of trying one configuration at a time wastes both time and GPU cycles. This formal integration will enable all TRL users to fine-tune/post-train smarter, iterate faster, and ship better models.
Try out the integration and let us know how much faster your experimental loops become. What should I make next? We’re just getting started, and your feedback will shape our future direction.

