We had Claude fine-tune our open source LLM

Sean Smith Avatar

We gave Claude the ability to fine-tune his language model using a new tool called Hugging Face Skills. In addition to writing training scripts, we actually submit jobs to cloud GPUs, monitor progress, and push completed models to Hugging Face Hub. This tutorial explains how it works and how you can use it yourself.

Claude Code can use “skills” such as packaged instructions, scripts, and domain knowledge to perform specialized tasks. The hf-llm-trainer skill teaches Claude everything he needs to know about training, including which GPU to choose for his model size, how to configure Hub authentication, when to use LoRA and full fine-tuning, and how to handle the many other decisions required for a successful training run.

With this skill, you can tell Claude things like:

Tweak Qwen3-0.6B with dataset open-r1/codeforces-cots

And Claude would do this:

Validate the dataset format Select the appropriate hardware (t4-small for the 0.6B model) Use Trackio Monitoring to update using the training script Submit the job to Hugging Face Jobs Report the job ID and estimated cost Check progress on request Help with debugging if something goes wrong

The model trains on the Hugging Face GPU while you do other things. Once complete, your fine-tuned model will appear in your hub and be ready to use.

This is not a toy demo. This skill supports the same training methods used in production: supervised fine-tuning, direct-first optimization, and reinforcement learning with verifiable rewards. You can train models with parameters from 0.5B to 70B, convert them to GGUF for local deployment, and run multi-stage pipelines that combine different techniques.

Setup and installation

Before you begin, you’ll need the following:

A Hugging Face account on a Pro or Team plan (job requires a paid plan) Write access token from Huggingface.co/settings/tokens Coding agent such as Claude Code, OpenAI Codex, or Google’s Gemini CLI

The Hug Face skill is compatible with Claude Code, Codex, and Gemini CLI. Cursor, Windsurf, and Continue integration is in progress.

claude code

Register the repository as a plugin marketplace: /plugin Marketplace addhuggingface/skills To install the skill, run: /plugin install @huggingface-skills

for example:

/plugin install hf-llm-trainer@huggingface-skills

codex

Codex identifies skills through the AGENTS.md file. You can check that the instructions are loaded using codex –ask-for-approval Never “Summary the current instructions.” For more information, see the Codex AGENTS guide.

Gemini CLI

This repository contains gemini-extension.json for integrating with Gemini CLI.

Install locally.

Install the gemini extension. – Agree

Or use the GitHub URL.

Installing the gemini extension https://github.com/huggingface/skills.git –consent For more information, see the Gemini CLI extension documentation.

connect to hug face

You must authenticate your Hugging Face account using a write access token so that the job can create the model repository.

Set the token.

hf authentication login

export HF_TOKEN=hf_your_write_access_token_here

!(Note) Configure the Hugging Face MCP server to use write tokens by sending them in the HF_TOKEN or Authorization: Bearer HTTP headers.

First training run

Let’s look at a complete example. Fine-tune a small model to see the complete workflow, then explore more advanced features.

Instruct coding agent to make minor adjustments

Start with simple, clear instructions for fine-tuning your specific model

Tweak Qwen3-0.6B with the open-r1/codeforces-cots dataset for the following instructions.

The coding agent analyzes the request and prepares the training configuration. For the 0.6B model in the demo dataset, t4-small (sufficient GPU for this model size and the cheapest option available) is chosen.

The open-r1/codeforces-cots dataset is a codeforces problem and solution dataset. This is a great dataset for tuning models to solve hard-coding problems.

Confirm before sending

Before the coding agent submits anything, it will see the following configuration:

Tweak Qwen/Qwen3-0.6B with open-r1/codeforces-cots using SFT. Configuration: – Hardware: t4-small (~$0.75/hour) – Estimated time: ~20 minutes – Estimated cost: ~$0.30 – Output: username/qwen-codeforces-cots-sft The model is automatically pushed to the hub. Should I submit it?

This is your chance to adjust something. Please change the output repository name, choose different hardware, or ask Claude to change the training parameters. Once approved, the agent will submit the job.

For example, you can ask the agent to attempt a test run.

Run a simple test with 100 examples.

Track your progress

After submission, get job details.

✅ The job was submitted successfully. Job ID: abc123xyz Monitor: https://huggingface.co/jobs/username/abc123xyz Estimated time: ~20 minutes Estimated cost: ~ $0.30 View real-time metrics: https://huggingface.co/spaces/username/trackio

This skill includes Trackio integration, so you can watch your training losses decrease in real time. The job runs asynchronously, so you can close the terminal and come back later. If you need an update:

How is my training going?

The agent then retrieves the logs and summarizes the progress.

Use the model

Once training is complete, the model is on the hub.

from transformer import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained(“Username/qwen-codeforces-cots-sft”) tokenizer = AutoTokenizer.from_pretrained(“Username/qwen-codeforces-cots-sft”)

That’s a complete loop. Explaining what you want in plain English, the agent handled GPU selection, script generation, job submission, authentication, and persistence. The whole thing cost about 30 cents.

training method

This skill supports three training approaches. Understanding when to use each will give you better results.

Supervised fine-tuning (SFT)

Most projects start with SFT. You provide demonstration data (examples of inputs and desired outputs), and training adjusts the model to match those patterns.

Use SFT when you have a high-quality example of the behavior you want. This includes customer support conversations, code generation pairs, domain-specific Q&A, anything that can show you what a good model is.

Tweak Qwen3-0.6B over 3 epochs at my-org/support-conversations.

The agent validates the dataset, selects the hardware (a10g-large with LoRA on the 7B model), and configures training with checkpoints and monitoring.

For models with more than 3B parameters, the agent automatically uses LoRA (low rank adaptation) to reduce memory requirements. This allows you to train 7B or 13B models on a single GPU while maintaining most of the quality of full fine-tuning.

Direct Override Optimization (DPO)

DPO trains based on preferred pairs (responses where one is “selected” and the other is “rejected”). This aligns the model’s output to human preferences, usually after an initial SFT stage.

Use DPO if you have set annotations or automatic comparisons by a human labeler. DPO directly optimizes the desired response without the need for a separate reward model.

Run DPO on my-org/preference-data to tune the SFT model you just trained. The dataset has a “selected” column and a “rejected” column.

DPO is sensitive to the format of the dataset. I need a column with exactly selected and rejected names, or a prompt column with input. The agent first validates this and indicates how to map columns if the dataset uses different names.

Group relative policy optimization (GRPO)

GRPO is a reinforcement learning task that has been proven to be effective for verifiable tasks such as solving math problems, writing code, or any task with programmatic success criteria.

Train a mathematical inference model using GRPO on the openai/gsm8k dataset based on Qwen3-0.6B.

The model generates a response, receives rewards based on correctness, and learns from the results. It is more complex than SFT or DPO, but the configuration is similar.

hardware and costs

Agents choose hardware based on model size, but understanding the tradeoffs can help you make better decisions.

Model size and GPU mapping

For small models with parameters less than 1B, t4-small works well. These models can be trained quickly, so expect to pay $1-2 for a complete run. This is perfect for teaching or performing experiments.

For smaller models (1-3B) step up to t4-medium or a10g-small. Training takes several hours and costs between $5 and $15.

For medium models (3-7B), you need a10g-large or a100-large with LoRA. Although full fine-tuning is not suitable, LoRA makes these very easy to train. Production budget is $15 to $40.

For large models (7B+), this HF skill job is not suitable.

Demo and production environment

When testing your workflow, start small.

Perform a quick test run to SFT Qwen-0.6B using the 100 examples from my-org/support-conversations.

The coding agent configures minimal training sufficient to validate the operation of the pipeline without incurring any real costs.

In a production environment, specify it explicitly:

SFT Qwen-0.6B in production with complete my-org/support-conversations. 500 steps, 3 epochs, checkpoints per cosine learning rate.

Always run a demo before committing to a multi-hour production job. A $0.50 demo that catches formatting errors saves you $30 in failed executions.

Validating the dataset

Dataset format is the most common cause of training failures. The agent can validate the dataset before spending GPU time.

Check if my-org/conversation-data works with SFT training.

The agent runs a simple check on the CPU (a fraction of a penny) and reports:

Dataset validation for my-org/conversation-data: SFT: ✓ READY conversational DPO ‘messages’ column found: ✗ INCOMPATIBLE ‘chosen’ and ‘rejected’ columns are missing

If a dataset needs to be transformed, the agent will show you how to:

My DPO dataset uses “good_response” and “bad_response” instead of “chosen” and “rejected”. How can I fix this?

The agent provides mapping code that you can incorporate directly into your training scripts.

Monitoring training

Real-time monitoring allows you to catch problems early. This skill configures Trackio by default. After submitting your job, you can monitor metrics at:

https://huggingface.co/spaces/username/trackio

This shows training loss, learning rate, and validation metrics. In a healthy run, you will see your losses steadily decreasing.

Ask your agent about the status at any time.

What is the status of my training job? Job abc123xyz is running (45 minutes have passed) Current step: 850/1200 Training loss: 1.23 (down from 2.41 at start) Learning rate: 1.2e-5 Estimated completion: ~20 minutes

If something goes wrong, the agent can help diagnose it. Are you out of memory? The agent suggests reducing the batch size or upgrading your hardware. Dataset error? The agent identifies the discrepancy. timeout? The agent recommends a longer duration or faster training setting.

Conversion to GGUF

After training, you can run the model locally. The GGUF format works with llama.cpp and dependent tools such as LM Studio and Ollama.

Convert the fine-tuned model to GGUF using Q4_K_M quantization. Push to username/my-model-gguf.

The agent submits a conversion job that merges LoRA adapters, converts them to GGUF, applies quantization, and pushes them to the hub.

Then use it locally.

Llama Server -hf /: Llama Server -hf unsloth/Qwen3-1.7B-GGUF:Q4_K_M

what’s next

We showed that coding agents such as Claude Code, Codex, and Gemini CLI can handle the entire model fine-tuning lifecycle, including data validation, hardware selection, script generation, job submission, progress monitoring, and output transformation. This allows what used to be specialized skills to become conversational.

Things to try:

Fine-tune the model on your own dataset Build a model to your liking with SFT → DPO Train an inference model using GRPO in math or code Convert the model to GGUF and run it on Ollama

Skills are open source. You can extend it, customize it to fit your workflow, or use it as a starting point for other training scenarios.

resource

versatileai

See Full Bio

What's Hot

Claude faces “industrial-scale” AI model extraction

Deploying an open source vision language model (VLM) on Jetson

Gemini 2.5: A series of thought model updates

Claude faces “industrial-scale” AI model extraction

Deploying an open source vision language model (VLM) on Jetson

Gemini 2.5: A series of thought model updates

World’s largest dairy cooperative builds AI dairy platform based on 50 years of data

How financial institutions are incorporating AI decision-making

Deploying an open source vision language model (VLM) on Jetson

Most Popular

World’s largest dairy cooperative builds AI dairy platform based on 50 years of data

How financial institutions are incorporating AI decision-making

Deploying an open source vision language model (VLM) on Jetson

Don't Miss

Claude faces “industrial-scale” AI model extraction

Deploying an open source vision language model (VLM) on Jetson

Gemini 2.5: A series of thought model updates

Subscribe to Updates

What's Hot

We had Claude fine-tune our open source LLM

Setup and installation

claude code

codex

Gemini CLI

connect to hug face

First training run

Instruct coding agent to make minor adjustments

Confirm before sending

Track your progress

Use the model

training method

Supervised fine-tuning (SFT)

Direct Override Optimization (DPO)

Group relative policy optimization (GRPO)

hardware and costs

Model size and GPU mapping

Demo and production environment

Validating the dataset

Monitoring training

Conversion to GGUF

what’s next

resource

Related Posts