Nanovlm is the easiest way to get started with training your own Vision Language Model (VLM) using pure Pytorch. This is a lightweight toolkit that lets you get started with VLM training with our free Tierco Love Notebook.
Inspired by Andrej Karpathy’s Nanogpt, I provided a similar project for the Vision domain.
At its heart, Nanovlm is a toolkit that helps you build and train a model that can understand both images and text, and generate text based on it. The beauty of nanoflum lies in its simplicity. The entire codebase is intentionally minimal and easy to read, perfect for beginners and those who want to peek under the hood of a VLMS without being overwhelmed.
This blog post will cover the core ideas behind your project and provide an easy way to interact with the repository. We don’t just go into the details of the project, but we also encapsulate everything so we can get started right away.
table of contents:
tl; dr
You can follow these steps to start training your Vision Language model using the Nanovlm Toolkit.
git clone https://github.com/huggingface/nanovlm.git python train.py
This is a colab notebook that will help you get started with your training run without the need for a local setup!
What is a vision language model?
As the name suggests, the Vision Language Model (VLM) is a multimodal model that handles two modalities: vision and text. These models typically take images and text as input and generate text as output.
Generating text (output) that conditionalizes understanding of images and text (input) is a powerful paradigm. It enables a wide range of applications, from image captions and object detection to answer questions about visual content (as shown in the table below). One thing to note is that Nanovlm focuses solely on visual questions as the purpose of training.
Caption Two cats lying in a bed with a remote control near the image detect captions that detect objects in semantic segmentation in the image. 2 Visual Questions Response
If you want to learn more about VLMS, we highly recommend reading our latest blog on topics: Vision Language Models (better, faster, stronger)
Working with the repository
“Talking is cheap, show me the code” – linus torvalds
In this section, we will guide you through the codebase. It is helpful to leave the tab open for reference when following.
Below is the folder structure for the repository: I deleted helper file for Brevity.
. DATA│├├. ├. . Collators.py││├. ├. phy└└│. Only processors.py├·· is generated. Py.py.py├. . └└) train.py. vision_transformer.py
Architecture
. ├··x data│└. .
After two well-known and widely used architectures, we model Nanovlm. Vision Backbone (Models/Vision_transformer.py) is a standard vision transformer, or more specifically, Google’s Siglip Vision encoder. Our language backbone follows the Llama 3 architecture.
Vision and text modalities are adjusted using the modality projection module. This module takes the image embedding generated by Vision Backbone as input and converts it into an embedding compatible with text embedding from the embedding layer of the language model. These embeddings are then concatenated and fed to the language decoder. The Modality Projection module consists of a pixel shuffle operation followed by a linear layer.
Model Architecture (Source: Author)
Pixel Shuffle reduces the number of image tokens. This reduces computational costs and quickly reduces training of trans-based language decoders, particularly input length sensitive. The diagram below illustrates the concept.
Pixel Shuffle Visualization (Source: Author)
All files are extremely lightweight and well documented. I highly recommend checking out individually for better understanding of implementation details (model/xxx.py)
During training, use the following pre-trained backbone weights:
Vision Backbone: Google/Siglip-Base-Patch16-224 Language Backbone: HuggingFacetB/SMOLLM2-135M
You can also swap backbones with other variants of Siglip/Siglip 2 (for Vision Backbone) and SmollM2 (for Language Backbone).
Train your own VLM
As you’re familiar with architecture, let’s talk about how to shift gears and train your own vision language model using Train.py.
. ├··x data│└. .
You can kick off the training with:
Python Train.py
This script is a one-stop shop for the entire training pipeline, such as:
Optimizing and logging dataset loading and preprocessing model initialization
composition
Before anything, the script loads two configuration classes from the model/config.py:
TrainConfig: configurable parameters useful for training, such as learning rates, checkpoint paths, and more. VLMCONFIG: Configuration parameters used to initialize VLM, such as hidden dimensions, number of attention heads, etc.
Loading data
At the heart of the data pipeline is the get_dataloaders function. that:
Load the dataset by hugging Face’s Load_Dataset API. Combines and shuffles multiple datasets (if provided). Apply train/valsplit via index. We wrap them in custom datasets (vqadataset, mmstardataset) and collators (vqadataset, mmstarcollator).
A useful flag here is data_cutoff_idx, which is useful for debugging on small subsets.
Initializing the model
This model is built via the VisionLanguageModel class. If you are resuming from a checkpoint, it’s as simple as:
from Models.vision_language_model Import VisionLanguageModel Model = VisionLanguageModel.from_pretrained(model_path)
Otherwise, you’ll get a newly initialized model with optional preloaded backbone for both the vision and language.
Optimizer setup: 2 LRs
Because the modality projector (MP) is newly initialized, the backbone is pretrained, so the optimizer is split into two parameter groups, each with its own learning rate.
High LR on MP, small LR on encoder/decoder stack
This balance allows MPs to learn quickly while maintaining knowledge of the vision and language backbone.
Training loop
This part is pretty standard, but thoughtfully structured:
Mixed accuracy is used in Torch.autocast to improve performance. Cosine learning rate schedules with linear warmup are implemented via get_lr. Token throughput (tokens per second) is logged batchwise for performance monitoring.
For every 250 steps (configurable), the model is evaluated on the validation and MMSTAR test dataset. If accuracy improves, the model is checkedpointed.
Logging and monitoring
When log_wandb is enabled, training statistics such as batch_loss, val_loss, accuracy, and tokens_per_second are recorded in weights and biases for real-time tracking.
Runs are autonamed using metadata such as sample size, batch size, epoch count, learning rate, date, and more, all processed by the Helper get_run_name.
Push to the hub
Use the following to push the trained model into the hub and find others and test them.
model.save_pretrained(save_path)
It can be easily pushed using.
model.push_to_hub (“Hub/ID”))
Perform inference on a pre-trained model
Nanovlm was used as a toolkit to train the model and publish it to the hub. I used Google/Siglip-Base-Patch16-224 and HuggingFacetB/SMOLLM2-135M as the backbone. This model trained this for ~6 hours on one H100 GPU with approximately 1.7m samples of the cauldron.
This model is intended to make VLMS components and training processes easier to understand, rather than competing with the SOTA model.
. ├. ├└└││. .
Let’s use the Generate.py script to perform inference on the trained model. You can run the generation script using the following command:
python generate.py
This uses the default argument and runs the query “What is this?” Image assets/image.png.
You can use this script in your own image.
python generate.py – image path/to/image.png -prompt “You’ll urge here.”
If you want to visualize the mind of your script, these lines:
Model = VisionLanguageModel.from_pretrained(source).to(device)Model.I’ll rate it() tokenizer = get_tokenizer(model.cfg.lm_tokenizer)image_processor = get_image_processor(model.cfg.vit_img_size)template = f “Question: {args.prompt} answer:”
encoded = tokenizer.batch_encode_plus((template), return_tensors =“PT”) token=encode (“input_ids”).to(device)img = image.open(args.image).convert(“RGB”)img_t = image_processor(img).unsqueeze(0).TO (device)
printing(“\ninput:\n”,args.prompt, “\n \noutputs:”))
for I in range(args.generations): gen = model.generate(tokens, img_t, max_new_tokens = args.max_new_tokens) out = tokenizer.batch_decode(gen, skip_special_tokens =truth) ()0))
printing(f”>> Generate {i+1}: {outside}“))
Create a model and set it to evaluate. Initializes the token agent that tokenizes text prompts and the image processor used to process images. The next step is to process the input and run the model. Generate and generate output text. Finally, decode the output using batch_decode.
Generating image prompts
What is this? In the photo, you can see a pink bed sheet. You can see two cats lying on the bed sheet.
What do women do? In the middle of here she plays yoga
If you want to perform inference on a model trained in a UI interface, here is the embracing face space to interact with the model:
Conclusion
In this blog post, I explained what VLMS is, explored the architectural choices that power Nanovlm, and solved the training and inference workflow in detail.
By keeping the codebase lightweight and easy to read, Nanovlm aims to serve as both a learning tool and a foundation that can be built. Whether you want to understand how multimodal inputs are matched or you want to train your VLMs on your own dataset, this repository provides a headstart.
If you try it, you’ll have a question to build on it. Happy hanging!
reference
Github -Huggingface/nanovlm: The simplest and fastest repository for training/fine tuning. Vision Language Model (better, faster, stronger) Vision Language Model explained diving into Vision Language Model smolvlm: Redefining a small and efficient multimodal model