Data, models, beef GPU setup and more, everything is ready. Press “run”…wait. And wait a little longer. Your GPU hardly sweats while your wallet is lighter in time.
Does it sound familiar? We were there. After detective work on our Nanovlm project, we discovered that the real culprit was not our model or hardware. That’s incredibly wasted our data pipeline.
Here’s what we found:
Idol GPU: Our model literally waited for the data to pad
In this post, we will build an efficient pipeline in five stages. At each stage, add or remove from the previous step and comment on what is correct and what is not.
table of contents:
(Stage 0) Preparation
To make it easier to follow the data preparation task, we created a separate repo laser focused only on the data pipeline. I hope this will be much easier to understand reading code integrated with the Nanovlm repository. Plus, this will help you bootstrap other data pipelines!
Repository: https://github.com/arig23498/mmdp
All you need to do is clone the repository. It includes the final data preparation task, but is designed to showcase each step in the path.
$ git clone https://github.com/arig23498/mmdp.git
(Stage 1) Visualization of the dataset
Before we can optimize anything, we need to understand what we work for. The multimodal dataset has images, text prompts, and responses.
$ uv run 01_check_dataset.py
Being familiar with training data is essential to success. The previous script shows a random sample every time you run it. You can copy the snippet into a notebook and run it multiple times to make it feel like you’re doing it.
(Stage 2) Naive Padding
Our first training attempts used the obvious (and very frequent) approach.
Find the longest sequence on every tokenization each batch pad and match $UV run 02_naive_pad_dataloader.py
The outcome was painful. Look at this visualization:
Do you see all of that grey? It’s padding. It’s a GPU process that absolutely nothing while paying for computing time. I wasted about 60% of my batch with empty tokens.
(Stage 3) Constrained Padding
The next move was easy. Set the global maximum length and stick to it. If the sample is too long, just drop it.
You may have noticed that there is one less sample in the batch. This is due to the filtering process. This helped, but I was padding everything to the same fixed length, regardless of the actual content. It’s better than before, but it’s still useless.
(Stage 4): Packing smarter with knapsack
Now you’re ready to completely rethink your batch. Padding is the enemy and requires a strategy to minimize while maximizing the data that can be adapted to each batch. Enter your knapsack problem, the best computer science classic for this.
Imagine packing your backpack for a hike. It can only hold so much weight and I want to pack as many useful items as possible. In our case:
The backpack is a training batch with maximum token limit (max_length). Each item is a sequence (tokenized prompt response pair), and its weight is the number of tokens. Our goal is to pack as many sequences into batches as possible into batches without exceeding the token limit, minimizing wasted space.
To test this idea, we’ll start with a toy dataset. A list of numbers between 1 and 25, each representing the length of the sequence. This allows you to experiment without the complexity of images or text.
Switch to a repeatable dataset
Most Pytorch datasets are map styles (accessible in dataset (i)). However, dynamic batches require something more flexible. So I constructed a iterative style dataset by subclassing Torch.utils.data.iterabledataset. This allows you to generate batches on the fly and process tricks like shard data across multiple workers.
def _get_data_range(self): worker_info = get_worker_info()
if worker_info teeth none:
return self.start, self.end
Other than that:per_worker = int(math.ceil((self.end -self.start) / worker_info.num_workers)) worker_id = worker_info.id
iter_start = self.start + worker_id * per_worker iter_end = Min(iter_start + per_worker, self.end)
return iter_start, iter_end
The magic of producer consumers
The packaging sequence can be slower, especially if you are sorting or shuffling. Use producer consumer patterns using Python queues to keep things moving.
def _producer(self, data_iter, queue, stop_signal):
if self.strategy == “Good Deep”:
for pack in self._greedy_packing(data_iter): queue.put(pack)
Elif self.strategy == “Binpack”:
meanwhile truth:buffer = list(itertools.islice(data_iter, self.buffer_size))
if do not have buffer:
break
knapsacks = self._bin_packing (buffer)
for pack in Knapsack: queue.put(pack) queue.put(stop_signal)
The producer’s threads stuff batches and queue them, but the main thread pulls them out as needed. This overlap keeps the pipeline flowing smoothly.
Greedy packing
First, try a simple greedy packing strategy.
def _greedy_packing(Self, iterator): pack, pack_sum =(), 0
for item in Iterator:
if item>self.max_length:
Continued
if pack_sum + item <= self.max_length: pack.append(item)pack_sum + = item
Other than that:
yield pack pack =(item)pack_sum = item
if pack:
yield pack
This will allow the data to proceed in sequence, adding items to the pack until the items are full and starting a new one. It’s fast, but not perfect. Here’s what the batch looks like:
===Strategy: greedy ===(Tensor ((1)), Tensor ((2)), Tensor ((3)), Tensor ((4)), Tensor ((5)), Tensor ((6)), Tensor ((7), Tensor ((8)), Tensor ((9)), Tensor ((10)), Tensor tensor ((12)), Tensor ((13))) (Tensor ((14)), Tensor ((15)), Tensor ((16)), Tensor (17), Tensor ((18)), Tensor ((19)))) (Tensor ((24))
Notice how the later batches become sparse. It’s far from the gap.
Bin packing for a tighter fit
Let’s try a smarter approach: bin packing (specifically, initial fit reduction):
def _bin_packing(Self, Buffer: list(int))): buffer = sort(buffer, reverse =truth)knapsacks =()
for item in buffer:
for pack in Knapsack:
if sum(pack) + item <= self.max_length: pack.append(item)
break
Other than that:knapsacks.append((item))
This sorts the sequence by length (longest) and tries to fit into the first pack each with a room. If it does not fit, a new pack will start. result?
===Strategy: binpack ===(Tensor ((24)), Tensor ((23)), Tensor ((22)), Tensor ((22)), Tensor ((21)), Tensor ((10))) Tensor ((1))) (Tensor ((15)), Tensor ((14)), Tensor ((13)), Tensor ((12)), Tensor ((11)), Tensor ((8)), Tensor ((7), Tensor ((6), Tensor ((5)), Tensor (4), Tensor
These batches are much tighter and require less wasted space. It’s like playing Tetris with your data and matching pieces together.
(Stage 5) Napsacc for multimodal data
In practice, we apply knapsack packing to a multimodal data set.
You need to go back to images, prompts and responses and pack them efficiently, respecting both token limits and image budgets. Images are budgeted to ensure that images are balanced for each sample. I would like to avoid cases where one GPU needs to process more images than another.
Our new constantLengthDataset class handles heavy lifting. Here’s how it works compared to stage 4:
Concept Stage 4 (Toy Data) Stage 5 (Multi-modal Data) Function (S) Item Integer (Sequence Length) Full Sample (Image, Prompt, Response) Greedy or Vinpack Greedy Packing with Token and Image Constraints_balanced_greedy_knapsack Producer – Consumer Producer queues the same as the toy examples, but __iter__producer, __iter__sample Filtering Skip Integer > max_length smalls small ship shipte shipte shipte ship ship ship ship ship ship ship ship ship ship _prodect make_base_iterator() batching group integers concatenate and align tokens/images_pack_one_group outputintegers dict dict dict with Yields from input_ids, labels, attentice_mask, image__iter__
constantlengthdataset does it all:
Reads samples (image and text). Remove samples that are either too long or have too many images. Pack samples into batches using a greedy knapsack strategy that balances token counts and image counts. Pad the final batch to a fixed length, but with less padding than before.
Here is the result:
Look at that! Gray (padding) is minimal, and batches are dense with useful data. It’s like packing a suitcase well and can be zipped up without sitting.
The image may seem unintuitive at first glance, but let’s look at the images side by side with suppressed padding.
Knapsack is restricted
Here you will notice that knapsack samples are more evenly distributed. You also don’t run into the issue of low samples in batches due to filtering.
Conclusion
What started out as a simple “Why is training so slow?” The research has completely rethinked how multimodal data is processed.
A balanced knapsack strategy for your data pipeline builds from scratch post-training data strategies from Nvidia’s Frontier Vision-Language Models Paper.
Important Lessons:
Padding everything into the longest sequence is a good first approach (but wasted) batch as a packaging problem. First consider all constraints (text length, image memory, etc.) to validate your approach
Want to dig deeper? Check out:
Happy training (and your GPU may be busy)!