Build a great dataset for video generation

The tool for image generation datasets is well established, and IMG2Dataset is a basic tool used to prepare large datasets, with various community guides, scripts and UIs covering small initiatives. It will be supplemented.

Our ambition is to create tools for equally established video generation datasets by creating small-scale, suitable open video dataset scripts and leveraging Video2Dataset for large-scale use cases .

“If I’ve seen more, it’s standing on the shoulders of a giant.”

In this post, we outline the tools we are developing to make it easier for the community to create their own datasets of fine-tuned video generation models. If you can’t wait to get started, check out this codebase.

table of contents

Toto

Touring

Video generation usually conditiones natural language text prompts such as “cat walking on grass, realistic style.” Secondly, there are many qualitative aspects of video for controllability and filtering.

Watermark movement Aesthetic existence NSFW content presence

Video generation models are as good as the data being trained. Therefore, these aspects become important when curating the dataset for training/fine tuning.

The 3-stage pipeline is inspired by works such as stable video spreading, LTX-Video, and its data pipeline.

Stage 1 (acquisition)

Like video2dataset, I choose to use yt-dlp to download videos.

Create scripted videos in your scene and split the long video into short clips.

Stage 2 (pre-processing/filtering)

Extracted frames

The whole video

Predict motion scores with OpenCV

Stage 3 (Processing)

Florence-2 with Microsoft/Florence-2-Large, Florence-2 tasks, and extracted frames. This provides a variety of captions, object recognition, and OCRs that can be used for filtering in different ways.

Other captions can be brought in in this regard. You can also caption the entire video (such as a model like QWEN2.5) in contrast to captioning for individual frames.

Filtering Examples

In the dataset of the model Finetrainers/Crush-Smol-V0, select the caption from QWEN2VL and then PWATERMARK <0.1およびAESTHETIC> Filtered at 5.5. This highly restrictive filtering caused 47 videos out of a total of 1,493.

Let’s take a look at an example frame in Pwatermark –

The two scores in the text are 0.69 and 0.61

Pwatermark Image 0.69

0.61

“Toy Car with a bundle of mice” gets 0.60 and 0.17 when the toy car is crushed.

Pwatermark Image 0.60

0.17

All sample frames were filtered by Pwatermark <0.1. Pwatermark is effective in detecting text/watermarks, but the score does not indicate whether it is a text overlay or a toy car license plate. Our filtering required that all scores be below the threshold. The average frame is a more effective strategy for Pwatermarks with a threshold of approximately 0.2-0.3.

Let’s take a look at an example frame for aesthetic scores –

Pink Castle initially scores 5.5 and 4.44 is crushed

Aesthetic Images 5.50

4.44

The action figure score drops at 4.99 and drops to 4.84 when crushed.

Aesthetic image 4.99

4.87

4.84

Glass shard scores a low score at 4.04

Aesthetic Image 4.04

Filtering required all scores to fall below the threshold. In this case, using the aesthetic score for the first frame is only a more effective strategy.

Reviewing Finetrainers/Crush-Smol, many of the objects being crushed are round or rectangular and colorful, which is similar to the findings of the example frame. Aesthetic scores are useful, but there are potential biases that exclude good data when used with extreme thresholds such as >5.5. It may be more effective as a filter for good content with a minimum threshold of about 4.25-4.5.

OCR/Caption

Here we provide a visual example of each filter and a Florence-2 caption.

Image Caption Detailed Caption

A toy car with lots of rats inside. This image shows a blue toy car with three white mice sitting behind it, driving on a road with a green wall in the background. Comes with OCR labels with OCR and regional labels

Use 👨‍🍳 to use this tool

Similar to the Pika effect, I tried to generate cool video effects and created various datasets using tools.

We then used these datasets to fine-tune the Cogvideox-5B model using Finetrainers. Below is an example of the output from Finetrainers/Crush-Smol-V0.

Prompt: diff_crush red candles are placed on the metal platform, with a large metal cylinder descending from above, flattening the candle as if it were under a hydraulic press. The candle is crushed into a flat, round shape, leaving a pile of debris around it.

Your turn

We hope that this tool provides a headstart for creating small, high quality video datasets for your own custom applications. Keep an eye out as we continue to add more useful filters to our repository. Your contributions are also welcome

Thank you Pedro Cuenca for providing extensive reviews on the post.

See Full Bio

What's Hot

Lossless compression tailored to AI

Easy to train your model using H100 GPU on nvidia dgx cloud

Best Pytorch Quantization Backend

Easy to train your model using H100 GPU on nvidia dgx cloud

Best Pytorch Quantization Backend

Intel Meteor Lake’s PHI-2

BitMart Research: MCP+AI Agent – A new framework for AI

The UAE announces bold AI-led plans to revolutionize the law

The UAE will use artificial intelligence to develop new laws

Most Popular

BitMart Research: MCP+AI Agent – A new framework for AI

The UAE announces bold AI-led plans to revolutionize the law

The UAE will use artificial intelligence to develop new laws

Don't Miss

Lossless compression tailored to AI

Easy to train your model using H100 GPU on nvidia dgx cloud

Best Pytorch Quantization Backend

Subscribe to Updates

What's Hot

Build a great dataset for video generation

Touring

Stage 1 (acquisition)

Stage 2 (pre-processing/filtering)

Extracted frames

The whole video

Stage 3 (Processing)

Filtering Examples

OCR/Caption

Use 👨‍🍳 to use this tool

Your turn

Related Posts