The tool for image generation datasets is well established, and IMG2Dataset is a basic tool used to prepare large datasets, with various community guides, scripts and UIs covering small initiatives. It will be supplemented.
Our ambition is to create tools for equally established video generation datasets by creating small-scale, suitable open video dataset scripts and leveraging Video2Dataset for large-scale use cases .
“If I’ve seen more, it’s standing on the shoulders of a giant.”
In this post, we outline the tools we are developing to make it easier for the community to create their own datasets of fine-tuned video generation models. If you can’t wait to get started, check out this codebase.
table of contents
Toto
Touring
Video generation usually conditiones natural language text prompts such as “cat walking on grass, realistic style.” Secondly, there are many qualitative aspects of video for controllability and filtering.
Watermark movement Aesthetic existence NSFW content presence
Video generation models are as good as the data being trained. Therefore, these aspects become important when curating the dataset for training/fine tuning.
The 3-stage pipeline is inspired by works such as stable video spreading, LTX-Video, and its data pipeline.
Stage 1 (acquisition)
Like video2dataset, I choose to use yt-dlp to download videos.
Create scripted videos in your scene and split the long video into short clips.
Stage 2 (pre-processing/filtering)
Extracted frames
The whole video
Predict motion scores with OpenCV
Stage 3 (Processing)
Florence-2 with Microsoft/Florence-2-Large, Florence-2 tasks, and extracted frames. This provides a variety of captions, object recognition, and OCRs that can be used for filtering in different ways.
Other captions can be brought in in this regard. You can also caption the entire video (such as a model like QWEN2.5) in contrast to captioning for individual frames.
Filtering Examples
In the dataset of the model Finetrainers/Crush-Smol-V0, select the caption from QWEN2VL and then PWATERMARK <0.1およびaesthetic> Filtered at 5.5. This highly restrictive filtering caused 47 videos out of a total of 1,493.0.1およびaesthetic>
Let’s take a look at an example frame in Pwatermark –
The two scores in the text are 0.69 and 0.61
Pwatermark Image 0.69
0.61
“Toy Car with a bundle of mice” gets 0.60 and 0.17 when the toy car is crushed.
Pwatermark Image 0.60
0.17
All sample frames were filtered by Pwatermark <0.1. pwatermark is effective in detecting text/watermarks, but the score does not indicate whether it a text overlay or toy car license plate. our filtering required that all scores be below threshold. average frame more strategy for pwatermarks with threshold of approximately 0.2-0.3.
0.1.>Let’s take a look at an example frame for aesthetic scores –
Pink Castle initially scores 5.5 and 4.44 is crushed
Aesthetic Images 5.50
4.44
The action figure score drops at 4.99 and drops to 4.84 when crushed.
Aesthetic image 4.99
4.87
4.84
Glass shard scores a low score at 4.04
Aesthetic Image 4.04
Filtering required all scores to fall below the threshold. In this case, using the aesthetic score for the first frame is only a more effective strategy.
Reviewing Finetrainers/Crush-Smol, many of the objects being crushed are round or rectangular and colorful, which is similar to the findings of the example frame. Aesthetic scores are useful, but there are potential biases that exclude good data when used with extreme thresholds such as >5.5. It may be more effective as a filter for good content with a minimum threshold of about 4.25-4.5.
OCR/Caption
Here we provide a visual example of each filter and a Florence-2 caption.
Image Caption Detailed Caption
A toy car with lots of rats inside. This image shows a blue toy car with three white mice sitting behind it, driving on a road with a green wall in the background. Comes with OCR labels with OCR and regional labels
Use 👨🍳 to use this tool
Similar to the Pika effect, I tried to generate cool video effects and created various datasets using tools.
We then used these datasets to fine-tune the Cogvideox-5B model using Finetrainers. Below is an example of the output from Finetrainers/Crush-Smol-V0.
Your turn
We hope that this tool provides a headstart for creating small, high quality video datasets for your own custom applications. Keep an eye out as we continue to add more useful filters to our repository. Your contributions are also welcome
Thank you Pedro Cuenca for providing extensive reviews on the post.