Dataset recording, VLA fine-tuning, and on-device optimization

Author: Enzo Ruedas, Tess Boivin

Recent advances in large-scale language models have enabled the transition from text-only inference to multimodal systems. First, the integration of visual recognition in visual language models (VLMs) and, more recently, the generation of robot actions in visual language action (VLA) models. Deploying these models to embedded robotic platforms remains a challenge due to severe constraints in terms of compute, memory, and power, as well as real-time control requirements.

In a synchronous control pipeline, the arm sits idle waiting for commands while the VLA performs inference, resulting in oscillatory behavior and delays in corrections. To address this, asynchronous inference separates generation and execution, allowing for smooth, continuous operation. However, to be effective, the end-to-end inference latency must be less than the action execution time. This time constraint therefore sets an upper bound on the model’s throughput.

Deploying VLA models to embedded platforms is not a model compression problem, but a complex systems engineering problem that requires architectural decomposition, delay-aware scheduling, and hardware-aligned execution. Addressing these challenges is essential for translating recent advances in multimodal foundational models into practical, deployable embedded robotic systems.

This guide introduces NXP’s practical best practices for reliably recording robot datasets, fine-tuning VLA policies (ACT and SmolVLA), and highlights the real-time performance achieved by NXP i.MX95 after optimization.

🎥 Recording datasets: What actually matters?

High-quality, consistent data is better than “lots of messy” data. In this section, you’ll translate your hard-earned lessons into concrete checklists and schemas.

In our case, we recorded a dataset for the task “Put a tea bag into a mug”.

1) Consistency comes first.

Fixed camera: Use a fixed mount to avoid pose drift. During recording or evaluation, if one or more cameras move due to robot vibration or the operator resetting the environment, significant accuracy degradation may be observed. Control lighting: Set up an environment where you have as much control over the lighting as possible (with a fixed light source and far away from sunlight, which changes during the day). Strong contrast: Avoid “white-on-white” training unless it is an induction domain. Maximize the contrast between your arms, objects, and environment. Fixed calibration: Be sure to create a backup of your robot and teleoperator calibrations so you don’t have to re-record previous episodes if your code crashes. Don’t cheat: Don’t use information that the model doesn’t have access to during inference. During data recording, operators will want to rely on direct visual observation of the scene. However, this introduces information that is not present in the dataset. Dataset collection should be limited to the same camera inputs that are available in the policy at runtime.

2) Use a gripper camera (highly recommended)

Moving from a scene-only view to a mixed perspective improves overall accuracy, but the delay impact increases as the number of cameras increases. Therefore, a suitable compromise must be chosen. In our case, three cameras worked out the balance.

top gripper left

top
gripper
left

Global perspective of the entire scene. Closest view for precise grasping and alignment. Complement the top view for height and depth.

We highly recommend using a gripper-equipped camera. Consistently improves the success rate of fine-grained operations tasks by providing a rigorous, task-related perspective. Importantly, it is also the camera that most effectively enforces correct data collection practices, allowing the operator to rely solely on the robot’s perception rather than directly observing the scene.

When installing a gripper camera, we recommend securing the cables with Velcro or strain relief guides to prevent them from obstructing the view or coming loose during operation.

3) Improved understanding

Simple hardware adjustments, such as attaching heat shrink tubing to the gripper claws, can increase friction, reduce roughness, reduce slippage during episodes, improve task success rates (fewer episodes of “nearly success”), and improve the stability of policy learning.

4) Diversity and fragmentation

When recording a dataset, you must:

Vary the distribution of episodes: Divide your workspace into clusters of starting locations and record at least 10 episodes per cluster. Add variety by changing the position and rotation of objects.

For example, we divided the reachable workspace of the robot arm into 11 clusters, each with dimensions of 10 × 10 cm.

Distinguish between training and validation sets: The policy can easily overfit the training set, so make sure the validation set is not known to the model.

For example, we removed cluster 6 from the training set.

Record as many movements as possible. Small VLA models have limited generalization to invisible motion. Therefore, we record episodes that cover a wider range of degrees of freedom.

For example, you grabbed a tea bag in a horizontal or vertical position.

Anticipate failure: In some cases, a policy may not reach an object the first time and must “go back”. We noticed that 20% of all episodes corresponding to cases of returning to an object help improve the overall success rate of the model.

For example, approximately 20% of the training set corresponds to a recovery episode.

This reflects best practices across VLA papers and community guides. Here are three examples of data diversity within the same cluster.

Starting position 1 Starting position 2 Recovery episode

cluster_10_1
cluster_10_2
recovery

Starting positions 1 and 2 correspond to different positions within the same cluster. In contrast, during a recovery episode, the robot does not start in “starting mode.” But since it’s already near the mug, you have to take the teabag directly from that location.

🎛️ Fine-tuning VLA

What we actually did:

Task: “Grab the teabag and place it in the mug.” Dataset: 120 Episodes: 10 clusters x (10 different teabag starting positions + 2 recovery episodes) 3 cameras (640x480px, 30fps): Top, gripper, left cluster n°6 removed for validation Batch size: 8 Training: Model with lowest validation loss after 200k steps selected checkpoint

Across both training and validation sets, the range that provides the best tradeoff between accuracy, generalization, and motion smoothness was found for ACT (100 actions per chunk) within 100k to 160k training steps. For SMolVLA training (50 actions per chunk), the trade-off appears after more training steps. We found that continuing training slightly beyond the point where the model starts to overfit tends to improve overall accuracy.

Rule of thumb: Select the final checkpoint by evaluating success on both training and validation sets, not training loss.

⚡ Optimized for NXP i.MX95

i.MX95 integrates six Arm Cortex-A55, Cortex-M7/M33, Mali GPU, new ISP, and eIQ® Neutron NPU, targeting efficient and secure edge inference with multi-camera support and powerful I/O. (nxp.com)

1) Divide and conquer

Instead of running the model as one monolithic graph, we decompose the VLA graph into logical stages: encoder, decoder, and action expert. Therefore, each component can be optimized, scheduled, and deployed independently.

In practice, SmolVLA is divided into the following subblocks:

Vision: Processes RGB camera frames and generates visual embeddings. LLM backbone: Generate action tokens from visual and textual embeddings. Action Expert: Apply flow matching to iteratively denoise action samples and output final control commands.

This separation allows block-by-block optimization. You can measure the impact of quantization on each block to choose the best tradeoff between latency and accuracy. Also, separating Action Expert from VLM was ideal for running less frequently.

2) Quantization

To optimize i.MX95 inference, we considered several quantization techniques on different blocks. We find that quantization of the vision encoder and LLM prefill has a limited impact on accuracy, while quantization of the action expert denoising flow significantly degrades performance. This behavior is expected because quantization errors accumulate throughout the iterative denoising steps.

Therefore, we decided to keep this block at higher precision to maintain stability, and in other blocks we considered different quantization configurations, from 8-bit mixed precision to 4-bit quantization, depending on the layer.

Additionally, we applied in-house optimizations to various blocks. The results are shown in the table below and are called the optimization model.

3) Asynchronous reasoning: control-aware scheduling

In a synchronous control loop, the pipeline works as follows:

Capture observations Perform full model inference Execute generated actions

During step (2), the robot remains idle. If the inference delay cannot be ignored, then:

Idle gap in operation Vibration correction due to old observations Decrease in effective control frequency Defect in recovery operation

With asynchronous inference, the generation of actions occurs in parallel with their execution.

The robot executes the current chunk of actions. The next chunk is calculated at the same time.

This increases the effective control frequency, reduces the obsolescence of observations, and improves recovery behavior.

For embedded platforms such as i.MX95, asynchronous inference is essential, but only effective if the inference latency is kept below the action horizon budget: $T_{\text{inference}}

Synchronous Inference Asynchronous Inference Actions per Chunk 100 100 FPS 60 60 Chunk Size Threshold N/A 0.2 Aggregate Function N/A Weighted_average Action Queue Evolution
async_g_0
async_g_02

result

📊 What you can achieve with i.MX95

setting

Task: “Grab the tea bag and put it in the mug.” Test set (20 episodes): 2 random positions in each cluster. Validation set (10 episodes): all 10 positions of cluster n°6

Platform (CPU) Policy Format Inference Latency Accuracy Test Set (20) Accuracy Verification Set (10) Global Accuracy (30) i.MX 95 ACT ONNX FP32 2.86 s 1.00 0.90 0.96 i.MX 95 ACT Optimization 0.32 s 1.00 0.60 0.89 i.MX 95 SmolVLA ONNX FP32 29.1 seconds 0.50 0.40 0.47

⏩ Next steps

Our immediate goal is to improve the accuracy of the task using SmolVLA (ONNX FP32). We have already established a baseline and measured an optimized onboard inference latency of 6.15 seconds.

The next phase will focus on more detailed optimization of the NPU. In parallel, we aim to move from single-task setups to longer-term, more complex scenarios. To that end, we will introduce the following:

Simulation environment for scalable data generation and benchmarking Reinforcement learning (RL) for policy refinement Sim-to-Real transfer to bridge domain gaps and improve real-world performance

The goal is to move from a single, validated operational task to a reproducible methodology for deploying VLA policies in embedded robotic systems.

✅ Reusable checklist

recording

training

Deployment to i.MX95

📚 Resources and inspiration

versatileai

See Full Bio

What's Hot

Dataset recording, VLA fine-tuning, and on-device optimization

Update to Gemini 2.5 from Google DeepMind

JPMorgan ramps up investment in AI as technology spending approaches $20 billion

Update to Gemini 2.5 from Google DeepMind

JPMorgan ramps up investment in AI as technology spending approaches $20 billion

One year since “Deep Seek Moment”

Improving the accuracy of multimodal search and visual document retrieval using the Llama Nemotron RAG model

5 ways rules and regulations guide AI innovation

Google’s industrial robot AI Play makes physical AI a priority

Most Popular

Improving the accuracy of multimodal search and visual document retrieval using the Llama Nemotron RAG model

5 ways rules and regulations guide AI innovation

Google’s industrial robot AI Play makes physical AI a priority

Don't Miss

Dataset recording, VLA fine-tuning, and on-device optimization

Update to Gemini 2.5 from Google DeepMind

JPMorgan ramps up investment in AI as technology spending approaches $20 billion

Subscribe to Updates

What's Hot

Dataset recording, VLA fine-tuning, and on-device optimization

🎥 Recording datasets: What actually matters?

1) Consistency comes first.

2) Use a gripper camera (highly recommended)

3) Improved understanding

4) Diversity and fragmentation

🎛️ Fine-tuning VLA

⚡ Optimized for NXP i.MX95

1) Divide and conquer

2) Quantization

3) Asynchronous reasoning: control-aware scheduling

📊 What you can achieve with i.MX95

⏩ Next steps

✅ Reusable checklist

📚 Resources and inspiration

Related Posts