Sota OCR with core ML and dots.ocr

Pedro Cuenca's avatar

Every year, our hardware is a little more powerful and our models get a little more smarter with each parameter. In 2025, running competitive models in advance is more viable than ever. dots.ocr is a renote 3b-parameter OCR model that outperforms Omnidocbench’s Gemini 2.5 Pro, making OCR an uncompromising use case. Running models on-device certainly appeals to developers. No need for API key smuggling, zero cost, networking. However, if you run these models on-device, you should be aware of limited calculations and power budgets.

Join Neural Engine, Apple’s custom AI accelerator shipped to all Apple devices since 2017. This accelerator is designed to be highly functional while sipping the battery power. Some of our tests found that neural engines are 12 times more power efficient than CPUs and 4 times more power efficient than GPUs.

All this sounds very appealing, but unfortunately the neural engine can only be accessed through Core ML, Apple’s closed source ML framework. Additionally, simply converting the model from Pytorch to Core ML can present some challenges. Also, without the knowledge of fused models and sharp edges, it can be difficult for developers. Fortunately, Apple also offers MLX, a more modern and flexible ML framework targeting GPUs (rather than neural engines), which can be used in conjunction with Core ML.

This three-part series provides inference traces on how Dots.ocr runs on-device using a combination of COREML and MLX. This process should be applicable to many other models. I hope this will help highlight the ideas and tools needed by developers looking to use their own models.

To follow, clone the report. UV and HF must be installed to run the setup command.

./BooStrap.sh

If you want to skip first and use the transformed model, you can download it here.

conversion

Conversion from Pytorch to Coreml is a two-step process.

Capture the Pytorch execution graph (via torch.jit.trace or a more modern approach to Torch.Export). Compile this converted graph into a .mlpackage using Coremltools.

There are a few knobs that can be adjusted in step 2, but most of our controls are in step 1 and the graph is fed into coremltools.

Following Rittany, the programmer who makes it work, we’ll focus on getting it right, faster, and first making conversions work in GPU, float32, and static form. Once this works, you can dial down the accuracy and try to move it to the nerve engine.

dots.ocr

dots.ocr consists of two key components: a 1.2b parameter vision encoder trained from scratch based on the Navit architecture and a QWEN2.5-1.5B backbone. Run the Vision encoder using CoreML, run MLX to run the LM backbone.

Step 0: Understand and simplify the model

To transform a model, it is best to understand its structure and functionality before you begin. Looking at the original vision modeling file here, we can see that the Vision encoder is similar to the QWENVL family. Like most vision encoders, dot vision encoders work patch-based. In this case it’s a 14×14 patch. Dots Vision Encoder can process batches of videos and images. This gives you the opportunity to simplify by processing a single image at a time. This approach occurs frequently in apps on your device. Here we provide essential features when processing multiple images and transform iterative models.

When starting the conversion process, it is best to start with a minimal, viable model. This means removing bells and whistles that are not strictly necessary for the model to work. In our case, DOTS has a variety of note implementations available in both the Vision encoder and the LM backbone. Coreml has a large number of infrastructure, centered around the Scaled_dot_product_attention operators introduced in iOS 18. You can simplify your model by removing all other noteworthy implementations and focusing on simple SDPA for now (not a memory-efficient variation).

Once you’ve done this, you’ll get a scary warning message when loading the model.

Notes on the slide window are enabled, but not implemented for `sdpa`; Unexpected results may occur.

The model does not require the attention of the slide window for functionality, so you can be happy to move on.

Step 1: A simple harness

Using Torch.jit.trace is the most mature way to convert a model to COREML. Typically, this can be encapsulated in a simple harness to change the computing unit to use and the selected accuracy.

You can check the first harness here. If you run the following in your original code implementation:

uv run convert.py -precision float32 – compute_units cpu_and_gpu

You need to run into the first (many) problem.

Step 2: Bug hunting

It is rare for a model to be transformed for the first time. In many cases, you will need to further modify the execution graph until you reach the final node.

The first problem is the following error:

Error – Conversion “outer” It is located in the OP (in: ‘Vision_tower/rotary_pos_emb/192’):op “Matmoor”if x and y are both non-consts, their dtypes should match, but I got x as int32 and y as fp32

Luckily, this error provides quite a bit of information. You can look at the VisionRotaryembeding layer and see the following code:

def forward(Self, seqlen: int) -> torch.tensor: seq = torch.arange(seqlen, device = self.inv_freq.device, dtype = self.inv_freq.dtype) freqs = torch.outer(seq, self.inv_freq)
return freqs

Torch.Arange has a DTYPE argument, but Coremltools ignores this and always outputs INT32. To fix this issue, simply add a cast after the alange and you can commit here.

After fixing this, if I run the conversion again, I’ll get the next issue with Repeat_interLeave.

Error – Conversion ‘Repeat_interleave’ It is located in the OP (in: ‘Vision_tower/204’): Cannot add const (none)

This error is not very useful, but there is only one call to repeat_interleave in the Vision encoder.

cu_seqlens = torch.repeat_interleave(grid_thw(:, 1) * grid_thw (:, 2), grid_thw (:, 0). cumsum(dim =0,dtype = grid_thw.dtype if torch.jit.is_tracing() Other than that torch.int32,)

CU_SEQLENS is used to mask variable length sequences of flash_attention_2. This is derived from the GRID_THW tensor, which represents time, height, and width. Since you are only processing a single image, you can just delete this call and commit it here.

to the next! This time we get a more inexplicable error.

Error – Conversion ‘_internal_op_tensor_inplace_fill_’ It is located in the OP (in: ‘Vision_tower/0/attn/301_internal_tensor_assign_1’):_internal_op_tensor_inplace_fill does not support dynamic indexing

This is also due to masking logic for processing variable-length sequences. We only handle a single image (not a video or batch of batches) so no attention masking is required at all. Therefore, all masks can be used. To prepare for the conversion of the nerve engine, the nerve engine does not support boolean tensor commits here, so switch from using boolean mask to all zero float masks

Once this is all done, the model should be successfully converted to Coreml! However, when I run the model I get the following error:

error: “mps.reshape” OP result shape is not compatible with input shape

This modification could be found in multiple locations! Fortunately, it helps you track the issue using previous warning messages.

TracerWarning: Repeating tensors can cause the trace to be incorrect. You can pass tensors of different shapesChange the number of iterations that have been executed (which can lead to errors or may result in incorrect results).
For grid_thw t, h, w:

Most ML compilers don’t like dynamic control flows. Fortunately, since we’re only processing a single image, we can remove the loop and process a single H,W pair and commit it here.

And we have it! If you run the transformation again, you will see that the model successfully transforms and matches the original Pytorch precision.

Maximum difference: 0.006000518798828125, average difference: 1.100682402466191E-05

Step 3: Benchmark

Now that the model works, let’s assess its size and performance. The good news is that the model is working. The bad news is that it’s over 5GB! This is completely unacceptable in device deployments! To benchmark calculation times, you can use the built-in Xcode tool by calling:

open dotsocr_float32.mlpackage

Start the Xcode Inspector for your model. After clicking on the (+Performance) report and launching a report on all computing devices, you should see something like this:

More than 1 second for one forward pass in the Vision encoder! There’s more work to do.

Part 2 of this series will tackle COREML and MLX integration and run the complete model. Part 3 dives deep into the optimizations needed to run this model in a neural engine, such as quantization and dynamic shapes.

versatileai

See Full Bio

What's Hot

Gemini Images – Google Deep Mind

Huawei Powers’ first monetization fan network

Sota OCR with core ML and dots.ocr

Huawei Powers’ first monetization fan network

5 Best AI AppSec Tools of 2025

AI causes reduced brain activity in users – MIT

Plug in and see it works: How Metanova AI is engineering the future of applied AI

Advocacy groups urge Birdom to sign California’s AI bill

Accelerated depth pronune draft model for the QWEN3-8B agent from Intel® Core™ Ultra

Most Popular

Plug in and see it works: How Metanova AI is engineering the future of applied AI

Advocacy groups urge Birdom to sign California’s AI bill

Accelerated depth pronune draft model for the QWEN3-8B agent from Intel® Core™ Ultra

Don't Miss

Gemini Images – Google Deep Mind

Huawei Powers’ first monetization fan network

Sota OCR with core ML and dots.ocr

Subscribe to Updates

What's Hot

Sota OCR with core ML and dots.ocr

conversion

dots.ocr

Step 0: Understand and simplify the model

Step 1: A simple harness

Step 2: Bug hunting

Step 3: Benchmark

Related Posts