In Part 1, we walked through converting the Qwen2 text backbone of dots.ocr to MLX, enabling Apple Silicon to run the text model natively. That milestone gave us the ability to handle prompts and generate text. But OCR isn't just about text — it's about understanding images and turning them into structured words. That's where the vision tower comes in.
In this second installment, we dive into the vision side of the model: building the tower in MLX, wiring in its layers, and preparing it to splice seamlessly with the text backbone.
Revisiting the Architecture
DotsOCR is a vision-language model. The vision tower processes image patches and transforms them into embeddings. A projector then maps those vision embeddings into the same dimensional space as the text backbone. Finally, the model stitches the two modalities together by splicing vision embeddings into the token stream, bracketed by special tokens like <|vision_start|> and <|vision_end|>.
Our job was to replicate this process in MLX, ensuring that patch embedding, transformer layers, and the projector all load and align with the text dimension of Qwen2.
Extracting and Preparing Weights
The Hugging Face repository contains both text and vision weights, but MLX needs a clean set for the vision tower. We exported only the relevant tensors — patch embedding, transformer blocks, norms, and the projector — into a dedicated .npz file.
The naming convention of these weights is critical. Keys must map cleanly into our MLX implementation, for example:
vision.patch_embed.proj.weight vision.blocks.0.attn.qkv.weight vision.blocks.0.mlp.fc1.weight vision.blocks.0.mlp.fc2.weight vision.blocks.0.mlp.fc3.weight vision.blocks.0.norm1.weight … projector.weight projector.bias
With this set extracted, we had a portable file to load directly into MLX.
Building the Vision Tower in MLX
With weights ready, the next step was implementing the model. The MLX version mirrors the PyTorch architecture closely: a patch embedding layer, a stack of transformer blocks, normalization, and finally the optional projector.
Here's a simplified outline:
import mlx.nn as nn
import mlx.core as mx
class DotsVisionMLX(nn.Module):
def __init__(self, d, depth, heads, proj_out=None):
super().__init__()
self.patch_embed = PatchEmbedMLX(...)
self.blocks = nn.Sequential([
VisionBlockMLX(d, heads) for _ in range(depth)
])
self.norm = nn.RMSNorm(d)
self.projector = nn.Linear(d, proj_out) if proj_out else None
def forward(self, x, grid_thw):
x = self.patch_embed(x)
x = self.blocks(x)
x = self.norm(x)
if self.projector:
x = self.projector(x)
return xThe supporting classes handle details like splitting query/key/value heads, applying attention, and projecting intermediate representations through feedforward MLPs.
Wiring in the Projector
The projector layer deserves special attention. It transforms the vision output (dimension D_v) into the text model's hidden dimension (D_text). Without this, the embeddings wouldn't match and the splice would fail.
We confirmed the projector weights loaded correctly by printing their shape:
[loader] Loaded projector: (1536, 1536)
This confirmed alignment with Qwen2's text hidden size of 1536.
Sanity Checking the Output
To validate our implementation, we passed a test image through the MLX tower and inspected the output embeddings:
MLX vision out shape: (1, 5184, 1536)
This shape was exactly what we expected: one batch, thousands of patches, each projected into the 1536-dimensional text space. That was the green light that our mapping worked.
Preparing for the Splice
Once the embeddings are produced, they need to be slotted into the text model's prompt. The tokenizer defines <|vision_start|> and <|vision_end|> markers, with placeholder pads in between. At runtime, these pads are replaced by the projected embeddings.
The splice must respect sequence length: the combined embeddings should be the same length as the original padded prompt. Any mismatch leads to assertion errors during inference. This required careful handling of indices like pad_start and pad_end, along with debug logging to confirm counts.
Where We Leave Off
At this stage, the vision tower is alive inside MLX. We've successfully extracted, rebuilt, and validated the core of the vision model, and confirmed projector alignment. The next big step will be integrating everything: splicing embeddings into text prompts and driving full OCR runs.
That's the focus of Part 3, where we'll refine prompt strategies, manage large documents, and begin to evaluate quality against the Hugging Face implementation.
Closing Thoughts
Converting vision models is never as straightforward as text backbones. The devil is in the details — especially patching, projection, and embedding alignment. But once the tower is rebuilt, the path to true OCR becomes much clearer.
In Part 3, we'll move from raw embeddings to coherent OCR output. Stay tuned — the pieces are now in place.