Part 1: Converting dots.ocr to MLX — Text Backbone (Qwen2)

Introduction

Large language models are evolving rapidly, and Apple's MLX framework gives Mac users a powerful way to run them natively on Apple Silicon. With its optimized GPU support and unified memory design, MLX can unlock performance that feels closer to running on dedicated accelerators.

In this series, I'll walk through how I converted the Hugging Face model rednote-hilab/dots.ocr into MLX format on my Mac Studio M3 Ultra with 512 GB unified memory.

dots.ocr is a vision-language model (VLM). In this first part, we'll focus on converting the Qwen2 language backbone — getting the text side running natively in MLX. In Part 2, I'll extend this into full OCR by adding the vision tower.

Why MLX?

Apple designed MLX to run AI models with tight integration to Apple Silicon's architecture. That means:

GPU acceleration without custom CUDA installs.
Unified memory that scales seamlessly across CPU/GPU/NPU.
Lightweight Python APIs that feel familiar if you've worked with PyTorch or NumPy.

For Mac developers, this translates into less friction and more speed when experimenting with state-of-the-art models.

Setting Up The Environment

First set up the python environment that can handle the MLX conversion:

brew install python@3.12
/opt/homebrew/bin/python3.12 -m venv ~/venvs/mlx-dots-py312
source ~/venvs/mlx-dots-py312/bin/activate
python -m pip install --upgrade pip
python -m pip install torch safetensors transformers==4.51.0

Downloading the Model

I started by pulling down the Hugging Face model repository locally:

huggingface-cli download rednote-hilab/dots.ocr --local-dir ~/models/DotsOCR --local-dir-use-symlinks True

Then I made a copy dedicated to the text-only conversion:

rsync -a ~/models/DotsOCR/ ~/models/DotsOCR_textonly/

Preparing the Config

MLX recognizes models like LLaMA, Mistral, and Qwen — but it doesn't know about the custom dots_ocr type. Since the backbone is Qwen2, I patched the config so MLX would treat it as such:

python - <<'PY'
import json, pathlib
p = pathlib.Path('~/models/DotsOCR_textonly/config.json').expanduser()
cfg = json.loads(p.read_text())
cfg['model_type'] = 'qwen2'
cfg['architectures'] = ['Qwen2ForCausalLM']
for k in ['vision_config','image_token_id','video_token_id']:
    cfg.pop(k, None)
p.write_text(json.dumps(cfg, indent=2, ensure_ascii=False))
print("Patched", p)
PY

This small edit made the model compatible with MLX's conversion tool.

Stripping Vision Weights

The original shards contained both language and vision weights. Since we're only targeting the text backbone here, I filtered out all vision-related tensors and merged the remaining Qwen2 weights into a single shard.

First we move the original safetensors files to a backup folder:

mkdir -p ~/models/DotsOCR_textonly/original_safetensors
mv ~/models/DotsOCR_textonly/model-*.safetensors ~/models/DotsOCR_textonly/original_safeensors/ 2>/dev/null || true

Then move the safetensors files into the folder:

mv ~/models/DotsOCR_textonly/model-*.safetensors ~/models/DotsOCR_textonly/original_safetensors/ 2>/dev/null || true

Strip the vision tensors from the originals

(This reads the originals from original_safetensors/ and writes stripped shards into text_only_weights/)

python - <<'PY'
from safetensors import safe_open
from safetensors.torch import save_file
from pathlib import Path

root = Path('~/models/DotsOCR_textonly').expanduser()
orig = root / 'original_safetensors'
dst  = root / 'text_only_weights'
dst.mkdir(parents=True, exist_ok=True)

def keep(k: str) -> bool:
    k = k.lower()
    drop = (
        'vision_tower', 'vision.', '.vision', 'visual.',
        'image_proj', 'mm_projector', 'pixel', 'patch_embed',
        'vision_proj', 'visiongrid', 'visionnorm'
    )
    return not any(d in k for d in drop)

any_written = False
for shard in sorted(orig.glob('model-*.safetensors')):
    with safe_open(shard, framework="pt") as f:
        keys = [k for k in f.keys() if keep(k)]
        tensors = {k: f.get_tensor(k) for k in keys}
    out = (dst / shard.name).as_posix()
    if tensors:
        save_file(tensors, out)
        print(f"[STRIP] {shard.name}: kept {len(tensors)} tensors → {out}")
        any_written = True
    else:
        print(f"[STRIP] {shard.name}: kept 0 tensors (vision-only shard)")

print("Done. Output dir:", dst if any_written else "No text tensors written.")
PY

We see shard 1 kept 339 tensors and the vision only shard kept 0 tensors.

Now we merge all of the kept tensors into a single shard and write a new index:

python - <<'PY'
from safetensors import safe_open
from safetensors.torch import save_file
from pathlib import Path
import json, torch

root = Path('~/models/DotsOCR_textonly').expanduser()
src_dir = root / 'text_only_weights'
out_name = 'model-00001-of-00001.safetensors'
out_path = root / out_name

# 1) Collect all kept tensors from text_only_weights/
all_tensors = {}
for sf in sorted(src_dir.glob('model-*.safetensors')):
    with safe_open(sf, framework="pt") as f:
        for k in f.keys():
            if k in all_tensors:
                raise RuntimeError(f"Duplicate tensor key across shards: {k}")
            all_tensors[k] = f.get_tensor(k)

if not all_tensors:
    raise SystemExit("No text tensors found to merge. Did the strip step produce anything?")

# 2) Save a single merged shard
save_file(all_tensors, out_path.as_posix())
print(f"[MERGE] Wrote merged shard: {out_path} with {len(all_tensors)} tensors")

# 3) Compute total_size for index metadata
total_size = 0
for t in all_tensors.values():
    total_size += t.element_size() * t.numel()

index = {
    "metadata": {"total_size": total_size},
    "weight_map": {k: out_name for k in all_tensors.keys()},
    "format": "safetensors"
}

# 4) Write fresh index referencing exactly one shard
idx_path = root / "model.safetensors.index.json"
idx_path.write_text(json.dumps(index, indent=2))
print(f"[INDEX] Wrote {idx_path.name} (total_size={total_size})")

# 5) Clean any old root shards (we'll keep only the merged one)
for old in root.glob('model-*.safetensors'):
    if old.name != out_name:
        old.unlink()
        print("[CLEAN] removed old shard", old.name)

print("[DONE] Single-shard layout ready.")
PY

Now ~/models/DotsOCR_textonly/ should have:

model-00001-of-00001.safetensors our merged text-only shard
Updated model.safetensors.index.json pointing to the new shard

Let's do a quick check:

ls -lh ~/models/DotsOCR_textonly

Conversion to MLX

With the patched config and text-only weights, conversion is straightforward:

First we be sure we have the latest mlx-lm:

python -m pip install --upgrade mlx-lm

Now let's run the mlx conversion:

python -m mlx_lm convert --hf-path ~/models/DotsOCR_textonly --mlx-path ~/mlx-checkpoints/dotsocr-text -q

The -q flag enabled quantization, reducing memory usage while preserving performance. On my M3 Ultra, this isn't necessary.

Sanity Check

To verify everything worked, I generated text with the converted model:

python -m mlx_lm generate \
  --model ~/mlx-checkpoints/dotsocr-text \
  --prompt "You are a helpful assistant.\nUser: Say hello in one short sentence.\nAssistant:" \
  --max-tokens 64 \
  --temp 0.7 \
  --top-p 0.9

The model responds with a coherent continuation — confirming that the Qwen2 text backbone of Dots-OCR is alive and running in MLX.

What We Have

At this point, we've successfully:

Converted dots.ocr into a text-only MLX model.
Verified it runs natively and efficiently on Apple Silicon.

What's missing? The vision tower. Without it, the model won't yet handle images for OCR. That's exactly what Part 2 will address.

Coming Up in Part 2

In the next post, we'll bring the vision tower into MLX by:

Porting DotsVisionTransformer into MLX.
Converting the vision weights from the original shards.
Connecting vision embeddings into the Qwen2 text stack.
Running end-to-end OCR on real images.

With the text backbone already working, the next step is to unlock full multimodal OCR performance on Apple Silicon.

Stay tuned for Part 2.

✍️ Written and tested on a Mac Studio M3 Ultra with 512 GB unified memory.

Part 1: Converting `dots.ocr` to MLX — Text Backbone (Qwen2)