How I translated 1,200 pages into 11 languages with Ollama for €0

FitMesh Sync lives in eleven markets. Each one wants to read the same thing in its own language: the landing page, the supported-device pages, the articles. Translating all of it through a commercial API was the obvious route — and the one I dropped. This is the system I built instead: it runs entirely locally on my MacBook, the translation-API cost is zero, and at build time it produces 1,243 indexable pages.

The problem: multilingual SEO on a zero budget

For an indie app, "translating the site" doesn't mean swapping the strings in the interface. It means generating real pages, one per language, with the text inside the HTML at build time — not a label loaded over JavaScript in the browser. That's the difference between a page Google indexes and one that stays invisible.

Multiply everything by eleven locales and the site's surface area multiplies by eleven. The text to translate isn't trivial: landing copy, device pages, blog posts. The convenient route — DeepL API or GPT-4o — works beautifully, but it has a per-character price, and that price gets paid again on every regeneration of the site. For a project I maintain alone, the constraint was sharp: marginal cost per page equal to zero, because otherwise I stop regenerating and the content goes stale.

From here on, every technical decision comes out of that constraint.

Picking the model: why qwen2.5:7b

Ollama is a runtime that runs language models locally and exposes a CLI and an HTTP API: no call leaves the machine. On top of it I run qwen2.5:7b, Alibaba's 7-billion-parameter model, quantized Q4_K_M: 4.7 GB on disk, running in memory on an Apple Silicon MacBook with no dedicated GPU.

It isn't the best model in absolute terms. It's the best one for this constraint. The alternatives I weighed:

gpt-4o — better translation quality, but paid. Across 1,243 pages, multiplied by every copy update, the bill isn't symbolic.
DeepL API — the highest quality of the group on European languages, but paid and character-capped on the free tier. Same problem as gpt-4o: excellent until you regenerate often.
gemma (2B / 7B) — runs locally like qwen, but in my tests it slipped on technical vocabulary: sensor names, units, standards. It tended to "translate" terms that should stay in English.
qwen2.5:7b — the best balance of quality, technical-vocabulary robustness, and speed that runs on my laptop. It wins because it's the only one that satisfies the "€0 per page" constraint without falling apart on health-tech content.

The rule I drew from it: when the constraint is zero marginal cost, the question isn't "which is the best model" but "which is the best model among the ones that run for free on my machine".

The pipeline architecture

The single source of truth is the English content in JSON. Everything else is derived. The shape, for a landing page:

post_data/<slug>-lp.json     EN source (~38 strings)
        │
        ▼
translate_landing.py         batches of 18 strings → Ollama → JSON
        │   qwen2.5:7b · Q4_K_M · 4.7 GB · local, Apple Silicon
        ▼
LandingPage                  typed TypeScript object
        │
        ▼
lib/landing/data.ts          11 locales per slug
        │
        ▼
Next.js generateStaticParams → 11 SSG pages per slug

The key point: English is the input, typed TypeScript is the output. The Python script sits in the middle and knows nothing about Next.js — it emits a LandingPage object that the site consumes as if I'd hand-written it. The pages are static (SSG): the translated text is already in the HTML at deploy, exactly what a search engine needs to read it.

The core: batches, retries, fallback

The model doesn't translate one string at a time — that would be glacial — nor all of them at once, because a 7B loses coherence on long inputs. The compromise is the batch of 18 strings: enough context for the translations to stay consistent with each other, short enough that the model doesn't forget the rules halfway through.

# translate_landing.py — translates an EN landing into N locales.
# In Docker:  docker compose run --rm translate python -u translate_landing.py

import json, subprocess

BATCH_SIZE = 18          # context/speed tradeoff on a 7B
MAX_RETRIES = 3
MODEL = "qwen2.5:7b"

def ollama(prompt):
    r = subprocess.run(
        ["ollama", "run", MODEL],
        input=prompt.encode(),
        capture_output=True,
    )
    return r.stdout.decode().strip()

def translate_batch(strings, target_lang):
    """Returns the translated list, or None if the model yields no valid JSON."""
    prompt = build_prompt(strings, target_lang)
    for _ in range(MAX_RETRIES):
        try:
            out = json.loads(ollama(prompt))
            if isinstance(out, list) and len(out) == len(strings):
                return out
        except json.JSONDecodeError:
            pass        # malformed JSON → retry
    return None         # retries exhausted → caller keeps the English

def translate_landing(strings, target_lang):
    result = []
    for i in range(0, len(strings), BATCH_SIZE):
        batch = strings[i:i + BATCH_SIZE]
        out = translate_batch(batch, target_lang)
        if out is None:
            result.extend(batch)               # fallback: keep the original EN
            print("!", end="", flush=True)      # one "!" = one untranslated batch
        else:
            result.extend(out)
            print(".", end="", flush=True)      # one "." = one good batch
    print()
    return result

Three details that look minor and aren't:

Falling back to English. If after 3 attempts the model doesn't return a JSON array of the right length, I invent nothing: I keep the original English string. A page with a few English sentences beats a broken page or a hallucinated translation.
print(".", flush=True) on every batch. Without flush, stdout stays in the buffer and you see nothing until the script finishes. With the flush, every . is a batch that went through and every ! is a fallback: I read a run's quality at a glance, while it runs.
Python's -u flag in Docker. Inside a container, stdout is buffered even more aggressively. python -u forces unbuffered mode: without it, the .s and !s all arrive in one block at the end, and real-time monitoring doesn't exist.

The prompt: what not to translate

Half the quality is in the prompt, and the most important part of the prompt is the list of things not to translate. "Health Connect" isn't a phrase, it's an API name. "SpO2" is a standard. If the model translates them, the text is instantly wrong for whoever reads it.

# Technical names, standards and brands: leave them in English.
BRANDS = [
    "Health Connect", "Google Fit", "Samsung Health", "Apple Health",
    "Wear OS", "HRV", "SpO2", "VO2 Max", "Garmin", "Fitbit",
    "FitMesh Sync", "Bluetooth", "GPX", "BPM",
]

def build_prompt(strings, target_lang):
    numbered = "\n".join(f"{i}. {s}" for i, s in enumerate(strings))
    keep = ", ".join(BRANDS)
    return f"""You are a technical translator for a health and fitness app.
Translate the following strings from English into {target_lang}.

Rules:
- Do NOT translate these technical names and brands: {keep}
- Keep the tone concise and technical, not promotional.
- Do NOT use em-dashes (—) or typographic quotes: use plain hyphens and straight quotes (").
- Preserve placeholders like {{count}} or %s exactly as they are.

Reply ONLY with a JSON array of {len(strings)} strings, in the same order.
No text before or after.

Strings:
{numbered}"""

The rule about em-dashes and typographic quotes looks like nitpicking. It isn't. Models love "pretty" typographic characters (— instead of -, " " instead of "), which then break the JSON, the alignment in the .ts files, and the snippets in meta tags. Saying it explicitly in the prompt costs one line and removed an entire class of downstream errors.

The real numbers

No vanity metrics. This is what the pipeline produces at build time:

47 blog posts × 11 locales = 517 URLs
12 landing pages × 11 locales = 132 URLs
24 device pages × 11 locales = 264 URLs (generated by translate_provider_models.py, same pipeline)
Subtotal from the three generators: 913 URLs
+ ~330 URLs from the home, section pages, categories and legal pages, one version per locale
Indexable total: 1,243 URLs

Time to translate one landing page (~38 strings × 10 locales, the eleventh being the English source): about 8 minutes on the laptop. Translation-API cost: €0, because the model runs locally and Ollama is free. The only cost is the laptop's power and the minutes it takes.

Honest limits

qwen2.5:7b is not GPT-4, and pretending otherwise would be dishonest.

Health-tech quality, not literary. For concise technical copy — feature names, device descriptions, instructions — the result holds. For prose with nuance or rhythm, it doesn't. Luckily, FitMesh Sync's content sits in the first category.
Japanese and Korean fall back more. Across runs, JA and KO collect more !s than the European languages: the model produces malformed JSON more often, and more batches end up on the English fallback. Quality on those languages is the weakest of the set.
Typographic characters are still an open front. Even with the explicit rule in the prompt, an em-dash slips through now and then. For now I accept it; systematic sanitization is on the list.

What I'd do differently

Three things, in order of return on investment:

A bigger model for JA/KO. For the two problem languages I'd use qwen3:14b: slower, but almost certainly fewer fallbacks where quality is weakest. The European languages stay on the 7B, which is more than enough there.
A translation cache. Today a copy update re-translates everything. With a hash of the source string as the key, I'd re-translate only what changed — updates would go from minutes to seconds.
Writing output straight into the Docker volume. Right now I track progress over stdout, handy for live monitoring but fragile as transport. Writing the translated files directly into the mounted volume removes a step and a breakage point.

Translation is only half the work

Having 1,243 translated, static pages doesn't mean having 1,243 pages that rank. Translating the text is the visible half. The other half is telling search engines how to hold eleven versions of the same page together without cannibalizing each other: correct hreflang, per-locale canonical, a multilingual sitemap, and structured data that describes each page in its own language. That's a problem of its own, with its own traps, and I'll pick it up in a dedicated article.

A 7B model on your laptop doesn't beat GPT-4. But 1,243 pages at zero cost, reproducible and versioned in the repo, beat the 1,243 pages you never shipped because the API cost too much.