playto / Docs / Translation

Translation

Playto offers multiple ways to translate game text — from classic OCR to AI vision, with local and cloud options.

Text Reading

The default translation mode. Playto captures a region of your screen and reads the text using Windows Runtime (WinRT) OCR, then sends it to a local LLM for translation.

Pipeline: Screen capture → WinRT OCR text recognition → text stability check → LLM translation → overlay display.

Text stability: Playto waits for the text to settle (150ms local, 500ms API) before translating. If the text changes less than the threshold (25% edit distance), it skips re-translation to avoid flicker.

Supported OCR languages: English, Japanese, Chinese (Simplified), Korean, German, French, Spanish, Portuguese, Russian, Italian, Polish, Vietnamese.

Image Recognition STANDARD

Instead of OCR, Playto sends the screen image directly to a Vision Language Model. The AI reads the image and translates in one pass — no OCR step needed.

Best for: Stylized fonts, text on complex backgrounds, handwritten-style text, and UI elements that OCR can't handle.

Image processing: Screenshots are resized to max 512px (preserving aspect ratio, min 128px height) to optimize inference speed while keeping text readable.

Two variants:

  • OCR-only — extracts text from the image, then translates separately (more accurate for simple text)
  • Direct — reads and translates in one pass (faster, handles context better)

Streaming: Standard users get real-time streaming — the overlay updates as the model generates translation tokens.

Cloud AI

If you don't have a dedicated GPU, you can use cloud API providers for translation. Add your API key in Settings and Playto will route requests to the cloud.

Supported providers:

  • Google Gemini — gemini-2.0-flash (fast, free tier available)
  • OpenAI — gpt-4o-mini (high quality)
  • OpenRouter — access multiple models through one API
  • Custom endpoint — any OpenAI-compatible API

Rate limiting: Playto respects API rate limits (e.g., ~15 RPM for Gemini free tier) with automatic cooldown between requests.

Batch Translation & Paragraph Merge

Batch translation: When multiple text blocks are captured at once, Playto sends them to the LLM in a single request using a separator format. The model returns a structured JSON response with corrections and translations for each block.

Paragraph merge: Multi-line text captured from the screen can be merged into a single paragraph before translation. This produces more natural translations for dialogue that spans multiple lines. Forced on in Cloud AI mode to reduce the number of API calls.

Prompt Patterns

Playto offers three prompt patterns that control how captured text is sent to the AI for translation:

  • Auto — Playto decides the best pattern based on the text structure. Best for most games.
  • Subtitle — Treats the captured text as a continuous subtitle. Lines are merged into a single paragraph before translation. Best for dialogue-heavy games and cutscenes.
  • Per Line — Each line is translated independently. Best for menus, item lists, and UI text where lines are unrelated.

Configure in Settings > Capture > Prompt Pattern. You can also set per-game overrides in Game Packs.

Conversation Context

Playto keeps recent translation history as context for the AI. This allows the model to produce more consistent and accurate translations by understanding the ongoing conversation or story.

How it works: Previous translations are included in the prompt, so the AI can reference character names, pronouns, and story context from earlier dialogue. This is especially helpful for games with long conversations.

Context window: The number of previous translations kept as context depends on the model's context size. Larger models can maintain more context.

Local LLM Setup

Playto runs a llama.cpp server locally for translation inference. Models are downloaded through the app with one click.

Configuration options:

  • GPU Layers (NGL) — how many model layers to offload to GPU. More = faster, but uses more VRAM.
  • Context size — how much text the model can process at once.
  • Thread count — CPU threads for non-GPU layers.
  • Flash attention — enabled by default for faster inference.

A separate server can run on port 8081 for Image Recognition, allowing Text Reading and Image Recognition models to run simultaneously.