Why are we still paying frontier-token prices to read a screenshot?

Last week I unboxed an AYN Thor handheld and went looking for the canonical "set this up like a pro" YouTube video. There isn't one, because the Thor is new, dual-screen, rooted Android, and the existing handheld content is mostly ROG Ally and Steam Deck material. I was not going to sit through ten loosely related videos to learn which launcher to install and how to get Steam onto the top screen.

So I asked Claude. Claude is a fine reasoning agent and a perfectly competent research analyst, but you don't need a frontier model to summarize a 12-minute YouTube clip about a launcher menu. You need a transcript, a few captioned frames, and a model that can write a paragraph. I already have a homelab with an RTX 5090, so I built two small tools and two Claude skills that send the routine work there and reserve Claude's frontier reasoning for the part of the job that actually benefits from it.

The result is the setup in this post: two CLIs (thor-ask for screen and image narration, thor-vid for video transcription and scene description), a local vision model on the 5090, and Claude skills that nudge the assistant toward the local path whenever the work is token-heavy and the answer doesn't need GPT-class judgment.

The Routing Decision

Before any tooling, you need a clean way to decide which model gets which job. I think about it as three buckets:

frontier reasoning architecture, refactors, novel analysis bulk narration OCR, screenshots, video, frame captioning bulk research cross-doc summary, release-note distill FRONTIER API paid model, billed per token Claude / GPT / etc — reserved for hard work LOCAL GPU RTX 5090 in the homelab vLLM + faster-whisper, OpenAI-compatible cost = wattage, not tokens
The routing rule. Anything frontier-shaped goes to the paid API. Anything narration- or research-shaped goes to the homelab GPU.

Buckets two and three are where local hardware wins. A single 5090 with 32 GB of VRAM can run a 30B-class model with plenty of context, and the only ongoing cost is the wattage. Once a model lives behind an OpenAI-compatible URL, anything that speaks the chat-completions protocol can call it. That includes Claude, when you give it a tool that points at the right endpoint.

The Hardware Behind the Endpoints

The homelab side is intentionally small. One node (bruiser) holds an RTX 5090 and serves two endpoints inside the cluster:

homelab inference endpoints
# image / video frame description (Qwen2.5-VL-7B-Instruct-AWQ)
http://172.16.100.117:31800/v1

# audio transcription (faster-whisper, OpenAI-compatible)
http://172.16.100.117:31900/v1

The vision model is small enough to share the GPU with my agent-routing model through time-slicing, so the same 5090 also handles tool-calling work for PentAGI and a few other agents. There is nothing exotic in the stack: vLLM, faster-whisper-server, ArgoCD, a NodePort. The interesting part is what you do with these endpoints once they exist.

thor-ask: Read the Screen, Read the Image

The first tool is thor-ask. It started as a Thor-specific helper for reading the handheld's screen over ADB, then I generalized the underlying call so it works on any image.

The Thor variant is a one-liner. It pulls a screenshot off the device and sends it to the local vision model with a prompt biased toward OCR and game identification:

thor-ask, screen capture path
$ thor-ask "what app is on screen and what's the menu state?"
[capture 412KB in 88ms; model=Qwen/Qwen2.5-VL-7B-Instruct-AWQ]
[llm 1.4s]

ES-DE frontend on the primary display, "Systems" view focused on the
"Nintendo - GameCube" tile. Top-right shows time 14:32 and battery 78%.
The bottom hint bar reads "A: Select   B: Back   Start: Menu".

Behind that is a tiny Python script. The capture step is adb exec-out screencap -p, the inference step is a base64-encoded image inside an OpenAI chat/completions request. Total source is about 100 lines, no dependencies outside the standard library:

thor-ask: the call to vLLM
def ask(base_url, model, prompt, png_bytes):
    b64 = base64.b64encode(png_bytes).decode()
    body = {
        "model": model,
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url",
                 "image_url": {"url": f"data:image/png;base64,{b64}"}},
            ],
        }],
        "max_tokens": 1024,
        "temperature": 0.1,
    }
    ...

The same function, called with a file path instead of a screencap, works for every screenshot, photo, and PDF page I throw at it. When Claude needs to read a long screenshot or describe an image, the right behavior is to call thor-ask --file path/to/image.png "describe this in detail" and get a paragraph back in under two seconds, not to consume vision tokens against a frontier API for the same job.

thor-vid: Watch the Video So You Don't Have To

The second tool is thor-vid. This is the one that actually saved me from the AYN Thor YouTube spiral.

You hand it a URL or a local file. It runs yt-dlp to pull the audio and the video, sends the audio to Whisper for a timestamped transcript, samples frames at a configurable cadence, and asks the vision model to describe each frame in a single sentence. The output is a merged markdown report with the transcript and the scene captions interleaved on the same timeline.

thor-vid, full pipeline
$ thor-vid "https://www.youtube.com/watch?v=<ayn-thor-setup-video>" \
    --frames-per-min 6
[yt-dlp ok: 12m04s, 1080p]
[whisper 38.2s -> 184 segments]
[frame sample every 10s -> 73 frames]
[vl 41.7s -> 73 captions]

wrote ~/.agents/scratch/thor-vid/ayn-thor-setup/summary.md

The merged file looks like a screenplay. Every transcript chunk sits next to a one-line description of what's actually on screen, which means a model summarizing the file gets the visual context it needs without watching anything. Claude reads the markdown the same way it would read a blog post, and it can pull the answer out of a twelve-minute video in seconds.

summary.md, excerpt
[00:01:12 - 00:01:25]
host: "First thing I do on a new Thor is replace the launcher with ES-DE,
       because the stock one doesn't deal with the bottom screen well."
scene: Stock AYN launcher visible on top screen, Settings app open on bottom
       screen, finger tapping "Apps".

[00:01:26 - 00:01:39]
host: "Long-press the home button, hit 'Set as default home', pick ES-DE."
scene: Android home selector dialog with "ES-DE" and "AYN Launcher" options,
       cursor over "ES-DE".

The runtime, on a single 5090, lands around twice the wall-clock duration of the source video (a 12-minute clip takes roughly two minutes, dominated by the frame caption pass). Multiply that by the ten videos I would have watched to set up the Thor and the trade is obvious. The frontier-token equivalent of the same work, with vision-token pricing on every frame, would have been comically expensive.

URL or file .mp4 / .mov / yt yt-dlp + ffmpeg audio.wav 16 kHz mono WHISPER transcript.json frames/*.jpg ~4–8 per minute QWEN2.5-VL scenes.json MERGE summary.md
One CLI, one node, one report. yt-dlp pulls the media, ffmpeg splits audio and frames, Whisper and the VL model run in parallel, a small script merges by timestamp.

Skills, or How You Stop Reminding Claude

Tools are only half the problem. The other half is convincing Claude to use them when the work calls for it, instead of reaching for its built-in vision capability or burning context on a long YouTube transcript that some other tool already produced for free.

Claude Code's skills system solves this. A skill is a small markdown file with YAML frontmatter that tells Claude what the skill does and when to invoke it. The frontmatter is matched against the user's intent, so a well-written description means Claude reaches for the skill on the right kinds of prompts without you having to remember to say "use thor-vid".

The skill that wraps thor-vid is twenty lines:

~/.claude/skills/thor-vid/SKILL.md
---
name: thor-vid
description: >
  Transcribe and frame-caption a video using the homelab Whisper + VL stack
  on bruiser. Use whenever the user shares a YouTube URL or a local video file
  and wants a summary, transcript, scene description, or "what does this video
  say". Always prefer this over watching the video or transcribing with a
  frontier model. Triggers: youtube link, video file, .mp4, .mov, "summarize
  this video", "what's in this video", "transcribe".
---

# thor-vid skill

When the user gives you a video URL or a local video file:

1. Run `thor-vid <url-or-path>` and wait for completion.
2. Read `~/.agents/scratch/thor-vid/<slug>/summary.md`.
3. Answer the user's actual question from that file. Do not paste the
   whole transcript unless they explicitly ask for it.

Use `--frames-per-min` to control caption density. Default is 4. For dense
tutorial content, bump to 6 or 8.

Two things in there matter beyond the obvious "run this CLI" instruction. The description is written for Claude's matcher, not for a human reader, which is why it lists trigger phrases and explicitly tells Claude to prefer the local path over the frontier path. And the numbered steps close the loop: Claude runs the CLI, reads the report, and answers the question rather than dumping the transcript back to the user.

The thor-ask skill is shaped the same way:

~/.claude/skills/thor-ask/SKILL.md (excerpt)
---
name: thor-ask
description: >
  Describe images, screenshots, and PDF pages using the local vision model.
  Use whenever the user shares an image file, asks "what's on this screen",
  or needs OCR. Prefer this over reading the image directly. Triggers: png,
  jpg, screenshot, "read this image", "what does this say", "OCR".
---

One real behavior change came out of this. Before the skill existed, I would say "look at this screenshot" and Claude would happily process the image with its own vision capability, which is fine but burns input tokens against a paid quota. Now Claude defers to thor-ask for routine reads and only falls back to its native vision when the prompt actually rewards a smarter model — diagrams that need careful structural reasoning, code in screenshots that needs to be modified, that kind of thing.

Wiring the AYN Thor

The Thor was the test that proved the rest of the setup was worth keeping. The device is unusual: dual screens, rooted Android 13, ADB-only access (SSH is closed), an unfamiliar mix of GameNative, ES-DE, and a stock launcher that fights the second screen. The community wisdom is scattered across YouTube videos, forum posts, and the occasional README.

I told Claude what I had, pointed it at three setup videos, and let the skills do the rest:

the actual conversation
me:    here are three videos about setting up the AYN Thor. Use thor-vid
       on each, then write me a short setup plan. I want to land Steam
       (GameNative) on the primary display and ES-DE as the home app.

claude: [runs thor-vid on each URL]
       [reads three summary.md files]

       Plan:
       1. Set ES-DE as home: cmd package set-home-activity ...
       2. GameNative is the practical "Steam" client on rooted ARM Android.
          Force it onto the primary display with: cmd activity start-activity
          --display 0 -n app.gamenative/.MainActivityAliasDefault
       3. CocoonShell still runs on display 4 — leave it alone, it doesn't
          fight ES-DE.
       4. ROMs go under /storage/<volume>/Emulation/roms.

       Want me to write the steam-on-top wrapper to /data/local/tmp/ now?

Total work on my side: three URLs, one paragraph of intent. Total time on the homelab side: about six minutes for transcription and captioning, parallelizable across the three videos. Total frontier tokens spent on the routine narration: zero.

The Thor topology I ended up with — Display 0 as the primary "Built-in Screen", Display 4 as the secondary touch panel, GameNative on display 0, ES-DE as home — came directly out of those summaries combined with a couple of follow-up questions Claude could answer from the merged transcripts.

What This Pattern Buys You

The architecture is small enough that the value gets lost if you describe it as "local inference". The real shift is the routing rule: send the bulk, narration-shaped work to your own GPU, and reserve the frontier model for the work that benefits from a frontier model.

You get a few practical things from that:

What's Worth Doing Yourself

If you want to try this without recreating my whole homelab, the order I would build in is:

  1. Stand up one OpenAI-compatible vision endpoint. A single GPU, vLLM, a Qwen2.5-VL or Phi-3.5-vision model. Confirm you can hit /v1/chat/completions with a base64 image.
  2. Wrap it in a tiny CLI. Take a file or a URL, return a paragraph. Resist the temptation to make this a service. CLIs compose with skills, prompts, and shell loops in a way services don't.
  3. Add Whisper if you watch video. faster-whisper-server is a single container, runs on the same GPU, and the OpenAI-compatible mode means your CLI is one HTTP call.
  4. Write the skill last. The CLI has to be reliable before the skill is worth installing — a flaky tool that Claude reaches for automatically is worse than no tool at all.

The full source for thor-ask and thor-vid lives in kvncrw/thor-tools, and the skills directory layout follows the standard Claude Code skill format. Nothing here requires a special integration — it's all OpenAI-compatible HTTP and a few markdown files.

Related Posts

The same routing argument shows up in Turning Honeypot Noise into PentAGI Investigations, where the same RTX 5090 handles routine defensive recon while paid models stay reserved for human-initiated work. The shared idea: agents need to spend tokens like an operator spends money, and the only way that happens is by giving them a cheaper path that's good enough for the routine case.