Lesson 05 · Tech

Tech: boring stack, ruthless latency.

For a real-time desktop AI app, the hard problem isn't the model — it's the audio pipeline and the latency budget that decides whether the product feels like magic or like a bad phone line. Pick boring tech for the shell, the backend, the billing, and the auth. Spend all your complexity where it actually matters: the path from a microphone sample to a translated voice in someone's ear.

Duration

18 min read

Format

read + checklist

Goal

Stack / Latency / Scale path

Outcome

A latency budget and an MVP stack you can defend

why this matters for you

context
Babelio's defensibility is integration + speed, not models. Deepgram, Gemini and ElevenLabs are commodity APIs — every competitor can call them. What is not commodity is the audio routing, the per-OS capture, the model router, and a sub-700ms end-to-end pipeline that holds under load.
risk
The worst Babelio failure mode is not "model picked the wrong word". It's 1.5 seconds of dead air on a Zoom call. Mis-budget your latency or pick exotic infra and you'll spend six months fighting the stack instead of shipping.

What this lesson does / does not do.

Does

Explain the boring-tech bias and where it does and does not apply.
Draw the end-to-end pipeline from microphone to virtual output.
Give you a frame-by-frame latency budget with hard caps per component.
Sketch the 100 → 10K → 100K user scale path, with the inflection points marked.

Does not

Write a single line of Rust for you.
Pick your IDE, your test runner, or your branch strategy.
Replace the detailed spec in research/tech.md.
Cover GTM (Lesson 06) or fundraising (Lesson 07).

01.

Concept 01 · Boring tech > shiny tech

Spend weirdness where it pays.

4 minread

Every startup has a complexity budget. Choosing exotic tech in places that don't matter burns the budget before you reach the part that actually differentiates you.

Boring tech means choosing the option that has been in production for a decade, that hires can pick up in a week, and that does not surprise you on a Tuesday. Postgres instead of a new database. A managed PaaS instead of self-managed Kubernetes. A signed installer instead of a custom auto-updater. The boring choice is rarely the best on any single axis — it is the best on the axis that matters most for a small team: predictability.

The corollary: weirdness is a finite resource, and you should spend it on the part of the product that is the product. For an AI-first desktop app, weirdness belongs in the audio path, the model router, and the per-OS capture code — not in the billing system, not in the auth stack, not in the queue between two HTTP services.

in your startup

shell
Tauri 2 (Rust core + system webview), not a custom Electron fork. ~3–10 MB installer vs 80–150 MB, audio code in-process with no IPC tax, native webview = lower idle RAM. The audio pipeline is the hot path — boring shell, hot core.
backend
FastAPI + Postgres 16 on Fly.io, two regions. Sessions, quotas, prompt versions, billing webhooks. ~$30/mo MVP, ~$80/mo with monitoring. No microservices, no message bus, no event store.
billing/auth
Stripe for subscriptions + metered usage. Clerk for auth with native desktop SDK. Both are boring in 2026 and free you to fight the audio pipeline, not OAuth.
weirdness budget
Spend it in three places: (1) macOS CoreAudio Process Tap FFI, (2) Windows WASAPI per-process loopback, (3) the model router + fallback graph. Everywhere else, take the boring win.

02.

Concept 02 · The latency budget

Latency is a frame-by-frame contract.

5 minread

A latency target is not "fast". It is a budget assigned to each component, with hard caps and known headroom, written down before any code is shipped.

A real-time system has one rule: every component owes the pipeline a fixed number of milliseconds, and when it overspends, the whole product breaks. The budget is not allocated by goodwill — it is computed top-down from the perceived-latency threshold (~700ms for voice to still feel live) and divided across the stages. Each stage has a p50 target and a p95 hard cap. If a component cannot meet its cap, you do not relax the budget. You change the component or split the work.

The most common failure is treating latency as a property of the system rather than a sum of properties of the parts. The cure is mechanical: write the budget on one page, instrument every stage with structured timing, alert on p95 drift, and refuse any feature that does not respect the contract.

in your startup

VAD
~100 ms — Silero VAD v5, ONNX, 2 MB, CPU. Trims silence before STT (30–50% cost reduction), gives clean endpoints.
STT
~250 ms first partial — Deepgram Nova-3 over WebSocket. 30+ langs, code-switching, keyterm prompting. Cap is p95 = 300 ms; over that → auto-failover.
MT
~200 ms first token — Gemini 2.5 Flash, streaming. Cheap (~$0.10/M in, $0.40/M out). Fallback GPT-4o-mini for "quality" tier.
TTS
~75 ms TTFA — ElevenLabs Flash v2.5 over WebSocket, 32 langs. Fallback Cartesia Sonic-3 (~40 ms) when ElevenLabs quota / outage.
routing
~25 ms — Tokio broadcast channel, audio mixer (AVAudioEngine on macOS), virtual output via BlackHole / VB-CABLE.
total
≈ 650 ms end-to-end, 150 ms headroom. Anything that eats the headroom (extra hop, sync proxy, unstreamed call) is rejected at design time.

03.

Concept 03 · The audio capture problem

OS audio is three different problems wearing one name.

5 minread

Capturing audio sounds like one thing. In practice it is three orthogonal problems — per-OS API, per-process scoping, and virtual output routing — and each has its own version floor, entitlement, and trap.

Desktop audio on macOS and Windows is not symmetric. The APIs differ, the privacy models differ, the version floors differ, and the path for shipping a virtual output device differs by an order of magnitude in effort. Treating "audio capture" as one item on the spec is how teams lose two months. Treating it as three problems with three owners, three test matrices, and three fallbacks is how you ship.

The non-obvious rule: do not invent your own kernel layer in the MVP. Use the official per-process tap APIs that Apple and Microsoft shipped in the last two years; for the virtual output side, ride on existing signed drivers (BlackHole, VB-CABLE) until you have revenue, then ship your own. The order is API → routing → driver, not the other way around.

in your startup

macOS
CoreAudio Process Tap (CATapDescription + AudioHardwareCreateProcessTap) via Rust FFI. Version floor: macOS 14.4+. Requires NSAudioCaptureUsageDescription. macOS 13 fallback: ScreenCaptureKit audio (system-wide only).
Windows
WASAPI per-process loopback via ActivateAudioInterfaceAsync with VIRTUAL_AUDIO_DEVICE_PROCESS_LOOPBACK. Version floor: Win 10 build 20348+ / Win 11. No driver install needed for capture.
virtual output
MVP: signed BlackHole 2ch install step on macOS, VB-CABLE on Windows. Both free, stable, widely used. Long-term (6–12 mo): own signed CoreAudio Server Plugin + APO/AVStream driver. Defer until $1M ARR.
entitlements
macOS notarized + hardened runtime + audio/microphone entitlements only. Windows EV code sign for Smartscreen reputation. Without these, the installer is an "are you sure?" wall.

Don't ship your own kernel extension in MVP

macOS kexts are functionally dead; Apple wants Server Plugins. Windows kernel drivers need WHQL signing, $300+ EV cert, and weeks of approval. Use BlackHole / VB-CABLE until product-market fit.

Don't run audio through the webview

Web Audio API in a Tauri webview adds an IPC hop and 20–40ms of jitter. Keep audio entirely in Rust; the webview only renders the overlay and settings.

04.

Concept 04 · Scale path without YAGNI

Mark the inflection, don't build it.

4 minread

Premature scale architecture is the most expensive mistake an early-stage team makes. The cure is not to ignore scale — it is to mark the inflection points and refuse to build past the next one.

Every system has natural breakpoints — user counts, request rates, or cost curves where the cheap approach stops working and a different architecture becomes ROI-positive. A founder's job is to know where those points sit, write them down, and stay one step ahead — not three. Kubernetes at a hundred users is masochism; Postgres at a million users is debt. Both are failures of the same skill: matching architecture to the actual load curve.

The trick is leaving doors unlocked but unopened. Pick boring components that can be swapped without rewriting business logic. Keep stateless services stateless. Keep your data model normal. When the inflection comes, the change is contained, not a rewrite.

in your startup

100 users
No infra changes. Single Fly machine, single Postgres, all AI third-party. Cost ~$30/mo. Goal: validate latency + retention, not headroom.
10K users
Add per-user token-bucket rate limiter (Redis). Enable on-device Whisper-turbo for free tier — kills ~60% of STT cost. Add Postgres read replica. Crossover math: Deepgram ~$0.0043/min vs self-host Whisper-turbo on $0.30/hr L4 ~$0.0001/min, breakeven at ~12K concurrent minutes/day.
100K users
Self-hosted STT fleet (Whisper-turbo on Triton + L4 GPUs, ~50 streams/GPU). Multi-region API (US-east, EU, AP). CDN for auto-update artifacts. Postgres → managed (Neon or Crunchy) with PITR.
implication
Until you hit 10K paying users, do not self-host inference. Until you hit 100K, do not go multi-region. The map is on the wall; the construction crew stays home.

Checklist for this week.

Six concrete actions. By Friday you should have a one-page latency budget pinned above your monitor and a written commitment to not touch Kubernetes until 10K users.

Write your latency budget on one page, by component (VAD / STT / MT / TTS / routing), with p50 target and p95 hard cap per stage. Pick your audio capture method per OS and document the version floor (macOS 14.4+ Process Tap, Win 10 build 20348+ WASAPI per-process loopback). Write the fallback for older OS. Choose your inference router: which model gets which call in Speed / Balanced / Quality / Offline modes. Add health-check probes per provider with auto-failover thresholds. Decide where data is and is NOT persisted. Default: no raw audio, no transcripts. Opt-in transcripts encrypted column-level. Write the privacy claim as an architectural rule, not a marketing line. Refuse Kubernetes, microservices, and "new database X" until 10K paying users. Pin the inflection-point map (100 / 10K / 100K) next to the latency budget and revisit only when usage forces it. Instrument every stage with structured timing from day 1. Ship a debug overlay that shows p50/p95 per stage in real time — you cannot improve what you do not measure.

lesson mantra

«Latency is the product.»

— onward to Lesson 06 · Growth

Next lesson · 06

Growth: the wedge channel before the rest.

→