Tech: boring stack, ruthless latency.
For a real-time desktop AI app, the hard problem isn't the model — it's the audio pipeline and the latency budget that decides whether the product feels like magic or like a bad phone line. Pick boring tech for the shell, the backend, the billing, and the auth. Spend all your complexity where it actually matters: the path from a microphone sample to a translated voice in someone's ear.
why this matters for you
- contextBabelio's defensibility is integration + speed, not models. Deepgram, Gemini and ElevenLabs are commodity APIs — every competitor can call them. What is not commodity is the audio routing, the per-OS capture, the model router, and a sub-700ms end-to-end pipeline that holds under load.
- riskThe worst Babelio failure mode is not "model picked the wrong word". It's 1.5 seconds of dead air on a Zoom call. Mis-budget your latency or pick exotic infra and you'll spend six months fighting the stack instead of shipping.
What this lesson does / does not do.
Does
- Explain the boring-tech bias and where it does and does not apply.
- Draw the end-to-end pipeline from microphone to virtual output.
- Give you a frame-by-frame latency budget with hard caps per component.
- Sketch the 100 → 10K → 100K user scale path, with the inflection points marked.
Does not
- Write a single line of Rust for you.
- Pick your IDE, your test runner, or your branch strategy.
- Replace the detailed spec in
research/tech.md. - Cover GTM (Lesson 06) or fundraising (Lesson 07).
Spend weirdness where it pays.
Every startup has a complexity budget. Choosing exotic tech in places that don't matter burns the budget before you reach the part that actually differentiates you.
Boring tech means choosing the option that has been in production for a decade, that hires can pick up in a week, and that does not surprise you on a Tuesday. Postgres instead of a new database. A managed PaaS instead of self-managed Kubernetes. A signed installer instead of a custom auto-updater. The boring choice is rarely the best on any single axis — it is the best on the axis that matters most for a small team: predictability.
The corollary: weirdness is a finite resource, and you should spend it on the part of the product that is the product. For an AI-first desktop app, weirdness belongs in the audio path, the model router, and the per-OS capture code — not in the billing system, not in the auth stack, not in the queue between two HTTP services.
in your startup
- shellTauri 2 (Rust core + system webview), not a custom Electron fork. ~3–10 MB installer vs 80–150 MB, audio code in-process with no IPC tax, native webview = lower idle RAM. The audio pipeline is the hot path — boring shell, hot core.
- backendFastAPI + Postgres 16 on Fly.io, two regions. Sessions, quotas, prompt versions, billing webhooks. ~$30/mo MVP, ~$80/mo with monitoring. No microservices, no message bus, no event store.
- billing/authStripe for subscriptions + metered usage. Clerk for auth with native desktop SDK. Both are boring in 2026 and free you to fight the audio pipeline, not OAuth.
- weirdness budgetSpend it in three places: (1) macOS CoreAudio Process Tap FFI, (2) Windows WASAPI per-process loopback, (3) the model router + fallback graph. Everywhere else, take the boring win.
Latency is a frame-by-frame contract.
A latency target is not "fast". It is a budget assigned to each component, with hard caps and known headroom, written down before any code is shipped.
A real-time system has one rule: every component owes the pipeline a fixed number of milliseconds, and when it overspends, the whole product breaks. The budget is not allocated by goodwill — it is computed top-down from the perceived-latency threshold (~700ms for voice to still feel live) and divided across the stages. Each stage has a p50 target and a p95 hard cap. If a component cannot meet its cap, you do not relax the budget. You change the component or split the work.
The most common failure is treating latency as a property of the system rather than a sum of properties of the parts. The cure is mechanical: write the budget on one page, instrument every stage with structured timing, alert on p95 drift, and refuse any feature that does not respect the contract.
in your startup
- VAD~100 ms — Silero VAD v5, ONNX, 2 MB, CPU. Trims silence before STT (30–50% cost reduction), gives clean endpoints.
- STT~250 ms first partial — Deepgram Nova-3 over WebSocket. 30+ langs, code-switching, keyterm prompting. Cap is p95 = 300 ms; over that → auto-failover.
- MT~200 ms first token — Gemini 2.5 Flash, streaming. Cheap (~$0.10/M in, $0.40/M out). Fallback GPT-4o-mini for "quality" tier.
- TTS~75 ms TTFA — ElevenLabs Flash v2.5 over WebSocket, 32 langs. Fallback Cartesia Sonic-3 (~40 ms) when ElevenLabs quota / outage.
- routing~25 ms — Tokio broadcast channel, audio mixer (AVAudioEngine on macOS), virtual output via BlackHole / VB-CABLE.
- total≈ 650 ms end-to-end, 150 ms headroom. Anything that eats the headroom (extra hop, sync proxy, unstreamed call) is rejected at design time.
OS audio is three different problems wearing one name.
Capturing audio sounds like one thing. In practice it is three orthogonal problems — per-OS API, per-process scoping, and virtual output routing — and each has its own version floor, entitlement, and trap.
Desktop audio on macOS and Windows is not symmetric. The APIs differ, the privacy models differ, the version floors differ, and the path for shipping a virtual output device differs by an order of magnitude in effort. Treating "audio capture" as one item on the spec is how teams lose two months. Treating it as three problems with three owners, three test matrices, and three fallbacks is how you ship.
The non-obvious rule: do not invent your own kernel layer in the MVP. Use the official per-process tap APIs that Apple and Microsoft shipped in the last two years; for the virtual output side, ride on existing signed drivers (BlackHole, VB-CABLE) until you have revenue, then ship your own. The order is API → routing → driver, not the other way around.
in your startup
- macOSCoreAudio Process Tap (
CATapDescription+AudioHardwareCreateProcessTap) via Rust FFI. Version floor: macOS 14.4+. RequiresNSAudioCaptureUsageDescription. macOS 13 fallback: ScreenCaptureKit audio (system-wide only). - WindowsWASAPI per-process loopback via
ActivateAudioInterfaceAsyncwithVIRTUAL_AUDIO_DEVICE_PROCESS_LOOPBACK. Version floor: Win 10 build 20348+ / Win 11. No driver install needed for capture. - virtual outputMVP: signed BlackHole 2ch install step on macOS, VB-CABLE on Windows. Both free, stable, widely used. Long-term (6–12 mo): own signed CoreAudio Server Plugin + APO/AVStream driver. Defer until $1M ARR.
- entitlementsmacOS notarized + hardened runtime + audio/microphone entitlements only. Windows EV code sign for Smartscreen reputation. Without these, the installer is an "are you sure?" wall.
Don't ship your own kernel extension in MVP
macOS kexts are functionally dead; Apple wants Server Plugins. Windows kernel drivers need WHQL signing, $300+ EV cert, and weeks of approval. Use BlackHole / VB-CABLE until product-market fit.
Don't run audio through the webview
Web Audio API in a Tauri webview adds an IPC hop and 20–40ms of jitter. Keep audio entirely in Rust; the webview only renders the overlay and settings.
Mark the inflection, don't build it.
Premature scale architecture is the most expensive mistake an early-stage team makes. The cure is not to ignore scale — it is to mark the inflection points and refuse to build past the next one.
Every system has natural breakpoints — user counts, request rates, or cost curves where the cheap approach stops working and a different architecture becomes ROI-positive. A founder's job is to know where those points sit, write them down, and stay one step ahead — not three. Kubernetes at a hundred users is masochism; Postgres at a million users is debt. Both are failures of the same skill: matching architecture to the actual load curve.
The trick is leaving doors unlocked but unopened. Pick boring components that can be swapped without rewriting business logic. Keep stateless services stateless. Keep your data model normal. When the inflection comes, the change is contained, not a rewrite.
in your startup
- 100 usersNo infra changes. Single Fly machine, single Postgres, all AI third-party. Cost ~$30/mo. Goal: validate latency + retention, not headroom.
- 10K usersAdd per-user token-bucket rate limiter (Redis). Enable on-device Whisper-turbo for free tier — kills ~60% of STT cost. Add Postgres read replica. Crossover math: Deepgram ~$0.0043/min vs self-host Whisper-turbo on $0.30/hr L4 ~$0.0001/min, breakeven at ~12K concurrent minutes/day.
- 100K usersSelf-hosted STT fleet (Whisper-turbo on Triton + L4 GPUs, ~50 streams/GPU). Multi-region API (US-east, EU, AP). CDN for auto-update artifacts. Postgres → managed (Neon or Crunchy) with PITR.
- implicationUntil you hit 10K paying users, do not self-host inference. Until you hit 100K, do not go multi-region. The map is on the wall; the construction crew stays home.
Checklist for this week.
Six concrete actions. By Friday you should have a one-page latency budget pinned above your monitor and a written commitment to not touch Kubernetes until 10K users.
«Latency is the product.»