Lesson 08 · AI Thesis

AI Thesis: AI as the foundation, not the feature.

"AI-powered" is a marketing word. "AI-impossible" is a product fact. Your job in this lesson is to state, in one paragraph, the thing your product needs that did not exist 18 months ago — and then to be honest about which parts of that stack are commodity and which parts are yours.

Duration

16 min read

Format

read + checklist

Goal

Thesis / Flywheel / Moat

Outcome

A one-paragraph AI thesis and a model-routing plan

why this matters for you

honest
Babelio's moat is not model quality. You ride the same Deepgram / Gemini / ElevenLabs APIs every other team can buy. Saying otherwise on a pitch deck is the fastest way to lose a sophisticated investor.
real moat
The defensible asset is OS-level audio integration, per-app routing + glossary corpus, and consumer distribution — the latency-tuned UX layer on top of commodity weights. Say so plainly; build accordingly.

What this lesson does / does not do.

Does

Define "AI-impossible" as a sharper bar than "AI-improved".
Explain the data flywheel — usage to data to model to product.
Pick a primary model + fallback per stage with cost numbers.
Name AI-specific risks and the mitigations that ship in v1.

Does not

Pretend you will train a foundation model. You will not.
Promise a research moat that doesn't exist on commodity APIs.
Replace the audio-routing engineering work (that's the real moat).
Cover go-to-market — that was Lesson 06.

01.

Concept 01 · AI-impossible vs AI-improved

The product fact that pre-dates the marketing.

4 minread

"AI-powered" is the laziest phrase on the internet. The useful question is sharper: without the AI capability of the last 18 months, is your product impossible — or merely improved?

An AI-improved product gets a faster autocomplete, a better summary, a smarter search ranker. It would still exist without the model — just slightly worse. An AI-impossible product cannot exist at all if you remove the model: the core experience requires a capability that did not have a price-per-call eighteen months ago. Investors fund the second category. The first is feature-list dressing.

The test is simple. Delete the AI from your description. Can you still write a one-sentence product? If yes, you are AI-improved. If the sentence collapses into nonsense, you are AI-impossible — and your job is to identify which capability curve crossed which threshold in which quarter. That date is your "why now".

in your startup

the bar
Babelio dubs any speaker from any app into your language under ~700 ms end-to-end, with original-speaker mute and identity-preserving voice. Strip the AI and you have an audio router with no audio — the product disappears.
three curves
2025 was the year all three latency floors fell at once: STT <300 ms (Deepgram Nova-3, Whisper-turbo), MT <200 ms (Gemini 2.5 Flash, GPT-4o), TTS <200 ms TTFB (ElevenLabs Flash v2.5, Cartesia Sonic).
why not earlier
Pre-2024 cascaded pipelines hit 2–4 s end-to-end — feels like a delayed echo, not interpretation. Phrase-based MT mangled mid-sentence code-switching. TTS was studio-grade or fast, never both.
your one-liner
"Sub-800 ms streaming dub of arbitrary conversational speech from any desktop app is AI-impossible before mid-2025." That is the thesis. Put it on slide 4 of the deck and nowhere else.

02.

Concept 02 · The data flywheel

Usage to data to model to product.

4 minread

A flywheel is not a metaphor for "more users = more better". It is a specific four-arrow loop where each turn lowers the cost of the next turn and a competitor without the loop cannot catch up.

The classical loop is: product creates usage, usage emits a signal, the signal improves the model, the improved model differentiates the product. The honest version names what kind of signal: clicks for ranking, corrections for translation, completions for code, outcomes for diagnosis. Without naming the signal you have a slogan, not a flywheel.

The second honest test: does the signal accumulate where you sit, or where the foundation-model vendor sits? If OpenAI sees the corrections before you do, your flywheel powers their factory. If the corrections live in your stack as glossaries, prompt overrides, routing rules, eval sets — then the asset is yours, and a competitor with the same APIs starts from zero.

in your startup

arrow 1
Usage → data. Opt-in transcript donation + the "wrong word" hotkey emit per-pair, per-app correction events. One user fixes "Aragorn"; another fixes "API". You capture both with app context.
arrow 2
Data → model. Corrections become per-language-pair idiom-tuning eval sets, per-app glossaries, and style preferences (gaming slang vs enterprise register). Routed as prompts and post-edit rules, not weight updates.
arrow 3
Model → product. Per-context routing improves: gaming Discord uses slang-tuned MT prompts, Zoom uses formal register, Twitch uses streamer-name preservation. Hallucination rate drops monotonically per app.
where it lives
The corpus is yours, not Deepgram's or Gemini's. Foundation weights are commodity; the routing config + glossary corpus is the proprietary asset. Distribution-bound, not data-bound — but compounding nonetheless.

03.

Concept 03 · Model strategy + routing

Cheap by default, smart on demand.

5 minread

A serious AI product is not one model behind one prompt. It is a pipeline of stages with a primary and a fallback per stage, routed by signal: confidence, cost ceiling, latency tail, user tier.

The cheap-to-smart pattern works because most requests are easy and a small percentage need the heavyweight. Route by default to the cheapest model that meets the latency and quality bar; reroute the segment to a more expensive model when a confidence signal trips a threshold. This pattern is how you keep gross margin alive while still delivering the best result on the long tail.

Two architectural rules. First, every model goes behind an adapter — never call a vendor SDK directly from product code, or you lock yourself into their roadmap and their price list. Second, every stage has a local-only fallback, even if degraded. If your only option is "cloud or nothing", a price hike or an outage is a company-ending event.

in your startup

default path
VAD: Silero local. STT: Deepgram Nova-3 multilingual streaming ($0.0092/min). MT: Gemini 2.5 Flash ($0.30 / $2.50 per 1M tok). TTS: Cartesia Sonic (~$0.03/min). Cheapest viable end-to-end <700 ms.
smart reroute
On low MT logprob or detected idiom token, re-route the segment to Claude Haiku 4.5 ($1 / $5 per 1M tok). Fast enough to stay in budget; smart enough to fix the long-tail miss.
premium tier
Paid users get ElevenLabs Flash v2.5 ($0.05 per 1k chars) for voice-cloned dub and tighter latency SLA. Free tier stays on Cartesia.
offline / privacy
WhisperKit (Whisper-large-v3-turbo) on Apple Neural Engine for local STT — 0.46 s latency, 2.2% WER. Piper / Kokoro for local TTS. MT stays cloud unless full-offline mode (degraded NLLB-200).
cost truth
Heavy user (30 hr audio/mo) on pure-cloud Cartesia path ≈ $55/mo COGS. Local STT on Apple Silicon drops that to ~$41/mo. Price must be usage-tiered ($9.99 / 10 hr, $24.99 / 40 hr) — flat $9.99 unlimited dies.

Stage

Primary → Fallback

VAD

Silero (local, always)

STT

Deepgram Nova-3 → WhisperKit local → GPT-4o-transcribe

Gemini 2.5 Flash → Claude Haiku 4.5 (idiom) → GPT-4o-mini (cost floor)

TTS

Cartesia Sonic (free) → ElevenLabs Flash v2.5 (paid) → Piper (offline)

04.

Concept 04 · Risks & mitigations

Failure modes name themselves first.

3 minread

An AI product fails in patterned, predictable ways. The mature founder enumerates the failure modes before the user does, and ships the mitigation in v1 — not "on the roadmap".

Four risk families come up in every conversational-AI product: vendor lock-in (your supply chain), cost spikes (your unit economics), hallucination (your trust budget), and privacy (your regulatory surface). Each has a known structural mitigation. Each must be visible in the architecture diagram, not buried in the FAQ.

The discipline is to convert each risk into a concrete shipping artefact: an adapter interface, a per-user quota, a confidence gate, a retention policy. If a risk has no artefact, it is a wish. Investors and serious customers can tell the difference in under a minute.

in your startup

lock-in
Multi-provider adapter pattern for STT / MT / TTS in v1. Deepgram tripling its price = swap to GPT-4o-transcribe in a day. No vendor SDK reaches product code directly.
cost spike
Hard per-user daily cap tied to plan tier. Idle detection auto-pauses Babelio after 10 min of no foreground app audio change. Overnight YouTube autoplay can't blow a $200 STT bill.
hallucination
Confidence floor — segments below threshold are suppressed, never dubbed. Aggressive Silero VAD gate before STT. Optional source-language ghost track at -18 dB so the user always has a reality check.
privacy
No audio persisted by default. Zero-retention contracts with Deepgram / ElevenLabs / Cartesia. Local-STT default on Apple Silicon. Per-app allowlist UI (Slack huddles default off). Voice-clone requires explicit consent phrase — no passive capture.

Don't claim a model-quality moat in your pitch deck

You use the same Deepgram, Gemini, Claude and ElevenLabs APIs every YC batch does. Sophisticated investors read "AI moat" as either dishonest or naïve. Lead with OS integration depth, latency-tuned UX, and the data-flywheel corpus — those are real and yours.

Don't promise on-device translation parity

WhisperKit is great for STT. Local MT (NLLB-200) is meaningfully worse than Gemini Flash on conversational, idiomatic, code-switched speech. Be honest: "private mode is degraded by design" beats "fully on-device" lying badly to a customer once.

Checklist for this week.

Five concrete actions. By Friday you should be able to read your one-paragraph AI thesis out loud without flinching, and your routing table should fit on one page.

Write your one-paragraph AI thesis — start with "Babelio is impossible without AI because…" and finish with the three 2025 latency curves. Define your data flywheel in three arrows: usage → data → model. Name the signal at each arrow (correction event, app context, hallucination flag). Pick your primary model + one fallback per stage (VAD / STT / MT / TTS). Write the routing rule that switches between them. Set per-user cost ceilings tied to plan tier — daily caps, idle pause, overage cutoff. Compute heavy-user COGS for each plan. Do NOT claim a model-quality moat in your pitch deck if you use commodity APIs. Replace any "AI moat" slide with "OS integration + data flywheel + distribution".

lesson mantra

«AI is the floor, not the ceiling.»

— back to the cover

Next · 00

Playbook: back to the cover.

→