AI Thesis: AI as the foundation, not the feature.
"AI-powered" is a marketing word. "AI-impossible" is a product fact. Your job in this lesson is to state, in one paragraph, the thing your product needs that did not exist 18 months ago — and then to be honest about which parts of that stack are commodity and which parts are yours.
why this matters for you
- honestBabelio's moat is not model quality. You ride the same Deepgram / Gemini / ElevenLabs APIs every other team can buy. Saying otherwise on a pitch deck is the fastest way to lose a sophisticated investor.
- real moatThe defensible asset is OS-level audio integration, per-app routing + glossary corpus, and consumer distribution — the latency-tuned UX layer on top of commodity weights. Say so plainly; build accordingly.
What this lesson does / does not do.
Does
- Define "AI-impossible" as a sharper bar than "AI-improved".
- Explain the data flywheel — usage to data to model to product.
- Pick a primary model + fallback per stage with cost numbers.
- Name AI-specific risks and the mitigations that ship in v1.
Does not
- Pretend you will train a foundation model. You will not.
- Promise a research moat that doesn't exist on commodity APIs.
- Replace the audio-routing engineering work (that's the real moat).
- Cover go-to-market — that was Lesson 06.
The product fact that pre-dates the marketing.
"AI-powered" is the laziest phrase on the internet. The useful question is sharper: without the AI capability of the last 18 months, is your product impossible — or merely improved?
An AI-improved product gets a faster autocomplete, a better summary, a smarter search ranker. It would still exist without the model — just slightly worse. An AI-impossible product cannot exist at all if you remove the model: the core experience requires a capability that did not have a price-per-call eighteen months ago. Investors fund the second category. The first is feature-list dressing.
The test is simple. Delete the AI from your description. Can you still write a one-sentence product? If yes, you are AI-improved. If the sentence collapses into nonsense, you are AI-impossible — and your job is to identify which capability curve crossed which threshold in which quarter. That date is your "why now".
in your startup
- the barBabelio dubs any speaker from any app into your language under ~700 ms end-to-end, with original-speaker mute and identity-preserving voice. Strip the AI and you have an audio router with no audio — the product disappears.
- three curves2025 was the year all three latency floors fell at once: STT <300 ms (Deepgram Nova-3, Whisper-turbo), MT <200 ms (Gemini 2.5 Flash, GPT-4o), TTS <200 ms TTFB (ElevenLabs Flash v2.5, Cartesia Sonic).
- why not earlierPre-2024 cascaded pipelines hit 2–4 s end-to-end — feels like a delayed echo, not interpretation. Phrase-based MT mangled mid-sentence code-switching. TTS was studio-grade or fast, never both.
- your one-liner"Sub-800 ms streaming dub of arbitrary conversational speech from any desktop app is AI-impossible before mid-2025." That is the thesis. Put it on slide 4 of the deck and nowhere else.
Usage to data to model to product.
A flywheel is not a metaphor for "more users = more better". It is a specific four-arrow loop where each turn lowers the cost of the next turn and a competitor without the loop cannot catch up.
The classical loop is: product creates usage, usage emits a signal, the signal improves the model, the improved model differentiates the product. The honest version names what kind of signal: clicks for ranking, corrections for translation, completions for code, outcomes for diagnosis. Without naming the signal you have a slogan, not a flywheel.
The second honest test: does the signal accumulate where you sit, or where the foundation-model vendor sits? If OpenAI sees the corrections before you do, your flywheel powers their factory. If the corrections live in your stack as glossaries, prompt overrides, routing rules, eval sets — then the asset is yours, and a competitor with the same APIs starts from zero.
in your startup
- arrow 1Usage → data. Opt-in transcript donation + the "wrong word" hotkey emit per-pair, per-app correction events. One user fixes "Aragorn"; another fixes "API". You capture both with app context.
- arrow 2Data → model. Corrections become per-language-pair idiom-tuning eval sets, per-app glossaries, and style preferences (gaming slang vs enterprise register). Routed as prompts and post-edit rules, not weight updates.
- arrow 3Model → product. Per-context routing improves: gaming Discord uses slang-tuned MT prompts, Zoom uses formal register, Twitch uses streamer-name preservation. Hallucination rate drops monotonically per app.
- where it livesThe corpus is yours, not Deepgram's or Gemini's. Foundation weights are commodity; the routing config + glossary corpus is the proprietary asset. Distribution-bound, not data-bound — but compounding nonetheless.
Cheap by default, smart on demand.
A serious AI product is not one model behind one prompt. It is a pipeline of stages with a primary and a fallback per stage, routed by signal: confidence, cost ceiling, latency tail, user tier.
The cheap-to-smart pattern works because most requests are easy and a small percentage need the heavyweight. Route by default to the cheapest model that meets the latency and quality bar; reroute the segment to a more expensive model when a confidence signal trips a threshold. This pattern is how you keep gross margin alive while still delivering the best result on the long tail.
Two architectural rules. First, every model goes behind an adapter — never call a vendor SDK directly from product code, or you lock yourself into their roadmap and their price list. Second, every stage has a local-only fallback, even if degraded. If your only option is "cloud or nothing", a price hike or an outage is a company-ending event.
in your startup
- default pathVAD: Silero local. STT: Deepgram Nova-3 multilingual streaming ($0.0092/min). MT: Gemini 2.5 Flash ($0.30 / $2.50 per 1M tok). TTS: Cartesia Sonic (~$0.03/min). Cheapest viable end-to-end <700 ms.
- smart rerouteOn low MT logprob or detected idiom token, re-route the segment to Claude Haiku 4.5 ($1 / $5 per 1M tok). Fast enough to stay in budget; smart enough to fix the long-tail miss.
- premium tierPaid users get ElevenLabs Flash v2.5 ($0.05 per 1k chars) for voice-cloned dub and tighter latency SLA. Free tier stays on Cartesia.
- offline / privacyWhisperKit (Whisper-large-v3-turbo) on Apple Neural Engine for local STT — 0.46 s latency, 2.2% WER. Piper / Kokoro for local TTS. MT stays cloud unless full-offline mode (degraded NLLB-200).
- cost truthHeavy user (30 hr audio/mo) on pure-cloud Cartesia path ≈ $55/mo COGS. Local STT on Apple Silicon drops that to ~$41/mo. Price must be usage-tiered ($9.99 / 10 hr, $24.99 / 40 hr) — flat $9.99 unlimited dies.
Failure modes name themselves first.
An AI product fails in patterned, predictable ways. The mature founder enumerates the failure modes before the user does, and ships the mitigation in v1 — not "on the roadmap".
Four risk families come up in every conversational-AI product: vendor lock-in (your supply chain), cost spikes (your unit economics), hallucination (your trust budget), and privacy (your regulatory surface). Each has a known structural mitigation. Each must be visible in the architecture diagram, not buried in the FAQ.
The discipline is to convert each risk into a concrete shipping artefact: an adapter interface, a per-user quota, a confidence gate, a retention policy. If a risk has no artefact, it is a wish. Investors and serious customers can tell the difference in under a minute.
in your startup
- lock-inMulti-provider adapter pattern for STT / MT / TTS in v1. Deepgram tripling its price = swap to GPT-4o-transcribe in a day. No vendor SDK reaches product code directly.
- cost spikeHard per-user daily cap tied to plan tier. Idle detection auto-pauses Babelio after 10 min of no foreground app audio change. Overnight YouTube autoplay can't blow a $200 STT bill.
- hallucinationConfidence floor — segments below threshold are suppressed, never dubbed. Aggressive Silero VAD gate before STT. Optional source-language ghost track at -18 dB so the user always has a reality check.
- privacyNo audio persisted by default. Zero-retention contracts with Deepgram / ElevenLabs / Cartesia. Local-STT default on Apple Silicon. Per-app allowlist UI (Slack huddles default off). Voice-clone requires explicit consent phrase — no passive capture.
Don't claim a model-quality moat in your pitch deck
You use the same Deepgram, Gemini, Claude and ElevenLabs APIs every YC batch does. Sophisticated investors read "AI moat" as either dishonest or naïve. Lead with OS integration depth, latency-tuned UX, and the data-flywheel corpus — those are real and yours.
Don't promise on-device translation parity
WhisperKit is great for STT. Local MT (NLLB-200) is meaningfully worse than Gemini Flash on conversational, idiomatic, code-switched speech. Be honest: "private mode is degraded by design" beats "fully on-device" lying badly to a customer once.
Checklist for this week.
Five concrete actions. By Friday you should be able to read your one-paragraph AI thesis out loud without flinching, and your routing table should fit on one page.
«AI is the floor, not the ceiling.»