Frontier Systems and the Future of Voice AI · Mati Staniszewski

2026-06-09 · A faithful, transcript-grounded reading by PodLens

Original episode:https://youtu.be/vfF011ko89o?si=tTB8c62w2U6F1IBt　·　Timestamps are clickable — they seek the player in place

Voice AIElevenLabsSpeech SynthesisMultimodal InteractionFrontier Systems

What This Episode Is About

As the co-founder and CEO of ElevenLabs, Mati Staniszewski joins host Anj to dive deep into the development history of voice AI and audio generation technology, the evolution of system architectures, commercialization strategies, and future frontier trends. The conversation kicks off with ElevenLabs' early Discord community-driven product-led growth (PLG) model, tracing their journey from initially aiming to solve the complex challenge of multilingual AI dubbing to strategically conquering high-quality monolingual text-to-speech (TTS) first. Mati Staniszewski breaks down the pros and cons of cascaded architecture versus fused/end-to-end architecture in real-world enterprise scenarios (focusing on the trade-offs between reliability, control, and latency), and shares how ElevenLabs achieved rapid commercial growth to over $430 million ARR. Additionally, he discusses macro topics such as content safety, watermarking, on-device deployment, and partnerships with governments (such as Ukraine's Diia app), painting a vision of serving as the conversational interaction infrastructure for enterprises in the future.

Timeline Topic Map

[00:07-01:32] Anj introduces Mati Staniszewski, recalling how they met three years ago through Nat Friedman, who joined as an angel investor.
[01:32-02:36] Mati Staniszewski reveals that ElevenLabs operated entirely on Discord in its early days, and discusses the evolution of community interaction tools with Anj, the former head of the Discord platform.
[02:36-04:59] Exploring the phenomenon of gaming communities as the birthplace of AI innovation, analyzing how ElevenLabs drew inspiration from Midjourney's PLG model and established a voice marketplace.
[04:59-07:23] Tracing the original inspiration of ElevenLabs: solving the terrible experience of single-voice, single-language dubbed films in Poland, which led to the decision to develop high-quality multilingual AI dubbing.
[07:23-09:25] Explaining the three core models of the dubbing pipeline (speech-to-text, machine translation, and text-to-speech), and the strategic choice to prioritize voice restoration and monolingual TTS based on user feedback.
[09:25-12:44] Technical details of the Cascaded Workflow: combining the context prediction capabilities of Large Language Models (LLMs) with innovative voice feature extraction to achieve natural emotional expression.
[12:44-14:45] Early R&D exploration: seeking inspiration from open-source projects (such as Tortoise-TTS developed by James Betker) and academic papers, while confronting the physical limitations of Tortoise-TTS's slow speed and instability.
[14:45-17:37] Early compute budgets and patent strategies: training early models with a $100,000-level compute budget, and why they decided to forgo patent applications based on legal advice.
[17:37-22:03] ElevenLabs' technical evolution from 2022 to 2026: from monolingual TTS to multilingual cloning, then to AI video localization (such as Javier Milei's UN speech) and real-time voice agents.
[22:03-23:41] Exploring the next stage of voice AI evolution, leading to how to endow systems with semantic-level vocal expressiveness and emotional understanding.
[23:41-26:39] The debate between cascaded and fused architectural paths: analyzing the security and control advantages of cascaded systems over fused systems under high enterprise reliability requirements.
[26:39-28:50] Breakthroughs in emotional control: ElevenLabs' heavy investment in data labeling, enabling bidirectional transfer and controllable adjustment of tones (happy, sad, anxious) within a cascaded system.
[28:50-31:17] Reliability requirements of cascaded systems: complex orchestration involving authentication, multi-tool calling, and multi-factor verification in enterprise scenarios (such as airline rebooking).
[31:17-35:13] Industry coopetition and ecosystem: Anj emphasizes the open and collaborative attitude of leaders in systems development, sharing the deep collaboration and mutual investment between ElevenLabs and Sesame (led by Brendan).
[35:13-39:30] Commercial growth miracle: ElevenLabs' extraordinary journey of scaling from $1 million ARR to $430 million in 36 months, while the team consistently maintained an autonomous small-team model of fewer than 10 people.
[39:30-42:35] Commercial pricing logic: rejecting pricing based on compute costs, and insisting on reverse-engineering pricing and packaging based on the actual value delivered to customers (capturing 1/10 of the value).
[42:35-44:33] Security and anti-counterfeiting technology: preventing voice cloning abuse, promoting digital watermarking and public detection tools; emphasizing that voice biometrics (Voice Authentication) is insecure for financial-grade authentication.
[44:33-46:32] Technical bottlenecks over the next five years: how to handle specific nomenclature across massive heterogeneous scenarios and personalize interaction speed and tonal preferences.
[46:32-48:23] Training difficulties of cascaded vs. fused architectures: cascaded systems require pre-training emotional parameters, while fused systems face extremely high barriers in merging text and audio tokens and are limited by the capabilities of open-source base models.
[48:23-51:39] Five-year vision and social responsibility: striving to become one of the 3-5 major conversational cloud platforms globally, and sharing philanthropic projects aimed at reconstructing voices for patients with speech loss, such as ALS.
[51:39-54:51] Sovereign-level deployment and partnership with Ukraine: collaborating with the Ukrainian government in a wartime environment to integrate voice services into the Diia app, providing flat and agile digital government services.
[54:51-57:15] International competition and security defense: defending against distillation attacks from other regions, and establishing core barriers through localized dialect variations and global service quality.
[57:15-1:01:45] Film and TV studios' attitudes toward AI voice: analyzing the shift from end-to-end voice generation to middle-to-middle collaborative tools, resolving the conflict between artistic fidelity and financial royalty distribution (IP Royalty).
[1:01:45-1:06:16] Breakthroughs in on-device deployment: revealing ElevenLabs' latest progress in successfully running models on local devices, and exploring the balance between experience and privacy in hybrid cloud-edge architectures.

Core Insights List

Community-driven and PLG models are the best paths for AI startups to gather user feedback and discover unpredictable use cases. ElevenLabs maintained a tight closed loop with creators and developers through Discord in its early days, a model that helped them rapidly validate quality and foster unexpected application scenarios. [03:37-04:03] | Type: Insight
The complete realization of AI Dubbing must rely on the synergy of three major models—transcription, translation, and text-to-speech (TTS)—and requires strategic dimensionality reduction based on user pain points before the technology matures. When ElevenLabs was founded, they initially wanted to directly solve the multilingual dubbing problem, but research revealed that they could only piece together a crude "Frankenstein" version at the time. Consequently, they decided to narrow their R&D focus to monolingual TTS, the lowest common denominator. [07:23-09:25] | Type: Fact
The breakthrough naturalness of text-to-speech (TTS) comes from combining context-awareness with de-parameterized voice feature extraction. ElevenLabs broke away from the traditional practice of predicting voices by hard-coding parameters like gender, accent, and age, and instead introduced the context prediction mechanism of Large Language Models, allowing the model to autonomously extract voice features. [11:12-12:16] | Type: Insight
Compute constraints and rapid technological iteration make applying for patents in the early stages of AI meaningless. ElevenLabs possessed only tens of thousands of dollars in compute in its early days. Faced with high patent application fees, they decided to forgo applying, realizing that rapid technological shifts would quickly render patents obsolete and that defensive patents could not stop rapid iteration. [15:57-17:24] | Type: Insight
In high-reliability enterprise scenarios, Cascaded Architecture remains a superior choice to Fused Architecture for the next few years. Although cascaded architecture is inferior to fused systems in terms of latency, it offers extremely high auditability, making it easier to set up security guardrails during multi-step authentication and tool calling, and is better suited for controllability interventions of emotional parameters. [23:41-26:01] | Type: Prediction | Limitation: Mati Staniszewski mentioned that if the sole pursuit is ultra-low latency or companion-like scenarios with no action-execution requirements, fused systems would be more appropriate, and hybrid cloud-edge or dynamic switching might emerge in the future.
AI product pricing should be completely decoupled from compute costs and reverse-engineered based on the value created for customers. A reasonable pricing model should aim to capture one-tenth of the total economic value created for the customer. [42:06-42:35] | Type: Insight
Voice recognition and authentication (Voice Authentication) cannot serve as a secure means of identity verification. With the low-cost democratization of AI voice cloning technology, traditional financial institutions using voice for account authentication have become insecure, and the industry must rapidly transition to other authentication solutions. [43:41-44:01] | Type: Insight
The best application form of AI in cultural and creative fields is "middle-to-middle" collaborative tools, rather than "end-to-end" direct generation. The resistance of film and TV studios to AI is mainly because end-to-end generation easily leads to low-quality content collapse (AI Slop), whereas fine-grained directorial control of middle-to-middle tools (such as controlling the tone and speed of individual sentences) and resolving benefit-sharing mechanisms are the keys to implementation. [59:31-1:01:11] | Type: Insight
Future voice interactions will be dominated by a few Conversational Cloud Platforms globally. Just as the current cloud computing market has three or four major cloud providers, future interactions between enterprises and users will converge on 3 to 5 platforms focused on conversation orchestration and knowledge integration. [49:45-50:39] | Type: Prediction

Internal Tensions and Self-Corrections

[01:52] vs [02:20]: To avoid the hierarchical reporting disease of large corporations, the founding team was allergic to meetings and insisted on running the company on Discord. However, after a few months of actual operation, they compromised with reality due to difficulties in organizing information flow and migrated to Slack, which is easier for threaded discussions.
[07:23] vs [09:25]: The original founding intention was to fully tackle the complex challenge of full-process multilingual automatic dubbing. However, after deep research and user surveys, they found that the technology at the time was only sufficient to piece together a crude "Frankenstein"-style effect. Consequently, they decisively narrowed their front and reduced dimensions to first conquer monolingual TTS, the underlying lowest common denominator.

Plain English Retelling

Imagine watching a foreign movie in Poland, where whether it's a macho male lead or a gentle female lead, the entire film is read in a monotonous, emotionless middle-aged male voice. This sounds like a disaster, but for ElevenLabs co-founder Mati Staniszewski, this was precisely the starting point for them to reshape the audio world with AI.

To conquer the monster of "perfect dubbing," you need to train three little monsters simultaneously: speech recognition (understanding), machine translation (translating correctly), and speech synthesis (reading naturally). In 2022, before the LLM boom, forcing these three together would only yield a stuttering, emotionless "Frankenstein" dub. Mati Staniszewski and his partner Piotr demonstrated incredible product intuition: they decided not to overextend themselves, but rather to reduce dimensions and focus entirely on the core "lowest common denominator"—text-to-speech (TTS). They astutely realized that the context prediction capabilities of Large Language Models (LLMs) could be introduced into speech synthesis, allowing the AI to not just read words, but to allocate tone like a human actor, combining the context of the situation (such as happiness, sadness, or conversational settings).

In this process, ElevenLabs' rise was accompanied by an extremely open "coopetition" mindset. Instead of viewing peer startups (like Sesame) as mortal enemies, they empowered each other through angel investments and technical exchanges, building their true defense lines on continuous model iteration, meticulous data labeling, and "Cascaded Workflows" tailored for large enterprises. Why not use the currently popular, one-step "end-to-end fused model (Fused Model)"? Because for major clients like airlines and banks, having an AI hallucinate or talk nonsense is absolutely unacceptable. Although the cascaded system is slightly slower, it acts like an honest programmer—every step (transcribing, thinking, synthesizing) is crystal clear, allowing for safety guardrails, on-demand external database calls, and two-factor authentication.

Commercially, ElevenLabs' explosion is also a classic product lesson. Instead of calculating how much electricity it costs to run a model once, they look at how much business growth this voice brings to the customer, and then proudly capture only one-tenth of that value (value-based reverse pricing). In the future, ElevenLabs is squeezing complex voice models into local devices while striving to become a "conversational cloud platform" that connects enterprise knowledge bases and interaction channels. Whether it's a philanthropic project to recover voices for ALS patients or helping the Ukrainian government build agile digital governance in the Diia app amidst war, voice AI is transforming from a fun audio toy into a true sovereign-level and civilization-level cognitive infrastructure.

Recommended Segments for Deep Listening

[01:52-02:20] Listen to Mati Staniszewski recall the early anecdotes of forcing the company to run on Discord to escape the meeting and email bombardment of traditional large corporations. You can hear the genuine struggle of two tech founders facing team scaling and information overload.
[11:12-12:16] Mati Staniszewski breaks down how ElevenLabs abandoned traditional "hard-coded voice parameters" (such as gender, age, accent) and instead allowed the model to autonomously learn emotional features using LLM context. This segment is highly dense with information, showcasing the technical trajectory of audio generation breaking through at the mechanism level.
[32:54-34:29] Host Anj emotionally shares the inside story of collaboration between Mati Staniszewski and peer Sesame founder Brendan, who chose to share information and invest in each other despite competitive pressures. You can feel the warmth of a rare "frontier exploration community" amidst Silicon Valley's wild Darwinism.
[51:39-53:39] Mati Staniszewski recounts his journey to Kyiv to work with the Ukrainian government team under wartime emergency conditions without red tape, deploying voice services in the Diia app in a flat, agile manner. His tone reveals a sense of mission that transcends technology and business itself.

Resonances with past episodes

Corroborates→ Frontier Systems Compute and the Context Loop War · Anjney Midha
Both explain from the dimensions of application performance and underlying mechanisms the deep reason why AI cannot achieve fully automated generation in creative fields: because aesthetics and creativity lack clear validation metrics, models easily fall into mediocrity or produce low-quality slop without human intervention. Therefore, they must adopt the form of collaborative tools that retain human directorial control.
This[59:31-1:01:11] The best application form of AI in cultural and creative fields is "middle-to-middle" collaborative tools, rather than "end-to-end" direct generation. Direct generation easily leads to low-quality content collapse (AI Slop), whereas fine-grained directorial control is the key to implementation.
Related[38:39-39:35] The pace of progress in Reinforcement Learning (RL) at the frontier is directly proportional to the ease of verification in the domain. In hard-to-verify fields like aesthetics and creative writing, AI struggles to self-improve and easily falls into mediocrity and hallucination.
Complements→ Frontier Systems Compute and the Context Loop War · Anjney Midha
The former provides a micro-level practical path for the latter: the tight user interaction closed-loop established through community- and product-led growth models is precisely the concrete means for startups to macroscopically acquire and monopolize the "contextual feedback loop" to capture ultimate value.
This[03:37-04:03] Community-driven and product-led growth (PLG) models are the best paths for AI startups to gather user feedback and discover unpredictable use cases, such as maintaining a tight closed loop with creators and developers through Discord communities.
Related[24:51-27:48] Ultimate value capture in the AI industry depends on sovereign or exclusive control over specific contexts and environments; companies with unique and protected contextual feedback loops will win, driven by the compute flywheel.
Isomorphism← The Discipline of Value Delivery per Gigawatt · Amin Vahdat
The two are highly isomorphic in value measurement, both advocating for breaking away from traditional 'resource- and cost-oriented' thinking (such as compute costs, gigawatts, FLOPs) and shifting toward 'actual value ultimately delivered to the user' as the core standard for measuring system efficiency and commercial success.
This[42:06-42:35] AI product pricing should be completely decoupled from compute costs and reverse-engineered based on the value created for customers. A reasonable pricing model should aim to capture one-tenth of the total economic value created for the customer.
Related[04:55-05:08] The true measure of compute capacity is the actual value delivered per dollar (Value per Dollar) or user activity (Daily Active Users), rather than simply gigawatts (Gigawatts) or hardware FLOPs.
Corroboration← Product Building and Career Evolution in the AI Era · Nikhyl Singhal
Both point out that in a rapidly evolving technological and market environment, high-frequency iteration speed is a startup's most core defensive barrier, and static initial states or legal patent protection will quickly fail in the face of high-speed iteration.
This[15:57-17:24] Compute constraints and rapid technological iteration make applying for patents in the early stages of AI meaningless. ElevenLabs initially had only tens of thousands of dollars in compute. When faced with high patent application fees, they abandoned the application because they realized rapid technological turnover would quickly make patents obsolete, and defensive patents could not stop rapid iteration.
Related[13:54] Product Iteration Speed determines a product's success or failure more than its initial state, constituting a core advantage for startups against large companies.

A faithful reconstruction and plain-language retelling of the episode, generated by PodLens.

This is one source-grounded reading, not a replacement for the original. Every point is anchored to its source, so you can check it yourself — and corrections are welcome.