Embedded TTS: Tivoization Trap & Python Tax

This article explores two critical challenges when deploying text-to-speech models on embedded systems. These are the Tivoization Trap (GPL licensing issues caused by the eSpeak-NG dependency) and the Python Tax (runtime performance and resource costs of Python-based implementations).

Picture this: you start playing around with TTS (Text To Speech) through open-source models (just like Kokoro, an incredible 82M parameter model that punches way above its weight) and python, because it is convenient to work on your laptop. It all works quite well, and at the end of the day you are ready to deploy on a Raspberry Pi, right? Right?

Well, not so fast. Getting it to work is only half of the job, or, more accurately... ten percent. Quickly, you realize the shortcuts you took lead you to a bloated runtime environment and worse... compliance issues.

The first one is pretty obvious. Python is not the best when it comes to real time performance, and is absolutely not suitable to real time applications such as the likes of TTS. The number one UX killer is the lack of a response when companies start replacing actual humans with machines, and these machines do not respond in a timely manner.

How Kokoro-82M Works

The second, is not what you would come to expect from open-weight models: compliance. You are probably thinking, it does not make sense that the model does not have a commercially-permissive license, considering the state of the industry and the fact that basically all of these models are.

And you are right. That is why to know why you will run into compliance issues, you first need to understand how the model works. Firstly, it is not an end-to-end model, meaning it cannot go straight from text to audio. Instead, it requires some extra components to make it work.

Kokoro-82M is not an end-to-end TTS solution {{ w: 762, h: 142 }} Kokoro-82M is not an end-to-end TTS solution

The process from text to voice mainly involves converting the text into phonemes (also known as G2P - Grapheme to Phoneme), which are then consumed by Kokoro-82M into a mel-spectogram, an intermediate format that is consumed by a vocoder to convert into sound. Let us go slowly over these terms:

  • Phoneme - Conversion from words into pronunciation. This is performed because the same word may be read in a different way, depending on the surrounding context. An easy example is the word read, which can be read in two different ways, even though it is written in the same way. That is easy for us to figure out, as we have learned the language, but for a TTS model, it needs a helping hand to figure out the ambiguity.

This means the model does not need to worry about the ambiguity or word read, as it receives as an input either of the phonemes /ɹ iː d/ or /ɹ ɛ d/, for present and past tense, respectively. Yes, this is one of the reasons the model is so small, but honestly that is a good compromise, as the model worries only about pronouncing the phonemes in the best way possible, adding the respective pauses and proper pitch to the phonemes, resulting in better sounding audio at the end.

  • Mel-Spectogram - As with every spectogram, it is a representation of the frequency range over time, via Short-Time Fourier Transforms (STFT).

The mel part comes from a linear mel filterbank, a series of overlapping triangular filters, which matches the human hearing, which has a higher resolution at low frequencies (below 1 kHz) and a lower resolution at high frequencies (logarithmic behavior).

What a Mel-Spectogram Looks Like {{ w: 495, h: 447 }} What a Mel-Spectogram Looks Like

Why Kokoro-82M Is a Production Nightmare

Now that we have an overview of how Kokoro-82M works and the components it needs around it, the reason why it can become a nightmare is starting to come into light: the external dependencies the model needs to work. And yes, that is the reason. The conversion into phonemes is usually carried out via eSpeak-NG, which has a GPLv3 license.

For hobbyists it is not a big deal, but for a medical device or robotics company, which relies on TTS to communicate with its users, that becomes an issue. Using a GPLv3-licensed software in a product requires them to open-source their entire product upon request. This is also known as GPL poison, and closes our circle on the legal compliance issue for Kokoro-82M.

Why Not Replace eSpeak-NG?

eSpeak-NG is basically the back-end of all open-source TTS models, so it became an obvious choice for Kokoro-82M. An easy question to ask is: if eSpeak-NG is the "tivoization" enemy, why not get rid of it? Unfortunately, it is not so easy.

Phonemizers like eSpeak-NG have their own phoneme dialect, which Kokoro-82M is trained upon. This means, that if you want to replace it, you need to mimic it, with the exact phonemes it provides, as these are the ones Kokoro-82M expects. Luckily, there are already some GPLv3-free drop-in replacements for eSpeak-NG, which are not as good both in terms of:

  • Performance - The entire thing is written in C, so it is blazing fast, making it extremely appealing for real-time applications.

  • Quality - Tools like OpenPhonemizer are not there yet in terms of text normalization, such as converting $10 to 10 dollars, Aug 20 to August twentieth and 6'1 to six foot one. Currency, dates and units are an issue.

The Technical Trap of Relying on Python

Although not as equally messy to figure out, relying on Python to provide the TTS functionality works well on a personal computer, but fails miserably for embedded systems for a multitude of reasons, namely:

  • Worse Real Time Performance - With Python, you get a slower, less-responsive real time performance, making the device feel like it is lagging behind the user and not contributing to a fluid experience.

  • Pay More For The Same (or Less) - Nothing worse than having to increase the hardware performance because of software. On small embedded devices, 300 MB of RAM (around what you need to run PyTorch and Python), will set you back around 10 to 20% of your total RAM budget, even before you made the robot speak.

  • Dependency Hell - Bringing the production environment up to speed on every update becomes a mess. With Python and the necessary packages to make TTS work, you have this anchor slowing you down, especially if you are trying to build a Buildroot or Yocto image.

Conclusion: Pay Now or Pay Later

Even if it moves your project faster, if you are building a professional TTS application resorting to edge computing, tackling compliance and real-time performance becomes paramount, and the longer you drag it, the worse it will get.


Building in public. Follow my journey at InvisiblePuzzle (opens in a new tab) where I document how I'm building B2B automation tools while working full-time.

Tags: TTS, Text-to-Speech, Embedded Systems, Python, Machine Learning, GPL, Open Source, Kokoro, Real-time Systems, IoT, Raspberry Pi, eSpeak-NG, Licensing

Get notified on new posts

No spam. Unsubscribe anytime.

© 2026 InvisiblePuzzle

Building Software Tools for B2B