Impressive pocket sized TTS
a little text to speech that could.
For quite some time now, I was searching for low-profile, low-power, almost real-time TTS, and it looks like I found it.
Pocket TTS is a lightweight text-to-speech (TTS) model designed to run efficiently on CPUs. And what’s most important thing is that it can handle infinitely long text inputs, unlike similar or bigger models that can handle a minute or two at the most.
To install, you can install it via
pip install pocket-tts
If you want to train your own voice model, you will need to go to Hugging Face, log in, and request access for PocketTTS. You will also need to generate an HuggingFace access token; just make sure it has READ permissions and is not finetuned.
py -m pip install -U huggingface_hub
hf auth login
When asked, paste the token key into the terminal.
As a test, I found some examples of Scarlett Johansson's voice (I suspect that it is also an AI generated one) and trained PocketTTS on it. It took like a second and produced a safetensor file that’s 1.5 Mb in size. Thats insain. Generation speed was 4.5x the real time, so that’s good.
Training sample
Generated voice:
Trained model and demo code can be found on BarnLabs GitHub.
To test it out, you don’t need to write any code; just do
pocket-tts generate --text “This is my cloned scarlet voice speaking now.” --voice .\scarlet_voice.safetensors --output-path test.wav
in your terminal.
I’m getting closer and closer to building Her.


