Impressive pocket sized TTS

a little text to speech that could.

Feb 27, 2026

For quite some time now, I was searching for low-profile, low-power, almost real-time TTS, and it looks like I found it.

Pocket TTS is a lightweight text-to-speech (TTS) model designed to run efficiently on CPUs. And what’s most important thing is that it can handle infinitely long text inputs, unlike similar or bigger models that can handle a minute or two at the most.

To install, you can install it via

pip install pocket-tts

If you want to train your own voice model, you will need to go to Hugging Face, log in, and request access for PocketTTS. You will also need to generate an HuggingFace access token; just make sure it has READ permissions and is not finetuned.

py -m pip install -U huggingface_hub

hf auth login

When asked, paste the token key into the terminal.

As a test, I found some examples of Scarlett Johansson's voice (I suspect that it is also an AI generated one) and trained PocketTTS on it. It took like a second and produced a safetensor file that’s 1.5 Mb in size. Thats insain. Generation speed was 4.5x the real time, so that’s good.

Training sample

0:00

-0:32

Generated voice:

Trained model and demo code can be found on BarnLabs GitHub.

To test it out, you don’t need to write any code; just do

pocket-tts generate --text “This is my cloned scarlet voice speaking now.” --voice .\scarlet_voice.safetensors --output-path test.wav

in your terminal.

I’m getting closer and closer to building Her.

Barn Lab

Discussion about this post

Ready for more?