Local AI TTS Guide

Read your live stream chat aloud using free, private, self-hosted AI voices — no API keys required

Overview

Social Stream Ninja supports fully local AI text-to-speech — your chat messages are converted to voice entirely on your own computer, with no data sent to external servers and no API key required.

There are two approaches:

Path 2 — Self-Hosted Server Docker Required

Run a local TTS server on your machine and point Social Stream Ninja at it. Gives you more voice options, voice cloning, and server-side control.

  • Kokoro-FastAPI
  • openedai-speech (Piper)
  • kokoro-web

Uses Social Stream's built-in OpenAI-compatible endpoint support.

Start with Path 1. The built-in Kokoro TTS rivals cloud services in quality, runs entirely in your browser, and works with OBS out of the box. Only move to Path 2 if you need more control or different models.

Where to Click in SSN

In the extension popup, open the TTS provider selector and choose Custom / Local TTS Endpoint. That shows the OpenAI-compatible local endpoint fields and the link back to this guide.

About screenshots: the SSN field map above shows the local endpoint fields. Third-party server UIs change by project version, so their current screenshots and UI details are linked from each project repo near the relevant setup step.

Self-Hosted Flow

SSN treats a local/self-hosted TTS server like an OpenAI-compatible speech endpoint. The core flow is:

chat text -> SSN TTS request -> local endpoint or SSN bridge -> TTS server -> audio response -> SSN playback

Request Shape

For ttsprovider=customtts, localtts, or openai, SSN sends a JSON POST to the configured endpoint:

POST /v1/audio/speech { "model": "tts-1", "input": "Chat message text", "voice": "af_bella", "response_format": "mp3", "speed": 1.0 }

CORS and the Bridge

If the request comes from Chrome or an OBS browser source, the local server must allow browser CORS requests. If it does not, run the SSN local TTS bridge and point SSN at http://127.0.0.1:8124/v1/audio/speech. The bridge adds browser-safe CORS headers and can translate GPT-SoVITS or F5-TTS wrapper requests.

Supported Audio Responses

Response SSN support Notes
Binary audio Yes Best option. Return audio/mpeg, audio/wav, audio/ogg, audio/aac, or another browser-playable audio type.
JSON with audio URL Yes SSN checks url, audio_url, output_url, nested data.url, and the first data[] item.
JSON with base64 audio Yes SSN checks audio, audio_data, audioContent, b64_json, nested data fields, and data URLs.
Raw PCM Only if wrapped Return PCM as a WAV file or base64 WAV. A browser audio element cannot reliably play raw PCM bytes directly.
Recommended formats: use mp3 for small files and broad browser support, wav for local cloning servers and bridge testing, and opus only when the server and browser both support it.

Streaming Audio

SSN does not currently do progressive playback for custom/local TTS endpoints. It waits for the response blob or JSON audio payload, then plays it. Some upstream servers expose streaming endpoints, but SSN's current OpenAI-compatible path buffers before playback.

Practical result: keep chat TTS snippets short. Streaming support would need a separate playback path using streamed WAV/MP3 chunks, MediaSource, WebCodecs, or a server-side mixer.

Path 1 — Built-in TTS (Zero Setup)

These engines are bundled inside Social Stream Ninja and require no installation. They run in the browser using WebAssembly (WASM) or ONNX Runtime.

Provider Quality CPU Use GPU/WebGPU URL Parameter
Kokoro TTS ⭐⭐⭐⭐⭐ Excellent Medium Faster with GPU ?ttsprovider=kokoro
Piper TTS ⭐⭐⭐⭐ Very Good Low CPU only ?ttsprovider=piper
Kitten TTS ⭐⭐⭐ Good Very Low CPU only ?ttsprovider=kitten
eSpeak-NG ⭐⭐ Robotic Minimal CPU only ?ttsprovider=espeak

How to Enable

Add &ttsprovider= and &speech= to your Social Stream dock.html URL:

dock.html?session=YOUR_SESSION&speech=en-US&ttsprovider=kokoro

Kokoro TTS Options

Kokoro has 26 built-in voices. Specify one with &voicekokoro=:

English female: af_bella, af_sarah, af_nicole, af_sky English male: am_adam, am_michael British female: bf_emma, bf_isabella British male: bm_george, bm_lewis
dock.html?session=YOUR_SESSION&speech=en-US&ttsprovider=kokoro&voicekokoro=af_bella&kokorospeed=1.1

Piper TTS Options

Specify a voice model with &pipervoice=:

dock.html?session=YOUR_SESSION&speech=en-US&ttsprovider=piper&pipervoice=en_US-hfc_female-medium

Kitten TTS Options

dock.html?session=YOUR_SESSION&speech=en-US&ttsprovider=kitten&kittenvoice=expr-voice-4-f

eSpeak-NG Options

dock.html?session=YOUR_SESSION&speech=en-US&ttsprovider=espeak&espeakvoice=en&espeakspeed=175
First-time load: Kokoro and Piper need to download their model files on first use (~50–200 MB). This happens automatically in the background. Subsequent loads are instant (cached in browser).
OBS capture: All built-in TTS providers play audio directly through the browser. In OBS, add your dock.html as a Browser Source and enable "Control audio via OBS" — no virtual cables needed. See the OBS section below.

Browser and Desktop App Notes

The Chrome extension, OBS browser source, and standalone Social Stream Ninja desktop app all use the same dock.html URL parameters for TTS.

Surface Local TTS behavior Audio capture
Chrome extension / OBS browser source Browser fetch requires CORS from the local server, unless you use the SSN bridge. Use OBS Browser Source with "Control audio via OBS".
Standalone desktop app Uses the same provider settings. The app's local file windows are less CORS constrained, but the bridge is still the safest path for servers that reject browser-style requests. Capture desktop/app audio, or route the app to a virtual audio cable.
Built-in Kokoro in desktop app The app can use its local ninjafy.tts path for Kokoro instead of relying only on browser model loading. Audio plays from the app, so use desktop/app audio capture.
Packaging note: when shipping the standalone app, make sure the current social_stream files, thirdparty/ assets, and docs/ guide are included in the app's fallback resource bundle.

Path 2 — Self-Hosted TTS Server

If you want more voice options, voice cloning, or a dedicated server you can reuse across tools, you can run a local TTS server. Social Stream Ninja connects to it using its built-in OpenAI-compatible TTS endpoint support — no API key needed for local servers.

Requirements: Docker Desktop must be installed and running. Docker is free for personal use.

Three recommended options:

Server Model GPU Disk Default Port
Kokoro-FastAPI Recommended Kokoro 82M Optional ~2 GB 8880
openedai-speech (Piper) Lightweight Piper TTS CPU only <1 GB 8000
kokoro-web Kokoro 82M Optional ~2 GB 3000

Which Package Fits?

Package Best benefit Tradeoff
Built-in Kokoro Best first choice: no server, strong quality, private, works in browser and desktop app. No voice cloning.
Kokoro-FastAPI OpenAI-compatible server, easy Docker setup, CPU or GPU, many Kokoro voices. No real voice cloning; voice blending and custom voice features depend on the server build.
openedai-speech Light OpenAI-compatible endpoint; Piper is CPU-friendly and XTTS adds cloning around a 4 GB VRAM target. The repo says it is mostly obsolete, so treat it as useful but not future-proof.
Chatterbox servers Voice cloning, web UI options, OpenAI-compatible APIs, long-text tooling. CUDA/GPU support is smoother than CPU for some builds; setup varies by server fork.
GPT-SoVITS Strong cloning/control with short references and transcript support. Not OpenAI-compatible by default; use the SSN bridge mode.
F5-TTS Natural zero-shot cloning with prompt WAV + transcript. Official project is not a simple OpenAI endpoint; use a wrapper or bridge mode.
Qwen3-TTS Modern cloning and voice-design features, including smaller 0.6B/1.7B models. Library/demo first; needs a wrapper for SSN.
MisoTTS High-end prompted speech generation. Not a 6 GB VRAM local target; use remote/custom hosting if needed.

How Voice Cloning Works

Voice cloning is not a separate SSN mode. It is a feature inside some local TTS servers. SSN sends the chat text to a local endpoint; the server chooses the cloned voice from a saved reference audio file, a voice profile, or bridge configuration.

Typical Flow

  1. Record a clean reference clip, usually 3 to 30 seconds of one speaker with little background noise.
  2. Some engines also require the exact transcript of that reference clip.
  3. The local server converts the reference into a speaker prompt, embedding, or voice profile.
  4. SSN sends live chat text to the endpoint using ttsprovider=customtts.
  5. The server returns a playable audio file, usually WAV or MP3, and SSN plays it in the dock/browser source.
Use consented voices only. Voice cloning can sound like a real person, so only use voices you own, have permission to use, or have clearly licensed for this purpose.

For 6 GB VRAM or less, target small zero-shot cloning models and OpenAI-compatible servers first. Bigger models can still work through the same SSN endpoint if the user hosts them elsewhere.

Option Voice cloning 6 GB VRAM fit API path for SSN
Qwen3-TTS 0.6B Base 3-second reference audio Likely Use an OpenAI-compatible wrapper, then ttsprovider=customtts
XTTS-v2 / openedai-speech Short WAV reference voices Yes, about 4 GB reported by openedai-speech /v1/audio/speech
Chatterbox Turbo / Server Reference-audio cloning Likely if using Turbo / small chunks OpenAI-compatible server builds, or the bridge
GPT-SoVITS 5-second zero-shot, 1-minute few-shot Likely with fp16 / lightweight install Use scripts/local-tts-bridge.cjs --mode gptsovits
F5-TTS Prompt WAV + transcript Maybe; depends on build and vocoder Use an OpenAI-compatible wrapper, or --mode f5 for F5-TTS server wrappers
MisoTTS 8B Prompted audio context No; project recommends 24 GB VRAM Remote/custom endpoint only
Best SSN target shape: accept POST /v1/audio/speech with { model, input, voice, response_format, speed } and return a playable audio file. That covers OpenAI, Coqui/XTTS, Kokoro wrappers, Qwen wrappers, and most proxy services.

Computer Requirements

These are practical starting points, not hard guarantees. Model version, quantization, text length, Docker image, and background apps can change memory use.

Option Minimum practical computer Good target Notes
System TTS / eSpeak Any modern PC Any PC Fast, low quality, no cloning.
Built-in Kitten Low-end CPU, 4 GB RAM Modern laptop CPU, 8 GB RAM Small ONNX model, fast startup.
Built-in Piper Modern CPU, 4-8 GB RAM Modern CPU, 8 GB RAM Good low-resource neural voice option.
Built-in Kokoro Modern CPU, 8 GB RAM WebGPU-capable GPU or fast CPU, 8-16 GB RAM Best zero-setup quality. First load downloads model assets.
Kokoro-FastAPI CPU Docker host, 8 GB RAM NVIDIA GPU optional, 8-16 GB RAM Good local server when browser model loading is not ideal.
openedai-speech Piper CPU, 4-8 GB RAM CPU, 8 GB RAM Light OpenAI-compatible server.
openedai-speech XTTS NVIDIA GPU around 4 GB VRAM, 8-16 GB RAM 6 GB+ NVIDIA GPU, 16 GB RAM Voice cloning path; CPU is possible but slow.
Chatterbox servers CPU can work for some builds but is slow 6 GB+ NVIDIA GPU, 16 GB RAM Use GPU when cloning or processing long text.
GPT-SoVITS / F5-TTS / Qwen3-TTS CPU testing only, slow 6 GB+ NVIDIA GPU for smaller/optimized models, 16 GB RAM Wrapper choice and model size matter. Expect more setup.
MisoTTS 8B Not recommended locally at 6 GB VRAM 24 GB VRAM or remote host The repo recommends high-VRAM GPUs for interactive use.

Tested Server Notes

These are the self-hosted voice-cloning targets checked for SSN compatibility. The local endpoint path was tested against both dock.html and featured.html.

SSN accepts direct binary audio responses, JSON responses with base64 audio, and JSON responses with an audio URL. Current custom/local playback buffers the returned audio before playing it; progressive streaming playback is not supported yet.

Server SSN path Notes
openedai-speech Direct or bridge OpenAI-compatible /v1/audio/speech. Piper mode was tested with real CPU synthesis from dock.html and featured.html, direct and through the bridge. If running from source on Windows, make sure the venv Scripts folder is on PATH so piper.exe and ffmpeg.exe can be found.
chatterbox-tts-api Direct or bridge OpenAI-compatible /v1/audio/speech. Uses configured reference audio for cloning. API shape was tested direct and through the bridge.
Chatterbox-TTS-Server Direct or bridge OpenAI-compatible endpoint and Web UI. Tested with real CPU synthesis using Emily.wav from dock.html and featured.html, direct and through the bridge.
GPT-SoVITS Bridge mode Run SSN bridge with --mode gptsovits; target server is /tts, not OpenAI-compatible.
F5-TTS_server Bridge mode Run SSN bridge with --mode f5; target server uses GET /synthesize_speech/.
F5-TTS official Needs wrapper CLI, Gradio, and socket server first. Use an OpenAI-compatible wrapper or the F5 bridge mode against a wrapper.
Qwen3-TTS Needs wrapper Library and Gradio demo first. Good candidate for a small OpenAI-compatible wrapper around generate_voice_clone.
MisoTTS Remote/custom only Voice cloning is supported, but the 8B model is not a 6 GB VRAM target and has no local REST endpoint in the repo.

Kokoro-FastAPI Setup

Kokoro-FastAPI runs the Kokoro 82M model as a local server with an OpenAI-compatible API. It works on CPU (no GPU required) and has excellent voice quality.

Install with Docker

Open a terminal (Command Prompt, PowerShell, or Terminal) and run one of the following:

CPU (works on any computer):

docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.2

GPU (NVIDIA only — faster synthesis):

docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:v0.2.0post4
First run: Docker will download the image (~1.5–2 GB). This only happens once. After that, the server starts in a few seconds.

Verify It's Running

Open your browser and go to http://localhost:8880/web/ — you should see a web UI where you can test voices.

Available Voices

67+ voices available. A few highlights:

af_bella, af_sarah, af_nicole, af_sky, af_heart (American female) am_adam, am_michael (American male) bf_emma, bf_isabella (British female) bm_george, bm_lewis (British male)

Browse and test all voices at http://localhost:8880/web/ once the server is running.

SSN URL

dock.html?session=YOUR_SESSION&speech=en-US&ttsprovider=openai&openaiendpoint=http://localhost:8880/v1/audio/speech&voiceopenai=af_bella

Keep the Server Running

To keep Kokoro-FastAPI running automatically in the background, use Docker's restart flag:

docker run -d --restart unless-stopped -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.2

It will now start automatically with Docker Desktop on every reboot.

openedai-speech Setup (Lightweight Piper)

openedai-speech is the lightest option — a CPU-only Piper TTS server under 1 GB. Good for older or less powerful computers.

Install with Docker Compose

1
Clone the repository or create a folder with the following docker-compose.min.yml. Alternatively, run the commands below directly.
2
Run the minimal Piper-only image:
docker run -d --restart unless-stopped \ -p 8000:8000 \ ghcr.io/matatonic/openedai-speech-min

Windows Source Install Note

If you run openedai-speech from a local checkout instead of Docker, add its virtual environment scripts folder to PATH before starting the server. Without this, requests can return HTTP 500 because the server cannot find piper.exe or ffmpeg.exe.

cd openedai-speech $env:Path = "$PWD\.venv\Scripts;$env:Path" .\.venv\Scripts\python.exe speech.py --xtts_device none -H 127.0.0.1 -P 8000

Available Voices

openedai-speech uses OpenAI-style voice names mapped to Piper voices:

alloy, echo, fable, onyx, nova, shimmer

SSN URL

dock.html?session=YOUR_SESSION&speech=en-US&ttsprovider=openai&openaiendpoint=http://localhost:8000/v1/audio/speech&voiceopenai=nova

Local TTS Bridge

If a local TTS server does not allow browser CORS requests, run SSN's Node bridge locally and point SSN at the bridge instead. The bridge returns the third-party audio response unchanged. The standalone starter folder is local-tts-bridge/; see the bridge README for the code and launch options.

OpenAI-Compatible Proxy

$env:SSN_TTS_TARGET="http://127.0.0.1:8000/v1/audio/speech" npm run local-tts-bridge

Or run it from the bridge folder:

cd local-tts-bridge node server.cjs
dock.html?session=YOUR_SESSION&speech=en-US&ttsprovider=customtts&openaiendpoint=http://127.0.0.1:8124/v1/audio/speech&voiceopenai=nova

GPT-SoVITS Proxy Mode

GPT-SoVITS uses its own /tts JSON shape, so the bridge can translate SSN's OpenAI-compatible request into the GPT-SoVITS request body.

$env:SSN_TTS_REF_AUDIO_PATH="C:\voices\speaker.wav" $env:SSN_TTS_REF_TEXT="Reference audio transcript here." $env:SSN_TTS_TARGET="http://127.0.0.1:9880/tts" npm run local-tts-bridge -- --mode gptsovits
dock.html?session=YOUR_SESSION&speech=en-US&ttsprovider=customtts&openaiendpoint=http://127.0.0.1:8124/v1/audio/speech&openaiformat=wav

F5-TTS Server Proxy Mode

Some F5-TTS server wrappers expose /synthesize_speech/?text=...&voice=... instead of an OpenAI-compatible endpoint. The bridge can translate SSN's request into that query format.

$env:SSN_TTS_TARGET="http://127.0.0.1:7860/synthesize_speech/" npm run local-tts-bridge -- --mode f5
dock.html?session=YOUR_SESSION&speech=en-US&ttsprovider=customtts&openaiendpoint=http://127.0.0.1:8124/v1/audio/speech&voiceopenai=default_en&openaiformat=wav
Bridge endpoint: http://127.0.0.1:8124/v1/audio/speech. Change the port with SSN_TTS_BRIDGE_PORT=8125 if needed.

Connecting to Social Stream Ninja

All self-hosted servers above use the same connection method — Social Stream's built-in OpenAI TTS endpoint support with a custom local URL.

URL Parameters

Parameter Value Description
ttsprovider customtts or openai Use the OpenAI-compatible TTS path. Use customtts for local/self-hosted endpoints.
openaiendpoint http://localhost:8880/v1/audio/speech Your local server URL (change port as needed)
speech en-US Enables TTS for English
voiceopenai af_bella Voice name (depends on server)
openaiformat mp3 Audio format: mp3, wav, opus, flac
openaispeed 1.0 Speaking speed (0.5–2.0)
Endpoint aliases: customttsendpoint and localttsendpoint also work. customttsvoice, localttsvoice, customttsmodel, localttsmodel, customttsformat, and localttsformat are accepted aliases for the OpenAI-style fields.

Full Example URLs

Kokoro-FastAPI:

dock.html?session=YOUR_SESSION&speech=en-US&ttsprovider=customtts&openaiendpoint=http://localhost:8880/v1/audio/speech&voiceopenai=af_bella&openaispeed=1.1

openedai-speech:

dock.html?session=YOUR_SESSION&speech=en-US&ttsprovider=customtts&openaiendpoint=http://localhost:8000/v1/audio/speech&voiceopenai=nova

kokoro-web:

dock.html?session=YOUR_SESSION&speech=en-US&ttsprovider=customtts&openaiendpoint=http://localhost:3000/api/v1/audio/speech&voiceopenai=af_bella

Additional TTS Options

These work with any TTS provider, including local servers:

Parameter Example Description
simpletts &simpletts Skip "says" — reads message only
simpletts2 &simpletts2 Skip usernames entirely
volume &volume=0.8 Volume level (0.0–1.0)
skipmessages &skipmessages=3 Read every 3rd message only
ttscommand &ttscommand=!say Only read messages starting with !say
readevents &readevents Also read subscriptions, donations, etc.
No API key needed. When using a local server (non-openai.com URL), Social Stream Ninja sends the request without an Authorization header. You do not need to configure a key.

Built-in Browser Options Worth Supporting

SSN already supports OS/browser speechSynthesis, built-in Kokoro, Piper, Kitten, and eSpeak. The most useful future browser-side additions would be an audio output device picker where setSinkId is available, more Piper voice choices, and a dedicated progressive streaming playback path for servers that can stream audio chunks.

Getting Audio into OBS

How you capture TTS audio in OBS depends on how you're running Social Stream Ninja.

Method 1 — OBS Browser Source Recommended

This is the simplest method and works for all TTS providers (built-in and self-hosted server).

1
In OBS, add a new Browser Source
2
Set the URL to your dock.html URL with TTS parameters
3
Check "Control audio via OBS" in the browser source settings
4
Click OK — TTS audio will now appear as an OBS audio source you can adjust or route
5
Click the browser source once in the preview to allow browser audio autoplay
Why this works: Built-in TTS and self-hosted server TTS both play audio through the browser's audio context (not OS speech synthesis). OBS can capture browser audio directly when "Control audio via OBS" is checked.

Method 2 — SSN Desktop App + Desktop Audio

If you're using the Social Stream Ninja standalone desktop app (not an OBS browser source):

1
TTS audio plays through your system speakers/headphones from the app
2
In OBS, add an Audio Input Capture or Desktop Audio Capture source
3
If you want TTS isolated from other desktop audio, use a virtual audio cable:
  • Windows: VB-Audio Virtual Cable (free)
  • Set CABLE Input as the output for the SSN app in Windows Sound settings
  • Capture CABLE Output in OBS with Audio Input Capture

Windows Audio Routing Links

Windows 10 Per-App Route

1
Open Sound Settings > App volume and device preferences.
2
Find the browser or SSN app in the app list.
3
Set Output to CABLE Input (VB-Audio Virtual Cable).
4
In OBS, add Audio Input Capture and choose CABLE Output.

Windows 11 Per-App Route

1
Open Settings > System > Sound > Volume Mixer.
2
Find the browser or SSN app.
3
Set Output device to CABLE Input (VB-Audio Virtual Cable).
4
In OBS, add Audio Input Capture and choose CABLE Output.

Audio Router Software

Audio Router can route one app to a virtual cable, but it is older software. Prefer the Windows per-app route when it works.

1
Install Audio Router.
2
Route the browser or SSN app to CABLE Input.
3
In OBS, capture CABLE Output.

Voicemeeter Advanced Route

Voicemeeter is best when you need to hear TTS locally, route it to OBS, and keep it separate from music/game audio.

1
Install Voicemeeter and set it as the Windows default output.
2
Set Hardware Out to your speakers/headphones.
3
Route the virtual output into OBS as an Audio Input Capture source.
System TTS (?speech=en-US without a provider) uses OS speech synthesis, which cannot be captured by OBS browser source. Use one of the providers above (kokoro, piper, etc.) instead.

Comparison Table

Option Setup Quality Private OBS (Browser Source) GPU Needed Cost
Built-in Kokoro None ⭐⭐⭐⭐⭐ Yes Yes No (faster with) Free
Built-in Piper None ⭐⭐⭐⭐ Yes Yes No Free
Built-in Kitten None ⭐⭐⭐ Yes Yes No Free
Built-in eSpeak None ⭐⭐ Yes Yes No Free
Kokoro-FastAPI Docker ⭐⭐⭐⭐⭐ Yes Yes No (optional) Free
openedai-speech Docker ⭐⭐⭐⭐ Yes Yes No Free
ElevenLabs API Key ⭐⭐⭐⭐⭐ No Yes No Paid tiers
System TTS None ⭐⭐ Yes No* No Free

* System TTS requires virtual audio cable routing for OBS capture.

Troubleshooting

No audio playing

Kokoro / Piper takes a long time on first load

The model files are being downloaded (~50–200 MB). This only happens once — subsequent loads use the cached version. Wait for the first message before testing.

Local server not responding (self-hosted setup)

CORS error in browser console (extension mode)

If you're using the Social Stream browser extension (not the standalone app), your local server must allow cross-origin requests from the extension.

Most FastAPI-based servers (Kokoro-FastAPI, openedai-speech) allow all origins by default. If you see a CORS error:

"http://localhost" blocked in OBS browser source

OBS browser sources can have trouble reaching localhost servers. Try:

Audio plays but OBS doesn't capture it

Wrong voice / voice not found

Voice names are case-sensitive and must match what the server supports. Visit http://localhost:8880/web/ (Kokoro-FastAPI) to browse and test available voices.

Docker image not found

Image tags change with new releases. If the tag in this guide no longer works, check the project's GitHub page for the latest version tag.

More TTS options: For cloud-based premium TTS (ElevenLabs, Google Cloud, Speechify) and full URL parameter reference, see the TTS Voice Guide.