Question 1

Why does text sound unnatural in text-to-speech?

Accepted Answer

TTS engines struggle with: abbreviations (reads 'Dr.' as 'Doctor' in some but skips it in others), markdown formatting (reads '**bold**' as 'asterisk asterisk bold asterisk asterisk'), numbers written as digits vs words ('1,234' vs 'one thousand two hundred and thirty-four'), acronyms (should 'NASA' be spelled out or said as a word?), URLs and email addresses (reads character by character), and punctuation that affects pacing (em dashes, ellipses, semicolons). Formatting for TTS removes these ambiguities.

Question 2

How do I format numbers for text-to-speech?

Accepted Answer

For most TTS engines: spell out numbers in running prose ('three hundred' not '300'), use numerals for measurements and statistics where the spoken form is unambiguous ('5 km', '$20'), write out ordinals ('third' not '3rd'), write percentages as words ('fifteen percent' not '15%'), and write currency clearly ('twenty dollars' or '$20' — avoid '$20.00' which may be read as 'twenty dollars and zero cents'). SSML (Speech Synthesis Markup Language) lets you explicitly control how numbers are read with <say-as interpret-as='cardinal'>.

Question 3

What is SSML and should I use it?

Accepted Answer

SSML (Speech Synthesis Markup Language) is an XML-based markup language for controlling TTS output — pauses (), emphasis (), pronunciation (), speed () and more. Most professional TTS APIs support it: Amazon Polly, Google Cloud TTS, Azure Cognitive Services, ElevenLabs. If you're building audio content at scale, SSML gives you far more control than plain text. For quick conversions, plain text formatting is faster.

Question 4

Which text-to-speech engine produces the most natural output?

Accepted Answer

As of 2024: ElevenLabs produces the most realistic, expressive voices — best for podcasts, narration and voiceovers. OpenAI TTS (tts-1-hd) is close, fast and very affordable. Google Cloud WaveNet and Neural2 voices are excellent for large volume at low cost. Amazon Polly Neural voices are reliable and well-documented. Microsoft Azure Neural voices have strong multilingual support. All improve significantly with clean, well-formatted input — garbage in, garbage out applies to TTS too.

Input	Output (spoken form)	Why
bold text	bold text	Markdown asterisks read aloud as "asterisk"
Dr. Smith said...	Doctor Smith said...	Abbreviations read unexpectedly without expansion
$1,234.56	one thousand two hundred...	Currency digits spoken differently by each engine
15%	fifteen percent	Percent symbol may be skipped or mis-spoken
https://example.com	example dot com	URLs read character-by-character without formatting
• Item one\n• Item two	Item one. Item two.	Bullet points add silence or read as punctuation

TTS Script Formatter

What gets transformed

Frequently asked questions

Related writing tools

Make your TTS scripts sound natural