The two halves of the Web Speech API
The Web Speech API has two halves: speech synthesis (text β audio, what our converter uses) and speech recognition (microphone β text). They were standardized at different speeds and have very different support stories. This page is about synthesis only.
The synthesis half β exposed as window.speechSynthesis with SpeechSynthesisUtterance β is available in all major modern browsers. The catch is that "available" does not mean "behaves the same." Each browser bridges to a different speech engine on each operating system, and each browser interprets the autoplay rules differently.
What "supported" actually means here
For our converter to be useful, four things have to work in your browser:
- Voice enumeration β
speechSynthesis.getVoices()returns at least one voice. - Speech playback β
speak()produces audible output. - Controls β
pause(),resume(), andcancel()work as documented. - Events β
onstart,onend, andonerrorfire reliably so the play/pause UI updates.
Each of those can break independently. The table and notes below are organized that way.
By browser engine
| Engine | Voice list | Playback | Pause / resume | Notes |
|---|---|---|---|---|
| Chromium (Chrome, Edge, Brave, Opera, Vivaldi, Arc) on desktop | OS voices plus, on some builds, Google's network voices | Reliable after a user gesture | Works | Most consistent overall. Long utterances (~200+ words) sometimes silently stall; splitting helps. |
| Safari on macOS | OS voices only; Premium/Enhanced voices appear once downloaded | Reliable | Works; resume may restart the current sentence on some macOS versions | Best voice quality on Apple Silicon Macs once Premium voices are installed. |
| Firefox on desktop | OS voices on Windows and macOS; on Linux, depends on speech-dispatcher | Reliable | Works | Slightly slower to populate the initial voice list β wait a few hundred milliseconds. |
| Safari on iOS / iPadOS (also Chrome, Edge, Firefox on iOS) | iOS system voices; controlled centrally under Settings β Accessibility | Requires a tap on the page first | Pause works; cancel() sometimes has a short delay | All iOS browsers use WebKit, so the list is identical across them. |
| Chrome / Samsung Internet on Android | Depends on selected Android TTS engine (Google, Samsung, third-party) | Reliable after first interaction | Works | Switching the system TTS engine changes the voice list immediately. |
| Firefox on Android | Limited; reads from the Android speech service | Reliable | Works | Smaller voice list than Chrome on the same device. |
Autoplay rules: why nothing happens until you click
Modern browsers block audio that starts without a user gesture. For speech synthesis this means the first speak() call inside a page reliably works only if it happened in response to a click, tap, or key press. After that first gesture, the audio context is "unlocked" for the page and subsequent calls work even from timers or callbacks.
This is why the converter's "Play" button is intentional: the click is the gesture that unlocks audio. Visiting the page and waiting passively is not enough; you must press Play once per page load.
iOS is the strictest about this. Safari on iPhone treats every new tab as a fresh permission boundary, and a long silence between pressing Play and the audio starting is sometimes the OS waking the speech engine for the first time in the session.
Voices that load asynchronously
The voice list is populated by the OS, and on Chromium-based browsers it is initially empty when the page first runs. The browser fires a voiceschanged event when the list is ready. Our converter already handles this β it polls and re-renders the dropdown when voices arrive β but it explains why, on a slow first paint, the dropdown briefly shows "Loading voicesβ¦" before filling out.
If the list never populates: see the install system voices page for OS-level checks, then try a different browser as a fast bisect. If voices appear in Safari but not Chrome on the same Mac, the problem is browser-side; if they appear nowhere, the problem is in the OS speech engine.
Length limits
The spec does not impose a maximum utterance length, but in practice each engine has a soft ceiling. Symptoms of going over it are: speech that cuts off mid-sentence, onend firing while there is still text to read, or the page hanging without an error. As a working rule of thumb, keep individual utterances under about 200 to 300 words. Our converter's 5,000-character limit is a UX choice, not an API one β for long texts, copy them into the converter in passages.
Event reliability
Three events drive any TTS UI: start, end, and error. In Chromium and Firefox they fire as expected. In Safari, two oddities are worth knowing:
- Calling
cancel()can fireonerrorwith an error type of "canceled" instead ofonend. Our converter treats "canceled" as a non-error so the UI does not flash a misleading message. - When the system speech engine is busy from a separate app (a screen reader, dictation),
speak()can return silently without ever firingstart. Restart the page or wait for the other app to finish.
Things that are not portable
Several niceties exist on paper but should not be relied on for a public web app:
- SSML. Most browser engines accept some SSML-like markup, but support varies dramatically. For predictable speech, write plain text and shape it with punctuation β see Writing for TTS.
- Word boundary events (
onboundary). Chromium fires them per word, Safari fires them per sentence, Firefox is somewhere in between. Building a karaoke-style word highlighter on top of them works in one browser and breaks in the next. - Captured audio. The API does not expose the synthesized audio as a stream, only as playback. There is no portable way to record it from JavaScript. To capture the output you must use the OS recorder.
- Background tabs. Browsers throttle background tab work. If you switch tabs mid-speech, expect Safari to pause and Chromium to keep going at half rate.
A short troubleshooting checklist
- Did you click Play? The first audio in a tab needs a user gesture.
- Is the page muted? Right-click the tab; if "Unmute site" is offered, take it.
- Is another app speaking? Some OSs serialize speech engine access β pause your screen reader or close it.
- Did you switch tabs? Bring the converter tab to the foreground and press Play again.
- Is the voice list empty? See Install system voices.
- Are you on a corporate network? Some VPNs and content filters intercept Google's network voices on Chromium. The "(Local)"-tagged voices in the dropdown will still work.
When the browser is not the right tool
If you need consistent voice quality across every visitor, a guaranteed downloadable audio file, or licensed-for-commercial-use voices, a cloud TTS service is the appropriate tool. The browser TTS vs cloud TTS comparison walks through the decision in more detail. For everything else β quick listening, proofreading, accessibility, language practice β browser TTS is more than enough, and the converter is one tab away.