Last reviewed on May 13, 2026

The two halves of the Web Speech API

The Web Speech API has two halves: speech synthesis (text β†’ audio, what our converter uses) and speech recognition (microphone β†’ text). They were standardized at different speeds and have very different support stories. This page is about synthesis only.

The synthesis half β€” exposed as window.speechSynthesis with SpeechSynthesisUtterance β€” is available in all major modern browsers. The catch is that "available" does not mean "behaves the same." Each browser bridges to a different speech engine on each operating system, and each browser interprets the autoplay rules differently.

What "supported" actually means here

For our converter to be useful, four things have to work in your browser:

  1. Voice enumeration β€” speechSynthesis.getVoices() returns at least one voice.
  2. Speech playback β€” speak() produces audible output.
  3. Controls β€” pause(), resume(), and cancel() work as documented.
  4. Events β€” onstart, onend, and onerror fire reliably so the play/pause UI updates.

Each of those can break independently. The table and notes below are organized that way.

By browser engine

EngineVoice listPlaybackPause / resumeNotes
Chromium (Chrome, Edge, Brave, Opera, Vivaldi, Arc) on desktopOS voices plus, on some builds, Google's network voicesReliable after a user gestureWorksMost consistent overall. Long utterances (~200+ words) sometimes silently stall; splitting helps.
Safari on macOSOS voices only; Premium/Enhanced voices appear once downloadedReliableWorks; resume may restart the current sentence on some macOS versionsBest voice quality on Apple Silicon Macs once Premium voices are installed.
Firefox on desktopOS voices on Windows and macOS; on Linux, depends on speech-dispatcherReliableWorksSlightly slower to populate the initial voice list β€” wait a few hundred milliseconds.
Safari on iOS / iPadOS (also Chrome, Edge, Firefox on iOS)iOS system voices; controlled centrally under Settings β†’ AccessibilityRequires a tap on the page firstPause works; cancel() sometimes has a short delayAll iOS browsers use WebKit, so the list is identical across them.
Chrome / Samsung Internet on AndroidDepends on selected Android TTS engine (Google, Samsung, third-party)Reliable after first interactionWorksSwitching the system TTS engine changes the voice list immediately.
Firefox on AndroidLimited; reads from the Android speech serviceReliableWorksSmaller voice list than Chrome on the same device.

Autoplay rules: why nothing happens until you click

Modern browsers block audio that starts without a user gesture. For speech synthesis this means the first speak() call inside a page reliably works only if it happened in response to a click, tap, or key press. After that first gesture, the audio context is "unlocked" for the page and subsequent calls work even from timers or callbacks.

This is why the converter's "Play" button is intentional: the click is the gesture that unlocks audio. Visiting the page and waiting passively is not enough; you must press Play once per page load.

iOS is the strictest about this. Safari on iPhone treats every new tab as a fresh permission boundary, and a long silence between pressing Play and the audio starting is sometimes the OS waking the speech engine for the first time in the session.

Voices that load asynchronously

The voice list is populated by the OS, and on Chromium-based browsers it is initially empty when the page first runs. The browser fires a voiceschanged event when the list is ready. Our converter already handles this β€” it polls and re-renders the dropdown when voices arrive β€” but it explains why, on a slow first paint, the dropdown briefly shows "Loading voices…" before filling out.

If the list never populates: see the install system voices page for OS-level checks, then try a different browser as a fast bisect. If voices appear in Safari but not Chrome on the same Mac, the problem is browser-side; if they appear nowhere, the problem is in the OS speech engine.

Length limits

The spec does not impose a maximum utterance length, but in practice each engine has a soft ceiling. Symptoms of going over it are: speech that cuts off mid-sentence, onend firing while there is still text to read, or the page hanging without an error. As a working rule of thumb, keep individual utterances under about 200 to 300 words. Our converter's 5,000-character limit is a UX choice, not an API one β€” for long texts, copy them into the converter in passages.

Event reliability

Three events drive any TTS UI: start, end, and error. In Chromium and Firefox they fire as expected. In Safari, two oddities are worth knowing:

  • Calling cancel() can fire onerror with an error type of "canceled" instead of onend. Our converter treats "canceled" as a non-error so the UI does not flash a misleading message.
  • When the system speech engine is busy from a separate app (a screen reader, dictation), speak() can return silently without ever firing start. Restart the page or wait for the other app to finish.

Things that are not portable

Several niceties exist on paper but should not be relied on for a public web app:

  • SSML. Most browser engines accept some SSML-like markup, but support varies dramatically. For predictable speech, write plain text and shape it with punctuation β€” see Writing for TTS.
  • Word boundary events (onboundary). Chromium fires them per word, Safari fires them per sentence, Firefox is somewhere in between. Building a karaoke-style word highlighter on top of them works in one browser and breaks in the next.
  • Captured audio. The API does not expose the synthesized audio as a stream, only as playback. There is no portable way to record it from JavaScript. To capture the output you must use the OS recorder.
  • Background tabs. Browsers throttle background tab work. If you switch tabs mid-speech, expect Safari to pause and Chromium to keep going at half rate.

A short troubleshooting checklist

  1. Did you click Play? The first audio in a tab needs a user gesture.
  2. Is the page muted? Right-click the tab; if "Unmute site" is offered, take it.
  3. Is another app speaking? Some OSs serialize speech engine access β€” pause your screen reader or close it.
  4. Did you switch tabs? Bring the converter tab to the foreground and press Play again.
  5. Is the voice list empty? See Install system voices.
  6. Are you on a corporate network? Some VPNs and content filters intercept Google's network voices on Chromium. The "(Local)"-tagged voices in the dropdown will still work.

When the browser is not the right tool

If you need consistent voice quality across every visitor, a guaranteed downloadable audio file, or licensed-for-commercial-use voices, a cloud TTS service is the appropriate tool. The browser TTS vs cloud TTS comparison walks through the decision in more detail. For everything else β€” quick listening, proofreading, accessibility, language practice β€” browser TTS is more than enough, and the converter is one tab away.