Taming the Web Speech API

9 min readJan 31, 2023

Unfortunately not fully cross browser and also not super aligned across browsers, the Web Speech API is very cool but kinda unusable. This post goal is to explain how to workaround its gotchas.

What works and what doesn’t 🙈 🙉 🙊

Let’s start clearing up what may work for you but not for me and what surely doesn’t work across browsers:

on (my) Linux, Chromium exposes the API but no sound is ever heard. The mic is fine though!
on (my) Linux, MS Edge can speak but it crashes when attempting to use the mic.
previous point mean that Chrome could both speak and listen on Desktop, but I haven’t tried the non OSS version (yet) … update: I did and it worked well!
Firefox doesn’t officially support the SpeechRecognition API, even if KaiOS is 90% based on speech recognition, otherwise you’d need to type inputs 3 clicks per button per each char each time … I have no idea why Firefox wouldn’t acknowledge how important this API is but at least “it can speak” …
GNOME Web doesn’t seem to expose the API at all.
Chrome on Android apparently works best!
Safari on Mobile requires workarounds due events shenanigans (explained in here).
Safari Mobile WebView triggers an error right away without even asking to activate the microphone. You need to check all demo within Safari itself, not as a WebView from PWAs or native apps, otherwise nothing would work.
Safari on Mobile won’t allow Speech Recognition API once installed as PWA
Brave browser refuses to implement the SpeechRecognition API because without internet there’s no recognition!

There will be demo to test and forget about Firefox, but as quick check, if you can allow the mic and see events beside an error in this page, you’re good to go!

The SpeechRecognition

Before showing code and live examples, let’s make this clear: I haven’t (yet) explored the .continuous functionality because knowing any tab could listen forever freaks me out but there is a chance that for the last demo this could be a much more suitable solution.

Update: the continuous feature is completely useless on iOS and hard to deal with properly on Chrome. It’s a forever increasing list of results in Chrome or a forever increasing single text result on iOS.

I am letting you explore/experiment with it only after understanding all the gotchas we need to take care about, so let’s start!

In a nutshell, this API allows your page to:

access the mic for a finite time or for as long as needed (continuous)
provide a result once the mic has decided nobody is speaking anymore (except iOS because … iOS)
return possible alternatives to what’s been understood as word or phrase

This API also exposes some event, which is logically ordered as such: start, audiostart, soundstart, speechstart, speechend, soundend, audioend, result, and finally end. There are two extra events such as error and nomatch, where the former happens, as example, if we purposely switch off the mic or if the recognition failed big time.

The Events

Using minimal code that hides boring operations behind the scene, this is a basic demo page that shows all events as these occur:

import {$} from './$.js';

// normalize SpeechRecognition
let {SpeechRecognition} = globalThis;
if (!SpeechRecognition)
  SpeechRecognition = webkitSpeechRecognition;

// create a div and show the event name
const logEvent = ({type}) => {
  const div = document.createElement('div');
  div.textContent = type;
  $('#content').appendChild(div);
};

// activate the listening
$('#mic').on('click', ({currentTarget}) => {
  // avoid clicks while listenings
  currentTarget.disabled = true;

  // log passed time
  logEvent({type: 0});
  const time = new Date;
  const i = setInterval(node => {
    node.textContent = ((new Date - time) / 1000).toFixed(1);
  }, 100, $('#content').lastChild);

  // start listening to all events *and*
  // avoid iOS listening forever (it stops in 10 seconds)
  setTimeout(
    $(new SpeechRecognition)
      .on('start', logEvent)
      .on('audiostart', logEvent)
      .on('soundstart', logEvent)
      .on('speechstart', logEvent)
      .on('speechend', logEvent)
      .on('soundend', logEvent)
      .on('audioend', logEvent)
      .on('result', logEvent)
      .on('end', event => {
        logEvent(event);
        // cleanup and stop listening
        clearInterval(i);
        event.currentTarget.stop();
        currentTarget.disabled = false;
      })
      // extra events
      .on('error', logEvent)
      .on('nomatch', logEvent)
      .start()
      // forward the stop
      .stop,
    10000
  );
});

Basically we can click the top mic button, say something, and see all events logged on the page in order … except for iOS, where only start and audiostart are logged, then we can wait seconds or forever before getting back a result.

There are many bugs and discussions online around this iOS inconsistent shenanigan compared to Chromium based alternatives, and the solution is to orchestrate a best-threshold estimate of when a person might stop talking, in my example a delay of 750 milliseconds without partial results.

The interimResults

Yes, precisely because Safari Mobile won’t ever react in human reasonable time to provide an answer, the trick I’ve used is a mix that works fine on Chromium based browsers, and slightly slower on iOS but at least it still works!

interimResults suggest the browser to dispatch result literally as we speak, so that having no results within some millisecond seems to be a fair assumption that the user stopped talking.

Here the interimResults counter example that works in all capable browsers, see it with your phone!

import {$} from './$.js';

// normalize SpeechRecognition
let {SpeechRecognition} = globalThis;
if (!SpeechRecognition)
  SpeechRecognition = webkitSpeechRecognition;

// create a div and show some text
const log = text => {
  const div = document.createElement('div');
  div.textContent = text;
  $('#content').appendChild(div);
};

// activate the listening
$('#mic').on('click', ({currentTarget}) => {
  // avoid clicks while listenings
  currentTarget.disabled = true;

  // log passed time
  log(0);
  const time = new Date;
  const i = setInterval(node => {
    node.textContent = ((new Date - time) / 1000).toFixed(1);
  }, 100, $('#content').lastChild);

  // start listening with interimResults
  const sr = new SpeechRecognition;
  sr.interimResults = true;
  let t = 0, ended = false;
  $(sr)
    // works both on Chrome and Safari
    .on('result', event => {
      // prevent multiple showResult calls
      clearTimeout(t);
      // but if audioend fired already
      if (ended)
        // show results right away (if any final is present)
        showResult(event);
      // otherwise wait 750ms (or more, or less)
      else
        t = setTimeout(showResult, 750, event);
    })
    // works on Chrome, maybe on Safari too
    .on('audioend', () => {
      ended = true;
    })
    .start()
  ;

  // stop listening (collects the final result)
  // and show the result. This could get called
  // multiple times.
  function showResult({results}) {
    ended = true; // speed up iOS
    sr.stop();
    for (const result of results) {
      // consider only the final result
      if (result.isFinal) {
        // loop the first alternative returned
        for (const {transcript} of result) {
          // clean up and show result + enable button
          clearInterval(i);
          console.log(result);
          log('You said: ' + transcript);
          currentTarget.disabled = false;
          return;
        }
      }
    }
  }
});

Now, if we check the previous link we can click the button, say something, and read what the mic understood.

Simplifying the dance

The reason I’ve published that page is to show how to orchestrate for both Chrome and Safari a reliable result, but that’s also quite some boilerplate … and since we now know what it takes to have reliable results, how about creating a tiny helper?

const {assign} = Object;
const interimResults = {interimResults: true};
const once = {once: true};

let {SpeechRecognition} = globalThis;
if (!SpeechRecognition)
  SpeechRecognition = webkitSpeechRecognition;

export default (options = void 0) => new Promise((resolve, reject) => {
  let t = 0, ended = false;
  const stop = event => {
    if (event) reject(event);
    clearTimeout(t);
    ended = true;
    sr.stop();
  };
  const result = ({results}) => {
    stop();
    for (const result of results) {
      if (result.isFinal) {
        for (const {transcript} of result) {
          resolve(transcript);
          return;
        }
      }
    }
  };
  const sr = assign(new SpeechRecognition, options, interimResults);
  sr.addEventListener('error', stop, once);
  sr.addEventListener('nomatch', stop, once);
  sr.addEventListener('audioend', () => stop(), once);
  sr.addEventListener('result', event => {
    if (ended)
      result(event);
    else {
      clearTimeout(t);
      t = setTimeout(result, 750, event);
    }
  });
  sr.start();
});

Well done me, now I have a Promise based solution that listens and resolves once the user stopped talking 🥳

import {$} from './$.js';
import listen from './listen.js';

// create a div and show some text
const log = text => {
  const div = document.createElement('div');
  div.textContent = text;
  $('#content').appendChild(div);
};

// activate the listening
$('#mic').on('click', ({currentTarget}) => {
  // avoid clicks while listenings
  currentTarget.disabled = true;

  // log passed time
  log(0);
  const time = new Date;
  const i = setInterval(node => {
    node.textContent = ((new Date - time) / 1000).toFixed(1);
  }, 100, $('#content').lastChild);

  // listen to something
  listen().then(
    transcript => {
      clearInterval(i);
      log('You said: ' + transcript);
      currentTarget.disabled = false;
    },
    console.error
  );
});

We gotta agree this page now looks way cleaner than before … right? And this is also live to test here: just click the mic and say something!

OK Goo … er … Web!

So far so good, we have an extremely simplified way to listen once and await results, but how could we orchestrate something like “OK Google” except we want the page to react when we say “OK Web” and also stop listening when we say exactly “Stop listening” ??? Strawberry on top, we want the very same page to talk to us too!

import {$} from './$.js';
import listen from './listen.js';

// create a div and show some text
const log = text => {
  const div = document.createElement('div');
  div.textContent = text;
  $('#content').appendChild(div);
};

// say something in the default language
const say = text => {
  const ssu = new SpeechSynthesisUtterance(text);
  // cancel any previous text before starting this one
  speechSynthesis.cancel();
  speechSynthesis.speak(ssu);
};

// activate the listening
$('#mic').on('click', ({currentTarget}) => {
  currentTarget.disabled = true;
  const check = transcript => {
    switch (transcript.toLowerCase()) {
      case 'stop listening':
        currentTarget.disabled = false;
        say('just stopped');
        log('Just stopped 👍');
        break;
      case 'ok web':
      case 'okay web':
        say('I am ready');
        log('I am ready 🤖');
      default:
        listen().then(check);
        break;
    }
  };
  // grant SpeechSynthesisUtterance usage
  say('');
  // listen and check
  listen().then(check);
});

And that’s what this page does indeed!

P.S. this is the previously mentioned demo that maybe would make more sense as continuous listening, which avoids seeing the mic icon, or its light, flashing per each sentence we say … however, I’ve found this way simple to reason about, and quite convenient too.

A common privilege gotcha!

Like it is for most Web APIs, we need a trusted user action to enable functionalities such as “listen” or “speak”, and while I’d argue any non related event would do so that this model doesn’t really work to me, it’s very important to grant these privileges “all at once” for a “listen and speak” like application.

This is the reason within the button click I also say('') nothing, but that’s good enough to grant speaking rights, among the listening rights, except the former doesn’t trigger, or require, any confirmation, like the mic enabler does.

About Voices

We now know the basics around listening and speaking, but we could dig further and personalize our experience using the voice we like the most.

We are talking about SpeechSynthesis.getVoices() here, quite possibly the ugliest API I’ve ever witnessed on the Web:

it’s not synchronous
it doesn’t return a Promise
it triggers the retrieval of its values only after a voicechanged event has been attached and speechSynthesis.getVoices() has been called
on top of that, Chromium might never resolve or invoke that callback, so that a hard time-stopper is needed

export default (timeout = 3000) => new Promise($ => {
  // must be assigned before trying to access voices
  speechSynthesis.addEventListener(
    'voiceschanged',
    () => { $(speechSynthesis.getVoices()) },
    {once: true}
  );
  // kinda trigger the voices recognition
  const voices = speechSynthesis.getVoices();
  // if already populated, just resolve with it
  if (voices.length) $(voices);
  setTimeout($, timeout, []);
});

There we go, we can now safely import whenVoices from './voices.js' and use whenVoices().then(...) to populate a select or inform the user there are other options or automatically pick the best option.

As Summary

All files and demo revealed in this post are Open Source on GitHub and there is also a root demo I am playing with, something I haven’t polished yet with these utilities which I’ve created only after to both post this and help me out with the madness but the PoC is called Talk2GPT and the idea is that we can click the mic and ask something to ChatGPT which will be then read out loudly, beside being shown on the page.

I hope these tricks and workarounds will help you too using more this super nice API and I am looking forward for Mozilla to implement this too, as Web apps you can talk with could be already the norm 👋