Back to all Talks funjavascript2023

Exploring the Potential of the Web Speech API in Karaoke / Ana Rodrigues

Talk description

Isn't it frustrating when the song you want isn't available at karaoke? Let's see if we can solve this. We will look at the current state of the Web Speech API and what's coming next, and have some fun!

Session Summary

Every karaoke bar plays the same one Rasmus song, so Ana Rodrigues built her own karaoke app to fix it — and gives the most honest walkthrough of the Web Speech API you'll see. She demos Twinkle Twinkle Little Star live, speaking the lyrics works and singing them spectacularly doesn't, then opens up why: your audio quietly ships to a vendor server, Firefox refuses on privacy grounds, recognition cuts out at sixty seconds, two tabs can't share it. The whole thing is an HTML audio element and a pile of if/else, plus the Tim Holman pitch that useless side projects are how you actually learn.

View detailed generated session topics, quotes and video timestamps

A fan site for The Rasmus (1m00s)

Ana Rodrigues opens with the talk's secret subject: her lifelong love of Finnish band The Rasmus, the View-Source-driven fan-site she built for them in 2004, and the fact that every karaoke bar she's been to only ever has In the Shadows. The talk is the story of building a karaoke app to fix that.

"I learned how to code via the View Source to learn how to build a fan site for The Rasmus"

"and even had a very successful forum at the time, but we don't have archives of that"

"whenever I go karaoke, they only have one song from The Rasmus — rude"

What is the Web Speech API (3m07s)

The Web Speech API splits into Speech Recognition and Speech Synthesis. Originally designed for accessible form input and continuous dictation rather than karaoke, but Ana wanted to gamify lyrics — match what you sing to the words on screen — and the API was the free, browser-native answer.

"what if we could match what we're saying to the lyrics? Because I know I would win — I know all the lyrics by heart"

"the Web Speech API splits into two: the speech recognition and the speech synthesis"

"one of the core ideas for it was not for karaoke but to enable developers to use the speech recognition as an accessible tool for inputs for forms, continuous dictation, and control"

Why your audio goes to a server (4m09s)

A Safari permission prompt — "the speech data from this app will be sent to Apple to process your request" — reveals the catch. Most browser Web Speech implementations send your audio to a vendor-side server, which is why it doesn't work offline, why Firefox doesn't ship it (privacy concerns), and why every implementing browser is a big-company browser with the infrastructure.

"the audio is sent to a web service for recognition processing, so it won't work offline"

"Firefox is not one of them — there are reasons for it. There was a very, very interesting thread which talks about their position and their concerns about privacy and implementation"

"browser vendors that are owned by massive corporations can have an easier time doing this — they have access to all the necessary infrastructure"

A karaoke demo, live (6m16s)

She demoes Twinkle Twinkle Little Star (Rasmus lyrics are copyrighted, nursery rhymes aren't), first by speaking the lyrics on time (which works), then by singing them (which doesn't). The lyrics highlight as the song plays; matched words turn green, partial matches orange.

"the good thing is that karaoke is not for good singers — we don't want show-offs"

"my husband insisted that you should give credit if you get one word right — so I put the orange one. He was my emotional support, so I granted this wish"

"if you connect words together, it might not work out"

How the karaoke is built (11m35s)

Plain HTML, vanilla JavaScript, CSS, no libraries — and one requirement: HTTPS only. Each lyric has a start and end time. On the audio element's timeupdate event you check whether the current playback time falls inside any lyric range; if so, that lyric is the current one and gets the highlight class. When the user speaks, you compare the recognised transcript against the active lyric.

"this is when I thought this was gonna be straightforward and I'll just copy-paste the code from MDN and it'll work"

"the first line will be 'twinkle twinkle little star', which starts at five seconds and ends at 11.2 seconds"

"in the end of the day, it's a lot of if/else"

The 60-second auto-stop and other quirks (12m40s)

The browser stops the recognition after about a minute to prevent always-on microphones. The hack is to restart it on its end event, which on mobile makes the start/stop notification sound play repeatedly. The Web Speech API also doesn't let two tabs share the recognition — open it in another tab and both stop.

"the speech recognition actually stops after a while — and on mobile it has like sound notifications"

"the browsers don't want you to leave your laptop on for all day and leave — they don't want to process that data constantly"

"if I have a browser running but with another tab, it immediately stops both of them — they just like, 'no, we're not doing that for free, sorry, buy it'"

Verdict: potential, but not for production (15m17s)

The Web Speech API has potential but you won't see it in karaoke bars or your live transcription tool any time soon — Shazam, Google's voice-to-text and other paid services are still meaningfully better. It works for short voice notes, not for singing, and Tony Edwards confirmed it doesn't work for rap.

"there is potential — you're not gonna see it rolled out in bars or in your browser anytime soon though"

"Tony Edwards did a fantastic talk called Beats Rhymes and Unit Tests — he wanted to see if the Web Speech API could help him jot down his rhymes"

"the Web Captioner project... has actually been sunset last week"

Useless side projects are valuable (18m23s)

The closing pitch borrows from Tim Holman's 2018 FFconf talk: any idea you have has value. Building this karaoke app gave her a working demo, two new APIs in her vocabulary, CSS animation experience, content for a conference talk, and the experience of reading and understanding API specifications. Useless side projects pay off in skills.

"any idea you have has value"

"only recently I gave myself permission to build unproductive things that are not for my job"

"side projects don't need to be monetised in order to be valid — they don't need to become npm packages or open-source packages"

Get the latest Announcements & news for FFConf

Announcements for tickets, conference dates and details, workshops and more - including early bird access and video releases from previous years.