Use the Google Cloud Speech API to transcribe a podcast

As I was listening to the December 21 episode of the CPPCast, together with TWiML&AI my two most favourite podcasts, I couldn’t help but be a little bewildered by the number of times the guest used the word “like” during their interview.

Most of these were examples of speech disfluency, or filler words, but I have to admit that they detracted somewhat from an otherwise interesting discourse.

During another CPPCast episode which I recently listened to, the hosts coincidentally discussed the idea of making available transcriptions of the casts.

These two occurrences, namely the abundance of the “like” disfluency and the mention of transcription, connected in the back of my mind, and produced the idea of finding out how one could go about to use a publically available speech API to transcribe the podcast, and count the number of utterances of the word “like”.

Due to the golden age of information we find ourselves in, this was not that hard at all.

Selecting the API

After a short investigation of Microsoft’s offerings seemed to indicate that I would not be able to transcribe just under an hour of speech, I turned to Google.

The Google Cloud Speech API has specific support for the asynchronous transcription of speech recordings of up to 3 hours.

Setting up the project and service account

Make sure that you can access the Google Cloud Dashboard with your google account. I created a new project for this experiment called cppcast-speech-to-text.

Within that project, select APIs & Services dashboard from the menu on the left, and then enable the Speech API for that project by selecting the Enable APIs and Services link at the top.

Next, go to IAM & Admin and Service Accounts via the main menu, and create a service account for this project.

Remember to select the download JSON private key checkbox.

Transcode and upload the audio

For the Speech API, you will have to transcode the MP3 to FLAC, and you will have to upload the file to a Google Cloud Storage bucket.

I transcoded the MP3 to a 16kHz mono FLAC (preferred by the API) as follows:

ffmpeg -i cppcast-131.mp3 -y -vn -acodec flac -ar 16000 -ac 1 cppcast-131.flac

This turned my 39MB MP3 into a 61MB FLAC file.

Create a storage bucket via the cloud dashboard main menu’s StorageBrowser menus, and then upload the FLAC file to that bucket via the web interface.

Note down the BUCKETNAME and the FILENAME, you’ll need these later when starting the transcription job.

Transcribe!

I used the Asynchronous Speech Recognition API, as this is the only API supporting speech segments this long.

First startup the Google Cloud Shell by clicking on the boxed >_ icon at the top left. In this super convenient Debian Linux shell, gcloud is already installed, which is why I chose to use it.

Upload your service account JSON private key, and activate it by doing the following:

export GOOGLE_APPLICATION_CREDENTIALS=~/your-service-account-key.json

Using one of the installed editors, or just uploading, create a file called async-request.json in your home:

{
  "config": {
      "encoding": "FLAC",
      "sampleRateHertz": 16000,
      "language_code": "en-US"
  },
  "audio":{
    "uri":"gs://BUCKETNAME/FILENAME"
  }
}

You are now ready to make the request using curl, and the async-request.json file you created:

curl -X POST \
     -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \
     --data @async-request.json "https://speech.googleapis.com/v1/speech:longrunningrecognize"

You should see a response looking something like this:

{
  "name": "LONG_JOB_NUMBER"
}

Soon after this, you can start seeing how your job is progressing:

curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \
     "https://speech.googleapis.com/v1/operations/LONG_JOB_NUMBER"

The response will look like this while your request is being processed:

{
  "name": "LONG_JOB_NUMBER",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
    "progressPercent": 29,
    "startTime": "2018-02-14T20:17:05.885941Z",
    "lastUpdateTime": "2018-02-14T20:22:26.830868Z"
  }
}

In my case, the 56 minute podcast was transcribed in just under 17 minutes.

When the job is done, the response to the above curl request will contain the transcribed text. It looks something like this:

{
  "name": "LONG_JOB_NUMBER",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
    "progressPercent": 100,
    "startTime": "2018-02-14T20:17:05.885941Z",
    "lastUpdateTime": "2018-02-14T20:35:16.404144Z"
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
    "results": [
      {
        "alternatives": [
          {
            "transcript": "I said 130 want to see PP cast with guest Nicole mazzuca recorded September 14th 2017",
            "confidence": 0.8874592
          }
        ]
      },

// and so on for the whole recording

You can download the full transcription here.

Too many likes?

I wrote the following Python to tally up the total number of words, and the total number of “like” utterances.

import json

with open('/Users/cpbotha/Downloads/cppcast-131-text.json') as f:
    # results: a list of dicts, each with 'alternatives', which is a list of transcripts
    res = json.load(f)['response']['results']

num_like = 0
num_words = 0
for r in res:
    alts = r['alternatives']
    # ensure that we only have one alternative per result
    assert len(alts) == 1
    # break into lowercase words
    t = alts[0]['transcript'].strip().lower().split()
    # tally up total number of words
    num_words += len(t)
    # count the like utterances
    num_like += sum(1 for w in t if w == 'like')

In this 56 minute long episode of CPPCast, 7411 words were detected, 214 of which were the word “like”.

This is not quite as many as I imagined, but still comes down to 3.82 likes per minute, which is enough to be quite noticeable.

Conclusions

  • We should try to use “like” and other speech disfluencies far less often. Inserting a small pause makes more sense: The speaker and the listeners get a little break to process the ongoing speech, and the speech comes across as more measured.
  • All in all, it took me about 2 hours from idea to transcribed text. I find it wonderful that machine learning for speech-to-text has become so democratised.
  • After my transcription job was complete, I saw that it was possible to supply phrase hints to the API. I could have uploaded a list of words we expect to occur during this podcast, such as “CPPCast” and “C++”, and this would have been used by the API to further improve its transcription.