How Convert A Live Stream Podcast To Text

This Python Vosk tutorial will describe how to convert speech in a live stream podcast to text. In a previous post we described how to convert a podcast or mp3 file to a JSON text file so we will skip how to save the transcription results in a JSON file.

This is part of a series of posts so if you’re starting here you might want to read the first three posts:

In the post that describes how to set up the environment we created a python virtual environment and a batch file to activate it. On my Windows 10 system when I start the command line window I get a command prompt with the current directory set to my c:\users\xxx where xxx is my windows userid. You can get to the windows command line by searching for command prompt and running the application. The batch file we created was called NLP.bat so if you enter that into the command prompt you should activate your python virtual environment. We also created a batch file for Idle so go ahead and start your Idle python editor.

Idle Screen Shot

Transcribe Live Stream To Text

Once you are in Idle, you can cut and paste the following code into the Idle terminal. This script will convert a live stream podcast to text.

#!/usr/bin/env python3

from vosk import Model, KaldiRecognizer, SetLogLevel
import subprocess

# initialize variables
SetLogLevel(0)
sample_rate=16000
model = Model(r"C:\Users\xxx\pyenv\NLP\model")
streamURL = "http://listen.noagendastream.com/noagenda"

# create recognizer
recognizer = KaldiRecognizer(model, sample_rate)

# create ffmpeg process to read the stream
process = subprocess.Popen(['ffmpeg', '-loglevel', 'quiet', '-i',
                            streamURL,
                            '-ar', str(sample_rate) , '-ac', '1', '-f', 's16le', '-'],
                            stdout=subprocess.PIPE)

while True:
    data = process.stdout.read(4000)
    if len(data) == 0:
        print("end of stream")
        break
    if recognizer.AcceptWaveform(data):
        print(recognizer.Result())
#    else:
#        print(recognizer.PartialResult())

print(recognizer.FinalResult())
print("End Stream Transcription")

Explanation of Python Code

Let’s walk through the code and see what it does.

from vosk import Model, KaldiRecognizer, SetLogLevel
import subprocess

First a few imports. The vosk module provides an easy to use python api for transcribing audio to text using the Kaldi project.

# initialize variables
SetLogLevel(0)
sample_rate=16000
model = Model(r"C:\Users\xxx\pyenv\NLP\model")
streamURL = "http://listen.noagendastream.com/noagenda"

Next we define a few variables. The model variable is set to the path that contains the Vosk language model you downloaded as a part of the installation (see this post). The streamURL variable is the url for your live stream. I’ve shown the url to the NoAgenda live stream which frequently has a podcast streaming. Otherwise, you can substitute a url to any live streaming source.

recognizer = KaldiRecognizer(model, sample_rate)

In this line we create the recognizer object which will do all the real work.

process = subprocess.Popen(['ffmpeg', '-loglevel', 'quiet', '-i',
                            streamURL,
                            '-ar', str(sample_rate) , '-ac', '1', '-f', 's16le', '-'],
                            stdout=subprocess.PIPE)

The process variable represents a subprocess object that is running the ffmpeg program. ffmpeg is what is actually reading the stream (notice the streamURL variable is passed into ffmpeg as a parameter).

while True:
    data = process.stdout.read(4000)
    if len(data) == 0:
        print("end of stream")
        break
    if recognizer.AcceptWaveform(data):
        print(recognizer.Result())

Finally we get to the actual work of the script.

To process the stream we go into an infinite loop using the “while True” construct. The loop will process the stream until the readframes method returns a zero length chunk of data. Each chunk of data is processed using the recognizer.AcceptWaveform(data) method. This method will return true when a complete amount of speech has been processed which I think is defined by some amount of silence occurring between words. The documentation is sketchy so I might not be interpreting this correctly. Regardless, when this method returns true you will be able to retrieve a chunk of json that describes each word spoken including start and stop times in the audio file as well as the complete string of spoken words. This is accomplished with the recognizer.Result() method and the result is printed to the screen.

Finally:

This post shows how to connect to a live stream using ffmpeg and convert the live stream audio to text in real time. If this is the first time you’ve found this series of posts I’d recommend going back to the beginning as described at the beginning of this post.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

How To Extract Sound Clips From A Podcast

Related Posts