This Python Vosk tutorial will describe how to convert speech in a live stream podcast to text. In a previous post we described how to convert a podcast or mp3 file to a JSON text file so we will skip how to save the transcription results in a JSON file.
This is part of a series of posts so if you’re starting here you might want to read the first three posts:
- Which Python Library To Use For Speech To Text Conversion
- How To Setup A Python Speech To Text Environment Using Vosk
- How To Convert Speech To Text Using Python and Vosk
In the post that describes how to set up the environment we created a python virtual environment and a batch file to activate it. On my Windows 10 system when I start the command line window I get a command prompt with the current directory set to my c:\users\xxx where xxx is my windows userid. You can get to the windows command line by searching for command prompt and running the application. The batch file we created was called NLP.bat so if you enter that into the command prompt you should activate your python virtual environment. We also created a batch file for Idle so go ahead and start your Idle python editor.
Transcribe Live Stream To Text
Once you are in Idle, you can cut and paste the following code into the Idle terminal. This script will convert a live stream podcast to text.
#!/usr/bin/env python3
from vosk import Model, KaldiRecognizer, SetLogLevel
import subprocess
# initialize variables
SetLogLevel(0)
sample_rate=16000
model = Model(r"C:\Users\xxx\pyenv\NLP\model")
streamURL = "http://listen.noagendastream.com/noagenda"
# create recognizer
recognizer = KaldiRecognizer(model, sample_rate)
# create ffmpeg process to read the stream
process = subprocess.Popen(['ffmpeg', '-loglevel', 'quiet', '-i',
streamURL,
'-ar', str(sample_rate) , '-ac', '1', '-f', 's16le', '-'],
stdout=subprocess.PIPE)
while True:
data = process.stdout.read(4000)
if len(data) == 0:
print("end of stream")
break
if recognizer.AcceptWaveform(data):
print(recognizer.Result())
# else:
# print(recognizer.PartialResult())
print(recognizer.FinalResult())
print("End Stream Transcription")
Explanation of Python Code
Let’s walk through the code and see what it does.
from vosk import Model, KaldiRecognizer, SetLogLevel
import subprocess
First a few imports. The vosk module provides an easy to use python api for transcribing audio to text using the Kaldi project.
# initialize variables
SetLogLevel(0)
sample_rate=16000
model = Model(r"C:\Users\xxx\pyenv\NLP\model")
streamURL = "http://listen.noagendastream.com/noagenda"
Next we define a few variables. The model variable is set to the path that contains the Vosk language model you downloaded as a part of the installation (see this post). The streamURL variable is the url for your live stream. I’ve shown the url to the NoAgenda live stream which frequently has a podcast streaming. Otherwise, you can substitute a url to any live streaming source.
recognizer = KaldiRecognizer(model, sample_rate)
In this line we create the recognizer object which will do all the real work.
process = subprocess.Popen(['ffmpeg', '-loglevel', 'quiet', '-i',
streamURL,
'-ar', str(sample_rate) , '-ac', '1', '-f', 's16le', '-'],
stdout=subprocess.PIPE)
The process variable represents a subprocess object that is running the ffmpeg program. ffmpeg is what is actually reading the stream (notice the streamURL variable is passed into ffmpeg as a parameter).
while True:
data = process.stdout.read(4000)
if len(data) == 0:
print("end of stream")
break
if recognizer.AcceptWaveform(data):
print(recognizer.Result())
Finally we get to the actual work of the script.
To process the stream we go into an infinite loop using the “while True” construct. The loop will process the stream until the readframes method returns a zero length chunk of data. Each chunk of data is processed using the recognizer.AcceptWaveform(data) method. This method will return true when a complete amount of speech has been processed which I think is defined by some amount of silence occurring between words. The documentation is sketchy so I might not be interpreting this correctly. Regardless, when this method returns true you will be able to retrieve a chunk of json that describes each word spoken including start and stop times in the audio file as well as the complete string of spoken words. This is accomplished with the recognizer.Result() method and the result is printed to the screen.
Finally:
This post shows how to connect to a live stream using ffmpeg and convert the live stream audio to text in real time. If this is the first time you’ve found this series of posts I’d recommend going back to the beginning as described at the beginning of this post.