How To Use Python To Convert Speech To Text
This post begins my effort to implement speech to text conversion using python. Speech to text is one component of a larger set of capabilities called Natural Language Processing or NLP for short.
There are a number of python libraries available for speech to text conversion. The libraries themselves don’t actually do the conversion, instead they use a cloud based or local service to crunch the data.
Before we look at the choices let’s talk about requirements. I have some specific goals in mind that will drive the decision on what python tools to use.
- Use Python to access the NLP functionality.
- Available as open source or have a free non-commercial version (at least).
- Process large amounts of speech.
Goals For Converting Speech To Text
The goal is to create a text transcript from my favorite podcast (actually the best podcast in the universe) – No Agenda. Because they produce 3 hours twice a week that’s a lot of audio to transcribe and as we will see this will go beyond all of the “free” options of the major cloud services. My ultimate goal is to extract the meaning or relevant information being discussed on the podcast which will require further analysis of the text to derive it’s meaning but this is a topic for later posts. For now it is interesting to note that speech to text applications seem to divide into two categories:
- Knowledge Extraction – determine semantic intent from natural language (my goal)
- Conversational applications – speech bots, automated assistants, etc.
Python Libraries For Speech to Text Conversion
Let’s look at the python libraries available to us for speech to text conversion.
|Python Library||Last Version||Service||Free?|
|Pocket Sphinx||2018||CMU Sphinx||Open Source|
|API.AI||2017||Google Dialog Flow||Try Free|
|assemblyai||2018||AssemblyAI||3 hour per month free|
|vosk||2021||Vosk (offline)||open source|
|pywit||2015||wit.ai – Facebook cloud service||Free|
|speechrecognition||2017||Multiple Services supported|
|IBM Watson||Try Free|
Which Python Library To Use For Speech To Text
When you search the internet on “Python speech to text” the vast majority of blog posts will cover the “speechrecognition” python library. This library provides a common interface to a number of cloud based services shown above (IBM, Google, Bing, Facebook). It has a hardcoded test password to the Google API so you can try this out without having to create an account – easy for blog posts but not a good long term solution.
Anyway, the big three cloud services (IBM, Google, Bing) only have a “try free” option which is really just a small credit towards the use of the cloud service so trying these services is about all you can do before you start paying. If you are developing an enterprise scale app this might be a good way to go, but if you are looking for a long term “free” option this won’t work. Facebook is “free” but without reading the license terms closely, is anything from Facebook actually “free”?
The “CMU Sphinx” option looks promising as an open source option but it appears to be no longer supported according to their website so that option is out. They now refer you to another site that features the “Vosk” server and python library. The interesting thing about Vosk (and Sphinx) is that it runs offline which means you can install it on your local machine and not rely on a cloud service. Vosk itself is built on the Kaldi project which you could use by itself but this is not for the faint of heart.
Finally, the “assemblyai” python library supports the cloud based services from the AssemblyAI company. They have a free option but its only 3 hours per month but it does seem to be free forever instead of “try” free.
So given my requirements it looks like I’m going to try Vosk first. I may also try AssemblyAI to compare performance/capabilities etc. Stay tuned for more posts on this topic!
Next Post – How To Setup A Python Environment For Vosk