Results of Vosk Python Speech To Text Conversion

In this post we will look at some results from our python scripts that utilize the Vosk speech translation api to convert audio files to text files. Earlier posts provided a tutorial on how to use Python and Vosk.

For complete python applications with a user interface check out these posts:

This is part of a series of posts so if you’re starting here you might want to read the first three posts:

In the third post we ran a series of python scripts that read in an MP3 file, saved a portion of it as a WAV file, converted the WAV file to mono and then performed the translation from speech to text. You can run this script with any MP3 input file you want by changing the file names in the scripts. In this post let’s take a look at the output provided by the Vosk api.

Vosk Translation Results Using Kaldi Recognizer

The script in post #3 captures the results of the speech to text conversion and saves it as a JSON file.

The following is the full results output from the Vosk recognizer object. The output is essentially a list of “result” key/value pairs. Each “result” value is a list of key/value pairs:

  • conf – this is the confidence level given to each word.
  • start – this is the start time in the file
  • end – this is the end time in the file
  • word – this is the word converted from the audio sound

Following the list of words is the “text” key/value pair that contains all of the words in one value.

This is one “result” value returned by Vosk from the audio file above.

  "result" : [{
      "conf" : 1.000000,
      "end" : 1.680000,
      "start" : 1.320000,
      "word" : "graph"
    }, {
      "conf" : 1.000000,
      "end" : 2.670000,
      "start" : 1.680000,
      "word" : "databases"
    }, {
      "conf" : 1.000000,
      "end" : 3.420000,
      "start" : 3.210000,
      "word" : "are"
    }, {
      "conf" : 1.000000,
      "end" : 3.930000,
      "start" : 3.450000,
      "word" : "based"
    }, {
      "conf" : 1.000000,
      "end" : 4.410000,
      "start" : 3.960000,
      "word" : "on"
    }, {
      "conf" : 1.000000,
      "end" : 4.710000,
      "start" : 4.590000,
      "word" : "the"
    }, {
      "conf" : 1.000000,
      "end" : 5.460000,
      "start" : 4.710000,
      "word" : "mathematical"
    }, {
      "conf" : 1.000000,
      "end" : 5.910000,
      "start" : 5.460000,
      "word" : "graph"
    }, {
      "conf" : 1.000000,
      "end" : 6.390000,
      "start" : 5.910000,
      "word" : "theory"
  "text" : "graph databases are based on the mathematical graph theory"

Sample Input WAV File

The first example will use an audio file from my “Introduction To Graph Databases” course which is hosted at The audio file is pretty short and you can listen to it here.

Sample audio file in WAV format

Subset Of Translation Output

The script also produces a smaller JSON file that only contains the “text” output. Each text is a complete set of spoken words separated by some amount of silence. They may or may not correspond directly to individual sentences. Here’s what we get for the audio file above.

    "graph databases are based on the mathematical graph theory",
    "in graph theory there are essentially two concepts we talk about vertices and edges",
    "so vertices are used to represent things really anything we care to be analyzing",
    "and the edges represent relationships or connections between them",
    "turns out that graphs are everywhere",
    "think about how you use social media social media is essentially a giant graph of all the people on that platform",
    "so off (if) i i myself would be representatives of urgency (represented by a vertice) and perhaps you are on the social media platform as well and you would be over to see (a vertice) or a vertex",
    "and how might we be connecting we like each other",
    "or we could connect to each other we could follow each other those would all be edges in the graph",
    "now using mathematical theory we can analyze data stored as a graph and we can use that to solve some rather interesting problems these are referred to as og rhythms (algorithms) and we'll be looking at that later in the course",

If you listen to the audio file and read the text above you will see that it is very accurate. It stumbled on a few words which are highlighted above. The errors are in bold red and the correct word I added manually in parenthesis. It mistook “if” for “off” but when I listen to the audio it really sounds like I said “off”. The next two errors are my own fault – I was mistakenly saying “vertisee” (singular) instead of vertex. When I said vertices earlier it transcribed that correctly. Faced with a non-word “vertisee” it did the best it could. Finally, the one real miss was “algorithm”.

Well that’s it for the first proof of concept using the Vosk python library to convert speech to text. I’m going to be exploring several options to improve the results in upcoming posts but this is enough to get you started on your own effort to convert an audio file to a text file.

The Full App

If you want to go straight to the full solution then check out this complete python application.

1 comment
  1. A wonderful article… I love it. I will be waiting for your work that translates using a microphone. Thank you very much for your effort.

Comments are closed.

Previous Post

How To Convert Speech To Text Using Python and Vosk

Next Post

How To Transcribe A Podcast Audio File To Text For Free

Related Posts