Last modified: Apr 16, 2026

Transcribe Audio to Text with Python Speech Recognition

Converting speech to text is a powerful skill. Python makes it accessible. This guide will show you how.

You will learn to use speech recognition libraries. We will cover file handling and error management. Let's get started.

Why Use Python for Speech Recognition?

Python is a top choice for developers. It has simple syntax and powerful libraries. This is true for audio tasks too.

For a broader look at handling sound, see our Python Audio Processing Guide for Beginners. It covers foundational concepts.

Speech-to-text has many uses. It can create subtitles, take voice notes, or analyze interviews. Python scripts can automate these tasks.

Setting Up Your Python Environment

First, ensure you have Python installed. Version 3.7 or higher is recommended. You will also need a package manager like pip.

The main library we need is SpeechRecognition. Install it using pip in your terminal.


pip install SpeechRecognition

This library acts as a wrapper. It connects to various speech recognition engines. For most tasks, we will use Google's Web Speech API.

You might also need pydub for handling different audio formats. Install it as well.


pip install pydub

Basic Transcription from an Audio File

Let's start with a simple script. We will transcribe a WAV file. First, import the speech_recognition module.

Create a Recognizer object. This object will handle the recognition process.


import speech_recognition as sr

# Initialize the recognizer
recognizer = sr.Recognizer()

Next, load your audio file. Use the AudioFile class. Provide the path to your WAV file.

Open the file using a context manager. Then, record the audio into an AudioData object.


# Load the audio file
audio_file = sr.AudioFile('my_recording.wav')

with audio_file as source:
    # Record the entire audio file
    audio_data = recognizer.record(source)

Now, use the recognize_google method. This sends the audio to Google's free API. It returns the transcribed text.

Important: An internet connection is required for this step.


try:
    # Transcribe the audio data
    text = recognizer.recognize_google(audio_data)
    print("Transcription:", text)
except sr.UnknownValueError:
    print("Google Speech Recognition could not understand the audio.")
except sr.RequestError as e:
    print(f"Could not request results from Google service; {e}")

Here is what a successful run might look like.


Transcription: Hello world this is a test recording using Python

Handling Different Audio Formats

Not all audio is in WAV format. You might have MP3 or M4A files. The SpeechRecognition library primarily works with WAV.

You need to convert other formats first. The pydub library is perfect for this. It's a key tool in any Python Audio Libraries toolkit.

Here is how to convert an MP3 file to WAV before transcription.


from pydub import AudioSegment
import speech_recognition as sr
import os

# Convert MP3 to WAV
mp3_audio = AudioSegment.from_mp3("interview.mp3")
mp3_audio.export("temp_converted.wav", format="wav")

# Now transcribe the WAV file
recognizer = sr.Recognizer()
with sr.AudioFile("temp_converted.wav") as source:
    audio_data = recognizer.record(source)
    text = recognizer.recognize_google(audio_data)

print(text)
# Clean up the temporary file
os.remove("temp_converted.wav")

Working with Long Audio Files

Google's API has a time limit per request. For long files, you must split the audio. The record method can help.

You can specify an offset and a duration. This lets you process the file in chunks.


recognizer = sr.Recognizer()
full_text = []

with sr.AudioFile('long_lecture.wav') as source:
    # Get the total duration (in seconds) of the audio file
    duration = source.DURATION
    chunk_length = 30  # Process 30-second chunks

    for i in range(0, int(duration), chunk_length):
        # Record a chunk starting at offset i, for chunk_length seconds
        audio_chunk = recognizer.record(source, offset=i, duration=chunk_length)
        try:
            chunk_text = recognizer.recognize_google(audio_chunk)
            full_text.append(chunk_text)
            print(f"Chunk {i//chunk_length + 1} done.")
        except sr.UnknownValueError:
            full_text.append("[INAUDIBLE]")
        except sr.RequestError as e:
            full_text.append(f"[API ERROR: {e}]")

final_transcript = " ".join(full_text)
print("Full Transcript:", final_transcript)

Improving Accuracy and Handling Noise

Real-world audio often has background noise. This can hurt transcription accuracy. The Recognizer object has methods to help.

Use the adjust_for_ambient_noise method. It listens to a part of the audio to calibrate.


recognizer = sr.Recognizer()

with sr.AudioFile('noisy_audio.wav') as source:
    # Let the recognizer adjust to the ambient noise for 1 second
    recognizer.adjust_for_ambient_noise(source, duration=1)
    audio_data = recognizer.record(source)
    text = recognizer.recognize_google(audio_data)
    print(text)

Important: Always call adjust_for_ambient_noise while the audio file is open as the source. This ensures it samples the correct audio.

Conclusion

Transcribing audio with Python is straightforward. The SpeechRecognition library does the heavy lifting. You learned to handle different files and improve accuracy.

Start with clean WAV files for best results. Use pydub for conversion. Split long files and handle errors gracefully.

This skill opens doors to automation and data analysis. You can build transcription services or analyze spoken content. Keep experimenting with different audio sources.