Transcribing Audio & Video Using OpenAI Whisper API
Manage large audio or video files with pydub and OpenAI
4 min read
Transcribing audio has become an essential task in various fields, from creating subtitles for videos to converting meetings and interviews into text. OpenAI's Whisper API offers a powerful solution for this, providing high-accuracy speech-to-text capabilities. However, it's important to note that Whisper's transcription service is only accessible via the API and not through a graphical user interface (UI). This guide will walk you through using the Whisper API for transcribing audio, including handling file size restrictions by chunking the audio and aggregating the transcriptions.
Understanding the Whisper API
OpenAI's Whisper API is designed to convert speech to text with impressive accuracy. The API can handle various languages and accents, making it a versatile tool for global applications. However, the API comes with some limitations, particularly concerning the size of the audio files it can process. Currently, the Whisper API can handle audio files up to a specific size, which means longer recordings need to be split into smaller segments before transcription.
Restrictions and Limitations
The primary restriction of the Whisper API is its file size limit. It is generally recommended to keep audio files relatively small. This ensures smooth processing and avoids timeouts or errors during the transcription process. For longer recordings, you will need to divide the audio into smaller chunks, transcribe each chunk individually, and then combine the results.
Preparing Your Audio Files
To transcribe a long audio file using the Whisper API, you need to break it into smaller, manageable segments. This can be done using Python, which provides libraries for audio processing and API interaction. Here's a step-by-step guide on how to do this.
Step-by-Step Guide to Transcribing Audio with Whisper API
1. Install Required Libraries
First, you need to install the necessary Python libraries. You can do this using pip:
pip install openai pydub
The library is used for audio processing, and openai is the official library to interact with OpenAI’s APIs.
2. Chunking the Audio File
You can use the pydub library to split your audio file into smaller chunks. Here's a Python script to do that:
This script divides the audio into 1-minute chunks. You can adjust the split_length_ms
variable based on your needs as follows:
3. Transcribing Each Chunk
Next, you need to transcribe each chunk using the Whisper API:
4. Putting it all together
Finally, you can aggregate the transcriptions from each chunk into a single text file:
For more information on the Whisper API and its capabilities, check the
Update: How to transcribe video files
Transcribing video files is similar to transcribing audio files. You can extract the audio from the video file and then follow the same process as described above. Here's a Python script to extract audio from a video file using the moviepy
library: