Unlocking Speech-to-Text: Harnessing the Power of the OpenAI Whisper API with FastAPI Integration

To a certain point, exploring AI usages has become like playing the game Happy Families :). Indeed, since august 2023, I have tackled some AI’s applied usages: Deploying NLP features, Annotation & Machine Learning Model Customization, Exploring ChatGPT Prompts, Delegating to ChatGPT Time-Consuming Product Owner Tasks… I must admit that the ultimate objective is to gather a decisional matrix that will give both a usage/solution combinations list, each tempered by pros and cons and a “big picture” of the IA ecosystem.

For this post, you can find all files for each project on my GitHub account. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper

Exploring the Audio revolution with Whisper

This time, in the AI use family, I am taking the audio family. This post is therefore an opportunity for a quick POC exploring Audio and AI with the main objective to extract transcription from an audio file then expose it via an API. Like always, the API is designed with FastAPI, this part remains from one post to one another remarkably similar!

This post idea was inspired by the recent announcement of the collaboration between Spotify and OpenAI which announces the possibility of listening to your favorite podcasts in your native language!

Some prominent podcasts available in English are available into German, Spanish and French with “Voice Cloning” which is quite impressive. The announcement is unambiguous: “Spotify’s AI Voice Translation Pilot Means Your Favorite Podcasters Might Be Heard in Your Native Language”. So, you can hear podcasters like Dax Shepard, Monica Padman, Lex Fridman, Bill Simmons, and Steven Bartlett and their guests with generated AI-powered voice translations in other languages—including Spanish, French, and German.

I have checked out these two podcasts to figure out. Well, it is very professional, astonishing, and extremely useful for a better understanding, mostly performative and a bit without cultural roots. It reminds me this quote of Wittgenstein: “The limits of my language are the limits of my world”. Precisely expressing the sensation, I got is far beyond the purpose of this post, so I keep it for myself!

[Traducido con IA] Yuval Noah Harari: Naturaleza humana, Inteligencia, Poder y Conspiraciones – Lex Fridman Podcast
[Traducido con IA – Español] | Podcast on Spotify: https://open.spotify.com/episode/6S3TPytV81NWmkDArApKYl
The Diary Of A CEO with Steven Bartlett [Traducido con IA – Español] | Podcast on Spotify: https://open.spotify.com/show/7oOabGYIDTpIENhYPNUFdP

ChatGPT to the rescue to get definitions

Right now, I am more a “Doer” than a “Thinker”. So, on a more basic level, I ask ChatGPT to precise how do you call these operations, here is the prompt. That clarify a bit the battle ground.

Prompt_1

How do you call Text-to-Speech Conversion, from text you create an audio with a robotic voice and how do ou call Text-to-Speech Conversion where the voice of the user has been cloned ?

You can check “chat_GPT_3_5_prompt_5” in the file “prompts_chatgpt_samples.diff”. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper

This first prompt helps me also to schematically determine the several stages for a machine learning model pipeline that I have converted to a second prompt!

Prompt_2

Can you describe into different stages the process of audio cloning to produce a podcast like the one from Lex Fridman with Yuval Noah Harari that have been translated to Spanish. Number the steps and give them a summary title each time with a description for each stage e.g stage_1 :: Audio Transcription, stage_2 :: Transcription Translation… and so on

You can check “chat_GPT_3_5_prompt_11” in the file “prompts_chatgpt_samples.diff”. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper

Well anyway, this time I did not use the answer from ChatGPT as I found this answer not enough synthetic and too much analytic! Ungrateful Human 🙂 I have also forged a portmanteau word that is not a standard industry terminology “Cloned Voice Overing” so ChatGPT will not understand.

So, here is my rough Process Map in four main steps or stages.

stage_1: Audio Transcription
Description: This is the text extraction from the audio. This is also the main focus of this post. You can even think of doing at this stage some NLP transformation.
stage_2: Transcription Translation
Description: The transcription is translated into a chosen language. Damn it, here you have to select a good translation tool.
stage_3: Voice Cloning
The voices heard in the audio from the host e.g Lex Fridman and the guests e.g., Yuval Noah Harari are trained and reproduced or cheaper you can go robotic voices for both, what a fraud then! Ditto, like the stage_2, finding a great tool is key.
stage_4: Cloned Voice Overing
The podcast is produced with the cloned or synthesized voices that are used for voice-over narration preserving also all the inputs from the first production (sound effects, jingle…).

Like I said, for the moment I am modestly focus on stage_1.

Last thing to say! Even though I claim to only be interested in the how-to. I ask a few questions about the purpose of all this. This experience raises many questions without being a conspiracy theorist.

Let’s take the measure of the Spotify’s experiment. It is also a major advance and a brilliant instrument for soft power and cultural domination. Naively, I often tend to forget that technology, content, and ideology are linked. It is always healthy to ask the question of the political and economic purpose behind the “so natural” adoption of AI.

Source: https://www.euronews.com/culture/2023/09/26/spotify-launches-ai-tool-for-translating-podcasts

Source: https://newsroom.spotify.com/2023-09-25/ai-voice-translation-pilot-lex-fridman-dax-shepard-steven-bartlett/

Source: https://openai.com/research/whisper

POC with Whisper and attempt with FastAPI

It is mainly a POC so you have progressive and iterated attempts that goes from the simple usage to the integration into FastAPI. I gave also some audio samples and like I said tests and prompts made to ChatGPT.

Below, All files and directory with a quick description for this POC.

001_openai_whisper.py: minimum loading and usage of WHISPER
002_openai_whisper.py: minimum loading and usage of WHISPER with languages (AR, ES, CN, RU, FR)
003_openai_whisper.py: output the WHISPER transcription into a text file
004_openai_whisper_panda.py: output the WHISPER transcription in a .cvs file with PANDA
005_openai_whisper.py: WHISPER few attempts on languages detection (AR, ES, CN, RU, FR)
006_openai_whisper_pytube.py: make WHISPER transcription for YOUTUBE video
006_openai_whisper_pytube_ffmpeg.py: make WHISPER transcription for YOUTUBE video leveraging on FFMPEG
007_openai_whisper.py: ditto to 001_openai_whisper.py
008_openai_whisper_fastapi.py: Integration of WHISPER into FASTAPI to provide an POC for an API
009_openai_whisper_fastapi.py: POC with WHISPER and FASTAPI, managing audio and video upload and extract transcription
010_request_files_fastapi.py: Other way to managing files upload in FASTAPI
011_faster_whisper.py: experiments with FASTER-WHISPER
012_openai_whisper.py: build WHISPER transcription function
013_openai_whisper.py: WHISPER transcription in different formats (.json, .srt, .tsv, .txt, .vtt)
README.md: the readme for the main Github directory
audio_files_sources: some audio samples in different languages
ffmpeg_python: experiments with ffmpeg-python
output_srtfiles_writer: output directory for transcription
prompts_chatgpt_samples.diff: some prompts related to the post and to WHISPER
requirements.txt: the python requirements for WHISPER
tests_from_whisper: some tests (pytest) extracted from the original WHISPER project
video_download_from_yt: output audio extraction from a YT video

Audio Transcription with Whisper – Objectives

Let get back the basic, so using Whisper will help make Audio Transcription in more than 70 languages. Good, for a journalist or for anyone working with audio, it is interesting to easily perform such operations: get quickly full transcription of an interview to rework, to translate, to retrieve keywords, make a summary… etc

For example, the site Slator point out 6 Practical Use Cases for the New Whisper API. Well, I am really interested in these 2.

Indexing Podcasts and Audio Content
With the rise of podcasts and audio content, the Whisper model can be used to transcribe and generate text-based versions of audio content. This can help improve accessibility for those with hearing impairments and also improve “search-ability” for podcast episodes, making them more discoverable.
Transcription Services
Transcription service providers can use OpenAI’s Whisper API to transcribe audio and video content in multiple languages accurately and efficiently. The API’s ability to transcribe the audio in near real-time and support multiple file formats allows for greater flexibility and faster turnaround times.

Source: Here Are Six Practical Use Cases for the New Whisper API. https://slator.com/six-practical-use-cases-for-new-whisper-api/

This trend was already predicted 3 years ago… by Deloitte. Well, that is normal digital transformation and creative disruption is their business!

The ears have it: The rise of audiobooks and podcasting https://www2.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions/2020/rise-of-audiobooks-podcast-industry.html

By the way, they are plenty of alternatives to OpenAI Whisper such DeepSpeech, Flashlight, Kaldi, SpeechPy, Speechly, Botium Speech Processing… Unfortunately, I am running out of time to make POC for each of them and list pros and cons.

Audio Transcription with Whisper – Environment

Like always, it is better to create a “virtual environment”. You can use “venv” or “Anaconda”. I am using Anaconda.

You can check the main readme for all the commands. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper

Source: https://docs.python.org/3/library/venv.html

Audio Transcription with Whisper – Audio samples

Get some audio samples preferably in different languages with different accents if possible. Always remember that the quality of the dataset always prefigures the quality of the result. Here are some resources.

Category:Audio files of speeches from wikimedia: https://commons.wikimedia.org/wiki/Category:Audio_files_of_speeches
MelNet – Audio Samples: https://audio-samples.github.io/
Nice samples in different languages from the “Académie de Versailles”: https://audio-lingua.ac-versailles.fr/?lang=en

Audio Transcription with Whisper – Extra stuff

FFMPEG
The audio or video manipulation required mine old acquaintance FFMPEG, part of my very first posts in 2009 in French! Anyway, end-up with nostalgia, I have focused this time on FFMPEG with Python through the library “ffmpeg-python”.

You can check the main readme for all the commands. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper

Source:
https://www.delftstack.com/howto/python/ffmpeg-python/

PYTEST

I gave these tests as goodies. It is just that building an IA strategy require to harden quality like never and also if you are not able to write tests, you can figure out what are the use cases you want to target with your API or with Whisper. Here are some tests grabbed directly from the official repository.

You can check the main readme for all the commands and the directory “tests_from_whisper” for the tests. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper

Source: https://github.com/openai/whisper/tree/main/tests

Videos to tackle this post

#1 Using Whisper & FastAPI: Unlocking Multilingual Transcription with #Whisper:
Exploring Audio for a POC

#2 Using Whisper & FastAPI: Creating a Multilingual Audio API with #Whisper: POC Using #FastAPI

#3 Using Whisper & FastAPI: Leveraging Faster-Whisper
for Multilingual NLP & Audio Exploration

More infos

faster-whisper
https://github.com/guillaumekln/faster-whisper
whisper by openai.com
https://openai.com/research/whisper
whisper on Github
https://github.com/openai/whisper
I used OpenAI’s new tech to transcribe audio right on my laptop
https://www.theverge.com/2022/9/23/23367296/openai-whisper-transcription-speech-recognition-open-source
FastAPI_Whisper
https://github.com/rosaldo/FastAPI_Whisper
OpenAI Whisper using FastAPI
https://blindbox.mithrilsecurity.io/en/integration-task/docs/how-to-guides/openai-whisper/
OpenAI Whisper Python Tutorial: Step-by-Step Guide
https://analyzingalpha.com/openai-whisper-python-tutorial
Converting Speech to Text with the OpenAI Whisper API
https://www.datacamp.com/tutorial/converting-speech-to-text-with-the-openAI-whisper-API
WhisperX
https://github.com/m-bain/whisperX
How to Run OpenAI’s Whisper Speech Recognition Model
https://www.assemblyai.com/blog/how-to-run-openais-whisper-speech-recognition-model/
Show and tell for whisper on github.com
https://github.com/openai/whisper/discussions/categories/show-and-tell
Whisper prompting guide
https://cookbook.openai.com/examples/whisper_prompting_guide
How to use OpenAI’s Whisper for speech recognition
https://www.graphcore.ai/posts/how-to-use-openais-whisper-for-speech-recognition
Comment installer et déployer Whisper, la meilleure alternative open-source à la synthèse vocale de Google ?
https://nlpcloud.com/fr/how-to-install-and-deploy-whisper-the-best-open-source-alternative-to-google-speech-to-text.html
Generating automatic video subtitles from any language with Whisper AutoCaption
https://blog.paperspace.com/automatic-video-subtitles-with-whisper-autocaption/
Whisper-AutoCaption
https://github.com/gradient-ai/whisper-autocaption?ref=blog.paperspace.com
How to: Use Whisper To Convert Speech to Text!
https://www.youtube.com/watch?v=Q7Rq_92kW9A
Real-time Speech Recognition in 15 minutes with AssemblyAI
https://www.youtube.com/watch?v=5LJFK7eOC20
Stable Diffusion XL is a latent text-to-image diffusion model capable of generating photo-realistic images given any
text input
https://stablediffusionweb.com/
Voice Cloning for Content Creators
https://marketplace.respeecher.com/
Introduction to MoviePy
https://www.geeksforgeeks.org/introduction-to-moviepy/
MoviePy (full documentation) is a Python library for video editing: cutting, concatenations, title insertions, video
compositing (a.k.a. non-linear editing), video processing, and creation of custom effects. See the gallery for some
examples of use.
https://pypi.org/project/moviepy/
Aloud – dubbing for everyone
https://aloud.area120.google.com/
ElevenLabs: The official Python API for ElevenLabs text-to-speech software. Eleven brings the most compelling, rich and lifelike voices to creators and developers in just a few lines of code.
https://github.com/elevenlabs/elevenlabs-python
FastAPI: 10 Overlooked Features You Should Be Using
https://medium.com/@kasperjuunge/10-overlooked-fastapi-features-you-should-be-using-9ca53eb4c15b
For the stage_3, I will have to choose among these projects all on Voice Cloning : https://github.com/topics/voice-cloning