Unlocking Speech-to-Text: Harnessing the Power of the OpenAI Whisper API with FastAPI Integration
To a certain point, exploring AI usages has become like playing the game Happy Families :). Indeed, since august 2023, I have tackled some AI’s applied usages: Deploying NLP features, Annotation & Machine Learning Model Customization, Exploring ChatGPT Prompts, Delegating to ChatGPT Time-Consuming Product Owner Tasks… I must admit that the ultimate objective is to gather a decisional matrix that will give both a usage/solution combinations list, each tempered by pros and cons and a “big picture” of the IA ecosystem.
For this post, you can find all files for each project on my GitHub account. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper
Exploring the Audio revolution with Whisper
This time, in the AI use family, I am taking the audio family. This post is therefore an opportunity for a quick POC exploring Audio and AI with the main objective to extract transcription from an audio file then expose it via an API. Like always, the API is designed with FastAPI, this part remains from one post to one another remarkably similar!
This post idea was inspired by the recent announcement of the collaboration between Spotify and OpenAI which announces the possibility of listening to your favorite podcasts in your native language!
Some prominent podcasts available in English are available into German, Spanish and French with “Voice Cloning” which is quite impressive. The announcement is unambiguous: “Spotify’s AI Voice Translation Pilot Means Your Favorite Podcasters Might Be Heard in Your Native Language”. So, you can hear podcasters like Dax Shepard, Monica Padman, Lex Fridman, Bill Simmons, and Steven Bartlett and their guests with generated AI-powered voice translations in other languages—including Spanish, French, and German.
I have checked out these two podcasts to figure out. Well, it is very professional, astonishing, and extremely useful for a better understanding, mostly performative and a bit without cultural roots. It reminds me this quote of Wittgenstein: “The limits of my language are the limits of my world”. Precisely expressing the sensation, I got is far beyond the purpose of this post, so I keep it for myself!
- [Traducido con IA] Yuval Noah Harari: Naturaleza humana, Inteligencia, Poder y Conspiraciones – Lex Fridman Podcast
[Traducido con IA – Español] | Podcast on Spotify: https://open.spotify.com/episode/6S3TPytV81NWmkDArApKYl - The Diary Of A CEO with Steven Bartlett [Traducido con IA – Español] | Podcast on Spotify: https://open.spotify.com/show/7oOabGYIDTpIENhYPNUFdP
ChatGPT to the rescue to get definitions
Right now, I am more a “Doer” than a “Thinker”. So, on a more basic level, I ask ChatGPT to precise how do you call these operations, here is the prompt. That clarify a bit the battle ground.
Prompt_1
How do you call Text-to-Speech Conversion, from text you create an audio with a robotic voice and how do ou call Text-to-Speech Conversion where the voice of the user has been cloned ?
You can check “chat_GPT_3_5_prompt_5” in the file “prompts_chatgpt_samples.diff”. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper
This first prompt helps me also to schematically determine the several stages for a machine learning model pipeline that I have converted to a second prompt!
Prompt_2
Can you describe into different stages the process of audio cloning to produce a podcast like the one from Lex Fridman with Yuval Noah Harari that have been translated to Spanish. Number the steps and give them a summary title each time with a description for each stage e.g stage_1 :: Audio Transcription, stage_2 :: Transcription Translation… and so on
You can check “chat_GPT_3_5_prompt_11” in the file “prompts_chatgpt_samples.diff”. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper
Well anyway, this time I did not use the answer from ChatGPT as I found this answer not enough synthetic and too much analytic! Ungrateful Human 🙂 I have also forged a portmanteau word that is not a standard industry terminology “Cloned Voice Overing” so ChatGPT will not understand.
So, here is my rough Process Map in four main steps or stages.
- stage_1: Audio Transcription
Description: This is the text extraction from the audio. This is also the main focus of this post. You can even think of doing at this stage some NLP transformation. - stage_2: Transcription Translation
Description: The transcription is translated into a chosen language. Damn it, here you have to select a good translation tool. - stage_3: Voice Cloning
The voices heard in the audio from the host e.g Lex Fridman and the guests e.g., Yuval Noah Harari are trained and reproduced or cheaper you can go robotic voices for both, what a fraud then! Ditto, like the stage_2, finding a great tool is key. - stage_4: Cloned Voice Overing
The podcast is produced with the cloned or synthesized voices that are used for voice-over narration preserving also all the inputs from the first production (sound effects, jingle…).
Like I said, for the moment I am modestly focus on stage_1.
Last thing to say! Even though I claim to only be interested in the how-to. I ask a few questions about the purpose of all this. This experience raises many questions without being a conspiracy theorist.
Let’s take the measure of the Spotify’s experiment. It is also a major advance and a brilliant instrument for soft power and cultural domination. Naively, I often tend to forget that technology, content, and ideology are linked. It is always healthy to ask the question of the political and economic purpose behind the “so natural” adoption of AI.
Source: https://www.euronews.com/culture/2023/09/26/spotify-launches-ai-tool-for-translating-podcasts
Source: https://openai.com/research/whisper
POC with Whisper and attempt with FastAPI
It is mainly a POC so you have progressive and iterated attempts that goes from the simple usage to the integration into FastAPI. I gave also some audio samples and like I said tests and prompts made to ChatGPT.
Below, All files and directory with a quick description for this POC.
- 001_openai_whisper.py: minimum loading and usage of WHISPER
- 002_openai_whisper.py: minimum loading and usage of WHISPER with languages (AR, ES, CN, RU, FR)
- 003_openai_whisper.py: output the WHISPER transcription into a text file
- 004_openai_whisper_panda.py: output the WHISPER transcription in a .cvs file with PANDA
- 005_openai_whisper.py: WHISPER few attempts on languages detection (AR, ES, CN, RU, FR)
- 006_openai_whisper_pytube.py: make WHISPER transcription for YOUTUBE video
- 006_openai_whisper_pytube_ffmpeg.py: make WHISPER transcription for YOUTUBE video leveraging on FFMPEG
- 007_openai_whisper.py: ditto to 001_openai_whisper.py
- 008_openai_whisper_fastapi.py: Integration of WHISPER into FASTAPI to provide an POC for an API
- 009_openai_whisper_fastapi.py: POC with WHISPER and FASTAPI, managing audio and video upload and extract transcription
- 010_request_files_fastapi.py: Other way to managing files upload in FASTAPI
- 011_faster_whisper.py: experiments with FASTER-WHISPER
- 012_openai_whisper.py: build WHISPER transcription function
- 013_openai_whisper.py: WHISPER transcription in different formats (.json, .srt, .tsv, .txt, .vtt)
- README.md: the readme for the main Github directory
- audio_files_sources: some audio samples in different languages
- ffmpeg_python: experiments with ffmpeg-python
- output_srtfiles_writer: output directory for transcription
- prompts_chatgpt_samples.diff: some prompts related to the post and to WHISPER
- requirements.txt: the python requirements for WHISPER
- tests_from_whisper: some tests (pytest) extracted from the original WHISPER project
- video_download_from_yt: output audio extraction from a YT video
Audio Transcription with Whisper – Objectives
Let get back the basic, so using Whisper will help make Audio Transcription in more than 70 languages. Good, for a journalist or for anyone working with audio, it is interesting to easily perform such operations: get quickly full transcription of an interview to rework, to translate, to retrieve keywords, make a summary… etc
For example, the site Slator point out 6 Practical Use Cases for the New Whisper API. Well, I am really interested in these 2.
Indexing Podcasts and Audio Content
With the rise of podcasts and audio content, the Whisper model can be used to transcribe and generate text-based versions of audio content. This can help improve accessibility for those with hearing impairments and also improve “search-ability” for podcast episodes, making them more discoverable.Transcription Services
Transcription service providers can use OpenAI’s Whisper API to transcribe audio and video content in multiple languages accurately and efficiently. The API’s ability to transcribe the audio in near real-time and support multiple file formats allows for greater flexibility and faster turnaround times.
Source: Here Are Six Practical Use Cases for the New Whisper API. https://slator.com/six-practical-use-cases-for-new-whisper-api/
This trend was already predicted 3 years ago… by Deloitte. Well, that is normal digital transformation and creative disruption is their business!
The ears have it: The rise of audiobooks and podcasting https://www2.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions/2020/rise-of-audiobooks-podcast-industry.html
By the way, they are plenty of alternatives to OpenAI Whisper such DeepSpeech, Flashlight, Kaldi, SpeechPy, Speechly, Botium Speech Processing… Unfortunately, I am running out of time to make POC for each of them and list pros and cons.
Audio Transcription with Whisper – Environment
Like always, it is better to create a “virtual environment”. You can use “venv” or “Anaconda”. I am using Anaconda.
You can check the main readme for all the commands. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper
Source: https://docs.python.org/3/library/venv.html
Audio Transcription with Whisper – Audio samples
Get some audio samples preferably in different languages with different accents if possible. Always remember that the quality of the dataset always prefigures the quality of the result. Here are some resources.
- Category:Audio files of speeches from wikimedia: https://commons.wikimedia.org/wiki/Category:Audio_files_of_speeches
- MelNet – Audio Samples: https://audio-samples.github.io/
- Nice samples in different languages from the “Académie de Versailles”: https://audio-lingua.ac-versailles.fr/?lang=en
Audio Transcription with Whisper – Extra stuff
FFMPEG
The audio or video manipulation required mine old acquaintance FFMPEG, part of my very first posts in 2009 in French! Anyway, end-up with nostalgia, I have focused this time on FFMPEG with Python through the library “ffmpeg-python”.
You can check the main readme for all the commands. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper
Source:
https://www.delftstack.com/howto/python/ffmpeg-python/
PYTEST
I gave these tests as goodies. It is just that building an IA strategy require to harden quality like never and also if you are not able to write tests, you can figure out what are the use cases you want to target with your API or with Whisper. Here are some tests grabbed directly from the official repository.
You can check the main readme for all the commands and the directory “tests_from_whisper” for the tests. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper
Source: https://github.com/openai/whisper/tree/main/tests
Videos to tackle this post
#1 Using Whisper & FastAPI: Unlocking Multilingual Transcription with #Whisper:
Exploring Audio for a POC
#2 Using Whisper & FastAPI: Creating a Multilingual Audio API with #Whisper: POC Using #FastAPI
#3 Using Whisper & FastAPI: Leveraging Faster-Whisper
for Multilingual NLP & Audio Exploration
More infos
- faster-whisper
https://github.com/guillaumekln/faster-whisper - whisper by openai.com
https://openai.com/research/whisper - whisper on Github
https://github.com/openai/whisper - I used OpenAI’s new tech to transcribe audio right on my laptop
https://www.theverge.com/2022/9/23/23367296/openai-whisper-transcription-speech-recognition-open-source - FastAPI_Whisper
https://github.com/rosaldo/FastAPI_Whisper - OpenAI Whisper using FastAPI
https://blindbox.mithrilsecurity.io/en/integration-task/docs/how-to-guides/openai-whisper/ - OpenAI Whisper Python Tutorial: Step-by-Step Guide
https://analyzingalpha.com/openai-whisper-python-tutorial - Converting Speech to Text with the OpenAI Whisper API
https://www.datacamp.com/tutorial/converting-speech-to-text-with-the-openAI-whisper-API - WhisperX
https://github.com/m-bain/whisperX - How to Run OpenAI’s Whisper Speech Recognition Model
https://www.assemblyai.com/blog/how-to-run-openais-whisper-speech-recognition-model/ - Show and tell for whisper on github.com
https://github.com/openai/whisper/discussions/categories/show-and-tell - Whisper prompting guide
https://cookbook.openai.com/examples/whisper_prompting_guide - How to use OpenAI’s Whisper for speech recognition
https://www.graphcore.ai/posts/how-to-use-openais-whisper-for-speech-recognition - Comment installer et déployer Whisper, la meilleure alternative open-source à la synthèse vocale de Google ?
https://nlpcloud.com/fr/how-to-install-and-deploy-whisper-the-best-open-source-alternative-to-google-speech-to-text.html - Generating automatic video subtitles from any language with Whisper AutoCaption
https://blog.paperspace.com/automatic-video-subtitles-with-whisper-autocaption/ - Whisper-AutoCaption
https://github.com/gradient-ai/whisper-autocaption?ref=blog.paperspace.com - How to: Use Whisper To Convert Speech to Text!
https://www.youtube.com/watch?v=Q7Rq_92kW9A - Real-time Speech Recognition in 15 minutes with AssemblyAI
https://www.youtube.com/watch?v=5LJFK7eOC20 - Stable Diffusion XL is a latent text-to-image diffusion model capable of generating photo-realistic images given any
text input
https://stablediffusionweb.com/ - Voice Cloning for Content Creators
https://marketplace.respeecher.com/ - Introduction to MoviePy
https://www.geeksforgeeks.org/introduction-to-moviepy/ - MoviePy (full documentation) is a Python library for video editing: cutting, concatenations, title insertions, video
compositing (a.k.a. non-linear editing), video processing, and creation of custom effects. See the gallery for some
examples of use.
https://pypi.org/project/moviepy/ - Aloud – dubbing for everyone
https://aloud.area120.google.com/ - ElevenLabs: The official Python API for ElevenLabs text-to-speech software. Eleven brings the most compelling, rich and lifelike voices to creators and developers in just a few lines of code.
https://github.com/elevenlabs/elevenlabs-python - FastAPI: 10 Overlooked Features You Should Be Using
https://medium.com/@kasperjuunge/10-overlooked-fastapi-features-you-should-be-using-9ca53eb4c15b - For the stage_3, I will have to choose among these projects all on Voice Cloning : https://github.com/topics/voice-cloning