Unlocking Speech-to-Text: Harnessing the Power of the OpenAI Whisper API with FastAPI Integration

To a certain point, exploring AI usages has become like playing the game Happy Families :). Indeed, since august 2023, I have tackled some AI’s applied usages: Deploying NLP features, Annotation & Machine Learning Model Customization, Exploring ChatGPT Prompts, Delegating to ChatGPT Time-Consuming Product Owner Tasks… I must admit that the ultimate objective is to gather a decisional matrix that will give both a usage/solution combinations list, each tempered by pros and cons and a “big picture” of the IA ecosystem.

For this post, you can find all files for each project on my GitHub account. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper

Exploring the Audio revolution with Whisper

This time, in the AI use family, I am taking the audio family. This post is therefore an opportunity for a quick POC exploring Audio and AI with the main objective to extract transcription from an audio file then expose it via an API. Like always, the API is designed with FastAPI, this part remains from one post to one another remarkably similar!

This post idea was inspired by the recent announcement of the collaboration between Spotify and OpenAI which announces the possibility of listening to your favorite podcasts in your native language!

Some prominent podcasts available in English are available into German, Spanish and French with “Voice Cloning” which is quite impressive. The announcement is unambiguous: “Spotify’s AI Voice Translation Pilot Means Your Favorite Podcasters Might Be Heard in Your Native Language”. So, you can hear podcasters like Dax Shepard, Monica Padman, Lex Fridman, Bill Simmons, and Steven Bartlett and their guests with generated AI-powered voice translations in other languages—including Spanish, French, and German.

I have checked out these two podcasts to figure out. Well, it is very professional, astonishing, and extremely useful for a better understanding, mostly performative and a bit without cultural roots. It reminds me this quote of Wittgenstein: “The limits of my language are the limits of my world”. Precisely expressing the sensation, I got is far beyond the purpose of this post, so I keep it for myself!

ChatGPT to the rescue to get definitions

Right now, I am more a “Doer” than a “Thinker”. So, on a more basic level, I ask ChatGPT to precise how do you call these operations, here is the prompt. That clarify a bit the battle ground.

Prompt_1

How do you call Text-to-Speech Conversion, from text you create an audio with a robotic voice and how do ou call Text-to-Speech Conversion where the voice of the user has been cloned ?

You can check “chat_GPT_3_5_prompt_5” in the file “prompts_chatgpt_samples.diff”. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper

This first prompt helps me also to schematically determine the several stages for a machine learning model pipeline that I have converted to a second prompt!

Prompt_2

Can you describe into different stages the process of audio cloning to produce a podcast like the one from Lex Fridman with Yuval Noah Harari that have been translated to Spanish. Number the steps and give them a summary title each time with a description for each stage e.g stage_1 :: Audio Transcription, stage_2 :: Transcription Translation… and so on

You can check “chat_GPT_3_5_prompt_11” in the file “prompts_chatgpt_samples.diff”. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper

Well anyway, this time I did not use the answer from ChatGPT as I found this answer not enough synthetic and too much analytic! Ungrateful Human 🙂 I have also forged a portmanteau word that is not a standard industry terminology “Cloned Voice Overing” so ChatGPT will not understand.

So, here is my rough Process Map in four main steps or stages.

  1. stage_1: Audio Transcription
    Description: This is the text extraction from the audio. This is also the main focus of this post. You can even think of doing at this stage some NLP transformation.
  2. stage_2: Transcription Translation
    Description: The transcription is translated into a chosen language. Damn it, here you have to select a good translation tool.
  3. stage_3: Voice Cloning
    The voices heard in the audio from the host e.g Lex Fridman and the guests e.g., Yuval Noah Harari are trained and reproduced or cheaper you can go robotic voices for both, what a fraud then! Ditto, like the stage_2, finding a great tool is key.
  4. stage_4: Cloned Voice Overing
    The podcast is produced with the cloned or synthesized voices that are used for voice-over narration preserving also all the inputs from the first production (sound effects, jingle…).

Like I said, for the moment I am modestly focus on stage_1.

Last thing to say! Even though I claim to only be interested in the how-to. I ask a few questions about the purpose of all this. This experience raises many questions without being a conspiracy theorist.

Let’s take the measure of the Spotify’s experiment. It is also a major advance and a brilliant instrument for soft power and cultural domination. Naively, I often tend to forget that technology, content, and ideology are linked. It is always healthy to ask the question of the political and economic purpose behind the “so natural” adoption of AI.

Source: https://www.euronews.com/culture/2023/09/26/spotify-launches-ai-tool-for-translating-podcasts

Source: https://newsroom.spotify.com/2023-09-25/ai-voice-translation-pilot-lex-fridman-dax-shepard-steven-bartlett/

Source: https://openai.com/research/whisper

POC with Whisper and attempt with FastAPI

It is mainly a POC so you have progressive and iterated attempts that goes from the simple usage to the integration into FastAPI. I gave also some audio samples and like I said tests and prompts made to ChatGPT.

Below, All files and directory with a quick description for this POC.

  • 001_openai_whisper.py: minimum loading and usage of WHISPER
  • 002_openai_whisper.py: minimum loading and usage of WHISPER with languages (AR, ES, CN, RU, FR)
  • 003_openai_whisper.py: output the WHISPER transcription into a text file
  • 004_openai_whisper_panda.py: output the WHISPER transcription in a .cvs file with PANDA
  • 005_openai_whisper.py: WHISPER few attempts on languages detection (AR, ES, CN, RU, FR)
  • 006_openai_whisper_pytube.py: make WHISPER transcription for YOUTUBE video
  • 006_openai_whisper_pytube_ffmpeg.py: make WHISPER transcription for YOUTUBE video leveraging on FFMPEG
  • 007_openai_whisper.py: ditto to 001_openai_whisper.py
  • 008_openai_whisper_fastapi.py: Integration of WHISPER into FASTAPI to provide an POC for an API
  • 009_openai_whisper_fastapi.py: POC with WHISPER and FASTAPI, managing audio and video upload and extract transcription
  • 010_request_files_fastapi.py: Other way to managing files upload in FASTAPI
  • 011_faster_whisper.py: experiments with FASTER-WHISPER
  • 012_openai_whisper.py: build WHISPER transcription function
  • 013_openai_whisper.py: WHISPER transcription in different formats (.json, .srt, .tsv, .txt, .vtt)
  • README.md: the readme for the main Github directory
  • audio_files_sources: some audio samples in different languages
  • ffmpeg_python: experiments with ffmpeg-python
  • output_srtfiles_writer: output directory for transcription
  • prompts_chatgpt_samples.diff: some prompts related to the post and to WHISPER
  • requirements.txt: the python requirements for WHISPER
  • tests_from_whisper: some tests (pytest) extracted from the original WHISPER project
  • video_download_from_yt: output audio extraction from a YT video

Audio Transcription with Whisper – Objectives

Let get back the basic, so using Whisper will help make Audio Transcription in more than 70 languages. Good, for a journalist or for anyone working with audio, it is interesting to easily perform such operations: get quickly full transcription of an interview to rework, to translate, to retrieve keywords, make a summary… etc

For example, the site Slator point out 6 Practical Use Cases for the New Whisper API. Well, I am really interested in these 2.

Indexing Podcasts and Audio Content
With the rise of podcasts and audio content, the Whisper model can be used to transcribe and generate text-based versions of audio content. This can help improve accessibility for those with hearing impairments and also improve “search-ability” for podcast episodes, making them more discoverable.

Transcription Services
Transcription service providers can use OpenAI’s Whisper API to transcribe audio and video content in multiple languages accurately and efficiently. The API’s ability to transcribe the audio in near real-time and support multiple file formats allows for greater flexibility and faster turnaround times.

Source: Here Are Six Practical Use Cases for the New Whisper API. https://slator.com/six-practical-use-cases-for-new-whisper-api/

This trend was already predicted 3 years ago… by Deloitte. Well, that is normal digital transformation and creative disruption is their business!

The ears have it: The rise of audiobooks and podcasting https://www2.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions/2020/rise-of-audiobooks-podcast-industry.html

By the way, they are plenty of alternatives to OpenAI Whisper such DeepSpeech, Flashlight, Kaldi, SpeechPy, Speechly, Botium Speech Processing… Unfortunately, I am running out of time to make POC for each of them and list pros and cons.

Audio Transcription with Whisper – Environment

Like always, it is better to create a “virtual environment”. You can use “venv” or “Anaconda”. I am using Anaconda.

You can check the main readme for all the commands. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper

Source: https://docs.python.org/3/library/venv.html

Audio Transcription with Whisper – Audio samples

Get some audio samples preferably in different languages with different accents if possible. Always remember that the quality of the dataset always prefigures the quality of the result. Here are some resources.

Audio Transcription with Whisper – Extra stuff

FFMPEG
The audio or video manipulation required mine old acquaintance FFMPEG, part of my very first posts in 2009 in French! Anyway, end-up with nostalgia, I have focused this time on FFMPEG with Python through the library “ffmpeg-python”.

You can check the main readme for all the commands. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper

Source:
https://www.delftstack.com/howto/python/ffmpeg-python/

PYTEST

I gave these tests as goodies. It is just that building an IA strategy require to harden quality like never and also if you are not able to write tests, you can figure out what are the use cases you want to target with your API or with Whisper. Here are some tests grabbed directly from the official repository.

You can check the main readme for all the commands and the directory “tests_from_whisper” for the tests. See https://github.com/bflaven/ia_usages/tree/main/ai_openai_whisper

Source: https://github.com/openai/whisper/tree/main/tests

Videos to tackle this post

#1 Using Whisper & FastAPI: Unlocking Multilingual Transcription with #Whisper:
Exploring Audio for a POC

#2 Using Whisper & FastAPI: Creating a Multilingual Audio API with #Whisper: POC Using #FastAPI

#3 Using Whisper & FastAPI: Leveraging Faster-Whisper
for Multilingual NLP & Audio Exploration

More infos