Crafting Fluent Translation API: A quick Journey into Text Translation with NLLB, HuggingFace, and FastAPI, Plus a small Dive into Roberta Masked Language Modeling with Gradio

I have already mentioned many times on this blog the obstacles linked to managing a multilingual digital product. Indeed, whether you are creating a mobile application, a site or even a CMS, multilingualism increases the complexity somewhat. For AI, inevitably, the question arises with even more strength. For instance, whatever the language is, how is possible to ensure the same quality for functionalities in NLP (NER, Keywords extraction, summarization, etc.) or for audio extraction… etc. and so, what is the best strategy to adopt to overcome this obstacle?

For this post, you can find all files for each project on my GitHub account. See https://github.com/bflaven/ia_usages/tree/main/ia_translation_gradio

Looking for the miracle translation tool to tackle the subject of multilingualism. The post was of course also an opportunity to discover the inexhaustible resources of the HuggingFace site as well as the possibility of using Gradio to dress up my AI productions as Streamlit does.

A little clarification on languages, I am really looking for an LLM able to translate a wide variety of languages. I need to get rid of my Eurocentrism and include in my quest a fair number of different languages such as African languages: Hausa, Kiswahili, Mandenkan, Fulfulde, Wolof and of course Asian languages such as Chinese, Khmer or Vietnamese, Slavic languages such as Russian, Ukrainian and of course Persian, Arabic or finally Romanian. In short, a real babelian sample!

As such, Meta became aware of the under-representation of African linguistic groups for its “no-language-left-behind” library. As such, Meta is therefore launching a call to document certain African languages. So if you speak the Congo, Kinyarwanda or Swazi languages, join the band wagon.

Source: https://ai.meta.com/research/request-for-proposals/translation-support-for-african-languages/

Source: https://ai.meta.com/research/no-language-left-behind/

Source: https://ai.meta.com/blog/nllb-200-high-quality-machine-translation/

Few words on Masked Language Modeling

An interesting essay on Masked Language Modeling so here is a definition taken from the Keras site.

Masked Language Modeling is a fill-in-the-blank task, where a model uses the context words surrounding a mask token to try to predict what the masked word should be.
For an input that contains one or more mask tokens, the model will generate the most likely substitution for each.
Example:
Input: “I have watched this [MASK] and it was awesome.”
Output: “I have watched this movie and it was awesome.”

Like any beginner in IA, during the few tests on Masked Language Modeling with Roberta, I have experimented stranger things! Sometimes, the result is a bit little confusing. Especially, once again when the “system” has to guess a word in a context that is more “African” than “American” I should say! Without being a specialist on the issue, I use this example as an illustration of cognitive bias, which is also very much linked to cultural domination.

Indeed, for sure, Roberta guesses the capital of Oregon but misses that of Uganda due to lack of appropriate training.

[MASK] is the capital of Oregon.
# For sure, everybody knows that the correct answer is "Portland" and the probability to have "Portland" as a proposition is close to one!
# Portland 94%
# This 2%
# Washington 1%
# Seattle 1%
# It 0%

[MASK] is the capital and largest city of Uganda.
# For sure, everybody knows that the correct answer is "Kampala" and the probability to have "Kampala" as a proposition is not even in the propositions!

# Juba 25% Juba is the capital and largest city of South Sudan. 
# Uganda 14%
# Beni 8% A city in the Democratic Republic of Congo
# Nairobi 6% Nairobi is the capital and largest city of Kenya.
# It 6%

Code: https://github.com/bflaven/ia_usages/tree/main/ia_translation_gradio/004_pyyush_maskedlanguagemodeling

Source:
https://keras.io/examples/nlp/masked_language_modeling/

Cut the NLP & Multilingualism’s Knot

One of the main impediments I wanted to solve is boiled down to this question: How to offer the same NLP features in an API in an unlimited number of languages?

Until now, in fact, the English language predominates in AI, so the models are trained in this language. Two solutions are therefore available to us:

Waiting for the situation to improve and therefore having an open source LLM including your missing languages. Time is then your enemy and it is risky but as we have seen, the NLLB initiative may solved the issue.
Choose a “pivot” language such as English or French, Spanish. Then, the process wil be the following: translate the content, apply the selected AI models (NER, Keywords extraction, summarization, etc.) that you want to the newly translated text in English, then retranslate the result! Smarter and more pragmatic solution, at least it is all in your hands!

To translate text from one language to another for translation in 200 languages with PyTorch and Hugging Face transformers, I will be using Facebook’s NLLB 200 3.3B

Source: https://huggingface.co/facebook/nllb-200-3.3B

Below, a quick presentation for each GitHub directory for this overall exploration of translation IA mechanism. Each directory contains a POC with the code.

001a_giladd123_nllb_fastapi: an attempt using No Language Left Behind (for instance nllb-200-distilled-600M) aka nllb and FastAPI to offer an endpoint for translation
002_mlearning_ai: using langdetect and again some experiments with nllb
003_using_gradio: using gradio, some concepts + again for translation extracted from https://huggingface.co/spaces/Geonmo/nllb-translation-demo
004_pyyush_maskedlanguagemodeling: a use case with FastAPI and Roberta (xlm-roberta-base, xlm-roberta-large) for Masked language modeling

Using langdetect

A very basic usage but that can help to select automatically the language from your source inside an API enpoint dedicated to translation for instance 🙂
As the name indicates obviously: langdetect is a language detection library: detect one or several languages from a text.

Code: https://github.com/bflaven/ia_usages/tree/main/ia_translation_gradio/002_mlearning_ai

Model for Classification

Just few resources as a reminder for myself, another subject to explore with IA that should be a post in itself.

bart-lage-mnli-yahoo-answers: https://huggingface.co/joeddav/bart-large-mnli-yahoo-answers
xlm-roberta-large-xnli: https://huggingface.co/joeddav/xlm-roberta-large-xnli
DistilBart: distilbart-mnli-12-1 https://huggingface.co/valhalla/

Conclusion
The translation problem is a finite problem, which is to say limited, the number of languages is not infinite. Not to mention the phenomenon of the disappearance of languages that will reduce their number and therefore the complexity of this translation issue! The end will come quickly maybe the translation capabilities addressing all languages will be avalable in 6 months, 1 year, 2 years. There will remain the thorny and much more costly question of customisation training corpus to work on translation this time but in a specific context. I will call it culture in some way. Just remember Uganda versus Oregon showed earlier.

More infos

Masked Language Modeling
https://github.com/pyyush/MLM
Zero-Shot Learning in Modern NLP
https://joeddav.github.io/blog/2020/05/29/ZSL.html
Flask API Hugging Face facebook/bart-large-mnli
https://github.com/rajatdiptabiswas/flask-api-hugging-face-fb-bart-large-mnli
Building and Deploying a FastAPI App with Hugging Face
https://medium.com/@qacheampong/building-and-deploying-a-fastapi-app-with-hugging-face-9210e9b4a713
Question Answering API
https://github.com/Proteusiq/huggingfastapi
joeddav/zero-shot-demo
https://huggingface.co/spaces/joeddav/zero-shot-demo/tree/main
Learn AI Translation with FB’s No Language Left Behind – NLLB-200 Python Code Tutorial
https://www.youtube.com/watch?v=AGgzRE3TlvU
How to do Free Speech-to-Text Transcription Better Than Google Premium API with OpenAI Whisper Model
https://www.youtube.com/watch?v=msj3wuYf3d8
No Language Left Behind : Modeling
https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/modeling
How to fine-tune a NLLB-200 model for translating a new language
https://cointegrated.medium.com/how-to-fine-tune-a-nllb-200-model-for-translating-a-new-language-a37fc706b865
NLLB Serve
https://github.com/thammegowda/nllb-serve
deepl-python
https://github.com/DeepLcom/deepl-python
argos-translate
https://github.com/argosopentech/argos-translate
deep-translator
https://github.com/nidhaloff/deep-translator
DeepL
https://www.deepl.com/
Top 10 DeepL Alternatives & Competitors
https://www.g2.com/products/deepl/competitors/alternatives
begin python
https://www.freecodecamp.org/news/python-code-examples-sample-script-coding-tutorial-for-beginners/
plagiarism (llm-detect-ai-generated-text)
https://www.kaggle.com/competitions/llm-detect-ai-generated-text
Deploy Your LLM API on CPU
https://python.plainenglish.io/deploy-your-llm-api-on-cpu-d350e38a7dbd
Getting started with Llama
https://ai.meta.com/llama/get-started/
Unleash the Power of Llama: Build an Interactive Web App with Next.js and FastAPI!
https://medium.com/@chrisswhitneyy/unleash-the-power-of-llama-build-an-interactive-web-app-with-next-js-and-fastapi-9146c4647a87
Zeroshot Classification
https://akgeni.medium.com/zeroshot-classification-864a278628f6
Zero-Shot Learning in Modern NLP
https://joeddav.github.io/blog/2020/05/29/ZSL.html
facebook/bart-large-mnli
https://huggingface.co/facebook/bart-large-mnli
NLI Models as Zero-Shot Classifiers
https://jaketae.github.io/study/zero-shot-classification/
Keyword Extraction with BERT
https://jaketae.github.io/study/keyword-extraction/
Building a Text Preprocessing Microservice with FastAPI
https://towardsdatascience.com/building-a-text-preprocessing-microservice-with-fastapi-ca7912050ba
Zero and Few Shot Learning
https://towardsdatascience.com/zero-and-few-shot-learning-c08e145dc4ed
ChatCLP — HuggingFace Transformers Part 1
https://medium.com/devnullblog/chatclp-build-a-chatbot-using-transformers-1f4ee9739edb
How to Implement Zero-Shot Classification using Python
https://www.section.io/engineering-education/how-to-implement-zero-shot-classification-using-python/
Text-Summarization
https://github.com/aj-naik/Text-Summarization
Using fastText
https://fasttext.cc/docs/en/support.html
Check the ressource nllb200 on GitHub
https://github.com/topics/nllb200
Language Detection: detect one or several languages from a text with langdetect
https://pypi.org/project/langdetect/
Gradio tutorial (Build machine learning applications)
https://www.machinelearningnuggets.com/gradio-tutorial/
Creating A Language Translator App Using Gradio
https://analyticsindiamag.com/creating-a-language-translator-app-using-gradio/
Building NLP Web Apps With Gradio And Hugging Face Transformers
https://towardsdatascience.com/building-nlp-web-apps-with-gradio-and-hugging-face-transformers-59ce8ab4a319
Gradio Guide
https://www.gradio.app/guides
Transformers, what can they do?
https://huggingface.co/learn/nlp-course/chapter1/3?fw=pt
Text Translation using NLLB and HuggingFace— Tutorial
https://medium.com/mlearning-ai/text-translation-using-nllb-and-huggingface-tutorial-7e789e0f7816

Few words on Masked Language Modeling

Cut the NLP & Multilingualism’s Knot

Using langdetect

Model for Classification

More infos

Partager :

Published by Bruno Flaven

Related Posts