Crafting Fluent Translation API: A quick Journey into Text Translation with NLLB, HuggingFace, and FastAPI, Plus a small Dive into Roberta Masked Language Modeling with Gradio

I have already mentioned many times on this blog the obstacles linked to managing a multilingual digital product. Indeed, whether you are creating a mobile application, a site or even a CMS, multilingualism increases the complexity somewhat. For AI, inevitably, the question arises with even more strength. For instance, whatever the language is, how is possible to ensure the same quality for functionalities in NLP (NER, Keywords extraction, summarization, etc.) or for audio extraction… etc. and so, what is the best strategy to adopt to overcome this obstacle?

For this post, you can find all files for each project on my GitHub account. See https://github.com/bflaven/ia_usages/tree/main/ia_translation_gradio

Looking for the miracle translation tool to tackle the subject of multilingualism. The post was of course also an opportunity to discover the inexhaustible resources of the HuggingFace site as well as the possibility of using Gradio to dress up my AI productions as Streamlit does.

A little clarification on languages, I am really looking for an LLM able to translate a wide variety of languages. I need to get rid of my Eurocentrism and include in my quest a fair number of different languages such as African languages: Hausa, Kiswahili, Mandenkan, Fulfulde, Wolof and of course Asian languages such as Chinese, Khmer or Vietnamese, Slavic languages such as Russian, Ukrainian and of course Persian, Arabic or finally Romanian. In short, a real babelian sample!

As such, Meta became aware of the under-representation of African linguistic groups for its “no-language-left-behind” library. As such, Meta is therefore launching a call to document certain African languages. So if you speak the Congo, Kinyarwanda or Swazi languages, join the band wagon.

Source: https://ai.meta.com/research/request-for-proposals/translation-support-for-african-languages/

Source: https://ai.meta.com/research/no-language-left-behind/

Source: https://ai.meta.com/blog/nllb-200-high-quality-machine-translation/

Few words on Masked Language Modeling

An interesting essay on Masked Language Modeling so here is a definition taken from the Keras site.

Masked Language Modeling is a fill-in-the-blank task, where a model uses the context words surrounding a mask token to try to predict what the masked word should be.

For an input that contains one or more mask tokens, the model will generate the most likely substitution for each.

Example:

Input: “I have watched this [MASK] and it was awesome.”
Output: “I have watched this movie and it was awesome.”

Like any beginner in IA, during the few tests on Masked Language Modeling with Roberta, I have experimented stranger things! Sometimes, the result is a bit little confusing. Especially, once again when the “system” has to guess a word in a context that is more “African” than “American” I should say! Without being a specialist on the issue, I use this example as an illustration of cognitive bias, which is also very much linked to cultural domination.

Indeed, for sure, Roberta guesses the capital of Oregon but misses that of Uganda due to lack of appropriate training.

[MASK] is the capital of Oregon.
# For sure, everybody knows that the correct answer is "Portland" and the probability to have "Portland" as a proposition is close to one!
# Portland 94%
# This 2%
# Washington 1%
# Seattle 1%
# It 0%

[MASK] is the capital and largest city of Uganda.
# For sure, everybody knows that the correct answer is "Kampala" and the probability to have "Kampala" as a proposition is not even in the propositions!

# Juba 25% Juba is the capital and largest city of South Sudan. 
# Uganda 14%
# Beni 8% A city in the Democratic Republic of Congo
# Nairobi 6% Nairobi is the capital and largest city of Kenya.
# It 6%


Code: https://github.com/bflaven/ia_usages/tree/main/ia_translation_gradio/004_pyyush_maskedlanguagemodeling

Source:
https://keras.io/examples/nlp/masked_language_modeling/

Cut the NLP & Multilingualism’s Knot

One of the main impediments I wanted to solve is boiled down to this question: How to offer the same NLP features in an API in an unlimited number of languages?

Until now, in fact, the English language predominates in AI, so the models are trained in this language. Two solutions are therefore available to us:

  1. Waiting for the situation to improve and therefore having an open source LLM including your missing languages. Time is then your enemy and it is risky but as we have seen, the NLLB initiative may solved the issue.
  2. Choose a “pivot” language such as English or French, Spanish. Then, the process wil be the following: translate the content, apply the selected AI models (NER, Keywords extraction, summarization, etc.) that you want to the newly translated text in English, then retranslate the result! Smarter and more pragmatic solution, at least it is all in your hands!

To translate text from one language to another for translation in 200 languages with PyTorch and Hugging Face transformers, I will be using Facebook’s NLLB 200 3.3B

Source: https://huggingface.co/facebook/nllb-200-3.3B

Below, a quick presentation for each GitHub directory for this overall exploration of translation IA mechanism. Each directory contains a POC with the code.

  • 001a_giladd123_nllb_fastapi: an attempt using No Language Left Behind (for instance nllb-200-distilled-600M) aka nllb and FastAPI to offer an endpoint for translation
  • 002_mlearning_ai: using langdetect and again some experiments with nllb
  • 003_using_gradio: using gradio, some concepts + again for translation extracted from https://huggingface.co/spaces/Geonmo/nllb-translation-demo
  • 004_pyyush_maskedlanguagemodeling: a use case with FastAPI and Roberta (xlm-roberta-base, xlm-roberta-large) for Masked language modeling

Using langdetect

A very basic usage but that can help to select automatically the language from your source inside an API enpoint dedicated to translation for instance 🙂
As the name indicates obviously: langdetect is a language detection library: detect one or several languages from a text.

Code: https://github.com/bflaven/ia_usages/tree/main/ia_translation_gradio/002_mlearning_ai

Model for Classification

Just few resources as a reminder for myself, another subject to explore with IA that should be a post in itself.

Conclusion
The translation problem is a finite problem, which is to say limited, the number of languages is not infinite. Not to mention the phenomenon of the disappearance of languages that will reduce their number and therefore the complexity of this translation issue! The end will come quickly maybe the translation capabilities addressing all languages will be avalable in 6 months, 1 year, 2 years. There will remain the thorny and much more costly question of customisation training corpus to work on translation this time but in a specific context. I will call it culture in some way. Just remember Uganda versus Oregon showed earlier.

More infos