The importance of the Labeling process or annotating inside an ML pipeline plus an example on how-to train a “custom” NER for Spacy

With AI/ML, I’m trying to go from “user” status to “maker” status. At the very beginning, like often, I am more focused on uses and few attempts posted in this blog testified of this strong interest on how-to.

For this post, you can find all files for each project on my GitHub account. See https://github.com/bflaven/ia_usages/tree/main/ml_label_studio
But what if I want to go further than the simple use? Like in any learning process, to become a “doer” requires an essential step: selecting tools.

This action often allows both to deepen understanding and practicing. At the same time, you should not hesitate to throw away a technology or a tool if it proves to be too time-consuming for a poor result!

I have focus my attention on a very basic question: How to build a custom NER with the help of Spacy? This question leads me to a bigger one: In NLP, how to build a machine learning pipeline with Label Studio or Docanno (open source), instead of Prodigy (paid)?

Choosing the right tool answered to the thorny question of ensuring data quality and might give some accuracy to your model! Otherwise, you may fall in “Shit in, Shit out” curse.

Like always, to ensure quality and accuracy, you must get down to the nitty-gritty. So, if you can, you must look under the hood and get our hands dirty with code.

The importance of the Labeling process or annotating inside the ML pipeline.

Source: https://www.researchgate.net/figure/A-common-pipeline-for-annotating-data-and-building-models-with-machine-learning_fig1_359227212

So, for the sake of simplicity, the labeling process can be even cut into 6 steps

  • step_1. Unlabeled data
  • step_2. Labeler (loop)
  • step_3. Labeled data
  • step_4. Training (loop)
  • step_5. Model
  • step_6. Inference

“Simplest” Labeling Process with 6 Steps

Here is a quick terminology to grasp roughly the essential of the Labeling process.

  • Data: interesting stuff to annotate – image, audio, video, text.
  • Task: Single piece of data that is ready to be predicted or annotated ; Part of job ; Includes metadata relevant for predicting or annotating.
  • Prediction: Inference output from a machine learning model
  • Annotation: Human created or… approved

Source: https://www.youtube.com/watch?v=VBnhpMLpBB0

Like as I said before discovering the labelling process, my original question was to create a specific NER with the help of Spacy.

Indeed, the ability to recognize name entities is paramount but most of the time you may need a specific NER because you are working with domain specific data e.g. Finance, Medicine, Cooking, History, Biology, African Football, Law, Cinema, Television, Transport… Anyway, there are many domains where you have to deal with specific words that which sounds like gibberish to laymen!

Methodology

I have to talk a little bit about methodology. First of all, use cases have always help me to enliven my learning process. So, I sure look after explanations but also select use cases, mostly quick and reproducible. Even if, use cases must be counterbalanced by the consciousness that mere imitation is not sufficient to fully understand AI/ML! Reproducing a tutorial will never turn you into a specialist. Thanks God, become a specialist is not my objective. As much as possible, I try to isolate real process like the labeling one to master IA cases and minimizing uncertainty when working on Projects in Real Life.

The Custom NER example

In the field of NLP, create a “custom NER” is a classic example. This is most often favored example. Indeed, it is the most intuitive process with the most striking results, a bit like writing your first test under Cypress on you website homepage, creating a blog in 20 min on RoR or finally building your WP plugin in 5 minutes.

For any POC to have value, meaning instructive and profitable, two things are needed:

  1. Set goals (OKR in particular)
  2. Set a time limit (Deadline) otherwise from POC you go to WIP which is as toxic as procrastination. In general, in duration, I try not to exceed a month and a half.

Always define objectives

So, here is a list of my objectives in term of learning value:

  1. Create a custom NER for Spacy for specific vocabulary or specific language (African languages or poor documented languages e.g Hmong). This exploration leads to lead to discover NLP useful formats such as JSON, JSON-MIN, CONLL2003 to facilitate workflow.
  2. Discover a viable workflow on how-to customise a NER inside Spacy and find a free alternative to the tool https://prodi.gy/. This exploration leads to lead to annotation tool
  3. Eventually extend the annotation tool for other type of media e.g video, audio, images… This exploration leads to generate a customized ML model for YOLO image detection for instance. This case will be probably a topic for a next post.

Remember few LEAN AGILE principles and make some mix

Just as quick reminder, as I have read earlier some elements on Lean. I found interesting to mix also this investigation on my labeling process exploration with some rules from Lean methodology. All these principles are so many reminders when you get lost in explanations or in useless code creation.

  • Go on site to see the problem: Go&See or “Go to the Gemba”
  • Gemba is the equivalent of the crime scene in Japanese.Remain on the ground where you create and produce value. In business, Genba refers to the place where value is created;Source: https://en.wikipedia.org/wiki/Gemba

  • The Plan-Do-Check-Act (PDCA) cycle
  • The Plan-Do-Check-Act avoids investing on false leads. Lean favors countermeasures that are ingenious, economical and whose effect can be verified quickly. I can resist to quote these “waste hole” examples as I definitely fell once in while in one of those traps.

    Type of waste: the 7 examples of waste

    1. Overproduction: Write code that is never used. Create a User Story sooner than necessary.
    2. Corrections & retouching: Investigation and correction of a defect; Refactoring exactly opposite to that done by another pair.
    3. Waiting: Waiting for the availability of the test environment. Waiting for build result; Workstation slowness; Waiting for functional information or decision while performing a task.
    4. Inventory: Maintenance of a backlog of undeveloped User Stories; Regularly review the post-its on an “Ideas for improvement” poster without ever implementing these ideas.
    5. Useless gestures: Code navigation; Restarting the development environment (IDE).
    6. Unnecessary steps: Two pairs performing overlapping tasks.
    7. Transportation: Carry over changes between different branches within a configuration management system; Passing information from one developer to another.

So, the purpose is either to flesh out and make alive a real procedure, process or pipeline for annotating data and building models with machine learning for any kind of content.

In most of the case, I have replaced Prodigy by Label Studio or even Doccano. I would rather go for Label Studio as it is compatible with DAGSHub. Label Studio gave a great documentation and many examples for any kind of annotation process including sounds, images, videos….

Set the environments with anaconda

# list the envs
conda info --envs

# ml_with_label_studio
source activate ml_with_label_studio
pip freeze > requirements_ml_with_label_studio.txt
conda deactivate

# tagging_entity_extraction
source activate tagging_entity_extraction
pip freeze > requirements_tagging_entity_extraction.txt
conda deactivate


# Create a anaconda environment named tagging_entity_extraction
conda create --name tagging_entity_extraction python=3.9.13
conda info --envs
source activate tagging_entity_extraction
conda deactivate

# if needed to remove an environment
conda env remove -n [NAME_OF_THE_CONDA_ENVIRONMENT]
conda env remove -n chainlit_python
conda env remove -n ai_chatgpt_prompts

# update conda 
conda update -n base -c defaults conda

# to export requirements
pip freeze > requirements_ml_with_label_studio.txt
pip freeze > requirements_tagging_entity_extraction.txt

# to install
pip install -r requirements_ml_with_label_studio.txt
pip install -r requirements_tagging_entity_extraction.txt

Use cases

Some explanations can be found in the readme on https://github.com/bflaven/BlogArticlesExamples/tree/master/ml_label_studio

Videos to tackle this post

#1 Machine Learning for NLP: Labeling process to train a “custom” NER for Spacy with Label Studio

#2 Machine Learning for NLP: Export & convert data from Label Studio to Spacy training format to create a custom NER

#3 Machine Learning for NLP: Work with a dataset in Label Studio for a Pytorch Sentiment Analysis

#4 Machine Learning for NLP: Running & Connecting a ML Backend to Label Studio for a Pytorch Sentiment Analysis

More infos