The importance of the Labeling process or annotating inside an ML pipeline plus an example on how-to train a “custom” NER for Spacy
With AI/ML, I’m trying to go from “user” status to “maker” status. At the very beginning, like often, I am more focused on uses and few attempts posted in this blog testified of this strong interest on how-to.
- Improve a CMS’s photos library qualification with AI, Object Detection from images in Python with OpenCv and YOLO
https://flaven.fr/2021/03/improve-a-cmss-photos-library-qualification-with-ai-object-detection-from-images-in-python-with-opencv-and-yolo/ - Quick overview about using NLP for a CMS Customer Support (FAQs turn to a Chatbot) or CMS editorial features for Journalist (Keywords Extraction) using spaCy, Rake, TensorFlow, Pytorch
https://flaven.fr/2020/09/quick-overview-about-using-nlp-for-a-cms-customer-support-faqs-turn-to-a-chatbot-or-cms-editorial-features-for-journalist-keywords-extraction-using-spacy-rake-tensorflow-pytorch/
For this post, you can find all files for each project on my GitHub account. See https://github.com/bflaven/ia_usages/tree/main/ml_label_studio
But what if I want to go further than the simple use? Like in any learning process, to become a “doer” requires an essential step: selecting tools.
This action often allows both to deepen understanding and practicing. At the same time, you should not hesitate to throw away a technology or a tool if it proves to be too time-consuming for a poor result!
I have focus my attention on a very basic question: How to build a custom NER with the help of Spacy? This question leads me to a bigger one: In NLP, how to build a machine learning pipeline with Label Studio or Docanno (open source), instead of Prodigy (paid)?
Choosing the right tool answered to the thorny question of ensuring data quality and might give some accuracy to your model! Otherwise, you may fall in “Shit in, Shit out” curse.
Like always, to ensure quality and accuracy, you must get down to the nitty-gritty. So, if you can, you must look under the hood and get our hands dirty with code.
The importance of the Labeling process or annotating inside the ML pipeline.
So, for the sake of simplicity, the labeling process can be even cut into 6 steps
- step_1. Unlabeled data
- step_2. Labeler (loop)
- step_3. Labeled data
- step_4. Training (loop)
- step_5. Model
- step_6. Inference
“Simplest” Labeling Process with 6 Steps
Here is a quick terminology to grasp roughly the essential of the Labeling process.
- Data: interesting stuff to annotate – image, audio, video, text.
- Task: Single piece of data that is ready to be predicted or annotated ; Part of job ; Includes metadata relevant for predicting or annotating.
- Prediction: Inference output from a machine learning model
- Annotation: Human created or… approved
Source: https://www.youtube.com/watch?v=VBnhpMLpBB0
Like as I said before discovering the labelling process, my original question was to create a specific NER with the help of Spacy.
Indeed, the ability to recognize name entities is paramount but most of the time you may need a specific NER because you are working with domain specific data e.g. Finance, Medicine, Cooking, History, Biology, African Football, Law, Cinema, Television, Transport… Anyway, there are many domains where you have to deal with specific words that which sounds like gibberish to laymen!
Methodology
I have to talk a little bit about methodology. First of all, use cases have always help me to enliven my learning process. So, I sure look after explanations but also select use cases, mostly quick and reproducible. Even if, use cases must be counterbalanced by the consciousness that mere imitation is not sufficient to fully understand AI/ML! Reproducing a tutorial will never turn you into a specialist. Thanks God, become a specialist is not my objective. As much as possible, I try to isolate real process like the labeling one to master IA cases and minimizing uncertainty when working on Projects in Real Life.
The Custom NER example
In the field of NLP, create a “custom NER” is a classic example. This is most often favored example. Indeed, it is the most intuitive process with the most striking results, a bit like writing your first test under Cypress on you website homepage, creating a blog in 20 min on RoR or finally building your WP plugin in 5 minutes.
For any POC to have value, meaning instructive and profitable, two things are needed:
- Set goals (OKR in particular)
- Set a time limit (Deadline) otherwise from POC you go to WIP which is as toxic as procrastination. In general, in duration, I try not to exceed a month and a half.
Always define objectives
So, here is a list of my objectives in term of learning value:
- Create a custom NER for Spacy for specific vocabulary or specific language (African languages or poor documented languages e.g Hmong). This exploration leads to lead to discover NLP useful formats such as JSON, JSON-MIN, CONLL2003 to facilitate workflow.
- Discover a viable workflow on how-to customise a NER inside Spacy and find a free alternative to the tool https://prodi.gy/. This exploration leads to lead to annotation tool
- Eventually extend the annotation tool for other type of media e.g video, audio, images… This exploration leads to generate a customized ML model for YOLO image detection for instance. This case will be probably a topic for a next post.
Remember few LEAN AGILE principles and make some mix
Just as quick reminder, as I have read earlier some elements on Lean. I found interesting to mix also this investigation on my labeling process exploration with some rules from Lean methodology. All these principles are so many reminders when you get lost in explanations or in useless code creation.
- Go on site to see the problem: Go&See or “Go to the Gemba”
- The Plan-Do-Check-Act (PDCA) cycle
- Overproduction: Write code that is never used. Create a User Story sooner than necessary.
- Corrections & retouching: Investigation and correction of a defect; Refactoring exactly opposite to that done by another pair.
- Waiting: Waiting for the availability of the test environment. Waiting for build result; Workstation slowness; Waiting for functional information or decision while performing a task.
- Inventory: Maintenance of a backlog of undeveloped User Stories; Regularly review the post-its on an “Ideas for improvement” poster without ever implementing these ideas.
- Useless gestures: Code navigation; Restarting the development environment (IDE).
- Unnecessary steps: Two pairs performing overlapping tasks.
- Transportation: Carry over changes between different branches within a configuration management system; Passing information from one developer to another.
Gemba is the equivalent of the crime scene in Japanese.Remain on the ground where you create and produce value. In business, Genba refers to the place where value is created;Source: https://en.wikipedia.org/wiki/Gemba
The Plan-Do-Check-Act avoids investing on false leads. Lean favors countermeasures that are ingenious, economical and whose effect can be verified quickly. I can resist to quote these “waste hole” examples as I definitely fell once in while in one of those traps.
Type of waste: the 7 examples of waste
So, the purpose is either to flesh out and make alive a real procedure, process or pipeline for annotating data and building models with machine learning for any kind of content.
In most of the case, I have replaced Prodigy by Label Studio or even Doccano. I would rather go for Label Studio as it is compatible with DAGSHub. Label Studio gave a great documentation and many examples for any kind of annotation process including sounds, images, videos….
Set the environments with anaconda
# list the envs conda info --envs # ml_with_label_studio source activate ml_with_label_studio pip freeze > requirements_ml_with_label_studio.txt conda deactivate # tagging_entity_extraction source activate tagging_entity_extraction pip freeze > requirements_tagging_entity_extraction.txt conda deactivate # Create a anaconda environment named tagging_entity_extraction conda create --name tagging_entity_extraction python=3.9.13 conda info --envs source activate tagging_entity_extraction conda deactivate # if needed to remove an environment conda env remove -n [NAME_OF_THE_CONDA_ENVIRONMENT] conda env remove -n chainlit_python conda env remove -n ai_chatgpt_prompts # update conda conda update -n base -c defaults conda # to export requirements pip freeze > requirements_ml_with_label_studio.txt pip freeze > requirements_tagging_entity_extraction.txt # to install pip install -r requirements_ml_with_label_studio.txt pip install -r requirements_tagging_entity_extraction.txt
Use cases
Some explanations can be found in the readme on https://github.com/bflaven/BlogArticlesExamples/tree/master/ml_label_studio
Videos to tackle this post
#1 Machine Learning for NLP: Labeling process to train a “custom” NER for Spacy with Label Studio
#2 Machine Learning for NLP: Export & convert data from Label Studio to Spacy training format to create a custom NER
#3 Machine Learning for NLP: Work with a dataset in Label Studio for a Pytorch Sentiment Analysis
#4 Machine Learning for NLP: Running & Connecting a ML Backend to Label Studio for a Pytorch Sentiment Analysis
More infos
- Text Annotations using Label Studio and DAGSHub – (NLP – NER)
https://www.youtube.com/watch?v=CxfGJGK4mxQ - TinySquirrelDataset
https://dagshub.com/DagsHub/TinySquirrelDataset - RelaxML
https://dagshub.com/yonomitt/RelaxML - jcharistech
https://dagshub.com/jcharistech - spacy-ner
https://github.com/kriesbeck/spacy-ner/blob/master/NER%20in%20spaCy.ipynb - Quickstart from spacy
https://spacy.io/usage/training#quickstart - Using spaCy as a ML backend for Label Studio
https://www.youtube.com/watch?v=F19NT-21uT4 - Zero to One: Getting Started with Label Studio
https://labelstud.io/blog/zero-to-one-getting-started-with-label-studio/ - datasets from Label Studio
https://s3.amazonaws.com/labelstud.io/datasets/IMDB_train_unlabeled_100.csv - Introduction to Machine Learning with Label Studio
https://labelstud.io/blog/introduction-to-machine-learning-with-label-studio/ - label-studio-ml-tutorial
https://github.com/heartexlabs/label-studio-ml-tutorial.git - Max Tkachenko Youtube Channel
https://www.youtube.com/@makseqtka4 - What is the Label Studio ML backend?
https://github.com/heartexlabs/label-studio-ml-backend - Integrate Label Studio into your machine learning pipeline
https://labelstud.io/guide/ml#Start-your-custom-ML-backend-with-Label-Studio - Create the simplest Machine Learning backend
https://labelstud.io/tutorials/dummy_model.html - PRODIGY v1.12: OpenAI integration, Prompt Engineering, Task Routers, Deployment Docs and more!
https://www.youtube.com/watch?v=-JiwLH9RG1E - Graph Rewriting for NLP
https://grew.fr/ - Dependency parsing
https://grew.fr/grs/parsing/ - NER @ CLI: Custom-named entity recognition with spaCy in four lines
https://www.codecentric.de/wissens-hub/blog/ner-cli-custom-named-entity-recognition-with-spacy-in-four-lines - Spark NLP -Training
https://sparknlp.org/docs/en/training - CoNLL-U Viewer
https://universaldependencies.org/conllu_viewer.html - pyconll is a low level wrapper around the CoNLL-U format
https://pyconll.readthedocs.io/en/stable/starting.html - CoNLL-U Parser
https://github.com/EmilStenstrom/conllu - What is CoNLL data format?
https://stackoverflow.com/questions/27416164/what-is-conll-data-format - CoNLL-U File Viewer
https://www.sdbtools.net/info/prj/conllu/ - spacy-conll 3.4.0
https://pypi.org/project/spacy-conll/ - Applied Language Technology
https://www.youtube.com/@AppliedLanguageTechnology - Parsing CoNLL-U annotations using Python
https://www.youtube.com/watch?v=lvJRFMvWtFI - Introducing the CoNLL-U annotation schema
https://applied-language-technology.mooc.fi/html/notebooks/part_iii/06_text_linguistics.html#introducing-the-conll-u-annotation-schema - Hmong Medical Corpus Blog: Discussing NLP approaches for resource-poor languages
https://hmcorpus.home.blog/ - nathanmwhite
https://github.com/nathanmwhite?tab=repositories - Languages list
https://universaldependencies.org/#language- - Many tree-banks for African languages: Bambara, Yoruba, Wolof, Amharic, Swahili….
https://github.com/UniversalDependencies/ - CoNLL-U Parser
https://pypi.org/project/conllu/ - spacy – Projects
https://spacy.io/usage/projects - 🪐 spaCy Project: Part-of-speech Tagging & Dependency Parsing (Universal Dependencies)
https://github.com/explosion/projects/tree/v3/pipelines/tagger_parser_ud - spaCy lookups data
https://github.com/explosion/spacy-lookups-data - 🪐 Project Templates
https://github.com/explosion/projects/tree/v3 - spacy – blog
https://explosion.ai/blog - spacy – 🪐 Project Templates
https://github.com/explosion/projects - SpaCy for Digital Humanities with Python Tutorials. This playlist is a tutorial series on how to use spaCy in Python for the purposes of performing natural language processing (NLP) on texts.
https://www.youtube.com/playlist?list=PL2VXyKi-KpYvuOdPwXR-FZfmZ0hjoNSUo - William Mattingly on github
https://github.com/wjbmattingly - spacy-models
https://github.com/topics/spacy-models - HuSpaCy is a spaCy library providing industrial-strength Hungarian language processing facilities through spaCy models
https://github.com/huspacy/huspacy - Chinese models for SpaCy
https://github.com/howl-anderson/Chinese_models_for_SpaCy - Experimental Finnish language model for spaCy
https://github.com/aajanki/spacy-fi - Experimental Turkish language model for SpaCy
https://github.com/mehmetilker/spacy-tr - Модель русского языка для библиотеки spaCy
https://github.com/buriy/spacy-ru - [CA] Model pel processament del llenguatge natural en Català per a spaCy
https://github.com/ccoreilly/spacy-catala - Serbian Language Pipeline for Spacy
https://github.com/BCDH/spacy-serbian-pipeline - Ancient Greek models for spaCy
https://github.com/jmyerston/greCy - Thai spaCy model
https://github.com/PyThaiNLP/thai_spacy_model - Welsh-language Part-of-Speech Tagging Model
https://github.com/techiaith/model-tagiwr-spacy-cy - 🪐 spaCy Project: Nepali Spacy Model
https://github.com/jangedoo/spacy_ne - ukr-spacy
https://github.com/kurnosovv/ukr-spacy - SpaCy_sv_master
https://github.com/alonsopg/spaCy_sv_master - A simple docker image with python, spacy and language models preinstalled
https://github.com/lteacy/spacy_docker - spacy create new language model with data from corpus
https://stackoverflow.com/questions/50152856/spacy-create-new-language-model-with-data-from-corpus/50215001#50215001 - Training a language model in spaCy v3
https://tech.bakkenbaeck.com/post/training-a-language-model-in-spacy-v3 - spacy – Training config
https://spacy.io/api/data-formats#config - thinc.ai
https://thinc.ai/docs/usage-config - How to Create a Config.cfg File in spaCy 3x for Named Entity Recognition (NER)
https://www.youtube.com/watch?v=l67PXnhu0ig - Using spaCy 3.0 to build a custom NER model
https://towardsdatascience.com/using-spacy-3-0-to-build-a-custom-ner-model-c9256bea098 - pythonhumanities.com
https://pythonhumanities.com/ - The ‘__init__.py’ File: What Is It? How to Use It? (Complete Guide)
https://www.codingem.com/what-is-init-py-file-in-python/ - Natural Language Processing with spaCy — Steps and Examples
https://pub.towardsai.net/natural-language-processing-with-spacy-steps-and-examples-155618e84103 - polyglot
https://polyglot.readthedocs.io/en/latest/ - spaCy models
https://www.educba.com/spacy-models/ - spacy – Language Support
https://github.com/explosion/spaCy/discussions/categories/language-support - spacy – Adding models for new languages master thread #3056
https://github.com/explosion/spaCy/discussions/3056 - spacy – tests for language english
https://github.com/explosion/spaCy/tree/master/spacy/tests/lang/en - spacy – Linguistic Features
https://spacy.io/usage/linguistic-features - Build a Custom NER model using spaCy 3.0
https://turbolab.in/build-a-custom-ner-model-using-spacy-3-0/ - Working with languages not yet supported by Spacy
https://support.prodi.gy/t/working-with-languages-not-yet-supported-by-spacy/206/19 - Using Custom spaCy Components in Rasa
https://rasa.com/blog/custom-spacy-components/ - Python Tutorials for Digital Humanities
https://www.youtube.com/@python-programming - Named Entity Recognition: Concept, Tools and Tutorial
https://monkeylearn.com/blog/named-entity-recognition/ - Tag NLP on medium.com
https://medium.com/tag/nlp - How to Create a Custom NER in Spacy 3.5
https://towardsdatascience.com/how-to-create-a-custom-ner-in-spacy-3-5-c9942aab3c91 - The New Economy of Data Labeling
https://medium.com/coderbyte/data-labeling-63b5bf796f24 - Learn how to prepare your dataset for fine-tuning
https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/prepare-dataset - NLP Project Part 2: How to Clean and Prepare Data for Analysis
https://www.dataquest.io/blog/how-to-clean-and-prepare-your-data-for-analysis/ - How To Prepare Your Data For Machine Learning in Python with Scikit-Learn
https://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/ - Train Neural Network by loading your images |TensorFlow, CNN, Keras tutorial
https://www.youtube.com/watch?v=uqomO_BZ44g - Machine Learning Mastery With Python
https://machinelearningmastery.com/machine-learning-with-python/ - Datasets for Natural Language Processing
https://machinelearningmastery.com/datasets-natural-language-processing/ - tm2tb is a term extraction module with a focus on bilingual data.
https://github.com/luismond/tm2tb - Custom Named Entity Recognition Using spaCy
https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718 - Spacy
https://spacy.io/ - NLP_App
https://github.com/yami-s/NLP_App - Named Entity Recognition | Relation Extraction | Label Studio | Part1
https://www.youtube.com/watch?v=qNcoLpxW8QE - Labeling images for semantic segmentation using Label Studio
https://www.youtube.com/watch?v=UUP_omOSKuc - Custom NER with spaCy v3 Tutorial | Free NER Data Annotation | Named Entity Recognition Tutorial
https://www.youtube.com/watch?v=p_7hJvl7P2A - Label Studio API
https://labelstud.io/api - Prodigy
https://prodi.gy - Label Studio Docs
https://github.com/HumanSignal/label-studio - Youtube Channel from Label Studio
https://www.youtube.com/@label-studio - Training Custom NER models in SpaCy to auto-detect named entities [Complete Guide]
https://www.machinelearningplus.com/nlp/training-custom-ner-model-in-spacy/ - Training Spacy NER models with doccano
https://medium.com/@justindavies/training-spacy-ner-models-with-doccano-8d8203e29bfa - Labelbox
https://labelbox.com/ - Using spaCy as a ML backend for Label Studio
https://www.youtube.com/watch?v=F19NT-21uT4 - spaCy powered Label Studio ML backend
https://github.com/tim-smart/label-studio-spacy - Spacy: Training Pipelines & Models
https://spacy.io/usage/training