Named Entity Recognition with weakly-labeled data

Description

Named entity recognition (NER) is an indispensable approach for analyzing chunks of text and locating and classifying predefined entities, such as location, companies or person names.

We can use NER in many real-world problems, e.g. 

  • extract company names in given news articles
  • identify mentioned persons in social media postings
  • extract product attributes in a search query of an e-commerce website.

Over the last few years, several methods and models have been published that allow us to use NER out of the box or fine-tune a model to a domain-specific dataset. Even the best machine learning architecture needs high-quality training data to make accurate predictions. 

Unfortunately, creating a high-quality training dataset is a manual process where we have to assign each word of the dataset an entity label. To manually label all data requires a lot of time and is not cheap. 

In some cases, we only have weakly-labeled data. For example, we know that some entities are in the given text, but we do not know the exact position of the word.

If we want to use the weakly-labeled data for the model training, we have three options:

  • We could use the complete weakly-labeled dataset to train the NER model
  • We could label a few samples out of the original dataset in a short period of time and guarantee a good data quality
  • We could to use the weakly-labeled dataset and train a generative model

The first and second approach requires a model that is able to predict for each word if it is a named entity and the associated class (e.g. organization or country). Therefore, we could use the weakly-labeled data or label a few samples manually. We expect that a few good labeled samples would outperform a model that was trained on the whole weakly-labeled dataset. 

To avoid additional labeling, the idea behind the third approach is to map the challenges as a sequence-to-sequence problem. Therefore, we want to train a sequence-to-sequence model that learns to generate information based on a given context. In Particular, we want to build a question-answering model that takes the text including the named entities as input. The model will be trained to answer questions about specific entities. If the entity is part of the context the model should return the entity value.

Assume we pass the following sentence into our model: “Berlin lies in northeastern Germany, [...]”, with the question about countries, the model should answer “Germany”.

The advantage of this approach is that we do not need specific information about the exact position of the entity in the given text.
However, we do not know which approach we should prefer if we have weakly-labeled data. We want to find a good threshold between the effort of manual labeling work and model accuracy.

Goal

The goal of this project is to find the best working approach to handle the NER model training with a weakly-labeled dataset. You will propose an advisory approach when we should use which method.
Therefore, you will looking into possible training datasets and train at least three different machine learning models for the task of NER:

  • a NER model on the few strong-labeled dataset
  • a NER model on the huge weakly-labeled dataset
  • a generative model on the huge weakly-labeled dataset as an question-answering problem

Afterwards, your mission is to compare each model and find an answer to the question which approach is the most efficient. If you are the right agent for that mission, feel free to contact us. 

Profile / Required skills

  • Strong interest in NLP 
  • Familiarity with Python and first experience with Spacy and Huggingface transformers is a big plus
  • Excellent verbal and written communication in English.
  • You are currently pursuing a degree in computer science or related field.

Internship Duration

The duration of the internship can be flexible and depends on the candidate preference and the project requirements. The typical duration is 6 to 8 weeks. The preferred duration for this specific project is 8 weeks.

  • Week 1-2: Literature review
  • Week 3-4: Dataset creation
  • Week 5-6: Hands-on Huggingface and model training
  • Week 7-8: Evaluate models, internship debrief

References

Supervisors

Thomas Dehaene Chapter Lead

Matthias Richter Machine Learning Engineer (daily supervisor)