The NLP Chapter is the ML6 special division on all things related to Natural Language Processing.
We try to tackle all relevant areas of Natural Language Processing: all things information extraction, speech recognition, sequence to sequence modelling under the roof of a single internal powerhouse.
EXPLORE
We are machine learning engineers at heart.We build and deploy tools, demos and boilerplates to bootstrap ML6 project work and showcase the value of Natural Language Processing in a wide variety of areas.
BUILD
NLP research is happening at dazzling speed. By exploring trending and relevant AI papers and topics, we keep up to date with the latest and greatest in our field.
No transformer left behind!
SHARE
We love to show and share our work. Head over to our use cases to see what we’re up to or have fun with one of our NLP-powered demos.
Come check us out on Huggingface and Github as well!
Terms & Conditions Summarizer 📝
Apply state of the art extractive and abstractive summarization on website Terms of Service to get a quick and concise focus on the main points.
View full demoRE:Belle
Building Beautiful Knowledge Graphs With REBEL
DOCRtor : Simulate common OCR errors for Dutch texts and use the finetune the character-based ByT5 model to correct them.
View full demoMulti-Article
Summarization
GPT2 Quantization using ONNXRuntime
Discover our tipText Augmentation using large-scale LMs and prompt engineering
Read moreGender debiasing of documents using simple CDA
Read moreBuilding an AI-driven content platform with Funke Mediengruppe
Building an AI-driven content platform with Funke Mediengruppe
Accelerating Keypoint's Digital Real Estate Management Platform with AI
Accelerating Keypoint's Digital Real Estate Management Platform with AI
Domain transfer with GGPL: German Generative Pseudo Labeling
Domain transfer with GGPL: German Generative Pseudo Labeling
BERT is eating your cash: quantization and ONNXRuntime to save money
BERT is eating your cash: quantization and ONNXRuntime to save money
NLP - ML - Audio - Speech to text - Transcription
Customer data is often at the heart of any project start. But whereas textual and image data are typically easier to collect and label, audio data is often more scarce and tricky.
This typically results in a few rounds of “back and forth” at a project start with typical questions such as:
In this internship, we aim to answer those questions to as specific a degree as possible. For this, you will focus on creating a speech transcription engine that can identify and accurately transcribe your colleagues in various contexts. In this process, you’ll identify key relationships between data quantity, quality, type and model accuracy. You’ll then package all of this into a demo which can take various forms for your fellow agents and the world to use!
This, dear ML6 Intern agent, is your mission.
The goals of this internship are as follows:
The adoption of NLP models in real-world use cases has also meant the rise of niche domain-specific language models. Do you need a language model for Polish legal text? Sure thing. Or how about a language model for Swedish medical texts? Look no further. However, in many real-world situations, this begs the question of when we decide to explore using a custom language model or when we decide that an “out-of-the-box” language model will suffice. In some fields like the medical field, this question is trivial as most of the important words (e.g, names of diseases, Latin anatomy language, etc.) are out-of-vocabulary for general-purpose language models but in other domains such as legal texts, technical manuals, etc. it is much less obvious.
The goal of this project is to estimate the expected effect that using a custom language model will have over an existing one based on known heuristics. The current best approach is to compare the overlap of n-grams from your domain-specific texts to those from the texts that the existing solution was trained on and install some (arbitrary) cut-off (i.e, if the n-gram overlap is under 30%, we explore using a custom language model). However, this method is not very quantitative nor very rigorous.
During this internship, you will:
The duration of the internship can be flexible and depends on the candidate preference and the project requirements. The typical duration is 6 to 8 weeks. The preferred duration for this specific project is 6 weeks.
Our internships and theses are linked to our chapters. A chapter is a cross-squad team of experts in a specific topic to enable knowledge building and sharing across projects. The chapters build knowledge by performing applied research and gathering learnings from projects. This internship falls under the NLP chapter.
Companies have a massive amount of information available within reach, only a few clicks away. The challenge with these large volumes of information is that we, as a human being, are not able to digest it properly. Search Engines like Google are helping to prioritise information based on your query and interests. However, we noticed, with the trend of open data, that not all data is indexed in Google, leaving multiple sources “undiscoverable”.
One type of information is the local, regional and federal political information. Vast amounts of reports, detailed research documents, … are available for mining but it is difficult to valorize them, since the information is “stuck” in PDF, or, as mentioned, is not indexed by Google. Many political decision documents have become more and more open-source. Albeit on a local municipal level or on a district level, these data and metadata have found their way into Linked Open Data platforms and databases.
In this project, we wish to solve this challenge. We want to enable many companies to reach that information with ease, based on their interests in the political chatter available about their company, line of work or sector.
A traditional approach to this problem is to have literal people spending hours combing through government statements and meeting notes to sift out nuggets of information pertaining to a certain context. A more modern approach however, is to use advanced NLP techniques to do this automatically on a large scale.
We want to address this gap by creating an end-to-end application that:
In order to get companies:
Your mission, dear ML6 Intern agent, should you choose to accept it, is exactly this!
On a high-level technical perspective, the solution could look as follows:
Of course, things aren’t set in stone, and the finalisation of the functional design, as well as the translation into the technical design, is something that can happen in collaboration with senior engineers at ML6.
On a machine learning level:
So if you are a person with a broad set of interests in Machine Learning, Data Engineering and Software Engineering: you are the agent for the job 😎!
In recent times the general trend towards automation has meant that use cases which involve processing large amounts of data are becoming automated. The reasons for this are quite obvious: these are often repetitive time-consuming tasks that are prone to human error and lend themselves to being automated. However, still one of such tasks remains and that is reading. Unfortunately, we can’t automate reading but we can make it faster by highlighting the key information in the text.
The goal of this project is to develop an algorithm that highlights the most important information in a document. However, how we define important information requires some creativity (i.e, it could mean sentences that summarise the text, sentences that are unexpected, etc.). Concretely, we see a major use case for legal texts where we can also exploit the repetitive nature of such documents (e.g, rental contracts are 90% the same because they need a certain legal structure) but we are open to suggestions if there is another field that interests you more where it could also have a big impact.
During this internship, you will:
The duration of the internship can be flexible and depends on the candidate preference and the project requirements. The typical duration is 6 to 8 weeks. The preferred duration for this specific project is 6 weeks.
In recent times anomaly detection has become a major topic within the field of AI. It has particularly gained traction within the domain of computer vision with use cases in defect detection, predictive maintenance, etc. and within the domain of structured data with use cases in fraud detection, spam filtering, etc. However, the development of similar techniques with the domain of NLP remains an understudied subject despite its potential.
The goal of this project is to leverage different NLP techniques to arrive at an algorithm that can accurately highlight anomalous words and/or sentences in a document that you wouldn’t expect to appear in that document. This system would likely exploit the repetitive nature of certain types of documents (e.g, rental contracts are 90% the same because they need a certain legal structure). If successful, such an algorithm could have very impactful use cases in fields such as the legal domain, insurance companies, etc. The concrete approach would be to develop such a system on legal documents but we are open to suggestions if there is another field that interests you more where it could also have a big impact.
During this internship, you will:
The duration of the internship can be flexible and depends on the candidate preference and the project requirements. The typical duration is 6 to 8 weeks. The preferred duration for this specific project is 8 weeks.
With the increase of globalisation and the current rise in numbers of people moving abroad, learning a new language in an affordable and fun way is becoming more and more important. Parlangi is an e-learning provider which connects native speakers and learners of a specific language. They are invited to use a platform for video calls to talk and improve the language skills of the language learner.
As of now, Parlangi provides different topics of conversations every 10 minutes to engage the speakers. Although this approach is relevant to keep the conversation going, it does have a few potential drawback as it can either be:
The goal of this project is to enhance the conversation topic suggestion feature and make it more dynamic. This can help provide a more fun and satisfying experience for the users of the platform.
In order to do so, both the frequency and duration of silence of the conversation as well as the speech frequency of the individual speakers need to be determined. This information can be used to quantify the overall level of engagement of the speakers and suggest conversational topic switch at an appropriate timing.
One method of approaching this problem is by applying speaker diarization techniques on the raw audio recording. Speaker diarization aims at answering the question “who spoke when”. With that, it is feasible to detect both the ‘speech’ moments of the individual speakers as well as the ‘silence’ segments as illustrated below.
This internship isn’t only a great way to leverage your skills working with audio and edge computing but also to do good. Your internship can be rounded up with a blog post where you share your learnings and how you helped Parlangi and its users by improving their experience of learning a new language.
You can take a headstart when working on this project, as some work has already been done. There exist many diarization libraries that already implement diarization pipelines to diarize audio recordings. Some initial exploration of those libraries has been done by ML6. However, there is still much work to be done to put this tool in practice.
During this internship you will:
The duration of the internship can be flexible and depends on the candidate preference and the project requirements. The estimated duration for this specific project is 6-8 weeks:
Our internships and theses are linked to our chapters. A chapter is a cross-squad team of experts in a specific topic to enable knowledge building and sharing across projects. The chapters build knowledge by performing applied research and gathering learnings from projects. This internship falls under the Speech/Audio working group which is part of the Natural Language Processing (NLP) chapter.
Thomas Dehaene: Chapter Lead
Lisa Becker: Machine Learning Engineer and Speech Working Group Lead (daily supervisor)
Named entity recognition (NER) is an indispensable approach for analyzing chunks of text and locating and classifying predefined entities, such as location, companies or person names.
We can use NER in many real-world problems, e.g.
Over the last few years, several methods and models have been published that allow us to use NER out of the box or fine-tune a model to a domain-specific dataset. Even the best machine learning architecture needs high-quality training data to make accurate predictions.
Unfortunately, creating a high-quality training dataset is a manual process where we have to assign each word of the dataset an entity label. To manually label all data requires a lot of time and is not cheap.
In some cases, we only have weakly-labeled data. For example, we know that some entities are in the given text, but we do not know the exact position of the word.
If we want to use the weakly-labeled data for the model training, we have three options:
The first and second approach requires a model that is able to predict for each word if it is a named entity and the associated class (e.g. organization or country). Therefore, we could use the weakly-labeled data or label a few samples manually. We expect that a few good labeled samples would outperform a model that was trained on the whole weakly-labeled dataset.
To avoid additional labeling, the idea behind the third approach is to map the challenges as a sequence-to-sequence problem. Therefore, we want to train a sequence-to-sequence model that learns to generate information based on a given context. In Particular, we want to build a question-answering model that takes the text including the named entities as input. The model will be trained to answer questions about specific entities. If the entity is part of the context the model should return the entity value.
Assume we pass the following sentence into our model: “Berlin lies in northeastern Germany, [...]”, with the question about countries, the model should answer “Germany”.
The advantage of this approach is that we do not need specific information about the exact position of the entity in the given text.
However, we do not know which approach we should prefer if we have weakly-labeled data. We want to find a good threshold between the effort of manual labeling work and model accuracy.
The goal of this project is to find the best working approach to handle the NER model training with a weakly-labeled dataset. You will propose an advisory approach when we should use which method.
Therefore, you will looking into possible training datasets and train at least three different machine learning models for the task of NER:
Afterwards, your mission is to compare each model and find an answer to the question which approach is the most efficient. If you are the right agent for that mission, feel free to contact us.
The duration of the internship can be flexible and depends on the candidate preference and the project requirements. The typical duration is 6 to 8 weeks. The preferred duration for this specific project is 8 weeks.
Thomas Dehaene Chapter Lead
Matthias Richter Machine Learning Engineer (daily supervisor)