The adoption of NLP models in real-world use cases has also meant the rise of niche domain-specific language models. Do you need a language model for Polish legal text? Sure thing. Or how about a language model for Swedish medical texts? Look no further. However, in many real-world situations, this begs the question of when we decide to explore using a custom language model or when we decide that an “out-of-the-box” language model will suffice. In some fields like the medical field, this question is trivial as most of the important words (e.g, names of diseases, Latin anatomy language, etc.) are out-of-vocabulary for general-purpose language models but in other domains such as legal texts, technical manuals, etc. it is much less obvious.
The goal of this project is to estimate the expected effect that using a custom language model will have over an existing one based on known heuristics. The current best approach is to compare the overlap of n-grams from your domain-specific texts to those from the texts that the existing solution was trained on and install some (arbitrary) cut-off (i.e, if the n-gram overlap is under 30%, we explore using a custom language model). However, this method is not very quantitative nor very rigorous.
During this internship, you will:
The duration of the internship can be flexible and depends on the candidate preference and the project requirements. The typical duration is 6 to 8 weeks. The preferred duration for this specific project is 6 weeks.
Our internships and theses are linked to our chapters. A chapter is a cross-squad team of experts in a specific topic to enable knowledge building and sharing across projects. The chapters build knowledge by performing applied research and gathering learnings from projects. This internship falls under the NLP chapter.