Machine Learning is at its most disruptive time, being more and more impactful in real world applications. This is especially the case in one domain: Long document processing.
Businesses in industries such as media, finance, jurisdiction, education, etc. tend to accumulate a large quantity of medium to long documents. More often than not, the documents and the database containing them are not so well structured. It’s a big opportunity cost not to be able to organize and extract any valuable information from them.
And yet, from a technical stand point it’s very challenging to process such long documents. Indeed, the NLP research world is all about benchmarking models on small text tasks. The state of the art models coming out of that usually have a maximum number of input token allowed - i.e. roughly a maximum number of words allowed in your input text.
But our team of experts at ML6 needs to find creative solutions on how to tackle that problem. Here are a few examples on how they do just that:
In a nutshell: some models can handle long sequences better, though they’re still limited on input size and aren’t necessarily adapted to every task. We benchmarked some of those models, to see which would be the most performant. You can find our results here.
More technically speaking: as said before, typical NLP model are quite limited on their input size. Traditional RNN/LSTM and now Transformers model are trained with a fixed one. To make it a tad more specific, models like BERT with a full attention mechanism become quadratic dependent on it (training one with long maximum input size makes your memory explode). Typically, a transformer model will have a maximum input size of 512 tokens. So for a traditional model this blogpost could already be too long to process (about ~800 words).
Now the smart people of the Allen Institute for AI and of Google Research saw that problem and implemented respectively the Longformer model (see paper) and the Big Bird model (see paper). By combining random, window and global attention mechanisms (so essentially sparse attention layers), these models would only get linearly impacted by input length. They could outperform models like RoBERTa on longer input tasks and provide input length of up to 4096 tokens (BigBird).
If you want some more info about it, head to our quick-tip (codes included) about it.
Note that the problem here is to tackle long documents and not just text. One approach we took for a large scale project we had is break down documents in clauses, which can then be tackled by a language model individually.
In this case, we relied on an object detection model. Essentially, we worked on scans of documents and the model could visualize where different clauses were. Typically this kind of method works very well to extract structured information, like tables or IDs. You can learn more about our solution in this video (also includes some more info about the context and what is done on the text of each clause).
For each problem its solution: This type of project usually require a pretty custom approach. It can for instance use manual rules to extract specific section of the text (e.g. find contact information in a large database of job offers).
A less conventional approach is to use extractive text summarization. So you use an extra model to extract the most relevant piece of your text.
That means combining a summarization model on top of the other modeling techniques you’ll use on its output → more chance of issues. Not necessarily recommended.
But yet again, in specific use cases it can be a convenient tool (e.g. summarizing long meetings to then extract relevant piece of information). You can have a glimpse of what is possible with summarization models in this blogpost and even try it yourself with this demo.
It is clear that being able to process and extract key information from a database of large documents is extremely valuable for a lot of institutions. Data is the new oil, we now need ways of drilling it.
On the other hand, it is very challenging from a technical stand point to process those long documents. We did face that challenge at ML6 and tried to layout few ways of doing so.