December 15, 2021

OCR Correction with ByT5

Contributors
No items found.
Subscribe to newsletter
Share this post

We have developed a Dutch OCR correction model using the ByT5 architecture, which is capable of identifying and rectifying OCR mistakes. Optical Character Recognition (OCR) technology is widely used to convert scanned documents into digitized text, but it often produces errors. To automate the manual post-correction phase, we trained the ByT5 model on a large Dutch dataset and simulated OCR mistakes using the nlpaug library. ByT5, a token-free model that operates on raw bytes of text, proves to be more resistant to noisy data compared to token-based models. Our implementation, which includes dataset loading, model training, and inference, demonstrates the effectiveness of the ByT5 model in OCR correction tasks. The results highlight its advantages over token-based models for small to medium-sized sentences with high noise levels. This OCR correction model provides a powerful solution for automating the post-processing phase and improving the accuracy of OCR outputs.

The blogpost can be found on our Medium channel by clicking this link.

Related posts

View all
No results found.
There are no results with this criteria. Try changing your search.
Large Language Model
Foundation Models
Corporate
People
Structured Data
Chat GPT
Sustainability
Voice & Sound
Front-End Development
Data Protection & Security
Responsible/ Ethical AI
Infrastructure
Hardware & sensors
MLOps
Generative AI
Natural language processing
Computer vision