codensuch

Improving OCR and continued public alpha test

Added 2022-05-31 02:19:29 +0000 UTC

I've been trying to improve the accuracy of OCR over the last couple of weeks. The new OCR AI is currently up on the public alpha site for you to check out. It is a culmination of a few things:

1. New OCR LSTM model. The previous model was using a modified version of the Tesseract's best Japanese model. This one is trained from ~30,000 lines of text from OPM manga. It took awhile to train a model that is somewhat on par with default model. It handles things like Kanji and punctuation much better. However it's worse in other areas, like hallucinating non existent characters due to OPM training data bias. Overall I'd say it's a toss up so I'm leaving it as my go to model until I can train a better one.

2. Improving image processing for connected bubbles. One of the issues affecting OCR accuracy were connected speech bubbles. When connected bubbles were cropped out it often clips into each other and resulted in unnecessary text that was lumped into OCR. This also affected the translation accuracy. This has been fixed. Better yet, the overlapping text issue during typesetting was also fixed as a result of the image processing improvement.