From Print To Screen | |
Part 6 - OCR Part 1a - Contents & Metadataby Ben | 24th April 2020 |
OCR (Optical Character Recognition) is, for those who don't know, the process of converting an image of text into actual text, and right from the beginning of this project I didn't just want to scan the magazines and put them up as PDF images - I wanted to be able to search the content, let search engines index it and do all the other good stuff that having the text allows, including updating errata and fixing article errors - reading PDFs on a computer or iPad is okay to a point, but on a phone for example it's a pretty lousy experience.
The very first thing I did back in 2006 at the dawn of mu:zines to decide whether getting a magazine issue into a web site was viable, was to investigate the available OCR tools, and run some tests to see how much time and effort would be involved. Back then I was using a Mac Powerbook G4, which was rather slow (I also had a regular PC available - but being Mac-based, a Mac solution was preferable for me).
I explored the range of OCR solutions out there at the time, pretty much everything aside from large-scale commercially licensed systems, and I certainly can't recall all the systems I looked at. But the one that gave the best results at the time, and was widely regarded as "one of the better systems", was TextBridge Pro.
Now, like many OCR systems, this one had come from the PC/Windows world, and the Mac version was a bit clunky in terms of the interface, but of all the systems I tried, it gave the best output, and that was what I chose to initially work with.
Reading was slow, tedious, and had many errors. As each article was read, I had to go through a proofing/fixing process which involved stepping through the words that Textbridge wasn't sure about, and either fixing them, or ok-ing them, in a rather small window, before eventually saving the text file which I could then work on in my text-editor of choice. And from there, I would have to again do a lot of work to bring the text up to the required standard, often retyping whole paragraphs. It was, shall we say, "sub-optimal" - but it was the best I could do at the time.
Looking at my files, it looks like I did about 17 issues of MT using this method, but I think the tedium of doing it this way contributed to not making much progress back them.
Thankfully, today, things are *much* better in terms of the available tools and technology. When I came back to this project, I did the same thing as before, which was take a survey of available tools, run some tests, and see what would work best for the task at hand. Again, I tried pretty much everything out there available on the Mac platform, and also some PC solutions for comparison.
To keep the story short, ABBY Finereader was the winner, by some margin. I'd nearly go so far as to say that without Finereader (referred to as "FR" from now on), this project would probably not be practical.
Gone was the tedious error-correcting phase. Recognition was better than anything else I tried (I ran the same test scans through different systems so I could compare the results on the same content.). On good scans of simple articles (say, a modern Sound On Sound article printed on white paper) it's not unusual for the resultant text to have *no errors at all*. The interface was ok - not amazing, but a *lot* better than the old Textbridge, that's for sure!
So, OCR software chosen - let's get back to the task in hand. I will go into some specifics on how I use Finereader, and it's strengths and weaknesses as we go.
If you're following along you'll hopefully recall that in the previous blog entry, we had output the desired scanned images as full size high quality jpegs. In the first stage of the OCR process, we need to create an OCR document for the issue, containing those pages, and we want to OCR out the contents page/s of the magazine, so we can create the necessary article entries in the CMS (eg, "On page 27-29, it's a Korg M1 review", and so on.)
Ok, in FR, we create a new document, navigate to our scans folder, and import the scan images. FR has options to automatically pre-process, and recognise documents automatically, which I *don't* use for this task (although I do use it in other situations). Importing the large images is fairly slow (over a minute), and adding additional processing at this stage makes it even slower, and has some gotchas that I prefer to avoid (see panel below).
Once the images are imported, I'll go to the contents pages - typically a single or double-page (and sometimes there may be another mid-magazine contents page where they break the contents of that sections down individually).
I'll manually draw a recognition area over the contents text, and export as raw text to *muzines*/processing/mt/mt_94_02_feb/00 contents.txt, and save the FR document as *muzines*/processing/mt/mt_94_02_feb/mt_94_02_feb.frdoc. FR documents are self-contained and have the images inside them, so I no longer need the temporary scan files, so they've served their purpose and can be deleted.
The "00" in the filename is a naming convention I use to keep exported articles in magazine order - articles will end up named like this:
00 contents.txt
01 editorial.txt
02 shapeofthings.txt
03 ...
At this stage, I officially designate the issue as "In Processing", in that there is an FR doc of the issue (editorial pages only), and a contents text file, and the website has the page scans - so I go back to the CMS and mark the issue as such - synchronising the status change to be visible on the live site.
It takes probably ten minutes or so to create the FR document, read and output the contents file, and update the CMS, most of which is the time it takes FR to import and save the document, which are typically around 1GB in size. It's pretty slow, but given that site donations aren't helping me get an iMac Pro any time soon, I have to settle for slow...
Ok - so for all the initial talk about OCR, we didn't actually *do* much OCR in this part. That's OK - we're not quite ready to do the bulk of the OCR work yet. To recap - we imported the scans into a Finereader document, saved it, OCR'd the contents page out to a text file, and marked the issue as "In Processing" in the website.
The next step is to edit and format the contents file we just saved, so we can create the necessary articles and meta-data in the CMS. So this is what we'll cover in the next part...