Book OCR issues [Archive] - The Explosives and Weapons Forum

pdv37

July 25th, 2006, 06:26 PM

Greetings. I've been experimenting with OCR of some of the books that have been posted on the this forum, and ran into some issues.

Firstly, most books were scanned 2 pages at a time, so is there any easy way to divide them into separate pages using some type of software? If not, is there anyway to do it manually? (only thing I can think of is taking screenshots of pdf files[or converting pdf to images], dividing the picture in 2 with graphic editing program, and then OCRing them separately).

Then, there's issue with OCRing graphics on the page. While Abbyy is ok for OCRing drawings, it doesn't do an extremely good job with photographs, I would guess that is due to the compression abbyy uses. So if anybody knows how to preserve original quality of images during an OCR procedure please let me know.

Abbyy also doesn't recognize text that is sideways (ex. at 90 degrees)... or am I just missing some type of option or feature that is used to do just that?

While I still can insert side text and photographs manually into pdf document, I have no idea on what would be the best way to separate pages.

If anyone has any advice or experience with OCRing, it would be appreciated. Im ready to spend time to convert some of the books from this forum and share them too.

Anira

July 26th, 2006, 10:50 AM

Well its always good to have a copy of Adobe Acrobat Pro. I don't think this forum allows such information, so I will let you find it on your own.

Adobe's OCR is reasonably good in the aspect of recognising blocks of text and columbs. It also has features like export all the images, or add / delete pages.

I bet you could get it done with some of the batch features of Adobe Photoshop. I personaly have not done the exact task, but it shouldn't be too bad.

Ok, here is how.

First go to the history window and click the second tab, actions.
Click the "Create New Action" button. Then click on the record button and record the actions you want.
Go to File --> Automate --> Batch...
From there you can select the script you made and run it. I tried once and it did not work, but I am fairly sure it could be gotton working without a lot of work.

Sparky

July 26th, 2006, 06:56 PM

I have had reasonably good success writing macros with "autohotkey" http://www.autohotkey.com/. Writing macros in this way is extremely flexible although it did take a while to learn. It essentially provides the ability to do "batch" everything in any program as long as the task is repetitive enough and you have enough programming inguenity.

megalomania

July 26th, 2006, 10:03 PM

The most convenient way to split book scans with two pages would be to use Adobe Photoshop CS2. Here are the detailed steps I use when splitting pages with Photoshop.

1. Make two folders containing COPIES of the original files. Name these folders “right side pages” and left side pages” respectively. Do not edit your originals! One mistake and everything is lost with your originals. The purpose for naming your folders as right side and left side should be obvious: you are cropping away unwanted portions of the page of either the left or right pages.

2. Open only 1 image from the “right hand pages” folder, the first one is fine. Using the cropping tool, select the area of the page you want to crop. Since we are using a page from the “right hand pages” folder, we want to keep the right half of the page. Highlight the area of the right side you want to KEEP. Leave a little extra for margins, as this will be important in later steps. Hit enter to crop the page.

3. If the page is cropped how you like it, click the “image” tab at the very top and select “image size” from the drop down list. Click it. You will now see the horizontal and vertical dimensions of your page. Write these down.

4. UNDO the cropping you just did (edit --> undo or ctrl+z) to get the original back. Click the “marquee tool” which is the default selection that looks like a dotted square. In the tool options bar near the top of the screen there is a drop down box named “Style” and it will be set to “normal” by default. Change this to “fixed size.” Enter the width and height dimensions you wrote down from step 3. You did write those down right?

5. On one of the panels that get in your way on the right side of photoshop, and I assume you are using the default view by the way, one of them will have two tabs named “history” and “actions.” Click the “actions” tab. At the bottom of that panel are a few symbols that look like VCR recorder symbols, stop, record, play, a folder, a bit of curled paper, and a trash can. Click on the symbol that looks like curled paper, that is the “create new action” button. A box will pop up asking you to name your new action. You can name it something appropriate like “cropper” or just leave it be. Click the “record” button to begin recording our actions.

6. You are now recording every step you make. The first step of our action is to click anywhere on our opened image. The marquee tool should still be selected, and it should be set up for a fixed width and height. It does not matter where you click on the image at this point, so do not worry if the marquee you placed is not over the area you want.

7. Using the arrow keys on your keyboard, nudge the selected area into the position you want. This will create a new action “move selection” but it will not show up yet. At the top click the “edit” button and choose “crop” from the drop down list. You will not now have two actions added, “move selection” and “crop.” Your image should now appear as you want it, cropped properly.

8. Click “file” and select “save” from the drop down list. A box will pop up asking you for quality parameters. I suggest the highest for now, or level 10 anyway. Click the “OK” button to save the file. This step is now saved in your action. Close the image by clicking on the “X.” This step will be added to your action as well.

9. Click the “stop” button on the actions palate to stop recording steps.

10. Click “file,” hold your mouse over “automate” from the drop down list, and click “batch” from the box that expands.

11. The box that pops up is the batch processing mode. This will allow you to perform the same steps you just recorded over an infinite number of image files. In a 1000 page book such a feature is a very handy time saver. The drop list named “Action” should already have your newly created action listed. If it does not, click the arrow next to the box and select the action you recorded from the list. In the section below there is a drop list box named “source,” which should be set to “folder.” Below that is a button named “Choose.” Click that button and select the folder you named “right side pages,” or whichever folder you got your page from.

12. Here is a sticky point for me learned the hard way. I have made mistakes in the past screwing up original scans irreparably, so I am very paranoid about editing any image I have not backed up and duplicated into other folders. I will always create a new folder with copies of every step of editing I do rather than overwrite files I have, even if those files are copies. In the drop down list named “Destination” you may either leave it at “none” to overwrite the files you will be editing, or you may select “folder” to create a destination for all of your edited files. I always choose the latter and make a subfolder in the original directory named something like “edit 1” or “cropped”. If you choose “folder” click the button “Choose” to establish your files destination.

13. Make sure you click the check box named “Override Action “Save As” Commands.” We already set up a save and close step in our macro. If you do not check this box, every single image will have the save box pop up, and you will have to manually click it each time. This is a waste of time and defeats the purpose of doing this automatically. When you click the box a pop up will ask if you are sure you know what you are doing. Just click “OK” to get rid of that.

14. Finally, click the “OK” button in the upper right corner of the window to start your batch process on its way of systematically applying your crop commands to every page of your book.

NOTICE: This is the general process by which you can crop every page in your book. There are a few caveats I have discovered along the way when doing this. First and foremost you must understand that what is cropped properly on page 1 may differ from what would be proper on page 100, which may in turn may be different for page 200.

Why is this? Twists and turns of a page always make some sort of misalignment. This is why I recommended leaving a good bit of extra margin in step 2. This is exactly why I never overwrite any images. If you just mis-cropped 900 of your 1000 pages, you are screwed unless you have the originals.

If the pages that you are cropping are very different from one another I recommend using a different approach to step 11. Instead of using the default “folder” for your source files, select “Open Files” instead. You will first have to open all of the pages you want to crop before doing this.

Trial and error is you friend here. You may find pages 1-32, 33-97, 98-414, and 415-1000 all require different cropping. More often than not the size of the cropped area remains the same, it is just that the marquee needs to be moved. If page 33 differs from the rest, open that image file, select the step in you action named “move selection” and click the “record button” to add an additional step. Using your arrow keys move the marquee to the new position you want it. Click the “stop” button, and a new “move selection” step will be added. Make sure the new step was not added at the bottom as your last step, because you don’t want to crop, save, and close the image before the new move step takes place! You can drag the steps into position as needed.

If, for example, pages 33-97 are to receive this modified cropping, open all pages from 33 to 97 at once. I would recommend only opening 25-50 images at a time to avoid stressing our your computer with so many images open at once. Once you start the batch process, only the opened images will be edited.

I hope I have explained this as best I can. I do this a lot, so I know exactly what I am doing, and I may have glossed over something you may not know. It would probably be easier to show with video screen captures how I do this.

If you are getting your images from a PDF document, you may want to use a PDFtoJPG type of program, or anything that converts PDF pages into image files. Please be aware that doing such a thing degrades your images, and they may not OCR with any degree of accuracy. It is always a good idea to compress the original scans when making a PDF to save space, but in your case you don’t have the originals.

Abbyy Finereader does have the ability to detect two page documents and OCR them separately. Considering at least one of the pages will typically have some sort of skew, the skewed page will have more errors. This is why it is best to always use one book page per PDF page when OCR’ing.

You will have to rotate all pages as straight as possible (rotated as you would read a page) to do OCR. Abbyy may apply some deskewing correction, but only for a few degrees before it can’t read anything.

Considering there is a high probability your pdf-to-jpg-back-to-pdf books will have many OCR errors, and since you indicated you want to take the time to do it right, then manually correcting the OCR errors using the “trainer” in Abbyy would give the best results.

nbk2000

July 27th, 2006, 07:59 AM

Anira:

If you've got links to good (and relevant) high-end warez, feel free to post them, as we always need tools. :)

Abbey has a Rotate Page function. You may have to edit your toolbar options to show it, though.

As long as you've scanned the text in so that it's an exact (or nearly so) increment of 90°, Abbey will be able to rotate it quite cleanly. Otherwise, open the file in photoshop and use the measure tool to draw a line exactly parallel to the bottom edge of a line of text in the page scan, select Image/Rotate/Arbitrary, and adjust appropriately.

Scanning in lossless uncompressed TIFF is really the only good way of scanning. JPG and GIF create artifacts that are not good for OCR'ing.

Abbey can split two page documents automatically, but it's bad at doing it. Better to manually adjust the split line to get it right. You can do it in PS too, but it's more hassle than doing it in the OCR proggy.

Graphics are handled by setting the image save setting to the highest DPI and resolution needed for clarity.

I've found that in most cases of my scanning, that a 600DPI scan lets me save a picture at 150DPI/20% .JPG with no more artifacting than 300DPI scan saving at 150DPI/80% .JPG does, only at 1/5th or less of the file size of the higher-res image. :D Using a higher-resolution scan to start with lets you use a much more compressed image with less artifacting than a lower-resolution scan with less image compression.

Really, most of the hassles in OCR stem from bad scanning technique more than anything else. If you spend the time to do a clean page scan (even if it takes several tries), the need for cleanup is almost non-existant and OCR is practically perfect. :)

Best to remove the pages from the book if you can, so the pages lay perfectly flat and square on the scanner. Obviously you can't do this with borrowed books.