Log in

View Full Version : OCR'ing FTP


lider_revolucionario
August 8th, 2006, 12:07 PM
I was looking the FTP and saw that there is a reasonable amount of books not OCR'ed. So I was thinking in doing this for helping those guys without broadband to have access to these books. I've started OCR'ing PMJB vol 1, and I'm at page 20 (It is a huge work!). If anyone interested in helping post here to people know what is in progress already.

Anira
August 9th, 2006, 01:53 AM
For some reason I am getting the impression you are typing it and not ocr'ing. OCR is quite quick for the first stage and then should be poof read. The first step should take a few minutes for a computer to calculate.

lider_revolucionario
August 9th, 2006, 10:27 AM
For some reason I am getting the impression you are typing it and not ocr'ing. OCR is quite quick for the first stage and then should be poof read. The first step should take a few minutes for a computer to calculate.

The point is that some books are very hard to getting OCR'ed. By an example, the book I'm working on a huge amount of letters aren't recognised correctly. in part due to the small size or to problems on scanning. Even the ABBY FineReader 8.0 can't do this part. This generally happens with old books.

nbk2000
August 11th, 2006, 07:07 AM
With books that won't OCR, there's generally very little reason to bother with hand transcription of the contents. It's not like you're typing up the Bible or Mein Kampf. :)

A reasonable compromise would be to type up the Index or Table of Contents, making that searchable, and providing inline links to the relevant pages. That's what I did with the PMJB series.

megalomania
August 13th, 2006, 05:20 PM
Time is also on your side. With each passing year scanning technology improves, and OCR accuracy increases. It may be an easier matter to simply redo a poor scan in a few years. I would rather spend my effort OCR'ing better quality, but un-OCR'd, books instead of manually transcribing the tough ones. Pick the low hanging fruit first, and hope for better recognition in the future.

The Gamera OCR engine has a very nifty character identifier that lumps every occurrence of a letter into a single graphic grid. I wish Abbyy had something like this because it would make training user patterns much faster. Unfortunately for Gamera it is some incomprehensible *nix based app that lacks every other useful feature of Abbyy. Perhaps, in a few years, someone will steal the idea.

Gamera is one example of a very useful technology that just isn’t ready for prime time consumer use with a handy Windows based GUI. Xerox scientists are hard at work improving heuristic recognition of text. Each new version of Abbyy, or OmniPage, improves recognition accuracy. Give it time…