It looks like you're using an Ad Blocker.
Please white-list or disable AboveTopSecret.com in your ad-blocking tool.
Some features of ATS will be disabled while you continue to use an ad-blocker.
originally posted by: Shadoefax
Great job, but I have to ask. Why is the archive so large? If it consists of 129,000 pages and is 154 GB in size, that works out to about 837 KB per page. I realize that the pages are scanned images, but what format? If they are in a non-compressed format (like .tif or .bmp) surely they can be re-sampled as .jpg or .png and reduced in size ten to 100-fold. It would be a lot easier to download 1.5 GB than 154.
originally posted by: pauljs75
Interesting job there. My past experience reading OCR'd books (project Gutenburg) seems to show the process is still anything but ideal. That text data is going to need some proofreading in comparison to the scanned documents if it's to be of much good.
Also formatting can be funny at times, page breaks, or whatever OCR tends to do that can be annoying in a flowing layout.