UFO eBooks - Text recognition, enhancement and compression

posted on Nov, 27 2015 @ 08:18 AM
This is a small and rather boring thread, but it lays part of the foundation to a couple of more interesting threads I hope to post here soon.

As some of you know, I have been concerned for many years about the amount of time and energy wasted within ufology reinventing the wheel. To reduce that wastage, during the last few years I've posted a number of searchable PDF collections of official documents (e.g. from the FBI, Canada, Australia and New Zealand), out-of-print UFO journals/magazines (e.g. by Skeptics UFO Newsletter by Phil Klass etc), ufological PhD dissertations - plus some audio material. I've also posted about ways of searching large collections of searchable PDF files - see my thread from 2011 entitled "FAST searching of major free online collections of UFO journals (or just browse/download them)".

Unfortunately, during the last few years I've only been able to post a small fraction of my ufological material, partly due to copyright issues and partly due to the time/resources needed to scan/upload material.

I've recently put quite a bit of time and effort into finding quicker ways of producing smaller searchable PDF files, ideally with enhanced text (or at least without making text very pixellated).

I thought it worth reporting back on my results, since I learnt a few things I really wish I'd been told a few years ago. It would have saved me a lot of time and effort. Hopefully, the results below will help save some of you a bit of time in the future.

Basically, I've been keen to reduce PDF file sizes to make it easier to exchange material online without materially sacrificing image/text quality.

One piece of software that I was recommended a few years ago for this purpose was the Russian Scankromsator software which generally reduces the file size by a factor of 10 to 20 BUT needs individual tweaking for each PDF file which can take about 30 minutes or longer each. Given the volume of material I wanted to process and the limits on my spare time (and also limted patience for tedious tasks...), this was simply too long. I also tried quite a few methods I've seen discussed online, but the options generally only reduce the file size by a factor of 2 or 3 (which is disappointingly low compared to the much more dramatic reduction achieved by Scankromsator).

At the suggestion of French researcher Nab Lator a couple of months ago, I tried OmniPage Pro 18 on a 275Mb scan of a UFO book. That book had been reduced to a high quality copy of only 30Mb using ScanKromsator in about 30 minutes. Using OmniPage Pro 18, results were FAR quicker. I was able to produce (in a few seconds each) a number of variations. I started by producing just a lossless compressed 300dpi black and white version (about 12Mb) but the few photos in the book were useless. Doing the same book in grayscale made the photos useful and only increased the file size to 13Mb. Encouraged by that result, I made a 24bit colour version - which is only about 17Mb. As I say, I don't care about the the file size difference. The quality looks fine when reading it on a screen, but when I zoom in (to, say, 800%) pixelation compared to the original file is obvious - which I found a little surprising given the usef of a "lossless" setting.

While playing with lots of options on OmniPage Pro 18, I came across an option for "MRC" compression. This may be old news to a lot of you, but I was very surprised by the results of using this compression method. Basically, the same sample book referred to above was reduced to a searchable PDF file with a file size of just 5Mb - and it looked pretty good to me.

Looking into MRC compression, I found that the FineReader software also has an option for this method of compression. With some further kind help from Nab Lator, I tried out a number of options in FineReader and was delighted to find that it has very easy options for creating a tool for batch processing directories full of PDF files. Forget about individual tweaking for each book - just let a computer run and it processes a large pile of material for you (although I found that it was best to set FineReader to process a batch lasting about a day or so, otherwise it seemed to become a bit unstable and slow down or crash).

FineReader also has a function called "PreciseScan" which smooths pixellated characters. I think the Precisescan function makes a subtle but noticeable difference when comparing some sample pages of text.

Using the batch processing tool in FineReader 12 Corporate, I was able to:
(1) Dramatically reduce file sizes;
(2) At the same time, enhance the text;
(3) Get excellent text recognition results (considerably better than my previous results in Adobe Acrobat);
(4) Not have to tweak settings or interfere with each file being processed - I could just let my computer run.

The reduction in file sizes can be seen by comparing the sample list below:

Unprocessed original large file sizes:

Small searchable PDF file sizes after processing using FineReader 12 Corporate:

Here is a sample of an original scan (before PreciseScan enhancement):

Here is the same page after processing with the PreciseScan function engaged in FineReader 12 Corporate:

I'll also just paste below the settings I used in the batch processing tool in FineReader 12 Corporate, which only takes a few minutes to set up and can then be saved for use in the future (like a macro):

posted on Nov, 27 2015 @ 05:07 PM
S &F for you! Thanks much. This is very helpful to me.

posted on Dec, 12 2015 @ 11:41 PM
Nice process Isaac, the results look good compared to the original, very readable, slightly rounded edges.

I have a serious issue with related video quality. YouTube compression kills the detail in 1920 x 1080 hi def, rendering YouTube useless as a serious analysis tool, and becoming just a reference tool. Where can high def vids be posted and shared here without major compression? There seems to be a quality void because of the price of bandwidth.

posted on Apr, 6 2016 @ 03:15 PM
thank you for all of this omg !!!

