I'd like to make available online a fairly large and useful resource regarding the
Rendlesham Forest Incident, but am having a slight technical problem - outlined
below.
The 1980
Rendlesham Forest Incident is one of the more frequently discussed UFO
events.
Many of the current online discussions of that event reinvent the wheel, covering grounds covered previously at length.
Some of the more detailed previous discussions were on a specialist forum at Rendlesham-Incident.co.uk. Members included various skeptics (including
Ian Ridpath) and some of the key witnesses (including John Burroughs). Unfortunately, that fourm is no longer available online. (That discussion
forum was started by someone when he was in school and basically grew out of his interest in UFOs...).
The good news is that I have been able to obtain a copy of the archive of the posts to that forum from one of its old members ("Daniel"). The owner
of the forum positively encouraged members to download the archive before he pulled the plug.
I'd now like to make that archive available online. I think this is what the owner of that forum would have wanted (and I've been trying to contact
him - or anyone still in touch with him - via various Rendlesham discussion forums to double-check this is okay with him, but he doesn't appear to
have stayed in touch with any of the Rendlesham researchers or witnesses).
Anyway, the archive of posts was saved as a 0.5Gb collection of html files.
To make it easier to search, I have converted all the pages of posts into 10,000 PDF documents. That archive is rather large (just over 2Gb).
Upon doing some searches, it quickly became clear that many of the pages within the html archive are duplicated. There up to 5 to 10 copies of many
pages.
Apart from inflating the size of the archive that I'd like to share online on a free file hosting website (e.g. minus.com), this duplication also
makes reviewing search results more time consuming than it needs to be.
I've used two pieces of software to try to eliminate the duplication:
"Duplicate
Cleaner" and DupKiller.
The former of these made it easy to delete files of the same size, and this has reduced the number of PDF documents from 10,000 to just over 5,000.
Not a bad start. Unfortunately, a lot of duplication remains which was not eliminated using this method. It appears that, despite looking the same
and having the same text, some of the duplicates have *slightly* different file sizes so aren't found by the "same size" function of either
"Duplicate Cleaner" and DupKiller. Nor does the "same content" function of these
pieces of software seem to pick up the remaining large amount of duplication.
I've uploaded a sample of about 200 PDF files from the 5,000 file archive sorted by size to the link below to illustrate the problem:
min.us...
That sample can be downloaded as a single zip file, or reviewed online by scrolling through the documents. There is an indication of the file sizes
at the bottom.
I sorted the sorting the collection by size and started manually deleting all but one file out of a set where there were several of approximately the
same size, but found that I was starting to eliminate some non-duplicates.
Apart from sorting the collection by size and manually opening each PDF document before manually deleting files of approximately the same size, is
there a simple method or piece of software I could use?
If a method works on
the sample at the link above, presumably I can run the same method on the larger collection.
Any help would be appreciated.
All the best,
Isaac
edit on 1-5-2012 by IsaacKoi because: (no reason given)