Deleting *nearly* duplicate files? (Large Rendlesham UFO resource), page 1

posted on May, 1 2012 @ 10:17 AM

link

I'd like to make available online a fairly large and useful resource regarding the Rendlesham Forest Incident, but am having a slight technical problem - outlined below.

The 1980 Rendlesham Forest Incident is one of the more frequently discussed UFO events.

Many of the current online discussions of that event reinvent the wheel, covering grounds covered previously at length.

Some of the more detailed previous discussions were on a specialist forum at Rendlesham-Incident.co.uk. Members included various skeptics (including Ian Ridpath) and some of the key witnesses (including John Burroughs). Unfortunately, that fourm is no longer available online. (That discussion forum was started by someone when he was in school and basically grew out of his interest in UFOs...).

The good news is that I have been able to obtain a copy of the archive of the posts to that forum from one of its old members ("Daniel"). The owner of the forum positively encouraged members to download the archive before he pulled the plug.

I'd now like to make that archive available online. I think this is what the owner of that forum would have wanted (and I've been trying to contact him - or anyone still in touch with him - via various Rendlesham discussion forums to double-check this is okay with him, but he doesn't appear to have stayed in touch with any of the Rendlesham researchers or witnesses).

Anyway, the archive of posts was saved as a 0.5Gb collection of html files.

To make it easier to search, I have converted all the pages of posts into 10,000 PDF documents. That archive is rather large (just over 2Gb).

Upon doing some searches, it quickly became clear that many of the pages within the html archive are duplicated. There up to 5 to 10 copies of many pages.

Apart from inflating the size of the archive that I'd like to share online on a free file hosting website (e.g. minus.com), this duplication also makes reviewing search results more time consuming than it needs to be.

I've used two pieces of software to try to eliminate the duplication: "Duplicate Cleaner" and DupKiller.

The former of these made it easy to delete files of the same size, and this has reduced the number of PDF documents from 10,000 to just over 5,000. Not a bad start. Unfortunately, a lot of duplication remains which was not eliminated using this method. It appears that, despite looking the same and having the same text, some of the duplicates have *slightly* different file sizes so aren't found by the "same size" function of either "Duplicate Cleaner" and DupKiller. Nor does the "same content" function of these pieces of software seem to pick up the remaining large amount of duplication.

I've uploaded a sample of about 200 PDF files from the 5,000 file archive sorted by size to the link below to illustrate the problem:
min.us...

That sample can be downloaded as a single zip file, or reviewed online by scrolling through the documents. There is an indication of the file sizes at the bottom.

I sorted the sorting the collection by size and started manually deleting all but one file out of a set where there were several of approximately the same size, but found that I was starting to eliminate some non-duplicates.

Apart from sorting the collection by size and manually opening each PDF document before manually deleting files of approximately the same size, is there a simple method or piece of software I could use?

If a method works on the sample at the link above, presumably I can run the same method on the larger collection.

Any help would be appreciated.

All the best,

Isaac

edit on 1-5-2012 by IsaacKoi because: (no reason given)

PhoenixOD

posted on May, 1 2012 @ 10:49 AM

link

reply to post by IsaacKoi

You really need something that can compare files of the nearly the same size and then compare the text and then come up with a percentage match. This probably isnt much help to you but i think id write my own software to do the job.

ecoparity

posted on May, 1 2012 @ 10:58 AM

link

If you are a linux or Mac user and able to use advanced command line tools this is pretty simple, with Windows you're going to find it a bit more difficult.

Try searching for duplicate file finders capable of comparing hash data, full text, word count, etc. Matching file sizes is only good for specific types of files and PDF docs is not an advisable way to use size based de-duplication.

Here, I found a windows based Diff tool capable of handling PDF docs, and it's free -
www.addictivetips.com... fference-with-diffpdf/

edit on 1-5-2012 by ecoparity because: (no reason given)

IsaacKoi

posted on May, 1 2012 @ 10:59 AM

link

Originally posted by PhoenixOD
You really need something that can compare files of the nearly the same size and then compare the text and then come up with a percentage match.

Sounds sensible to me.

Does anyone know of any such software?

I can't find functions on the two pieces of software I've tried to use to eliminate duplicate files - but I may have missed something. (I'm a lawyer, not an IT expert...)

i think id write my own software to do the job.

I can do the job manually in a bearable amount of time (since the automated process of comparing identical file sizes has already reduced the number of pages from 10,000 to 5,000). I'd certainly be quicker doing it manually than I would learning to write a piece of software of this sort. But if the software exists already, it would avoid a fairly boring day or two of doing things manually!

IsaacKoi

posted on May, 1 2012 @ 11:02 AM

link

Originally posted by ecoparity
If you are a linux or Mac user

Nope. I'm afraid I've stuck with Windows (from DOS onwards...).

with Windows you're going to find it a bit more difficult.

I'm probably too stuck in my ways to change over to linux or a Mac now.

Thanks for your comments though. One option is that I find a tech-savvy Mac user and give him the archive to run the relevant command on.

ecoparity

posted on May, 1 2012 @ 11:05 AM

link

I edited my post but you replied since then.

Here's a free PDF diff tool for windows you can use to compare text:

www.addictivetips.com... fference-with-diffpdf/

IsaacKoi

posted on May, 1 2012 @ 11:10 AM

link

Originally posted by ecoparity
Here's a free PDF diff tool for windows you can use to compare text:

www.addictivetips.com... fference-with-diffpdf/

Many thanks. I'll try that out when I get home this evening.

Looking at the relevant webpage, I think that tool deals with a slightly different issue in that it seems to open two PDF documents and then highlights differences between them. In my case, I can sort by file size and open each one in turn and tell at a glance if two adjoining files in the list are the same document (because they have the same title at the top of the page and look the same) or are two different documents (because they look completely different and have different titles) - so I don't need to do a detailed comparison of the two. However, I'll install the tool you've found and see if it can be used for the job I'd like it to do without having to manually open each file and then manually delete any duplicate. Thanks again.

edit on 1-5-2012 by IsaacKoi because: (no reason given)

ecoparity

posted on May, 1 2012 @ 11:12 AM

link

Here's another solution but a bit more complex. It adds an explorer extension that will calculate file hash values.

What you want to do is set it to generate a hash for the PDFs, then add that hash value as a display option in explorer (so it shows up next to the files in one of the categories in an explorer window).

Then you can sort by that column and you should see sets of PDFs with matching hash values. These are almost surely going to be duplicates of each other.

The reason pure size based diff doesn't work w/ PDFs is that unimportant elements like spacing can create different sized files and even the PDF compression will take two identical documents and create two PDFs with two different file sizes.

I would use the hash values to locate and sort likely duplicates and the diff tool to identify and eliminate them. With shell scripting you could automate this which is why a Mac or Linux format would be ideal. Otherwise it's going to be pure manual labor, I'm afraid.

www.addictivetips.com... ws-tips/hashtab-calculate-compare-hash-checksum-values-from-file-properties/

ecoparity

posted on May, 1 2012 @ 11:17 AM

link

I don't know if you already have somewhere to host all this but if not I could probably help you with some hosting and be able to assist with the de-duping once the files are uploaded to a linux based web server.

Send me a pm if you want.

IsaacKoi

posted on May, 1 2012 @ 11:21 AM

link

Originally posted by ecoparity
Here's another solution but a bit more complex.

This one looks promising. I don't mind a bit of complexity. I'd rather spend time learning how to get my computer to something new than the same amount of time (or considerably longer) sorting the list of files by file size, manually opening adjoining files (all 5,000 documents...) and manually deleting duplicates.

What you want to do is set it to generate a hash for the PDFs, then add that hash value as a display option in explorer (so it shows up next to the files in one of the categories in an explorer window).

Then you can sort by that column and you should see sets of PDFs with matching hash values. These are almost surely going to be duplicates of each other.

I see.

I'm more optimistic about this method than with the other tool, but will give both a go and report back. Thanks again.

IsaacKoi

posted on May, 1 2012 @ 11:25 AM

link

Originally posted by ecoparity
I don't know if you already have somewhere to host all this

Well, I have a draft website but I don't want to be seen as profiting from work done by other people so had in mind just sharing this archive on a free file-storage website (such as minus.com) rather than posting it on my own website. (I've done the same with various other archives of posts and other material, when the copyright holder has agreed to my sharing the material).

but if not I could probably help you with some hosting and be able to assist with the de-duping once the files are uploaded to a linux based web server.

Send me a pm if you want.

Thanks. I'll try the two methods of de-duping that you've kindly posted above and if I'm still having problems I'll send you a pm.

As for hosting, I'm happy sticking with minus.com or another of the free file-storage websites so that there is no hint that I'm trying to profit from work done by others.

IsaacKoi

posted on May, 1 2012 @ 03:04 PM

link

Problem solved.

Many thanks for the input in this thread, particularly by ecoparity.

The solution, found as a result of looking into ecoparity's helpful comments, actually turned out to be pretty simple.

After installing the hashtag generating software, I set about looking into how to get Windows to display the hash information in a separate column. When looking into that I found that one of the columns that can be displayed is "title" which actually gives the title within the PDF document. (As I mentioned above, the duplicates have the same title within the PDF document). When displaying that column, sorting by file size shows blocks with one title followed by blocks with the next title etc, without needing to go through the tedious business of opening each one.

I can now tell at a glance at the listing of files which ones are duplicates and, without opening the PDF files, manually delete the duplicates very quickly and easily. I should be able to zip through the 5,000+ files in a matter of minutes.

I like it when solutions turn out to be easy.

Many thanks for the input in this thread. I wouldn't have found this simple solution if it wasn't for the comments in this thread.

I hope to be able to upload the Rendlesham archive (free from duplicates...) in the next day or two. I'll post in this thread once I've done so, in case anyone was interested in the substance of the archive.

edit on 1-5-2012 by IsaacKoi because: (no reason given)

Jocko Flocko

posted on May, 1 2012 @ 08:02 PM

link

Just thought I would say thanks for your efforts in keeping these archives alive and well. The Rendlesham Forest incident always intrigues me the most and I consider it to be in the top ten UFO incidents of all time.

Good work.

IsaacKoi

posted on May, 11 2012 @ 04:03 AM

link

Originally posted by IsaacKoi
Problem solved.

Many thanks for the input in this thread, particularly by ecoparity.

Well, it turned out that I was a bit optimistic with my comment above. The relevant method was useful and helped reduce the archive by about another 30%, but still left quite a duplication. I ended up going through the last few thousand posts manually...

Eliminating that duplication (which has involved a LOT more time and effort than I had hoped...) has reduced the size of the zipped collection of PDFs from 2Gb to under 50Mb!

Anyway, I've now uploaded the PDF archive at the link below and will write a thread in the Aliens & UFOs forum here on ATS about it (and other resources relating to Rendlesham, including a chronological list of relevant documentaries on Youtube and the official documents released by the Ministry of Defence):
minus.com...

I thought I'd post the link here first since you guys helped reduce the amount of work involved.

Each page of posts is a separate PDF document, so the PDF archive is largely unstructured (unless you instruct windows to add a column containing the "title", which then lets you sort by the title of each thread in Windows) - but I have still been finding the PDF archive very useful to be able to quickly search the collection of PDF pages using the free software outlined at the link below:
www.abovetopsecret.com...

I have also uploaded the original html archive supplied to me by "Daniel" to the link below (with a file size of about 0.4Gb):
minus.com...

Thanks again.

edit on 11-5-2012 by IsaacKoi because: (no reason given)