Massive UFO disclosure in USA : A challenge for ATS, page 17

posted on Aug, 6 2011 @ 06:00 PM

link

Originally posted by idealord
Writing parsers for inconsistent data is impossible or at best problematic. Plus when you add characters that are meaningless in the middle of them - those >> characters aren't ASCII, they're who knows what, then it's just a mess. I was lucky to be able to separate city and state as often as I did.

I'm afraid at some point it'll come down to volunteers hand-fixing the data (which is why I kept the processed raw on the side).

Thanks Jeff.

I think I understand (most of your comments anyway...). Of course, I'd like the automated process to deal with as much of the tidying up as possible, but I accept there will be limits. However, while you would probably be horrified at my approach, I think I could use Microsoft Word (which I'm more at home with than any script or code...) to automate some more of the tidying up if I import your spreadsheet as a table in Word and then record a couple of basic macros. This would not solve all the issues, but should be able to use the raw information in your final column to fill in quite a few of the gaps in the other columns in the spreadsheet you are kindly creating.

All the best,

Isaac

astrowater

posted on Aug, 6 2011 @ 09:34 PM

link

I have an idea as to how UFO's could be refueling in our atmosphere. Recently, a ring of anti-matter was discovered this week around the Earth by the PAMELA science instrument on board a Russian Satellite in a near polar orbit. (Polar orbits are typically used for stealth satellites - yes, those exist).

Please, check out my blog for my idea and the reference to the article which contains a reference to the technical paper.
novelfindings.blogspot.com...

Please, let me know what you think in comments here or on the blog.

idealord

posted on Aug, 7 2011 @ 12:51 AM

link

reply to post by IsaacKoi

Not horrified at all! In fact, that's why I'm saving Raw. As I explained earlier, I have a second database that I haven't put up yet which just has the ID's and the Unprocessed Raw Data, thinking I might come up with a better approach at a later date.

I'm up to 84801 records right now, FWIW... I'm starting to think before I get back to the Canadian data, I might get back to downloading the low-res docs. It's probably a good thing to preserve and with this data, I could make a huge PDF that we could slowly get better at organizing with a little tweaking.

I was thinking it'd be nice to have all the incidents page by page, browsable. So everything from the Long Island incident of 1963, would be next to each other.

About the Canadian stuff, is it true the database only lists 2000 pages? If so, my spidering approach is very limited. How did you guys confirm that?

Jeff

idealord

posted on Aug, 7 2011 @ 07:48 AM

link

All done!

parnasse.com...

Google wouldn't let me have a spreadsheet over 50,000 cells, EditGrid wouldn't let me upload a spreadsheet over 2MB, so here's an XLS file from my site zipped.

I fixed ALL the parsing errors by taking the raw data and replacing the bad binary characters with | symbols and then loading it into a spreadsheet program. The file also contains the raw data. I've lost the City/State separation, but because the original data was already non-normalized (read loooooose) we have complete consistency now. It's saved as Excel 2007 because Excel XP couldn't handle over 65,000 rows. I can save it out as pretty much anything now...

Hope this is interesting to others!

Jeff

IsaacKoi

posted on Aug, 7 2011 @ 05:15 PM

link

Originally posted by idealord
All done!

Hi Jeff,

Don't be so quick to declare a complete victory, although I think you are pretty darn close in terms of extracting information...

Apart from the small matter of also downloading all the images - okay, not so small (although hopefully Xtraeme is fairly close to sorting this problem out), I have a few relatively minor issues with your draft spreadsheet.

In terms of Xtraeme's approach to downloading the images themselves and the fact that he very recently mentioned he was working on extracting the metadata associated with each image, I presume that your work basically solves that extraction issue. I'm not sure that your explanation of your approach is sufficient to save Xtraeme some effort on this problem and duplicate the extraction process. (It's certainly not enough for me to duplicate your work but, hey, you'd have to write an explanation that was the length of at least one book - if not more - for me to be able to understand the issues sufficiently to follow a more detailed technical explanation). I wonder, however, if you could send Xtraeme a U2U to check if he is still working on the extraction problem or post in this thread any more technical details that will save him any time and effort (if those details would not be sufficiently clear to Xtraeme from your summary so far).

I'd simply like to ensure Xtraeme doesn't waste any time reinventing the wheel when completing his work...

Anyway, back to your spreadsheet:

parnasse.com...

Google wouldn't let me have a spreadsheet over 50,000 cells, EditGrid wouldn't let me upload a spreadsheet over 2MB, so here's an XLS file from my site zipped.

I fixed ALL the parsing errors by taking the raw data and replacing the bad binary characters with | symbols and then loading it into a spreadsheet program. The file also contains the raw data. I've lost the City/State separation, but because the original data was already non-normalized (read loooooose) we have complete consistency now. It's saved as Excel 2007 because Excel XP couldn't handle over 65,000 rows. I can save it out as pretty much anything now...

First of all - WOW!, what an improvement over the first draft spreadsheet (and - in some respects, e.g. an indication of the number of pages held in relation to each incident - over any index to the Project Bluebook files I've ever seen before!).

I only have access to Excel 2002 tonight and can only see just over 65,000 rows. I'll run your spreadsheet on Excel 2007 on another machine tomorrow. I presume that on Excel 2007 I'll see one row for each row (i.e. about 129,000 images/rows).

I'm not sure how many people have seen the potential of your spreadsheet yet - particularly when combined with one or two other UFO databases and tools. I may have to create a thread or two using your spreadsheet in combination with those other databases just to show what can now be done. That raises the issue of how to credit your work when I refer to your spreadsheet here and elsewhere - would you prefer I give your ATS username or your real name (in which case perhaps you could let me know your surname here or in a U2U)?

A couple of (relatively minor) issues, which I can probably resolve myself:

(1) At the moment, some of the images appear to have multiple rows in the spreadsheet (although I don't know yet whether this is simply an effect of my using Excel 2002 tonight). If I sort by Column A, I can see the same image number (with the same content in the remaining columns) commonly appears 3 or 4 times. In terms of the statistical problems I'd like to try to address using the spreadsheet, I'd need to eliminate that duplication (assuming that this issue remains when I use Excel 2007 tomorrow).

(2) Column E (which states the page number within the relevant file on an incident) does not appear to sort properly (again, using Excel 2002 tonight). For example, between "Page 1" and "Page 2" appears "Page 11", Page "12" etc etc.

Thanks for all your work,

Isaac

Loopdaloop

posted on Aug, 7 2011 @ 05:46 PM

link

Dear IssacKoi,
Thanks very much for drawing my attention to this footnotes resource. I had no idea it existed (embaressingly) and will enjoy going through it. Theres so much detail and time and effort put into all these documents by government officials which says something in itself.

The best thing about so much information being released is there is bound to be information there which they didn't mean to release via human error.

Its a shame some sort of wiki-project and time lines and direct links can't be set up with each document making it all much more coherent and accessible.

Xtraeme

posted on Aug, 7 2011 @ 08:34 PM

link

Just want to give a quick update. I've got some bugs that I'm still trying to work out of the system, but progress is good. I also spent a couple hours yesterday trying to figure out how they were generating the hash, but so far haven't figured it out. So I went back to the macro approach

. Hopefully by tomorrow or the next day I'll have it finalized (I typically let it run during the evening to find out when and if the program breaks). The app's got a simple interface. Just a basic windows application with a few menu buttons to start, stop, pause, save, load, and keep track of where it is in the extraction. All the information about what it's doing is piped to a file that's then 'tail -f'ed to a console.

[atsimg]http://files.abovetopsecret.com/images/member/bc5cf10309f6.png[/atsimg]
I've also got it to a state where I can fish out the _javascript metadata on the fly, but as a backup I'm going to use Jeff's database to double-check that the data is correct. If the _javascript call fails I'll use the cvs as a the final determiner of what the name will be. And that's about it at the moment!

edit on 7-8-2011 by Xtraeme because: (no reason given)

idealord

posted on Aug, 8 2011 @ 12:26 AM

link

Hey Xtraeme, FWIW, my meta-data is not what you get on the side-panel of the fancy Flash Footnote viewer. It's just what appears on the search output window. You've got the annotations and a lot of other cool data accessible there. I looked at the Hash thing too. Did you decompile the Flash? I can try that if you haven't. Is there a Hashing function in the JS? I haven't downloaded all of it yet. They don't seem to be intentionally hiding stuff; the Hashing seems to be to make you have to join Footnote to be able to download.

As far as redundant data goes, that's probably my fault. I have had wonky wifi connections yesterday and I had to restart my spider about 5 times when I lost connectivity. It writes to the file constantly. I'm sure I didn't miss anything, but I did suspect that it would produce a few extra rows. The area around the 120000+ range was particularly painful! As far as the sorting stuff, that's a typical sorting error because it's an alphabetical sort. Make sure you're sorting numerically - of course you'll have to get rid of the Page word. I can get rid of that easily with a search and replace and it should sort numerically.

As far as the dupes, I can run it again - heh - and produce new redundancies...

sometime.

Do you guys want the 1024x1024 images?

Issac, you can credit the database to my real name (which many people know already) - Jeff Harrington.

Jeff

edit on 8-8-2011 by idealord because: (no reason given)

IsaacKoi

posted on Aug, 8 2011 @ 06:49 AM

link

Originally posted by Xtraeme
Just want to give a quick update. I've got some bugs that I'm still trying to work out of the system, but progress is good.

Hi Xtraeme,

Thanks for the update. It sounds like things are going well.

As I've said before, let me know if there is anything you need help with. (As I've also said before, my own technical abilities are limited so my main support will be creating another thread and inviting people to help you with anything that is proving time-consuming - as I did in the recent Canadian UFO thread).

All the best,

Isaac

IsaacKoi

posted on Aug, 8 2011 @ 07:02 AM

link

Originally posted by idealord
As far as the sorting stuff, that's a typical sorting error because it's an alphabetical sort. Make sure you're sorting numerically - of course you'll have to get rid of the Page word. I can get rid of that easily with a search and replace and it should sort numerically.

Hi Jeff,

No problem, even I can use the find and replace command in Excel (or Word...) to find "Page " and replace with nothing.

As far as the dupes, I can run it again - heh - and produce new redundancies...
sometime.

I wouldn't expect you to run it again. I'm sure someone will be able to tell me a way to eliminate duplicate lines in Excel. (I'll post a brief query on a relevant part of ATS - I think there is a computer or technical sub-forum...).

Do you guys want the 1024x1024 images?

Mmm. Let's wait a bit and see whether Xtraeme gets his method working to download the higher definition issues.

Issac, you can credit the database to my real name (which many people know already) - Jeff Harrington.

Thanks Jeff.

edit on 8-8-2011 by IsaacKoi because: (no reason given)

IsaacKoi

posted on Aug, 8 2011 @ 07:08 AM

link

Originally posted by kkrattiger
Here is the link.
Project Blue Book Smithsonian Astrophysical Observatory report from LC 1958

Hi kkrattiger,

Thank you for the interesting link.

In relation to the question that you posted on my profile as to whether this would be a suitable document with which to start your own thread, I think that would be fine - particularly if you include your reasons for finding it interesting.

I'd suggest resizing the image of this document that you included in your previous post, since it did not fit on the screen without scrolling across.

All the best,

Isaac

IsaacKoi

posted on Aug, 8 2011 @ 09:30 AM

link

Originally posted by idealord
All done!

parnasse.com...
...

It's saved as Excel 2007 because Excel XP couldn't handle over 65,000 rows. I can save it out as pretty much anything now...

Hi Jeff,

I'm looking at your spreadsheet again now on a machine running Excel 2007 - but I still only see about 65,000 rows (65,536 to be precise, which - from memory - is the same number I saw last night on a machine running Excel 2002).

How many rows should I be seeing?

(I'm assuming there should be one row for each page in the Project Bluebook collection on Footnote.com, i.e. just over 129,000).

All the best,

Isaac

idealord

posted on Aug, 8 2011 @ 09:43 AM

link

Hey Issac,

I just loaded it into Excel Starter and it shows 124,741 rows. So it looks like Excel 2010 can handle it. In that zip file there's the raw data; you should be able to load that into Excel and tell it do delimit on the pipe - | - character. (Just that upward straight line). It's basically a CSV file, but you really can't comma-delimit spreadsheets when you have commas in the data, obviously...

Let me know if that doesn't work for you. I can split it into pieces if you want, also...

Jeff

IsaacKoi

posted on Aug, 8 2011 @ 09:49 AM

link

Originally posted by idealord
I can split it into pieces if you want, also...

Hi Jeff,

Since I rarely use Excel, I don't know how to get Excel to delimit on a character.

If you could split it into 2 or 3 pieces instead, that could save me a fair bit of time.

All the best,

Isaac

idealord

posted on Aug, 8 2011 @ 10:02 AM

link

Problem is you won't be able to globally sort them if you have 2 different spreadsheets and that'll limit your understanding of the pagination... what are you going to combine them into?

Here's a free spreadsheet program - and it's the one I'm using since I don't buy Microsoft products.

people.gnome.org...

projects.gnome.org...

It's stable and you can import CSV files and save Excel files, etc.

Jeff

IsaacKoi

posted on Aug, 8 2011 @ 10:23 AM

link

Originally posted by idealord
Problem is you won't be able to globally sort them if you have 2 different spreadsheets and that'll limit your understanding of the pagination... what are you going to combine them into?

Good point.

I'll get hold of something that can handle the bigger spreadsheet and will then download your zip file again. I'll get back to you on this soon.

All the best,

Isaac

IsaacKoi

posted on Aug, 11 2011 @ 08:51 AM

link

Originally posted by idealord
All done!

parnasse.com...
...

It's saved as Excel 2007 because Excel XP couldn't handle over 65,000 rows. I can save it out as pretty much anything now...

Hi Jeff,

I've been looking into ways to open the file in some version of Excel.

I've open the txt file (in Notepad) and seen the mass of data and opened it in Excel using the pipe character to delimit the columns. The good news is that I can now see the larger number of rows of data. The bad news is that this method seems to result in one column for the record number and just one column for all the remaining fields. In other words, the month, year, location and page number are not in separate columns. Looking at the txt file (sample below), I can't see the pipe used as a delimiting character except after the record number:

Sample from txt file:

"11443600|… [BLANK] » [BLANK] » [BLANK] » Page 7453
7199601|… 1968 » July » Brooklyn, New York » Page 132
11446642|"

Alternatively, Excel 2007 can handle the larger number of rows - as long as the file is in the right format. I've watched the video below on converting files from the old xls format (used in your zip file) to the new Excel 2007 xlsx format, but the data remains at about 65,000 rows. I wonder if others have the same problem or whether you could save the spreadsheet in the xlsx format instead.

I could download the software you mention, but I'm struggling to get to grips with Excel so I'd rather avoid trying to pick up another piece of sofware...

All the best,

Isaac

edit on 11-8-2011 by IsaacKoi because: (no reason given)

ArMaP

posted on Aug, 12 2011 @ 03:19 AM

link

Originally posted by IsaacKoi
I've watched the video below on converting files from the old xls format (used in your zip file) to the new Excel 2007 xlsx format, but the data remains at about 65,000 rows.

If the file is a "xls" then there's no way he can have more than 65,000 rows.

I think the best way would be to save the file as "Tab delimited text", that format is used by all spreadsheet and database programs (and it's very easy to use in scripts or home-made programs).

ArMaP

posted on Aug, 12 2011 @ 03:50 PM

link

reply to post by idealord

I hope you don't mind, but I converted your raw data to a real (xlsx format) Excel 2007 spreadsheet, with 124,741 records.

It's available here (I hope it works).

Xtraeme

posted on Aug, 13 2011 @ 02:30 AM

link

Hey Jeff, I just wrote a quote Perl script to convert the bluebook-raw-data.txt file into a csv (downloadable here):

#!/usr/bin/perl -w

if(exists $ARGV[0]) [;
open(INF, "$ARGV[0]") or die "Cannot open $!";
foreach $line (@contents) [
$x = $line;
$x =~ s/,/_/g;
$x =~ s/\x20\xC2\xBB\x20/,/g;
$x =~ s/\x7C\xE2\x80\xA6\x20/,/g;
if($x =~ m/Page 1\s.*$/i) [
print OUF $x;
]
]
close(OUF);
]

( seems the code tag didn't format it very nicely

)

For the purposes of the scraper I only need to know enough information to figure out what to name the directory. So the script drops all the files that don't indicate "page 1." Scanning to the bottom of the newly generated csv would seem to indicate that there are a total of 10,464 documents. However I've noticed a few of the reports like Tom's River, New Jersey sighting aren't in the list. I don't want to duplicate your work. So would you be opposed to sharing the script you used to harvest the data? I just want to make sure we're not missing any other files. Especially since I'll be using this to ultimately determine the directory names.

Thanks Jeff. =)