Originally posted by idealord
Don't be so quick to declare a complete victory, although I think you are pretty darn close in terms of extracting information...
Apart from the small matter of also downloading all the images - okay, not so small (although hopefully Xtraeme is fairly close to sorting this
problem out), I have a few relatively minor issues with your draft spreadsheet.
In terms of Xtraeme's approach to downloading the images themselves and the fact that he
very recently mentioned he was working on extracting the metadata
associated with each image, I presume that your work basically solves that extraction issue. I'm not sure that your explanation of your approach is
sufficient to save Xtraeme some effort on this problem and duplicate the extraction process. (It's certainly not enough for me to duplicate your work
but, hey, you'd have to write an explanation that was the length of at least one book - if not more - for me to be able to understand the issues
sufficiently to follow a more detailed technical explanation). I wonder, however, if you could send Xtraeme a U2U to check if he is still working on
the extraction problem or post in this thread any more technical details that will save him any time and effort (if those details would not be
sufficiently clear to Xtraeme from your summary so far).
I'd simply like to ensure Xtraeme doesn't waste any time reinventing the wheel when completing his work...
Anyway, back to your spreadsheet:
Google wouldn't let me have a spreadsheet over 50,000 cells, EditGrid wouldn't let me upload a spreadsheet over 2MB, so here's an XLS file from my
I fixed ALL the parsing errors by taking the raw data and replacing the bad binary characters with | symbols and then loading it into a spreadsheet
program. The file also contains the raw data. I've lost the City/State separation, but because the original data was already non-normalized (read
loooooose) we have complete consistency now. It's saved as Excel 2007 because Excel XP couldn't handle over 65,000 rows. I can save it out as
pretty much anything now...
First of all - WOW!, what an improvement over the first draft spreadsheet (and - in some respects, e.g. an indication of the number of pages held in
relation to each incident - over any index to the Project Bluebook files I've ever seen before!).
I only have access to Excel 2002 tonight and can only see just over 65,000 rows. I'll run your spreadsheet on Excel 2007 on another machine
tomorrow. I presume that on Excel 2007 I'll see one row for each row (i.e. about 129,000 images/rows).
I'm not sure how many people have seen the potential of your spreadsheet yet - particularly when combined with one or two other UFO databases and
tools. I may have to create a thread or two using your spreadsheet in combination with those other databases just to show what can now be done. That
raises the issue of how to credit your work when I refer to your spreadsheet here and elsewhere - would you prefer I give your ATS username or your
real name (in which case perhaps you could let me know your surname here or in a U2U)?
A couple of (relatively minor) issues, which I can probably resolve myself:
(1) At the moment, some of the images appear to have multiple rows in the spreadsheet (although I don't know yet whether this is simply an effect of
my using Excel 2002 tonight). If I sort by Column A, I can see the same image number (with the same content in the remaining columns) commonly
appears 3 or 4 times. In terms of the statistical problems I'd like to try to address using the spreadsheet, I'd need to eliminate that duplication
(assuming that this issue remains when I use Excel 2007 tomorrow).
(2) Column E (which states the page number within the relevant file on an incident) does not appear to sort properly (again, using Excel 2002
tonight). For example, between "Page 1" and "Page 2" appears "Page 11", Page "12" etc etc.
Thanks for all your work,