posted on Jul, 23 2009 @ 04:33 PM
Hi all, I wanted to start a new thread on this, but am too new, so thought I'd post in this most recent thread.
I dislike the way that the 'webbot' people charge for the reports, and the reports are basically their analysis of their data, but from what I
understand you can't get the raw data to perform your own analysis, even if you pay.
We know roughly how the system works, it trawls many forums/news sites gathering data, then via their software, it gets crunched, and the guys add
their layer of interpretation on top. I know he dresses it up with "special sauce" descriptions, giving the buzzwords like "prolog", "linguistic
analysis", etc, but I don't buy that it's anything that the rest of us couldn't do.
I've seen a number of people on ATS say they have programming skills, and so do I, and I think we should team up and make our own version. My
proposal would be an Open Source program, so people can see and judge how it really works, and if they think they have a better method they can fork
it/give patches. Then make the data collated open too, so that any one with analysis skills can have a go at providing their own interpretation.
I already have a basic prototype I could provide to interested programmers (At this stage it's nowhere near end-user ready), after some clean up. My
prototype essentially works like this:
1. A text file of URLs is loaded, and each URL is downloaded and dumped to a file.
2. HTML tags and other junk is removed from that file.
3. Another pass breaks it up into words, and while it does it outputs the current word, and the current phrases (2-6 words prior to and including the
current word). This is dumped to an even bigger file.
4. This file is sorted and unique phrases extracted and counts assigned.
This leaves a final report of the most used phrases found for that time period. (I ran it daily during my little test). Once you have more than one
report, you can do some analysis. At first I thought that things like "us" vs "US" would be an issue, but if you compare as a percentage
increase/decrease between reports, words such as "the" drop out, as they tend to make up a pretty consistent percentage of text once you get a big
sample. A word like "us", will drop out too, and only jump out if people start using "US" as the country a lot for some reason.
What I have coded so far, It's really just a quickish hack, I've used bash script (with lots of tr/sed/grep/uniq/sort/wget/etc action), and some C
code to make some custom utils for some of the more intensive stuff, so at this stage you would need to either be a linux user, or have a decent mingw
system set up to play with the code.
So anyone up for it? Capable programmers would be the first people needed to get the project up and running, but also
statisticians/analysts/linguists would be good to help guide the way the system works.
[edit on 23-7-2009 by harpsounds]