Hopefully, I can add some insight to the situation that will further inspire fear, anger, and paranoia in the average person previously unconcerned
and unaware of the conspiracy of blanket government spying that has been discussed here for years.
The Wired article is correct about the meta data, but they leave out some important information -- meta data is more valuable than the actual content
of phone calls and email. Yes. It's true.
I've recently been thrown into the world of "big data analysis" and things like massive Hadoop clusters; volume, velocity, variety (3-Vs); flume
feeds, pattern matching, extraction and on and on. For a look at one private sector company, funded by US tax payer dollars through Venture Capital
arms of the nations covert agencies, check out www.recordedfuture.com
-- using big-data tools to try and
predict the future.
Meta data is structured information
. This means field values across different datasets will either be identical, or so similar as to need only
a minor adjustment to be normalized. Time. Location. Phone number. Phone owner. Length of call. Number frequency. All these and more are highly
structured pieces of data, and the more you have from different sources, the possibility of deep and meaningful insights increase.
Call content, email content, tweets, text messages, web page content, etc. are all unstructured data
-- language. While we might have a billion
phone call transcripts in english, each will be formatted differently, use different syntax/slang, have varying levels of literacy, and so on.
Normalizing this unstructured data for big-data analysis requires expensive (in terms of processing power) and slow natural language processing.
Keyword and entity extraction can be performed fairly well, but even then, the all-important context or meaning will be missing.
Pattern-matching meta data in trillion-record datasets is the big-data analyst's wet-dream. It's highly plausible that such a system, with all data
from everywhere, could identify that a new burner phone is being used by criminals or terrorists within 2-3 days using just the meta data. The same
pattern matching could also identify that a burner is being used by a journalist -- or -- using the journalist's known number, determine what
potentially damaging story a journalist is currently investigating. Once pattern-matching flags something, then the available unstructured data is
Oh, and don't presume that the typical methods of encrypting or protecting your privacy will thwart this. Even if you use TOR for browsing, and the
NSA/FBI takes an interest in what sites you visit -- if they have your ISP logs and that of upstream Internet providers (it has been reported that
they do), it only makes it about 5% harder to nail down where you went online. Seriously.
Of course, all of this relies on having all available data from all sources, and that it's constantly updated in real-time. Missing pieces like
Qwest/CenturyLink are a real problem for bringing it all together. So if anything, it's not as bad as it could be... yet.