This article was originally published on CFO.com
I wonder if we are pushing the power of analytics too hard and too far.
Tools for handling and analyzing very large amounts of unstructured data are increasingly “mainstream” in both research and enterprise IT operational use. What they make possible is fascinating – and the potential to “mine” these huge pools of data (which I suppose we should call “lakes” or even “oceans” if they’re large enough) is significant.
One of the very well respected firms we use for insights into emerging technology fields — especially very early-stage trends in the startup and venture-backed communities — recently announced that they were turning their analytic tools on the mass of general mass media content. Their thesis is that analyzing what’s being talked about in the mass media (pretty much all of which is readily available online) will help them predict what’s going to happen in the future. Sort of the ultimate in crowdsourcing.
That worries me.
The first worry is the generally terrible signal-to-noise ratio in mass media content. Stories are not always complete or entirely accurate, even when they’re adequately fact checked. They’re “sized to fit” even if that means leaving relevant things out. It’s often difficult to tell the difference between reporting, analysis, and opinion in the content, and there’s often a lot of “padding” required by editorial policy and legal review.
The second worry is the almost total lack of curation in mass media. Just because an item appears in a lot of places (unlike, in most cases, citation analytics, where frequency can be highly informative) doesn’t mean it’s important or relevant or even popular. Mostly it means that someone thought it was “sensational” and no one wanted to be left out of the sensation.
The third concern relates to provenance, which in this context is the weighted reliability of the source, both the publisher and the author. Media bias both exists and is not always obvious, because you don’t get to see what doesn’t get “printed.” While consistent bias can be accounted for, hidden bias may be more difficult to identify, and while bias might actually be a useful predictive element in some contexts, I worry that it’s general impact on content is not well enough understood.
If you’re familiar with the Gartner “hype cycle” you’ll recognize these concerns. Without knowing where you are in the lifecycle of an idea in the mass media, it’s difficult to know whether you’re seeing the early “hype” or the dismissal of an idea because it’s promise has yet to be realized. This happens a lot with mass media topics too, often very rapidly. What weight should be given to the persistence of ideas, rather than their breadth of appeal or at least appearance?
Mostly I wonder if we are pushing the power of analytics too hard and too far. Prediction works because the near future is roughly the same as the recent past in most places, most of the time. Where that’s not the case, prediction generally fails. There are techniques (weak signals, scenario analysis, etc.) that do better than predictive analytics in these circumstances, but some elements of the future are inherently unpredictable no matter how much data we have or how good our algorithms are at explaining it.
Do I expect that we will learn useful things for an intelligent analysis of mass media content? Absolutely. Some aspects of the future are already here, but, to quote William Gibson, they’re not yet evenly distributed. Maybe smart analytics can find them and get them more evenly spread out faster than would otherwise occur. That would certainly be a valuable capability.
But let’s not get trapped into believing that we’ll get all the answers we’d like from this particular crowd, or that the answers we do get will all be reliable and usefully predictive. As Yogi Berra is famously (but possibly incorrectly and almost certainly not originally) supposed to have opined, “prediction is hard, especially about the future.”