This article was originally published on CFO.com.
As Big Data grows bigger, more complicated, and more difficult to deal with, the choice of which data get retained and which discarded becomes pivotal.
Some years ago, around 2000, when I was running a research program on the then rapidly growing digital economy, I had a team working on estimating how much of the total of human experience was being captured and stored digitally. There were about 6 billion people in the world at the time, with maybe 1 billion “online” in some fashion. This was before smartphones (as we now think of them) and tablets, so there were lots of documents, emails, texts and images, but relatively a lot less video. With some reasonable assumptions about usage, some fairly comprehensive secondary research on stored data volumes, some modeling and a few assumptions, we came up with a number — around 1 percent of human experience was being captured digitally.
Back then, therefore, 99 percent of human experience was either being lost or was confined to human memory (not a great long-term storage mechanism, for accuracy or speed of recall). Roll the calendar forward to 2014 and we are creating and storing much more information. I recently saw an estimate of more than 350 gigabytes a year for every online individual, plus another 350 gigabytes of associated data — log files, account information and so on. With 3 billion online users by 2015, that’s a lot of data — no wonder it’s called “Big”!
And it’s growing every year as we collect more of each individual’s experience and add more connected users. Back in 2000, when we were doing the original research, we estimated that at that rate of growth, even if we used just one atom to store each “bit” of data and there were no duplicate copies, we’d run out of atoms sometime before 2020, and as far as I can tell, we’re still on track to do so. All those atoms cost money — and can’t be used for anything else, so left alone, big data will eventually “eat the entire world” or at least the entire budget.
But the total of human digital experience isn’t the sum of each individual’s. There’s masses of duplication — intersecting and overlapping viewpoints; and “shared” experiences, where viewpoints may differ but the view is the same for everyone. As challenging as it may seem (and will be) we’re going to have to start reducing the amount of duplication (or get more bits per atom) in the stored data.
And then there are all the things that don’t change much or at all from moment to moment (or day to day, or year to year). If the view is always the same, we can “edit” it out of the data and replace it with a “tag” that links each view to the first (that’s essentially how compression software works to reduce the size of large files). If we do this well, we can reduce the size of the stored data by more than 80 percent and still keep enough to recreate every scene as it actually happened from the viewpoint of everyone who was involved.
(OK, for the math purists among you, I know some things get larger when processed by “lossless” compression algorithms, but in the real world, where there’s a lot of “well behaved” and static data, the reduction percentage is a pretty good target.)
So we can probably push out the day when we will run out of atoms to store our bits by a couple of decades (maybe), but sooner or later we are going to run out. At which point “curation” strategies will become really important. Just what should be kept? Who and what gets edited out and essentially forgotten?
If this seems unfair or unreasonable, remember that throughout history almost everything that happened has been forgotten. Only a tiny fraction of the total of human experience made it from generation to generation — especially before the invention of the printing press. Something had to be pretty important to get remembered — and even so, plenty of pretty important things weren’t. Many great ideas were lost and had to be rediscovered — and it’s probable that some remain lost to this day.
So curation will matter. So will who gets to be a curator.
And then there’s the time factor. The closer we get to recording all of human experience, the less time there is to go back and review what we recorded. Today, we can use the huge gaps in the total recorded and stored experience to watch what happened to others — real or imagined. But at close to 100 percent experience capture, there’ll be no time to do so. We will be living only going forward. And if we can’t ever go back and review the past (because by doing so we will miss being part of the present), why bother to record everything in the first place?
Finally, there’s entropy — which you can think of as the propensity for organized things to self-randomize over time. The more bits we store, the more bits will be randomly flipping from one to zero or vice versa, unless we watch them to make sure they don’t. (We have to keep adding orderliness to the total system to counter the inclination to randomness.) But the more bits we store, the more time we need to check for errors and the less time we have to do so. At some point, we’re going to be doing damage just with the checking process, which is also part of the entropic environment. Eventually, if the curators don’t delete you, entropy will.
The Big Data frenzy we are experiencing today is just the tip of the iceberg. It’s going to get bigger, more complicated and more difficult to deal with. Not a pretty picture, even with all the claimed benefits.
And always remember Sturgeon’s Law: 90 percent of everything is, in general, crud. Which specific 90 percent depends on your point of view. Better start training as a curator. So you get to decide which points of view matter.