A colleague recently sent me an email that included four questions that he suggested were the most concerning to both data management companies and customers: *
- Big Data Tools – What’s working today? What’s next?
- Big Data Storage – Do organizations have a manageable and scalable storage strategy?
- Big Data Analytics – How are organizations using analytics to manage their large volume of data and put it to use?
- Big Data Accessibility – How are organizations leveraging this data and making it more accessible?
These are bad questions.
I should be clear that the questions are not bad on account of the general concerns they are meant to address. Questions about tools, scalable storage, the ways in which data are analyzed (and visualized), and the availability of information are central to an organization’s long-term information strategy. Each of these four questions addresses a central concern that has very significant consequences for the extent to which available data can be leveraged to meet current informational requirements, but also future capacity. These concerns are good and important. The questions, however, are still bad.
The reason these questions are bad (okay, maybe they’re not bad…maybe I just don’t like them) is that they are unclear about their terms and definitions. In the first place, they imply that there is a separation between something called ‘Big Data’ and the tools, storage, analytics (here used very loosely), and accessibility necessary to manage it. In actual fact, however, there is no such ‘thing’ as Big Data in the absence of each of those four things. Transactional systems (in the most general sense, which also includes sensors) produce a wide variety of data, and it is an interest in identifying patterns in this data that has always motivated empirical scientific research. In other words, it is data, and not ‘Big Data’ that is our primary concern.
The problem with data as objects is that, until recently, we have been radically limited in our ability to capture and store them. A transactional system may produce data, but how much can we capture? How much can we store? For how long? Until recently, technological limitations have radically limited our ability to capture, store, and analyze the immense quantities of data that are generated, and have meant working with samples, and using inferential statistics to make probable judgements about a population. In the era of Big Data, these technological limitations are rapidly disappearing. As we increase our capacity to capture and store data, we increasingly have access to entire populations. A radical increase in available data, however, is not yet ‘Big Data.’ It doesn’t matter how much data you can store if you don’t also have the capacity to access it. Without massive processing power, sophisticated statistical techniques, and visualization aids, all of the data we collect is for naught, pure potentiality in need of actualization. It is only once we make population data meaningful in its entirety (not sampling from our population data) through the application of statistical techniques and sound judgement that we have something that can legitimately be called ‘Big Data.’ A datum is a thing given to experience. The collection and visualization of a population of data produces another thing given to experience, a meta-datum, perhaps.
In light of these brief reflections, I would like to propose the following (VERY) provisional definition of Big Data (which resonates strongly, I think, with much of the other literature I have read):
Big Data is the set of capabilities (capture, storage, analysis) necessary to make meaningful judgements about populations of data.
By way of closing, I think it is also important to distinguish between ‘Big Data’ on the one hand, and ‘Analytics’ on the other. Although the two are often used in conjunction with each other, it is important to note that using Big Data is not the same as doing analytics. Just as the defining characteristic of Big Data above in increased access (access to data populations instead of samples), so to does analytics. In the past, the ability to make data-driven judgements meant either having some level of sophisticated statistical knowledge oneself, or else (more commonly) relying upon a small number of ‘data gurus,’ hired expressly because of their statistical expertise. In contrast to more traditional approaches to institutional intelligence, which involve data collection, cleaning, analysis, and reporting (all of which took time), analytics toolkits quickly perform these operations in real-time, and make use of visual dashboards that allow stakeholders to make timely and informed decisions without also having the skills and expertise necessary to generate these insights ‘from scratch.’
Where Big Data gives individuals access to all the data, Analytics makes Big Data available to all
Big Data is REALLY REALLY exciting. Of course, there are some significant ethical issues that need to be addressed in this area, particularly as the data collected are coming from human actors, but from a methodological point of view, having direct access to populations of data is something akin to a holy grail. From a social scientific perspective, the ability to track and analyze actual behavior instead of relying on self-reporting about behavior on surveys can give us insight into human interactions that, until now, was completely impossible. Analytics, on the other hand, is something about which I am a little more ambivalent. There is definitely something to be said to encouraging data-driven decision-making, even by those with limited statistical expertise. Confronted by pretty dashboards that are primarily (if not exclusively) descriptive, without the statistical knowledge to ask even basic questions about significance (just because there appears to be a big difference between populations on a graph, it doesn’t necessarily mean that there is one), and with no knowledge about the ways in which data are being extracted, transformed, and loaded into proprietary data warehousing solutions, I wonder about the extent to which analytics do not, at least sometimes, just offer the possibility of a new kind of anecdotal evidence justified by appeal to the authority of data. Insights generated in this way are akin to undergraduate research papers that lean heavily upon Wikipedia because, if it’s on the internet, it’s got to be true.
If it’s data-driven, it’s got to be true.
I’m not really happy with this diagram. Definitely a work in progress, but hopefully it capture’s the gist of what I’m trying to sort out here.