Predictive analytics are not social science: A common misunderstanding with major consequences for higher education

This is the second in my series on common misunderstandings about predictive analytics that hinder their adoption in higher education. Last week I talked about the language of predictive analytics. This week, I want to comment on another common misconception: that predictive analytics (and educational data mining more generally) is a social science.

I began my college journey as a musician. I played jazz and classical guitar, and received several scholarships in support of a music degree. During my first semester in school, a friend introduced me to the music of Rage Against the Machine … and I had absolutely no idea what I was listening to. I had no real frame of reference, and no way to immediately make sense of it. Rage would become among my favorite bands of all time, but it took a while before I ‘got it.’

I eventually earned a masters degree in sociology, and then worked as a market research consultant before returning to school for a doctorate in philosophy. During this time, I worked a part-time job supporting faculty and students in the use of instructional technology. One day I was asked by the school’s deputy CIO to complete a literature review of the then nascent field of learning analytics. This marked my first exposure to the world of data science, and an experience that was not dissimilar from my introduction to Rage Against the Machine. As a social scientist, I had a pretty good sense of how statistics worked and what a model was. Machine learning was completely foreign, and it was easy to become confused when trying to understand it strictly in light of my previous training.

I use this example from my own life to illustrate what I think is a pretty common point of confusion at institutions that are just beginning to dabble in predictive analytics. Regardless of your background, it is pretty likely that you have taken a statistics course at some point in your academic career. Especially if you took this course through a sociology or psychology department, chances are that you will enter into conversations about predictive analytics with a set of assumptions that are heavily informed by this background.

But data science is not social science research.

In the social sciences, we go through a process that seeks to describe actual relationships in the world. We test hypotheses about concepts by operationalizing them in the form of variables that can be measured through things like survey instruments. The statistical correlations that we discover suggest causal relationships in the world as it is. At its most positivist extreme, the aim of social scientific research is to predict and control behavior through a complete explanation of the conditions that determine it.

In contrast, data scientists qua data scientists have zero interest in explaining how the world works. They are not interested in causal factors, and they are not interested in creating models of the universe. Instead, data scientists are interested in developing systems that are optimized to achieve particular outcomes. They use statistics, but they also use a wide variety of other algorithms. They are interested in classification rather than causation.

In early conversations with institutions about predictive analytics, many of the questions I hear are clearly informed by a social scientific lens. People want to ask about which is the most important predictor of student success, as if the predictive modeling process was establishing correlations between specific factors and learner outcomes. In data science, we can talk about the extent to which particular variables account for variation in the model, but explaining variance in a predictive model is not the same as contributing to a causal model about actual student behavior.

Let’s take a hypothetical example. Consider a group of students about whom we know five things:

1. High school GPA
2. Course attendance
3. Time spent in the LMS/VLE
4. Zip code
5. Whether they ate pizza for lunch yesterday

If a social scientist were to approach the problem of identifying students at risk of dropping out, they might begin with a model (based on prior research) that sees success in university as a function of socioeconomic status, effort, past knowledge, and social engagement. They would then map each of these factors to available variables. Our pizza variable doesn’t clearly align with any part of our model, and so would likely be discarded.

The social scientist is interested in explanation, and so is deeply concerned with what data mean. The data scientist, on the other hand, doesn’t really care. If the goal is to produce a model that accurately classifies students into one of two groups (those who pass and those who fail), and including the pizza variable results in a model that classifies students more accurately, then there is no reason not to include it.

Viewed through the eyes of a social scientist, a predictive model of student success might look pretty strange. Most people, regardless of their social scientific training, are inclined to ask why certain features are more predictive than others, in order to incorporate this information as part of an explanatory model of student behavior. But this is a temptation that needs to be avoided.

Predictive and social scientific models may look similar, including similar variables that have similar levels of explanatory power, but they are doing very different things, and it is possible that confusing prediction with explanation could result in policy decisions with unpredictable results.

In short, if we have a state that we want to predict (like a student failing a course), and we know what we would do if we were capable of predicting that state with a sufficiently high degree of accuracy, this is exactly the kind of situation that predictive analytics can help us with. When it comes to predictive modeling, what should concern us is the output, and not the model itself (except for methodological and ethical reasons). Beyond output and action, we should exercise extreme caution when trying to interpret the relative inputs into a predictive model. We should look for things like latent bias, to be sure, but avoid the strong temptation to use models as a way of describing the world, and be wary of vendors who build technologies that encourage institutions to use them in this way.


Also published on Medium.