Thursday, November 3, 2016

Looking back, predicting the future

Real-estate agents try to profit from the controversial candidates. 
Image source: fox13now.com


The morning after the Brexit vote in June, I woke up and went online to see the results. I wasn’t really worried and frankly looking forward to getting this over with. For me, it was a question of common sense that our fellow Europeans should vote “remain”. Unless you live without any access to modern or traditional media (in which case you would not be reading this) you know that I was wrong.
What was surprising to me was how many professional forecasters and pollsters came to the wrong conclusion as well.
The explanation is short enough: nobody can look into the future. Therefore, an election forecast, which is basically predicting future behaviour, has to be at least slightly off in one direction or another.
But there is a complicated backstory to this answer, and it has to do with both how the data is collected and what happens to it afterwards. These two processes, data collection and data treatment, are usually not conducted by the same people. Data collection is managed by the "Pollster", professionals who collect the data for newspapers, election campaigns or universities. Predicting election outcome based on these data is the work of the "Forecasters". The daily work of a forecaster is to create a model of future voting behaviour based on the one hand on polling data, on the other on what we know about past voting behaviour.
Let us pretend you are trying to predict the outcome of the American presidential election, which is only a few days away. For simplicity, you will take the role of both the pollster and the forecaster. 
Conducting data collection for an election poll is hard work, and much more complicated than tweeting out a short questionnaire to a few (thousand) followers. Polls, much like typical social science questionnaires, are designed to capture what people think, how they act or – in the case of voting – how they intend to act based on their beliefs and the available options. But who are these “people”? In national polls, the researchers intend to access a representative selection of members from different groups – women, men, minorities, rural or urban citizens – who are likely to vote. There is the first difficulty: considering that you would need a meaningful sample of each group, how big will your final data set be, and what kind of data does it contain? And what exactly is a good measure of “likelihood to vote”?
Young and older citizens voted systematically different in the 2016 Brexit referendum.
Image source: catholicherald.co.uk.
Do you have the resources to send your questionnaire to women of polish origin with a college degree, religious, married, with children, living in bigger cities AND to women with the same characteristics, but living in the countryside? Which state are we talking about? What about those without children? And when is a group too small to be taken into account? What exactly are you planning to ask? Did you ask for voting intention or just general support of one candidate over another? Or rather one party over another? What about third-party candidates ? At this point, you should say a prayer of thanks that you don’t have to do this in Belgium.
But there is more: Is it possible that you obtained more voters for one specific candidate because they were easier to reach? Did you consider response biases, such as social desirability aka the willingness to confirm what you want to hear? Andrew Gelman explained in more detail why these factors are important for the forecasting result.
You might have successfully addressed some of these questions in one way or another and you obtained your data. Now, the forecaster takes over and the fun really begins. The process of predicting, or forecasting, election outcomes could be simplified to take current data and past behaviour to predict the future. The first question is now: How do you decrease the biases in the sample? One way of doing so is to give more weight to some subsamples of data than to others. But what weight do you give to which respondents?  For example, there might be more or less the same number of white, Christian, college educated men as women in your polling data. However, men and women are not equally likely to actually leave the house and vote. You could attribute more weight to female voting intentions than male voting intentions, as more women than men have voted in past presidential elections. 

Women and men participating in the US presidential elections.
Source: Center for American Women and Politics, Rutgers. 

But what if this year that isn’t true? It doesn’t look likely, but it is not impossible, so you might want to take movements specific to this election into account, such as Donald Trump's popularity amongst military veterans
Assumptions like these are the basis of every model. And variations in these assumptions can be pretty powerful: The New York Times gave the same data set to four different forecasters and made a prediction themselves. The outcome varied between Clinton winning by four percentage points and Trump being one percent ahead. Based on the same data.
In addition to these explicit assumptions, implicit biases can influence what forecasters predict. Traditional media being strongly split into two camps, and social media acting as echo chambers, telling you what you already believe, you are likely to skew your analysis in a way that suits your own view of the world. To illustrate this, the website Fivethirtyeight created a tool that allows you to confirm both the democratic and the republican party’s positive and negative influence on the economy.
Based on your data, informed by your assumptions you then have to go on to create a model of the future. Again, the forecaster has to chose what will be part of this prediction. The resulting model is fairly complicated, including a variety of sophisticated statistical techniques. To create such a forecasting model, you will have to consider - again - a number of different choices, all of which will influence the resulting prediction. If you are interested to know, for example, how a third-party candidate changes the election forecast, this article by Nate Silver can provide you with some insight. 
In total, we have three different steps where you can substantially influence your estimation of the 2016 US presidential election, simply by deciding on who and what you ask, how you generalise this to the entire population and what factors are important in your prediction model. Knowing all this, it doesn't seem surprising anymore that the tight pro-Brexit referendum vote was not unanimously predicted.  

So, dear forecaster, here you are, with your data, all your assumptions, weighing samples this way and that, creating models to predict the election outcome. And then the video happens. And here there are the emails again. What do you do? Start from the beginning? recalculate your predictions? Will it even have any impact? It is your decision how to proceed, but my suggestion would be to watch this, or this, or this. And let it rest until 2020.

1 comment: