Back in 2016 we were a team of scientists that used a novel methodology to successfully predict Brexit and Trump, both within a single percentage point margin of error. It was a combination of our wisdom of crowds survey and a network analysis of friendship links on Facebook & Twitter (methodology described below).
Four years later, we are once again making our prediction, this time for the 2020 US election.
Results from 2016
We will start with our survey soon, focused primarily on the key US swing states (PA, FL, OH, NC, MI, WI — all states where Trump won in 2016 and which delivered him the electoral college victory, but also GA, VA, IA, CO, AZ, NV, TX).
In the mean time, just like in 2016, we will track “the competition” both here and on our election blog: the polling aggregators (like FiveThirtyEight, Upshot, RCP, PollyVote, etc.), the prediction models (Cook Political, Sabato’s Crystal Ball), the betting markets (Iowa EM, PredictIt, ), and — my personal favourite benchmark — the superforecasters, all of which were way off in their predictions in 2016. Clinton was given an average 89% chance of winning, while not a single polling aggregator or prediction model gave PA, FL or NC to Trump, all of which we correctly called in his favour (see map below).
In addition we also correctly called all the other major states, including OH, VA, IA, AZ, NV, CO, NM. We were only wrong about WI and MI, the only reason being of not placing too much emphasis on these states in the survey.
Implications for 2020
Back in January the narrative was clear: the Trump team was running a data-savvy campaign, emulating its 2016 approach, except this time instead of social media it was all about utilizing text messages on WhatsApp and geofencing. On the other hand the Democrats were said to be losing their digital edge — particularly on social media — were targeting wrong voters, and were increasingly criticized for becoming detached from the average American. Trump’s approval ratings have started to increase, he was performing well in all the key swing states that he won back in 2016. All early signals were pointing to his victory.
Six months later, the COVID-19 outbreak and all of its consequences have seriously undermined that narrative. Now Biden is firmly in the lead across the majority of nation-wide polls, but with knowledge of past polling errors (particularly the ones in 2016), the uncertainty surrounding this election is even greater than it was in 2016. How come?
Can you trust the polls?
For one trust in pollsters has been seriously eroded ever since Brexit and Trump. Polling errors are usually magnified during election times and pollsters worldwide are still struggling to gain representative samples. Furthermore there is a prominent hypothesis about the so-called Shy Trump voters (or the silent majority), i.e. the people who conceal their true preference, either by saying they are undecided, misrepresenting who they support (one study found Republicans and Independents to be twice as likely not to give their true opinion to a pollster), or simply by choosing not to respond to the poll at all.
However according to this paper there is little evidence of “shy voters” causing any substantial polling errors:
Generally, there is little evidence that voters lying about their vote intention (so-called ‘shy’ voters) is a substantial cause of polling error. Instead, polling errors have most commonly resulted from problems with representative samples and weighting, undecided voters breaking in one direction, and to a lesser extent late swings and turnout models.
This is where the main problem lies: non-response bias in standard polls. Or in simple terms: less and less people responding to polls. A response rate is the number of people who agree to give information in a survey divided by the total number of people called.
According to Pew Research Center, a prominent pollster, and Harvard Business Review response rates have declined from 36% in 1997 to as little as 9% in 2016. This means that in 1997 in order to get say 900 people in a survey you had to call about 2500 people. In 2016 in order to get the same sample size, you needed to call 10,000 people.
Random selection is crucial here (because the sample mean in random samples is very close to the population mean) and pollsters spend a lot of money and effort to achieve randomness even among those 9% who did respond. But can this be truly random is an entirely different question. Such low response rates are almost certainly making the polls subject to non-response bias. This type of bias significantly reduces the accuracy of any telephone poll, making it more likely to favour one particular candidate because they only capture the opinion of particular groups, and not the entire population. Online polls on the other hand suffer from self-selection problems and are by definition non-random and hence biased towards particular voter groups (younger, urban populations, usually also better educated).
Following the above example, assume that after calling about 10,000 people in 2016 and only getting 900 (correctly stratified and supposedly randomized) respondents, the results were the following: 450 for Clinton, 400 for Trump, and 50 undecided (assuming, for simplicity, no other candidates). This would yield the poll saying that Clinton is at 50%, Trump at 44.4%, and that 5.5% are undecided, and it would conclude that because the sampling was random (or because their model did a good job of reweighting the sample), the average of responses for each candidate in the sample is likely to be very close to the average in the population.
But it’s not. The low response rate suggests that some of those who do intend to vote simply did not want to express their preferences. Among all those 9000 non-respondents the majority are surely people who dislike politics and hence will not even bother to vote (turnout in the US is usually between 50 and 60%, meaning that almost half of the eligible voters simply don’t care about politics). However, among the rest there are certainly people who will in fact vote, but are unwilling to say this to the interviewee directly for a number of reasons (lack of trust being the main one). It was one of the reasons why we found that in 2016 pollsters systematically underestimated Trump by 4.3% on average across all 50 states.
This is posing a serious problem to the polling industry as they can no longer rely on standard statistical methods to deliver accurate predictions of trends (as they used to).
How can we fix this? Use an alternative method that does not depend on having a representative sample to predict voter (or consumer) preferences.
We just so happen to have one.
A new polling methodology
The logic of our approach is simple. Our survey asks the respondents not only who they intend to vote for, but also who they think will win, by what margin, as well as their view on who other people think will win. It is essentially a wisdom of crowds concept adjusted for the question on groupthink.
The wisdom of crowds is not a new thing, it has been tried before. But even pure wisdom of crowds is still not enough to deliver a correct prediction. The reason is because people can fall victim to group bias if their only source of information are polls and like-minded friends. We used social network analysis to overcome this effect. Using Facebook and Twitter to see how you and your friends predict the election outcome (only if your friends also solve our survey, all of it completely anonymously), we were able to recognize if you belong to groups where groupthink bias was strong. People living in bubbles (homogenous likeminded groups) tend to only see one version of the truth — their own. This means they’re likely to be bad forecasters. On the other hand, people living in more diverse, heterogenous groups are exposed to both sides of the argument. This means they are more likely to be better forecasters, so we value their opinions more. By performing this network analysis of voter preferences we are able to eliminate groupthink bias from our forecasts and therefore eliminate the bias from polling.
Our solution to the aforementioned traditional issues with online polls is the very idea of combining the wisdom of crowds with a network analysis to remove the selection and non-response bias. Asking a respondent how people around them think means that we are including a group of people instead of an individual. So all we have to do is to correct for each groups’ bias, which is easier than correcting for individual bias.
To summarize, when we do a survey on social media this is what it looks like:
- We poll people on social media to find the best “observers” who tell us what their friends & other people think who will win.
- Our users-observers then invite their friends to the survey, which enables us to see their preference pattern and measure their group bias (only if the friends solve the survey).
- We then place a weight on each individual’s predictions based on their group’s bias and draw patterns of behaviour.
This methodology has enabled us to accurately predict not only election outcomes (like Brexit, Trump or Macron), but also consumer sentiment and demand, market trends, optimal pricing, and even the economic consequences of the COVID-19 pandemic.
How can you benefit?
So, how do you stay ahead of the curve? How can you reduce the uncertainty over the election outcome and better plan your investing or business strategy?