Poll Dancing - When The Data Scientist Shouldn't Lead

15/11/2016 11:47 | Updated 15 November 2016

It's fair to say that pollsters won't be looking back on 2016 with any fondness. Wrong on Brexit and spectacularly wrong on Trump, the polling industry has taken such a dent to its reputation that it will likely take a long time before it's fully trusted again. But just why did the pollsters get it so wrong when, in the past, they have got it so right ? It's a complex question to answer but let me try to explain...

Firstly, for many reasons, it's really hard to get a representative sample. The decline in use of landlines and the fact that people would seemingly rather take a poll about the colour of Donald Trump's hair than a legitimate poll are just two contributing factors.

Creating a representative sample given the complex factors that go into people's decision-making is a challenge! You can no longer assume a 35 year old with a certain income and lifestyle is going to vote a certain way. The days when people just voted their party's full ticket like their parents did are gone! It's a time of individual choice, so the segments being used are too broad. For example, if you were to take a group of 1,000 35 year old unmarried males who like sports; could you predict how they would vote? Probably not, because in that segment are straight people, LGBT people, republicans that will vote democrat and vice versa, etc. So you need to segment into smaller and smaller categories for accurate results.

The problem is that it's almost impossible to poll microsegments using traditional technology like telephone and mail. You'd have to do it on the internet in a way that would attract a non-biased sample. If you just create a poll on the internet, it's hard to control who joins. Friends of friends will join and friends tend to think alike so the results will be biased. To do it correctly, an organisation would have to create a poll of known people in many micro-segments that would agree to participate. And this would be an enormously expensive endeavor.

Using broad segments, polls and commentators predicted the outcome of the 2015 UK General Election would be too close to call and would result in a second hung Parliament; similar to the 2010 election. But the polls were eventually proven to have underestimated the Conservative vote as they won a surprise outright majority; which bore resemblance to their victory in the 1992 general election. The British Polling Council began an inquiry into the substantial variance between opinion polls and the actual result, with George Osborne announcing that pollsters would face a 'big post-mortem.' Perhaps it was just external forces (eg: late swing) and no fault of the polling companies, but more likely the miss was due to the difficulties in polling micro segments using traditional technology.

Next, capturing sentiment is hard. How hard? Look at the spectacular failure of HP's acquisition of Autonomy. They paid $11.7B and within 12 months wrote off $8.8 billion of the value. It's really very, very hard to understand sentiment using data science. Can we take a sample of the population, ask them how they will vote, capture characteristics about them such as income, race, etc and then extend that assumption over a bigger population? Technically yes, but besides the fact that we would still have a macro vs micro segmentation problem, we wouldn't capture their level of passion or how fresh news affects their behaviour. Subjective questions in polls try to do that, but rarely do they have the level of detail needed.

Look at what happened with Brexit. Sixty-one percent of people older than 65 voted to exit and the vote to leave was much higher for people with lower education. Could the polls have predicted that those groups of people would mostly vote to leave? Yes, but did the polls capture the level of sheer frustration of those groups? No, they didn't. For a few reasons. One, people lie. Yep. How many beers did you tell your doctor you drank in a week? Two is the most common, but rarely an accurate answer. Similarly, in election polls, if a person has some pretty unpopular ideas, they are unlikely to give that information to the pollster (unless they are super passionate). Because of inaccurate quantitative and subjective measures, this small, very vocal group will now look like a minority when, in fact, many more people feel the same way but won't admit it. Using sentiment to predict election results can create a very unstable foundation on which to build predictions.

The final reason pollsters got it wrong is that they over-relied on data science. For a simple problem - such as forecasting sales that only vary by 10% monthly - data science can be used to predict future sales with a high degree of accuracy. For a slightly more complex problem, such as finding hidden opiate abusers for an insurance company, a large amount of data is needed in both size and number of data elements needed to produce a valuable, but slightly hazier answer. For example, a result of this type of modelling could be "this person has a greater than 83% chance of being an opiate abuser." Now, take a complex problem such as predicting an election. Collect the best, but imperfect set of data available, add some historic data and assumptions and a hazy view on sentiment. Mix in things that can't be modeled - such as the surprise FBI announcement of a new investigation - and the result will be directionally correct, but practically inaccurate.

Look at the US election. Poll after poll showed a tight race with Clinton winning. Like Brexit, the polls show an older generation longing for something that doesn't exist... "the good old days" and focused on Trump voters who have lower education levels. But they missed something: a hidden portion of educated, young and middle aged people who voted for Trump. American political activist Van Jones called it "Whitelash." A word that captures an incredibly complex set of issues, but I think there's more to it. Despite the polls showing very high dissatisfaction with all of congress for both candidates; data scientists may not have accounted for the ferocity of that sentiment for many Americans. They simply couldn't see it in the data. Without that information, how could they have predicted that people are so dissatisfied that they would see a corporation-owning billionaire playboy as the anti-establishment choice? Really hard to put that in model.

This is a good lesson for all of us not to rely solely on data science for decision making. Don't get me wrong; amazing things are being discovered using data science. I think it will be highly and positively disruptive for many years to come, however it is but one source of insight. Modeling must be done correctly and the results blended with experience and also balanced by an understanding of human behaviour and emotion that can't be put in a model. Not yet anyway.