A
A
Anonymous
Guest
Pop tarts, goats’ cheese and Manchester City
By Steve Jones
In the summer of 2011, Manchester City’s assistant manager David Platt decided to use data to address a troubling aspect of the team’s performance. With a number of tall, strong players in the side, Platt felt that they were not scoring as many goals from corners as they should.
There are two, main techniques for crossing the ball directly into the penalty area from a corner: the in-swinger, and the out-swinger. After consulting the club’s in-house data analysts, the team increased its reliance on in-swinging corners – passes directed towards, rather than away from, the goalkeeper. The tactical switch produced startling results. Manchester City scored 15 goals from corners over the course of the season, the highest return of all the Premier League clubs. Two thirds of those goals were from in-swingers.
The exercise provided a strong endorsement for data-driven decision making. But there was an additional factor to consider: manager Roberto Mancini’s initial scepticism regarding the inherent value of data. Mancini had in fact already consulted the club’s data analysts about the team’s use of corners two years’ earlier. The analysts had told him that the method he instinctively favoured, the out-swinger (directed away from the goalkeeper) was statistically less successful.
Mancini had ignored the evidence and stuck to his intuition, which told him that swinging the ball away from the goalkeeper gave him less opportunity to catch it, and onrushing attacking players more of an opportunity to head it. But while Mancini had identified a relationship between two variables, his instincts had blurred his ability to judge the strength of that relationship. In other words, there may have been a correlation between out-swinging corners and the number of goals scored, but the data proved there was a more direct, causal relationship between the in-swinger and the number of goals scored.
What does this case study teach us about making better business decisions? A US retailer recently discovered an intriguing link between two different variables. It found that when the weather gets cold, sales of cinnamon Pop Tarts rise by 500 per cent - not all Pop Tarts, just the cinnamon variety. Faced with this data snippet, the retailer had a decision to make. How many more cinnamon Pop Tarts should be stocked whenever the weather is forecast to be cold? Another retailer found that discounted goats’ cheese appeared to boost sales of red wine. Is discounting goats’ cheese the way to go when looking to move red-wine stock?
The answer to both questions depends on a problem right at the heart of Big Data analytics: understanding the difference between correlation and causation. People are brilliant at spotting correlations – it’s an evolutionary trait – but we’re much worse at understanding when those correlations are directly related, in the sense that one causes another. Mistaking correlation for causation is dangerous to decision making. Betting your business on a correlation can bring disaster, because the impact you expect is unlikely to happen.
Consider recent research, which indicated a clear correlation between a country’s chocolate consumption and the number of Nobel Prizes per capita. Should countries encourage citizens to increase chocolate consumption in order to drive up the number of Nobel Prizes they win?
To use Big Data effectively, correlation should be considered only as a starting point. If two variables are related, how do we respond? So before pushing a “chocolate as education” policy a government should first look at other factors, such as the relative educational levels and research budgets in those countries with high numbers of Nobel Prize winners – both variables with a more likely causal relationship than chocolate consumption.
Similarly, our Pop Tart and goats’ cheese retailers need to test their hypotheses before they can be certain, for instance by trialling an ‘over-stock’ of cinnamon Pop Tarts in a few stores before accepting there is causation, or discounting goats’ cheese to see whether red-wine sales really do increase.
These may be simple cases of causation, but each has unintended consequences that a company needs to understand. Does the increase in cinnamon Pop Tarts mean a sales decrease in other products? Does the increase in red wine also mean less beer is sold, or more steak? The number of variables impacting the modern supply chain is huge and growing: the weather, social media, special offers, news on food safety, all of these impact consumer behaviour and therefore how much stock a retailer should buy. This is fundamentally a chaotic system. It is impossible to predict with 100 per cent certainty what will happen. But the better the model, the better the prediction. And the better the prediction, the better the response.
Data analysis is like an impressionist painting. The picture only starts to make sense when you stand back and look at all of the pieces working together. Up close it’s impossible to understand what is going on. What makes data scientists’ jobs more difficult is the fact that conditions change continuously over time, while correlated variables affect the end result when they themselves are changed. While a human could handle the link between weather and pop tarts, it takes a complex algorithm to deal with the knock-on effect on the sales of other products or understand the level at which the pop-tart trigger kicks in.
This helps explain why Manchester City’s corner routine is unlikely to achieve results over the long-term. Making a simple tactical switch, from more out-swinging corners to more in-swinging corners, ignores the unique variables present not only in every match but in every potential goal-scoring situation.
Over time, awareness of Manchester City’s new tactic will grow, meaning that defences will be better prepared to deal with it. In addition, some defences will be naturally better equipped than others at coping with the danger of in-swinging corners. For example, the preferred standing position of the goalkeeper and the height of the defender nearest to the corner are both variables to be considered, not to mention wind speed and behavioural elements, such as the relation between defender concentration levels and the amount of elapsed game time.
Can data be used to predict future outcomes with greater certainty? Absolutely, but only provided businesses can resist the temptation of convenient patterns and learn to distinguish between correlation and true causality.