ProLearn’s Data Pundits Tackle World Cup Predictions!
By Aditi Bhat
The 2018 FIFA World Cup has been termed as one of the biggest shockers in the FIFA history. The strongest cup contenders like Germany, Portugal and Argentina are already on their way home after an unexpected show in the preliminary rounds.
The unpredictable nature of World Cup has drawn many game analysts and big data analytics enthusiasts, to crunch numbers and make accurate predictions about the winning team. Immense amounts of structured match data saved over decades along with countless variables make winner prediction a challenging task.
Getting the kicks with a data science challenge
Turning the artistic magic of soccer into science by leveraging the power of data analytics techniques, Manipal ProLearn Data Science team also took a shot at predicting the FIFA 2018 World Cup Winner. The team of students who tackled this project included Kavyashree, Chaitanya, Vanaja and Amandeep, who had quite a task on their hands with the sheer amount of data at their disposal, but it was a challenge that they were more than ready to conquer!
The students from Manipal ProLearn Data Science batch challenged themselves to predict the outcome of FIFA World Cup 2018 being held in Russia.
Did you know the game developing company EA Sports has predicted France as the winner based on its state of the art data analytics software?
Setting the parameter
To get started, the data science team penned down classification models over a dataset of historic football results that included attributes from the playing teams by rating them in attack, midfield, defence, aggression, pressure, chance creation and building ability.
“The team collected data of almost 40,000 international matches.”
To build a predictive model that will make good predictions for the World Cup, the students sourced datasets from Kaggle with data of almost 40,000 international matches played between 1872 and 2018.
For the project, the team used numerous variable parameters like team playing as home or visitor, goals scored by home team and date of the match etc.
Getting over the hurdles
The team then merged the match data files with stat data files to create a final dataset using R. Initial hurdles like discrepancy in team names, for example, USA and United States were creased out. The team then employed SQL language to manipulate data frames in R. By the end, the team gathered 1,183 observations of international matches since 2010 where the stats for the two teams were available.
Working around problems – Key data science skill
Improvisation is a key data science skill. Often the most information rich data also needs improvisation to get meaningful insights. Sharing one such experience, the data science team said, “We realized that ‘home’ and ‘away’ is a condition that doesn’t apply in the World Cup. So we decided to rearrange the teams based on their overall rank instead of home and away teams. We rearranged the data and created strong and weak teams, which included rearranging all of their stats as well. We also created new fields to be used as predictors which are the differences for each stat between the strong and the weak team.”
Real life data challenges like FIFA World Cup create unique learning possibilities too. The students experimented and dribbled with variables and classes based on game rules. For instance, draws and wins were merged and considered as win since both outcomes count towards scoring in the World Cup.
Determining with data mining
The initial attempts to build a predictive model for this statistical analysis by the team showed similar results in both decision tree and knn methodologies. The best model achieved in decision tree was choosing a maximum tree depth of 5, minimum cases of 10 and 5 for parent and child respectively and a minimum change in improvement of 0.003; the tree had 6 nodes, 4 of which were terminal nodes with a depth of 2.
The model showed a 78% accuracy with 99.6% precision in wins, but 2.3% in losing games. The students further improved the model’s precision by fine tuning the sampling selection. After performing this sampling selection, the algorithms improved significantly in terms of precision.
The final decision tree was grown using a maximum depth of 10, minimum number of cases of 10 and 5 for parent and child respectively and a GINI minimum improvement in purity of 0.003. The precision increased drastically in losing games to 33.6%, maintaining the winning precision and by only dropping 5 points in the overall accuracy to reach 73.1%.
Final decision tree
The students, much to their amusement, let the magic algorithm churn out the champion’s name as Spain, beating Brazil in the final showdown. They also found that Brazil would beat Germany in a hypothetical final match.
The team found out that the most important predictors were difference in attack, chance creation crossing, passing, speed, overall pressure, defense and midfield.
In general, sports outcome are difficult to predict so an accuracy of around 80% is reasonably acceptable. The students termed this assignment as a kick off to continuous learning process inspired by real world challenges in the data science field.
Gone Chaitanya Vanaja Nagalla Amandeep Singh Kavyashree Ramesh