The Apteco Datathon: 3. How can you win an Olympic medal?

20 Sep 2018  |  by Melanie Davis

In the third of our Apteco Datathon series we continue to deep dive into real-life datasets to analyse and interpret the numbers. In this blog we examine what it takes to be an Olympic athlete – and your best chances of winning a medal.

In our last blog we explored a dataset containing international football matches starting from 1872. Continuing this sporting theme, we created and explored a FastStats system containing 120 years of Olympic data. What can we learn from this data and could it show us how to improve our chances of standing on the podium with an Olympic medal around our neck?

The Dataset

Conveniently for us, a dataset scraped from was put into a nice csv format and hosted on kaggle. We combined this with a US government dataset containing various measurements of country development. With a bit of tweaking this gave us two tables: Athletes; and Events. The data for each table is:

  • Athletes
    • ID
    • Height
    • Weight
    • Sex
    • Name
  • Events
    • Athlete
    • Athlete age at the event
    • Competing country and NOC code (National Olympic Committee Code e.g. GBR)
    • Team (e.g. Denmark/Sweden or USA)
    • Host city
    • Sport and event
    • Medal
    • Event year

To attach the country data to the relevant country in the events table, the NOC code was used with a lookup table to its associated country. This isn’t always appropriate as there are occasionally mixed nation teams representing under the same country code, but this rarely happens. Plus more often than not the team countries are very similar in terms of development (note the Denmark/Sweden example). This gives us a total of 135,571 athletes, with 271,116 unique entries, and information for 205 countries.

As there’s no data for the 2018 winter Olympics in PyeongChang, we’ll only be looking at the summer games.

Diving into the data

To first verify the integrity of the data, let's look at something we already know. How about identifying the most decorated Olympic athlete? We'll begin this by making an aggregation of events that have a bronze, silver or gold medal.



Using this value on a selection by applying it for the top n results gives us the data grid:


with Michael Phelps on top. That's reassuring.

How the games have changed

Evidently the games are a totally different beast from 120 years ago, but how have they changed? Let’s start by looking at event entries over time on a chart.


From this we can see some dips in 1932, 1956, 1976, and 1980. The 1932 games was hosted in Los Angeles, and as a result many competitors couldn’t afford to pay for the trip there. In 1956, these countries didn’t compete for the following reasons:

  • Egypt, Iraq, and Lebanon in response to the Suez Crisis.
  • Netherlands, Cambodia, Spain, and Switzerland in response to USSR crushing the Hungarian Revolution.
  • People's Republic of China as a result of Republic of China (Taiwan) being allowed to compete

In 1976, 29 countries boycotted the games due to the IOC not banning New Zealand for the All Blacks tour of South Africa.

In 1980, 65 countries boycotted the games in Moscow in protest of the 1979 Soviet-Afghan war.

Let’s now have a look at the breakdown of events to see how the popularity of sports has changed.


In the modern Olympics there’s not been much variation, with athletics in purple being the most popular sport in terms of entries, followed by swimming in blue, and rowing in orange.

However, in 1932 we can see one green spike in popularity for an odd sport… Art competitions! This division of the Olympic Games was hosted from 1912 until 1948 but stopped due to “concerns about amateurism and professionalism”. The IOC also doesn’t recognise the medals awarded as official anymore, but this doesn’t make it any less interesting.

Something more personal

As I was looking into this dataset I realised I had a massive desire to win an Olympic medal. Luckily, with all this data in front of me I had an advantage over all those fools who blindly choose a sport to compete in for fun – I could statistically look at the data to define the sport in which achieving a medal is most probable.

I looked at the summer games since 2000, and using BMI and age created a cube with the mode event type as a statistic. Here's what comes out of this for men:


If you hover the mouse over a cube entry a popup will show with the relevant event. For example, I weigh 65kg and stand 1.8m tall, so my BMI is 65 / 1.8 ^ 2 = 20.1, and my age is 22. This puts me on event ID 71, which is Athletics Men’s 1,500 meters. Maybe it’s time to start doing a few laps of the track.

Below is the same table for women and an index lookup table so you're able to see what would suit you best too.


Event ID



Athletics Men's 1,500 metres


Athletics Men's 10,000 metres


Athletics Men's 4 x 400 metres Relay


Athletics Men's 5,000 metres


Athletics Men's 50 kilometres Walk


Athletics Men's 800 metres


Athletics Men's Marathon


Athletics Women's Discus Throw


Athletics Women's High Jump


Athletics Women's Javelin Throw


Athletics Women's Marathon


Athletics Women's Shot Put


Boxing Men's Featherweight


Boxing Men's Flyweight


Boxing Men's Light-Welterweight


Diving Men's Platform


Equestrianism Mixed Dressage, Individual


Equestrianism Mixed Dressage, Team


Equestrianism Mixed Jumping, Individual


Equestrianism Mixed Jumping, Team


Equestrianism Mixed Three-Day Event, Team


Equestrianism Mixed Three-Day Event, Team


Fencing Men's epee, Team


Fencing Women's Foil, Individual


Football Men's Football


Gymnastics Men's Parallel Bars


Gymnastics Men's Team All-Around


Gymnastics Women's Balance Beam


Gymnastics Women's Team All-Around


Gymnastics Women's Uneven Bars


Handball Men's Handball


Handball Women's Handball


Hockey Men's Hockey


Hockey Women's Hockey


Judo Men's Extra-Lightweight


Rhythmic Gymnastics Women's Group


Rowing Men's 17-Man Naval Rowing Boats


Rowing Men's Coxed Fours


Rowing Men's Coxed Pairs


Rowing Women's Coxed Eights


Rowing Women's Coxed Fours


Sailing Mixed Three Person Keelboat


Sailing Mixed Two Person Keelboat


Shooting Mixed Trap


Shooting Women's Sporting Pistol, 25 metres


Swimming Men's 1,500 metres Freestyle


Swimming Men's 200 metres Backstroke


Swimming Men's 200 metres Butterfly


Swimming Men's 4 x 100 metres Medley Relay


Swimming Men's 4 x 200 metres Freestyle Relay


Swimming Men's 400 metres Freestyle


Swimming Women's 4 x 100 metres Freestyle Relay


Swimming Women's 4 x 100 metres Medley Relay


Swimming Women's 400 metres Freestyle


Synchronized Swimming Women's Team


Volleyball Women's Volleyball


Water Polo Men's Water Polo


Weightlifting Men's All-Around Dumbbell Contest


Weightlifting Women's Super-Heavyweight


Wrestling Men's Bantamweight, Freestyle

While this is quite a fun piece of analysis, there are clearly limitations – not least that the mode will be heavily influenced by the sports in which there have been a lot of athletes, so we see a lot of the more popular sports in the list.

What makes an athlete?

This exercise also doesn't really give us any proper insight into how these variables affect the chance of getting a medal. To determine this we can use a modelling environment, or more precisely the Apteco Modelling profiling tool.


Both the base and analytic selection are all summer events entries from 2000, and the analytic section also filters out all non-medal events. We looked at how weight, height, sex, sport, age and bmi affect the likelihood of obtaining a medal.

Using the profile tool it appears that the particular sport you choose has the greatest impact on your chances of standing on the podium. Team sports are the most likely route to success, with individual sports – such as triathlon and gymnastics – the least likely.




Weight and height were the next most influential variables, with taller and heavier athletes more likely to be medal winners. It's also nice to see that out of our variables, sex is the least effective.

Unfortunately, the success of an athlete isn't always dictated by how much work they put in. This is evident when we look at how the success of an athlete is affected by the country they represent.



With this profile including GDP (USD per Capita) and literacy (% of population), we can see that athletes in more developed countries are heavily favoured. This is likely to be related to access to greater resources than those in less-developed nations (Note that the GDP banding seems quite erratic – bin size was prioritized over range size).


There isn’t enough space in this blog to detail everything I've discovered, and there’s plenty more to learn from this dataset. Considering how relatively small the dataset is with only 271,116 entries, the data is very rich with information. This made it very enjoyable to build the system and work with it. If I were to continue working with the data I would want to look into the outliers that were discovered in the profiling tool and have a look to see if certain countries are doing better than they are expected to. But then there’s always more to learn from data.

Melanie Davis

Group Marketing Manager

Since 2007 Mel has led the Apteco marketing team in driving the Apteco brand. She has been a B2B marketer in the data and marketing technology sectors for over 20 years. Her aim is to ensure that Apteco is a trusted and respected brand that is the first point of call for all data driven marketers.

Subscribe to our newsletter and get all the latest data analysis and campaign automation news.