The one true triathlon distance

10 minute read

I have created a web-scraping tool to pull athlete result data from Ironman triathlon events. Using some of this data I’d like to do some simple analysis on the relative distances and finishing times of the three disciplines in triathlon: swim, bike run. In the end, I would like to come up with a ‘natural’ distance for each disciplines, where each is equally weighted in the event.

A brief look at finishing times

Using the associated Ironman data-scrapper, the event data for Ironman Canada 2019 (held in Mt. Tremblant) will be used for demonstration. We can see the general structure of the data below. The data pulled from the Ironman site is athlete-specific and quite granular, including all split times and ranks for each individual.

import pandas as pd
import time

event_folder = 'data/canada140/'
ad = pd.read_csv(event_folder+'data.csv')
ad.head()
link name bib division age state country profession points overall_rank ... gender_rank swim_time bike_time run_time total_time swim_racetime bike_racetime run_racetime t1_time t2_time
0 https://www.ironman.com/triathlon/events/ameri... Raynard Picard 139 35-39 39 CA MEX --- 5000 1.0 ... 1.0 3577.0 17389.0 11612.0 32966.0 3577.0 21193.0 32966.0 227.0 161.0
1 https://www.ironman.com/triathlon/events/ameri... James Curran 211 35-39 34 Munster IRL --- 4948 2.0 ... 2.0 3820.0 17904.0 11172.0 33224.0 3820.0 21907.0 33224.0 183.0 145.0
2 https://www.ironman.com/triathlon/events/ameri... Reid Foster 130 30-34 33 OK USA --- 5000 3.0 ... 3.0 3437.0 17605.0 11834.0 33242.0 3437.0 21238.0 33242.0 196.0 170.0
3 https://www.ironman.com/triathlon/events/ameri... Heather Wurtele 3 PRO 40 BC CAN --- 5000 4.0 ... 1.0 3198.0 18107.0 12084.0 33641.0 3198.0 21429.0 33641.0 124.0 128.0
4 https://www.ironman.com/triathlon/events/ameri... Karel Sumbal 145 40-44 42 SC CZE --- 5000 5.0 ... 4.0 3510.0 19025.0 11043.0 33913.0 3510.0 22730.0 33913.0 195.0 140.0

5 rows × 21 columns

print('There are {:d} athlete entries.'.format(len(ad.index)))
There are 1489 athlete entries.

Each athlete entry (row) includes a direct link to it’s results page, along with detailed race data. All time-based columns are in units of seconds. First, let’s look at the finishing times of all the athletes.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.distplot(ad.total_time.dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x18208617d0>

Notice the irregular distribution shape and the truncation for long time. This is atleast partially due to the hard time limit of the event (17hrs or 61200s). Additionally, each dicipline has it own cut-off time.

print('The average and median total times are ' + time.strftime('%Hhrs %Mmin %Ss', time.gmtime(np.mean(ad.total_time))) +
     ' and '+ time.strftime('%Hhrs %Mmin %Ss', time.gmtime(np.median(ad.total_time.dropna()))) +
     '. The standard deviation of the distribution is ' +time.strftime('%Hhrs %Mmin %Ss', time.gmtime(np.std(ad.total_time))) + '.') 
The average and median total times are 13hrs 21min 41s and 13hrs 23min 51s. The standard deviation of the distribution is 01hrs 46min 57s.
f, ax = plt.subplots(1,3,figsize=(12, 3))
plt.setp(ax, yticks=[])
sns.set(style="white", palette="muted", color_codes=True)
sns.despine(left=True)
sns.distplot(ad.swim_time.dropna(), color='b',ax=ax[0])
sns.distplot(ad.bike_time.dropna(), color='g',ax=ax[1])
sns.distplot(ad.run_time.dropna(), color='r',ax=ax[2])
<matplotlib.axes._subplots.AxesSubplot at 0x1a214a4210>

png

The event cut-off times for the swim, bike, and run are 2hrs 20min (8400s), 8hrs 10min (29400s), and 6hrs 30min (23400s) respectively. Some individual times may exceed this limit, but the athlete would not be permitted to continue to the next dicipline.

print([(np.mean(time), np.median(time.dropna()), np.std(time), np.std(time)/np.mean(time)) for time in [ad.swim_time,ad.bike_time,ad.run_time]])
[(4839.948738812042, 4765.0, 823.7251282290428, 0.17019294473586197), (24268.259509721047, 24194.0, 3001.649341671252, 0.12368622234606039), (18234.61800699301, 18085.5, 3351.7988083726937, 0.18381513706990038)]

An interesting feature of the plots above is that while the swim times produce the narrowest distribution, they are not the least variable with respect to the mean. The swim time standard deviation is 17.0% of the average swim time, compared to 12.4% for the bike, and 18.4% for the run. The run result at least is expected, since it is the last dicipline of a gruelling event. With the body exhausted it is natural to expect significant variability in finishing times.

f, ax = plt.subplots(1,2,figsize=(14, 8))
sns.swarmplot(data=ad.iloc[:,12:15],size=2, color=".3", linewidth=0, ax=ax[0])
sns.boxplot(data=ad.iloc[:,12:15], whis="range", ax=ax[0])
sns.boxplot(x='total_time', y='country', data=ad, ax=ax[1])

png

A natural triathlon distance

The Ironman triathlon, sometimes referred to as a 140.6, consists of a 3.86km swim, 180.25km bike, and 42.20km run. Added together the event is 226.3km, or 140.6 miles (hence the name). As shown above, the bike portion of the race is heavily weighted, averaging for more than half the total race time. While one could contemplate and debate the merits of this with respect to the sport of triathlon, I am curious what a more evenly weighted race looks like. Presumably there would be less emphasis on the bike, likely being shifted to the swim - but to what extent?

The total event distance $d_{event}$ is simply the sum of the dicipline distances: \(d_{event} = d_{swim} + d_{bike} + d_{run}\)

d_ims = 3.86; d_imb = 180.25; d_imr = 42.2; d_im = d_ims + d_imb + d_imr

Weighting schemes for dicipline distances

For the purposes of this analysis we will hold the total event distance constant ($d_{event} = D_T = 226.3 km$). This leaves us to choose conditions for the weighting of each dicipline. We will investigate a few weighting schemes now. In each case, the dicipline distance is characterized by it weight, $W_i$: \(d_i = W_iD_{T}\) The weights must be normalized, therefore: \(\sum W_i = 1\)

The Ironman 140.3 uses relative distances of 1:46.7:10.9 for $d_{swim}:d_{bike}:d_{run}$, resulting in weights of: 0.017, 0.796, and 0.186.

Equal weighting

A simple scheme is to simply weight all diciplines equally: \(W_i = \frac{1}{N} = \frac{1}{3}.\) The resulting dicipline distance is $226.3/3 = 75.4 km$.

w_s = w_b = w_r = 1/3
print('The dicipline distances are swim = {:.1f}km, bike = {:.1f}km, run = {:.1f}km.'.format(*[w*d_im for w in [w_s,w_b,w_r]]))
The dicipline distances are swim = 75.4km, bike = 75.4km, run = 75.4km.

This is a trivially absurd example considering a ‘marathon’ swim is typically expressed as $10 km$, i.e. the event would consist of 8 consecutive marathon swim, capped by two marathon runs. Meanwhile, the bike distance is nothing a a casual Sunday rider couldn’t handle. While this example is silly, it does suggest the next weighting scheme: time-weighted distances.

Time weighting

Time (or speed) weighting may be the first ‘natural’ choice for determining dicipline distances. The premise is that each dicipline should take roughly the same amount of time to complete. On an individual basis this almost certainly will not hold true, but if instead we consider the entire field we can determine a weighting scheme that will satisfy the equal-time criterion.

Once again, we are using the Ironman Canada 2019 data as a test-case. The mean split swim/bike/run times in this event were:

print('{:.2f}, {:.2f}, {:.2f} hours, respectively.'.format(*[np.mean(time)/(60*60) for time in [ad.swim_time,ad.bike_time,ad.run_time]]))
1.34, 6.74, 5.07 hours, respectively.

The average split time for each dicipline is given simply by $t_i = \frac{d_i}{v_i}$, where $v_i$ is the average speed for the split. Since $d_i = W_i D_{T}$, the split time can be written as: \(t_i = \frac{W_i}{v_i}D_T\) and the dicipline weights written as: \(W_i = \frac{v_i t_i}{D_T}.\)

If we enforce the condition that split times must be equal, $t_i = t_{swim} = t_{bike} = t_{run} = T$, then we see that, since $T$ and $D_T$ are constant, the weights are proportional to the average speed.

We can express the normalized weights, $W_i$, in a slightly different manner: \(W_i = \frac{w_i}{\sum w_i},\) where $w_i$ are unnormalized and the sum in the denominator serves as the normalization constant. Using this notation it is simple to see that in the case of equal-time weighting, $w_i=v_i$, and so: \(W_i = \frac{1}{\sum v_i}v_i.\) Evidently, the normalization constant is $\frac{1}{\sum v_i}=\frac{T}{D_T} = A$.

The average swim, bike, and run speeds (from IronmanCanada 2019) are:

v = np.array([(x*1000)/(np.mean(time)) for (x,time) in zip([d_ims,d_imb,d_imr],[ad.swim_time,ad.bike_time,ad.run_time])])
print('{:.2f}, {:.2f}, {:.2f} m/s, respectively.'.format(*v))
0.80, 7.43, 2.31 m/s, respectively.
A = 1/np.sum(v)
W = A*v
d = d_im*W
t_e = 3*(A*1000)*d_im
print('Normalization A: {:.3f} s/m'.format(A))
print('Weights:\n\tSwim: {:.3f}\n\tBike: {:.3f}\n\tRun: {:.3f}'.format(*W))
print('Distances: {} km total\n\tSwim: {:.2f} km\n\tBike: {:.2f} km\n\tRun: {:.2f} km'.format(d_im,*d))
print('Total average time: {:.1f} hrs'.format(t_e/(60*60)))
Normalization A: 0.095 s/m
Weights:
	Swim: 0.076
	Bike: 0.705
	Run: 0.220
Distances: 226.31 km total
	Swim: 17.13 km
	Bike: 159.49 km
	Run: 49.69 km
Total average time: 17.9 hrs

The resulting dicipline distances for this time-weighted scheme are 17.1 km, 159.5 km, and 49.7 km, for the swim, bike, and run, respectively. Expressed as ratios this is 1:9.3:2.9. This leads to an average event time of 17.9 hrs. If we scale the event down to the current average finishing time (13hrs 21min 41s), the distances become:

t_im = np.mean(ad.total_time-(ad.t1_time+ad.t2_time))
C = t_im/t_e
d_scaled = C*d
d_e = d_im*C
print('Distances: {:.2f} km total\n\tSwim: {:.2f} km\n\tBike: {:.2f} km\n\tRun: {:.2f} km'.format(d_e,*d_scaled))
Distances: 165.85 km total
	Swim: 12.55 km
	Bike: 116.88 km
	Run: 36.42 km

These results are starting to look realistic. The swim, though more than four times the current Ironman distance, is only slightly longer than an Olympic marathon swim. While this on it’s own isn’t an issue, organizers may view (and rightly so) an average of 4+hrs in the water as a significant safety risk.

Equal-opportunity weighting

The time-weighted approached discussed previously foccuses on equalizing the time of each segment. On the surface, this may seems like the most fair option to specialists of each dicipline, i.e. a strong swimmer would have the same advantage as a strong runner. However, since the distribution of paces in each dicipline is not uniform, this is not true. This is highlighted by the normalized dicipline finishing times shown below:

sns.distplot(ad.swim_time.dropna()/np.mean(ad.swim_time.dropna()), color='b')
sns.distplot(ad.bike_time.dropna()/np.mean(ad.bike_time.dropna()), color='g')
sns.distplot(ad.run_time.dropna()/np.mean(ad.run_time.dropna()), color='r')
<matplotlib.axes._subplots.AxesSubplot at 0x1a21c72110>

png

Evidently, the bike is the most narrowly distributed segment, followed by the swim, then run. This means a relatively strong biker will gain less time, relatively (i.e. time_gained/time_total) than a relatively strong swimmer or runner. The equal-opportunity weighting scheme will assign distances such that the time gained/lost to the field by a given athelete is consistent accross all diciplines for atheletes of the same relative skill level.

Clearly, the distributions above are not perfectly Gaussian, but they are relatively normally distributed. They are (essentially) single-mode and not significantly asymmetric. The mean and median do not differ by more than 1.5% (swim). A reasonable proxy for the discipline variability is the standard deviation of the distribution.

\[w_i = \frac{t_i}{\sigma_i}v_i = \frac{d_i}{\sigma_i}.\]

The term $\frac{\sigma_i}{t_i}$ is the standard deviation of the distribution relative to the mean:

print('Relative standard deviation\nSwim: {:.3f}\nBike: {:.3f}\nRun: {:.3f}'.format(*[np.std(time.dropna())/np.mean(time) for time in [ad.swim_time,ad.bike_time,ad.run_time]]))
Relative standard deviation
Swim: 0.170
Bike: 0.124
Run: 0.184
w = np.array([dist/np.std(time.dropna()) for dist,time in zip([d_ims,d_imb,d_imr],[ad.swim_time,ad.bike_time,ad.run_time])])
A = 1/(sum(w))
W = A*w
d = d_im*W
t_e = np.sum((d*1000)/v)
print('Weights:\n\tSwim: {:.3f}\n\tBike: {:.3f}\n\tRun: {:.3f}'.format(*W))
print('Distances: {} km total\n\tSwim: {:.2f} km\n\tBike: {:.2f} km\n\tRun: {:.2f} km'.format(d_im,*d))
print('Total average time: {:.1f} hrs'.format(t_e/(60*60)))
Weights:
	Swim: 0.061
	Bike: 0.777
	Run: 0.163
Distances: 226.31 km total
	Swim: 13.71 km
	Bike: 175.75 km
	Run: 36.85 km
Total average time: 15.8 hrs

Surprisingly, this treatment results in an increase of the swim distance at the expense of the run distance.

Time-gained weight

The strategy above can be treated more analytically. The time-gained in a dicipline is: \(t_g = t - t_0 = \frac{d_0}{v_0}-\frac{d_0}{v}\)

w = np.array([v0*(1-(np.mean(time)/np.std(time.dropna()))) for v0,time in zip(v,[ad.swim_time,ad.bike_time,ad.run_time])])
A = 1/(sum(w))
W = A*w
d = d_im*W
t_e = np.sum((d*1000)/v)
print('Weights:\n\tSwim: {:.3f}\n\tBike: {:.3f}\n\tRun: {:.3f}'.format(*W))
print('Distances: {} km total\n\tSwim: {:.2f} km\n\tBike: {:.2f} km\n\tRun: {:.2f} km'.format(d_im,*d))
print('Total average time: {:.1f} hrs'.format(t_e/(60*60)))
Weights:
	Swim: 0.058
	Bike: 0.788
	Run: 0.154
Distances: 226.31 km total
	Swim: 13.18 km
	Bike: 178.31 km
	Run: 34.82 km
Total average time: 15.4 hrs

data = ad.swim_time.dropna()
kde = sns.kdeplot(data)
x,y = kde.get_lines()[0].get_data()
imax = np.argmax(y)
plt.vlines([x[imax],np.mean(data),np.median(data)],0,np.max(y), color=['r','k','g'])
<matplotlib.collections.LineCollection at 0x1a21c23c50>

png

sns.distplot((ad.t1_time+ad.t2_time).dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x1a21d8ac50>

png

Leave a comment