Saturday, January 26, 2013

Tidying up timing data for arithmetic

The Northern Lights 300 started yesterday.  Helen Hegener is handling web and social media for the race and put together some really nice Google spreadsheets for maintaining the data.  It's a frosty -45 this morning in Two Rivers, too cold to do much outdoor recreating, so I thought this might be a good opportunity to start playing around with some plots (charts) I've been thinking about.  Oftentimes graphics are the best way to bring out something interesting in the data.

However, when I look at the NL300 spreadsheets there are a few issues that need to be resolved before I can start plugging in the formulas to do the arithmetic:

  • times are represented as 12-hour numbers rather than as 24-hour numbers.  For example, 2:00 may mean either 2:00am or 2:00pm.  Ideally 2pm would be represented as 14:00, and the times need to be changed in the spreadsheet to be able to get correct arithmetic results
  • there are no dates - not a problem in short races but when teams are out overnight we need the dates, or possibly offsets ("day 1", "day 2", etc.) to be able to do the arithmetic.  If someone leaves a checkpoint at 11pm and arrives at the next checkpoint at 5am the next morning, 5:00 - 23:00 is going to produce unuseful results.
  • each checkpoint has its own spreadsheet, sorted by arrival time.  This is the correct way to show the data to fans but it means that it's more difficult to do arithmetic on runtimes between two checkpoints because mushers change rows between checkpoints
Fortunately some of these are soluble with relatively little typing.  For example, on that last point, we've got something for each musher that stays the same across checkpoint: bib number.  Names are not reliable because a single typo can cause a mismatch on the same person (to illustrate, during a previous race someone named "Matt" was written "Matt" at some checkpoints and "Mark" at others).

So, let's walk through this process:
  1. We assume you've got a spreadsheet containing the start times, which is pretty much by definition sorted by bib number already.  
  2. Create a new sheet or scratch area.  In Google spreadsheets, you'll see a plus sign in the lower left-hand corner of the page:

  3. Copy and paste the checkpoint data into the new sheet
  4. Delete the columns to the left of the bib number column.  You should get something that looks like this:

  5. Select all the rows containing mushers (i.e. exclude blank lines and header lines):

  6. Next, go to the data menu and select "Sort range ..."

  7. You'll get a pop-up dialogue box.  You want to sort by bib number.  If that's in column A, as it is here, click "Sort".  Otherwise change the sort-by column, then click the "Sort" button

  8. and there it is:

Now you can go ahead and play with the numbers.  I like to copy the arrival times into a spreadsheet that contains a summary of arrival and departure times, and do the arithmetic to calculate runtimes, layovers, etc.  In this case we can't do it until we fix the times so that they're on a 24-hour clock, which I'll cover in a subsequent post.  We also need to account for missing data (mushers who haven't arrived at a checkpoint yet), and I'll cover that as well.

Note, also, that this procedure can be used to sort a table on any column, so this is a handy way to get a quick handle on, say, departure times, when the table is sorted by arrival times, etc.


  1. How did you get to step 1? The spreadsheets aren't downloadable from , and copy-and-paste doesn't look straightforward (I don't have a Mac spreadsheet, and I can't paste into a Google spreadsheet).

  2. Hi, Jack. There are several options. I did copy-and-paste into a Google spreadsheet. Unfortunately it inserted a blank row after each row, and since my goal is to do as little work as possible I pulled the URIs for the spreadsheets out of the source code, loaded those, and used the Table Capture extension for Chrome to copy it into the clipboard, then pasted it into a Google spreadsheet. That produced happy results that I did not have to fiddle with any further, beyond cleaning up the data.