Friday, February 22, 2013

Using my spreadsheets for race data - Part 2

In my previous post I introduced some of the background and goals for using a spreadsheet (particularly  the Google Drive spreadsheet) for doing some record-keeping for distance dogsled race data.  In this post I'd like to go into some of the details of how I dealt with the calculations.

The columns represent a particular piece of data of interest.  I'm probably going to find myself slipping into calling them "fields" from time to time, so when you see that, know that I'd just referring to a particular column.  The columns/fields are:

  • Bib number
  • Musher name
  • Date and time in
  • Runtime from the previous checkpoint
  • Speed from the previous checkpoint
  • Speed ranking from the previous checkpoint (where "1" is the fastest, and higher numbers are slower)
  • Date and time out
  • Rest time
At the top of each column, in spreadsheet row 1, are the column headers explaining what's in each column.

I'll step through each column in the following sections.

Bib number


As I explained in my previous post, I'm using the bib number to tie data together across sheets.  I initially loaded my data from the Yukon Quest website (for example, here) using the mechanism described in this blog post.  But the Quest does something a little odd, and rather than having a separate bib number field in their own data, they put the bib number in parentheses next to the musher name.  Extracting it by hand is tedious and error prone, so I populated my bib number field by extracting it from the Quest's musher name field using the Google spreadsheet "regexextract" function.  A "regular expression," often abbreviated to "regex," is a way of describing patterns of text.  Regexextract pulls data out of a cell based on a regular expression.  Here's what I put into cell B1 (the "Bib" column in the "Hugh Neff" row) in the spreadsheet I'm using:

=regexextract(B2;"([0-9]+)")

What this says is that in cell B2 (the one containing "Hugh Neff (4)"), I'm looking for a number from 0 through 9 followed by zero or more numbers from 0 through 9, and I want to extract those into this cell.    I then copied that and pasted it into the other rows in this column (that is to say, cells A3 through A27), and the spreadsheet software automatically adjusted the cell number to match the current row.  So, when I pasted it into cell A4, it then adjusted the formula to examine cell B4.  The best thing is that you can select a range of cells to paste the formula into, and it just works - all the cell row numbers are adjusted.

As I mentioned, the reason I did this is because of how the Quest shows the bib number.  You may prefer to manually enter the bib number in its own column for your race.

Musher

This is just the name of the musher.  In this particular spreadsheet I have the bib number in parentheses following the musher name, but that's only because the Yukon Quest does it that way, as explained in the previous section.

Date/time in

This cell contains the date and time that the musher arrived in this checkpoint, in the form 
month/day/year hour:minutes:seconds
Here are several things you should know:
  • The cell must have this particular format applied.  Go to the "Format" menu, choose "Number," and then choose "Date time".  Apply the format to every cell in the column except for the one in row 1, which contains the column header (label)


  • Google Drive will try to guess what you mean and can be pretty smart about format conversions. If you type something like "February 3 18:00" it will convert it to "2/3/2013 18:00:00" for you, automatically adding the year, converting the date to numbers, and adding the seconds.   I've run into one situation in which it did not convert correctly.  This may not be intuitive to people who don't write software, but if you enter a date/time that has leading blanks (that is to say, instead of saying "2/3/2013 18:00:00" it says "     2/3/2013 18:00:00") it will think that you mean that this is just a string of text characters and should not be converted.  I ran into this with the Quest 300 data, where it looks like they tried to center numbers in the cell by adding a bunch of blanks in front.  Don't do that!

Runtime from previous checkpoint

This is where you start to see some serious labor savings.  We tell the spreadsheet to do the work for us, by finding the right values to do the arithmetic on, and then doing the arithmetic.  Here's an example of what we put in a "runtime" cell on the sheet for the Circle City checkpoint:

=if (C2="", "", C2-vlookup(A2, 'slavens'!$A$2:$G$27, 7, FALSE))

Let's step through this one:

As I mentioned earlier, we need to make sure  that we're catching errors where we can, so that we don't show confusing junk to the user.  At the start of the race I want to fill in the formulas, even though I don't know what the actual dates and times will be.  I do know where they are, though, so I put the whole thing inside a big "if" clause.  The "if" formula in Google spreadsheet has three parts, inside parentheses and separated by commas.  The first part is the one that's evaluated.  In this case we're checking to see if the C2 cell is empty.  The second part is what's done if the first part is true.  In this case we just keep it empty.  The third part is what we do is if the first part is not true - if the C2 cell is not empty.  

So here's what we do if we've got an arrival date and time: we need to subtract the time out of the previous checkpoint from the time into the current checkpoint to get the runtime.  The date and time into the current checkpoint is in row C, and we need to find the time out of the previous checkpoint to subtract from it.  We could use a fixed spreadsheet location, but because we want to be able to be more foolproof than not, allow ourselves some flexibility in musher ordering, and reduce the likelihood for error, we're going to search for that number.  To do that, we use the spreadsheet "vlookup" function.  It takes four arguments, separated by commas:
  • the first argument is the value we want to look for.  In this case, we're using bib numbers to identify mushers and the bib number is stored in row A, so we're going to look for the value in A2 for row 2, value in A3 for row 3, and so on.
  • the second argument is where to look, and this is a little tricky.  First, we want to look in the sheet for the Slavens dog drop, which I've labeled "slavens"  
  • Within the "slavens" sheet, we want to search the range of cells from A2 through G27.  However, as I've mentioned above, the spreadsheet tries to be helpful by adjusting rows and columns when you copy and paste.  If you have a formula in cell B3 that references cell A2, if you copy that formula into cell C5 the spreadsheet will change that reference to B4.  So, to tell it you don't want it to adjust the cell reference values you prefix each row and column with a '$' sign.  '$A$2' means "yes, I really mean A2."  
  • the third argument is the column from which I'd like to take the value.  I think this entire interface is a little awkward but it does give you some flexibility.  So, when you find the value from A2 (in this case) in the range of columns given in the second argument, take the value from the 7th column (wasn't that a movie?).  In this case, the 7th column on the previous spreadsheet contains the departure time.  And magic, it does the arithmetic and calculates the runtime for you.
  • The last parameter ("FALSE") tells vlookup that your data are not sorted.
  • Do make sure that the "Format" for the runtime cell is hours, using the Format menu:


So that's how we set up the spreadsheet to take care of the runtime arithmetic for us.  Once that's done, it's done.  

Speed

Speed is derived from runtime (speed is a function of time and distance).  I was pretty slovenly and put the distance between checkpoints directly into the formulas and then copied-and-pasted.  Here's what it looks like:

=iferror(round(58/(int(D2)*24+hour(D2)+minute(D2)/60),1))

This looks like a hairball but it actually isn't, and to be honest I'm not really sure that this is the best way to calculate the speed.  But, it works, given the limitation of the medium.  Again, we're going to tackle this from the outside ("iferror") in.

  • "iferror" is a function that allows us to catch errors and make sure that something that's clearly wrong won't be displayed.
  • "round" is a function that allows us to control the number of digits to the right of the decimal point (or precision).  When you're talking about dogsled speeds, something like "7.1" makes sense.  Something like "7.128475" really doesn't, so let's not show that.
  • Inside the "round" function is where we're actually doing the arithmetic - dividing the distance traveled by the time it took to travel it.
  • The rest of the string ("int (D2)*24+hour(D2)+minute(D2)/60)" converts the date and time into just plain hours.  I can write some more about this at a later time but I think it may be a distraction here, since my assumption is that the people who use these spreadsheets are going to be doing a lot of copying and pasting and only need to know details to the extent that it helps them adapt the spreadsheets to their own race.  So: use this formula, but change the miles ("58") to the mileage from the last checkpoint to this one as needed.

Speed rank

I decided to rank runtimes from the previous checkpoint because it's something people ask about, and because it's easy to do.  I put this in column F.  The formula is:
=iferror(rank(E2; $E$2:$E$28; 0))
Again, outside to inside we have
  • iferror, to catch mistakes like missing data
  • the "rank" function, which takes the following parameters/arguments
    • the first argument is the cell for which you'd like to find the ranking (speed, which is in column E)
    • the second is the range within which you're ranking the data (and again, you put '$' in front of the row and column numbers to make it's clear to the spreadsheet that these are absolute, and should not be adjusted if and when you paste this formula into another column)
    • and the third argument tells rank if you want to sort in ascending order (0) or descending order (1)
You may have noticed that in this case I'm using a semicolon (';') to separate arguments when previously I'd been using a comma (','). This is the kind of inconsistency that can creep into larger projects when they're done over time rather than in just one sitting. I can and should bring them into consistency at some point ...

Departure

Column G contains the last piece of manually entered data, the departure time from the checkpoint. Again, this needs to be in the Date time format.

Layover, or rest

The last column, H, contains the rest time at the checkpoint, and is simply the value you get from subtracting time in from time out:

=iferror(if (G2="","", G2-C2))

Wrapped, as always, in an "iferror".

So that's a pretty messy, incoherent overview of what's actually in the spreadsheet cells, and how the arithmetic is done.

In the next post, I'll tell you what you need to do to adapt these spreadsheets for your race.

No comments:

Post a Comment