Now that we've got the spreadsheet thing more-or-less licked, it's time to start actually dorking around with data to see if we can learn anything. Initially I thought I'd look at one of this year's races, but with the
Quest starting in less than a week I thought it might be fun to look at previous races. I'm still getting the hang of the
R statistical package, which provides both statistical analytic functions and some good (if slightly unattractive) graphics plotting functionality.
If you'll recall, last year's top 5 teams, in order, were:
- Hugh Neff
- Allen Moore
- Lance Mackey
- Jake Berkowitz
- Brent Sass
So, to start with something simple that won't reveal much but might recall what a squeaker last year was, I decided to take a look at positions coming into checkpoints. The first team to arrive is in position 1, the second is in position 2, etc. It also might demonstrate at which point in the race things started to firm up. Remember that Two Rivers was the first checkpoint, so there hadn't been a lot of trail for teams to move into position. Still, you might note that 1) two of the top 5 teams actually drift back a bit through the first few checkpoints, and 2) Brent came out hard but couldn't hang onto it (click on image to enlarge).
I may go back and look at previous years with the same question, but one thing I'm interested in examining over multiple years is whether or not there's something going on in the first half of the race that can have predictive value for the outcome. In the meantime, if you've got questions that you think either plots or statistics can answer, let me know.
Hi Melinda,
ReplyDeleteI have a spreadsheet for the probably the last 10 years for the Quest and I generally think of them as 2 different sets of data due the change in direction of the race.
Looking at 2011, and 1st arrivals into checkpoints compared to 2007,2009 that race was quicker until Slaven's. If you recall it was very cold then and I think that might have slowed it down. I also wonder though that overall to that point the race was ahead of it's recent pace and the dogs may have also naturally slowed down due the early pace?
(i will try to email a graph that shows this)
I am really wondering if the best thing to do is read my race data into a database and then use that to make queries and analyze the data?
Something to keep in mind though there are several races in the past that had odd things happen. Trucking one year from Braeburn to Carmaks, returning to Dawson another year for the finish... those years only some of the data would be valid.
Right, great minds think alike and all that. I worked with last year's data because I already had it in a spreadsheet, but the more relevant question is what happened two years ago, because the race from east to west is different from the one from west to east. However, we know what happened that year after Dawson.
ReplyDeleteI've put up a spreadsheet for 2011, which I'm making available. I do think a database is the right way to store and access the data - if nothing else, if you're doing much with your spreadsheet beyond using it as a place to stash your tables you're basically reinventing a relational database, anyway. What I've been finding is that as you add complexity to Google spreadsheets they grind to a halt. Yesterday I actually hung the 2012 Quest spreadsheet and had to quit out of it completely.
For what it's worth, R and probably other statistical and plotting packages can use databases as a backend, using an ODBC connector.
If you wanted to load your spreadsheets into a database and make those available, that would be fantastic. Frpm what I've seen there's a lot more interest in taking a closer look at the data than there was a few years ago.