Saturday, July 25, 2015

Doing data

I had big plans for the summer - getting some code written, some analyses written up, and getting the new Mushing Tech website up and running.  Well, with one thing and another none of that has happened, but I thought it might be worthwhile to post something about things that people who are interested in a closer look at distance mushing race data can do for themselves.

Very broadly speaking, looking at the data involves two separate but related steps: 1) acquiring the data, and 2) doing the analysis.

Data acquisition and cleaning is a laborious grind.  Right now race data tend to be in pretty shabby shape.  They are generally available as as web pages, with no two races using the same format (even those using my spreadsheets have tweaked them for their own use, which is great but does increase the effort to take the data apart).  Mostly they're available as web pages, which means scraping the data and converting it into a format that you can use with analytical tools.  Worse, both major races have some major database errors - Iditarod's archived races have badly broken checkpoint links, and the Yukon Quest's archived races have screwed up bib numbers.

I've been pulling race data into my own spreadsheets using two mechanisms.  For races underway, I manually enter data into live spreadsheets that I keep in Google Sheets.  I've been bit-by-bit converting old race data by using table scraping tools available as browser extensions (for example, DataMiner for Chrome, and the extremely fabulously wonderful TableTools2 for Firefox).  So, I do have a collection of race data available in spreadsheet form, here.   Please feel free to use them yourselves, copy them, and so on.  They are released under the BSD 3-clause license, so please respect that and provide attribution if you do use them.

Another source of data is the Trackleaders race tracking archive, and this is the really interesting one.  For each race and each musher being tracked, the speed/time plot source data is actually embedded in the web page, and you can pull it out if you know where to look.  I've got a Python program that does that and calculates rest schedules from it, here.  Please feel free to grab a copy and tweak it for your own use (again, it's released under the BSD 3-clause license - it's yours to use as you wish at no cost other than providing attribution).

Okay, so that's the data side of things, but how do you learn how to look at data?  Many people I've talked with have questions they want to ask about the data but aren't sure how to go about answering them.

One way is to ask someone with a stats background, but a more Alaskan approach might be to learn to do it yourself.  There are a couple of possibilities at low or no cost.

Data science is a booming field right now and the educational resources are really impressive.  There are a lot of new books, many of which require minimal technical background.  O'Reilly has a remarkable collection in both print and ebook form.  For example, Cathy O'Neill's "Doing Data Science" is a terrific book that gives you an overview of the approaches you can take to looking at data, while Joel Grus's "Data Science From Scratch" provides a hands-on introduction to a variety of analytical techniques while providing a grounding in Python programming (if you're already a Python programmer you'll probably want to use an existing data science module, like scikit-learn).

But, if you've got a computer and an internet connection, a better choice might be to take an online course at one of the MOOCs ("Massive Open Online Course").  Data science and introductory statistics courses are everywhere, and they provide an opportunity to learn new skills with as much or as little of a commitment as works for you.  You can take one and do all the exercises and take all the quizzes (and earn a certificate), or just watch the lecture videos - it's all up to you.

Coursera features courses prepared by faculty at major universities and the quality is extremely high.  As an example, take a look at the courses in the Data Science Specialization offered by Johns Hopkins University.  They offer everything from a very high-level overview to a class on regression models.  ("Getting and cleaning data" might be a good one for people interested in looking more closely at mushing data!)

Udacity is another popular MOOC site.  It's more oriented towards practitioners and classes tend to be less consistent in quality than Coursera, but the are learn-at-your-own-pace with no deadlines for homework or quizzes, and some of the classes were developed by companies like AT&T and Google.  They also have a large data science category, with some excellent introductory classes like their Introduction to Descriptive Statistics, as well as classes that might give you some idea about how to approach looking at mushing data, like their "Data Analysis With R" class.

And as I mentioned, I'll be making a few videos showing how to work with Google spreadsheets to look at data, dealing with questions like how to do arithmetic on dates and times, and how to do some simple summary statistics.

But summer passes quickly in interior Alaska.  Yukon Quest sign-ups are in a week, and we should be able to start training the dogs regularly in about a month.  It's starting to get dark at night and I think we're all looking forward to seeing the aurora again.  But, in the meantime, there's summer to enjoy, fish to catch, projects to finish, and friends to visit.  Have a great rest of your summer, and watch this space!


No comments:

Post a Comment