Sunday, February 23, 2014

Why Iditarod tracker mileage sorting is messed up

People have noticed that the sort by mileage function on the Iditarod leaderboard was wrong (it's been fixed), with 100-something miles sorting in front of fewer miles.  For example, see this screen shot grabbed by Dawn Beckwell:



Here's what's going on (and this will be old news to a few people and not interesting at all to most):  Computers are binary calculators, with all data, from programs to stored files, taking the form of a string of 0s and 1s.  Both characters and numbers are also strings of 0s and 1s, and there are standardized encoding schemes for representing character data.  By far the most popular/successful is known as ASCII, or the American Standard Code for Information Interchange (nice collection of ASCII tables here).  So, while the number 1 is "00000001" in base 2 (again, base 2 because each bit can take one of only two values, 0 or 1), the character "1" is encoded in ASCII as 00110001.  That's right, 1 and '1' are not the same.  When you see a '1' on a screen what's really behind that - how the data are really represented - is 00110001.  00000001 is translated to 00110001 for printing or display.

Programs that do things to data, like sort them, have no way to know what 00110001 is or how they should treat it.  In a lot of web-oriented and application-oriented programming languages it's very easy to sort data (old school, we had to write our own sort functions) but the default is that data are treated like characters.  They look at the first character in each string and sort on that, etc.  In that scheme, "1" is smaller than "7" and sorts earlier.  To sort as numbers, in modern programming languages you just need to tell the sorting function to treat the data as numbers, not characters, so instead of looking at it character-by-character it understands that the value it needs to sort is 101.5.

No comments:

Post a Comment