CodeKata Four: Data Munging

Dave's specification: http://blogs.pragprog.com/cgi-bin/pragdave.cgi/Practices/Kata/KataFour.rdoc

Mmm, parsing problems. This is easy stuff, using pretty standard UNIX tools. No fancy-schmancy Python needed here; /bin/sh and awk are good enough for this kind of thing.

  • Weather data parser: codekata/4/weather.sh
    • Time spent:
      • Probably 10-15 minutes doing a rough cut, then a few more minutes cleaning up the interface.
    • Comments:
      • This is trivial stuff that system administrators do fairly regularly, so it's probably an unfair exercise for me; parsing formatted data is my day-to-day.
  • Soccer data parser: codekata/4/soccer.sh
    • Time spent:
      • Because it was a big cut-and-paste of the original code with a couple of small changes, probably only about three minutes.
    • Comments:
      • Since the first table parser was basically identical to what I needed to do here, implementation was trivial.
  • Combined parser: codekata/4/both.sh
    • Time spent:
      • 10-15 minutes.
    • Comments:
      • This was actually harder than just reusing the original version, because of the tools I chose to use; sh and awk scripts don't lend themselves well to "good programming practices". ;-) A little creative substitution, and we were all set.

Answers to Dave's questions:

  1. Using sh and awk made it a little less convenient to merge the two later. Had I known in advance (or read ahead), I would probably have picked Perl, Python, or something a little more "all-in-one", rather than using a set of disconnected tools to do it (in this case, sh handling the interface, and awk handling the heavy lifting).
  2. The second program wasn't just inspired by the first: it was practically a cut-and-paste.
  3. The programs suffered from the additional refactoring here. I really just obfuscated things for very little value (the original scripts were 27-28 lines each, and that's with some thought given to user interface; the core awk script is only 18 lines long). That being said, adding additional parsers to this is prety easy now, since we have a simple definition for the "business rules" of the data: header length, columns to compare, and column to output. Add a few extra things (field separators, line separators, etc), and you'd...well, you'd have awk, which I think was my original point. ;-)

An interesting thing about this was that the combination of awk for processing, sh for interface, and the data definition provided by Dave gives us a pretty low-tech version of MVC. Well, isn't that interesting; even my old-school UNIX tools can fit into whatever modern best-practices model-of-choice gimmick-widget we're using today. :-)