Let’s talk about something fun: the Advent of Code, where each day from the 1st to the 25th of December a new programming puzzle is presented for you to solve. A friend roped me into trying it this year, and I decided to try it in Python for learning purposes. Now halfway in, the difficulty level of the puzzles has increased, and I’m probably not doing all of them. Time to write down some reflections!

First, as one might expect, I have learned more Python this December than in several years of thinking ”I ought to learn a little Python at some point”. I’ve also started liking Python much more. Most of my professional Python experiences have been about running other peoples’ research code — sometimes with relative ease, and sometimes with copious dependence management. Dependence problems don’t make any language seem appealing. Advent of Code, on the other hand, is exactly what makes programming fun: small self-contained problems with well-defined solutions, test cases already prepared, and absolutely no need to install a years-old version of scikit-learn. And there is a lot to like about Python: list comprehensions (they are just fabulous!), automatic unpacking of tuples (yum!), numpy and pandas.

”The events leading up to the Second World War do not include the Second World War”

Second, it’s fun to notice what my most common errors are. I am a committed and enthusiastic R user, and you can tell from my errors: for the first few puzzles, I tended to accidentally write functions that returned None, because I unthinkingly relied on the R convention that the last expression in a function is automatically returned. The second most common error, frustratingly, is IndentationError: unexpected indent from having empty lines in my functions that are not indented. See, I like having whitespace between ”paragraphs” of code. While I’ve gotten over my gripes with some other common Python features, it still puzzles me that anyone would think that’s it’s a good idea to demand empty lines to be correctly indented. Even more puzzling, popular programmer’s text editor Atom, in its default configuration, deletes ”unncessary” whitespaces upon saving a Python source file.

After those, as expected, I’ve been making off-by-one errors like I was querying the UCSC Genome Browser for the first time (see this beautiful Biostars post for an explanation of that somewhat niche joke). I knew that Python counts from 0 where R counts from 1, but the difficulties don’t stop there. You also need to think about slices and ranges.

This gives you the first two elements of a list in Python:

pylist = ["a", "b", "c", "d"]
pylist[0:2]

This gives you the first two elements of a list in R:

rlist <- list("a", "b", "c", "d")
rlist[1:2]

That is, 1:2 in R includes the last element, 2, whereas 0:2 in Python doesn’t. This is well known, well documented, mentioned in every tutorial — I still get it wrong.

When we add ranges that go backwards, like in this function from Day 5, and you can see where this poor R user needs to scratch his head (and write more tests):

def get_range(start, end):
	if (start > end):
		return range(start, end - 1, -1)
	else:
		return range(start, end + 1)

When I wrote about my common errors on Twitter, fellow quantitative geneticist Lorena Batista warned me about Python’s assignment, which works very differently from R’s. She was right. This has bitten me already. The way assignment works in Python, always passing around references to objects unless explicitly told to copy them, does not fit my intuitions about assignment at all. I feel uneasy about it and how it interacts with mutability — here we try to write proper functions that don’t have strange side effects, and then we clobber the parameters of the function instead. This will take some getting used to if I’m going to use Python more seriously.

I don’t want to start any language quarrels, but even I see how Python feels cleaner than R in certain ways. Maybe getting ranges to go backwards doesn’t feel as natural, but at least you can expect the standard library to consistent case when naming things, whereas in R you will see model.matrix, pivot_longer, and stringsAsFactors in the same script (but the latter not for long, bye default factor conversion, you will not be missed). On the other hand, Python suffers from confusion about where the relevant functions can be found: some live as static methods in an object named like the module, some live in the objects themselves, and some are free-floating. In R, almost everything is free-floating, and if some package has objects with methods (like the SimParam class in AlphaSimR) you will remember because it’s the exception.

Parsing is half the problem

A nice insight from the Advent of Code is that parsing is more important than I thought. I don’t mean that parsing custom text file formats is annoying and time consuming, even if it is. What I mean is that the second part of parsing, after you’ve solved the immediate problem of getting data out of a file, is data modelling.

Take for example Day 12, a graph-related problem. If you have have the computational wherewithal to parse the map into an adjacency list, you are already well on your way to solving the problem.

Or, take my favourite problem so far, Day 8: Seven Segment Search. This involved unscrambling some numbers from an imagined faulty display.

The data looks like this example, and it’s not super important what it means, except that the words are scrambled digits, and that each row represents one set of observations:

be cfbegad cbdgef fgaecd cgeb fdcge agebfd fecdb fabcd edb | fdgacbe cefdb cefbgd gcbe
edbfga begcd cbg gc gcadebf fbgde acbgfd abcde gfcbed gfec | fcgedb cgb dgebacf gc
fgaebd cg bdaec gdafb agbcfd gdcbef bgcad gfac gcb cdgabef | cg cg fdcagb cbg

It’s always ten words, followed by the delimiter, followed by another four words. This looks a little like a data frame, and I’m accustomed to thinking about tabular data structures. Therefore, unthinking, I used pandas to read this into a data frame, which I then sliced into two.

import pandas
import numpy as np

data = pandas.read_table("day8.txt", sep = " ", header = None)

digits = data.iloc[:, :10]
output = data.iloc[:, 11:15]

That was good enough to easily solve the first part, which was about counting particular digits (i.e., words with particular numbers of characters in them):

def count_simple_digits(digits):
    """ Count 1, 4, 7, 8 in a column of digits """
    simple = [digit for digit in digits if len(digit) in [2, 4, 3, 7]]
    return len(simple)

sum([count_simple_digits(output[column]) for column in output])

Then comes part two, which was trickier, but more rewarding. I was pretty proud about my matrix-based solution, but that’s not the point here; I’m skipping over the functions that contain the solution. The point is that when I came to applying them to the real data, I had to do it in a clunky and I dare say unpythonic way:

## Apply to actual data
sorted_digits = digits.apply(sort_digits, axis = 0)
sorted_outputs = output.apply(sort_digits, axis = 0)

decoded = []

for ix in range(digits.shape[0]):
    segment_sum = np.sum(get_segments_shared(sorted_digits.iloc[ix]), axis = 1).tolist()
    matched_digits = match_digits(segment_sum, segment_sums_normal)
    decoded.append(decode_output(sorted_outputs.iloc[ix].tolist(), sorted_digits.iloc[ix].tolist(), matched_digits))

Look at that indexing over rows, which is not a smooth operation on a data frame. My problem here is that I stored the data on the same set of digits over ten columns (and four more columns for the words after the delimiter), when it would have been much more natural to have each data point pertaining to the same set of digits stored together in one structure — anything that you can easily iterate over without an indexed for loop.

The lesson, which applies to R code as well, is to not always reach for the tabular data structure just because it’s familiar and comes with a nice read_csv function, but to give more thought to data modelling.

On unicorns and genes

Martin Johnsson's blog about genetics and sundry things

Advent of Code observations

”The events leading up to the Second World War do not include the Second World War”

Parsing is half the problem

”The events leading up to the Second World War do not include the Second World War”

Parsing is half the problem

Dela detta:

Relaterade