Abstraction and Reasoning Challenge
Lessons from honestly not doing that well at it.
Introduction
In my Checkers post I said I wanted to keep myself on a weekly cadence, and since that post was a month ago, I’m off to a mixed start. I have come prepared with an excuse: I was working very aggressively on an entry in François Chollet’s Abstraction and Reasoning Challenge, followed by a brief recovery period.
Unfortunately, as a careful parsing of the leaderboard will show, I did not succeed. Initially this made me reluctant to do much of a write-up of it, but when I made myself say that out loud, it seemed like a pretty terrible reason. Post-mortems are always valuable, and even the best engineers have ideas that might not pan out. Besides, I’m trying to keep these posts focused on what I learn, not necessarily whether I craft some amazing and wondrous piece of software, and there’s no learning experience so good as failure.
The Competition
You can read about the competition here and also here and in this paper if you want, but the short version is that it consists of a bunch of matching pairs of “images” (or grids of numbers, if you like), where there is some fairly-easily-comprehensible rule explaining how the input maps to the output. Some simple examples (originally based on this helpful notebook):
For a human, these are all pretty easy to spot. (If it’s taking you a second - the first one repeats the top shape in the bottom grid, but only in the relative locations that the top shape was shaded. The second one is a pattern of repeating diagonal lines, but only three of the lines are shaded initially and you have to fill in the rest. The third one is actually just a simple mapping from one color to another - green becomes yellow and vice versa, blue becomes grey and vice versa, and so on.)
However, the challenge is to build a system that, given 3-5 examples from some similar task, can correctly determine the input-to-output rule and apply it to some new input example(s).
The competition also came with 400 “training” examples (to look over and get a sense of the types of tasks) and “evaluation” examples (to make sure that the logic you wrote for the training examples could generalize, without in theory ever looking at them), and ~100 hidden test examples that your code would run against once you submitted it. As of now, it looks like the grand-prize winner solved ~21 of the hidden tests.
On the other hand, if you take the time to look through the leaderboard, you will find that I solved… none of them. There are two or three big factors that contributed to this, which I will discuss, and the first concerns the competition itself.
How Challenging ARC Was
Honestly, my hat is off to Chollet here. When I first looked at the test examples I thought it would be hard, but not that hard, to identify a lot of the underlying concepts that were used in the tests.
The point of the challenge is that these tasks should be easy/intuitive for humans but hard for software. Originally I was skeptical about that framing - I tend to think a lot of human intuition isn’t actually that hard to codify in software. As an undergraduate, I also studied psychology and was pretty influenced by people like Dr. Paul Meehl, who did a lot of work showing that simple rules could often perform even experts at making predictions in a wide range of areas (see this later review for some discussion), and I later read a lot by people like Eliezer Yudkowsky who has written extensively about the practice of rational thought (in a similar vein to the more well-known Dr. Daniel Kahneman).
My big takeaway from both of them was that human thought is often less complex or impressive than we would like to believe. So, my instinctual response to Chollet’s framing of this problem as something that requires human-like reasoning skills beyond current machine learning approaches was, “Actually, I bet it doesn’t.”
It’ll probably be a while before we can have a really definitive answer to that, but I will say that, at the very least, it requires a lot of engineering. My initial attempt was able to solve about 1-in-4 of the “training” problems, but only about 1-in-8 of the “evaluation” ones, and a submission with most of that logic (I’ll explain why it’s “most” and not “all” below) didn’t solve anything in the hidden test set.
So I definitely think Chollet has built a set of problems that are hard to create a general solution for, even if you’re trying to, and even if your solution manages to generalize over a subset of them. (To note, even within the training data, I would look at one problem, write as much general logic for it as I could, and then run it against all the problems, and typically I would find I had solved anywhere from 5-20 of them - I was definitely trying to be as general as possible, just not succeeding.)
Time Pressure
I also definitely didn’t have enough time to do the problem justice, especially in light of the above. Part of this was self-inflicted and part was unavoidable. The competition ran from 2020-02-13 to 2020-05-26. Until the end of March, I was still employed by Google (and actually doing 20% work on Kaggle) and therefore ineligible to participate. (I did briefly wonder if that would still make me ineligible to claim any prize, but I decided it was an interesting enough problem that I wanted to try either way. In any event I didn’t have any material nonpublic information about the challenge, which is pretty obvious considering the outcome.)
Unfortunately I also didn’t even start working on it until May, when I could have started in April. Instead I focused on learning various front-end technologies. I don’t regret that in the long run - it’s more important for the projects I plan to spent the next year working on - but obviously spending ~8 weeks on this might have gotten me a lot further than ~3-4 did.
Conversion Challenges
On the Kaggle forums Chollet suggests getting started by taking some tasks and writing simple programs to solve it in whatever programming language you’re comfortable with. I wrote my initial solutions in C#, which is by far the language I’m best with. (To be clear, I would definitely have done this whether or not Chollet said to. I know people talk a lot about the importance of learning multiple languages, and obviously I’m putting effort into that myself, but I also think it’s underappreciated how much depth you can get in a single language over time, and how productive you can therefore be in it.)
Eventually, to submit it, I needed to rewrite everything in Python. (Or R, I guess, but I don’t know R. And from what little I’ve heard, I probably don’t want to.)
I know enough Python to get by, but not a lot. At this point I probably understand Javascript better, just by virtue of the projects I did in April. So, I didn’t exactly carefully rewrite equivalent logic in idiomatic Python - I just did the most minimal conversion possible, with the exact same class structure and so forth. (For non-programmers: this is like when somebody talks in a foreign language using only a word-for-word translation - it sort of works, but it’s not ideal.)
The resulting Python code proved unworkable, which is why I didn’t end up converting the entire C# solution before admitting defeat (although I know it wouldn’t have made that much of a difference). The biggest problem for me was that the overhead of calling a function in Python is just too high. In C# it’s basically nonexistent.
I’ll give a concrete example:
One of the most basic things you need to do in these problems is break the input and input grids up into “shapes”. A “shape”, in this context, is any contiguous set of pixels that share some characteristics. That could mean a few different things - “pixels that are the same color”, for example, or, “pixels that are not [the background color]". Also, we might say that “contiguous” includes diagonals, or that it doesn’t.
Breaking things into shapes is pretty straightforward - you basically use the same logic as a breadth-first search, except instead of going until you “find” something, you keep adding each adjacent space to the shape if it matches your “these two pixels belong to the same shape” rule (like, they have the same color, or they both aren’t the background color):
public List<Shape> GetShapes(Grid grid)
{
var results = new List<Shape>();
var handled = new HashSet<Location>();
foreach (var loc in grid.AllLocations())
{
if (!handled.Contains(loc))
{
if (this.BackgroundColor.HasValue && grid[loc] == this.BackgroundColor.Value)
{
handled.Add(loc);
continue;
}
var shape = this.GetShape(grid, loc);
results.Add(shape);
foreach (var shapeLoc in shape)
{
handled.Add(shapeLoc);
}
}
}
return results;
}
private Shape GetShape(Grid grid, Location start)
{
var visited = new HashSet<Location>();
var toVisit = new Queue<Location>();
toVisit.Enqueue(start);
var results = new List<Location>();
while (toVisit.Count > 0)
{
var current = toVisit.Dequeue();
visited.Add(current);
results.Add(current);
foreach (var neighbor in GetNeighbors(current, grid))
{
if (visited.Contains(neighbor))
{
continue;
}
if (this.IsSameShape(grid, current, neighbor))
{
toVisit.Enqueue(neighbor);
}
visited.Add(neighbor);
}
}
return new Shape(results);
}
This is all part of an abstract base class - the “IsSameShape” function is defined in a derived class. That means that you can’t inline that logic (at least, not without copy-pasting all of this code into those classes and getting rid of inheritance entirely) and you’re going to call “IsSameShape” O(N) times where N is the number of pixels in the image you’re breaking down.
In C#, this is almost completely inconsequential. In Python, it’s not. The C# version of this code runs something like 100x faster than the (very-literally-translated) Python version of it. Also, some parts of my solution were written to try a lot of combinations of stuff, like, trying every possible version of a “select some of the shapes” rule with every possible version of a “break this into shapes” rule and every possible version of a “see if the output can be scaled by some fixed amount” rule. Obviously this isn’t a very clever way to go about things, but it can work, assuming your program can run fast enough. Several places in my code have comments like this:
// TODO: Consider limiting to some fraction of middle spaces if needed for performance.
// Empirical note: So far it doesn't seem like it matters in the slightest.
To note, there may be other performance issues as well - I don’t know Python well enough to say. I’m sure a good engineer with more Python experience than me could write equivalent code that performs as well. But I know I couldn’t do it in the time I gave myself - I spent about three weeks writing the C# version of this and about 48 hours porting it all to Python, and by the time I was done it was less than 72 hours before the end of the competition, and I know there was no way I was rebuilding everything, and I knew that the logic I had converted (which could solve about half as many of the training/evaluation cases as my final version) still couldn’t solve any of the test cases, so even if I did manage it, I still needed to do a lot more work before my solution was general enough. I also had basically been working 60-80 hours a week on this for several weeks, so I decided to cut my losses, took a day off, and slept about 12 hours straight.
As I alluded to above, I don’t think this means I made the wrong choice writing my solution in C# first, and I definitely don’t intend to suggest Chollet’s advice was bad: I also couldn’t have written as good a solution in Python in 3-4 weeks, and more generally I’d rather walk away with some idiomatic C# code that might end up being useful later.
(If you’re wondering how any of this code could possibly be useful later for anything other than another ARC competition, know that I originally wrote the shape-extraction logic so I could analyze maps for some rules-based Starcraft 2 AI - all code might end up being useful later.)
Sidenote - Why Was Python Required?
My wife asked me why Python was a requirement if it’s so much lower-performance, and I figured I would include a brief explanation of this in case somebody non-technical happens to be reading (although if you’re not a software engineer and you actually read this far, thank you for what I can only imagine is the gift of your lifelong friendship).
Kaggle, as a platform, supports Python and R. To the best of my knowledge most machine learning work is done in Python these days, mostly using either the Tensorflow or PyTorch libraries (made by Google and Facebook, respectively).
Most of what these libraries do is make it easy to write things like neural networks that mostly rely on matrix operations (i.e., multiplying giant sets of numbers together). The actual “multiply these billion numbers against each other” part is usually written in some other language, like C++, and often run on GPUs or other special-purpose hardware that specifically does matrix operations in parallel, really quickly. The actual Python code just sets up these operations, so it doesn’t need to do much heavy lifting.
That doesn’t work for this sort of problem, where you aren’t necessarily solving it just by effectively multiplying a 50,000x50,000 matrix against another one.
Another question you might be asking is, “Why are the popular ML libraries written in Python, if they actually have to use C++ code under the hood?” As far as I know, this is entirely a function of history and user preference - history, because at this point Python and ML are so intertwined that if you want to learn ML you’ll end up learning Python and feeding the cycle, and preference, because a lot of ML researchers like Python’s ease of use when they’re prototyping new ML methods. (The features of Python make it easy to write small chunks of code, compared to languages like C# where you have to spend a bunch of time defining data types and so forth. “Defining data types and so forth” makes it a lot easier to write a 50,000 line project in C# without getting lost or driving yourself insane. It’s sort of like how bureaucracy is really annoying if you’re doing a group project with five people, but absolutely essential if you’re running the federal government.)
What I Learned From All This
- Even if the ARC tasks don’t require human intelligence, they’re a lot closer to requiring it than I would have expected
- Python function overhead is wicked high
- Participating in cash-prize programming competitions is fun and motivating, but also easy to overdo
- When working on an aggressive deadline, the quality of my code rapidly declines as the deadline approaches (I think I mostly knew this, although it’s rare to have a real deadline for anything, even when working at a large company)
- Human intuitions about shapes can be partially reproduced using graph logic, but humans can do a way better job determining when to apply exceptions to those rules
Remaining Issues / Future Work
Some of the code I wrote is pretty well-organized, but clearly not general enough to solve the underlying problem. Other parts of it, especially the ones I did last, are pretty ghastly. I think I could easily spent a couple of weeks cleaning up what I have now. I’m not sure it’s a good use of my time, though.
I found this whole problem very addictive, not even counting the potential reward, and I’m pretty sure it’s stuck in my head now forever, so I think what I might actually do in the future is toy around with some really abstract conceptual approaches to general solutions and see if I can come up with anything. If there’s another ARC competition (and I suspect there might be), that might turn into a couple-of-months project for me, but if not I think it’s something I still might spend the odd day on once in a while.