Snapdragon, The Basic Assumption(s)
Part 2 of many
Introduction
In the last post I gave a quick overview of the structure of the Snapdragon project, an AI deckbuilder for Marvel Snap.
As I summarized at the time:
- We generate a bunch of random decks
- We play the decks against each other to determine which ones are the best
- The best decks reproduce
- Over time we should get better decks
This all seems theoretically sound, but Machine Learning algorithms can be tricky, so in the next post I’ll examine some assumptions that are being made here, how those assumptions might be incorrect, and how to test them.
So, as promised, let’s talk about assumptions. There’s a few minor ones baked in there, but I want to start with the one overall, big assumption.
The Main Assumption
The most significant assumption in the chain of logic above - which technically isn’t even explicitly stated, as I look at it now - is that the best decks will win.
Our entire process is to randomly play decks against each other and then reproduce the ones with the highest victory count. If those are the best decks, then the system will work1. If there is a reason that’s not true, or not reliably, true, then the system will definitely not work.
But Like What Does “Best” Even Mean Though
Now, you may be asking - how can decks be “the best” if they aren’t “the decks that win the most”?
There are several potential issues here. Broadly, I would describe “the best” decks as the ones that win the most assuming they are played properly by a human.2 But we’re not running tests with humans. We’re running tests with a Monte Carlo game search specifying their behavior. So one obvious failure mode for the whole system would be if the Monte Carlo game search didn’t play well.
As described last time, the Monte Carlo game search basically just, on each turn, simulates making all possible moves, and then seeing how many times a given move results in a win if the rest of the game is played randomly, X times. The higher value of X - the simulation count - the better it should play. (For example, if it only simulates each move once, it would be easy for random chance to promote a worse move. Less so if it simulates it 20 times.) Now, because all of this is probabilistic, we don’t have the usual trait of software where we can just run a test with some sample inputs and see if it “does right thing” or “doesn’t break”. However, one way to test the implementation would be to see if cranking up the simulation count causes it to perform better.
So, I did that, in a couple of different ways. One was to simulate two different controllers with different simulation counts but identical decks, and see how often the one with the higher simulation count won. These are the results, playing 500 games each time:
5 sims vs 1 sim: 65.2% win
10 sims vs 5 sims: 61.2% win
20 sims vs 10 sims: 61.0% win
50 sims vs 20 sims: 61.6% win
100 sims vs 50 sims: 53.4% win
So, it really looks like higher simulation counts tend to beat lower simulation counts, although this starts to level off somewhat after we get to 50 simulations in this test.
To try to confirm this a bit, I also ran a test, this time with a couple of different decks, but with the same simulation count for each one, and instead compared what the average scores were for the players, again across runs of 500 games:
1 sim: 29.172
5 sims: 33.474
10 sims: 35.164
20 sims: 36.516
50 sims: 37.200
100 sims: 37.845
200 sims: 37.961
400 sims: 38.022
So, in this case, it seems like the average scores jump up pretty quickly as we increase the simulation count, but mostly plateau somewhere in the 20-50 simulations range. Actually, it seems like the curve is very steep at the bottom - going from 1 sims to 10 sims per move gets us a little over +4 points, but going from 10 to 50 only gets us +2.
Because this also increases the amount of processing it takes to run games, I’ve mostly tended to run experiments at 10 sims or occasionally 20 sims, although these numbers show there might be some more value to squeeze out by increasing this further.
In any event, for now, we will provisionally conclude that the Monte Carlo game search “is working properly”. However, there’s another assumption hidden above.
What Was That About Humans Before?
I said abve that “the best” decks are those that win the most assuming they are played properly by a human. In the previous section I showed some good evidence that the Monte Carlo game search logic is probably “playing properly”, but it’s not playing “like a human”. From the previous post:
There is one key issue with the current implementation, which is that the
MonteCarloSearchController
is given the full state of the game, including information that would normally not be revealed to players (what cards are still in their and their opponent’s decks, what cards are in their opponent’s hand, and what the unrevealed locations are). This means that the players are basically psychic. I haven’t dealt with this yet because any solution will be fairly complicated, and at the moment players are still effectively on a level playing field, but there are some implications I will discuss later.
Humans don’t have this information, and it has some implications. If your opponent has a card that can counter one of your cards3, maybe you won’t play that card at all, or you’ll play it in a way that avoids the counter. This remains an unsolved problem - I can definitely think of approaches to deal with it (when simulating games, also replace all unknown cards/locations with random ones, or statistically plausible ones), but I also don’t know if it’s having an impact, and I don’t have any good idea as to how to tell if it is.
Anyway, there’s one more element to our “the best decks win” assumption that’s even bigger than this, so let’s address that.
How Reliable Are The Best Decks?
There’s a decent amount of randomness in Marvel Snap:
- Each game you get three random locations with different abilities
- You have a 12-card deck but end up typically drawing 9 cards, so a quarter of your deck is effectively gone
- You draw cards in a random order, which might mean, for example, that you don’t get any 1-cost or 2-cost cards in the first two turns
All of this can add up - good decks usually have various combos that you might not be able to manifest, and some locations can more or less ruin entire strategies (or greatly benefit them). So even if one deck is much better than another, there still would be some chance it won’t win.
Initially I ran several experiments where I pinned specific cards in a population and tried to see how it would affect other, related cards. But those experiments didn’t really show any obvious, strong trends. Eventually I decided to test how random this was, by running the games for a generation, then running another set, and seeing how strong the correlation was between “wins in the first set” and “wins in the second set” for each deck.4
10 games: R = 0.098
20 games: R = 0.222
40 games: R = 0.362
80 games: R = 0.462
160 games: R = 0.694
Given that my initial experiments were running with sets of 10 games, the results make a lot of sense. That’s just not enough games to get rid of random noise. Now, any positive correlation is still enough to expect some progress over time, but in this case that might mean a lot of generations before anything really useful happens.
Like the Monte Carlo simulation count, this again has an obvious performance trade-off - scaling the number of games directly scales the processing time - so as a compromise I’ve been running most of my experiments with either 40 or 80 games per deck per generation, but that still produces some much clearer results.
Have We Run Out Of Assumptions Yet?
There are definitely more assumptions baked into this system. Some of them are pretty theoretical, and hard to measure. Here are a couple of examples:
- Are the “best decks” still “best” in all situations?
Some strategies counter other strategies. Some strategies are very powerful, except that they have a simple counter. Whether a deck is “the best” is only ever true in terms of the environment that deck is in, which really means “the set of other decks it’s playing against”. Some of my experiments basically have one “population” that’s evolving, but it’s always playing against itself, so “better” just means “better at beating other, similar decks”. In other experiments I tried to avoid this by also having a second population that’s unconstrained, but this has a different issue - the unconstrained population will probably tend to evolve to be a counter of the other population, not just “some neutral independent strategy”. Ideally it would be good to simulate a large constellation of strategies, but because this would require a lot more processing time, I haven’t done it (yet).
- Can you really evolve “the best decks” using individual cards as genes?
The basic reproduction process for decks is to select two of them and randomly cross their genes, i.e. their individual cards. But the best-performing decks probably have useful combos in them. On a long enough timeline, that probably will mean that all of the cards in the combo will get more common, so the combo should show up more. But what if the combo works well, but the two cards individually are quite weak?5
One way to address this would be to allow “genes” made up of larger chunks of cards. (Twelve cards is, helpfully, evenly divisible by 2, 3, and 4 cards.) This is on my to-do list and I think it’s promising - combos can be captured here in various ways - although I’ll have to do a comparison between populations with multi-card genes and populations with single-card genes to see if one works better than the other.
Awkward Conclusion
Hopefully this gives a good illustration of the some of the logical challenges of working with these types of algorithms. I would say this generally applies to a lot of “AI” or “machine learning” systems6, or really anything that’s more “statistical” than “deterministic”.
Last time I was pretty sure what I was going to write about next, but today I’m not. So I’ll leave it here for now, and I guess we’ll both be surprised.
-
Assuming none of our other assumptions are wrong, which we’ll definitely be exploring further. ↩︎
-
“Properly” is also maybe under-specified here. Let’s somewhat-circularly say “played in a way that is most likely to result in a win”. ↩︎
-
Like Shang-Chi if you’re about to play Devil Dinosaur. ↩︎
-
Using the Pearson correlation coeffient, which gives a value between -1 (perfectly inversely correlated) and 1 (perfectly positively correlated), with 0 being “no correlation”. ↩︎
-
I’m a big fan of Mister Negative who can produce some really devestating combos with, say, Iron Man, but is potentially worse-than-useless if added to a random deck. ↩︎
-
I’m not always entirely sure if Monte Carlo game search should count as “AI”. Probably not, because it doesn’t “learn”. Genetic algorithms do, though. ↩︎