I think running them against each other with a rules engine would be more interesting. Count up illegal moves and wins/unfinished games. I think llm grading is too unreliable.
A really interesting benchmark where the llms play multiplayer decks against each other using xMage as a rules engine,in this case, a $HORIZON token to the moon(Sideways).
1. Sideways walking (100M Horizontal)
2. Sideways Pinching (Crab division only)
3. Sideways Bleating (Goat division)
4. Sideways Rattling (Skeleton division)
5. Sideways Hay Toss (Mixed division)
6. Sideways Swimming (Tide pool division)
7. Sideways Knitting (GrandMittens Invitational)
8. Sideways Stay (Meditation division)
OLYMPICS RECORDS.
1.14.2 Seconds,Holder: Pinchy
2.(60s)120 pinches,Holder:Pinchy
3.(db)110db,Holder: EIDOLONX 4.(rhythm)9.7/10,Holder:Skeletorus
5.12.3m,Holder:Satochi Goat
6.(50m)32.1 sec,Holder: Pinchy
7.(1hr)100m,Holder: GrandMittens
8.(6hours),Holder: Satochi Goat
Economic boost:
$CRAB up 0.0001% (Sideways as Always.)
Providing them with medal count will improve their win rate against the baseline $HORIZON.
I actually considered using card forge when I started this. I mostly didn't end up using it because of how much more work it would have been.
But also with a rules engine, you have to manually go though every step, and pass priority after every action.
I think it makes more sense to let an LLM play magic like a person would. On early turns it is acceptable to say "I play a land and pass" without going through every phase. And you can say "I tap all my land and play this card" without having to use a tool call and agent turn for every land tap.
Also card forge would not let you goldfish a deck. You must have opponents.
> because of uncertainty on how it would affect it.
Have the LLM submit a proposed move and either advance the game state or reply "permission denied, try again". Probably also log the number of times it happens since attempted violations seems like a valuable signal as well.
I love obscure benchmarks, and I feel like I can trust their results a lot more - afterall, they (probably) weren't benchmaxxed. RuneBench[0] is another good example (how well LLMs can play Runescape)
I have a version of this where I have the llms play the duel decks "Elves vs Goblin" against each other using xMage as a rules engine.
Unfortunetly it gets really expensive to run even with some optimizations for the context.
I can only afford to play them with the deepseek models.
They make serious blunts sometimes. This is not an easy "harness" to build and I dont have the time or disposal cash to work on it. I think a lot of work could be done on improving it still and testing better models.
It would make an amazing "arena" bench. There is plenty of more duel decks well balanced against each other.
Sadly this benchmark removes the part of MTG that is most interesting: the opponent(s). Without opponents you simply don't have a game. You just have a rules engine - quite boring!
I think I object more to the decks used in testing than the machines' decisions. I do have nit picks though: This hand is quite poor and should be mulliganned: https://app.mtgautodeck.com/public/benchmarks/4bd9955b-ebe1-.... The poor runout reinforces this decision.
Admittedly, the mulligan phase system prompt is the weakest part of the project. I had to add heuristics to stop the LLMs from mulliganing down to just a few cards looking for a perfect hand. The scoring for the benchmark is mostly based on if the LLM could complete legal turns, not good turns.
This is a really interesting benchmark and also timely given a lot of existing benchmarks don't do a good job. The mechanics and edge cases seem notoriously difficult to parse also even for perhaps human players. Have you been also plugging these into newer reasoning models to see how providing them with thinking time improves their win rate against the baseline?
Since the library tools are just an MCP server, I did some testing on ChatGPT and Claude where I don't have to pay for api credits.
With maximum thinking and web search to look up magic rules, I didn't ever see it make a mistake. It is probably better at following the rules than the average magic player (but not better at making the most strategic moves).
The benchmark was mostly to find out what is the cheapest model with the lowest reasoning effort would provide a good experience for the app. The answer turned out to be that, for now, there is no cost effective way to run this app.
To provide a good experience, the simulations either need to be near instant, or you need to be able to run dozens or hundreds of simulations in parallel and do statistical analysis.
I wrote a rules engine in rust along with a reinforcement learning with MCTS based system to play decks against each other. It can handle aggro decks well enough but complex combo decks like Amulet Titan are tough to get working without expert demos or reward hacking.
You don't explain how scoring works, maybe it's obvious to MTG players? If you're using gpt 5.5, is there a possibility that it is biased in favour of models that think the way it does?
The scoring is just based on a simple prompt which is given the game state at the start and end of the turn and the log of tool calls and the final turn summary. The prompt asks it to evaluate the quality of the simulation from 0 to 10, and to give pass or fail for if it is legal.
It is far from ideal, but from my testing, even underpowered small LLMs that could not complete a single legal turn were reasonably good at judging if a simulation was legal. The final judging was all done by gpt-5.5 (medium) which might have given the OpenAI models an advantage, but from all the simulations I looked at, it seemed pretty fair.
This benchmark ended up be more of a test of how well an LLM can call tools without contradicting itself or backtracking. Most of the failures were not because of breaking magic rules, but because it could not sequence the tool calls correctly.
The failure mode seems to be that some models are overly trained to start tool calls, even when the model itself knows that it should not be calling the tool. Both of those examples were not errors because the judge prompt said they were illegal. In both of those examples the model stopped the simulation itself knowing that it made a tool error.
The Opus 4.8 examples are especially weird because it will consistently make the same tool call error 2 or 3 times in a row, and it will put things like "placeholder" or "noop" for the tool call reason.
Awesome ! Does this use https://mage-bench.com/ , or is it a separate project? I ran 4 local models in a tournament recently with mage-bench on an RTX 5090 ; Qwen 3.6 27B won narrowly over Gemma 4 .
No, I was not aware of that project when I made this.
I'll have to look into that project, but I also have an RTX 5090 and did a lot of testing with Qwen3.6 27B and Gemma 4 31B. I was not able to get it to play legal turns consistently. I had to keep expanding the system prompt and adding rules for edge cases. By the end, the prompt was over 10k tokens, and while it mostly make legal turns, it did not make good turns. And all the heuristics in the prompt degraded the performance and increased the cost for frontier models.
To clarify, the more accurate description would be "Testing how well LLMs can follow the rules of Magic", right? There is no actual evaluation of how "well" they are playing?
Magic is complicated. I looked at doing something like this but the open-ended nature where one specific card will completely change the rules or require a series of followup events or modifications to the rules engine at hand is just tremendous.
There is also this Matt Parker video about MTG, in which he explores a specific three-card combination that produces an ungodly amount of creature tokens.
I was wondering how complicated it could really be, and it turns out that some people showed in 2019 that it's Turing-complete -- meaning that any conceivable computation can be simulated by a MTG game, indeed a game in which every move by every player is forced: https://arxiv.org/abs/1904.09828
I think running them against each other with a rules engine would be more interesting. Count up illegal moves and wins/unfinished games. I think llm grading is too unreliable.
A really interesting benchmark where the llms play multiplayer decks against each other using xMage as a rules engine,in this case, a $HORIZON token to the moon(Sideways). 1. Sideways walking (100M Horizontal) 2. Sideways Pinching (Crab division only) 3. Sideways Bleating (Goat division) 4. Sideways Rattling (Skeleton division) 5. Sideways Hay Toss (Mixed division) 6. Sideways Swimming (Tide pool division) 7. Sideways Knitting (GrandMittens Invitational) 8. Sideways Stay (Meditation division)
OLYMPICS RECORDS. 1.14.2 Seconds,Holder: Pinchy 2.(60s)120 pinches,Holder:Pinchy 3.(db)110db,Holder: EIDOLONX 4.(rhythm)9.7/10,Holder:Skeletorus 5.12.3m,Holder:Satochi Goat 6.(50m)32.1 sec,Holder: Pinchy 7.(1hr)100m,Holder: GrandMittens 8.(6hours),Holder: Satochi Goat Economic boost: $CRAB up 0.0001% (Sideways as Always.) Providing them with medal count will improve their win rate against the baseline $HORIZON.
I know the author specifically did not use a rules engine in their simulation because of uncertainty on how it would affect it.
I do still wonder if adapting something like card forge for llm use would result in engaging gameplay with an llm.
https://github.com/Card-Forge/forge
I actually considered using card forge when I started this. I mostly didn't end up using it because of how much more work it would have been.
But also with a rules engine, you have to manually go though every step, and pass priority after every action.
I think it makes more sense to let an LLM play magic like a person would. On early turns it is acceptable to say "I play a land and pass" without going through every phase. And you can say "I tap all my land and play this card" without having to use a tool call and agent turn for every land tap.
Also card forge would not let you goldfish a deck. You must have opponents.
Those things sound less like general problems with rules engines and more like deficiencies of card forge IMO.
MTG: Arena uses a rules engine CLIPS (a s-expr expert system based on the RETE engine), which an acquaintance wrote a course for: https://ryjo.codes/tour-of-clips.html and even a declarative chat server: https://ryjo.codes/articles/a-simple-tcp-server-written-in-g...
> because of uncertainty on how it would affect it.
Have the LLM submit a proposed move and either advance the game state or reply "permission denied, try again". Probably also log the number of times it happens since attempted violations seems like a valuable signal as well.
I love obscure benchmarks, and I feel like I can trust their results a lot more - afterall, they (probably) weren't benchmaxxed. RuneBench[0] is another good example (how well LLMs can play Runescape)
[0] https://maxbittker.github.io/runebench/
I have a version of this where I have the llms play the duel decks "Elves vs Goblin" against each other using xMage as a rules engine.
Unfortunetly it gets really expensive to run even with some optimizations for the context.
I can only afford to play them with the deepseek models. They make serious blunts sometimes. This is not an easy "harness" to build and I dont have the time or disposal cash to work on it. I think a lot of work could be done on improving it still and testing better models.
It would make an amazing "arena" bench. There is plenty of more duel decks well balanced against each other.
Sadly this benchmark removes the part of MTG that is most interesting: the opponent(s). Without opponents you simply don't have a game. You just have a rules engine - quite boring!
I think I object more to the decks used in testing than the machines' decisions. I do have nit picks though: This hand is quite poor and should be mulliganned: https://app.mtgautodeck.com/public/benchmarks/4bd9955b-ebe1-.... The poor runout reinforces this decision.
This project is cool though, props for making it!
Admittedly, the mulligan phase system prompt is the weakest part of the project. I had to add heuristics to stop the LLMs from mulliganing down to just a few cards looking for a perfect hand. The scoring for the benchmark is mostly based on if the LLM could complete legal turns, not good turns.
https://github.com/CallumFerguson/mtg-auto-deck/blob/a877c08...
Gotta walk before you can run.
This is a really interesting benchmark and also timely given a lot of existing benchmarks don't do a good job. The mechanics and edge cases seem notoriously difficult to parse also even for perhaps human players. Have you been also plugging these into newer reasoning models to see how providing them with thinking time improves their win rate against the baseline?
Since the library tools are just an MCP server, I did some testing on ChatGPT and Claude where I don't have to pay for api credits.
With maximum thinking and web search to look up magic rules, I didn't ever see it make a mistake. It is probably better at following the rules than the average magic player (but not better at making the most strategic moves).
The benchmark was mostly to find out what is the cheapest model with the lowest reasoning effort would provide a good experience for the app. The answer turned out to be that, for now, there is no cost effective way to run this app.
To provide a good experience, the simulations either need to be near instant, or you need to be able to run dozens or hundreds of simulations in parallel and do statistical analysis.
I wrote a rules engine in rust along with a reinforcement learning with MCTS based system to play decks against each other. It can handle aggro decks well enough but complex combo decks like Amulet Titan are tough to get working without expert demos or reward hacking.
You don't explain how scoring works, maybe it's obvious to MTG players? If you're using gpt 5.5, is there a possibility that it is biased in favour of models that think the way it does?
The scoring is just based on a simple prompt which is given the game state at the start and end of the turn and the log of tool calls and the final turn summary. The prompt asks it to evaluate the quality of the simulation from 0 to 10, and to give pass or fail for if it is legal.
It is far from ideal, but from my testing, even underpowered small LLMs that could not complete a single legal turn were reasonably good at judging if a simulation was legal. The final judging was all done by gpt-5.5 (medium) which might have given the OpenAI models an advantage, but from all the simulations I looked at, it seemed pretty fair.
This benchmark ended up be more of a test of how well an LLM can call tools without contradicting itself or backtracking. Most of the failures were not because of breaking magic rules, but because it could not sequence the tool calls correctly.
For example: https://app.mtgautodeck.com/public/benchmarks/6349dda2-4069-...
and: https://app.mtgautodeck.com/public/benchmarks/dcc18bd8-339d-...
The failure mode seems to be that some models are overly trained to start tool calls, even when the model itself knows that it should not be calling the tool. Both of those examples were not errors because the judge prompt said they were illegal. In both of those examples the model stopped the simulation itself knowing that it made a tool error.
The Opus 4.8 examples are especially weird because it will consistently make the same tool call error 2 or 3 times in a row, and it will put things like "placeholder" or "noop" for the tool call reason.
Awesome ! Does this use https://mage-bench.com/ , or is it a separate project? I ran 4 local models in a tournament recently with mage-bench on an RTX 5090 ; Qwen 3.6 27B won narrowly over Gemma 4 .
No, I was not aware of that project when I made this.
I'll have to look into that project, but I also have an RTX 5090 and did a lot of testing with Qwen3.6 27B and Gemma 4 31B. I was not able to get it to play legal turns consistently. I had to keep expanding the system prompt and adding rules for edge cases. By the end, the prompt was over 10k tokens, and while it mostly make legal turns, it did not make good turns. And all the heuristics in the prompt degraded the performance and increased the cost for frontier models.
To clarify, the more accurate description would be "Testing how well LLMs can follow the rules of Magic", right? There is no actual evaluation of how "well" they are playing?
Benchmarks like this are onto something. Next frontier of llm benchmarking
Very cool. I’ve been daydreaming about whether LLMs can be used to reason through gaming decisions.
They should randomize games of judge tower and see who wins:
https://mtg.fandom.com/wiki/Judge_Tower
Heads up, most of the community migrated off Fandom a little while ago. https://mtg.wiki/page/Judge_Tower
Looking forward to this metric being Goodhart lawed.
Like how the strawberry example was overtrained for, or how the pelican on a bike started being used in official release posts.
Magic is complicated. I looked at doing something like this but the open-ended nature where one specific card will completely change the rules or require a series of followup events or modifications to the rules engine at hand is just tremendous.
or, that certain cards when play together make an infinite loop, and so cannot be played/insta-die
There is also this Matt Parker video about MTG, in which he explores a specific three-card combination that produces an ungodly amount of creature tokens.
https://www.youtube.com/watch?v=x3dE-NJ1UDQ
You misspelled insta-win. Infinite turn combos are the best.
I was wondering how complicated it could really be, and it turns out that some people showed in 2019 that it's Turing-complete -- meaning that any conceivable computation can be simulated by a MTG game, indeed a game in which every move by every player is forced: https://arxiv.org/abs/1904.09828
IOW, it's as complicated as possible.
Someone made a video based on the paper, if you want to see the cards being used and a little more explanation: https://www.youtube.com/watch?v=pdmODVYPDLA