This looks really interesting. It would be a good project to test this against a general card-playing framework to easily test it on a variety of imperfect-information games based on playing cards.
What tripped me up every time is that most board games have a lot of "if this happens, there is this specific rule that applies". Even relatively simple games (like Homeworlds) are pretty hard to nail down perfectly due to all the special cases.
Do you, or somebody else, have any recommendations on how to handle this?
[0] Dominion, Homeworlds and the battle part of Eclipse iirc.
In my 2-player Splendor rules engine, the following actions are possible:
1. Purchase a holding. (90 possible actions, one for each holding)
2. If you do not have 3 reserved cards, reserve a card and take a gold chip if possible. (93 possible actions, one for each holding and one for each deck of facedown cards)
3. If there are 4 chips of the same color in a pile, take 2 chips of that color. (5 possible actions)
4. Take 3 chips of different colors, or 2 chips of different colors if only 2 are available, or 1 chip if only 1 is available. (25 possible actions)
5. If after any action you have at least 11 chips, return 1 chip. (6 possible actions which are never legal at the same time as any other actions)
This still doesn't correctly implement the rules though. In the actual game, you'd be allowed to spend gold chips when you don't need to, which would make purchasing holdings contain extra decisions after you pick which holding to purchase about which chips you'd like to keep.
That’s what I did when I applied RL to Dominion, because the complexity of the game depends heavily on the cards you include! See part 3 of https://ianwdavis.com/dominion.html
The key is to build a data-driven state machine, rather than writing logic with a bunch of 'if' statements.
Something that I found amazing was inverting the flow control such that the server asks players questions with a list of possible choices simplifies the agent design tremendously. As I'm looking to retire to work on this project, I can generate the agent code and then hand-craft an AI. However, some AIs are soooo hard to even conceptualize.
Funny that most of the comments are about the name. What an excellent choice.
The first book Consider Phlebas isn't bad, but it isn't as well developed as the rest of the series IMO.
Funnily enough you can see the exact same effect in principal game-engineers or computer-hacking.
The practical question is: How much computation do you need to get useful results? Alpha Go Zero is impressive mathematics, but who is willing to spend $1mio daily for months to train it? IMPALA (another Google one) can learn almost all Atari games, but you need a head node with 256 TPU cores and 1000+ evaluation workers to replicate the timings from the paper.
Suppose you're a business that needs to play games. Most people seem to think that it's a matter of plugging in the settings from the paper, buying the same hardware, then clicking a button and waiting.
It's not. The specific settings matter a lot.
But my main point is that you'll get most of your performance pretty rapidly. The only reason to leave it running for so long is to get that last N%, which is nice for benchmarks but not for business.
DeepMind overspends. Actually, they don't; they're not paying anywhere close to the price of a 256 core TPU. (Many external companies aren't, either, and you can get a good deal by negotiating with the Cloud TPU team.)
But you don't need a 256 core TPU. Lots of times, these algorithms simply do not require the amount of compute that people throw at the problem.
On the other hand, you can also usually get access to that kind of compute. A 256 core TPU isn't beyond reach. I'm pretty sure I could create one right now. It's free, thanks to TFRC, and you yourself can apply (and be approved). I was. https://sites.research.google/trc/
It kills me that it's so hard to replicate these papers, which is most of the motivation for my comment here. Ultimately, you're right: "How much compute?" is a big unknown. But the lower bound is much lower than most people realize (and most researchers).
"IMPALA with 1 learner takes only around 10 hours to reach the same performance that A3C approaches after 7.5 days." says the paper, but I can run A3C on a cheap CPU-only server but to get that IMPALA timing, I need to spend a lot of money. But my biggest roadblock so far is that I need compute far exceeding what the papers claim.
The diagrams for IMPALA show good performance starting at 1e8 environment frames and excellent performance at 1e9 frames. By now, I'm at 2.5e9 frames and performance is still bad. In my opinion, the reason is that the sequence lengths for Bomberland are quite long. To clear a path, you place a bomb, wait 5 ticks for it to become detonatable, then detonate it, then wait 10 ticks for the fire to clear. With 7 possible actions per tick, the chance of randomly executing this 17 tick sequence becomes (1/7)^17 = 4e-15. If I calculate optimistically that all moves are valid, too, while we wait, then I can get up to (1/7)(5/7)^5(1/7)*(5/7)^10 = 1e-4. But that still means that at 1e8 env steps, I only have 1000 successful executions to learn from.
I'm a dabbler in Go, and "somewhere below professional" at the game of poker. I've followed the advances in the latter for more than a decade, eagerly reading every paper the CPRG publishes. They use a LOT of compute power!
I know from experience that "The specific settings matter a lot.". For several years, I made my living "implementing papers for hire". It's real work, no argument there. Sometimes the settings are the solution, and heck, sometimes the published algorithm is outright wrong, and you only discover so when trying to implement it.
But the second part of your point, that it's not simply achieving more performance by throwing more transistors at it, I don't have experience with, and I sorta don't believe you. :)
Your comment is quite well written, making me (irrationally?) predisposed to suspect you're correct on factual matters, or at least more of a domain expert than I. Can you cite sources, or simply elaborate more?
Is that true? I was unaware that PPO, SAC, DQN, Impala, MuZero/AlphaZero etc would all automatically Just Work™ for hidden information games. Straight MCTS-inspired algorithms seem like they'd fail for reasons discussed in the paper, and while PPO/Impala work reasonably well in DoTA2/SC2, it's not obvious they'd converge to perfect play.
If I remember correctly, the DeepMind x UCL RL Lecture Series proves the underlying Bellman equation in this video: https://www.youtube.com/watch?v=zSOMeug_i_M
As for "hidden information" games, I thought the trick was to concatenate the current state with all past states and treat that as the new state, thereby making it an MDP.
> Today, computer- playing programs remain consistently super-human, and one of the strongest and most widely-used programs is Stockfish.
They also go back to referring to it as Stockfish for the rest of the paper.
An analogous situation in my mind would be if AMD released a new CPU and benchmarked it against an Intel CPU, only mentioning once, somewhere in the middle of the paper, that it was a Pentium 4.
> Grimes seemingly makes multiple, thinly veiled references to Musk in the song
https://www.independent.co.uk/arts-entertainment/music/news/...
I am pretty sure aimbots access the internals of the game rather than reading the video output to identify the silhouette of the enemy.
A bit of googling shows that there is a General Game Playing AI community with their own Game Description Language. I never really encountered them before, and the DeepMind paper does not cite them, either.
"SCP-29123 Player Of Games"