This looks really interesting. It would be a good project to test this against a general card-playing framework to easily test it on a variety of imperfect-information games based on playing cards.
What tripped me up every time is that most board games have a lot of "if this happens, there is this specific rule that applies". Even relatively simple games (like Homeworlds) are pretty hard to nail down perfectly due to all the special cases.
Do you, or somebody else, have any recommendations on how to handle this?
[0] Dominion, Homeworlds and the battle part of Eclipse iirc.
In my 2-player Splendor rules engine, the following actions are possible:
1. Purchase a holding. (90 possible actions, one for each holding)
2. If you do not have 3 reserved cards, reserve a card and take a gold chip if possible. (93 possible actions, one for each holding and one for each deck of facedown cards)
3. If there are 4 chips of the same color in a pile, take 2 chips of that color. (5 possible actions)
4. Take 3 chips of different colors, or 2 chips of different colors if only 2 are available, or 1 chip if only 1 is available. (25 possible actions)
5. If after any action you have at least 11 chips, return 1 chip. (6 possible actions which are never legal at the same time as any other actions)
This still doesn't correctly implement the rules though. In the actual game, you'd be allowed to spend gold chips when you don't need to, which would make purchasing holdings contain extra decisions after you pick which holding to purchase about which chips you'd like to keep.
(At the same time that probably makes it a good choice for a game implementation)
Thing is that for all my examples above I had a "good" reason to implement that specific game:
1. Dominion (shortly after it came out) To evaluate strategies to best my friends (obviously). 2. Eclipse Has a nice rock-paper-scissors type of ship combat, where you can counter every enemy build (if you have enough time and resources). Calculating the odds of winning would be interesting. 3. Homeworlds Seems to be a very fascinating game. But without any players to compete with [0] ... AI to the rescue ;-)
[0] I am aware of SDG where I could play online, but that site is in decay mode. Getting an account involved mailing the maintainer and those times I tried to start a game no players showed up.
That’s what I did when I applied RL to Dominion, because the complexity of the game depends heavily on the cards you include! See part 3 of https://ianwdavis.com/dominion.html
The key is to build a data-driven state machine, rather than writing logic with a bunch of 'if' statements.
Something that I found amazing was inverting the flow control such that the server asks players questions with a list of possible choices simplifies the agent design tremendously. As I'm looking to retire to work on this project, I can generate the agent code and then hand-craft an AI. However, some AIs are soooo hard to even conceptualize.
Funny that most of the comments are about the name. What an excellent choice.
The first book Consider Phlebas isn't bad, but it isn't as well developed as the rest of the series IMO.
Funnily enough you can see the exact same effect in principal game-engineers or computer-hacking.
Edit: if not that's even more amusing
The practical question is: How much computation do you need to get useful results? Alpha Go Zero is impressive mathematics, but who is willing to spend $1mio daily for months to train it? IMPALA (another Google one) can learn almost all Atari games, but you need a head node with 256 TPU cores and 1000+ evaluation workers to replicate the timings from the paper.
Suppose you're a business that needs to play games. Most people seem to think that it's a matter of plugging in the settings from the paper, buying the same hardware, then clicking a button and waiting.
It's not. The specific settings matter a lot.
But my main point is that you'll get most of your performance pretty rapidly. The only reason to leave it running for so long is to get that last N%, which is nice for benchmarks but not for business.
DeepMind overspends. Actually, they don't; they're not paying anywhere close to the price of a 256 core TPU. (Many external companies aren't, either, and you can get a good deal by negotiating with the Cloud TPU team.)
But you don't need a 256 core TPU. Lots of times, these algorithms simply do not require the amount of compute that people throw at the problem.
On the other hand, you can also usually get access to that kind of compute. A 256 core TPU isn't beyond reach. I'm pretty sure I could create one right now. It's free, thanks to TFRC, and you yourself can apply (and be approved). I was. https://sites.research.google/trc/
It kills me that it's so hard to replicate these papers, which is most of the motivation for my comment here. Ultimately, you're right: "How much compute?" is a big unknown. But the lower bound is much lower than most people realize (and most researchers).
"IMPALA with 1 learner takes only around 10 hours to reach the same performance that A3C approaches after 7.5 days." says the paper, but I can run A3C on a cheap CPU-only server but to get that IMPALA timing, I need to spend a lot of money. But my biggest roadblock so far is that I need compute far exceeding what the papers claim.
The diagrams for IMPALA show good performance starting at 1e8 environment frames and excellent performance at 1e9 frames. By now, I'm at 2.5e9 frames and performance is still bad. In my opinion, the reason is that the sequence lengths for Bomberland are quite long. To clear a path, you place a bomb, wait 5 ticks for it to become detonatable, then detonate it, then wait 10 ticks for the fire to clear. With 7 possible actions per tick, the chance of randomly executing this 17 tick sequence becomes (1/7)^17 = 4e-15. If I calculate optimistically that all moves are valid, too, while we wait, then I can get up to (1/7)(5/7)^5(1/7)*(5/7)^10 = 1e-4. But that still means that at 1e8 env steps, I only have 1000 successful executions to learn from.
I'm a dabbler in Go, and "somewhere below professional" at the game of poker. I've followed the advances in the latter for more than a decade, eagerly reading every paper the CPRG publishes. They use a LOT of compute power!
I know from experience that "The specific settings matter a lot.". For several years, I made my living "implementing papers for hire". It's real work, no argument there. Sometimes the settings are the solution, and heck, sometimes the published algorithm is outright wrong, and you only discover so when trying to implement it.
But the second part of your point, that it's not simply achieving more performance by throwing more transistors at it, I don't have experience with, and I sorta don't believe you. :)
Your comment is quite well written, making me (irrationally?) predisposed to suspect you're correct on factual matters, or at least more of a domain expert than I. Can you cite sources, or simply elaborate more?
Yes, and in the case of deep RL, the ability to to get "lucky" random initialization seems to (still) matter a lot.
I work in real time control systems, which are roughly decision making under uncertainty problems. A lot of the RL research has become noise buoyed with large marketing budgets.
Is that true? I was unaware that PPO, SAC, DQN, Impala, MuZero/AlphaZero etc would all automatically Just Work™ for hidden information games. Straight MCTS-inspired algorithms seem like they'd fail for reasons discussed in the paper, and while PPO/Impala work reasonably well in DoTA2/SC2, it's not obvious they'd converge to perfect play.
If I remember correctly, the DeepMind x UCL RL Lecture Series proves the underlying Bellman equation in this video: https://www.youtube.com/watch?v=zSOMeug_i_M
As for "hidden information" games, I thought the trick was to concatenate the current state with all past states and treat that as the new state, thereby making it an MDP.
(History stacking may turn POMDPs into MDPs, but I don't know if they handle the specially adversarial nature of games like poker. That's quite different from stacking ALE frames.)
> Today, computer- playing programs remain consistently super-human, and one of the strongest and most widely-used programs is Stockfish.
They also go back to referring to it as Stockfish for the rest of the paper.
An analogous situation in my mind would be if AMD released a new CPU and benchmarked it against an Intel CPU, only mentioning once, somewhere in the middle of the paper, that it was a Pentium 4.
I also suspect part of the reason they chose Stockfish 8 was as a basis of comparison with AlphaZero. Their baselines for Go and poker are also pretty weak so their emphasis is clearly on displaying generality and reduced domain specialized input, not supremacy.
A single algorithm to play perfect and imperfect information games is difficult to achieve. Standard depth limited solvers and self-play RL result in highly exploitable agents. PoG appears to be very strong at Chess, decently strong at Go and decent at Poker (Facebook AI's ReBeL, the strongest prior work in this area, performed better against slumbot). What's unique about PoG is its ability to also play an imperfect information game (Scotland Yard) that has many rounds and a relatively long horizon (although it still has scaling issues).
It really isn't though. Technical papers have conventions, and they following them reasonably. You expect the methods description to be specific, the abstract not to be hyperbolic, and conclusions to be balanced. The general discussion parts are just that, general.
In the methods area they discuss the exact versions and parameters used, and how they compared them.
In the conclusions:
In the perfect information games of chess and Go,PoG performs at the level of human experts or professionals, but can be significantly weaker than specialized algorithms for this class of games, like AlphaZero, when given the same resources.
It would have perhaps been interesting to include a more recent stockfish, but it wouldn't really impact the paper.This is just a general effort to describe the present state of things. When they explicitly describe their evaluation process, they are sure to use the version number. They then _immediately_ drop the version number in subsequent usage which is culturally standard in research papers so they don't concern themselves with minute details of every single thing they find themselves redescribing. Believe me, you don't want to read the verbose version of this paragraph.
> In chess, we evaluated PoG against Stockfish 8, level 20 [81] and AlphaZero. PoG(800, 1) was run in training for 3M training steps. During evaluation, Stockfish uses various search controls: number of threads, and time per search. We evaluate AlphaZero and PoG up to 60000 simulations. A tournament between all of the agents was played at 200 games per pair of agents (100 games as white, 100 games as black). Table 1a shows the relative Elo comparison obtained by this tournament, where a baseline of 0 is chosen for Stockfish(threads=1, time=0.1s).
This is just benchmark cherry picking and does not reflect real performance or comparison.
> Grimes seemingly makes multiple, thinly veiled references to Musk in the song
https://www.independent.co.uk/arts-entertainment/music/news/...
I am pretty sure aimbots access the internals of the game rather than reading the video output to identify the silhouette of the enemy.
A bit of googling shows that there is a General Game Playing AI community with their own Game Description Language. I never really encountered them before, and the DeepMind paper does not cite them, either.
"SCP-29123 Player Of Games"