undefined | Better HN

0 pointsgwern4y ago0 comments

I don't think you can prove that (forgive me if I don't sit through a 2h video). Those all are susceptible to the deadly triad, and AFAIK there are no convergence proofs of any kind for the big model-free DL algs, and it would've been big news if someone had proved that a real-world version of PPO/DQN/IMPALA does in fact converge in the limit. Sutton's book and earlier proofs only cover cases where you drop the nonlinear approximator or something.

(History stacking may turn POMDPs into MDPs, but I don't know if they handle the specially adversarial nature of games like poker. That's quite different from stacking ALE frames.)

0 comments

cygaril4y ago

Standard RL algorithms will converge to optimal play versus a fixed opponent, but will not find an optimal policy via self play.

One intuitive way to see this is that a sequence of improving pure policies A < B < C < etc. will converge to optimal play in a perfect information game like chess, but not necessarily in an imperfect information game like rock/paper/scissors where Rock < Paper < Scissors < Rock, etc

j / k navigate · click thread line to collapse

0 comments

cygaril4y ago

Standard RL algorithms will converge to optimal play versus a fixed opponent, but will not find an optimal policy via self play.

j / k navigate · click thread line to collapse