【DeepMind大作】34位作者：Mastering the Game of Stratego with Model-Free MARL

实验室官方助手

论文原文： https://arxiv.org/pdf/2206.15378.pdf

We introduce DeepNash, an autonomous agent capable of learning to play the imperfect information game Stratego1 from scratch, up to a human expert level. Stratego is one of the few iconic board games that Artificial Intelligence (AI) has not yet mastered. This popular game has an enormous game tree on the order of 10535 nodes, i.e., 10175 times larger than that of Go. It has the additional complexity of requiring decision-making under imperfect information, similar to Texas hold’em poker, which has a significantly smaller game tree (on the order of 10164 nodes). Decisions in Stratego are made over a large number of discrete actions with no obvious link between action and outcome. Episodes are long, with often hundreds of moves before a player wins, and situations in Stratego can not easily be broken down into manageably-sized sub-problems as in poker. For these reasons, Stratego has been a grand challenge for the field of AI for decades, and existing AI methods barely reach an amateur level of play. DeepNash uses a game-theoretic, model-free deep reinforcement learning method, without search, that learns to master Stratego via self-play. The Regularised Nash Dynamics (R-NaD) algorithm, a key component of DeepNash, converges to an approximate Nash equilibrium, instead of ‘cycling’ around it, by directly modifying the underlying multi-agent learning dynamics. DeepNash beats existing state-of-the-art AI methods in Stratego and achieved a yearly (2022) and all-time top-3 rank on the Gravon games platform, competing with human expert players.

我们介绍了 DeepNash，一种能够从零开始学习玩不完美信息游戏 Stratego1 的自主智能体，直至达到人类专家的水平。 Stratego 是人工智能 (AI) 尚未掌握的少数标志性棋盘游戏之一。这个流行的游戏有一个巨大的游戏树，大约有 10535 个节点，比围棋大 10175 倍。它具有额外的复杂性，需要在不完全信息下进行决策，类似于德州扑克，它的游戏树要小得多（大约 10164 个节点）。 Stratego 中的决策是根据大量离散的行动做出的，行动和结果之间没有明显的联系。情节很长，在玩家获胜之前通常需要数百步棋，并且 Stratego 中的情况不能像扑克中那样轻易地分解为可管理大小的子问题。由于这些原因，Stratego 几十年来一直是 AI 领域的一项重大挑战，现有的 AI 方法几乎无法达到业余水平。 DeepNash 使用博弈论、无模型的深度强化学习方法，无需搜索，通过自我对弈来学习掌握 Stratego。正则化纳什动力学 (R-NaD) 算法是 DeepNash 的关键组成部分，通过直接修改底层多智能体学习动力学，收敛到近似纳什均衡，而不是围绕它“循环”。 DeepNash 在 Stratego 中击败了现有最先进的 AI 方法，并在 Gravon 游戏平台上获得了年度（2022 年）和历史前三名，与人类专家玩家竞争。

论文原文： https://arxiv.org/pdf/2206.15378.pdf

Document