离线强化学习 (ORL) 使我们能够分别研究强化学习的两个相互关联的过程:收集信息经验和推断最佳行为。第二步已在离线环境中得到广泛研究,但对于数据高效的强化学习同样重要的是信息数据的收集。由于收集单个数据集并使用它来解决出现的多个下游任务的可能性,数据收集的任务不可知设置(其中任务先验未知)特别令人感兴趣。我们通过基于好奇心的内在动机来研究这种设置,这是一系列探索方法,鼓励智能体探索它尚未学会建模的那些状态或转换。借助 Explore2Offline,我们建议通过传输收集的数据并使用奖励重新标记和标准离线 RL 算法推断策略来评估收集数据的质量。我们使用该方案评估了各种数据收集策略,包括一种新的探索代理、内在模型预测控制 (IMPC),并展示了它们在各种任务中的性能。我们使用这种解耦框架来加强对探索的直觉和有效离线 RL 的数据先决条件。
Offline Reinforcement Learning (ORL) enables us to separately study the two interlinked processes of reinforcement learning: collecting informative experience and inferring optimal behaviour. The second step has been widely studied in the offline setting, but just as critical to data-efficient RL is the collection of informative data. The task-agnostic setting for data collection, where the task is not known a priori, is of particular interest due to the possibility of collecting a single dataset and using it to solve several downstream tasks as they arise. We investigate this setting via curiosity-based intrinsic motivation, a family of exploration methods which encourage the agent to explore those states or transitions it has not yet learned to model. With Explore2Offline, we propose to evaluate the quality of collected data by transferring the collected data and inferring policies with reward relabelling and standard offline RL algorithms. We evaluate a wide variety of data collection strategies, including a new exploration agent, Intrinsic Model Predictive Control (IMPC), using this scheme and demonstrate their performance on various tasks. We use this decoupled framework to strengthen intuitions about exploration and the data prerequisites for effective offline RL.