es, the easiest solution is to design the trajectory reward function such that it is a summation over state-action features along the trajectory. In this way, the learned weights can be directly used as reward functions in reinforcement learning. The other easy option is to reward the reinforcement learning agent only at the very end once the trajectory is complete. This is of course a sparse reward setting and RL training becomes more difficult.
是的,最简单的解决方案是设计轨迹奖励函数,使其成为沿轨迹的状态-动作特征的总和。这样一来,学到的权重可以直接作为强化学习的奖励函数。另一个简单的选择是只在轨迹完成后的最后阶段奖励强化学习代理。当然,这是一个稀疏的奖励设置,RL训练变得更加困难。