Highlight

The RL² multi-reward system enables the agents to learn faster in sparse reward environments—provided the rewards are aligned with an optimal policy. By designing reward functions experimentally and assigning appropriate weights, our agents consistently learn to score goals in gameplay.

Beyond basic objectives, our surrogate reward functions have led agents to adopt human-like priors, such as long-shot scoring, tactical tackles, and opponent pressure. We find clear evidence of these emergent phases in agent strategy in our environment, each of which creates a new pressure for the opposing team to adapt.

Dodge opponent

Catching up to save goal

With all rewards trained at once

Step by step multi reward with weight decay

Stage wise Multi reward training

Training the agent with multiple reward signals tends to result in faster convergence to an optimal policy. However, breaking the training into three distinct stages leads to more stable learning overall.

In the initial stage, when the policy entropy is high, the reward focuses solely on facing, touching, and moving toward the ball. This helps the agent develop fundamental control and direction.

In stage two, the objective shifts toward goal-oriented behavior. The agent learns to shoot effectively by optimizing for both the ball's velocity and its trajectory toward the goal.

In the final stage, the agent is trained to pressure the opponent based on player positioning and game state dynamics. Penalties are introduced when the opponent gains ball possession, and we incorporate fun behaviors—such as rewarding the agent for destroying an opponent.

Below are some clips showcasing the agent scoring goals

Tackling the opponent

By penalizing the agent whenever the game favors the opponent (based on location and possession of the ball), the agents are more aggressive.

Reward hacking

Since the game only terminates if a goal is scored or no agent touches the ball for 30 seconds, there are instances wherein both agents just touch the ball for maximum rewards.

Miscellaneous

Here are few more additional clips wherein the agent tries to maximize reward based on boost, saving goal and random behavior when model is further trained with lower reward weights for non goal ojectives.

BibTeX

@misc{ddrlSpring2025Project,
        title={RL2: Reinfrocement Learning in Rocket League with multi reward system}, 
        author={Rahul Raman, Satyanarayana Chillale, Srivats Poddar, Anirudh Garg},
        year={2025},
        primaryClass={cs.CV},
        url={https://aiden-frost.github.io/RocketLeagueGym-Rewards/},
  }