Adversarial Attacks in Imitation Learning (AIRL): Black-Box Attack Framework

UWaterloo AI Safety Python PyTorch OpenAI Gym
AIRL results: reward heatmaps and learning curves

Figure: Learned reward heatmaps and transfer learning curves on tabular MDPs under adversarial conditions.

Description

  • Arezoo Alipanah

  • University of Waterloo

  • 2025

This project extends my research on IRL robustness to the Adversarial Inverse Reinforcement Learning (AIRL) setting. AIRL disentangles reward functions from environment dynamics — making it particularly powerful for transfer and safe RL.

We developed a black-box adversarial attack framework targeting AIRL agents, designed to induce unsafe behavior without any access to internal model parameters. Custom OpenAI Gym environments were implemented to study train-test distributional divergence and policy vulnerability.

Results show AIRL's state-only reward shaping recovers ground truth more faithfully, while revealing important vulnerabilities in the transfer learning setting — motivating the need for certified robust IRL algorithms.