Adversarial Attacks in Imitation Learning (AIRL): Black-Box Attack Framework

AIRL results: reward heatmaps and learning curves

Figure: Learned reward heatmaps and transfer learning curves on tabular MDPs under adversarial conditions.

Description

Arezoo Alipanah
University of Waterloo
2025

This project extends my research on IRL robustness to the Adversarial Inverse Reinforcement Learning (AIRL) setting. AIRL disentangles reward functions from environment dynamics — making it particularly powerful for transfer and safe RL.

We developed a black-box adversarial attack framework targeting AIRL agents, designed to induce unsafe behavior without any access to internal model parameters. Custom OpenAI Gym environments were implemented to study train-test distributional divergence and policy vulnerability.

Results show AIRL's state-only reward shaping recovers ground truth more faithfully, while revealing important vulnerabilities in the transfer learning setting — motivating the need for certified robust IRL algorithms.