Generating Malicious Demonstration Policies to Exploit Vulnerabilities in IRL

Paper diagram: adversarial IRL attack framework

Figure 1: The attack framework — a malicious policy injects poisoned demonstrations into the expert buffer before MaxEnt IRL training.

Description

Arezoo Alipanah & Prof. Yash Vardhan Pant
Canadian Conference on Artificial Intelligence (Canadian AI 2025)
2025
Funded by CPI, UWaterloo — Spotlight Presentation

Inverse Reinforcement Learning (IRL) algorithms learn reward functions from expert demonstrations, but their robustness to adversarial input has not been well studied. In this work, we investigate how an attacker can craft malicious demonstration policies to exploit vulnerabilities in IRL agents — causing them to learn unsafe or incorrect reward functions.

We develop a black-box adversarial attack framework that generates poisoned demonstrations to maximally destabilize the reward learning process, implementing custom OpenAI Gym environments to study train-test divergence and policy vulnerability.

This work was funded by the Cybersecurity and Privacy Institute (CPI) at UWaterloo and selected for a spotlight presentation at the annual CPI Symposium.