Generating Malicious Demonstration Policies to Exploit Vulnerabilities in IRL

Canadian AI 2025 CPI Spotlight Python PyTorch OpenAI Gym
Paper diagram: adversarial IRL attack framework

Figure 1: The attack framework — a malicious policy injects poisoned demonstrations into the expert buffer before MaxEnt IRL training.

Description

  • Arezoo Alipanah & Prof. Yash Vardhan Pant

  • Canadian Conference on Artificial Intelligence (Canadian AI 2025)

  • 2025

  • Funded by CPI, UWaterloo — Spotlight Presentation

Inverse Reinforcement Learning (IRL) algorithms learn reward functions from expert demonstrations, but their robustness to adversarial input has not been well studied. In this work, we investigate how an attacker can craft malicious demonstration policies to exploit vulnerabilities in IRL agents — causing them to learn unsafe or incorrect reward functions.

We develop a black-box adversarial attack framework that generates poisoned demonstrations to maximally destabilize the reward learning process, implementing custom OpenAI Gym environments to study train-test divergence and policy vulnerability.

This work was funded by the Cybersecurity and Privacy Institute (CPI) at UWaterloo and selected for a spotlight presentation at the annual CPI Symposium.