One-Pass Bandit Learning for RLHF and Function Approximation

Release Time：2026-06-17Number of visits：10

Speaker: Peng Zhao

Time: 15:00, June. 18th.

Location: SIST 2-215

Host: Prof. Xin Liu

Abstract:

Bandit models are a key framework for designing algorithms in interactive decision-making. While stochastic linear bandits are well-studied, real-world complexities have led to important extensions like generalized linear bandits (GLB) with nonlinear link functions, and heavy-tailed linear bandits (HvLB) to handle heavy-tailed noise. While optimal regret bounds have been established, existing algorithms are computationally impractical, requiring full data storage and repeated passes over all historical data. In this talk, I will introduce a one-pass method based on the Online Mirror Descent framework, a textbook-standard approach for regret optimization whereas we here use it as a statistical estimator. This approach achieves O(1) per-round computational cost while preserving optimal regret for GLB and HvLB. Then I will discuss extensions to online RL theory: (i) RL with multinomial logit function approximation, and (ii) RLHF with on-policy active data collection.

Bio:

Peng Zhao is currently an Associate Professor and PhD supervisor at the School of Artificial Intelligence, Nanjing University. He is also a member of the Learning and Mining from Data Group (LAMDA). His research focuses on the theoretical foundations of machine learning, including online learning, stochastic optimization, and reinforcement learning theory. He has published more than 60 academic papers in leading journals and conferences, including JMLR, COLT, ICML, and NeurIPS. He serves as an Action Editor for the journal Machine Learning, a Young Editorial Board Member of Frontiers of Computer Science, and an Area Chair for conferences such as ICML and NeurIPS. He was selected for the CCF Outstanding Doctoral Dissertation Award Program and has received honors including the Nanjing University Xiaomi Young Scholar Award for Scientific and Technological Innovation and the Baidu Scholarship.

导航

One-Pass Bandit Learning for RLHF and Function Approximation