About this item:

30 Views | 18 Downloads

Author Notes:

hangwu@gatech.edu

We thank Dr. Adith Swaminathan for answering questions regarding their paper’s experiments procedure.

Subjects:

Research Funding:

This work has been supported in part by grants from the National Science Foundation NSF1651360, National Institutes of Health (NIH) UL1TR000454 and NIH R01CA163256, CDC HHSD2002015F62550B, the Children’s Healthcare of Atlanta, and Microsoft Research.

Variance Regularized Counterfactual Risk Minimizationvia Variational Divergence Minimization

Tools:

Journal Title:

Proc Mach Learn Res

Publisher:

Type of Work:

Article | Post-print: After Peer Review

Abstract:

Off-policy learning, the task of evaluating and improving policies using historic data collected from a logging policy, is important because on-policy evaluation is usually expensive and has adverse impacts. One of the major challenge of off-policy learning is to derive counterfactual estimators that also has low variance and thus low generalization error. In this work, inspired by learning bounds for importance sampling problems, we present a new counterfactual learning principle for off-policy learning with bandit feedbacks. Our method regularizes the generalization error by minimizing the distribution divergence between the logging policy and the new policy, and removes the need for iterating through all training samples to compute sample variance regularization in prior work. With neural network policies, our end-to-end training algorithms using variational divergence minimization showed significant improvement over conventional baseline algorithms and is also consistent with our theoretical results.

Copyright information:

This is an Open Access work distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/).
Export to EndNote