Abstract
We propose to estimate the KL divergence using a relaxedlikelihood ratio estimation in a Reproducing Kernel Hilbertspace. We show that the dual of our ratio estimator for KLin the particular case of Mutual Information estimation cor-responds to a lower bound on the MI that is related to the socalled Donsker Varadhan lower bound. In this dual form, MIis estimated via learning a witness function discriminatingbetween the joint density and the product of marginal, as wellas an auxiliary scalar variable that enforces a normalizationconstraint on the likelihood ratio. By extending the functionspace to neural networks, we propose an efficient neural MIestimator, and validate its performance on synthetic examples,showing advantage over the existing baselines. We demon-strate its strength in large-scale self-supervised representationlearning through MI maximization