Autour de l’algorithme du Langevin extensions et applications

Some preliminaries on Markov chains

Markov chains are a class of stochastic processes commonly used to model many random systems in signal processing and control theory. A Markov chain is a sequence of random variables (Xk)k∈N defined on an appropriate filtered space (Fk)k∈N such that the law of Xn+1 conditioned on the filtration Fn is equal to the law of Xn+1 conditioned on Xn, Palmost surely. Heuristically, a discrete-time stochastic process has the Markov property if the past and future are independent given the present. For the theoretical aspects of Markov chains, we give here three references which each include a very extensive bibliography: [MT09; Dou+18], and [LP17] for discrete-space Markov chains. Markov chains can appear from the modelling of various situations such as financial time series, storage or queuing models. They can also be built by the practitioner in order to sample from a target probability distribution; in that case, they are called Markov Chains Monte Carlo algorithms (MCMC) and have become increasingly popular these last three decades. In this thesis, we study Markov chains with values in Rd endowed with the Borel sigma algebra B(Rd ). A homogeneous Markov chain (Xk)k∈N is characterized by its kernel R : R d × B(R d) → [0, 1] which satisfies: 1) for all x ∈ R d , R(x, ·) is a probability measure on B(Rd), 2) for all A ∈ B(R d), x 7→ R(x, A) is a measurable function. For any probability measure µ on B(R d) and A ∈ B(R d), we denote by µR(A) = R Rd R(x, A)µ(dx) and for any k ∈ N, x ∈ Rd, Rk+1(x, A) = R Rd R(x, dy)Rk (y, A). For µ a probability measure on B(Rd), and f a µ-integrable function, µ(f) = R Rd f(x)µ(dx). For x ∈ R d and f integrable under R(x, ·), we denote by Rf(x) = R Rd R(x, dy)f(y). A question of major interest is to know if the Markov chain (Xk)k∈N of kernel R has a (unique) invariant probability measure π satisfying πR = π. A related concept,sometimes easier to check in practice, is the notion of reversibility. A probability measure π is said to be reversible with respect to R, if for all A, B ∈ B(Rd),RAπ(dx)R(x, B) =RBπ(dx)R(x, A). Reversibility implies invariance. If π is invariant for R, the next step is the analysis of the convergence rate of µ0Rk to π when k → +∞ for any initial probability measure µ0. In particular, MCMC methods consist in building an appropriate Markov chain, preferably easy to simulate, such that it has a unique invariant probability measure π: the target distribution. The hope is that, for any initial probability measure µ0, for n ∈ N large enough, µ0Rn is approximately equal to π. In that case, (Xk)k≥n are (correlated) samples approximately drawn from π. Let us illustrate these different concepts through one of the simplest examples of MCMC algorithms: the Random Walk Metropolis (RWM) algorithm with a Gaussian proposal. Let π be a target probability measure on B(R d) with density with respect to the Lebesgue measure also denoted by π, such that for all x ∈ R d, π(x) > 0.

A short presentation of Bayesian statistics

To give a concrete example of an application of MCMC algorithms, we introduce briefly Bayesian statistics. We refer for example to [Gel+14; Rob07; MR07] for many more resources and references on the subject. Note that the field of applications of MCMC is much wider than Bayesian statistics, but the primary goal of this thesis is to develop and extend MCMC algorithms for which Bayesian statistics is a sufficiently vast subject to illustrate our results. Let D be some observed data. A common situation is when we observe N ∈ N∗ pairs (xi, yi) for i ∈ {1, . . . , N} where xi ∈ R d are the covariates associated to the observation yi which can be continuous or discrete. Here, D = {xi, yi} N i=1.

Extensions of the unadjusted Langevin algorithm

In Part I of this thesis, two limitations of the ULA algorithm defined in (1.10) are addressed. First, ULA is well defined and feasible if the potential U is continuously differentiable on R d : it cannot be directly applied to a distribution π restricted to a compact convex set. However, many statistical inference problems involve estimating parameters subject to constraints on the parameter space. The MYULA algorithm proposed in [DMP18] enables to draw approximate samples from distributions with compact support by regularizing appropriately the potential U. In Chapter 3, we derive precise bounds of convergence in total variation norm and Wasserstein distance for this algorithm. Second, when the potential U grows too fast at infinity, i.e. kU(x)k & kxk 2+α when kxk → +∞ and with α > 0, ULA is unstable and may diverge with positive probability. Inspired by some recent works on discretization of SDEs with superlinear drift coefficients [HJK12; Sab13], we propose a new algorithm in Chapter 4, the tamed ULA, and provide convergence guarantees in V -total variation distance and 2-Wasserstein distance.

Sampling from a distribution with compact support: MYULA

Consider in a Bayesian setting a posterior distribution π with bounded support. Some examples include truncated data problems which arise naturally in failure and survival time studies [KM05], ordinal data models [JA06], constrained Lasso and ridge regressions [Cel+12], Latent Dirichlet Allocation [BNJ03], and non-negative matrix factorization [PBJ14]. Drawing samples from such constrained distributions is a challenging problem that has been investigated in many papers; see [GSL92], [PP14], [LS15], [BEL15], [Hsi+18]. All these works are based on efficient Markov Chain Monte Carlo methods to approximate the posterior distribution; however, with the exception of [BEL15] and [Hsi+18], these methods are not theoretically well understood and do not provide any theoretical guarantees on the estimations delivered. A modification of ULA has been proposed in [DMP18] to sample from a non-smooth log-concave probability distribution on Rd. This method named MYULA is mainly based on a regularised version of the target distribution π that enjoys a number of favourable properties that are useful for MCMC simulation. In this study, we analyse the complexity of this algorithm when applied to log-concave distributions constrained to a convex set, with a focus on complexity as the dimension of the state space increases. More precisely, we establish explicit bounds in total variation norm and in Wasserstein distance of order 1 between the iterates of the Markov kernel defined by the algorithm and the target density π.

Echantillonnage d’une distribution avec un support compact :MYULA

Considérons dans un cadre bayésien une distribution postérieure π avec un support borné. Voici quelques exemples : les problèmes de données tronquées qui surviennent naturellement dans les études de temps de survie et d’extinction, [KM05], modèles de données ordinales [JA06], régressions Lasso et régressions ridge [Cel+12], Latent Dirichlet Allocation [BNJ03], et la factorisation matricielle positive [PBJ14]. Tirer des échantillons d’une distribution ainsi contrainte est un problème difficile à résoudre qui a fait l’objet de nombreux articles ; voir [GSL92], [PP14], [LS15], [BEL15], [Hsi+18]. Tous ces travaux sont basés sur l’efficacité de la chaine de Markov pour estimer la distribution postérieure ; cependant, à l’exception de [BEL15] et [Hsi+18], ces méthodes ne sont pas bien comprises en théorie et ne fournissent aucune garantie théorique sur les estimations fournies. Une modification d’ULA a été proposée dans [DMP18] pour échantillonner à partir d’une distribution de probabilité log-concave non lisse sur Rd. Cette méthode appelée MYULA est principalement basée sur une version régularisée de la distribution cible π qui jouit d’un certain nombre de propriétés favorables qui sont utiles pour la simulation MCMC. Dans cette étude, nous analysons la complexité de cet algorithme lorsqu’il est appliqué à des distributions log-concave limitées `a un ensemble convexe, l’accent étant mis sur la complexité à mesure que la dimension de l’espace d’état augmente. Plus précisément, nous établissons des limites explicites en variation totale et en distance de Wasserstein d’ordre 1 entre les itérations de la chaine de Markov et la densité cible π. Il est à noter que l’algorithme de Langevin ajusté par Metropolis (MALA) est une solution alternative à ULA pour échantillonner à partir d’une distribution à support compact. Toutefois, aucune limite de convergence précise n’était disponible pour MALA. Suite à ces travaux, [Dwi+18] a fourni des limites de convergence précises pour MALA et une nouvelle analyse du problème de l’échantillonnage d’une distribution à support compact dans cette perspective serait un travail de recherche intéressant.

Le rapport de stage ou le pfe est un document d’analyse, de synthèse et d’évaluation de votre apprentissage, c’est pour cela chatpfe.com propose le téléchargement des modèles complet de projet de fin d’étude, rapport de stage, mémoire, pfe, thèse, pour connaître la méthodologie à avoir et savoir comment construire les parties d’un projet de fin d’étude.

Table des matières

1 Introduction
1.1 Some preliminaries on Markov chains
1.2 A short presentation of Bayesian statistics
1.3 The unadjusted Langevin algorithm and avatars
1.4 Extensions of the unadjusted Langevin algorithm
1.5 Applications of the unadjusted Langevin algorithm
1.6 Stochastic Gradient Langevin Dynamics
2 Résumé de la thèse
2.1 Extensions de l’algorithme de Langevin non-ajusté
2.2 Applications de l’algorithme de Langevin non-ajusté
2.3 Stochastic Gradient Langevin Dynamics
I Extensions of the unadjusted Langevin algorithm
3 Sampling from a log-concave distribution with compact support with proximal Langevin Monte Carlo
3.1 Introduction
3.2 The Moreau-Yosida Unadjusted Langevin Algorithm (MYULA)
3.3 Distance between π and πλ
3.4 Convergence analysis of MYULA
3.5 Numerical experiments
3.6 Proofs
4 The Tamed Unadjusted Langevin Algorithm
4.1 Introduction
4.2 Ergodicity and convergence analysis
4.3 Numerical examples
4.4 Proofs
4.A Proof of Lemma 4.9
4.B Proof of Lemma 4.11
4.C Proof of Lemma 4.12
4.D Proof of Lemma 4.17
4.E Badly conditioned multivariate Gaussian variable
II Applications of the unadjusted Langevin algorithm
5 Normalizing constants of log-concave densities
5.1 Introduction
5.2 Theoretical analysis of the algorithm
5.3 Numerical experiments
5.4 Mean squared error for locally Lipschitz functions
5.5 Proofs
5.A Additional proofs of Section 5.2.1
5.B Additional proofs of Section 5.2.2
6 Diffusion approximations and control variates for MCMC
6.1 Introduction
6.2 Langevin-based control variates for MCMC methods
6.3 Asymptotic expansion for the asymptotic variance of MCMC algorithms
6.4 Numerical experiments
6.5 The RWM and MALA algorithms
6.6 Proofs
6.A Strong Law of Large Numbers and Central Limit Theorem for the control variates estimator
6.B Law of Large Numbers and Central Limit Theorem for a step size γn function of the number of samples n
6.C Additional proofs
6.D Numerical experiments – additional results
6.E Additional proofs on the diffusion approximation of RWM
III Stochastic Gradient Langevin Dynamics
7 The promises and pitfalls of Stochastic Gradient Langevin Dynamics
7.1 Introduction
7.2 Preliminaries
7.3 Results
7.4 Numerical experiments
7.A Proofs of Section 7.3.1
7.B Proofs of Section 7.3.2
7.C Means and covariance matrices of πLMC, πFP, πSGLD and πSGD in the Bayesian linear regression
Bibliography
CONTENTS
List of Figures
List of Tables