# Bayesian nonparametrics and the probabilistic approach to modelling

←

**Page content transcription**

If your browser does not render page correctly, please read the page content below

Bayesian nonparametrics and the probabilistic approach to modelling Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://mlg.eng.cam.ac.uk/zoubin Modelling is fundamental to many fields of science and engineering. A model can be thought of as a representation of possible data one could predict from a system. The probabilistic approach to modelling uses probability theory to express all aspects of uncertainty in the model. The probabilistic approach is synonymous with Bayesian modelling, which simply uses the rules of probability theory in order to make predictions, compare alternative models, and learn model parameters and structure from data. This simple and elegant framework is most powerful when coupled with flexible probabilistic models. Flexibility is achieved through the use of Bayesian nonparametrics. This article provides an overview of probabilistic modelling and an accessible survey of some of the main tools in Bayesian nonparametrics. The survey covers the use of Bayesian nonparametrics for modelling unknown functions, density estimation, clustering, time series modelling, and representing sparsity, hierarchies, and covariance structure. More specifically it gives brief non-technical overviews of Gaussian processes, Dirichlet processes, infinite hidden Markov models, Indian buffet processes, Kingman’s coalescent, Dirichlet diffusion trees, and Wishart processes. Key words: probabilistic modelling; Bayesian statistics; nonparametrics; machine learning. 1. Introduction Modelling is central to the sciences. Models allow one to make predictions, to understand phenomena, and to quantify, compare and falsify hypotheses. Modelling is also at the core of intelligence. Both artificial and biological systems that exhibit intelligence must be able to make predictions, anticipate outcomes of their actions, and update their ability to make predictions in light of new data. It is hard to imagine how a system could do this without building models of the environment that the system interacts with. It is thus not surprising that many theories in cognitive science are based around the idea of building internal models (Wolpert et al., 1995; Knill and Richards, 1996; Griffiths and Tenenbaum, 2006). A model is simply a compact representation of possible data one could observe. As such it may be more interpretable than the observed data itself, providing a useful representation of data. A model must be able to make forecasts of possible future data, otherwise it seems impossible to falsify a model in light of new data. Phil. Trans. R. Soc. A 1–27; doi: 10.1098/rspa.00000000 This journal is c 2011 The Royal Society

2 Bayesian nonparametric modelling I will use the term forecast in a very general way to refer to the process of making any claims about unobserved data on the basis of observed data; I will also use predict interchangeably with forecast. For all nontrivial phenomena, forecasts have to include some representation of the forecasting uncertainty. Deterministic forecasts (e.g. tomorrow’s high temperature will be 17 degrees C) are too brittle and therefore easy to falsify. Ideally, a model should also be adaptive. By adaptive I mean that the forecasts of the model should change depending on the data observed so far. Such adaptation, or learning, should hopefully have the effect of making the model’s forecasts be better aligned with actual data. For example, if the forecasts are in the form of probability distributions, adaptation should have the effect that the forecast probability assigned to what actually happens should increase after the model has seen more data, although this cannot be generally guaranteed. It is clear that all forecasts need to represent uncertainty. Observable data is generally corrupted by noise in the measurement process, and this noise needs to be incorporated in the forecast uncertainty. But uncertainty lurks at many other levels in any model. The amount of noise in the measurement process may itself be unknown. The model may have a number of parameters which are unknown. The structure of the model itself may be uncertain, or there may be multiple plausible competing models for the data. The forecasting system should ideally produce forecasts that incorporate all reasonable sources of uncertainty. As we will see, the Bayesian framework provides a natural and coherent approach for representing and manipulating all forms of uncertainty in modelling. 2. The Bayesian framework The fundamental idea in Bayesian modelling is to use the mathematics of probability theory to represent and manipulate all forms of uncertainty in the model. This is a surprisingly simple yet powerful idea. The good news is that there are only two rules of probability theory one needs to remember, the sum rule and the product rule.1 Consider a pair of random variables x and y taking on values in some spaces X and Y respectively. The sum rule states that if I know the joint probability of two random variables, x and y, I can obtain the marginal probability of x by summing over all possible values of y, X P (x) = P (x, y). y∈Y If y is continuous we simply replace the sum with an integral. If I have a model for the joint probability distribution of the high temperature in London and Cambridge I can obtain the marginal distribution for the high temperature in Cambridge summing out London’s temperature. The product rule states that the joint probability of x and y can be decomposed into the product of the marginal probability of x and the conditional 1 With apologies to probabilists and statisticians my presentation and notation will eschew formality in favour of didactic clarity.

Ghahramani 3 probability of y given x (or the other way around): P (x, y) = P (x)P (y|x) = P (y)P (x|y). Combining the sum and product rule and rearranging a bit we obtain Bayes rule as a corollary: P (x|y)P (y) P (x|y)P (y) P (y|x) = = X P (x) P (x|y 0 )P (y 0 ) y 0 ∈Y Let us now use these equations in a canonical modelling context. We will use D to represent the observable data. Often we talk about the data being a set consisting of a number of “data points” or measurements, D = {x1 , . . . , xN } but this is in no way a necessary assumption. For example, the data could be a single image, an observed graph, or an ordered sequence of measurements rather than a set. Our model will be indexed by m and we may want to consider multiple alternative models, e.g. m, m0 etc. Each model usually has a number of free parameters, which we will denote with θ which can be a vector if needed. First we need to make sure that the model m is well defined, in the sense that it can predict or forecast data. As previously discussed we use probability theory to represent the model forecasts. For any given setting of the model parameters, the model must be able to produce a forecast of the form P (D|θ, m). This probability of the data as a function of the parameters is the likelihood of the parameters. With the likelihood, for any given setting of the parameters we are in the position of making forecasts. However, the model m is not fully defined until we specify a “range” for the parameter values, θ. A linear regression model which states that the slope can take values between −1 and +1 is a very different model from one which specifies that the slope can take values between −100 and +100. In fact, to fully specify a model, we need to do a bit more than specify the range of parameters, we need to define a distribution over this range. Only then will our model m be able to make forecasts. We see this by writing the forecast probability of the model using the sum and product rules: Z Z sum prod P (D|m) = P (D, θ|m)dθ = P (D|θ, m)P (θ|m)dθ (2.1) The expression P (D|m) is variously called the marginal likelihood, model evidence, or integrated likelihood. The prior over the parameters, P (θ|m), plays the role of specifying the range as described above (for example it could be uniform on [−1, +1]) in the form of a distribution over the allowable values of the parameters. Without a prior our model is not well-defined: we cannot generate or forecast data until we know how to choose values for θ.2 Once the prior and likelihood are defined, and only then, is the model m fully specified in the sense that it can generate possible data sets. People often object to Bayesian methods on the basis that it forces one to define priors on the parameters. This, in my opinion, is completely misguided. 2 Of course, our model may have a fixed value of θ, e.g. 7.213, which corresponds to a delta function prior.

4 Bayesian nonparametric modelling All models make assumptions; without assumptions it is impossible to make any forecasts or predictions from observed data. The first stage of the Bayesian modelling framework is to explicitly state all assumptions using the language of probability theory. Specifying both the prior and likelihood is a necessary requirement so that the model is well defined. In fact, the distinction between prior and likelihood is arbitrary3 ; both are essential parts of the model. People also object to the use of priors on the parameters on the grounds that they don’t think of the parameters as being “random” variables. For example, if one is estimating the mass of a planet from astronomical data, the mass of the planet is not “random” in the colloquial sense of the word.4 This is a misunderstanding of the semantics of probabilities in Bayesian modelling. Probabilities are used to represent our uncertainty about unknown quantities. They are just as good at modelling uncertainty for repeatable experiments such as the outcome of a roll of a die, as they are for modelling uncertainty in the mass of a planet. Incidentally, both forms of uncertainty are fundamentally subjective; one’s uncertainty about a roll of a die depends on one’s knowledge of the exact initial conditions of the roll, in just the same way as the uncertainty about the mass of the planet depends on knowledge of the observational data on the planet’s orbit. Finally, people trained in the sciences are uncomfortable with the very notion of subjectivity in data analysis and modelling. This again is deeply misguided: all models involve assumptions, and all conclusions drawn from data analysis are conditional on assumptions. The probabilistic framework forces the scientist to be completely transparent about the assumptions, by expressing all assumptions as distributions on unknown quantities. These assumptions can then be easily contested, and the same data can be reanalysed under different modelling assumptions (and priors). The fact that conclusions could change depending on the assumptions is essential to good scientific practice. Fortunately, given enough data, the effect of the prior is generally overcome by the likelihood and posterior conclusions will converge (Doob, 1949; Le Cam, 1986; Van der Vaart, 2000). This is directly analogous to the progress of science, where of many possible hypotheses only the ones consistent with the data will survive. Bayesian modelling is subjective but not arbitrary: given a full specification of the model, and the data, there is only one way to reason about any quantity of interest. Modelling thus becomes a very simple procedure: • write down your assumptions (possible models, parameters, noise processes, etc), representing all forms of uncertainty using the language of probability theory 3 Consider for example a model which defines the probability of the data (i.e. the likelihood) using a student t-distribution. For a single data point, this can be equivalently expressed as a model with a Gaussian likelihood and a Gamma prior on the precision of that Gaussian. 4 I don’t like the term ’random variable’ because it suggest that there is some true source of randomness driving some process. To give another example, people would naturally and understandably flinch at the idea of calling something like the millionth digit of π or the number of people alive today a random variable. The notion that Bayesians consider true parameters to be “random variables” makes the framework less intuitive that it actually is. It is much more natural to use the term uncertain. Clearly I can be uncertain about the million’th digit of π even though its value is not random at all. My uncertainty about this digit will change after I look it up on the web, but presumably will not go to zero as there is always the chance of someone having made a mistake somewhere.

Ghahramani 5 • given the data, use probability theory to make inferences about any unknown quantities in your model, or to make predictions from the model. This process lends itself very naturally to a sequential processing of data. Your posterior after observing some data Dold , P (θ|Dold , m) is your prior before observing new data, Dnew : P (θ|Dnew , Dold , m) ∝ P (Dnew |Dold , θ, m)P (θ|Dold , m). The sum and product rule also tell us how to make predictions from a model. Consider predicting some unknown quantity x (e.g. the next data point) given observed data D and model m: Z P (x|D, m) = P (x|θ, D, m)P (θ|D, m)dθ. (2.2) This is a very intuitively satisfying expression which tells us that our predictions or forecasts are a weighted average of the forecasts from different parameter values, weighted by the posterior probability of each parameter value given the data observed so far. For parametric models, this simplifies since given the parameters forecasts are independent of the observed data: P (x|θ, D, m) = P (x|θ, m). We will revisit this point as we discuss parametric vs nonparametric models in Section 3. If we are considering a number of models m, m0 , etc, then by the sum and product rules, our forecasts are an average over models weighted by their posteriors. The probabilistic modelling framework also provides intuitive answers to problems in model comparison. Assuming a set of competing probabilistic models M, given some observed data we can evaluate the posterior probability of a particular model m, P (D|m)P (m) P (D|m)P (m) P (m|D) = =P 0 0 . P (D) m0 ∈M P (D|m )P (m ) Note the prominent role played by the marginal likelihood, P (D|m). Importantly, this marginal likelihood captures a preference for simpler models known as Bayesian Occam’s Razor (Jefferys and Berger, 1992; MacKay, 2003; Rasmussen and Ghahramani, 2001). Consider a set of nested models, for example different order polynomials (e.g. constant, linear, quadratic, cubic, etc.) used to fit some regression relationship (Figure 1). Clearly a higher order model such as the cubic polynomial is strictly more complex than a lower order model such as the linear polynomial. Model fitting procedures based on optimisation (such as maximum likelihood methods or penalised likelihood methods) need to take great care not to overfit the data by fitting the parameters of an overly complex model to a relatively small data set. Overfitting is not a problem for fully Bayesian methods, as there is no “fitting” of the model to the data. We only have the sum rule and the product rule to work with, there is no “optimise” rule in probability theory. A more complex model, say one that has more parameters, simply spreads its predictive probability mass over more possible data sets than a simpler model. If all models are specified as probability distributions over data sets, since probability distributions have to sum to one, all models have the same amount of probability mass to spread over possible data (Figure 2). Given a particular data set, it therefore becomes possible to reject both models that are too simple or too complex simply by using the rules of probability theory.

6 Bayesian nonparametric modelling M=0 M=1 M=2 M=3 Model Evidence 1 40 40 40 40 20 20 20 20 0.8 0 0 0 0 0.6 P(Y|M) −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 M=4 M=5 M=6 M=7 0.4 40 40 40 40 20 20 20 20 0.2 0 0 0 0 0 −20 −20 −20 −20 0 1 2 3 4 5 6 7 0 5 10 0 5 10 0 5 10 0 5 10 M Figure 1. Marginal likelihoods, Occam’s razor, and overfitting: Consider modelling a function y = f (x) + describing the relationship between some input variable x, and some output or response variable y. (left) The red dots in the plots on the left side are a data set of eight (x, y) pairs of points. There are many possible f that could model this given data. Let’s consider polynomials of different order, ranging from constant (M = 0), linear (M = 1), quadratic (M = 2), etc to seventh order (M = 7). The blue curves depict maximum liklihood polynomials fit to the data under Gaussian noise assumptions (i.e. least squares fits). Clearly the M = 7 polynomial can fit the data perfectly, but it seems to be overfitting wildly, predicting that the function will shoot off up or down between neighbouring observed data points. In contrast, the constant polynomial may be underfitting, in the sense that it might not pick up some of the structure in the data. The green curves indicate 20 random samples from the Bayesian posterior of polynomials of different order given this data. A Gaussian prior was used for the coefficients, and an inverse gamma prior on the noise variance (these conjugate choices mean that the posterior can be analytically integrated). The samples show that there is considerable posterior uncertainty given the data, and also that the maximum likelihood estimate can be very different from the typical sample from the posterior. (right) The normalised model evidence or marginal likelihood for this model is plotted as a function of the model order, P (Y |M ), where the data set Y are the eight observed output y values. Note that given the data, model orders ranging from M = 0 to M = 3 have considerably higher marginal likelihood that other model orders, which seems plausible given the data. Higher order models, M > 3, have relatively much smaller marginal likelihood which is not visible on this scale. The decrease in marginal likelihood as a function of model order is a reflection of the automatic Occam’s razor which results from Bayesian marginalisation.

Ghahramani 7 too simple P(D|m) "just right" too complex D All possible data sets of size n Figure 2. An illustration of Occam’s Razor. Consider all possible data sets of some fixed size n. Competing probabilistic models correspond to alternative distributions over the data sets. Here we have illustrated three possible models which spread their probability mass in different ways over these possible data sets. A complex model (shown in blue) spreads its mass over many more possible data sets, while a simple model (shown in green) concentrates its mass on a smaller fraction of possible data . Since probabilities have to sum to one, the complex model spreads its mass at the cost of not being able to model simple data sets as well as a simple model—this normalisation is what results in an automatic Occam’s razor. Given any particular data set, here indicated by the dotted line, we can use the marginal likehood to reject both overly simple models, and overly complex models. This figure is inspired by a figure from MacKay (1991), and an actual realisation of this figure on a toy classification problem is discussed in Murray and Ghahramani (2005). This approach to Bayesian model comparison can be used to solve a vast range of problems in learning the structure of complex models. For example, it has been used to learn the number of clusters in a mixture model (Roberts et al., 1998; Ghahramani and Beal, 2000), finding relevant variables or features in a prediction problem (MacKay, 1994; Neal, 1998), discovering the number of states in a hidden Markov model (MacKay, 1997), and learning the dependency structure between variables in a probabilistic graphical model (Friedman and Koller, 2003). The above approach to model comparison relies on the ability to enumerate a set of models M to be compared. This is often reasonable in scientific settings where there are a number of competing hypotheses to explain some phenomenon. Bayesian Occam’s razor ensures that overly complex models are adequately penalised when doing model comparison. However, the complexity of real-world phenomena often requires us to consider complex models that are flexible enough to capture the structure in real data. Flexible models are not only more realistic, but will also generally result in more reasonable forecasts than simpler models. Bayesian nonparametrics provides a natural framework for defining flexible models. 3. Nonparametric models The best way to understand nonparametric models is by first reminding ourselves of what parametric models are. A parametric model has some finite set of

8 Bayesian nonparametric modelling parameters θ. Given these parameters, future predictions, x, are independent of the observed data: p(x|θ, D) = p(x|θ). The parameters therefore capture everything there is to know about the data that is relevant for predicting future data. We can think of all models as information channels from past data D to future predictions x. The parameters in a parametric model constitute a bottleneck in this information channel. The complexity of the model, and the capacity of the channel, is bounded, even if the amount of observed data becomes unbounded. Parametric models are therefore not generally very flexible. In contrast a nonparametric model assumes that the data distribution cannot be defined in terms of such a finite set of parameters. However, often we can think of nonparametric models as being defined in terms of an infinite dimensional θ. More formally, the infinite dimensional θ is often represented as a function. The term nonparametric is therefore a bit of a misnomer; it’s not that nonparametric models don’t have parameters, in fact they have infinitely many parameters. Because of this, nonparametric models cannot be explicitly represented in terms of their parameters. From the information channel viewpoint, we have removed the bottleneck. The amount of information that θ can capture about the data D grows as the amount of data grows. This makes nonparametric models more flexible than parametric models. 5 There is another way to view the difference between parametric and nonparametric models. Predictions from a parametric model are explicitly and compactly summarised through the parameters θ, P (x|θ). Nonparametric models, in contrast, cannot be summarised in this way. Because of this predictions from a nonparametric are necessarily memory-based, P (x|D); to make predictions we need to store or remember a growing amount of information about the training data, D. Nonparametric models are inextricably tied to the notion of exchangeability. A sequence is exchangeable if its joint distribution is invariant under arbitrary permutation of the indices. Consider modelling a data set {x1 , . . . , xN } under the assumption that the ordering of the elements is uninformative. This data set may be a collection of documents, images, or any other sort of object collected in a manner such that either the ordering is irrelevant or completely unknown.6 De Finetti’s theorem states that a sequence is exchangeable if and only if there exists some θ such that the elements xn are independently and identically distributed (iid) from some unknown distribution indexed by θ (Kallenberg, 2005). Importantly, θ may need to be infinite dimensional as it needs to index the space of probability measures (non-negative functions that normalise to one). 5 To the best of my knowledge, this information channel view of nonparametric modelling is not explicitly articulated in the literature, although to be sure, there are a number of strong links between Bayesian inference and information theory (Zellner, 1988; MacKay, 2003). These ideas could certainly be further formalised by explicitly attempting to measure or estimate the channel capacity of different models. This would help provide a further definition of the complexity of a model, measured in bits per data point, which in many ways is more satisfactory than definitions that rely on counting parameters. 6 There are generalisations of exchangeability to handle time series and arrays.

Ghahramani 9 The consequence of de Finetti’s theorem is that if we want to model exchangeable data in full generality, we need to consider putting distributions on unknown probability measures. Distributions on measures, functions, and other infinite dimensional objects, are thus central to Bayesian nonparametric modelling. Many of these distributions are infinite dimensional versions of their finite dimensional counterparts, and in fact a good strategy for deriving Bayesian nonparametric models is to start from a parametric model and “take the infinite limit” (Neal, 2000). Distributions on infinite dimensional objects are the main subject of study in stochastic process theory, and therefore much of the terminology used in Bayesian nonparametrics is borrowed from this field.7 Two of the classical building blocks for Bayesian nonparametric models are the Gaussian process and the Dirichlet process. I will give an overview of these models in sections 4 and 5, with an emphasis on their applications to general problems in machine learning and statistics, including regression, classification, clustering, and density estimation (these problems will also be described in those sections). I will also cover newer building blocks that can represent nonparametric distributions over sparse matrices, hierarchies, covariances, and time series (sections 6 to 9). In order to give a broad overview I will necessarily have to avoid going in much depth into any of the specific topics, providing instead pointers to the relevant literature. Admittedly, my overview is based on my personal view of the field, and is therefore biased towards areas of Bayesian nonparametrics to which my colleagues and I have contributed, and misses other important areas. 4. Modelling functions, classification and regression: Gaussian processes Gaussian processes (GPs) are a distribution over functions which can be used in numerous contexts where one’s model requires one to represent an unknown function (Rasmussen and Williams, 2006). One dimensional GPs indexed by time are familiar to many fields: Brownian motion, Wiener processes, Ornstein- Uhlenbeck processes, linear Gaussian state-space models, and many random walks, are all examples of GPs. For historical reasons, GPs are also sometimes associated with spatial statistics, for example modelling temperature as a function of spatial location. Within machine learning, GPs are often used for nonlinear regression and classification. Consider the following simple model for nonlinear regression yn = f (xn ) + n . Here, we wish to model the relationship between an input (or covariate) x and an output (or response variable) y. The subscript n indexes data points in our data set D, n is some additive noise, and crucially, f is the unknown regression function we wish to learn about from data. 7 The word “process” usually evokes the idea of something evolving temporally, but it’s important to note that stochastic processes can be indexed by time, space, or any other index space. Many of the uses of stochastic processes in Bayesian nonparametrics do not correspond to indexing by time.

10 Bayesian nonparametric modelling Assume we have a distribution over functions f and we evaluate the marginal distribution it assigns to the vector f = (f (x1 ), . . . f (xN )). If, for any choice of input points, (x1 , . . . , xN ), the marginal distribution over f is multivariate Gaussian, then the distribution over the function f is said to be a Gaussian process. In other words, a GP is an infinite dimensional generalisation of the multivariate Gaussian. Analogously to the Gaussian, a GP is parameterised by a mean function, and a covariance function. The application of GPs to regression is straightforward. Starting with a prior on functions p(f ) we condition on the data to obtain a posterior p(f |D). When the noise is assumed to be Gaussian, all computations reduce to operations on N dimensional Gaussians. The infinitely many other dimensions of f can be marginalised out analytically. Classification problems correspond to predicting categorical response or output variables, e.g. y ∈ {cat, dog}. GP regression can be modified to do classification simply by introducing a link function that maps the real values f (x) into probabilities over the classes. Computing p(f |D) exactly becomes intractable but many good methods exist for approximating the required integrals (section 10). 5. Density estimation and clustering: Dirichlet processes and Chinese restaurant processes We now consider two distinct problems—density estimation and clustering— and describe a close link between the two when approached from a Bayesian nonparametric modelling approach. Density estimation refers to the problem of inferring an unknown density p from data D = {x1 , . . . , xN }. Let us first consider a very simple situation where the data points belong to a discrete and finite space with K possible values, x ∈ X = {1, . . . , K}. Any distribution on X can be represented by a K-dimensional non-negative vector p that sums to one. To infer p from D we need a reasonable prior on finite distributions. The Dirichlet distribution is a natural choice which takes the form: K 1 Y αhk −1 P (p|α, h) = pk . Z k=1 Here Z is a normalising constant, h is the mean of p and α > 0 controls the dispersion around the mean. A very attractive property of the Dirichlet distribution is that it is conjugate in the sense that the posterior P (p|D, α, h) is also Dirichlet. Another nice property is that for all but pathological choices of the parameters it has good coverage in the sense that it puts some nonzero probability mass near all possible values of p. The Dirichlet process (DP) extends the Dirichlet distribution to countable or uncountably infinite spaces X . It has the property that for any finite partition of X = (X1 , . . . XK ) it marginalises to a K-dimensional Dirichlet distribution on the measure (i.e. mass) assigned to each element of the partition (Ferguson, 1973). We write that the probability measure G is drawn from a Dirichlet process prior as G ∼ DP(α, H) analogously to our notation for the Dirichlet distribution. An excellent introduction to the Dirichlet process is provided by Teh (2010).

Ghahramani 11 Conjugacy and good coverage suggest that the DP could be a very good general purpose nonparametric density estimator. Unfortunately, the distributions drawn from a DP prior are, with probability one, discrete so they don’t have a density. In fact, a draw from a DP prior can be represented in the following way: ∞ X G= πk δxk (5.1) k=1 where the sum is an infinite sum, the πk are masses that sum to one, δ is the Dirac-delta function, and the xk are locations for those probability masses drawn iid from the base measure H which controls the mean of G. To alleviate the problem of discreteness, the DP is often used as a prior on the parameters of a mixture model, with the whole model now called a Dirichlet process mixture (DPM) (Antoniak, 1974; Ferguson, 1983; Lo, 1984; Neal, 2000). The particular example of a Dirichlet process mixture of Gaussians, also known as an infinite Gaussian mixture model (Rasmussen, 2000) is widely used for both density estimation and clustering. Clustering refers to the problem of finding groupings of objects or data points such that similar objects belong to the same group and dissimilar objects belong to different groups. An example application of clustering is finding groups of similar celestial objects in a large astronomical survey. Posed abstractly in this way, clustering is not a well-defined problem (how many groups should we find? what does “similar” mean?). However, if we restrict ourselves to assuming that each cluster can be captured by some parameterised probability distribution over data points, such as a Gaussian, then clustering becomes a well-defined probabilistic inference problem. Bayesian nonparametric clustering using Dirichlet process mixtures is a natural extension of classical clustering algorithms such as the EM algorithm for finite mixture models or k- means (Duda and Hart, 1973). In a finite mixture model, the data is assumed to come from a distribution composed of K components: K X p(x|π, θ) = πk p(x|θk ) (5.2) k=1 with mixing weights π that sum to one, and parameters θk for each component. An infinite mixture model considers the limit of K → ∞, and has the property that it allows the number of observed clusters to grow with the number of observed data points. A DPM can be obtained as an infinite limit of a finite mixture model in many ways, but let us consider the following construction: xn ∼ p(x|θn ) θn ∼ G G ∼ DP(α, H). Because G is discrete, the values of θn will repeat, which results in a clustering of the data points. By equation (5.1), the πk correspond to the mixing weights of the infinitely many clusters, which can be compared to the finite counterpart (equation 5.2). The distribution over partitions of the data points induced by the DPM is known as a Chinese Restaurant Process (CRP; Aldous (1985)).

12 Bayesian nonparametric modelling The DPM, apart from being mathematically elegant, has some practical advantages over traditional clustering algorithms. There can be computational advantages to running a single instance of a DPM inference algorithm which automatically infers the number of clusters, rather than having to run multiple instances of an algorithm to compare different hypotheses on the number of clusters. Moreover, at test time, the DPM always allows for the possibility that a new test point (e.g. a new astronomical object) belongs to a cluster that was not observed in the training data. This makes DPM predictions somewhat more robust to outliers. 6. Discrete state time series: infinite hidden Markov models Many processes of interest produce sequential data for which models that assume exchangeability are inadequate. For example, natural language text, protein amino acid sequences, financial time series, and natural sounds all have sequential structure which is important to model. One of the most widely used models for times series is the hidden Markov model (HMM). An HMM assumes that a series of observations (x1 , . . . , xT ) was generated by a process which can be in one of K different discrete states at each point in time, st ∈ {1, . . . , K}. Moreover, in an HMM, st fully captures the state of the system at time t in the sense that given st the future evolution of the system does not depend on the state at times previous to t; this is the Markov property: P (st+1 |s1 , . . . , st ) = P (st+1 |st ). Finally, the observations are assumed to be drawn independently given the hidden states, through an emission process P (xt |st ). An example of an interesting application of HMMs in the physical and biological sciences is the modelling of single ion channel kinetics (Chung et al., 1990). Here, the measured time series of currents from an ion channel are modelled by assuming that at each point in time the channel macromolecule can be in one of many different conformational states, and that there is a transition matrix defining the probability of transitioning between each of these states. Learning an HMM involves inferring both parameters of the transition process P (st+1 |st ), which is in general a K × K transition matrix, and parameters of the emission process, P (xt |st ) (Rabiner and Juang, 1986). The problem of learning the structure of an HMM corresponds to inferring the number of hidden states, K, from data. Rather than doing model selection over varying K, we would like to develop a nonparametric approach to HMMs where the model has countably infinitely many hidden states at its disposal. This can be useful both when we don’t believe that any finite HMM can capture the data generating process well, and in situations where we believe that the data was actually generated from a finite-state process, but we simply don’t know how many states this process should have.8 8 There is a subtle but important distinction between assuming an infinite model, and assuming a finite but potentially unbounded model. The difference comes to light when we consider what we expect to happen in the limit of infinitely large data sets. In the former case, the posterior will have visited infinitely many distinct states, while in the latter case, the posterior should converge on some finite number of states.

Ghahramani 13 The key insight that allows one to develop a nonparametric HMM is that the finite HMM is a time series generalisation of finite mixture models. At each time step, the HMM assumes that the observation xt was generated by one of K mixture components, where st indicated the component (or cluster) used. The only difference between HMMs and mixture models is that the mixture indicators in an HMM depend on the ones at the previous time step. Using this insight, Beal et al. (2002) developed the infinite HMM model (iHMM). The basic idea was to consider a Bayesian HMM with countably infinitely many states. The main difficulty was to define a sensible prior on the parameters of the ∞ × ∞ transition matrix. In the finite K dimensional case, one would typically use independent symmetric Dirichlet prior distributions for each row of the transition matrix (where a k th row corresponds to the vector of all outgoing transition probabilities from state k). In the infinite limit, the independent Dirichlet prior doesn’t result in a sensible model, as under this prior, with probability one, the HMM will keep transitioning to new states rather than revisiting previous states. The solution developed in Beal et al (Beal et al., 2002) was to couple the rows by using a hierarchical Dirichlet process, a solution analogous to a reinforced urn process in probability theory (Fortini and Petrone, 2012). This work was followed up in the elegant paper by Teh et al (Teh et al., 2006) which further developed the hierarchical Dirichlet process and proposed an improved MCMC sampler for the iHMM. Since the original paper, there have been a number of conceptual and algorithmic developments of the infinite HMM. The beam sampler provides an efficient way of sampling the iHMM by using dynamic programming forward- backward style message passing (Van Gael et al., 2008a).9 Parallel and distributed implementations of the iHMM allow larger scale deployments (Bratieres et al., 2010). The block diagonal iHMM is an extension which groups the hidden states into clusters of states, effectively hierarchically partitioning the state-space (Stepleton et al., 2009). The infinite HMM can be extended to have a power-law structure on the hidden states by using the Pitman-Yor process (Van Gael and Ghahramani, 2011; Blunsom and Cohn, 2011) and has been successfully applied to diverse problems such as language modelling (Blunsom and Cohn, 2011), and speaker diarization (Fox et al., 2008). 7. Sparse matrices and overlapping clusters: Indian buffet processes One limitation of Dirichlet process mixtures, the infinite HMM, and clustering models in general, is that each data point is modelled as belonging to one of a set of mutually exclusive clusters. Although the nonparametric variants are flexible, in that they allow countably infinitely many clusters, they don’t allow a data point to belong to multiple clusters at the same time – they fundamentally define distributions over partitions of the data. We would like building blocks for our models that can allow overlapping cluster membership. For example, to understand a person’s network of friendships we really need models which can capture the idea that a person can belong 9 The beam sampler is available in the software package http://mloss.org/software/view/205/

14 Bayesian nonparametric modelling simultaneously to many possible social groupings based on workplace, family, housing location, high school, hobbies, etc. This type of hidden structure in data is sometimes called factorial structure (Hinton and Zemel, 1994; Saund, 1994; Ghahramani, 1995). The Indian buffet process (IBP; (Griffiths and Ghahramani, 2006, 2011)) is a probabilistic object which can be used to represent nonparametric factorial structure. There are a number of ways of understanding and deriving the IBP. Let’s start with a sparse binary matrix view of IBPs. Consider an N × K binary matrix, Z, where we can think of the rows of the matrix as corresponding to objects or data points, and the columns as features or clusters. An element of this matrix znk = 1 may denote that object n possesses hidden feature k, (or belongs to cluster k), and features (clusters) are not mutually exclusive in that an object can have multiple features. We wish to define a very simple distribution that is exchangeable over the objects (rows). A very simple choice is to assume that the columns are independent. We can therefore model each column through a single unknown parameter, θk , representing the frequency of feature k, p(znk = 1|θk ) = θk . Full specification for this model requires some choice for the prior distribution of θk . A natural choice is the beta distribution (the special case of a Dirichlet when there are only two outcomes) which happens to be conjugate to the Bernoulli likelihood, allowing θk to be integrated out. Like in the previous cases, we wish to consider models with infinitely many possible features or clusters, so we therefore have to examine the limit K → ∞. The beta distribution has two parameters α, β and the mean of the beta distribution is α/(α + β). For fixed α,β, and N in the limit K → ∞, the matrix Z will have infinitely many ones in it, which makes it computationally and statistically of limited interest. However, consider scaling the first parameter of the beta distribution and setting the second parameter to 1, i.e. using a beta(α/K,1) prior for each θk .10 In this case, the limit on the distribution of Z has a number of nice properties (a) the number of ones in each row is distributed as Poisson(α); (b) the total expected number of ones is αN ; (c) the number of nonzero columns grown as O(α log N ); and (d) the rows are exchangeable. This distribution is the Indian Buffet Process. Note that the distribution described above has infinitely many columns of zeros. Since the columns are all iid, once we sample a matrix, we can reorder its columns into a more usable representation such as the one shown in Figure 3. In many applications we are specifically interested in sparse binary matrices; for example to represent which factors are non-zero in sparse latent factor models (Knowles and Ghahramani, 2007), for binary matrix factorisation (Meeds et al., 2007), or to represent the adjacency matrix in a graph (Adams et al., 2010). However, it sometimes useful to view the IBP as a more abstract probabilistic object. Whereas the Chinese Restaurant Process is an infinitely exchangeable distribution over partitions of a set of objects, the IBP is an infinitely exchangeable distribution over subsets of a set of objects. Each object belongs to Poisson(α) subsets (or clusters), and as N increases, both the number of subsets grows (logarithmically) and the size of each subset grows (linearly) with N . Since its introduction, many interesting properties and extensions of the IBP have been uncovered. Since the IBP defines an exchangeable distribution, it has 10 It is not hard to reintroduce the second parameter to get the two-parameter IBP.

Ghahramani 15 Prior sample from IBP with α=10 0 10 20 30 objects (customers) 40 50 60 70 80 90 100 0 10 20 30 40 50 features (dishes) Figure 3. A sample from an IBP matrix, with columns reordered. Each row has on average 10 ones. Note the logarithmic growth of nonzero columns with rows. For the “restaurant” analogy where customers enter a buffet with infinitely many dishes you can refer to the original IBP papers. a de Finetti mixing distribution; for the IBP Thibaux and Jordan (Thibaux and Jordan, 2007) showed that this is the beta process (Hjort, 1990). An important extension of the IBP is the three-parameter model which exhibits a power law growth in the number of clusters (Teh and Görür, 2009). Nonparametric models which use the IBP to define sparse latent variables have been applied to a number of different problems, as reviewed in Griffiths and Ghahramani (2011). A very interesting application of the IBP is to the problem of network modelling: modelling the connections between objects or entities in social, biological and physical networks (Newman, 2010). While many models of networks are based on the idea of discovering communities or clusters of the nodes, the IBP allows each node to belong to multiple overlapping communities or clusters, a property which can be exploited to obtain improved predictive performance in tasks such as link prediction (Miller et al., 2009; Mørup et al., 2011; Palla et al., 2012). Figure 4 shows a useful visualisation of the relationship between a number of models. Here we can see that the Dirichlet process mixture (section 5), the infinite HMM (section 6) and the IBP (section 7) can all be related to each other. Combining the key features of all models results in the infinite factorial HMM, a Bayesian nonparametric time series model with factorial hidden structure (Van Gael et al., 2008b).

16 Bayesian nonparametric modelling factorial factorial model HMM finite HMM mixture IBP ifHMM factorial DPM iHMM time non-param. Figure 4. A diagram representing how some models relate to each other. We start from finite mixture models and consider three different ways of extending them. Orange arrows correspond to time series versions of static (iid) models. Blue arrows correspond to Bayesian nonparametric versions of finite parameteric models. Green arrows correspond to factorial (overlapping subset) versions of clustering (non-overlapping) models. 8. Hierarchies: Kingman’s coalescent, and Dirichlet and Pitman-Yor Diffusion Trees It is useful to distinguish between flat clustering models and hierarchical clustering models. In the former, data are partitioned into clusters, while in the latter, these clusters in turn are also partitioned into superclusters in a hierarchical manner. This results in a tree (or dendrogram) where at the top level we have a single root node corresponding to one coarse cluster with all the data points, while at the bottom we have leaves corresponding to fine clusters with a single data point in each cluster (Duda and Hart, 1973). Such hierarchical clustering models are useful for many reasons. Firstly, many natural phenomena are well-modelled through hierarchies. For example, the evolution of biological organisms results in phylogenetic trees which are well- modelled by a hierarchy, and real-world objects can often be decomposed into parts, subparts, etc. Second, hierarchies allow one to tie together parameters of complex models so as to improve generalisation in learning. For example, if one is building statistical models of patient outcome across a country, it might be natural to group together parameters of regions, cities, hospitals, and individual doctors, corresponding to multiple levels of a hierarchy. Third, hierarchies provide good abstractions for interpretability of complex models. For example, rather than trying to understand what the hundreds of states in an HMM each do, it would be useful to have an HMM which partitions the states in a coarse to fine hierarchy so that one can start out by interpreting the coarse states and gradually move down to the fine states.

Ghahramani 17 Note that while hierarchical clustering models are often described in terms of hierarchies over data points, in the above examples we have seen that hierarchies can be useful more generally over internal components of models, such as hidden states of an HMM, or parameters of a model. We have seen that the Dirichlet process mixture model (and the associated Chinese Restaurant Process) can be used to define Bayesian nonparametric models for flat clustering. Are there equivalent Bayesian nonparametric models which result in hierarchical clusterings? The answer is yes. Here we very briefly touch upon two frameworks for generating hierarchies which can be used in nonparametric models. For the nonparametric setting, the key requirement is that the models define an infinitely exchangeable distribution over the data points. For this to hold, the models must be projective in the sense that marginalising over the (N + 1)th point should result in a coherent model over N points (Orbanz, 2009). The first solution is given by Kingman’s coalescent (Kingman, 1982). This model has been widely used for modelling population genetics (Donnelly and Tavare, 1995) and more recently in the machine learning community for hierarchical clustering (Teh et al., 2008). The coalescent defines distributions over trees by starting with every data point in its own cluster and considering a process that merges clusters backwards in time until only one cluster remains. The key property of the coalescent is that for each pair of clusters the event that they merge is independent of the event that any other pair merges, and the time to this event is drawn from an exponential distribution with constant rate (which we can set to 1 without loss of generality). This process is well-defined in the limit of N → ∞ data points, and defines an infinitely exchangeable distribution over points. A second solution is given by the Dirichlet Diffusion Tree (DDT; Neal (2003)). Like Kingman’s coalescent, the DDT also defines a tree by considering a Markovian process evolving in time; however, the DDT starts at time t = 0 and evolves forward for 1 unit of time.11 We denote the evolution of the nth point in time via xn (t). The first data point starts at a location x1 (0) = 0 and follows a Brownian motion process, ending up at some point drawn marginally from a unit variance Gaussian, x1 (1) ∼ N (0, 1). The second data point exactly follows the path of the first data point until a divergence event occurs, after which point its path is independent of the path of the first point. The time to divergence is parameterised through a hazard function, an object commonly used in survival analysis. Subsequent data points follow the paths of previous data points, diverging according to a scaled form of the hazard function, and when reaching a branch point choosing a branch with probability proportional to the number of points that chose that branch before. This process defines an exchangeable distribution over data points, parameterised by the unknown tree. Using the DDT prior, the problem of hierarchical clustering becomes one of inferring the unknown tree given some data. 11 Note that both in the coalescent and in the DDT “time” is an indexing variable used to generate a tree structure, and need not correspond to a real notion of time.

18 Bayesian nonparametric modelling Both Kingman’s coalescent and the Dirichlet diffusion tree generate binary trees with probability one.12 A generalisation of the DDT that allows arbitrary branching of the trees is given by the Pitman-Yor diffusion tree (PYDT; (Knowles and Ghahramani, 2011). The process is generalised to allow, at each branch point, for the new data point either to follow the path of one of the previous points, or to create a new branch. Like the Pitman Yor process (Pitman and Yor, 1997) the PYDT has two parameters controlling its branching structure. Certain settings of these parameters result in the DDT, while other settings recover the distribution over trees induced by Kingman’s coalescent. The tree distribution induced by the PYDT is the multifurcating Gibbs fragmentation tree (McCullagh et al., 2008), the most general Markovian exchangeable distribution over trees. General distributions over hierarchies and other clustering structures can also be derived through the elegant theory of fragmentation-coagulation processes (Teh et al., 2011). 9. Covariance matrices: Generalised Wishart processes Often we are interested in modelling the relationship between a number of variables. One can express relationships between variables in many different ways, but perhaps the simplest representation of dependence is the covariance matrix, Σ, a symmetric positive (semi)definite matrix representing second order statistics of the variables. The covariance matrix plays a key role in parameterising many distributions, most notably the multivariate Gaussian but also extensions such as the multivariate t-distribution or more generally elliptical distributions (Muirhead, 1982). In certain applications, we wish to model how such a covariance matrix might depend on some other variables. For example, in econometrics and finance, one is often interested in modelling a time-varying covariance matrix Σ(t)—this is the key object of interest in the field of multivariate stochastic volatility (Asai et al., 2006). More generally, we would like to place distributions on covariance matrices that can depend on arbitrary variables, Σ(x), not just scalar time. Viewed as a function of x, we want to be able to define distributions on matrix-valued functions restricted to the space of symmetric-positive-definite (s.p.d.) matrix values. Is there a convenient and simple way to define such a stochastic process? Indeed there is, and the key insight comes from the observation that one can generate s.p.d. matrices by taking the outer products of random vectors. Consider for following construction, where we draw D-dimensional vectors independently from a multivariate Gaussian distribution ui ∼ N (0, V ) with covariance matrix V , and we define Xν Σ= ui u> i . i=1 Such a matrix Σ is Wishart distributed with ν degrees of freedom and is s.p.d. with probability one as long as ν ≥ D. The mean of Σ is proportional to V . 12 In the case of the DDT, binary trees result in all but pathological cases where the hazard function diverges before t = 1.

Ghahramani 19 We can generalise the Wishart distribution to a stochastic process indexed by x in any space X by replacing the elements of each ui with draws from a Gaussian process: ui (x) = (ui1 (x), ui2 (x), . . . , uiD (x)) where uid ∼ GP(0, K), where K is the covariance function or kernel of the GP Rasmussen and Williams (2006). The desired stochastic process is obtained by the same construction. ν X Σ(x) = ui (x)ui (x)> . i=1 A special case of such a construction where the Gaussian processes are assumed to be Brownian has been studied in probability theory and is known as a Wishart process (Bru, 1991); these processes have recently been applied to problems in econometrics (Philipov and Glickman, 2006; Gouriéroux et al., 2009). The more general case for arbitrary Gaussian processes is developed in (Wilson and Ghahramani, 2011) where applications to multivariate stochastic volatility are also explored. In this general framework, the model parameters controlling V , K and ν can be learned from data. 10. Inference So far we have focused on describing a general probabilistic approach to modelling and some nonparametric distributions which can be used in models of complex data sets. The three key ingredients in any approach to modelling are the data, the model, and an algorithm for doing probabilistic inference in the model. The problem of probabilistic inference corresponds to computing the conditional probability of some variables of interest given some observed variables, whilst marginalising out all other variables. Thus, both the computation of a model’s marginal likelihood (equation 2.1) and prediction (equation 2.2) are inference problems. Filling in missing data given observed data, computing the parameter posterior, or evaluating expectations of various quantities can all also be phrased as instances of the inference problem. Generally, the inference problem boils down to a problem of numerical integration or summation over large state-spaces. For most models of interest, especially nonparametric models, exact inference is computationally intractable, in the sense that all known algorithms for computing the exact probabilities of interest scale exponentially with some aspect of the problem, such as the number of data points or variables. In many problems, even approximating these exact probabilities to within some small error tolerance is intractable in the worst case.13 A wide variety of approximation methods have been developed to solve Bayesian inference problems. These can be roughly divided into stochastic approximations (which make extensive use of random numbers) and deterministic approximations. Some examples of widely used stochastic approximate inference methods include Markov chain Monte Carlo methods (for an excellent review 13 Most theoretical results on intractability focus on the worst case of a problem instance. The real-world inference problem may in fact be much easier to approximate.

You can also read