Schedule for: 24w5196 - Frontiers of Bayesian Inference and Data Science

Beginning on Sunday, September 1 and ending Friday September 6, 2024

All times in Oaxaca, Mexico time, CDT (UTC-5).

Sunday, September 1
14:00 - 23:59 Check-in begins (Front desk at your assigned hotel)
19:30 - 22:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
20:30 - 21:30 Informal gathering (Hotel Hacienda Los Laureles)
Monday, September 2
07:30 - 09:00 Breakfast (Restaurant Hotel Hacienda Los Laureles)
09:00 - 09:45 Jeffrey Miller: Compressive Bayesian non-negative matrix factorization for mutational signatures
Non-negative matrix factorization (NMF) is widely used in many applications. Inferring an appropriate number of factors for NMF is a challenging problem, and sparse Bayesian models have been proposed for this problem. However, inference in these models is difficult due to the complicated multimodal posterior distributions that arise. We introduce a novel methodology for overfitted Bayesian NMF models using “compressive hyperpriors” that force unneeded factors to epsilon while imposing mild shrinkage on needed factors. The basic idea is to use simple semi-conjugate priors but set the strength of the hyperprior in a data-dependent way in order to achieve compressivity. This yields an easy-to-implement Gibbs sampler that has improved convergence and accuracy compared to state-of-the-art alternatives. We demonstrate in simulations and on real data for mutational signatures analysis in cancer genomics.
(Conference Room San Felipe)
09:45 - 10:15 Round Table & Coffee Break (Conference Room San Felipe)
10:15 - 11:00 Tommaso Rigon: Distribution theory of Gibbs-type feature allocation
Feature allocation models are a generalization of clustering methodologies, accommodating the sharing of multiple attributes among subjects. Analogous to clustering, these allocations are characterized by the Exchangeable Feature Probability Functions (EFPFs). This paper aims to provide distributional insights for a fundamental category of feature allocation models with EFPFs in product-form, called Gibbs feature models. These priors play a similar role to Gibbs-type priors in the species sampling framework, balancing computational tractability and modeling flexibility. We establish several theoretical results, covering analyses of the predictive distributions and posterior laws of the underlying statistical processes. This methodological framework finds application in ecology, particularly in species richness estimation via accumulation curve analysis.
(Online - CMO)
11:00 - 11:30 Isabella Deutsch: Ancestor Hawkes Processes for Group Chat Messaging Patterns
The Hawkes process is a versatile point process that we modify to capture dynamics in message patterns in a group chat setting. Events from a Hawkes process are either immigrant or triggered events. While this underlying branching structure is commonly used for sampling and estimation, this fundamental quality of each event is not reflected in the parameter structure of a classic Hawkes process. We therefore develop the Ancestor Hawkes model, a type of Hawkes process model that allows for different influences for immigrant and triggered events. We showcase this model on group chat data, which we collected specifically for this line of work. This allows us to characterise chat participants according to their answering behaviour and explore how participants start new messaging threads or answer to existing ones.
(Online - CMO)
11:30 - 12:00 Karla Vianey Palacios: Heavy-Tailed NGG-Mixture Models
Heavy tails are frequently encountered in practice, yet they are the Achilles heel of a variety of conventional random probability measures such as the Dirichlet process (DP). In this talk, we will focus on characterizing the tails of the normalized generalized gamma (NGG) process. We show that the right tail of an NGG is heavy if the base distribution is also heavy; the DP is the only exception. We will talk about two classes of heavy-tailed mixture models and evaluate their merits. Presenting multivariate extensions and a predictor-dependent version to learn about the effect of covariates on a multivariate response with heavy-tailed marginals. Our results suggest that our method works well in different scenarios, as we demonstrate on a neuroscience dataset. The Brain rhythm signals are fundamental to understanding how the human brain works. Broadly speaking, they consist of patterns of neural activity that are believed to be linked to certain behaviors, the intensity of alertness, and dream states. Signals are typically measured using an electroencephalogram (EEG) and there is evidence that alpha and beta rhythms have heavy tails. Therefore, the main goal of our analysis will be to learn about the marginal and joint distribution of these heavy-tailed oscillations, given various stimuli.
(Conference Room San Felipe)
12:00 - 12:30 Riccardo Corradin: Nesting compound random measures
Distributional heterogeneity in a partially exchangeable setting is typically modeled through vectors of dependent probability measures. Among the possible strategies, compound random measures have been shown to be a flexible modeling approach that can encompass many possible models within a single general formulation. We are particularly interested in constrained situations, where possible ties are induced across different dimensions of the vectors of dependent probability measures. Therefore, we propose a model that combines the properties of the compound random measure with nested modeling strategies. We derived a posterior characterization of this model, which allows us to perform scalable inference in a conditional setting. When combined with suitable kernel functions, the proposed model has proven to produce accurate distributional estimates while simultaneously clustering together different dimensions of the underlying vectors of dependent probability measures.
(Conference Room San Felipe)
12:30 - 14:00 Lunch (Restaurant Hotel Hacienda Los Laureles)
14:00 - 14:45 Tamara Broderick: Consistent Validation for Predictive Methods in Spatial Settings
Spatial prediction tasks are key to weather forecasting, studying air pollution, and other scientific endeavors. Determining how much to trust predictions made by statistical or physical methods is essential for the credibility of scientific conclusions. Unfortunately, classical approaches for validation fail to handle mismatch between locations available for validation and (test) locations where we want to make predictions. This mismatch is often not an instance of covariate shift (as commonly formalized) because the validation and test locations are fixed (e.g., on a grid or at select points) rather than i.i.d. from two distributions. In the present work, we formalize a check on validation methods: that they become arbitrarily accurate as validation data becomes arbitrarily dense. We show that classical and covariate-shift methods can fail this check. We instead propose a method that builds from existing ideas in the covariate-shift literature, but adapts them to the validation data at hand. We prove that our proposal passes our check. And we demonstrate its advantages empirically on simulated and real data.
(Online - CMO)
14:45 - 15:15 Dootika Vats: Moreau-Yosida Importance Sampling with MCMC
Often in modern Bayesian statistics, we are faced with non-differentiable posterior distributions or posteriors with large Lipschitz constants. These models typically arise in problems where parsimonious estimation is desirable. Due to the non-differentiability of the target distribution, otherwise effective gradient-based Markov chain Monte Carlo (MCMC) algorithms are ill-suited for this problem. Recently, Proximal MCMC methods have been proposed that approximate the non-differentiable posterior with a smooth Moreau-Yosida envelope, and use gradient information from the smooth approximation. We leverage these smooth approximations to build an effective importance sampling proposal that guarantees finite variance of estimators. The proposed algorithm, we show, can yield high gains in statistical efficiency for such ill-behaved problems. Applications to Bayesian trendfiltering will be discussed.
(Conference Room San Felipe)
15:15 - 15:45 Florian Maire: Towards variance reduction in MCMC with the occlusion process
There is a rich literature on strategies to reduce the variance of MCMC estimators: control variates, non-reversible dynamics, adaptivity, embedding, etc. In this work, we define and study a mechanism called occlusion which leverages recent strategies developed in ML to approximate a given probability distribution known up to a normalizing constant. The idea is to occlude each state of an ergodic and reversible Markov chain with a certain probability and, upon occlusion, replacing them by an independent draw from a certain distribution. This strategy is shown to not deteriorate the computational complexity and the convergence properties (LLN and CLT) of the initial MCMC estimator and we show on several examples that it does reduce its asymptotic variance, sometimes dramatically.
(Online - CMO)
15:45 - 16:15 Round Table & Coffee Break (Conference Room San Felipe)
16:15 - 17:00 Yanxun Xu: Precision Medicine in HIV
The use of antiretroviral therapy (ART) has significantly reduced HIV-related mortality and morbidity, transforming HIV infection to a chronic disease with the care now focusing on treatment adherence, comorbidities including mental health, and other long-term outcomes. Since combination ART with three or more drugs of different mechanisms or against different targets is recommended for all people living with HIV (PWH) and they must continue on it indefinitely once started, understanding the long-term ART effects on health outcomes and personalizing ART treatment based on individuals’ characteristics is crucial for optimizing PWH’s health outcomes and facilitating precision medicine in HIV. In this talk, I will present methods designed to learn and understand the impact of ART on the health outcomes of PWH, and explore the future of HIV care through innovative and individualized approaches.
(Online - CMO)
17:00 - 17:15 Noirrit Kiran Chandra: Functional connectivity across the human subcortical auditory system using an autoregressive matrix-Gaussian copula graphical model approach with partial correlations
The auditory system comprises multiple subcortical brain structures that process and refine incoming acoustic signals along the primary auditory pathway. Due to technical limitations of imaging small structures deep inside the brain, most of our knowledge of the subcortical auditory system is based on research in animal models using invasive methodologies. Advances in ultra-high field functional magnetic resonance imaging (fMRI) acquisition have enabled novel non-invasive investigations of the human auditory subcortex, including fundamental features of auditory representation such as tonotopy and periodotopy. However, functional connectivity across subcortical networks is still underexplored in humans, with ongoing development of related methods. Traditionally, functional connectivity is estimated from fMRI data with full correlation matrices. However, partial correlations reveal the relationship between two regions after removing the effects of all other regions, reflecting more direct connectivity. While most existing methods for learning conditional dependency structures based on partial correlations assume independently and identically Gaussian distributed data, fMRI data exhibit significant deviations from Gaussianity as well as high temporal autocorrelation. In this paper, we developed an autoregressive matrix-Gaussian copula graphical model approach to estimate the partial correlations and thereby infer the functional connectivity patterns within the auditory system while appropriately accounting for autocorrelations between successive fMRI scans. Our results are highly stable when splitting the data in halves according to the acquisition schemes and computing partial correlations separately for each half of the data, as well as across cross-validation folds. In contrast, full correlation-based analysis identified a rich network of interconnectivity that was not specific to adjacent nodes along the pathway. Overall, our results demonstrate that unique functional connectivity patterns along the auditory pathway are recoverable using novel connectivity approaches and that our connectivity methods are reliable across multiple acquisitions.
(Online - CMO)
17:15 - 17:45 Farabi Raihan Shuvo: Bayesian Semiparametric Drift-diffusion Models for Cognitive Control Leveraging Stop-signal Tasks (Conference Room San Felipe)
19:00 - 21:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
Tuesday, September 3
07:30 - 09:00 Breakfast (Restaurant Hotel Hacienda Los Laureles)
09:00 - 09:45 Trevor Campbell: autoMALA: Locally adaptive Metropolis-adjusted Langevin algorithm
The Metropolis-adjusted Langevin Algorithm (MALA) is a widely used Markov chain Monte Carlo (MCMC) algorithm for Bayesian posterior inference. Like many MCMC algorithms, MALA has a “step size” parameter that must be tuned in order to obtain satisfactory performance. However, finding an adequate step size for an arbitrary target distribution can be a difficult task, and there may not even be a single step size that works well throughout the whole distribution. To resolve this issue we introduce autoMALA, a new Markov chain Monte Carlo algorithm based on MALA that automatically sets its step size at each iteration based on the local geometry of the target distribution. We prove that autoMALA has the correct invariant distribution, despite continual automatic adjustments of the step size. Our experiments demonstrate that autoMALA is competitive with related state-of-the-art MCMC methods, in terms of the number of log density evaluations per effective sample, and it outperforms state-of-the-art samplers on targets with varying geometries. Furthermore, we find that autoMALA tends to find step sizes comparable to optimally-tuned MALA when a fixed step size suffices for the whole domain.
(Conference Room San Felipe)
09:45 - 10:15 Panayiota Touloupou: Scalable inference for epidemic models with individual level data
As individual level epidemiological and pathogen genetic data become available in ever increasing quantities, the task of analyzing such data becomes more and more challenging. Inferences for this type of data are complicated by the fact that the data is usually incomplete, in the sense that the times of acquiring and clearing infection are not directly observed, making the evaluation of the model likelihood intractable. A solution to this problem can be given in the Bayesian framework with unobserved data being imputed within Markov chain Monte Carlo (MCMC) algorithms at the cost of considerable extra computational effort. Motivated by this demand, we develop a novel method for updating individual level infection states within MCMC algorithms that respects the dependence structure inherent within epidemic data. We apply our new methodology to an epidemic of Escherichia coli O157:H7 in feedlot cattle in which eight competing strains were identified using genetic typing methods. We show that surprisingly little genetic data is needed to produce a probabilistic reconstruction of the epidemic trajectories, despite some possibility of misclassification in the genetic typing. We believe that this complex model, capturing the interactions between strains, would not have been able to be fitted using existing methodologies.
(Online - CMO)
10:15 - 10:45 Bernardo Flores López: Predictive coresets
Coresets are a set of methods born from information geometry designed to, given a data set and a learning algorithm, reduce the size of the dataset while retaining a similar learning performance. Traditionally this has been done by finding a set of sparse weights that minimize the KL divergence between the likelihood based on the original dataset and the one on the weighted data. This has the disadvantage of being ill-defined for nonparametric models, where the likelihood is often intractable. We propose an alternative construction based on matching the unknown predictive distributions over the unseen data based on a generalized posterior, which gives a robust estimator amenable to nonparametric priors. The performance of our method is evaluated on sRNA-Seq data, a good example of high dimensional data where classical estimation algorithms fail to scale.
(Conference Room San Felipe)
10:45 - 11:15 Round Table & Coffee Break (Conference Room San Felipe)
11:15 - 11:45 Marta Catalano: Merging rate of opinions via optimal transport on random measures
The Bayesian approach to inference is based on a coherent probabilistic framework that naturally leads to principled uncertainty quantification and prediction. Via posterior distributions, Bayesian nonparametric models make inference on parameters belonging to infinite-dimensional spaces, such as the space of probability distributions. The development of Bayesian nonparametrics has been triggered by the Dirichlet process, a nonparametric prior that allows one to learn the law of the observations through closed-form expressions. Still, its learning mechanism is often too simplistic and many generalizations have been proposed to increase its flexibility, a popular one being the class of normalized completely random measures. Here we investigate a simple yet fundamental matter: will a different prior actually guarantee a different learning outcome? To this end, we develop a new distance between completely random measures based on optimal transport, which provides an original framework for quantifying the similarity between posterior distributions (merging of opinions). Our findings provide neat and interpretable insights on the impact of popular Bayesian nonparametric priors, avoiding the usual restrictive assumptions on the data-generating process. This is joint work with Hugo Lavenant.
(Online - CMO)
11:45 - 12:15 Sameer Deshpande: Scalable smoothing in high-dimensions with BART
Bayesian Additive Regression Trees (BART) is an easy-to-use and highly effective nonparametric regression model that approximates unknown functions with a sum of binary regression trees (i.e., piecewise-constant step functions). Consequently, BART is fundamentally limited in its ability to estimate smooth functions. Initial attempts to overcome this limitation replaced the constant output in each leaf of a tree with a realization of a Gaussian Process (GP). While these elaborations are conceptually elegant, most implementations thereof are computationally prohibitive, displaying a nearly-cubic per-iteration complexity. We propose a version of BART built with trees that output linear combinations of ridge functions; that is, our trees return linear combinations of compositions between affine transforms of the inputs and a (potentially non-linear) activation function. We develop a new MCMC sampler that updates trees in linear time. Our proposed model includes a random Fourier feature-inspired approximation to treed GPs as a special case. More generally, our proposed model can be viewed as an ensemble of local neural networks, which combines the representational flexibility of neural networks with the uncertainty quantification and computational tractability of BART.
(Conference Room San Felipe)
12:15 - 12:45 Giovanni Rebaudo: Understanding partially exchangeable nonparametric priors for discrete structure
Species sampling models provide a general framework for random discrete distributions that are tailored for exchangeable data. However, they fall short when used for modeling heterogeneous data collected from related sources or distinct experimental conditions. To address this, partial exchangeability serves as the ideal probabilistic framework. While numerous models exist for partially exchangeable observations, a unifying framework, like species sampling models, is currently missing for this framework. Thus, we introduce multivariate species sampling models, a general class of models characterized by their partially exchangeable partition probability function. They encompass existing nonparametric models for partial exchangeable data, highlighting their core distributional properties. Our results allow the study of the induced dependence structure and facilitate the development of new models. This is a joint work with Beatrice Franzolini, Antonio Lijoi, and Igor Pruenster.
(Online - CMO)
12:45 - 14:15 Lunch (Restaurant Hotel Hacienda Los Laureles)
14:15 - 15:00 François-Xavier Briol: Robust and Conjugate Gaussian Process Regression
To enable closed form conditioning, a common assumption in Gaussian process (GP) regression is independent and identically distributed Gaussian observation noise. This strong and simplistic assumption is often violated in practice, which leads to unreliable inferences and uncertainty quantification. Unfortunately, existing methods for robustifying GPs break closed-form conditioning, which makes them less attractive to practitioners and significantly more computationally expensive. In this paper, we demonstrate how to perform provably robust and conjugate Gaussian process (RCGP) regression at virtually no additional cost using generalised Bayesian inference. RCGP is particularly versatile as it enables exact conjugate closed form updates in all settings where standard GPs admit them. To demonstrate its strong empirical performance, we deploy RCGP for problems ranging from Bayesian optimisation to sparse variational Gaussian processes.
(Online - CMO)
15:00 - 15:30 Eli Weinstein: Nonparametrically-perturbed parametric Bayesian models: robustness, efficiency and approximations
Parametric Bayesian modeling offers a powerful and flexible toolbox for scientific data analysis. Yet it often faces a basic challenge: the model, however detailed, may still be wrong, and this can make inferences untrustworthy. In this project we study nonparametrically perturbed parametric (NPP) Bayesian models, in which a parametric Bayesian model is relaxed via a nonparametric distortion of its likelihood. In particular, we analyze the properties of NPP models when the target of inference is the true data distribution or some functional of it, such as is often the case for instance in causal inference. We show that NPP models offer the robustness of nonparametric models while retaining the data efficiency of parametric models, achieving fast convergence when the parametric model is close to true. We then develop a practical generalized Bayes procedure which inherits the key properties of NPP models, at much less computational cost. Overall, we argue that NPP models offer a robust, efficient and black-box approach to Bayesian inference in general and causal Bayesian inference in particular.
(Conference Room San Felipe)
15:30 - 16:00 Lorenzo Capello: Scalable Bayesian inference for Coalescent Models
The observed sequence variation at a locus informs about the evolutionary history of the sample and past population size dynamics. The Kingman coalescent (and its extensions) is commonly used in a generative model of molecular sequence variation to infer evolutionary parameters. However, it is well understood that inference under this model does not scale well with sample size. In the talk, we will discuss a few attempts to tackle this issue. The first attempt is a lower-resolution coalescent model: here, we aimed at scalable inference via a model with drastically smaller state space. A second line of research is a different algorithm for inference: here, we will try to leverage advances in approximate inference for Bayes of the last decade and customize them to this specific setting.
(Conference Room San Felipe)
16:00 - 16:30 Round Table & Coffee Break (Conference Room San Felipe)
16:30 - 17:00 Georgia Papadogeorgou: Spatial causal inference in the presence of unmeasured confounding and interference
We discuss concepts from causal inference and spatial statistics, presenting novel insights for causal inference in spatial data analysis, and establishing how tools from spatial statistics can be used to draw causal inferences. We introduce spatial causal graphs to highlight that spatial confounding and interference can be entangled, in that investigating the presence of one can lead to wrongful conclusions in the presence of the other. Moreover, we show that spatial dependence in the exposure variable can render standard analyses invalid, which can lead to erroneous conclusions. To remedy these issues, we propose a Bayesian parametric approach based on tools commonly-used in spatial statistics. This approach simultaneously accounts for interference and mitigates bias resulting from local and neighborhood unmeasured spatial confounding. From a Bayesian perspective, we show that incorporating an exposure model is necessary, and we theoretically prove that all model parameters are identifiable, even in the presence of unmeasured confounding.
(Online - CMO)
17:00 - 17:15 Falco Joannes Bargagli Stoffi: Confounder-Dependent Bayesian Mixture Model: Characterizing Heterogeneity of Causal Effects
Several epidemiological studies have provided evidence that long-term exposure to fine particulate matter (PM2.5) increases mortality rate. Furthermore, some population characteristics (e.g., age, race, and socioeconomic status) might play a crucial role in understanding vulnerability to air pollution. To inform policy, it is necessary to identify groups of the population that are more or less vulnerable to air pollution. In causal inference literature, the Group Average Treatment Effect (GATE) is a distinctive facet of the conditional average treatment effect. This widely employed metric serves to characterize the heterogeneity of a treatment effect based on some population characteristics. In this paper, we introduce a novel Confounder-Dependent Bayesian Mixture Model (CDBMM) to characterize causal effect heterogeneity. More specifically, our method leverages the flexibility of the dependent Dirichlet process to model the distribution of the potential outcomes conditionally to the covariates and the treatment levels, thus enabling us to: (i) identify heterogeneous and mutually exclusive population groups defined by similar GATEs in a data-driven way, and (ii) estimate and characterize the causal effects within each of the identified groups. Through simulations, we demonstrate the effectiveness of our method in uncovering key insights about treatment effects heterogeneity. We apply our method to claims data from Medicare enrollees in Texas. We found six mutually exclusive groups where the causal effects of PM2.5 on mortality rate are heterogeneous.
(Online - CMO)
17:15 - 17:45 Dafne Zorzetto: Bayesian Nonparametrics for Principal Stratification with Continuous Post-Treatment Variables
Principal stratification provides a causal inference framework that allows adjustment for confounded post-treatment variables when comparing treatments. Despite the literature that mainly focused on binary post-treatment variables, principal stratification with continuous post-treatment variables is gaining increasing attention, with several emerging challenges to be carefully considered. Characterizing the latent principal strata presents a significant challenge that directly impacts the selection of models and the estimation of the principal causal effect. This challenge is further complicated in observational studies where the treatment is not randomly assigned to the units. We develop a novel approach for principal stratification with continuous post-treatment variables leveraging a data-driven method. Our approach exploits Bayesian nonparametric priors for detecting the principal strata, defines novel principal causal effects, and provides a full quantification of the principal strata membership uncertainty. More specifically, we introduce the Confounders-Aware Shared-atoms BAyesian mixture (CASBAH), where the dependent Dirichlet process with shared atoms across treatment levels allows us to adjust for the confounding bias and share information between treatment levels while estimating the principal strata membership. Through Monte Carlo simulations, we show that the proposed methodology has excellent performance in characterizing the latent principal strata and estimating the effects of treatment on post-treatment variables and outcomes. Our proposed method is applied to a case study where we estimate the causal effects of U.S. national air quality regulations on pollution levels and health outcomes.
(Conference Room San Felipe)
19:00 - 21:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
Wednesday, September 4
07:30 - 09:00 Breakfast (Restaurant Hotel Hacienda Los Laureles)
09:00 - 09:45 Yang Ni: Global-Local Dirichlet Processes for Identifying Pan-Cancer Subpopulations Using Both Shared and Cancer-Specific Data
We consider the problem of clustering grouped data for which the observations may include group-specific variables in addition to the variables that are shared across groups. This type of data is common in cancer genomics where the molecular information is usually accompanied by cancer-specific clinical information. Existing grouped clustering methods only consider the shared variables, thereby ignoring valuable information from the cancer-specific variables. To allow for these cancer-specific variables to aid in the clustering, we propose a novel Bayesian nonparametric approach, termed global-local (GLocal) Dirichlet process, that models the “global-local” structure of the observations across groups. We characterize the GLocal Dirichlet process using the stick-breaking representation and the representation as a limit of a finite mixture model, which leads to an efficient posterior inference algorithm. We illustrate our model with extensive simulations and a real pan-gastrointestinal cancer dataset. The cancer-specific clinical variables included carcinoembryonic antigen level, patients’ body mass index, and the number of cigarettes smoked per day. These important clinical variables refine the clusters of gene expression data and allow us to identify finer sub-clusters, which is not possible in their absence.
(Conference Room San Felipe)
09:45 - 10:15 Mario Beraha: Learning to (approximately) count via Bayesian nonparametrics
We study how to recover the frequency of a symbol in a large discrete data set, using only a compressed representation, or sketch, of those data obtained via random hashing. This is a classical problem in computer science, with various algorithms available, such as the count-min sketch. However, these algorithms often assume that the data are fixed, leading to overly conservative and potentially inaccurate estimates when dealing with randomly sampled data. In this paper, we consider the sketched data as a random sample from an unknown distribution, and then we introduce novel estimators that improve upon existing approaches. Our method combines Bayesian nonparametric and classical (frequentist) perspectives, addressing their unique limitations to provide a principled and practical solution. Additionally, we extend our method to address the related but distinct problem of cardinality recovery, which consists of estimating the total number of distinct objects in the data set. We validate our method on synthetic and real data, comparing its performance to state-of-the-art alternatives.
(Online - CMO)
10:15 - 10:45 Round Table & Coffee Break (Conference Room San Felipe)
10:45 - 11:30 Bernardo Nipoti: Clustering multiple network data: a Bayesian nonparametric approach}
A popular approach to the problem of clustering multiple network data makes use of distance metrics that measure the similarity among networks based on some of their global or local characteristics. In this context, we propose a novel Bayesian nonparametric approach to model undirected labeled graphs sharing the same set of vertices, which allows us to identify clusters of networks characterized by similar patterns in the connectivity of nodes. Our construction relies on the definition of a location-scale Dirichlet process mixture of centered Erdos-Renyi kernels. An efficient Markov chain Monte Carlo scheme is proposed to carry out posterior inference and provide a convenient clustering of the multiple network data, while the number of clusters in the population is not set a priori but inferred from the data. The performance of our approach is investigated by means of the analysis of synthetic data as well as a dataset on brain networks.
(Conference Room San Felipe)
11:30 - 12:00 Francesco Gaffi: An invariance-based approach to node clustering in dynamic networks
In network analysis, understanding the dynamics of evolving networks is often of paramount importance. We introduce and study a novel class of models to detect evolving communities underlying dynamic network data. The methods build upon the established literature on stochastic block models and extend it to accommodate temporal evolution. The cornerstone of our approach is the interplay of random partitions induced by hierarchical normalized completely random measures and the assumption of conditional partial exchangeability, a recently introduced modeling principle for capturing the dynamic of evolving partitions within a Bayesian framework. Our methodology effectively addresses the limitations inherent in traditional static community detection methods, and in contrast with other dynamic extensions of the classical stochastic block models, provides flexibility and built-in uncertainty quantification, while inducing a form of distributional invariance coherent with a time-evolving clustering scheme. Joint work with Beatrice Franzolini.
(Conference Room San Felipe)
12:00 - 12:30 Beatrice Franzolini: Conditional partial exchangeability: longitudinal and multi-view partitions
Standard clustering techniques assume a common configuration for all features and times in a dataset. However, when dealing with longitudinal or multi-view data, the number of clusters, clusters' frequencies, and clusters' shapes may need to vary across time or domains requiring the estimate of a collection of clustering configuration to accurately capture data heterogeneity and time dynamics. Nonetheless, popular techniques for dynamic clustering fail to account for within-subject dependence across time points and views, ignoring subject identities. A similar problem is encountered in stochastic block models for longitudinal or multiplex network data. To overcome this limitation, we introduce a wide class of Bayesian mixture models that induce a collection of dependent random partitions, where dependency is inserted at the subject level. The core concept is conditional partial exchangeability, a novel probabilistic paradigm that ensures analytical and computational tractability while defining a flexible law of dependent random partitions of the same objects across time, space, or domains.
(Online - CMO)
12:30 - 13:30 Lunch (Restaurant Hotel Hacienda Los Laureles)
13:30 - 19:00 Free Afternoon (Oaxaca)
19:00 - 21:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
Thursday, September 5
07:30 - 09:00 Breakfast (Restaurant Hotel Hacienda Los Laureles)
09:00 - 09:30 Francesca Panero: Modeling sparse networks with Bayesian Nonparametrics
The graphex is a statistical framework to model random graphs. It is particularly flexible, in that it allows us to describe dense and sparse networks, different degree distributions (power-law included) and positive clustering. After introducing the general graphex framework, I will explain how we model networks embedded in a latent space and use this model to helps us explain the structures underlying commuting patterns. I will also introduce a model for dynamic networks that describes communities varying over time.
(Conference Room San Felipe)
09:30 - 10:00 Deborah Sulem: Scalable Bayesian inference for Gaussian Graphical Models
Gaussian graphical models are widely used to analyse the dependence structure among variables. However, when the number of observed variables is large, the computational demands of estimating a high-dimensional graphical model have limited the scope of applications. In this work, we introduce a scalable, interpretable, and fully-Bayesian method for estimating a high-dimensional GGM. Our method capitalises on a discrete spike-and-slab parametrisation of the prior distribution, which allows to infer a sparse graphical model, and an efficient Block Gibbs sampler. Moreover, we propose an almost-parallel version of our sampling algorithm which exploits the relationship between the conditional dependence structure and a linear regression model. This strategy facilitates decomposing the high-dimensional estimation problem into sub-components, allowing the application of efficient inference methodology originally developed for linear regression.
(Online - CMO)
10:00 - 10:30 Felipe Medina: Speeding up Inference on genetic trees and graphs
Latent position graphical models (LPM) are praised by practitioners for their desirable theoretical properties and are particularly easy to interpret. However, their inference is notably challenging as the computational cost of existing methods scales with the square of the number of nodes. We consider an approximation of the likelihood function based on a discretization of the latent space: the level of noise introduced by the approximation can be arbitrarily reduced at the expense of computational efficiency. We establish several theoretical results that show how the likelihood error propagates to the invariant distribution of an MCMC algorithm designed to sample the posterior distribution of a LPM.
(Conference Room San Felipe)
10:30 - 11:00 Round Table & Coffee Break (Conference Room San Felipe)
11:00 - 11:30 Tianjian Zhou: On Bayesian Sequential Clinical Trial Designs
Clinical trials usually involve sequential patient entry. When designing a clinical trial, it is often desirable to include a provision for interim analyses of accumulating data with the potential for stopping the trial early. We review Bayesian sequential clinical trial designs based on posterior probabilities, posterior predictive probabilities, and decision-theoretic frameworks. A pertinent question is whether Bayesian sequential designs need to be adjusted for the planning of in- terim analyses. We answer this question from three perspectives: a frequentist-oriented perspective, a calibrated Bayesian perspective, and a subjective Bayesian perspective. We also provide new insights into the likelihood principle, which is commonly tied to statistical inference and decision making in sequential clinical trials. Some theoretical results are derived, and numerical studies are conducted to illustrate and assess these designs.
(Online - CMO)
11:30 - 12:00 Roberta de Vito: Bayesian multi-study techniques to learn reproducible signals across studies
Adopting methods to integrate multiple studies is crucial to achieving knowledge and information in epidemiological and biological data. This integration relies on two key challenges: 1- the common amount of information from all the studies and 2- the study-specific source from individual studies. The Bayesian Multi-study Factor model achieves these two challenges and handles multiple studies. We propose a novel sparse Bayesian Multi-study Factor model by adopting a latent factor regression approach. This generalization recovers the study and the common component, keeping track of the observed variables, such as demographic information. We consider different priors (local and non-local) to detect the latent dimension, enhancing the factor cardinality reconstruction. A user-defined prior dispersion for the regression coefficient accounts for population structure and other subject characteristics. We assess the characteristics of our method using different simulation settings, and we clarify the benefit of our model, resulting in better accuracy and precision. We illustrate the advantages of our method through a nutritional epidemiological application that identifies study-specific and shared signals across studies accounting for covariates, such as smoking, alcohol, and age.
(Conference Room San Felipe)
12:00 - 12:30 Yunshan Duan: Self-supervised learning with Gaussian Processes
Self supervised learning (SSL) is a machine learning paradigm where models learn to understand the underlying structure of data without explicit supervision from labeled samples. The acquired representations from SSL have demonstrated useful for many downstream tasks including clustering, and linear classification, etc. To ensure smoothness of the representation space, most SSL methods rely on the ability to generate observations that are similar to a given instance. However, generating these pairs may be challenging for many types of data. Moreover, these methods lack consideration of uncertainty quantification and can perform poorly in out-of-sample prediction settings. To address these limitations, we propose Gaussian process self supervised learning (GPSSL), a novel approach that utilizes Gaussian processes (GP) models on representation learning. Gaussian process priors are imposed on the representations, and we obtain a generalized Bayesian posterior minimizing a loss function that encourages useful representations. The covariance function inherent in GPs naturally pulls representations of similar units together, serving as an alternative to using explicitly defined positive samples. We show that GPSSL is closely related to both kernel PCA and VICReg, a popular neural network-based SSL method, but unlike both allows for posterior uncertainties that can be propagated to downstream tasks. Experiments on various datasets, considering classification and regression tasks, demonstrate that GPSSL outperforms traditional methods in terms of accuracy, uncertainty quantification, and error control.
(Conference Room San Felipe)
12:30 - 13:30 Lunch (Restaurant Hotel Hacienda Los Laureles)
13:30 - 14:15 Veronica Berrocal: Bayesian spatial modeling for applications in geophysical sciences
Perhaps more than in other sciences, geophysical sciences are often concerned with questions around predictions, both in space and in time. These predictions are in turn often used to inform decisions. As such, an appropriate quantification of uncertainty is fundamental, making the Bayesian paradigm a natural inferential framework for these disciplines. In this talk we will present and discuss three Bayesian spatial models developed with the goal of (i) understanding the impact of climate factors on crop yield and identifying areas in need of soil and water management; (ii) determining regions where to collect soil samples for carbon stock assessment; and (iii) identifying the main sources of particulate matter pollution in California.
(Online - CMO)
14:15 - 14:45 Francesco Denti: Of mice and music: finite-infinite nested priors for the segmentation of large-scale grouped data
Over the last few years, the Bayesian community has dedicated increased attention to mixture priors inducing nested random partitions. Models based on these priors allow the estimation of a two-layered partition over grouped data: across groups and across observations. We focus on nested models based on shared observational atoms, which permit observational clusters to spread across all the distributional groups. We introduce a novel finite-infinite nested model to overcome the high prior correlation between random mixing measures imposed by fully nonparametric common atoms models. This specification also enables the development of fast algorithms for posterior inference. Indeed, the tractability of the proposed prior grants the derivation of tailored mean-field variational inference algorithms, which scale up the applicability of Bayesian nested mixture models to large datasets. Such a computational strategy is highly efficient, and the accuracy of the posterior density estimate and the estimated partition is comparable with a standard Gibbs sampler algorithm. To showcase the applicability of the proposed framework, we illustrate how this prior can be embedded in more complex models motivated by real-world problems. In particular, we introduce and compare two models: one devised for clustering Spotify artists based on the characteristics of their songs and the other created to segment large mass-spectrometry imaging matrix data.
(Conference Room San Felipe)
14:45 - 15:15 Arman Oganisian: Bayesian Semiparametrics for Sequential Decision-Making with Incomplete Information: Applications in Acute Myeloid Leukemia
Chronic diseases are managed over time via a sequence of treatment decisions, each of which is made conditional on the patient's current disease state and history. Time-varying covariates are often used to tailor the treatment but monitoring of these covariates is sporadic - leading to non-monotone missingness. In our motivating example, patients with pediatric AML enrolled in the AAML1031 trial move through a sequence of treatment courses where a decision is made to withhold scheduled anthracycline (ACT) chemotherapy at that course. Since ACT is cardiotoxic, echocardiograms are sometimes - but not always - conducted ahead of each course to inform the withholding decision. We construct dynamic joint monitoring-treatment rules and are interested in estimating the effect of different rules on survival. Bayesian semiparametric transition models are used to model patients' transition between treatment courses (recurrent states) and death (an absorbing state) in continuous time, conditional on past monitoring-treatment patterns. To regularize the models we construct a class of autoregressive priors that smooth the transition models across both time and monitoring patterns. The regularized models are embedded in a Bayesian g-computation procedure that draws from the posterior distribution of the causal survival probability under various rules.
(Conference Room San Felipe)
15:15 - 15:45 Round Table & Coffee Break (Conference Room San Felipe)
15:45 - 16:15 Mauricio Tec: Eliciting Behavioral Bayesian Priors for Reinforcement Learning from Large Language Models (Conference Room San Felipe)
16:15 - 16:45 Gemma Moran: Identifiable deep generative models via sparse decoding
We develop the sparse VAE for unsupervised representation learning on high-dimensional data. The sparse VAE learns a set of latent factors (representations) which summarize the associations in the observed data features. The underlying model is sparse in that each observed feature (i.e. each dimension of the data) depends on a small subset of the latent factors. As examples, in ratings data each movie is only described by a few genres; in text data each word is only applicable to a few topics; in genomics, each gene is active in only a few biological processes. We prove such sparse deep generative models are identifiable: with infinite data, the true model parameters can be learned. (In contrast, most deep generative models are not identifiable.) We empirically study the sparse VAE with both simulated and real data. We find that it recovers meaningful latent factors and has smaller heldout reconstruction error than related methods.
(Conference Room San Felipe)
16:45 - 17:15 Jack Jewson: Differentially Private Statistical Inference through beta-Divergence One Posterior Sampling
Differential privacy guarantees allow the results of a statistical analysis involving sensitive data to be released without compromising the privacy of any individual taking part. Achieving such guarantees generally requires the injection of noise, either directly into parameter estimates or into the estimation process. Instead of artificially introducing perturbations, sampling from Bayesian posterior distributions has been shown to be a special case of the exponential mechanism, producing consistent, and efficient private estimates without altering the data generative process. The application of current approaches has, however, been limited by their strong bounding assumptions which do not hold for basic models, such as simple linear regressors. To ameliorate this, we propose betaD-Bayes, a posterior sampling scheme from a generalised posterior targeting the minimisation of the beta-divergence between the model and the data generating process. This provides private estimation that is generally applicable without requiring changes to the underlying model and consistently learns the data generating parameter. We show that betaD-Bayes produces more precise inference estimation for the same privacy guarantees, and further facilitates differentially private estimation via posterior sampling for complex classifiers and continuous regression models such as neural networks for the first time.
(Online - CMO)
19:00 - 21:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
Friday, September 6
07:30 - 09:00 Breakfast (Restaurant Hotel Hacienda Los Laureles)
09:00 - 10:30 Informal Interaction and Collaboration (Conference Room San Felipe)
10:30 - 11:00 Coffee Break (Conference Room San Felipe)
11:00 - 13:30 Informal Interaction and Collaboration (Conference Room San Felipe)
13:30 - 15:00 Lunch (Restaurant Hotel Hacienda Los Laureles)