# Statistics

## Scalable approximation of integrals using non-reversible methods: from Riemann to Lebesgue, and why you should care

How to approximate intractable integrals? This is an old problem which is still a pain point in many disciplines (including mine, Bayesian inference, but also statistical mechanics, computational chemistry, combinatorics, etc).

The vast majority of current work on this problem (HMC, SGLD, variational) is based on mimicking the field of optimization, in particular gradient based methods, and as a consequence focusses on Riemann integrals. This severely limits the applicability of these methods, making them inadequate to the wide range of problems requiring the full expressivity of Lebesgue integrals, for example integrals over phylogenetic tree spaces or other mixed combinatorial-continuous problems arising in networks models, record linkage and feature allocation.

I will describe novel perspectives on the problem of approximating Lebesgue integrals, coming from the nascent field of non-reversible Monte Carlo methods. In particular, I will present an adaptive, non-reversible Parallel Tempering (PT) allowing MCMC exploration of challenging problems such as single cell phylogenetic trees.

By analyzing the behaviour of PT algorithms using a novel asymptotic regime, a sharp divide emerges in the behaviour and performance of reversible versus non-reversible PT schemes: the performance of the former eventually collapses as the number of parallel cores used increases whereas non-reversible benefits from arbitrarily many available parallel cores. These theoretical results are exploited to develop an adaptive scheme approximating the optimal annealing schedule.

My group is also interested in making these advanced non-reversible Monte Carlo methods easily available to data scientists. To do so, we have designed a Bayesian modelling language to perform inference over arbitrary data types using non-reversible, highly parallel algorithms.

## Statistical and Data Science

Statistical science has a 200-year history of advances in theory and application. Data science is a relatively newly defined area of enquiry deriving from big data. The interplay between them, and their interactions with science, are a topic of ongoing discussion among statisticians. Some thoughts on this interplay and the role of the formal use of probability will be presented.

## Bayesian study design for nonlinear systems: an animal disease transmission experiment case study

Experimental design is a branch of statistics focused upon designing experimental studies in a way that maximizes the amount of salient information produced by the experiment. It is a topic which has been well studied in the context of linear systems. However, many physical, biological, economic, financial and engineering systems of interest are inherently non-linear in nature. Experimental design for non-linear models is complicated by the fact that the optimal design depends upon the parameters that we are using the experiment to estimate. A Bayesian, often simulation-based, framework is a natural setting for such design problems. We will illustrate the use of such a framework by considering the design of an animal disease transmission experiment where the underlying goal is to identify some characteristics of the disease dynamics (e.g. a vaccine effect, or the infectious period).

## Bi-cross-validation for factor analysis

Factor analysis is a core technique in applied statistics with implications for biology, education, finance, psychology and engineering. It represents a large matrix of data through a small number k of latent variables or factors. Despite more than 100 years of use, it remains challenging to choose k from the data. Ad hoc and subjective methods are popular, but subject to confirmation bias and they do not scale to automatic uses. There are many recent tools in random matrix theory (RMT) that apply to the factor analysis setting, so long as the noise has constant variance. Real data usually involves heteroscedasticity foiling those techniques. There are also tools in the econometrics literature, but those apply mostly to the strong factor setting unlike RMT which handles weaker factors. The best published method is parallel analysis, but that is only justified by simulations. We propose a bi-cross-validation approach holding out some rows and some columns of the data matrix, predicting the held out data via a factor analysis on the held in data. We also use simulations to justify the method, though our simulations are designed using recent findings from RMT. The new approach outperforms previous methods that we found, as measured by recovery of a true underlying factor matrix.

This is joint work with Jingshu Wang of Stanford University.

**Biosketch**: Art Owen is a professor of statistics at Stanford University. He is best known for developing empirical likelihood and randomized quasi-Monte Carlo. Empirical likelihood is an inferential method that uses a data driven likelihood without requiring the user to specify a parametric family of distributions. It yields very powerful tests and is used in econometrics. Randomized quasi-Monte Carlo sampling, is a quadrature method that can attain nearly O(n**-3) mean squared errors on smooth enough functions. It is useful in valuation of options and in computer graphics. His present research interests focus on large scale data matrices. Professor Owen's teaching is focused on doctoral applied courses including linear modeling, categorical data, and stochastic simulation (Monte Carlo).

## The long road to 0.075: a statistician’s perspective of the process for setting ozone standards

The presentation will take us along the road to the ozone standard for the United States, announced in Mar 2008 by the US Environmental Protection Agency, and then the new proposal in 2014. That agency is responsible for monitoring that nation’s air quality standards under the Clean Air Act of 1970. I will describe how I, a Canadian statistician, came to serve on the US Clean Air Scientific Advisory Committee (CASAC) for Ozone that recommended the standard and my perspectives on the process of developing it. I will introduce the rich cast of players involved including the Committee, the EPA staff, “blackhats,” “whitehats,” “gunslingers,” politicians and an unrevealed character waiting in the wings who appeared onstage only as the 2008 standards had been formulated. And we will encounter a couple of tricky statistical problems that arose along with approaches, developed by the speaker and his coresearchers, which could be used to address them. The first was about how a computational model based on things like meteorology could be combined with statistical models to infer a certain unmeasurable but hugely important ozone level, the “policy related background level” generated by things like lightning, below which the ozone standard could not go. The second was about estimating the actual human exposure to ozone that may differ considerably from measurements taken at fixed site monitoring locations. Above all, the talk will be a narrative about the interaction between science and public policy - in an environment that harbors a lot of stakeholders with varying but legitimate perspectives, a lot of uncertainty in spite of the great body of knowledge about ozone and above all, a lot of potential risk to human health and welfare.

## Projecting the Uncertainty of Sea Level Rise Using Climate Models and Statistical Downscaling

Most global climate models do not estimate sea level directly. A semi-empirical approach is to relate sea level change to temperature and then apply this relationship to climate model projections of temperature for different future scenarios. Another possibility is to estimate the relationship between global mean temperature in historical runs of a model and instead apply this relationship to future temperature projections. We compare these two methods to estimate global annual mean sea level and assess the resulting uncertainty. Of more practical importance is to estimate local sea level. We exemplify this by developing models for projected sea level rise in Vancouver and Washington State and illustrate different sources of uncertainty in the projections.

**BIO**: Peter Guttorp is a Professor of Statistics, Guest Professor at the Norwegian Computing Center, Project Leader for SARMA, the Nordic Network on Statistical Approaches to Regional Climate Models for Adaptation, Co-director of STATMOS, the Research Network on Statistical Methods for Atmospheric and Ocean Sciences, Adjunct Professor of Statistics at Simon Fraser University and member of the interdisciplinary faculties in Quantitative Ecology and Resource Management and Urban Design and Planning. He obtained a degree from the Stockholm School of Journalism in 1969, a B.S. in mathematics, mathematical statistics and musicology from Lund University, Sweden, in 1974, a Ph.D. in statistics from the University of California at Berkeley in 1980 and a Tech.D. h.c. from Lund University in 2009. He joined the University of Washington faculty in September 1980.

Dr. Guttorp’s research interests include uses of stochastic models in scientific applications in hydrology, atmospheric science, geophysics, environmental science, and hematology. He is a fellow of the American Statistical Association and an elected member of the International Statistical Institute. During 2004-2005 he was the Environmental Research Professor of the Swedish Institute of Graduate Engineers, and in 2014 he was one of the Chalmers Jubilee Professors.

## The Lasso: A Brief Review and a New Significance Test

Tibshirani will review the lasso method and show an example of its utility in cancer diagnosis via mass spectometry. He will then consider testing the significance of the terms in a fitted regression, fit via the lasso. He will present a novel test statistic for this problem and show that it has a simple asymptotic null distribution. This work builds on the least angle regression approach for fitting the lasso, and the notion of degrees of freedom for adaptive models (Efron 1986) and for the lasso (Efron et. al 2004, Zou et al 2007). He will give examples of this procedure, discuss extensions to generalized linear models and the Cox model, and describe an R language package for its computation.

This work is joint with Richard Lockhart (Simon Fraser University), Jonathan Taylor (Stanford) and Ryan Tibshirani (Carnegie Mellon).

## The Emerging Roles and Computational Challenges of Stochasticity in Biological Systems

In recent years it has become increasingly clear that stochasticity plays an important role in many biological processes. Examples include bistable genetic switches, noise enhanced robustness of oscillations, and fluctuation enhanced sensitivity or “stochastic focusing". Numerous cellular systems rely on spatial stochastic noise for robust performance. We examine the need for stochastic models, report on the state of the art of algorithms and software for modeling and simulation of stochastic biochemical systems, and identify some computational challenges.

## Sparse Linear Models

In a statistical world faced with an explosion of data, regularization has become an important ingredient. In many problems, we have many more variables than observations, and the lasso penalty and its hybrids have become increasingly useful. This talk presents a general framework for fitting large scale regularization paths for a variety of problems. We describe the approach, and demonstrate it via examples using our R package GLMNET. We then outline a series of related problems using extensions of these ideas. This is joint work with Jerome Friedman, Rob Tibshirani and Noah Simon.

Trevor Hastie is noted for his many contributions to the statistician’s toolbox of flexible data analysis methods. Beginning with his PhD thesis, Trevor developed a nonparametric version of principal components analysis, terming the methodology principal curves and surfaces. During the years after his PhD, as a member of the AT&T Bell Laboratories statistics and data analysis research group, Trevor developed techniques for linear, generalized linear, and additive models and worked on the development of S, the pre-cursor of R. Much of this work is contained in the well-known Statistical Computing in S (co-edited with John Chambers, 1991). In the book Generalized Additive Models (1990) Trevor and co-author Rob Tibshirani modified techniques like multiple linear regression and logistic regression to allow for smooth modeling while avoiding the usual dimensionality problems. In 1994, Trevor left Bell Labs for Stanford University, to become Professor in Statistics and Biostatistics. Trevor has applied his skills to research in machine learning. His book Elements of Statistical Learning (with Rob Tibshirani and Jerry Friedman, Springer 2001; second edition 2009) is famous for providing a readable account of flexible techniques for high dimensional data. This popular book expertly bridges the philosophical and research gap between computer scientists and statisticians.

- Read more about Sparse Linear Models
- 6961 reads

## Visualising data with ggplot2

This tutorial will introduce you to the theory and practice of ggplot2. I'll introduce you to the rich theory that underlies ggplot2, and then we'll get our hands dirty making graphics to help understand data. I'll also point you towards resources where you can learn more, and highlight some of the other packages that work hand in hand with ggplot2 to make data analysis easy.

You will have the opportunity to practice what you learn, so please bring along your laptop, with the latest version of R installed. Make sure that your version of ggplot2 is up-to-date by running install.packages("ggplot2").

To get the most out of the course, I'd recommend that you're already comfortable with R: you know how to get your data into R, you've done some graphics (base or lattice) in the past, and you've written an R function.

- Read more about Visualising data with ggplot2
- 20339 reads