# Statistics Theory

## Depth Functions in Multivariate & Other Data Settings: Concepts, Perspectives, Tools, & Applications

Depth functions were developed to extend the univariate notions of median, quantiles, ranks, signs, and order statistics to the setting of multivariate data. Whereas a probability density function measures local probability weight, a depth function measures centrality. The contours of a multivariate depth function induce closely associated multivariate outlyingness, quantile, sign, and rank functions. Together, these functions comprise a powerful methodology for nonparametric multivariate data description, outlier detection, data analysis, and inference, including for example location and scatter estimation, tests of symmetry, and multivariate boxplots. Due to the lack of a natural order in dimension higher than 1, notions such as median and quantile are not uniquely defined, however, posing a challenging conceptual arena. How to define the middle? The middle half? Interesting competing formulations of depth functions in the multivariate setting have evolved, and extensions to functional data in Hilbert space have been developed and more recently, to multivariate functional data. A key question is how generally a notion of depth function can be productively defined. This talk provides a perspective on depth, outlyingness, quantile, and rank functions, through an overview coherently treating concepts, roles, key properties, interrelations, data settings, applications, open issues, and new potentials.

## Bayesian study design for nonlinear systems: an animal disease transmission experiment case study

Experimental design is a branch of statistics focused upon designing experimental studies in a way that maximizes the amount of salient information produced by the experiment. It is a topic which has been well studied in the context of linear systems. However, many physical, biological, economic, financial and engineering systems of interest are inherently non-linear in nature. Experimental design for non-linear models is complicated by the fact that the optimal design depends upon the parameters that we are using the experiment to estimate. A Bayesian, often simulation-based, framework is a natural setting for such design problems. We will illustrate the use of such a framework by considering the design of an animal disease transmission experiment where the underlying goal is to identify some characteristics of the disease dynamics (e.g. a vaccine effect, or the infectious period).

## The Lasso: A Brief Review and a New Significance Test

Tibshirani will review the lasso method and show an example of its utility in cancer diagnosis via mass spectometry. He will then consider testing the significance of the terms in a fitted regression, fit via the lasso. He will present a novel test statistic for this problem and show that it has a simple asymptotic null distribution. This work builds on the least angle regression approach for fitting the lasso, and the notion of degrees of freedom for adaptive models (Efron 1986) and for the lasso (Efron et. al 2004, Zou et al 2007). He will give examples of this procedure, discuss extensions to generalized linear models and the Cox model, and describe an R language package for its computation.

This work is joint with Richard Lockhart (Simon Fraser University), Jonathan Taylor (Stanford) and Ryan Tibshirani (Carnegie Mellon).

## Sparse Linear Models

In a statistical world faced with an explosion of data, regularization has become an important ingredient. In many problems, we have many more variables than observations, and the lasso penalty and its hybrids have become increasingly useful. This talk presents a general framework for fitting large scale regularization paths for a variety of problems. We describe the approach, and demonstrate it via examples using our R package GLMNET. We then outline a series of related problems using extensions of these ideas. This is joint work with Jerome Friedman, Rob Tibshirani and Noah Simon.

Trevor Hastie is noted for his many contributions to the statistician’s toolbox of flexible data analysis methods. Beginning with his PhD thesis, Trevor developed a nonparametric version of principal components analysis, terming the methodology principal curves and surfaces. During the years after his PhD, as a member of the AT&T Bell Laboratories statistics and data analysis research group, Trevor developed techniques for linear, generalized linear, and additive models and worked on the development of S, the pre-cursor of R. Much of this work is contained in the well-known Statistical Computing in S (co-edited with John Chambers, 1991). In the book Generalized Additive Models (1990) Trevor and co-author Rob Tibshirani modified techniques like multiple linear regression and logistic regression to allow for smooth modeling while avoiding the usual dimensionality problems. In 1994, Trevor left Bell Labs for Stanford University, to become Professor in Statistics and Biostatistics. Trevor has applied his skills to research in machine learning. His book Elements of Statistical Learning (with Rob Tibshirani and Jerry Friedman, Springer 2001; second edition 2009) is famous for providing a readable account of flexible techniques for high dimensional data. This popular book expertly bridges the philosophical and research gap between computer scientists and statisticians.

- Read more about Sparse Linear Models
- 7480 reads

## Epidemiologic methods are useless. They can only give you answers

The first duty of any epidemiologist is to ask a relevant

question. Learning and applying sophisticated epidemiologic methods is

of little help if the methods are used to answer irrelevant questions.

This talk will discuss the formulation of research questions in the

presence of time-varying treatments and treatments with multiple

versions, including pharmacological treatments and lifestyle

exposures. Several examples will show that discrepancies between

observational studies and randomized trials are often not due to

confounding, but to the different questions asked.

**Brief Biography**

Miguel Hernán is Professor of Department of Epidemiology and Department of Biostatistics at the Harvard School of Public Health (HSPH). His research is focused on the development and application of causal inference methods to guide policy and clinical interventions. He and his collaborators apply statistical methods to observational studies under suitable conditions to emulate hypothetical randomized experiments so that well-formulated causal questions can be investigated properly. His research applied to many areas, including investigation of the optimal use of antiretroviral therapy in patients infected with HIV, assessment of various interventions of kidney disease, cardiovascular disease, cancer and central nervous system diseases. He is Associate Director of HSPH Program on Causal Inference in Epidemiology and Allied Sciences, member of the Affiliated Faculty of the Harvard-MIT Division of Health Sciences and Technology, and an Editor of the journal EPIDEMIOLOGY. He is the author of upcoming highly anticipated textbook "Causal Inference" (Chapman & Hall/CRC, 2013), drafts of selected chapters are available on his website.

## Pumps, Maps and Pea Soup: Spatio-temporal methods in environmental epidemiology

*
*

*Further information about the Constance van Eeden Invited Speaker Program*

This talk provides an introduction to epidemiological analysis where the distribution of health outcomes and related exposures are measured over both space and time. Developments in this field have been driven by public interest in the effects of environmental pollution, increased availability of data and increases in computing power. These factors, together with recent advances in the field of spatio-temporal statistics, have led to the development of models which can consider relationships between adverse health outcomes and environmental exposures over both time and space simultaneously.

Using illustrative examples, from outbreaks of cholera in London in the 1850s, episodes of smog in the 1950s to present day epidemiological studies, we discuss a variety of issues commonly associated with analyses of this type including modelling auto-correlation, preferential sampling of exposures and ecological bias. The precise choice of statistical model may be based on whether we are explicitly interested in the spatio-temporal pattern of disease incidence, e.g. disease mapping and cluster detection, or whether clustering is a nuisance quantity that we need to acknowledge, e.g. spatio-temporal regression. Throughout we consider the practical implementation of models with specific focus on inference within a Bayesian framework using computational methods such as Markov Chain Monte Carlo and Integrated Nested Laplace Approximations.

The talk also serves as a precursor to a graduate level course on spatio-temporal methods in epidemiology. This course will cover the basic concepts of epidemiology, methods for temporal and spatial analysis and the practical application of such methods using commonly available computer packages. It will have an applied focus with both lectures and practical computer sessions in which participants will be guided through analyses of epidemiological data.

*
*

**BACKGROUND INFORMATION:** The Statistics Department, with the support of the Constance van Eeden Fund, is honoured to host Dr Gavin Shaddick during term 2 2012-13. Dr Shaddick, a Reader in Statistics in the Department of Mathematical Sciences at the University of Bath, has achieved international prominence for his contributions to the theory and application of Bayesian statistics to the areas of spatial epidemiology, environmental health risk and the modelling of spatio-temporal fields of environmental hazards.

Dr Shaddick will begin his visit to the Department, by giving the 2012-13 van Eeden lecture. That lecture will inaugurate a one term special topics graduate course in statistics, which the Department of Statistics is offering next term. It will be given by Dr Shaddick and Dr James Zidek (Statistics, UBC) on the subject of spatial epidemiology. This course, which is aimed primarily at a statistical audience, will provide an introduction to environmental epidemiology and spatio-temporal process modeling, as it applies to the assessment of risk to human health and welfare due to random fields of hazards such as air pollution. Please see the course outline for more information.

## On Long-Run Covariance Matrix Estimation with the Truncated Flat Kernel

Despite its large sample efficiency, the truncated flat (TF) kernel estimator of long-run covariance matrices is seldom used, because it lacks the guaranteed positive semidefiniteness and sometimes performs poorly in small samples, compared to other familiar kernel estimators. This paper proposes simple modifications to the TF estimator to enforce the positive definiteness without sacrificing the large sample efficiency and make the estimator more reliable in small samples through better utilization of the bias-variance tradeoff. We study the large sample properties of the modified TF estimators and verify their improved small-sample performances by Monte Carlo simulations.

## Sequential Robust Design Strategies

The speaker introduces the formal notion of an approximately specified nonlinear regression model and investigates sequential design methodologies when the fitted model is possibly of an incorrect parametric form. He presents small-sample simulation studies which indicate that his new designs can be very successful, relative to some common competitors, in reducing mean squared error due to model misspecification and to heteroscedastic variation. His simulations also suggest that standard normal-theory inference procedures remain approximately valid under the sequential sampling schemes.