Probability and Statistics Key Ballroom 04
Hybrid Presentation

13 Nov 2021 09:00 AM - 12:15 PM(America/New_York)

Statistical methods play an essential role in an extremely wide range of human reasoning. From theorizing in the physical and social sciences to determining evidential standards in legal contexts, statistical methods are ubiquitous, and so are questions about their adequate application. As tools for making inferences that go beyond a given set of data, they are inherently a means of inductive, or ampliative reasoning, and so it is unsurprising that philosophers have used statistical frameworks to further our understanding of these topics. Yet statistical methods are undergoing considerable debate with important implications for standards of research across social and biological science. In the last decade many published results in the medical and social sciences have been found not to replicate. This has sparked debates about the very nature of statistical inference and modeling. Combining perspectives from philosophy, statistics, psychology, and economics, our symposium focuses on these recent debates. It will be a topical session building on Deborah Mayo's Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018), and a 2019 Summer Seminar on Philosophy of Statistics co-directed by D. Mayo and A. Spanos, in which all presenters of our proposed session participated (https://summerseminarphilstat.com/).

Statistical methods play an essential role in an extremely wide range of human reasoning. From theorizing in the physical and social sciences to determining evidential standards in legal contexts, statistical methods are ubiquitous, and so are questions about their adequate application. As tools for making inferences that go beyond a given set of data, they are inherently a means of inductive, or ampliative reasoning, and so it is unsurprising that philosophers have used statistical frameworks to further our understanding of these topics. Yet statistical methods are undergoing considerable debate with important implications for standards of research across social and biological science. In the last decade many published results in the medical and social sciences have been found not to replicate. This has sparked debates about the very nature of statistical inference and modeling. Combining perspectives from philosophy, statistics, psychology, and economics, our symposium focuses on these recent debates. It will be a topical session building on Deborah Mayo's Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018), and a 2019 Summer Seminar on Philosophy of Statistics co-directed by D. Mayo and A. Spanos, in which all presenters of our proposed session participated (__https://summerseminarphilstat.com/__).

Self-Correction and Statistical Misspecification
**Symposium Paper Abstracts**
09:00 AM - 09:30 AM (America/New_York)
2021/11/13 14:00:00 UTC - 2021/11/13 14:30:00 UTC

Current discussions of non-replication largely overlook the most serious source of the problem: violations of the probabilistic assumptions underlying the statistical models invoked in empirical inference with actual data. All approaches to statistical inference, including the frequentist, the Bayesian and the nonparametric, rest on a concept of a statistical model. What differentiates alternative approaches to inference is how effectively they can secure the reliability of inference by checking the adequacy of the assumed statistical model vis-à-vis the particular data.

Suppose for example that we can view the data as following a standard coin tossing (Bernoulli model): trials are independent, and the probability of heads (X=1) on each trial, q, is the same (independent and identically distributed, IID). Then we can use the sample mean M (the observed proportion of heads) to test hypotheses about the unknown parameter q. We might infer evidence against a hypothesis, say, that q ≤ 0.5 whenever M exceeds some value, and, thanks to assuming IID, give approximately correct assessments of the probability that the test would erroneously interpret the data (as evidence against, or no evidence against hypotheses about q).

But how can we assess whether these assumptions hold for our data? The inference about q– the *primary* inference–depends on the assumptions, but the tests of those assumptions–*misspecification (M-S) tests*– should not depend on the unknown q. It may seem surprising that, in an important class of cases, we may use the "same" data to test the primary hypothesis about q as we use to test the assumptions, provided we remodel the data appropriately. An example of a non-parametric M-S test for IID is the *runs test.* Each of the *n* outcomes is replaced by a + or − . It is +, if the next outcome exceeds it, − if the next is smaller. Each sequence of pluses only, or minuses only, is a run. For instance, the sequence − ,− ,+, − , +,+,+ has 4 runs. Just from assuming the validity of IID, we can calculate the probability of each number of runs, and compare the expected number of runs with the number observed! Unusually many or few runs are improbable under IID, thereby giving evidence that randomness (the IID assumptions) is violated. The error probabilities of the M-S test do not depend on the unknown q.

While failing to detect a violation of the IID assumptions does not warrant inferring evidence they hold (no evidence against is not evidence for), putting together several M-S tests can show the adequacy of the model for inference. It suffices that the error probabilities reported for the primary inference be close to the actual ones. The value of error control in our inferential interpretation of tests (Mayo and Spanos 2011), is not merely to control error in the long run, but to ensure the test would probably not have found evidence of a discrepancy from a claim about q unless it is warranted.

Measuring Severity in Statistical Inference
**Symposium Paper Abstracts**
09:30 AM - 10:00 AM (America/New_York)
2021/11/13 14:30:00 UTC - 2021/11/13 15:00:00 UTC

Statistical methods, be they Bayesian, frequentist or fiducial, prescribe procedures through which probability calculations substantiate data-dependent claims about the world. In order for any method to work, the analyst must make simplifying assumptions about the state of the world, which are almost certainly going to be restrictive or inaccurate, yet she aims to do so in a principled manner. A statistical method is said to be principled, if the claims it makes are equipped with some kind of quality guarantee, often assessed in terms of probabilities. Consistency, efficiency and powerfulness are just some of the many desirable properties that statistical methods may possess. Each is applicable to different contexts, and to a varying degree capturing the complex interaction between a good inherent design and the external generalizability of the method in question. The concept of severity (Mayo, 2018) is a principle of statistical hypothesis testing. Reflecting the Popperian notion of falsifiability, severity seeks to establish a stochastic version of modus tollens. It is an assessment of the test in relation to the claim it makes, and the data on which the claim is based. Specifically, a claim C passes a severe test with the data at hand, if with high probability the test would have found flaws with C if present, and yet it does not. In this talk, I discuss how the concept of severity can be extended beyond frequentist statistics and hypothesis testing, and be adapted to other contexts of statistical methodologies that follow both the frequentist and Bayesian traditions, such as classification, model selection, and model misspecification. In these areas of applications, severity by analogy measures the extent to which the respective resulting inference is warranted, in relation to the body of evidence at hand. If the current available evidence leads a method to infer something about the world, then were it not the case, would the method still have inferred it? I discuss how to formulate severity in these contexts, with examples to demonstrate its assessment and interpretation. A connection with significance function (Fraser, 1991) and confidence distribution (Xie & Singh, 2013) is drawn, highlighting a post-data aspect of the performance assessment and a fiducial spirit of the exercise. In conceptualizing severity and operationalizing its measurement for statistical tasks that are central to the quantitative sciences today, the hope is that severity can be instituted as a guiding principle which can be commonly referred and practically assessed in a wide range of modern applications that call for evidence-based scientific decision making.

Fraser, D.A.S. (1991). Statistical inference: Likelihood to significance. *J. Amer. Statist. Assoc.*, 86, 258–265.

Xie, M. G., & Singh, K. (2013). Confidence distribution, the frequentist distribution estimator of a parameter: A review. *International Statistical Review*, *81*(1), 3-39.

Psychometric Models: Statistics and Interpretation
**Symposium Paper Abstracts**
10:15 AM - 10:45 AM (America/New_York)
2021/11/13 15:15:00 UTC - 2021/11/13 15:45:00 UTC

This paper deals with measurement models in psychology, and their relation to problems of replication and confirmation. We argue that in order to adjudicate between different measurement models, the models need to be endowed with an interpretation that goes beyond their strict statistical content.

One reason that psychology might be particularly prone to problems with non-replication is that psychology is concerned with mental attributes (e.g., mental disorders) that are not precisely defined and not directly observed. So-called 'measurement models' are used to statistically identify the latent attribute from a set of observed variables. Statistical methods thus not only come into play to test hypotheses, but also to 'measure' the attributes that are objects of these hypotheses.

Psychological attributes are traditionally studied in a latent variable framework, in which the latent attribute is measured by a set of observed variables thought to be causally influenced by the latent attribute. For example, Lisa experiences difficulties in concentrating, fatigue and feelings of worthlessness because she suffers from depression. Because depression causes these symptoms, depression is reflected in whatever these symptoms share. A recent alternative psychometric theory is the psychological network theory. Here, observables correlate because they mutually reinforce each other. For example, because Lisa has sleeping problems, she finds it difficult to concentrate. As a result, she feels guilty and worries, which in turn hinder her sleep. 'Depression' here is not a latent common cause but refers to the resulting cluster of associated symptoms.

Although the theories offer radically different explanations for the correlations between observed variables, the network model and latent variable model are statistically proximate, and in some cases statistically equivalent. How should we think about comparing models that are statistically similar and yet represent such different theories about the data-generating mechanism? Interpreting these models purely statistically implies that an equivalent network model and latent variable model are just two different ways of expressing the same model. In contrast, a causal interpretation implies that these equivalent models make different predictions about interventions on observed variables. These diverging predictions not only help to probe the theories that are represented by the models, but also have important implications for clinical practice. For example, in a causal interpretation of psychometric models, the latent variable model implies that the treatment of depression should intervene on the latent variable since the symptoms are merely interchangeable indicators of the underlying disorder. The network model implies that treatment should intervene on symptoms that have an important role in the network.

We compare the implications of a causal versus statistical interpretation of psychometric models and argue that a causal interpretation helps to distinguish network models and latent variable models. Towards the end of the paper we consider if this commits us to a realist position on psychometric models, or if we can maintain a broadly empiricist attitude towards such models, to settle on an ontology of "real patterns".

Is Algorithmic Fairness Possible?
**Symposium Paper Abstracts**
10:45 AM - 11:15 AM (America/New_York)
2021/11/13 15:45:00 UTC - 2021/11/13 16:15:00 UTC

Algorithms are increasingly used by public and private sector entities to streamline decisions about healthcare, welfare benefits, child abuse, public housing, neighborhoods to police, bail and sentencing. This paper focuses on the ongoing debate among computer scientists, legal scholars and moral philosophers about the fairness of algorithms used in the criminal justice system. In 2016 ProPublica analyzed the software COMPAS and showed that the *false positive* rate was higher for blacks than for whites, and the *false negative* rate was higher for whites than for blacks. In response, Northpointe, the company that designed COMPAS, showed that the *predictive error rate* was the same across groups. This was a disagreement about the right conception of fairness. According to ProPublica, similarly situated individuals, regardless of their race, should be equally subject to classification errors such as false positives and false negatives. Call this *classification parity*. COMPAS did not satisfy classification parity. According to Northpointe, however, this disparity is irrelevant because fairness only requires that among individuals who are predicted to be 'high risk' or 'low risk', the proportion of those who actually reoffend be the same across groups. Call this *predictive parity*. On this interpretation, COMPAS exhibits no racial bias. There exists a growing body of literature in computer science on algorithmic fairness, focusing on whether algorithms can satisfy more than one conception of fairness at the same time or whether there are inevitable tradeoffs. Alexandra Chouldechova's 2017 article 'Fair prediction with disparate impact: A study of bias in recidivism prediction instruments' demonstrated that, when the prevalence of recidivism is different across groups, it is impossible to have equality in false positive and false negative rates and also maintain predictive parity (see also Borsboom, Wichers and Romeijn, 2008). Jon Kleinberg et al's 2017 article 'Inherent trade-offs in the fair determination of risk scores' established the incompatibility of two slightly different conceptions of fairness, call therm *predictive calibration *and *classification balance*. These results are often used to argue that the two criteria of fairness - classification and prediction - are incompatible. I explore two possible responses. First, the literature on algorithmic fairness assumes that differences in the prevalence of recidivism among different groups matter for making predictions about whether someone is going to reoffend or not, but the importance of prevalence or base rates for making predictions about *individual* behavior is questionable (SIST*, *section 5.6). If differences in prevalence are disregarded, the two conceptions of algorithmic fairness become trivially compatible. Second, even without discounting differences in prevalence, classification and predictive fairness are compatible as long as they are understood in terms of classification parity and predictive calibration. When an algorithm relies on a sufficiently large number of independent predictors, classification parity can be achieved together with predictive calibration, even when the prevalence of recidivism differs across groups. So, either the debate about algorithmic fairness rests on a faulty assumption or there is a plausible way to understand the two dimension of fairness that renders them compatible.

Statistical Modeling, Mis-specification Testing, and Exploration
**Symposium Paper Abstracts**
11:15 AM - 11:45 AM (America/New_York)
2021/11/13 16:15:00 UTC - 2021/11/13 16:45:00 UTC

There is a debate as to whether idealizations and abstractions play a substantive role in science-wherein what is meant by "substantive" is part of what is debated-but by non-substantive we mean pragmatic (e.g., having to do with mathematical tractability, heuristic, and pedagogical roles (Shech 2018). In the context of various case studies from physics, I have argued that idealizations facilitate a type of exploratory modeling that fosters understanding of theories, models, and phenomena (Shech 2015, 2017, Shech & Gelfert Forthcoming). Two senses of "understanding" have been identified: understanding-why some phenomenon occurs and understanding-with, which has to do with understanding a scientific theory or model.

How does this debate play out in the process of statistical modeling? That is the question I propose to tackle in this paper. I will concentrate on mis-specification testing-primarily as it arises in Spanos & Mayo's error statistical approach (e.g., Spanos & Mayo 2015). The centerpiece of their approach is to insist on a distinction between a substantive model and a statistical model. They view the substantive inquiry as posing questions in the context of a highly idealized statistical model. Such a model is statistically adequate when it captures the systematic chance regularity patterns in the data and representing statistical systematic information. That is, the adequacy of the statistical model means that the data *could* have been generated by a stochastic process as described in the model. By contrast, adequacy of the substantive model would mean it adequately describes the portion of the world giving rise to the particular data. Inadequacy at the substantive level means that the theory model differs systematically from the actual data generating mechanism that gave rise to the phenomenon of interest; this can arise from false causal claims, missing variables, confounding factors, etc. Inadequacy at the statistical level means that one or more of the probabilistic model assumptions (having to do with statistical distribution, dependence, and homogeneity) depart from the statistical regularity being modeled. In finding that the statistical model is misspecified, various ways to respecify the model using things like lags and dummy variables, could well be deemed pragmatic. But the goal in adding them to the statistical model is so that we can understand features about the actual and various *possible* data generating mechanism.

Examining the links between statistical and substantive models strengthens and illuminates my arguments that to view idealizations and abstractions as playing only pragmatic, heuristic, and pedagogical roles in science is naïve. It overlooks the ways in which idealized statistical models provide links to connect parameters in the substantive model to parameters in the statistical model. This links actual data to substantive questions in exploring and understanding aspects of a phenomenon.

No notes added.