It all starts with what some call “meta-issues.” These represent the conceptual foundation on which any statistical procedure rests. Without a solid conceptual foundation, all that follows will be ad hoc. Moreover, the conceptual foundation provides whatever links there may be between the empirical analyses undertaken and subject-matter theory or policy applications.
Conventional regression models
Conventional causal models are based on a quantitative theory of how the data were generated. Although there can be important difference in detail, the canonical account takes the form of a linear regression model such as
where for each case i, the response y
i
is a linear function of fixed predictors X
i
(ususally including a column if 1’s for the intercept), with regression coefficients β, and a disturbance term ε
i
∼ N I I D(0,σ2). For a given case, nature (1) sets the value of each predictor, (2) combines them in a linear fashion using the regression coefficients as weights, (3) adds the value of the intercept, and (4) adds a random disturbance from a normal distribution with a mean of zero and a given variance. The result is the value of y
i
. Nature can repeat these operations a limitless number of times for a given case with the random disturbances drawn independently of one another. The same formulation applies to all cases.
When the response is categorical or a count, there are some differences in how nature generates the data. For example, if the response variable Y is binary,
(2)
where p
i
is the probability of some event defined by Y. Suppose that Y is coded “1” if a particular event occurs and “0” otherwise. (e.g., A parolee is arrested or not.) Nature combines the predictors as before, but now applies a logistic transformation to arrive at a value for p
i
. (e.g., The cumulative normal is also sometimes used.) That probability leads to the equivalent of a coin flip with the probability that the coin comes up “1” equal to p
i
. The side on which that “coin” lands determines for case i if the response is a “1” or a “0.” As before, the process can be repeated independently a limitless number of times for each case.
The links to linear regression become more clear when Equation 2 is rewritten as
(3)
where p
i
is, again, the probability of the some binary response whose “logit” depends linearly on the predictors.c
For either Equation 1, 2 or 3, a causal account can be overlaid by claiming that nature can manipulate the value of any given predictor independently of all other predictors. Conventional statistical inference can also be introduced because the sources of random variation are clearly specified and statistically tractable.
Forecasting would seem to naturally follow. With an estimate in hand, new values for X can be inserted to arrive at values for that may be used as forecasts. Conventional tests and confidence intervals can then be applied. There are, however, potential conceptual complications. If X is fixed, how does one explain the appearance of new predictor values X∗ whose outcomes one wants to forecast? For better or worse, such matters are typically ignored in practice.
Powerful critiques of conventional regression have appeared since the 1970s. They are easily summarized: the causal models popular in criminology, and in the social sciences more generally, are laced with far too many untestable assumptions of convenience. The modeling has gotten far out ahead of existing subject-matter knowledge.d
Interested readers should consult the writings of economists such as Leamer, LaLonde, Manski, Imbens and Angrist, and statisticians such as Rubin, Holland, Breiman, and Freedman. I have written on this too[16].
The machine learning model
Machine Learning can rest on a rather different model that demands far less of nature and of subject matter knowledge. For given case, nature generates data as a random realization from a joint probability distribution for some collection of variables. The variables may be quantitative or categorical. A limitless number of realizations can be independently produced from that joint distribution. The same applies to every case. That’s it.
From nature’s perspective, there are no predictors or response variables. It follows that there is no such thing as omitted variables or disturbances. Often, however, researchers will use subject-matter considerations to designate one variable as a response Y and other variables as predictors X. It is then sometimes handy to denote the joint probability distribution as Pr(Y, X). One must be clear that the distinction between Y and X has absolutely nothing to do with how the data were generated. It has everything to do the what interests the researcher.
For a quantitative response variable in Pr(Y, X), researchers often want to characterize how the means of the response variable, denoted here by μ, may be related to X. That is, researchers are interested in the conditional distribution μ|X. They may even write down a regression-like expression
where f(X
i
) is the unknown relationship in nature’s joint probability distribution for which
It follows that the mean of ξ
i
in the joint distribution equals zero.e Some notational and conceptual license is being taken here. The predictors are random variables and formally should be represented as such. But in this instance, the extra complexity is probably not worth the trouble.
Equations 4 and 5 constitute a theory of how the response is related to the predictors in Pr(Y, X). But any relationships between the response and the predictors are “merely” associations. There is no causal overlay. Equation 4 is not a causal model. Nor is it a representation of how the data were generated — we already have a model for that.
Generalizations to categorical response variables and their conditional distributions can be relatively straightforward. We denote a given outcome class by G
k
, with classes (e.g., for K = 3, released on bail, released on recognizance, not released). For nature’s joint probability distribution, there can be for any case i interest in the conditional probability of any outcome class: p
k
i
= f(X
i
). There also can be interest in the conditional outcome class itself: g
k
i
= f(X
i
).f
One can get from the conditional probability to the conditional class using the Bayes classifier. The class with the largest probability is the class assigned to a case. For example, if for a given individual under supervision the probability of failing on parole is.35, and the probability of succeeding on parole is.65, the assigned class for that individual is success. It is also possible with some estimation procedures to proceed directly to the outcome class. There is no need to estimate intervening probabilities.
When the response variable is quantitative, forecasting can be undertaken with the conditional means for the response. If f(X) is known, predictor values are simply inserted. Then μ= f(X∗), where as before, X∗ represents the predictor values for the cases whose response values are to be forecasted. The same basic rationale applies when outcome is categorical, either through the predicted probability or directly. That is, p
k
= f(X∗) and G
k
= f(X∗).
The f(X) is usually unknown. An estimate,, then replaces f(X) when forecasts are made. The forecasts become estimates too. (e.g., μ becomes.) In a machine learning context, there can be difficult complications for which satisfying solutions may not exist. Estimation is considered in more depth shortly.
Just like the conventional regression model, the joint probability distribution model can be wrong too. In particular, the assumption of independent realizations can be problematic for spatial or temporal data, although in principle, adjustments for such difficulties sometimes can be made. A natural question, therefore, is why have any model at all? Why not just treat the data as a population and describe its important features?
Under many circumstances treating the data as all there is can be a fine approach. But if an important goal of the analysis is to apply the findings beyond the data on hand, the destination for those inferences needs to be clearly defined, and a mathematical road map to the destination provided. A proper model promises both. If there is no model, it is very difficult to generalize any findings in a credible manner.g
A credible model is critical for forecasting applications. Training data used to build a forecasting procedure and subsequent forecasting data for which projections into the future are desired, should be realizations of the same data generation process. If they are not, formal justification for any forecasts breaks down and at an intuitive level, the enterprise seems misguided. Why would one employ a realization from one data generation process to make forecasts about another data generation process?h
In summary, the joint probability distribution model is simple by conventional regression modeling standards. But it provides nevertheless an instructive way for thinking about the data on hand. It is also less restrictive and far more appropriate for an inductive approach to data analysis.