How Scientists Evaluate Their Models: A Holistic View
Abstract: This essay develops a holistic view of how scientists evaluate their models. The holistic view holds that the modelers usually evaluate the goodness-of-fit of their models with respect to the models’ target systems holistically. An evaluation is holistic when (a) it involves taking the model (and its target system) as an interconnected whole, wherein components of the whole interact with one another in producing certain outputs (or patterns, phenomena, etc.), and (b) the goodness-of-fit measure of the model with respect to its target system is achieved via comparing outputs (or patterns, phenomena, etc.) produced by these interacting components. This view is defended by close examination of scientific practice, focusing on maximum likelihood estimation (MLE), a common method of testing scientific models. Two specific scientific models are used as case-studies, a mathematical model in biology and a concrete model in hydraulics.
The last four-decade-period or so witnessed a major transition of focus in the philosophy of science from scientific theories to scientific models, and the traditional problem of how scientific theories are linked to observations (e.g., how an observation supports a theory) has largely been shifted to the problem of how scientific models are related to their target systems. A number of views regarding answering the question of how a scientific model is related to its target system, sometimes also called the model-world relationship problem, have been developed, such as the semantic view (Suppes 1957, 1962, 1967; van Fraassen 1970, 1972; Sneed 1971; Suppe 1977, 1989; Stegmüller 1976), the mediator view (Cartwright 1983; Morgan and Morrison 1999), the inferential view (Suárez 2004, 2015a, 2015b; Bueno and Colyvan 2011; Bueno and French 2012; Baron et al. 2016), the fictionalism view (Frigg 2009; Toon 2012; Frigg and Nguyen 2016; Salis 2016), and the similarity view (Giere 1988, 2004, 2010; Godfrey-Smith 2006; Weisberg 2013).
The holistic view to be developed in this essay is largely motivated by the similarity view mentioned above. While the similarity view is right to claim that a model is a good model with respect to explanation and/or prediction because it is similar to its target system in certain respects and to certain degrees, it is either ambiguous to state that something is similar to something (since everything is similar to everything in certain aspects and to certain degrees (Quine 1969; Goodman 1972)), or very difficult to specify how something is similar to something. In order to specify the degree to which a model is similar to its target system, Michael Weisberg has suggested a similarity measure which attempts to capture the idea that a good model is one that resembles neither everything nor nothing of its target system (Weisberg 2013). Hence, a good model must resemble something important of its target system.
Weisberg’s similarity view constitutes a valuable attempt to characterize how a model is related to its target system and, more importantly, how modelers judge the goodness-of-fit (i.e., the degree of similarity) of the model with respect to its target system. He interprets his view as distinct from the semantic view, because his similarity measure aims not to capture what the model-world relationship is per se, but to capture how modelers actually make similarity judgments about the model (Weisberg 2013, 155). The most salient feature of Weisberg’s similarity view is, therefore, its focus on practice rather than on semantics.
I think Weisberg’s practice-oriented approach towards answering the model-world relationship question is on the right track, and this essay can be viewed as a further press on that track. However, though I agree with Weisberg’s view in substance, we disagree in detail. In particular, I have argued elsewhere that Weisberg’s view fails to capture a key feature of how modelers make similarity judgments: modelers typically compare the model with its target system holistically (author anonymized, 20xx).
To motivate an alternative to Weisberg’s similarity view, this essay proposes a holistic view of how modelers make similarity judgments. The holistic view holds that modelers typically evaluate the goodness-of-fit of the model with respect to its target system holistically. That is, they make holistic evaluations about the model when comparing the model with its target. An evaluation is holistic when (a) it involves taking the model (and its target system) as an interconnected whole, wherein components of the whole interact with one another in producing certain outputs (or patterns, phenomena, etc.), and (b) the goodness-of-fit measure of the model with respect to its target system is achieved via comparing the outputs (or patterns, phenomena, etc.) produced by these interacting components. The notion of holistic evaluation, as we shall see below, is derived from looking closely at scientific modeling practice.
Before proceeding, it might be helpful to clarify the scope of this essay. First, since this essay centers on how scientists evaluate (or assess, test, verify, check, etc.) their predefined models, other aspects of scientific modeling such as how to explore, search for, build, select, and adjust models are not discussed. However, I think insights about how to evaluate a model must cast light on understanding these other aspects, though a thorough investigation of them must be left for another occasion. Second, it does not aim to provide an overarching account of scientific modeling that could make sense of all sorts of modeling practice. Rather, it aims to offer an account that could constitute at least part of the overarching account of scientific modeling—if there is such an account. Finally, this essay takes it as a background assumption that a modeler’s modeling goal would shape her modeling practice and thus affect her assessment of model adequacy (Levins 1966; Orzack 2005; Odenbaugh 2006; Weisberg 2006; Matthewson and Weisberg 2009; Parker 2009, 2010). Elaborating how a modeling goal would shape one’s modeling practice and affect one’s assessment of model adequacy, however, deserves another essay.
The essay proceeds as follows: to pave the way for examining the somewhat sophisticated MLE practice, Section 2 first considers a very simple estimation method commonly used in practice, namely the least squares estimation (LSE) method. This is done by considering the curve fitting problem using a biological example. After this, Section 3 elaborates the MLE practice using a mathematical model in biology. It turns out that the philosophical implications obtained by scrutinizing LSE reappear when examining MLE in Section 3. Section 4 examines a different kind of model, a concrete model, i.e., the San Francisco Bay model that was first introduced to the philosophical literature by Michael Weisberg (2013). After considering these two kinds of models, a general conclusion is achieved: for many—if not all—scientific models, the modelers usually judge the goodness-of-fit their models with respect to these models’ target systems holistically.
For various reasons, maximum likelihood estimation is a very common estimation method used in modeling practice (Shipley 2002, 71; also see Myung 2003). Thus insights about models and modeling can be gleaned by taking a closer look at MLE practice. Yet, to fully understand MLE practice, it is better to first consider a simpler estimation method, i.e., the least squares estimation (LSE) method, which is also widely used in modeling practice (e.g., in biology see Allman and Rhodes 2004, Symonds and Blomberg 2014; in psychology see Usher and McClelland 2001; in astronomy see Nievergelt 2000).
2.1. Presentation of the method
For simplicity, I will only consider linear least squares estimation in what follows. The following description of the LSE method largely follows Rice (2006), though with many modifications. To put it in a nutshell, LSE is mainly used in curve-fitting problems to estimate parameters by minimizing the sum of squared deviations of the predicted values from the observed values (Rice 2006, 542). The predicted values are also called the fitted values, which are given by the curve. To see how the estimation works, consider a simple example: a biologist is interested in whether the plant’s height (
y) is related with the amount of fertilizer (
x) the plant receives. Suppose she also believes that, if
yis related with
x, then the relation must be linear. In this case
xis called the predictor variable and
ythe response variable. She then proposes a model for the two variables:
β0is the intercept and
β1the slope of the proposed line. Remember that her purpose is to choose parameters to minimize the sum of squared deviations of the predicted from the observed values, and
β1are the parameters chosen. The following function is the sum of squared deviations:
ndenotes the number of observations and
irefers to the ith observation. Now suppose she takes a sample of
nplants in the species, observing values
yof the response variable and
xof the predictor variable. For the moment, she chooses values
b1as estimates of
β1, which are supposed to best fit the sample data. Based on the data she collects, she plots a figure for the observed values of
Figure 1. The observed data for the height of plants and the amount of fertilizer these plants receive. Note that the data in this example are only made for the purpose of illustration.
Next, she defines a fitted function:
ŷis the fitted value (in the fitted line, namely in the straight line in Figure 1). Now, for each value of the observed response variable
yiand for each corresponding value of the predictor variable
xi, she has a fitted value
ŷi=b0+b1xi. As such, it is time to minimize the sum of squared deviations of each observed response from its fitted values. She expands equation (2) to get
It can be shown that
b1satisfy the following formulae (Rice 2006, 544):
x̅refers to the mean of
y̅to the mean of
β̂1the estimate of
b1(Ibid., 542-547). Finally, after these calculations, she obtains the following results:
b1=0.58, and therefore the function for her model is:
2.2. Discussion: philosophical implications
Having described the estimation method, now let us consider what philosophical implications can be drawn.
The first point to note is that we compared two structures by comparing two curves. That is, we compared the fitted curve with the observed curve. This is typically done by minimizing the sum of squared deviations of the fitted values from the observed values. This process is holistic in essence because, when minimizing the sum of squared deviations, we did not care about the behavior of any particular point (e.g.,
(xi,yi)), but did care about the overall effect that is produced interactively by those points’ behaviors. More specifically, looking closely at the equation (5), we can see that the value of
b1was achieved by centering the variables
yiabout their means. The remaining calculation was accomplished in terms of the mean-centered variables, not in terms of the variables themselves. The mean of a variable describes the central tendency of that variable, namely the overall behavior of that variable.
Besides, we did not consider each variable separately, but considered pairs of variables (i.e.,
(xi,yi)) collectively, for example, the numerator in (5) was obtained by summing up the product of each pair of the mean-centered values of the two variables (i.e.,
∑i=1nxi-x̅(yi-y̅)). Therefore, we may say that the value of
b1was achieved by considering the overall behavior of the variables
y. By the same token, when looking closely at the equation (6), it can be seen that the value of
b0was achieved by considering the means of the variables (i.e.,
b0=y̅-β̂1x̅), not by considering the values of the variables themselves. XY has summed up the situation as follows: “Equations 5 and 6 nonlinearly rely on the means of the variables to which all values make a contribution at the same time, hence any independent contribution to the mean is later involved in a network of nonlinearities” (personal communication). I understand this to mean that, when comparing the model with its target, we did so in an essentially holistic way.
One may point out that if there were an abnormal observation point
(xs,ys)that resides far away from the remaining points, then the resultant fitted curve would be affected by this point. This sometimes does occur, but rather than undermining my stance, it underwrites my viewpoint: since what is under concern is the overall behavior of all observation points, if there exists one (or a few) point that affects the overall behavior of all points then this point must be taken seriously. For example, we may need to look carefully into the unmodeled cause of the occurrence of this point (e.g., measurement errors or latent variables), and a systematic investigation of the cause may deepen a modeler’s understanding of the natural phenomenon in question. As such, we are concerned with the effect of this abnormal point on the overall system’s behavior (i.e., the set of points), rather than with the point itself.
The second detail to note is the way we choose the values for the parameters in LSE. As shown above, in LSE we tried to choose a best combination of values (i.e.,
b1) for the parameters that could minimize the discrepancy between the fitted and observed curves. In other words, we did not really care about the particular value of each parameter; instead, we were caring about the best combination of values for the two parameters that, taken together, could minimize the distance between the fitted and observed curves. Therefore, the act of choosing one value for one parameter would affect the act of choosing values for another parameter. This has been shown by the dependence relationship between calculating values for
b1in equations (5) and (6). In short, the values for the parameters interact with one another in producing a best combination of values (we shall see in Section 4.2 that this is a kind of goodness-of-fit interaction among variables or parameters, which differs from another kind of interaction: causal interaction).
In sum, the discussion above has demonstrated that in modeling practice we routinely compare the model with its target system holistically. Perhaps this conclusion goes too far, because it is based on a single type of modeling practice, i.e., LSE. Yet, the following section involving another modeling method, i.e., MLE, will show that the conclusion is in fact very general.
3. Maximum Likelihood Estimation
This section is devoted to examining MLE in depth, with the focus on the fourth step where we estimate the parameters of the model. The following testing procedure follows Shipley (2002) and the relevant mathematical demonstration is also due to him. I will try my best to only keep technical details necessary for developing the holistic view. In addition, since we are interested in what philosophical implications can be derived rather than in any particular scientific model itself, a simplified leaf gas-exchange model involving only two random and one error variables will be introduced in this section.
3.1. Presentation of the method
The following description of the MLE method is paraphrased from Shipley (2002, 103-135) with minor modifications. Our goal here is to use MLE to test the goodness-of-fit of a biological model. To do so, the first step is to hypothesize a causal structure, i.e., a causal model, for the variables involved. A simplified leaf gas-exchange model is hypothesized as follows:
SLM (specific leaf mass,
Figure 2. A simplified path model relating two random variables. This figure is modified from Shipley’s original figure (2002, 131).
The second step is to translate the causal model hypothesized above into an observational model, and the observational model is usually expressed as a set of structural equations (Shipley 2002, 104-107):
The next step is to obtain the predicted variance and covariance between each pair of variables in the model, and this usually can be done by using covariance algebra (Ibid., 107-110). The result is shown in Table 1:
Table 1. Predicted population variance and covariance for variablesin the model
The fourth step is to estimate the free parameters (i.e.,
εk) in the model using maximum likelihood estimation. This is done by minimizing the difference between the observed covariance of the variables derived from the data and the covariance of the variables predicted by the model. The typical way to do this is to “choose values for the free parameters in our predicted covariance matrix that make it as numerically close as possible to the observed covariance matrix” (Ibid., 110). This process is roughly like searching for the lowest point in a landscape with many hills and valleys. Suppose a lady is blindfolded, and her business is to find the lowest point in the landscape without peeking:
She begins by taking an initial step in a direction based on her best guess. If she sees that she has moved down-slope then she continues in the same direction with a second step in the same direction. If not, she changes direction and tries again. She continues with this process until she finds herself at a position on the landscape in which every possible change in direction results in movement up-slope. She therefore knows that she is in a valley. Unfortunately, if the landscape is very complicated she may have found herself in a small depression rather than on the true valley floor. The only way to find out would be to start over at a different initial position and see whether she again ends up in the same place. (Ibid., 111)
To fully understand this searching process in estimating the parameters of our model, we need start with a couple of technical terms. For a random variable,
X, if it is continuous, then we need the probability density function. The probability density function of a univariate normal random variable is denoted
refers to the mean of the observations of
to the standard deviation of
are population parameters that are fixed. This probability density function says that, given the values of the parameters
, the probability of
Xtaking on a specific value is
fX|,. However, more often than not we do not know the values of the parameters in question (e.g.,
), and it is relatively easier to obtain the values for
Xby observation (or experiment). Because of this, our goal now is shifted to estimate the values for the parameters given the observation. In doing so, we obtain a likelihood function
L(,|X), meaning that, given the observation
X, the specific values of
would maximize the likelihood that we have observed
Now suppose we use the likelihood function
L,Σ|Xto estimate the values for our parameters under consideration, where
Σrefers to the population covariance matrix, of which I will say more in what follows. Since what we are interested in are not the exact values of the variables but their relationships, we often center the variables about their means. That is, we subtract the values of a variable from its mean, and the resultant mean-centered variable has a mean of zero. Since the data (i.e.,
X) and the mean-centered variables are fixed (i.e., their means
=0), the only parameters whose likelihood estimates we have to estimate are those in the population covariance matrix
Σ. In the simplified leaf gas-exchange model, the matrix
Σis the one described in Table 1. The logic of estimating
Σis, as said above, like searching for the lowest point in a landscape with many hills and valleys. More precisely, we first
[…] group all these free parameters together in a vector called
θ. Now, if we take a first guess at the values of these free parameters then we can calculate the predicted covariance matrix based on these initial values; let’s call the predicted covariance matrix that results from this guess
Σ1(θ)to emphasise that this matrix will change if we change our values in
θ. (Shipley 2002, 113)
Given this first matrix
Σ1(θ), we then calculate the value of the likelihood function
L,Σ1(θ)|X. After this, “we change our initial estimates of the free parameters and recalculate the predicted covariance matrix,
Σ2(θ)” (Ibid., 113). Based on this new matrix, we calculate the new value of the likelihood function
L,Σ2(θ)|X. We repeat this process again and again until we cannot increase the value of the likelihood function. At the point where we cannot increase the value of the likelihood function anymore, we obtain the most likely values for those free parameters in the model.
In the meanwhile, we can calculate the observed covariance matrix
Spurely based on the data. With the predicted covariance matrix
Σ(θ)and the observed covariance matrix
Sat hand, the difference between the two matrices can be calculated using the maximum likelihood fitting function
FML. Let us focus on the basic ideas of this function. Recall that our goal is to choose values for the free parameters in the predicted covariance matrix so as to make it as numerically close as possible to the observed covariance matrix. So far we have chosen the most likely values for the free parameters in the predicted covariance matrix
Σ(θ)by the previous step, and the values of the free parameters that maximize the likelihood function
L,Σ|Xare also the values that minimize the maximum likelihood fitting function
FML. Namely, these are the values that minimize the difference between the predicted covariance matrix and observed covariance matrix.
The foregoing process is the fourth step, namely estimating the values of the free parameters in the model using the maximum likelihood estimation method. The fifth step is to calculate the probability of having observed the measured minimum difference (assuming that the predicted and observed covariances are identical except for random sampling variation). This step involves calculating the degrees of freedom, and testing the null hypothesis that there is no difference between the predicted and observed covariance matrices except for random sampling variation of
Nindependent observations. Suppose that the probability of having observed the measured minimum difference obtained is 0.36, which is higher than the significance level 0.05; that is, our null hypothesis is not rejected by the evidence.
3.2. Discussion: Philosophical implications
By carefully tracking the estimation procedure, the central implication we get from the foregoing discussion is that, as we have gleaned from LSE practice, the modelers compare the model with its target system holistically. Since we cannot directly compare the causal relationships—as represented in the graph—with their counterparts in the real system, we instead compared two covariance matrices, one derived from the hypothesized causal graph and one from the real system respectively. A covariance matrix, to put it metaphorically, is a shadow of the model or the target system, which portrays correlations between variables.
Because we started with the hypothesis that it is a causal model, and instead of creating a new model based on data we tested a model against data, the covariances in the matrix (Table 1) were hypothesized to measure not only the (statistical) correlations between variables but also the degree to which these variables are causally linked. Grouping these covariances together results in a covariance matrix, which describes a causal structure within which elements are causally linked to one another directly or indirectly (and of course there are always elements in the structure that are not causally linked, e.g., variables
ε2in Figure 2). As a consequence, when comparing two covariance matrices, we are in fact comparing two (causal) structures, though we do this in an indirect way (as I said, we cannot compare the structure of the model with the structure of the real plant system directly).
Another way to see the holistic feature of the MLE practice is to notice the focus on the relationships between variables rather than on the variables themselves. As in the case of LSE where we did not care about any particular point’s behavior but about the overall effect that is produced interactively by all those points’ behaviors, in MLE we also did not care about the particular value (or behavior) of each variable. Remember that in MLE we centered each variable around its mean. What we really cared about were the degrees to which one variable would affect another variable, and these influences were expressed by the free parameters (recall the free parameters
bjin our hypothesized model). By estimating the values of the free parameters between variables, we are measuring the degree to which one variable’s behaviors might be affecting another variable’s behaviors. By estimating the values of all the parameters at the same time given the data—recall the process of choosing values for the parameters described above—we are in fact measuring the whole causal structure; that is, we are measuring the degree to which our model fits the data.
The last but not the least matter to take notice of is to take a closer look at the estimation of the free parameters. As we have seen in LSE practice where we tried to choose a best combination of values for the parameters that could minimize the discrepancy between the fitted and observed curves, in MLE practice the same story happened. Recall the process of choosing values for the free parameters using the maximum likelihood function
L,Σ|X. Our goal was to find a set of values for our free parameters that could maximize the maximum likelihood function. The set of values that could maximize the maximum likelihood function were grouped into a vector called
θ. Notice that we did not really care about any particular value in the vector; instead, we were caring about the best combination of values for the vector that, taken together, could maximize the maximum likelihood function. Therefore, the act of choosing one value for one free parameter would affect the act of choosing values for other free parameters in the vector. In other words, the free parameters in the vector interact with one another in producing a best combination that could maximize the maximum likelihood function. To give you a vivid image about how the free parameters in the vector might interact with one another in producing a best combination, and, more generally, how the process of choosing values for parameters is holistic in nature, please refer to Section 3.3 where the computer program, Tetrad, is used to evaluate a scientific model based on a hypothetical data set.
In sum, so far we have seen from several angles that the modelers judge the goodness-of-fit of the model holistically. First, when comparing a model with its target we are comparing two structures through comparing their covariance matrices. Second, the way we compare two structures is via focusing on relationships among variables rather than on variables themselves. Third, the relationships among variables, expressed as free parameters, interact with one another in producing a best fit of the model to its target system.
Given the same philosophical implications obtained from the LSE and MLE methods which are the most common tools in modeling practice, the more general lesson is that in many cases of modeling practice, when evaluating a model against its target system we usually compare two structures holistically.
3.3. Testing a scientific model using the tetrad program
Tetrad is a computer program developed by Clark Glymour, Richard Scheines, and Peter Spirtes (and many others) at Carnegie Mellon University, which is used to create, test, and search for causal and statistical models.
Now we use this program to estimate parameters of a causal model given a hypothetical data set. This model describes the causal effects of various factors on people’s behavior of donation (Cryder et al. 2013). This model has five variables: tangibility condition (how detailed the situation is described to the donors), imaginability (how concrete the situation is), sympathy (how much sympathy for the target), impact (how much impact will the donors’ money have on the target), and amount donated (actual amount of money donated by the donors). The hypothesized model is described below:
Figure 3. A causal model among five variables. For purpose of illustration, error variables are not shown in the figure.
bidenotes free parameters to be estimated.
With the data at hand, we now test this model using the Tetrad program. The testing process involves finding the most likely values for the parameters in the model, and this in turn involves using the maximum likelihood estimation method, as described in Section 3. It can be expected that the model will fit the data very well, because the data are just generated by this model. The result of the testing is shown below:
Figure 4. Results of using the Tetrad program to estimate the free parameters in the causal model. Numbers in black show the values of the parameters, and numbers in green show the mean values of the variables.
The model’s degrees of freedom are 4, and the chi square test of this model assumes that the maximum likelihood fitting function over the measured variables has been minimized—that is, it assumes that we have found the most likely values for the parameters that can minimize the difference between the predicted covariance matrix and observed covariance matrix. The value of the chi square test is 3.0327, and the probability of having observed the measured minimum difference (assuming that the predicted and observed covariances are identical except for random sampling variation) is 0.5524, which is significantly larger than the threshold value 0.05. Therefore, the hypothesized model is not to be rejected given the data.
Remember that our purpose for discussing this model using the Tetrad program is to show how the free parameters in a model might interact with one another in producing a best fit of the model to its target system. To show this, let me perform a little operation on the model: now we change the value of the parameter along the path linking Sympathy to AmountDonated from 0.8007 to 0.7200, that is, we fix the value of this parameter at 0.7200. The other parameters are still free parameters because they are not fixed. Then we test this modified model given the same data set we just used above. The result is shown below:
Figure 5. Results of using Tetrad to estimate the free parameters in the causal model by fixing the value of one parameter. The number in red is the value that has been fixed.
It can be seen from Figure 5 that, by changing the value of the parameter along the path linking Sympathy to AmountDonated from 0.8007 to 0.7200, the values of other parameters have also been changed. For example, the value of the parameter linking Impact to AmountDonated has been remarkably changed from 0.6371 to 0.7163, the value of the parameter linking TangibilityCondition to Sympathy has been changed from -0.5215 to -0.5021, etc.
Interestingly, given these changes, the modified model still fits its target system (i.e., the data set) to an extent. In this modified model, the chi square test also assumes that the maximum likelihood fitting function over the measured variables has been minimized. With the degrees of freedom being 5 (since we fix one parameter, there is one less parameter to estimate and thus one more degree of freedom), the result of the chi square test is 9.0028, and the probability of having observed the measured minimum difference is 0.1090 which is larger than the threshold value 0.05. Therefore, the modified model is also not to be rejected given the data.
The moral of this discussion is twofold. First, the act of choosing values for one parameter does have an impact on choosing values for the other parameters in the model. Second, though one parameter in a model might be changed significantly (e.g., from 0.8007 to 0.7200), the model may still fit its target system by adjusting the other parameters in a way that produces a best combination of values of the parameters given the data, i.e., produces a best fit of the model to its target system.
4. Generalizing the View: The San Francisco Bay Model
The foregoing discussion has shown that modelers usually compare their mathematical models with the target systems holistically, and that this holistic practice is insensitive to which particular estimation method is employed. Yet, one might be interested in exploring whether the philosophical implications obtained by scrutinizing mathematical models can be further generalized to other kinds of models, concrete models for example. If it turns out that the implications also hold in other kinds of models, then the holistic view gets extra support with respect to its viability as well as its scope. To this end, this section examines a concrete model, i.e., the San Francisco Bay model.
4.1. The San Francisco Bay model
In the 1950s there was a substantial debate over an ambitious plan: damming up the San Francisco Bay area so as to improve the water supply in the area (Weisberg 2013, 1). To settle this dispute, the Army Corps of Engineers was charged with investigating the influence of the proposed plan by building a massive hydraulic scale model of the Bay system (Ibid., 1-2).
Once the model was built, it was adjusted to accurately reproduce several measurements of the parameters such as tide, salinity, and velocities actually recorded in the Bay. This involved adjustment of the tidal apparatus, adjustment of the frictional resistance of the model bed, and so on (Huggins and Schultz 1967, 11).
Fortunately, the Bay model worked very well after adjustment:
Agreement between model and prototype for the verification survey of 21-22 September 1956, and for other field surveys, was excellent. Tidal elevations, ranges and phases observed in the prototype were accurately reproduced in the model. Good reproduction of current velocities in the vertical, as well as in the cross section, was obtained at each of the 11 control stations in deep water and at 85 supplementary stations. The salinity verification tests for the verification survey demonstrated that for a fresh-water inflow into the Bay system […], fluctuation of salinity with tidal action at the control points in the model was in agreement with the prototype. (Huggins and Schultz 1967, 11)
After the adjustment, it was time to evaluate the proposed plan. The evaluation showed that it would considerably reduce water-surface areas, reduce the velocities of currents in most of South San Francisco Bay, reduce the tidal discharge through the Golden Gate during the tidal cycle, and so on (Huggins and Schultz 1973, 19). Given these disastrous consequences, the Army Corps had good reasons to reject the proposed plan (Weisberg 2013, 9).
The next subsection will show that features in the Bay model interact with one another in producing outputs. Based on this, Section 4.3 will suggest that the modelers can only judge the Bay model with respect to its target system holistically.
4.2. Features interact with one another
As a first step, consider the idea that the Bay model, as a concrete object, is a structure in which features interact with one another in producing phenomena (or outputs). In fact, this is a plain fact for many modelers, for instance, Huggins and Schultz put it explicitly when testing the bay model: “Among the problems to be considered were the conservation of water and the preservation of its quality; […] the intrusion of salinity into the Sacramento-San Joaquin Delta; the tides, currents and salinity of the Bay as they affect other problems […]. None of these problems can be studied separately, for each affects the others” (1973, 12; my emphasis).
Consider, for instance, the relationship between two key features in the model: tide and salinity. Generally, Salinity varies along an estuary and increases rapidly when it gets close to the river mouth due to tidal influences. An estuary is
[…] the transition between a river and a sea. There are two main drivers: the river that discharges fresh water into the estuary and the sea that fills the estuary with salty water, on the rhythm of the tide. The salinity of the estuary water is the result of the balance between two opposing fluxes: a tide-driven saltwater flux that penetrates the estuary through mixing, and a freshwater flux that flushes the saltwater back. (Savenije 2005, Preface ix).
To illustrate this rhythm of the tide, consider the effect of the spring-neap tidal cycle on the vertical salinity structure of the James, York and Rappahannock Rivers, Virginia, U.S.A.:
Analysis of salinity data from the lower York and Rappahannock Rivers (Virginia, U.S.A.) for 1974 revealed that both of these estuaries oscillated between conditions of considerable vertical salinity stratification and homogeneity on a cycle that was closely correlated with the spring-neap tidal cycle, i.e. homogeneity was most highly developed about 4 days after sufficiently high spring tides while stratification was most highly developed during the intervening period. (Haas 1977, 485)
This short report shows not only that characteristics of salinity (such as stratification and homogeneity) are influenced by characteristics of the tide, but also that there is a phase connection (or synchronization) between tidal cycle and salinity oscillations. The former is a causal relationship while the latter is a temporal relationship (or a statistical correlation). The phase connection among features was also emphasized by the Army Corps when verifying the Bay model, saying “These gages were installed in the prototype and placed in operation several months in advance of the date selected to collect the primary tidal current and salinity data required for model verification, since it was essential to obtain all data simultaneously for a given tide over at least one complete tidal cycle of 24.8 hours” (1963, 50; my emphasis). Moreover, the same story goes for tide and tidal currents (for details see Army Corps 1963, 20).
In short, features in the model bear not only causal relationships, but also temporal relationships. This implies that, when verifying the model, features of the model causally interact with each other in producing certain outputs (e.g., predictions, effects, phenomena, etc.), rather than individually or separately producing outputs. So although outputs of key features in the Bay model can be identified and measured separately, they are not produced separately.
Causal interaction among features leads us to notice a second kind of interaction, i.e., a goodness-of-fit interaction, wherein features interact with one another in producing the goodness-of-fit measure of the model with respect to its target system. That is, one feature’s contribution to the goodness-of-fit measure depends on other feature(s)’ contribution(s) to that measure. For example, a goodness-of-fit interaction is shown by the verification of salinity in the Bay model, where the measurement of salinity (as a measurement of one feature’s contribution to the goodness-of-fit value) depended on the measurement of other features in the way that other features must be kept constant. That is, when verifying tidal current and salinity, “it was essential to obtain all data simultaneously for a given tide over at least one complete tidal cycle of 24.8 hours” (Army Corps 1963, 50), and “salinity phenomena in the model were in agreement with those of the prototype for similar conditions of tide, ocean salinity, and fresh-water inflow” (Ibid., 54; my emphasis).
The causal interaction (CI for short) and the goodness-of-fit interaction (GFI for short) are two different things. To begin with, CI means that the causal effect of one factor depends on the causal effect of another factor, while GFI means that one feature’s contribution to the goodness-of-fit measure of a model depends on other feature(s)’ contribution to that measure. The interplay between tide and salinity in the Bay model described above is a good example of CI, because how salinity produces outputs depends on how tide varies along an estuary. On the other hand, the interplay between the measurement of salinity and the measurement of other features in the Bay system (e.g., tide) is good example of GFI, because, as we often do in practice, we must keep the other features constant when measuring salinity. A good way to understand what GFI is is to first understand what GFI is not. Suppose there is a concrete model involving only two factors such as salinity and tide. If, for example, we can measure salinity’s contribution to the goodness-of-fit measure of the model independently, that is, we can measure salinity’s contribution without considering tide (e.g., without requiring similar tidal conditions), then the goodness-of-fit measure of the model with respect to its target system can be achieved by simply adding the goodness-of-fit measure of salinity and tide together. In this case, there is no goodness-of-fit interaction between salinity and tide in the model. Second, CI typically concerns factors within an object (e.g., in a model or a target system), whereas GFI concerns two objects because goodness-of-fit concerns the degree to which one object is fit to another object. Thus, they are two different concepts that should be distinguished clearly. Third, there is no simple relationship between the two. For example, when computing the fit of a straight line
y=ax+bto a cloud of points,
bwill depend on each other to produce the best fit. Yet, though there exists goodness-of-fit interaction between
b, there is no causal interaction between them. However, CI and GFI sometimes do happen together, for instance, in the Bay model, CI among features comes with GFI that each feature’s contribution to the goodness-of-fit measure of the model depends on other features’ contributions to that measure.
To sum up, we have seen that features of a model may interact with one another to produce outputs. This causal interaction among features has led us to notice another kind of interaction, namely the goodness-of-fit interaction, wherein one feature’s contribution to a model’s goodness-of-fit measure may depend on other features’ contributions to that measure.
4.3. Verifying the model is a holistic matter
The last subsection has shown that the features in the Bay model interact with one another in producing phenomena (or outputs). This leads us to a further claim: due to the goodness-of-fit interaction among features discussed above, the modelers must judge the model with respect to its target holistically when verifying the model. To show this, let us go back to the verification scenario.
At first blush, it seems the verification of the Bay model was achieved by independently verifying each feature, as one might interpret the report showed in Section 4.1. That is, it seems that by verifying that each feature in the model fits its counterpart in the target, modelers made the judgment that the model fits the target system. Underlying this seemingly plausible reasoning, however, there remains a problem of why the modelers were allowed to confirm the verification of the model by means of only verifying several individual features. Or, to put it slightly differently, in terms of what does the fit of features guarantee the judgment about the fit of the model to the target? I believe that it is more than the fit of individual features themselves that makes sense of the reasoning that the model fits the target.
Looking more closely at the verification procedure, we obtain an answer to our question: the modelers did not really verify each feature independently. Rather, they routinely kept other features constant when verifying one feature, for the data for each feature were obtained against similar conditions of other features. This practice of keeping other variables constant when verifying one variable is justified by the common methodology deployed in scientific practice, e.g., in developmental biology scientists often keep the environmental factors constant when investigating the causal effect of a gene (or a set of genes) on the development of a phenotypic trait (e.g., eye colors). Therefore, the actual scenario is not that the modelers verified the model by verifying each feature independently, but rather that the verification of each feature involved other features.
Given the actual scenario portrayed above, we now understand why verifying the Bay model was a holistic matter. First, if we regard the verification of each feature as measuring that feature’s contribution to the goodness-of-fit measure of the model with respect to its target system, then we have seen how each feature’s contribution to the goodness-of-fit measure depended on other features’ contributions to that measure. Second, and more importantly, the fact that the modelers, when verifying the model, attempted to obtainall dataabout tidal current and salinity “simultaneously for a given tide over at least one complete tidal cycle of 24.8 hours” indicates that the modelers treated their model as a whole, as opposed to a piecemeal of unrelated items. A different way to understand this is to ask why the modelers tried to obtain all data “simultaneously for a given tide”. Why not simply collect data for each feature separately and then add them together? The reason is that, as implied by the common methodology mentioned above, obtaining data for one variable without keeping the other variables constant may introduce confusion about which variable is causally responsible for which. As an illustration, consider developmental biology again. If one were to design a series of experiments to investigate how a gene might affect the development of a phenotype without keeping the environmental factors constant (or without against a similar environment), then it would be highly dubious whether the gene is really the difference-maker.
With this discussion in mind, it is not difficult to find the same salient characteristics as those found in the hypothesized causal structure about leaf gas-exchange. The first salient characteristic of testing the concrete object, as discussed above, is that the modelers treated their model holistically. The second salient characteristic, which has been implied by the first, is that the modelers knew that how one feature could produce effects must be constrained by how other features behaved. In other words, features interacted with one another in producing certain outputs. This causal interaction has also led us to notice another kind of interaction in the Bay model, i.e., the goodness-of-fit interaction, wherein one feature’s contribution to the goodness-of-fit measure of the model might depend on other features’ contributions to that measure.
In conclusion, we have seen that the philosophical implications obtained by using LSE and MLE to evaluate mathematical models apply in the concrete model case, even though the evaluation of the concrete object did not involve MLE practice or something the like (note that the MLE method was developed in the late 1960s and 1970s by Jöreskog (1967, 1969, 1970) and Keesling (1973), a time long after the testing of the Bay model in the 1950s). Given such, the lesson we achieved in the previous subsections can be further generalized: for many scientific models (mathematical, concrete, etc.), when evaluating we treat them holistically.
This essay has proposed a holistic approach to scientific modeling by way of looking closely at scientific practice. This approach views one central aspect of scientific modeling, i.e., the evaluation of models, as a holistically featured practice, wherein modelers typically treat their models as holistic objects that have interacting components. Though it only focused on one aspect of scientific modeling, it might have substantial implications for other aspects of modeling. First, for example, it might be hypothesized that building, exploring or adjusting models is also a holistic matter—e.g., in the context of building a model, adding or removing a causal path between two variables may have an impact not only on the two variables involved but also on the overall causal structure. Second, one might claim that reapproaching the thorny model-world relationship problem through the holistic lens may yield unexpected insights, though how this can be done remains to be seen.
Allman, E. S., and Rhodes, J. A. (2004). Mathematical Models in Biology: An Introduction. Cambridge: Cambridge University Press.
Army Corps of Engineers. (1963). Technical Report on Barriers: A Part of the Comprehensive Survey of San Francisco Bay and Tributaries, California. Appendix H, Volume 1: Hydraulic Model Studies. San Francisco: Army Corps of Engineers.
Baron, S., Colyvan, M., and Ripley, D. (2016). “How Mathematics Can Make a Difference”. Philosophers’ Imprint (forthcoming).
Bekker, P. A., Merckens, A., and Wansbeek, T. J. (2014). Identification, Equivalent Models, and Computer Algebra: Statistical Modeling and Decision Science. Academic Press.
Bueno, O., and Colyvan, M. (2011). “An Inferential Conception of the Application of Mathematics”. Noûs, 45 (2): 345-374.
Bueno, O., and French, S. (2012). “Can Mathematics Explain Physical Phenomena?” British Journal for the Philosophy of Science, 63 (2012): 85-113.
Cartwright, N. (1983). How the Laws of Physics Lie. Oxford: Oxford University Press.
Cryder, C. E., Loewenstein, G., and Scheines, R. (2013). “The Donor is in the Details.” Organizational Behavior and Human Decision Processes, 120 (1), 15-23.
Frigg, R. (2009). “Models and Fiction.” Synthese, 172 (2): 25-68.
Frigg, R., and Nguyen, J. (2016). “The Fiction View of Models Reloaded.” The Monist, 99 (3): 225-242.
Giere, R. (1988). Explaining Science: A Cognitive Approach. Chicago: University of Chicago Press.
Giere, R. (2004). “How Models are Used to Represent Reality.” Philosophy of Science, 71 (5): 742-752.
Giere, R. (2010). “An Agent-Based Conception of Models and Scientific Representation.” Synthese, 172 (2): 269-281.
Godfrey-Smith, P. (2006). “The Strategy of Model-based Science.” Biology and Philosophy, 21 (5): 725-740.
Goodman, N. (1972). “Seven Strictures on Similarity.” In N. Goodman (Ed.), Problems and Projects, 437-446. Indianapolis: Bobbs-Merril.
Greene, W. H. (2012). Econometric Analysis (7th Edition). Pearson.
Haas, L. W. (1977). “The Effect of the Spring-Neap Tidal Cycle on the Vertical Salinity Structure of the James, York and Rappahannock Rivers, Virginia, U.S.A”. Estuarine and Coastal Marine Science, 5: 485-496.
Huggins, E. M., and Schultz, E. A. (1967). “San Francisco Bay in A Warehouse”. Journal of the IEST, 10 (5): 9-16.
Huggins, E. M., and Schultz, E. A. (1973). “The San Francisco Bay and the Delta Model”. California Engineer, 51 (3): 11-23.
Jöreskog, K. G. (1967). “Some Contributions to Maximum Likelihood Factor Analysis.” Psychometrika, 32 (4): 443-482.
Jöreskog, K. G. (1969). “A General Approach to Confirmatory Maximum Likelihood Factor Analysis.” Psychometrika, 34 (2): 183-202. http://doi.org/10.1007/BF02289343.
Jöreskog, K. G. (1970). “A General Method for Analysis of Covariance Structures.” Biometrika, 57 (2): 239-251.
Keesling, J. W. (1973). Maximum Likelihood Approaches to Causal Flow Analysis. Chicago: University of Chicago.
Lee, S., and Hershberger, S. (1990). “A Simple Rule for Generating Equivalent Models in Covariance Structure Modeling.” Multivariate Behavioral Research, 25 (3): 313-334.
Levins, R. (1966). “The Strategy of Model Building in Population Biology.” American Scientist, 54 (4): 421-431.
MacCallum, R. C., Wegener, D. T., Uchino, B. N., and Fabrigar, L. R. (1993). “The Problem of Equivalent Models in Applications of Covariance Structure Analysis.” Psychological Bulletin, 114 (1): 185-199.
Matthewson, J., and Weisberg, M. (2009). “The Structure of Tradeoffs in Model Building”. Synthese, 170 (1): 169-190.
Morgan, M. S., and Morrison, M. (1999). Models as Mediators: Perspectives on Natural and Social Science (Vol. 52). Cambridge: Cambridge University Press.
Myung, I. J. (2003). “Tutorial on Maximum Likelihood Estimation.” Journal of Mathematical Psychology, 47 (1): 90-100.
Nievergelt, Y. (2000). “A Tutorial History of Least Squares with Applications to Astronomy and Geodesy.” Journal of Computational and Applied Mathematics, 121 (1-2): 37-72.
Odenbaugh, J. (2006). “The Strategy of ‘The strategy of Model Building in Population Biology’.” Biology and Philosophy, 21 (5): 607-621.
Orzack, S. H. (2005). “What, If Anything, Is ‘The Strategy of Model Building in Population Biology?’ A Comment on Levins (1966) and Odenbaugh (2003).” Philosophy of Science, 72 (3): 479-485.
Parker, W. S. (2009). “Confirmation and Adequacy-for-Purpose in Climate Modelling.” Proceedings of the Aristotelian Society, Supplementary Volumes, 83: 233-249.
Parker, W. S. (2010). “Scientific Models and Adequacy-for-Purpose.” Modern Schoolman: A Quarterly Journal of Philosophy (Proceedings of the 2010 Henle Conference on Experimental & Theoretical Knowledge), 87 (3-4): 285-293.
Parker, W. S. (2015). “Getting (even more) Serious about Similarity.” Biology and Philosophy, 30 (2): 267-276.
Pollard, D., and Radchenko, P. (2006). “Nonlinear Least-Squares Estimation.” Journal of Multivariate Analysis, 97 (2): 548-562.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992). Numerical Recipes in C: The Art of Scientific Computing (2nd Edition). Cambridge: Cambridge University Press.
Quine, W. V. O. (1969). “Natural Kinds.” In W. V. O. Quine (Ed.), Ontological Relativity and Other essaypro.com?tap_x=ZQaCDvQxuz6mVdnUddBuGn">Essays, 114-138. New York: Columbia University Press.
Raykov, T., and Marcoulides, G. A. (2001). “Can There Be Infinitely Many Models Equivalent to a given Covariance Structure Model?” Structural Equation Modeling, 8 (1): 142-149.
Rice, J. A. (2006). Mathematical Statistics and Data Analysis (3rd Edition). Nelson Education.
Salis, F. (2016). “The Nature of Model-World Comparisons”. The Monist, 99 (3), 243-259.
Savenije, H. H. G. (2005). “Salinity and Tides in Alluvial Estuaries”. Elsevier Science.
Shipley, B. (2002). Cause and Correlation in Biology: A User’s Guide to Path Analysis, Structural Equations and Causal Inference. Cambridge: Cambridge University Press.
Sneed, J. D. (1971). The Logical Structure of Mathematical physics. Dordrecht: Reidel.
Stegmüller, W. (1976). The Structure and Dynamics of Theories. New York: Springer-Verlag.
Stelzl, I. (1986). “Changing a Causal Hypothesis without Changing the Fit: Some Rules for Generating Equivalent Path Models.” Multivariate Behavioral Research, 21 (3): 309-331.
Suárez, M. (2004). “An Inferential Conception of Scientific Representation”. Philosophy of Science, 71 (5): 767-779.
Suárez, M. (2015a). “Representation in Science”. In P. Humphreys (Ed.), The Oxford Handbook of Philosophy of Science (forthcoming). Oxford: Oxford University Press.
Suárez, M. (2015b). “Deflationary Representation, Inference, and Practice”. Studies in History and Philosophy of Science, 49 (2015): 36-47.
Suppe, F. (1977). The Structure of Scientific Theories. Urbana, Illinois: University of Illinois Press.
Suppe, F. (1989). The Semantic Conception of Theories and Scientific Realism. Urbana, Illinois: University of Illinois Press.
Suppes, P. (1957). Introduction to Logic. New Jersey: D. Van Nostrand and Co.
Suppes, P. (1962). “Models of Data.” In E. Nagel, P. Suppes, and A. Tarski (Eds.), Logic, Methodology, and the Philosophy of Science, 252-261. California: Stanford University Press.
Suppes, P. (1967). “What is a Scientific Theory?” In S. Morgenbesser (Ed.), Philosophy of Science Today. New York: Meridian Books.
Symonds, M. R. E., and Blomberg, S. P. (2014). “A Primer on Phylogenetic Generalised Least Squares.” In L. Z. Garamszegi (Ed.), Modern Phylogenetic Comparative Methods and Their Application in Evolutionary Biology: Concepts and Practice, 105-130. Berlin, Heidelberg: Springer Berlin Heidelberg.
Toon, A. (2012). Models as Make-Believe: Imagination, Fiction and Scientific Representation. Palgrave-Macmillan.
Usher, M., and McClelland, J. L. (2001). “The Time Course of Perceptual Choice: The Leaky, Competing Accumulator Model.” Psychological Review, 108 (3), 550-592.
van Fraassen, B. C. (1970). “On the Extension of Beth’s Semantics of Physical Theories.” Philosophy of Science, 37 (3): 325-339.
van Fraassen, B. C. (1972). “A Formal Approach to the Philosophy of Science.” In R. Colodny (Ed.), Paradigms and Paradoxes. Pittsburgh: University of Pittsburgh Press.
Weisberg, M. (2006). “Forty Years of ‘The Strategy’: Levins on Model Building and Idealization.” Biology and Philosophy, 21 (5): 623-645.
Weisberg, M. (2013). Simulation and Similarity: Using Models to Understanding the World. New York: Oxford University Press.
 For example, in the following cases MLE is preferable to other methods: when the model contains latent variables, when the model relies on variables that contain measurement errors, etc. (Shipley 2002, 71). Latent variables are those that cannot be directly observed and measured (Ibid., 71).
 Though the testing procedures are different, the underlying idea of LSE is based on MLE. Moreover, when we assume that the measurement errors are independent and normally distributed with constant standard deviation, LSE is equivalent to MLE. For a discussion of the relationship between LSE and MLE, see Press et al. (1992, Chapter 15).
 For a discussion of nonlinear least squares estimation, see Pollard and Radchenko (2006).
 I thank XY for bringing this to my attention.
 This point is called an outlier in statistics, representing an observation point that is far from the bulk of data. This may be due to improperly calibrated equipment, recording and transcription errors, or equipment malfunctions, etc. (Rice 2006, 393-394).
 The original model involving more variables can be found in Shipley (2002, 130-135).
 For the details of the probability density function, see Shipley (2002, 110-114) or Rice (2006, 329-334).
 For details of the likelihood function see Shipley (2002, 113) or Greene (2012, 549-550).
 For details of this function, see Shipley (2002, 113-115).
 The degree of freedom is given by the function:
is the number of variables,
pis the number of free parameters, and
qis the number of free variances of exogenous variables (including error variables). For details, see Shipley (2002, 114).
 The testing process involves using the maximum likelihood chi-squared statistic which “will asymptotically follow a central chi-squared distribution with the degrees of freedom given above” (Shipley 2002, 115). For details of this statistic see Shipley (2002, 114-115).
 The significance level indicates whether there is a relationship between various variables represented in the model under testing, or whether the result can be explained by the null hypothesis. An outcome is said to be statistically significant only if it can enable the rejection of the null hypothesis.
 This metaphor comes from Shipley (2002, 1-6).
 I thank XZ and XY for drawing my attention to the Tetrad program and its relevance to my arguments presented in Section 3.2.
 For illustration purposes, the data set is generated by the Tetrad program based on the hypothesized model. We have one sample and the sample size is 1,000. Though the data set does not come from the world, the following discussion will show that our conclusion does not depend on where the data set comes from.
 Following Michael Weisberg, I think features refer to two different categories of things, with one category denoting properties and patterns, and another denoting the underlying mechanisms that generate the properties and patterns in the first category (Weisberg 2013, 145-146). Since the verification of the San Francisco Bay model in the 1950s concentrated on properties such as salinity, velocities of currents, and tide, not on the underlying mechanisms that generated these properties, the term feature in this section narrowly denotes the first category.
 I thank XY for suggesting this concept to me, and also for giving me the idea about what the goodness-of-fit interaction consists in and in what sense it differs from causal interaction. Wendy Parker has the similar idea of “goodness-of-fit interaction” when evaluating Michael Weisberg’s similarity measure, though she does not fully flesh out this idea (2015, 274).
 This point can be best illustrated with the curve fitting example: when computing the fit of a straight line
y=ax+bto a cloud of points,
bwill depend on each other to produce the best fit. I thank XY for giving me this example. This point has been shown in Section 2 with LSE.
 I thank XY for giving me these precise definitions.
 I thank XY for illustrating this difference.
 I thank XY for giving me this example.
 The relationship between CI and GFI may be more complicated than presented in this section, for example, there might be cases in which there is CI while there is no GFI. However, noticing that CI and GFI sometimes do happen together is sufficient for my current purposes (i.e., showing that in the Bay model there exist both CI and GFI), and addressing the more complicated issue must be left to another occasion. I thank XY and XZ for letting me pay attention to the complicated relationship between causal and similarity interactions.
 In what follows by features I mean these three features: salinity, velocities of currents and tide.
 The same story goes for tide and tidal currents. For details see Army Corps (1963, 20).