One Ecosystem :
Review Article

Corresponding author: James B Grace (gracej@usgs.gov)
Academic editor: Stoyan Nedkov
Received: 24 Jan 2020  Accepted: 10 Mar 2020  Published: 13 Mar 2020
This is an open access article distributed under the terms of the CC0 Public Domain Dedication.
Citation:
Grace JB (2020) A 'Weight of Evidence' approach to evaluating structural equation models. One Ecosystem 5: e50452. https://doi.org/10.3897/oneeco.5.e50452

It is possible that model selection has been the most researched and most discussed topic in the history of both statistics and structural equation modeling (SEM). The reason for this is because selecting one model for interpretive use from amongst many possible models is both essential and difficult. The published protocols and advice for model evaluation and selection in SEM studies are complex and difficult to integrate with current approaches used in biology. Opposition to the use of pvalues and decision thresholds has been voiced by the statistics community, yet certain phases of model evaluation have been historically tied to reliance on pvalues. In this paper, I outline an approach to model evaluation, comparison and selection based on a weightofevidence paradigm. The details and proposed sequence of steps are illustrated using a realworld example. At the end of the paper, I briefly discuss the current state of knowledge and a possible direction for future studies.
structural equation modelling, path models, model evaluation, model selection, weightofevidence
Model selection is one of the more challenging aspects of structural equation modeling. The selection decision typically follows a multistep process of model evaluation that considers numerous possible models and various types of evidence. Traditionally, this process has depended strongly on the use of pvalues and the concept of dichotomous hypothesis tests. There have been numerous calls from the statistics community for a cessation of pvaluebased hypothesis testing, especially as of late (see commentary by
When evaluating models estimated using traditional methods, there are two instances where SEM investigations encounter pvalues:
The likelihoodratio Χ^{2} test statistic was originally proposed as the first and final judge of global model fit (see
Throughout the modern history of SEM, which can be thought of as the time since the LISREL synthesis in the early 1970s (
Within the field of ecology,
Shifting away from a strict reliance on dichotomous hypothesis testing towards model comparisons has implications for model selection in SEM studies. Results reported by SEM software includes pvaluebased indices. Experience tells us that pvalues are useful indicators of modeldata relations, but today's general advice is to not use them for dichotomous significance testing. In this paper, I illustrate an approach to model evaluation and selection that utilises pvalues as information, while basing model selection on a comparison of the total weight of evidence. I first discuss the types of evidence that might be considered, including the role for expert knowledge. Second, I propose a sequence of steps to follow. Third, I illustrate the proposed process using an ecological example. Finally, I briefly consider the challenges facing the quest for a single “best” index and possibilities for resolution of this issue.
There are numerous types of evidence to be weighed when evaluating and comparing models. First and foremost is the scientific knowledge of the investigative team.
As an explanatory method, SEM requires the scientist to play an active role in the model evaluation process. A priori scientific knowledge is essential for the construction of the initial models. However, there may be a tendency for those beginning to use SEM to imagine that model evaluation, based on the data, is a tightlyscripted process defined by the rules of statistics. Earlier treatments of SEM tend to reinforce this impression. This is perhaps true of my own writing (e.g. comparing the presentations in Grace 2006 to the current paper), but also the writing in more general treatments of SEM (
The goal of model selection is not simply to describe the relationships in data. In my view, the goal is to balance twin objectives. First is the narrow task of evaluating the modeldata interrelationships. We must address the specific question, “What do these data say about the hypothesis?” SEM philosophy, however, imagines a sequence of studies and a process of sequential learning (
The initial model construction process in SEM relies heavily on investigator knowledge. The reasoning process adopted during model construction (
A Need to Support and Defend Critical Assumptions – There is a certain minimum amount of a priori knowledge that is required in order to use SEM. SE models include socalled “untestable” assumptions, along with assumptions that can be falsified by an appropriate dataset. The most fundamental untestable assumptions are the directions of the arrows. Often, in ecological studies, the investigator is able to justify arrow directionality. More challenging is for investigators to justify things that are omitted from their models. Of course, models must limit their scope to the essential components. There are rules relating to what can and cannot be omitted. First, only include variables that are essential to the modeling objectives. Often scientists attempt to include all measured variables and produce models whose complexity exceed the capacity of the data sample size (too many parameters per sample). Second, omitting variables that have strong effects on two or more of the variables of primary interest can lead to confounding. When very strong confounding relationships are omitted, they have the potential to bias conclusions.
For links not included in the initial model, postestimation evaluations should reveal whether these assumptions need to be reconsidered. For direct links, these are straightforward to detect and include. However, we should not ignore the possibility that errors will be nonindependent. Correlated errors are definite indications of omitted confounders. Finding error correlations may spark a need to consider model modifications. It is wise to consider how one might interpret such findings based on a priori knowledge, as these represent factors being omitted from explicit inclusion in the initial model.
Investigator’s Opinion of the Strength of Theory versus Strength of Data – A point that is rarely discussed outside of the sphere of Bayesian statistics is how to weight a priori scientific knowledge against the information content of a data set. This omission exists despite the fact that, in many ecological studies, analyses are based on very limited samples. In cases where there is a large and representative dataset for use in an analysis, one should be prepared to consider the dataderived estimates as the final arbiter for drawing inferences. In some cases, our a priori knowledge may be stronger than the dataset available for modeling.
A broad view of the quantitative sciences must recognise that we aspire for our models to transition over time from assumptiontesting to assumptionbased. Numerous subdisciplines within ecology rely on assumptionbased models. Socalled ‘mechanistic models’ incorporate processes that operate on biological systems with enough regularity that the form of the model is accepted as given and data are used purely to estimate the parameters. Population models often fall into this group. For this model type, some of the processes may be of known functional form, while others may be of unknown form. When studying multispecies ecological communities, we often encounter mechanisms that are contingent on so many factors that relationship forms cannot be taken as given (e.g. effects of species additions or removals;
The preceding material is presented to make two important points relevant to model evaluation and selection within SEM. First, when alternative models are suggested, based on empirical results, we should avoid constructing alternative models for consideration that we know are false representations of the system. Perhaps a birth rate estimate is low and its 95% confidence interval includes a value of zero. Do we prune the model to adhere to the principle of parsimony? That would mean we might end up presenting a final model that, by omission, suggests that births are not a contributing factor for population size. Scientific logic would suggest that we should not prune in this case, but what are the consequences? I will address this question in the context of our illustrative example in the section below. In summary, the rule I would suggest is for the investigator to not include models in your comparison set that you, as a scientist, are not willing to defend.
Historically, the use of pvalues came into widespread use as a part of the machinery for nullhypothesis testing. Pvalues have traditionally been used for dichotomous decisionmaking and have been central to the practical application of statistical methods. Within a WOE paradigm, pvalues can be used as continuous quantitative indicators without their being used as strict cutoffs. The case for this type of use of pvalues has been recently articulated by
As mentioned in the Introduction, for SEM applications that utilise global estimation methods (e.g. LISREL, Mplus, AMOS or lavaan), model estimation returns an initial set of measures that quantify the overall correspondence between observed and modelimplied covariances. Immediate focus is directed to the Χ^{2} statistic, summarising modeldata fit and its pvalue. The pvalue, in this case, has a counterintuitive meaning in that it represents the probability that the observed data deviate from model expectations, which would imply the model structure is inappropriate for obtaining estimates from the data. When conducting SEM, there is an immediate decision that has to be made after the initial model is estimated, which is to decide whether there are one or more important links omitted. If there are, the reported parameter estimates are not to be trusted. This need to make sure the network is not missing important links has led to a history of reliance on dichotomous decisionmaking and this is perhaps unavoidable. The next section will discuss alternative indices for global fit assessment, but regardless of the fit index used, the question still arises, “Should we keep the original model or create a more complete one before we proceed?”
There are a number of factors that limit our ability to provide an omnibus model evaluation using the Χ^{2} statistic. For example, the continuous increase in statistical power with increasing N (sample size) of course means appreciable differences between data and model could be missed when N is small, but trivial differences could be flagged when N is large. Additionally, it is well documented that the Χ^{2} statistic can hide various types of misspecification simply because it is a summary statistic for the entire model (
Regarding approximate fit indices, it is probably true that some are, on average, better than others. However, simulation studies indicate that the capacity to detect misspecifications based on recommended thresholds depends on the particular misspecification (
Under the WOE approach, I will demonstrate in this paper approximate fit indices can be useful measures to report.
The second phase in evaluating models after first examining global fit measures is often to search for indications of what changes could be made to improve modeldata concordance. Specially designed for this purpose are socalled modification indices (MI). All globalestimationbased software packages, with which I am familiar, report this information upon request. The critical role of the investigator’s scientific judgement comes into sharp focus once one tries to make sense of the MI table provided for a model that is misspecified.
MI values are expressed in terms of the drop in the Χ^{2} statistic that would be observed if a link were added to the model. The categories of possible additions include, (a) regressions, (b) latent variable loadings, (c) error correlations and (d) variance constraints. The modification indices are not arrived at by actually fitting alternative models. Rather, they are approximations and therefore do not always correspond to the changes that will be observed.
Perhaps the best way to gain some intuition about the challenge MI values attempt to overcome is to look at the raw materials for computing evidence of misspecification, which are the residuals. In this case, the residuals are not those between predicted and observed individual data values, but instead, between the observed and modelimplied variancecovariance matrices. Requesting to see residuals in a standardised metric will illustrate where modelimplied and observed matrices are most discrepant. Because As the parts of a model are intercorrelated, there are many different model changes that might be implied. From a very practical standpoint, the investigator must realise that any change in the model can potentially resolve many of the listed modification possibilities. Therefore, one should decide on a single addition to the model before reestimation. It is essential that the chosen modifications make substantive sense, because they must explain to the reviewers the scientific basis for the modifications of their initial model (see Grace 2006 for an indepth discussion of this issue).
When working with models having latent variables with multiple indicators, MI tables sometimes return no usable advice, even though global model fit is poor. In this case, it becomes essential to consult the residual matrix unless theoretically predefined alternative specifications are available. While matrices of residuals provide a more undistilled source of information, they are commonly a fundamental source of evidence for selecting alternative models for consideration.
As with all other types of fit measures, informationbased measures have a long history of use in SEM. This is a complex topic that I will treat lightly because the jury is still out on whether universallyapplicable recommendations are even possible. Also Complicating things is also the sheer variety of information metrics that have been proposed for use. Fortunately, a recent review of past studies and set of simulation studies by
Two types of information measures have captured most of the recommendations, AICtype indices and BICtype indices. The Akaike Information Criterion (AIC) was proposed for use in 1974 (
In their book, Burnham and Anderson (2004) suggest models separated by more than 2 AIC units could be seen as distinct, while later (
In this paper, I do not wish to attempt to propose a definitive answer to the question of which information index is best nor consider the detailed studies and arguments association with that question. My intent is to show how the various types of evidence, associated with SEM, can be used to build up a set of candidate models for comparison and how information measures can be used to assist in the comparisons. For that purpose, I will rely on the following synoptic view, taken from various sources.
The relative performance of different information measures has been shown to vary with a number of factors, including (a) sample size, (b) the composition of candidate model sets, (c) the magnitude of effects to be detected and (d) heterogeneity in the data. I will try to summarise our current understanding (and my own experience) in a few summary statements:
In this paper, my focus is on globallyestimated models where the investigator must contend with multiple forms of evidence encountered within a sequential evaluation of overall model fit and individual links included in the model. However, many investigators use local estimation methods (e.g.
Pearl’s redescription of SEM in foundational terms, referred to as the Structural Causal Model (
In 2013, Shipley subsequently developed a version of his method based on AIC (
Most recently,
The use of the abovedescribed types of evidence is illustrated next. To assist in the process, I provide a sequence of steps for a weightofevidence approach.
1. Consideration of Sample Size  The first piece of evidence to consider is the number of samples, N. Sample size has a huge influence on model evaluation. One can anticipate a great deal about the value of various model fit indices based on sample size. A firstcut distinction is often made between studies depending on whether N is less than or greater than 200. Other sample size distinctions may be informative to the investigator based on their past experience, as well as other factors such as model complexity and model specification details.
2. Examination of Model Χ^{2} Statistic, Model Degrees of Freedom and associated Pvalue – Below a sample size of 200, the Χ^{2} statistic and associated pvalue are informative. That said, the criterion of p > 0.05 is known to be imprecise. When N < 200, a pvalue falling below the 0.05 threshold is a strong indication that one should consider alternative models that include additional links. Above the 0.05 criterion, increasing values of p provide increasing support for the presumption that the model under consideration is not leaving out links representing important processes. It is common to find support for adding links when p > 0.05. Based on personal experience, we might think of some much larger value, such as 0.50, as a point at which any omitted links are likely weak, but additional pieces of information are nearly always worth examining unless p is quite high. Above a sample size of 200, one can expect to need to use Approximate Fit Indices to convince reviewers that the model is not leaving out important links and thus suitable for interpretation.
3. Examination of Select Approximate Fit Indices – As mentioned above,
a. The Root Mean Square Error of Approximation (RMSEA) is appealing because it is accompanied by upper and lower values for a 90% confidence interval. The RMSEA differs from the Χ^{2} statistic as it measures departures from approximate/close fit instead of perfect fit. "Approximate" or "close" fit is described as the situation where the Χ^{2} < df, while perfect fit is where Χ^{2} = 0. The RMSEA is scaled as a "badness of fit" measure, where 0 means perfect fit. The index is known to be a poor decision criterion by itself. That said, when the lower confidence interval is 0, it is supposed to mean that approximate fit is within the range of support offered by the data. When the associated pvalue is less than 0.05, it suggests some type of misfit (due to omitted linkage).
b. The Comparative Fit Index (CFI) compares the departure from close fit (just described) to what we would find for a null model (all parameters = 0.0). Its values are scaled to range from 1.0 to 0. It is widely reported by investigators, especially when N > 200, because it is completely unaffected by sample size. It is not a reliable index for model selection by itself (though the unscaled version, RNI – the Relative Noncentrality Index) has been shown to perform as well as AIC by
c. The Standardized Root Mean Square Residual (SRMR) is computed as the absolute value of the mean of the standardised residual covariances. A value of 0 means perfect fit and a value > 0.10 is of concern. To interpret this measure appropriately, it is best to examine the matrix of standardised residual covariances since adequate mean fit may hide specific deviations representing important misspecifications.
4. Examination of Modification Indices and Covariance Residuals – The matrix of covariance residuals provides the raw materials used to compute overall model fit, as well as modification indices. Their examination is sometimes an important supplement to the MI table. The MI values themselves are expressed in terms of the drop in the Χ^{2} statistic that would be expected if a link were added to a model. The categories of possible additions include, (a) regressions, (b) latent variable loadings, (c) error correlations and (d) the relaxation of constraints (e.g. if any parameters have been given fixed values). As any given covariance residual can be resolved in several different ways, it is important to consider the scientific interpretability before examining the MI table. For example, in this paper we are not considering latent variable models. For this reason, we may want to request only a subset of the MI types, such as direct effects (~) and possibly error correlations (~~). The singledegreeoffreedom Χ^{2} criterion value of 3.84 is often used to provide information on the interpretation of MI values. The use of this information is considered in the next section.
5. Considering Alternative Models Containing Additional Linkages/Parameters – Unless overall model fit is very close, it is desirable to consider alternative models. Failing to find a credible alternative model is itself a form of support for the model under consideration. It is critical that all alternative models estimated be scientifically plausible. The entire foundation for multimodel comparisons is based on the premise that all models compared are scientifically defensible. One simple tip is to consider in advance the signs (positive or negative) of links added to a model. At the point of examining modification indices, it is possible to not only examine the magnitude of the MI, but also the sign of the expected parameter. Interpretation of added coefficients can be very dependent on whether the sign of the coefficient corresponds to the type of mechanisms suggested. Sometimes, the expected parameter change (“epc”) is of opposite sign to the mechanistic interpretation for adding a link and this may influence how the investigator proceeds. When considering MI values in terms of their magnitudes, the investigator should recognise that it is only profitable to make one addition to a model at a time. Making a single change to a model can lead to a completely different set of MI values for the remaining omitted links under consideration. It is common to focus on the suggested modifications with the largest MI values, though the interpretability of suggested changes is of paramount importance. The investigator will usually notice several possible modifications that are projected to have identical effects on model fit. I will elaborate the logic to employ in such a situation in the context of our empirical example.
6. Repeat of Steps 1 – 5 as needed – When a link or additional parameter is added to form a new model, the steps just described will need to be repeated. At this point in the process, we are ignoring the secondlevel question of whether the included links (freely estimated parameters) in a model are supported. The inclusion of weak or nonsupported effects in models has only a minor effect on overall model fit because they always reduce the raw discrepancy and only impact the model Χ^{2} statistic by increasing the number of parameters being estimated (K). As omitted links can have large effects on model fit and on estimated values for the other parameters, I always recommend addressing the question of whether important processes are omitted from the model before worrying about simplification.
7. Model Simplification – The model simplification process addresses the question of whether there is empirical support for all of the included linkages. The reported statistics for individual parameters now come into play. Historically, the pvalues, associated with individual parameters, have been used as guideposts, but not for deciding whether a parameter should be set to zero. It has long been the practice within SEM under global estimation to use pvalues as a continuous quantitative measure of the variability associated with a parameter estimate. Finding a parameter estimate with a pvalue at or above 0.05 raises the question as to what magnitude of change would be seen in global model fit if that parameter were set to zero (i.e. removing a link). The standard approach has been to use a singledegreeoffreedom Χ^{2} test to judge whether a link should be removed. Under a WOE paradigm, pvalues at or above 0.05 suggest the construction of alternative models to evaluate, using multimodel comparison.
8. Selection of Candidate Models for Comparison – The approach, described thus far, encourages a liberal consideration of alternative models, under the provision that they represent valid competing scientific explanations. A single model for scientific inference purposes is to be selected from the set. The suggestion of model averaging, proposed by
9. Model Comparison, Weighing of Evidence and Model Selection – Model selection, based on multimodel comparisons using information criteria, as championed by
To illustrate the ideas presented in this paper, I will rely on an example related to the biological control of invasive plants. The invasive plant Euphoria esula (leafy spurge) is considered a threat to the ecological and economic integrity of grasslands in the northcentral United States. The example, presented here, is derived from a study conducted at the Theodore Roosevelt National Park in North Dakota (
In this paper, I have used the results published by
Fig.
Link # 
Description of potential mechanisms and expected sign of effect 
1 
Change in stem density is expected to depend on initial density of stems. A positive parameter estimate would indicate positive density dependence, while a negative parameter estimate would indicate negative density dependence (negative effect). 
2 
Dependence of flea beetle density on plant stem density for BioA. (positive effect) 
3 
Site fidelity for BioA. (positive effect) 
4 
Effect of BioA on StemChg (expecting a negative effect, if control agent is effective) 
5 
Lag food effect on BioA. (positive effect) 
6 
Competitive effect of BioA on BioB. (negative effect) 
7 
Dependence of flea beetle density on plant stem density for BioB. (positive effect) 
8 
Site fidelity for BioB. (positive effect) 
9 
Effect of BioB on StemChg (expecting a negative effect, if control agent is effective) 
10 
Lag food effect on BioB. (positive effect) 
11 
Competitive effect of BioB on BioA. (negative effect) 
Structural equation model representing an initial hypothesis regarding the potential effects of two biocontrol flea beetles, BioA (Aphthona nigriscutus) and BioB (Aphthona lacertosa), on an invasive plant (Euphorbia esula). Stem and beetle densities were measured over two years and the change in stem density between years 1 and 2 computed (StemChg12).
For the illustration below, I simulated 10,000 replicates using the lavaan package simulateData function and then captured the covariance matrix as input for illustrations. The illustrations in this paper assume a sample size of 150 (original investigation was based on 165 samples). Code used for data simulation is in the supplementary materials (Suppl. material
In this section, I follow the Proposed Sequence described above to illustrate its application. In this example, I start with a single proposed a priori model and work from there. In other cases, we might have multiple candidate models from the beginning to evaluate, which would modify the sequence slightly.
Model 1 (Fig.
### load libraries library(lavaan) 

########## Simulation Study #1 ########## # Covariance matrix for input 

sim.cov < ' 

1.2472 

0.1492 
1.019 

0.8442 
0.178 
1.6417 

0.0358 
0.704 
0.0357 
1.5512 

0.2922 
0.233 
0.1217 
0.5276 
1.488 

0.5235 
0.149 
0.5335 
0.3277 
0.655 
1.029' 
# Convert matrix and name variables sim.cov.dat < getCov(sim.cov, names = c("BioA1", "BioB1", "BioA2", "BioB2", "StemChg12", "Stems1")) 

##### Scenario 1  Set N=150 ##### ### mod1 – initial model 

mod1 < ' # regressions BioA1 ~ Stems1 BioB1 ~ Stems1 BioA2 ~ BioA1 +Stems1 +BioB1 BioB2 ~ BioB1 +Stems1 +BioA1 StemChg12 ~ Stems1 +BioA2 +BioB2 ' 

# Estimate model ‘mod1’ using data matrix ‘sim.cov.dat’, N = 150 Mod1.fit < sem(mod1, sample.cov = sim.cov.dat, sample.nos = 150) 
Estimation of Model 1, using lavaan, returns the measures of overall model fit shown in Table
> summary(mod1.fit, fit.measures=T) 

lavaan 0.65 ended normally after 15 iterations 

Estimator 
ML 
Optimisation method 
NLMINB 
Number of free parameters 
16 
Number of observations 
150 
Model Test User Model: 

Test statistic 
9.125 
Degrees of freedom 
4 
Pvalue (Chisquare) 
0.058 
Model Test Baseline Model: 

Test statistic 
247.786 
Degrees of freedom 
15 
Pvalue 
0.000 
User Model versus Baseline Model: 

Comparative Fit Index (CFI) 
0.978 
TuckerLewis Index (TLI) 
0.917 
Loglikelihood and Information Criteria: 

Loglikelihood user model (H0) 
1274.742 
Loglikelihood unrestricted model (H1) 
1270.179 
Akaike (AIC) 
2581.483 
Bayesian (BIC) 
2629.654 
Samplesize adjusted Bayesian (BIC) 
2579.017 
Root Mean Square Error of Approximation: 

RMSEA 
0.092 
90 Percent confidence interval  lower 
0.000 
90 Percent confidence interval  upper 
0.173 
Pvalue RMSEA <= 0.05 
0.153 
Standardized Root Mean Square Residual: 

SRMR 
0.055 
1. Consideration of Sample Size – Since N = 150, all of the fit measures presented can be interpreted in a fairly straightforward fashion.
2. Examination of Model Χ^{2} Statistic, Model Degrees of Freedom and associated Pvalue –While the pvalue returned is above the criterion of 0.05, it can be anticipated at a sample size of 150 that we have sufficient statistical power to detect modest effect sizes. For this example, it is appropriate to assume that the true model contains a wide range of effect strengths, including some of scientific interest, but of modest size. With a pvalue of 0.058, it would be naïve to simply accept this model without looking any further.
3. Examination of Select Approximate Fit Indices –
a. RMSEA is below 0.10, the lower CI boundary is 0.0 and associated pvalue is 0.153. All these values suggest approximate fit.
b. CFI is estimated to be 0.978, which implies that we are probably not missing any large effects (assuming no interactions, as I do for this illustration).
c. SRMR is 0.055, well below the warning level of 0.10.
d. Summary: This evidence suggests that when modeldata discrepancies are average across the whole model, there is reasonable correspondence. However, this level of fit could be the result of averaging many equalsized minor misspecifications OR averaging many areas of the model with very close fit and a few important modeldata mismatches. The limitation of global fit measures is their inability to distinguish these two possibilities.
Examination of Modification Indices and Covariance Residuals – For now, I forego presenting a matrix of standardised residual covariances as their interpretation requires some experience. My typical process is to first examine modification indices and work with those, resorting to an examination of residuals, if it seems necessary. Seven modification suggestions are returned in this case (Table
> # Modification Indices > subset(modindices(mod1.fit), mi > 2, c("lhs", "op", "rhs", "mi", "epc")) 

lhs 
op 
rhs 
mi 
epc 
1 BioA1 
~~ 
BioB1 
7.762 
0.224 
2 BioA1 
~ 
BioB1 
7.762 
0.226 
3 BioA1 
~ 
BioA2 
7.762 
1.723 
4 BioA1 
~ 
BioB2 
7.762 
0.341 
5 BioB1 
~ 
BioA1 
7.762 
0.229 
6 BioB1 
~ 
BioA2 
7.762 
0.414 
7 BioB1 
~ 
BioB2 
7.762 
12.411 
Fig.
### Model 2 (parameter numbers correspond to those in Fig. 5) 
mod2 < ' # regressions BioA1 ~ b2*Stems1 BioB1 ~ b7*Stems1 BioA2 ~ b3*BioA1 +b5*Stems1 +b11*BioB1 BioB2 ~ b8*BioB1 +b10*Stems1 +b6*BioA1 StemChg12 ~ b1*Stems1 +b4*BioA2 +b9*BioB2 # error covariance BioA1 ~~ b12*BioB1' 
Steps 25: Examination of Global Fit Measures and Consideration of Additions to Model 2 – The Χ^{2} dropped from 9.125 to 1.155 (a decline of 8.03) and its pvalue rose from 0.058 to 0.764 (Table
> summary(mod2.fit, fit.measures=T) 

lavaan 0.63 ended normally after 16 iterations 

Optimisation method 
NLMINB 
Number of free parameters 
17 
Number of observations 
150 
Estimator 
ML 
Model Fit Test Statistic 
1.155 
Degrees of freedom 
3 
Pvalue (Chisquare) 
0.764 
Model test baseline model: 

Minimum Function Test Statistic 
247.786 
Degrees of freedom 
15 
Pvalue 
0.000 
User model versus baseline model: 

Comparative Fit Index (CFI) 
1.000 
TuckerLewis Index (TLI) 
1.040 
Loglikelihood and Information Criteria: 

Loglikelihood user model (H0) 
1270.757 
Loglikelihood unrestricted model (H1) 
1270.179 
Number of free parameters 
17 
Akaike (AIC) 
2575.513 
Bayesian (BIC) 
2626.694 
Samplesize adjusted Bayesian (BIC) 
2572.892 
Root Mean Square Error of Approximation: 

RMSEA 
0.000 
90 Percent Confidence Interval 
0.000 0.093 
Pvalue RMSEA <= 0.05 
0.848 
Standardized Root Mean Square Residual: 

SRMR 
0.013 
Step 7: Considering Simplification for Model 2 – We now turn to judging support for the links included. The information reported by lavaan that is most helpful at this point are the statistics associated with individual parameters. These are presented in Table
> # For examination of individual parameter support > parameterEstimates(mod2.fit) 

lhs 
op 
rhs 
label 
est 
se 
z 
pvalue 
ci.low 
ci.up 

1 
BioA1 
~ 
Stems1 
b2 
0.509 
0.080 
6.382 
0.000 
0.353 
0.665 
2 
BioB1 
~ 
Stems1 
b7 
0.145 
0.080 
1.801 
0.072 
0.013 
0.302 
3 
BioA2 
~ 
BioA1 
b3 
0.554 
0.085 
6.496 
0.000 
0.387 
0.721 
4 
BioA2 
~ 
Stems1 
b5 
0.256 
0.094 
2.718 
0.007 
0.071 
0.440 
5 
BioA2 
~ 
BioB1 
b11 
0.131 
0.085 
1.549 
0.121 
0.297 
0.035 
6 
BioB2 
~ 
BioB1 
b8 
0.662 
0.085 
7.834 
0.000 
0.497 
0.828 
7 
BioB2 
~ 
Stems1 
b10 
0.213 
0.094 
2.266 
0.023 
0.029 
0.397 
8 
BioB2 
~ 
BioA1 
b6 
0.018 
0.085 
0.217 
0.828 
0.149 
0.186 
9 
StemChg12 
~ 
Stems1 
b1 
0.643 
0.091 
7.077 
0.000 
0.821 
0.465 
10 
StemChg12 
~ 
BioA2 
b4 
0.139 
0.069 
2.005 
0.045 
0.003 
0.275 
11 
StemChg12 
~ 
BioB2 
b9 
0.208 
0.067 
3.078 
0.002 
0.340 
0.075 
12 
BioA1 
~~ 
BioB1 
b12 
0.224 
0.082 
2.717 
0.007 
0.385 
0.062 
13 
BioA1 
~~ 
BioA1 
0.974 
0.113 
8.660 
0.000 
0.754 
1.195 

14 
BioB1 
~~ 
BioB1 
0.991 
0.114 
8.660 
0.000 
0.767 
1.215 

15 
BioA2 
~~ 
BioA2 
1.008 
0.116 
8.660 
0.000 
0.780 
1.236 

16 
BioB2 
~~ 
BioB2 
1.008 
0.116 
8.660 
0.000 
0.780 
1.236 

17 
StemChg12 
~~ 
StemChg12 
0.968 
0.112 
8.660 
0.000 
0.749 
1.187 

18 
Stems1 
~~ 
Stems1 
1.022 
0.000 
NA 
NA 
1.022 
1.022 
A new model for evaluation is presented in Fig.
### Model 3 (parameter b4 now estimates an error correlation) 
Mod3 < ' # regressions BioA1 ~ b2*Stems1 BioB1 ~ b7*Stems1 BioA2 ~ b3*BioA1 +b5*Stems1 +b11*BioB1 BioB2 ~ b8*BioB1 +b10*Stems1 +b6*BioA1 StemChg12 ~ b1*Stems1 +b9*BioB2 # error covariance BioA1 ~~ b12*BioB1 StemChg12 ~~ b4*BioA2' 
Steps 25: Examination of Global Fit Measures and Consideration of Additions to Model 3 – The fit statistics for Model 3 are similar to those for Model 2, but with even closer fit (full results not presented to economise on space). The Χ^{2} is now 0.133 and pvalue = 0.988. All other fit statistics suggest there are no other important links missing from this model.
Step 7: Considering Simplification for Model 3 – We now turn to an examination of the parameterspecific statistics for Model 3 (Table
> # For examination of individual parameter support > parameterEstimates(mod3.fit) 

lhs 
op 
rhs 
label 
est 
se 
z 
pvalue 
ci.low 
ci.up 

1 
BioA1 
~ 
Stems1 
b2 
0.509 
0.080 
6.382 
0.000 
0.353 
0.665 
2 
BioB1 
~ 
Stems1 
b7 
0.145 
0.080 
1.801 
0.072 
0.013 
0.302 
3 
BioA2 
~ 
BioA1 
b3 
0.551 
0.084 
6.573 
0.000 
0.387 
0.716 
4 
BioA2 
~ 
Stems1 
b5 
0.257 
0.094 
2.747 
0.006 
0.074 
0.441 
5 
BioA2 
~ 
BioB1 
b11 
0.133 
0.084 
1.594 
0.111 
0.297 
0.031 
6 
BioB2 
~ 
BioB1 
b8 
0.662 
0.085 
7.834 
0.000 
0.497 
0.828 
7 
BioB2 
~ 
Stems1 
b10 
0.213 
0.094 
2.266 
0.023 
0.029 
0.397 
8 
BioB2 
~ 
BioA1 
b6 
0.018 
0.085 
0.217 
0.828 
0.149 
0.186 
9 
StemChg12 
~ 
Stems1 
b1 
0.565 
0.083 
6.786 
0.000 
0.729 
0.402 
10 
Stemchg12 
~ 
BioB2 
b9 
0.224 
0.067 
3.332 
0.001 
0.355 
0.092 
11 
BioA1 
~~ 
BioB1 
b12 
0.224 
0.082 
2.717 
0.007 
0.385 
0.062 
12 
BioA2 
~~ 
StemChg12 
b4 
0.181 
0.083 
2.184 
0.029 
0.019 
0.344 
Model 3B, Step 7: Considering Simplification for Model 3B – Model 3B was created by removing link 6 from Model 3. This is done by setting parameter b6 to zero (Table
### Model 3B (parameter b6 now set to zero) 
Mod3B < ' # regressions BioA1 ~ b2*Stems1 BioB1 ~ b7*Stems1 BioA2 ~ b3*BioA1 +b5*Stems1 +b11*BioB1 BioB2 ~ b8*BioB1 +b10*Stems1 +b6*BioA1 StemChg12 ~ b1*Stems1 +b9*BioB2 # error covariance BioA1 ~~ b12*BioB1 StemChg12 ~~ b4*BioA2 # set parameters to zero b6 == 0' 
> # For examination of individual parameter support > parameterEstimates(mod3B.fit) 

lhs 
op 
rhs 
label 
est 
se 
z 
pvalue 
ci.low 
ci.up 

1 
BioA1 
~ 
Stems1 
b2 
0.509 
0.080 
6.382 
0.000 
0.353 
0.665 
2 
BioB1 
~ 
Stems1 
b7 
0.145 
0.080 
1.801 
0.072 
0.013 
0.302 
3 
BioA2 
~ 
BioA1 
b3 
0.551 
0.084 
6.573 
0.000 
0.387 
0.716 
4 
BioA2 
~ 
Stems1 
b5 
0.257 
0.094 
2.747 
0.006 
0.074 
0.441 
5 
BioA2 
~ 
BioB1 
b11 
0.133 
0.084 
1.594 
0.111 
0.297 
0.031 
6 
BioB2 
~ 
BioB1 
b8 
0.658 
0.082 
7.993 
0.000 
0.497 
0.820 
7 
BioB2 
~ 
Stems1 
b10 
0.223 
0.082 
2.723 
0.006 
0.063 
0.384 
8 
BioB2 
~ 
BioA1 
b6 
0.000 
0.000 
NA 
NA 
0.000 
0.000 
9 
StemChg12 
~ 
Stems1 
b1 
0.565 
0.083 
6.786 
0.000 
0.729 
0.402 
10 
StemChg12 
~ 
BioB2 
b9 
0.224 
0.067 
3.332 
0.001 
0.355 
0.092 
11 
BioA1 
~~ 
BioB1 
b12 
0.224 
0.082 
2.717 
0.007 
0.385 
0.062 
12 
BioA2 
~~ 
StemChg12 
b4 
0.181 
0.083 
2.184 
0.029 
0.019 
0.344 
Model 3C, Step 7: Considering Simplification for Model 3C – We now direct our attention to Table
In this situation, we must decide which of the models that have been estimated are scientifically defensible. Model 1 was found to be obviously misspecified due to lack of empirical support and is not a contender for model selection. Model 2, while exhibiting a close modeldata fit, was found to lack theoretical support for the originally hypothesised effect of BioA2 on StemChg12. Model 3 was created to solve that problem. Subsequent models examined (Models 3BD) were all attempts to evaluate simpler versions of Model 3 and are all scientifically defensible. The appropriate model comparison set in this case includes all versions of Model 3.
A model comparison table is presented in Table
> ##### Multimodel Comparisons > library (AICcmodavg) > aictab(list(mod3.fit, mod3B.fit, mod3C.fit, mod3D.fit), + c("MOD3", "MOD3B","MOD3C","MOD3D")) 

Model selection based on AICc: 

K 
AICc 
Delta_AICc 
AICcWt 
Cum.Wt 
LL 

MOD3B 
16 
2576.63 
0.00 
0.39 
0.39 
1270.27 
MOD3C 
15 
2576.64 
0.01 
0.39 
0.77 
1271.53 
MOD3D 
14 
2579.03 
2.40 
0.12 
0.89 
1273.96 
MOD3 
17 
2579.13 
2.50 
0.11 
1.00 
1270.25 
Model Fit 

Estimator 
ML 

Model Fit Test Statistic 
0.180 

Degrees of freedom 
4 

Pvalue (Chisquare) 
0.996 

Comparative Fit Index (CFI) 
1.000 

RMSEA 
0.000 

90 Percent Confidence Interval 
0.000 0.000 

Pvalue RMSEA <= 0.05 
0.998 

SRMR 
0.006 

Regressions: 

Estimate 
Std.Err 
zvalue 
P(>z) 
Std.all 

BioA1 ~ 

Stems1 
(b2) 
0.509 
0.080 
6.382 
0.000 
0.462 
BioB1 ~ 

Stems1 
(b7) 
0.145 
0.080 
1.801 
0.072 
0.146 
BioA2 ~ 

BioA1 
(b3) 
0.551 
0.084 
6.573 
0.000 
0.481 
Stems1 
(b5) 
0.257 
0.094 
2.747 
0.006 
0.204 
BioB1 
(b11) 
0.133 
0.084 
1.594 
0.111 
0.105 
BioB2 ~ 

BioB1 
(b8) 
0.658 
0.082 
7.993 
0.000 
0.534 
Stems1 
(b10) 
0.223 
0.082 
2.723 
0.006 
0.182 
BioA1 
(b6) 
0.000 
0.000 

StemChg12 ~ 

Stems1 
(b1) 
0.565 
0.083 
6.786 
0.000 
0.470 
BioB2 
(b9) 
0.224 
0.067 
3.332 
0.001 
0.228 
Covariances: 

Estimate 
Std.Err 
zvalue 
P(>z) 
Std.all 

.BioA1 ~~ 

.BioB1 
(b12) 
0.224 
0.082 
2.717 
0.007 
0.227 
.BioA2 ~~ 

.StmChg12 
(b4) 
0.181 
0.083 
2.184 
0.029 
0.181 
RSquare: 

Estimate 

BioA1 
0.214 

BioB1 
0.021 

BioA2 
0.381 

BioB2 
0.346 

StemChg12 
0.328 
Additional summary results for the selected model are shown in Table
A question raised earlier in the paper was about the consequences of retaining a weaklysupported link in a model for the other model parameters. It is known that leaving out an important link can have a major impact on estimated parameter values for the included links. This sensitivity is illustrated by the fact that model Χ^{2} can drop abruptly when a single link is added (we observed a drop of over 8 points when we added link 12 to create Model 2). The elimination of link 6 from Model 3, however, only increased model Χ^{2} by a tiny amount (0.05), which is typical when parameters with little support are removed. More to the point in this paper is the question of what would happen if we were to set the dependence of beetle species B on plant stems (parameter b7) to zero? The results from such a change are presented in Table
Results if b6 in Model 3B were to be set to zero (compare to results in Table
Model Fit 

Estimator 
ML 

Model Fit Test Statistic 
3.390 

Degrees of freedom 
5 

Pvalue (Chisquare) 
0.640 

Comparative Fit Index (CFI) 
1.000 

RMSEA 
0.000 

90 Percent Confidence Interval 
0.000 0.092 

Pvalue RMSEA <= 0.05 
0.793 

SRMR 
0.050 

Regressions: 

Estimate 
Std.Err 
zvalue 
P(>z) 
Std.all 

BioA1 ~ 

Stems1 
(b2) 
0.541 
0.078 
6.974 
0.000 
0.485 
BioB1 ~ 

Stems1 
(b7) 
0.000 
NA 
0.000 

BioA2 ~ 

BioA1 
(b3) 
0.551 
0.084 
6.573 
0.000 
0.481 
Stems1 
(b5) 
0.257 
0.093 
2.769 
0.006 
0.201 
BioB1 
(b11) 
0.133 
0.083 
1.610 
0.107 
0.104 
BioB2 ~ 

BioB1 
(b8) 
0.658 
0.081 
8.079 
0.000 
0.541 
Stems1 
(b10) 
0.223 
0.081 
2.752 
0.006 
0.184 
BioA1 
(b6) 
0.000 
0.000 

StemChg12 ~ 

Stems1 
(b1) 
0.565 
0.082 
6.903 
0.000 
0.474 
BioB2 
(b9) 
0.224 
0.067 
3.343 
0.001 
0.227 
Covariances: 

Estimate 
Std.Err 
zvalue 
P(>z) 
Std.all 

.BioA1 ~~ 

.BioB1 
(b12) 
0.228 
0.083 
2.743 
0.006 
0.230 
.BioA2 ~~ 

.StmChg12 
(b4) 
0.181 
0.083 
2.184 
0.029 
0.181 
RSquare: 

Estimate 

BioA1 
0.235 

BioB1 
0.000 

BioA2 
0.397 

BioB2 
0.327 

StemChg12 
0.316 
In this paper, I describe a way to bring the necessary evaluation of pvalues into what is ultimately a model comparison process. The advice provided from mainstream SEM, which is very heavily influenced by the study of complex latent variable models, is both exhaustive and exhausting for the ecologist. This paper provides updated advice for practitioners who rely on global estimation software packages for their SEM analyses.
At the present time, the greatest challenge for future studies is to provide defensible advice for the use of information measures in model comparisons. Most researchers investigating this topic have sought to identify a single index for allpurpose use. The most visible discussions in the field of ecology have debated the use of AIC versus BIC. There are also a great many variants of AIC and BIC that have been developed and discussed. For those in ecology, the recent simulation study by
The presentation here would be incomplete without mentioning that the ultimate solution to selecting the best model involves the data itself and not the methods of analysis. The bigger and better the sample, the more confidence we may have in the conclusions. If our goal is to generalise beyond the current sample, there is no substitute for sound mechanistic knowledge and sequential learning across linked studies (
I thank Lori Randall, USGS, Maria FelipeLucia, Helmholtz Center for Environmental Research and Frank Pennekamp, University of Zurich, for helpful review comments and suggestions. This work was supported by the USGS Land Change Science and Ecosystems Programs. Any use of trade, firm or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.
U.S. Geological Survey
No conflicts of interest.
This text file contains the R code used to develop the demonstrations included in Grace JB (2020) A 'weight of evidence' approach to evaluating structural equation models. One Ecosystem