One Ecosystem :
Review Article
|
Corresponding author: James B Grace (gracej@usgs.gov)
Academic editor: Stoyan Nedkov
Received: 24 Jan 2020 | Accepted: 10 Mar 2020 | Published: 13 Mar 2020
This is an open access article distributed under the terms of the CC0 Public Domain Dedication.
Citation:
Grace JB (2020) A 'Weight of Evidence' approach to evaluating structural equation models. One Ecosystem 5: e50452. https://doi.org/10.3897/oneeco.5.e50452
|
|
It is possible that model selection has been the most researched and most discussed topic in the history of both statistics and structural equation modeling (SEM). The reason for this is because selecting one model for interpretive use from amongst many possible models is both essential and difficult. The published protocols and advice for model evaluation and selection in SEM studies are complex and difficult to integrate with current approaches used in biology. Opposition to the use of p-values and decision thresholds has been voiced by the statistics community, yet certain phases of model evaluation have been historically tied to reliance on p-values. In this paper, I outline an approach to model evaluation, comparison and selection based on a weight-of-evidence paradigm. The details and proposed sequence of steps are illustrated using a real-world example. At the end of the paper, I briefly discuss the current state of knowledge and a possible direction for future studies.
structural equation modelling, path models, model evaluation, model selection, weight-of-evidence
Model selection is one of the more challenging aspects of structural equation modeling. The selection decision typically follows a multi-step process of model evaluation that considers numerous possible models and various types of evidence. Traditionally, this process has depended strongly on the use of p-values and the concept of dichotomous hypothesis tests. There have been numerous calls from the statistics community for a cessation of p-value-based hypothesis testing, especially as of late (see commentary by
When evaluating models estimated using traditional methods, there are two instances where SEM investigations encounter p-values:
The likelihood-ratio Χ2 test statistic was originally proposed as the first and final judge of global model fit (see
Throughout the modern history of SEM, which can be thought of as the time since the LISREL synthesis in the early 1970s (
Within the field of ecology,
Shifting away from a strict reliance on dichotomous hypothesis testing towards model comparisons has implications for model selection in SEM studies. Results reported by SEM software includes p-value-based indices. Experience tells us that p-values are useful indicators of model-data relations, but today's general advice is to not use them for dichotomous significance testing. In this paper, I illustrate an approach to model evaluation and selection that utilises p-values as information, while basing model selection on a comparison of the total weight of evidence. I first discuss the types of evidence that might be considered, including the role for expert knowledge. Second, I propose a sequence of steps to follow. Third, I illustrate the proposed process using an ecological example. Finally, I briefly consider the challenges facing the quest for a single “best” index and possibilities for resolution of this issue.
There are numerous types of evidence to be weighed when evaluating and comparing models. First and foremost is the scientific knowledge of the investigative team.
As an explanatory method, SEM requires the scientist to play an active role in the model evaluation process. A priori scientific knowledge is essential for the construction of the initial models. However, there may be a tendency for those beginning to use SEM to imagine that model evaluation, based on the data, is a tightly-scripted process defined by the rules of statistics. Earlier treatments of SEM tend to reinforce this impression. This is perhaps true of my own writing (e.g. comparing the presentations in Grace 2006 to the current paper), but also the writing in more general treatments of SEM (
The goal of model selection is not simply to describe the relationships in data. In my view, the goal is to balance twin objectives. First is the narrow task of evaluating the model-data inter-relationships. We must address the specific question, “What do these data say about the hypothesis?” SEM philosophy, however, imagines a sequence of studies and a process of sequential learning (
The initial model construction process in SEM relies heavily on investigator knowledge. The reasoning process adopted during model construction (
A Need to Support and Defend Critical Assumptions – There is a certain minimum amount of a priori knowledge that is required in order to use SEM. SE models include so-called “untestable” assumptions, along with assumptions that can be falsified by an appropriate dataset. The most fundamental untestable assumptions are the directions of the arrows. Often, in ecological studies, the investigator is able to justify arrow directionality. More challenging is for investigators to justify things that are omitted from their models. Of course, models must limit their scope to the essential components. There are rules relating to what can and cannot be omitted. First, only include variables that are essential to the modeling objectives. Often scientists attempt to include all measured variables and produce models whose complexity exceed the capacity of the data sample size (too many parameters per sample). Second, omitting variables that have strong effects on two or more of the variables of primary interest can lead to confounding. When very strong confounding relationships are omitted, they have the potential to bias conclusions.
For links not included in the initial model, post-estimation evaluations should reveal whether these assumptions need to be reconsidered. For direct links, these are straightforward to detect and include. However, we should not ignore the possibility that errors will be non-independent. Correlated errors are definite indications of omitted confounders. Finding error correlations may spark a need to consider model modifications. It is wise to consider how one might interpret such findings based on a priori knowledge, as these represent factors being omitted from explicit inclusion in the initial model.
Investigator’s Opinion of the Strength of Theory versus Strength of Data – A point that is rarely discussed outside of the sphere of Bayesian statistics is how to weight a priori scientific knowledge against the information content of a data set. This omission exists despite the fact that, in many ecological studies, analyses are based on very limited samples. In cases where there is a large and representative dataset for use in an analysis, one should be prepared to consider the data-derived estimates as the final arbiter for drawing inferences. In some cases, our a priori knowledge may be stronger than the dataset available for modeling.
A broad view of the quantitative sciences must recognise that we aspire for our models to transition over time from assumption-testing to assumption-based. Numerous sub-disciplines within ecology rely on assumption-based models. So-called ‘mechanistic models’ incorporate processes that operate on biological systems with enough regularity that the form of the model is accepted as given and data are used purely to estimate the parameters. Population models often fall into this group. For this model type, some of the processes may be of known functional form, while others may be of unknown form. When studying multi-species ecological communities, we often encounter mechanisms that are contingent on so many factors that relationship forms cannot be taken as given (e.g. effects of species additions or removals;
The preceding material is presented to make two important points relevant to model evaluation and selection within SEM. First, when alternative models are suggested, based on empirical results, we should avoid constructing alternative models for consideration that we know are false representations of the system. Perhaps a birth rate estimate is low and its 95% confidence interval includes a value of zero. Do we prune the model to adhere to the principle of parsimony? That would mean we might end up presenting a final model that, by omission, suggests that births are not a contributing factor for population size. Scientific logic would suggest that we should not prune in this case, but what are the consequences? I will address this question in the context of our illustrative example in the section below. In summary, the rule I would suggest is for the investigator to not include models in your comparison set that you, as a scientist, are not willing to defend.
Historically, the use of p-values came into widespread use as a part of the machinery for null-hypothesis testing. P-values have traditionally been used for dichotomous decision-making and have been central to the practical application of statistical methods. Within a WOE paradigm, p-values can be used as continuous quantitative indicators without their being used as strict cutoffs. The case for this type of use of p-values has been recently articulated by
As mentioned in the Introduction, for SEM applications that utilise global estimation methods (e.g. LISREL, Mplus, AMOS or lavaan), model estimation returns an initial set of measures that quantify the overall correspondence between observed and model-implied covariances. Immediate focus is directed to the Χ2 statistic, summarising model-data fit and its p-value. The p-value, in this case, has a counter-intuitive meaning in that it represents the probability that the observed data deviate from model expectations, which would imply the model structure is inappropriate for obtaining estimates from the data. When conducting SEM, there is an immediate decision that has to be made after the initial model is estimated, which is to decide whether there are one or more important links omitted. If there are, the reported parameter estimates are not to be trusted. This need to make sure the network is not missing important links has led to a history of reliance on dichotomous decision-making and this is perhaps unavoidable. The next section will discuss alternative indices for global fit assessment, but regardless of the fit index used, the question still arises, “Should we keep the original model or create a more complete one before we proceed?”
There are a number of factors that limit our ability to provide an omnibus model evaluation using the Χ2 statistic. For example, the continuous increase in statistical power with increasing N (sample size) of course means appreciable differences between data and model could be missed when N is small, but trivial differences could be flagged when N is large. Additionally, it is well documented that the Χ2 statistic can hide various types of mis-specification simply because it is a summary statistic for the entire model (
Regarding approximate fit indices, it is probably true that some are, on average, better than others. However, simulation studies indicate that the capacity to detect mis-specifications based on recommended thresholds depends on the particular mis-specification (
Under the WOE approach, I will demonstrate in this paper approximate fit indices can be useful measures to report.
The second phase in evaluating models after first examining global fit measures is often to search for indications of what changes could be made to improve model-data concordance. Specially designed for this purpose are so-called modification indices (MI). All global-estimation-based software packages, with which I am familiar, report this information upon request. The critical role of the investigator’s scientific judgement comes into sharp focus once one tries to make sense of the MI table provided for a model that is mis-specified.
MI values are expressed in terms of the drop in the Χ2 statistic that would be observed if a link were added to the model. The categories of possible additions include, (a) regressions, (b) latent variable loadings, (c) error correlations and (d) variance constraints. The modification indices are not arrived at by actually fitting alternative models. Rather, they are approximations and therefore do not always correspond to the changes that will be observed.
Perhaps the best way to gain some intuition about the challenge MI values attempt to overcome is to look at the raw materials for computing evidence of mis-specification, which are the residuals. In this case, the residuals are not those between predicted and observed individual data values, but instead, between the observed and model-implied variance-covariance matrices. Requesting to see residuals in a standardised metric will illustrate where model-implied and observed matrices are most discrepant. Because As the parts of a model are intercorrelated, there are many different model changes that might be implied. From a very practical standpoint, the investigator must realise that any change in the model can potentially resolve many of the listed modification possibilities. Therefore, one should decide on a single addition to the model before re-estimation. It is essential that the chosen modifications make substantive sense, because they must explain to the reviewers the scientific basis for the modifications of their initial model (see Grace 2006 for an in-depth discussion of this issue).
When working with models having latent variables with multiple indicators, MI tables sometimes return no usable advice, even though global model fit is poor. In this case, it becomes essential to consult the residual matrix unless theoretically predefined alternative specifications are available. While matrices of residuals provide a more undistilled source of information, they are commonly a fundamental source of evidence for selecting alternative models for consideration.
As with all other types of fit measures, information-based measures have a long history of use in SEM. This is a complex topic that I will treat lightly because the jury is still out on whether universally-applicable recommendations are even possible. Also Complicating things is also the sheer variety of information metrics that have been proposed for use. Fortunately, a recent review of past studies and set of simulation studies by
Two types of information measures have captured most of the recommendations, AIC-type indices and BIC-type indices. The Akaike Information Criterion (AIC) was proposed for use in 1974 (
In their book, Burnham and Anderson (2004) suggest models separated by more than 2 AIC units could be seen as distinct, while later (
In this paper, I do not wish to attempt to propose a definitive answer to the question of which information index is best nor consider the detailed studies and arguments association with that question. My intent is to show how the various types of evidence, associated with SEM, can be used to build up a set of candidate models for comparison and how information measures can be used to assist in the comparisons. For that purpose, I will rely on the following synoptic view, taken from various sources.
The relative performance of different information measures has been shown to vary with a number of factors, including (a) sample size, (b) the composition of candidate model sets, (c) the magnitude of effects to be detected and (d) heterogeneity in the data. I will try to summarise our current understanding (and my own experience) in a few summary statements:
In this paper, my focus is on globally-estimated models where the investigator must contend with multiple forms of evidence encountered within a sequential evaluation of overall model fit and individual links included in the model. However, many investigators use local estimation methods (e.g.
Pearl’s redescription of SEM in foundational terms, referred to as the Structural Causal Model (
In 2013, Shipley subsequently developed a version of his method based on AIC (
Most recently,
The use of the above-described types of evidence is illustrated next. To assist in the process, I provide a sequence of steps for a weight-of-evidence approach.
1. Consideration of Sample Size - The first piece of evidence to consider is the number of samples, N. Sample size has a huge influence on model evaluation. One can anticipate a great deal about the value of various model fit indices based on sample size. A first-cut distinction is often made between studies depending on whether N is less than or greater than 200. Other sample size distinctions may be informative to the investigator based on their past experience, as well as other factors such as model complexity and model specification details.
2. Examination of Model Χ2 Statistic, Model Degrees of Freedom and associated P-value – Below a sample size of 200, the Χ2 statistic and associated p-value are informative. That said, the criterion of p > 0.05 is known to be imprecise. When N < 200, a p-value falling below the 0.05 threshold is a strong indication that one should consider alternative models that include additional links. Above the 0.05 criterion, increasing values of p provide increasing support for the presumption that the model under consideration is not leaving out links representing important processes. It is common to find support for adding links when p > 0.05. Based on personal experience, we might think of some much larger value, such as 0.50, as a point at which any omitted links are likely weak, but additional pieces of information are nearly always worth examining unless p is quite high. Above a sample size of 200, one can expect to need to use Approximate Fit Indices to convince reviewers that the model is not leaving out important links and thus suitable for interpretation.
3. Examination of Select Approximate Fit Indices – As mentioned above,
a. The Root Mean Square Error of Approximation (RMSEA) is appealing because it is accompanied by upper and lower values for a 90% confidence interval. The RMSEA differs from the Χ2 statistic as it measures departures from approximate/close fit instead of perfect fit. "Approximate" or "close" fit is described as the situation where the Χ2 < df, while perfect fit is where Χ2 = 0. The RMSEA is scaled as a "badness of fit" measure, where 0 means perfect fit. The index is known to be a poor decision criterion by itself. That said, when the lower confidence interval is 0, it is supposed to mean that approximate fit is within the range of support offered by the data. When the associated p-value is less than 0.05, it suggests some type of misfit (due to omitted linkage).
b. The Comparative Fit Index (CFI) compares the departure from close fit (just described) to what we would find for a null model (all parameters = 0.0). Its values are scaled to range from 1.0 to 0. It is widely reported by investigators, especially when N > 200, because it is completely unaffected by sample size. It is not a reliable index for model selection by itself (though the unscaled version, RNI – the Relative Noncentrality Index) has been shown to perform as well as AIC by
c. The Standardized Root Mean Square Residual (SRMR) is computed as the absolute value of the mean of the standardised residual covariances. A value of 0 means perfect fit and a value > 0.10 is of concern. To interpret this measure appropriately, it is best to examine the matrix of standardised residual covariances since adequate mean fit may hide specific deviations representing important mis-specifications.
4. Examination of Modification Indices and Covariance Residuals – The matrix of covariance residuals provides the raw materials used to compute overall model fit, as well as modification indices. Their examination is sometimes an important supplement to the MI table. The MI values themselves are expressed in terms of the drop in the Χ2 statistic that would be expected if a link were added to a model. The categories of possible additions include, (a) regressions, (b) latent variable loadings, (c) error correlations and (d) the relaxation of constraints (e.g. if any parameters have been given fixed values). As any given covariance residual can be resolved in several different ways, it is important to consider the scientific interpretability before examining the MI table. For example, in this paper we are not considering latent variable models. For this reason, we may want to request only a subset of the MI types, such as direct effects (~) and possibly error correlations (~~). The single-degree-of-freedom Χ2 criterion value of 3.84 is often used to provide information on the interpretation of MI values. The use of this information is considered in the next section.
5. Considering Alternative Models Containing Additional Linkages/Parameters – Unless overall model fit is very close, it is desirable to consider alternative models. Failing to find a credible alternative model is itself a form of support for the model under consideration. It is critical that all alternative models estimated be scientifically plausible. The entire foundation for multimodel comparisons is based on the premise that all models compared are scientifically defensible. One simple tip is to consider in advance the signs (positive or negative) of links added to a model. At the point of examining modification indices, it is possible to not only examine the magnitude of the MI, but also the sign of the expected parameter. Interpretation of added coefficients can be very dependent on whether the sign of the coefficient corresponds to the type of mechanisms suggested. Sometimes, the expected parameter change (“epc”) is of opposite sign to the mechanistic interpretation for adding a link and this may influence how the investigator proceeds. When considering MI values in terms of their magnitudes, the investigator should recognise that it is only profitable to make one addition to a model at a time. Making a single change to a model can lead to a completely different set of MI values for the remaining omitted links under consideration. It is common to focus on the suggested modifications with the largest MI values, though the interpretability of suggested changes is of paramount importance. The investigator will usually notice several possible modifications that are projected to have identical effects on model fit. I will elaborate the logic to employ in such a situation in the context of our empirical example.
6. Repeat of Steps 1 – 5 as needed – When a link or additional parameter is added to form a new model, the steps just described will need to be repeated. At this point in the process, we are ignoring the second-level question of whether the included links (freely estimated parameters) in a model are supported. The inclusion of weak or non-supported effects in models has only a minor effect on overall model fit because they always reduce the raw discrepancy and only impact the model Χ2 statistic by increasing the number of parameters being estimated (K). As omitted links can have large effects on model fit and on estimated values for the other parameters, I always recommend addressing the question of whether important processes are omitted from the model before worrying about simplification.
7. Model Simplification – The model simplification process addresses the question of whether there is empirical support for all of the included linkages. The reported statistics for individual parameters now come into play. Historically, the p-values, associated with individual parameters, have been used as guideposts, but not for deciding whether a parameter should be set to zero. It has long been the practice within SEM under global estimation to use p-values as a continuous quantitative measure of the variability associated with a parameter estimate. Finding a parameter estimate with a p-value at or above 0.05 raises the question as to what magnitude of change would be seen in global model fit if that parameter were set to zero (i.e. removing a link). The standard approach has been to use a single-degree-of-freedom Χ2 test to judge whether a link should be removed. Under a WOE paradigm, p-values at or above 0.05 suggest the construction of alternative models to evaluate, using multi-model comparison.
8. Selection of Candidate Models for Comparison – The approach, described thus far, encourages a liberal consideration of alternative models, under the provision that they represent valid competing scientific explanations. A single model for scientific inference purposes is to be selected from the set. The suggestion of model averaging, proposed by
9. Model Comparison, Weighing of Evidence and Model Selection – Model selection, based on multimodel comparisons using information criteria, as championed by
To illustrate the ideas presented in this paper, I will rely on an example related to the biological control of invasive plants. The invasive plant Euphoria esula (leafy spurge) is considered a threat to the ecological and economic integrity of grasslands in the north-central United States. The example, presented here, is derived from a study conducted at the Theodore Roosevelt National Park in North Dakota (
In this paper, I have used the results published by
Fig.
Link # |
Description of potential mechanisms and expected sign of effect |
1 |
Change in stem density is expected to depend on initial density of stems. A positive parameter estimate would indicate positive density dependence, while a negative parameter estimate would indicate negative density dependence (negative effect). |
2 |
Dependence of flea beetle density on plant stem density for BioA. (positive effect) |
3 |
Site fidelity for BioA. (positive effect) |
4 |
Effect of BioA on StemChg (expecting a negative effect, if control agent is effective) |
5 |
Lag food effect on BioA. (positive effect) |
6 |
Competitive effect of BioA on BioB. (negative effect) |
7 |
Dependence of flea beetle density on plant stem density for BioB. (positive effect) |
8 |
Site fidelity for BioB. (positive effect) |
9 |
Effect of BioB on StemChg (expecting a negative effect, if control agent is effective) |
10 |
Lag food effect on BioB. (positive effect) |
11 |
Competitive effect of BioB on BioA. (negative effect) |
Structural equation model representing an initial hypothesis regarding the potential effects of two biocontrol flea beetles, BioA (Aphthona nigriscutus) and BioB (Aphthona lacertosa), on an invasive plant (Euphorbia esula). Stem and beetle densities were measured over two years and the change in stem density between years 1 and 2 computed (StemChg12).
For the illustration below, I simulated 10,000 replicates using the lavaan package simulateData function and then captured the covariance matrix as input for illustrations. The illustrations in this paper assume a sample size of 150 (original investigation was based on 165 samples). Code used for data simulation is in the supplementary materials (Suppl. material
In this section, I follow the Proposed Sequence described above to illustrate its application. In this example, I start with a single proposed a priori model and work from there. In other cases, we might have multiple candidate models from the beginning to evaluate, which would modify the sequence slightly.
Model 1 (Fig.
### load libraries library(lavaan) |
|||||
########## Simulation Study #1 ########## # Covariance matrix for input |
|||||
sim.cov <- ' |
|||||
1.2472 |
|||||
-0.1492 |
1.019 |
||||
0.8442 |
-0.178 |
1.6417 |
|||
0.0358 |
0.704 |
0.0357 |
1.5512 |
||
-0.2922 |
-0.233 |
-0.1217 |
-0.5276 |
1.488 |
|
0.5235 |
0.149 |
0.5335 |
0.3277 |
-0.655 |
1.029' |
# Convert matrix and name variables sim.cov.dat <- getCov(sim.cov, names = c("BioA1", "BioB1", "BioA2", "BioB2", "StemChg12", "Stems1")) |
|||||
##### Scenario 1 - Set N=150 ##### ### mod1 – initial model |
|||||
mod1 <- ' # regressions BioA1 ~ Stems1 BioB1 ~ Stems1 BioA2 ~ BioA1 +Stems1 +BioB1 BioB2 ~ BioB1 +Stems1 +BioA1 StemChg12 ~ Stems1 +BioA2 +BioB2 ' |
|||||
# Estimate model ‘mod1’ using data matrix ‘sim.cov.dat’, N = 150 Mod1.fit <- sem(mod1, sample.cov = sim.cov.dat, sample.nos = 150) |
Estimation of Model 1, using lavaan, returns the measures of overall model fit shown in Table
> summary(mod1.fit, fit.measures=T) |
|
lavaan 0.6-5 ended normally after 15 iterations |
|
Estimator |
ML |
Optimisation method |
NLMINB |
Number of free parameters |
16 |
Number of observations |
150 |
Model Test User Model: |
|
Test statistic |
9.125 |
Degrees of freedom |
4 |
P-value (Chi-square) |
0.058 |
Model Test Baseline Model: |
|
Test statistic |
247.786 |
Degrees of freedom |
15 |
P-value |
0.000 |
User Model versus Baseline Model: |
|
Comparative Fit Index (CFI) |
0.978 |
Tucker-Lewis Index (TLI) |
0.917 |
Loglikelihood and Information Criteria: |
|
Loglikelihood user model (H0) |
-1274.742 |
Loglikelihood unrestricted model (H1) |
-1270.179 |
Akaike (AIC) |
2581.483 |
Bayesian (BIC) |
2629.654 |
Sample-size adjusted Bayesian (BIC) |
2579.017 |
Root Mean Square Error of Approximation: |
|
RMSEA |
0.092 |
90 Percent confidence interval - lower |
0.000 |
90 Percent confidence interval - upper |
0.173 |
P-value RMSEA <= 0.05 |
0.153 |
Standardized Root Mean Square Residual: |
|
SRMR |
0.055 |
1. Consideration of Sample Size – Since N = 150, all of the fit measures presented can be interpreted in a fairly straightforward fashion.
2. Examination of Model Χ2 Statistic, Model Degrees of Freedom and associated P-value –While the p-value returned is above the criterion of 0.05, it can be anticipated at a sample size of 150 that we have sufficient statistical power to detect modest effect sizes. For this example, it is appropriate to assume that the true model contains a wide range of effect strengths, including some of scientific interest, but of modest size. With a p-value of 0.058, it would be naïve to simply accept this model without looking any further.
3. Examination of Select Approximate Fit Indices –
a. RMSEA is below 0.10, the lower CI boundary is 0.0 and associated p-value is 0.153. All these values suggest approximate fit.
b. CFI is estimated to be 0.978, which implies that we are probably not missing any large effects (assuming no interactions, as I do for this illustration).
c. SRMR is 0.055, well below the warning level of 0.10.
d. Summary: This evidence suggests that when model-data discrepancies are average across the whole model, there is reasonable correspondence. However, this level of fit could be the result of averaging many equal-sized minor mis-specifications OR averaging many areas of the model with very close fit and a few important model-data mismatches. The limitation of global fit measures is their inability to distinguish these two possibilities.
Examination of Modification Indices and Covariance Residuals – For now, I forego presenting a matrix of standardised residual covariances as their interpretation requires some experience. My typical process is to first examine modification indices and work with those, resorting to an examination of residuals, if it seems necessary. Seven modification suggestions are returned in this case (Table
> # Modification Indices > subset(modindices(mod1.fit), mi > 2, c("lhs", "op", "rhs", "mi", "epc")) |
||||
lhs |
op |
rhs |
mi |
epc |
1 BioA1 |
~~ |
BioB1 |
7.762 |
-0.224 |
2 BioA1 |
~ |
BioB1 |
7.762 |
-0.226 |
3 BioA1 |
~ |
BioA2 |
7.762 |
1.723 |
4 BioA1 |
~ |
BioB2 |
7.762 |
-0.341 |
5 BioB1 |
~ |
BioA1 |
7.762 |
-0.229 |
6 BioB1 |
~ |
BioA2 |
7.762 |
-0.414 |
7 BioB1 |
~ |
BioB2 |
7.762 |
-12.411 |
Fig.
### Model 2 (parameter numbers correspond to those in Fig. 5) |
mod2 <- ' # regressions BioA1 ~ b2*Stems1 BioB1 ~ b7*Stems1 BioA2 ~ b3*BioA1 +b5*Stems1 +b11*BioB1 BioB2 ~ b8*BioB1 +b10*Stems1 +b6*BioA1 StemChg12 ~ b1*Stems1 +b4*BioA2 +b9*BioB2 # error covariance BioA1 ~~ b12*BioB1' |
Steps 2-5: Examination of Global Fit Measures and Consideration of Additions to Model 2 – The Χ2 dropped from 9.125 to 1.155 (a decline of 8.03) and its p-value rose from 0.058 to 0.764 (Table
> summary(mod2.fit, fit.measures=T) |
|
lavaan 0.6-3 ended normally after 16 iterations |
|
Optimisation method |
NLMINB |
Number of free parameters |
17 |
Number of observations |
150 |
Estimator |
ML |
Model Fit Test Statistic |
1.155 |
Degrees of freedom |
3 |
P-value (Chi-square) |
0.764 |
Model test baseline model: |
|
Minimum Function Test Statistic |
247.786 |
Degrees of freedom |
15 |
P-value |
0.000 |
User model versus baseline model: |
|
Comparative Fit Index (CFI) |
1.000 |
Tucker-Lewis Index (TLI) |
1.040 |
Loglikelihood and Information Criteria: |
|
Loglikelihood user model (H0) |
-1270.757 |
Loglikelihood unrestricted model (H1) |
-1270.179 |
Number of free parameters |
17 |
Akaike (AIC) |
2575.513 |
Bayesian (BIC) |
2626.694 |
Sample-size adjusted Bayesian (BIC) |
2572.892 |
Root Mean Square Error of Approximation: |
|
RMSEA |
0.000 |
90 Percent Confidence Interval |
0.000 0.093 |
P-value RMSEA <= 0.05 |
0.848 |
Standardized Root Mean Square Residual: |
|
SRMR |
0.013 |
Step 7: Considering Simplification for Model 2 – We now turn to judging support for the links included. The information reported by lavaan that is most helpful at this point are the statistics associated with individual parameters. These are presented in Table
> # For examination of individual parameter support > parameterEstimates(mod2.fit) |
||||||||||
lhs |
op |
rhs |
label |
est |
se |
z |
pvalue |
ci.low |
ci.up |
|
1 |
BioA1 |
~ |
Stems1 |
b2 |
0.509 |
0.080 |
6.382 |
0.000 |
0.353 |
0.665 |
2 |
BioB1 |
~ |
Stems1 |
b7 |
0.145 |
0.080 |
1.801 |
0.072 |
-0.013 |
0.302 |
3 |
BioA2 |
~ |
BioA1 |
b3 |
0.554 |
0.085 |
6.496 |
0.000 |
0.387 |
0.721 |
4 |
BioA2 |
~ |
Stems1 |
b5 |
0.256 |
0.094 |
2.718 |
0.007 |
0.071 |
0.440 |
5 |
BioA2 |
~ |
BioB1 |
b11 |
-0.131 |
0.085 |
-1.549 |
0.121 |
-0.297 |
0.035 |
6 |
BioB2 |
~ |
BioB1 |
b8 |
0.662 |
0.085 |
7.834 |
0.000 |
0.497 |
0.828 |
7 |
BioB2 |
~ |
Stems1 |
b10 |
0.213 |
0.094 |
2.266 |
0.023 |
0.029 |
0.397 |
8 |
BioB2 |
~ |
BioA1 |
b6 |
0.018 |
0.085 |
0.217 |
0.828 |
-0.149 |
0.186 |
9 |
StemChg12 |
~ |
Stems1 |
b1 |
-0.643 |
0.091 |
-7.077 |
0.000 |
-0.821 |
-0.465 |
10 |
StemChg12 |
~ |
BioA2 |
b4 |
0.139 |
0.069 |
2.005 |
0.045 |
0.003 |
0.275 |
11 |
StemChg12 |
~ |
BioB2 |
b9 |
-0.208 |
0.067 |
-3.078 |
0.002 |
-0.340 |
-0.075 |
12 |
BioA1 |
~~ |
BioB1 |
b12 |
-0.224 |
0.082 |
-2.717 |
0.007 |
-0.385 |
-0.062 |
13 |
BioA1 |
~~ |
BioA1 |
0.974 |
0.113 |
8.660 |
0.000 |
0.754 |
1.195 |
|
14 |
BioB1 |
~~ |
BioB1 |
0.991 |
0.114 |
8.660 |
0.000 |
0.767 |
1.215 |
|
15 |
BioA2 |
~~ |
BioA2 |
1.008 |
0.116 |
8.660 |
0.000 |
0.780 |
1.236 |
|
16 |
BioB2 |
~~ |
BioB2 |
1.008 |
0.116 |
8.660 |
0.000 |
0.780 |
1.236 |
|
17 |
StemChg12 |
~~ |
StemChg12 |
0.968 |
0.112 |
8.660 |
0.000 |
0.749 |
1.187 |
|
18 |
Stems1 |
~~ |
Stems1 |
1.022 |
0.000 |
NA |
NA |
1.022 |
1.022 |
A new model for evaluation is presented in Fig.
### Model 3 (parameter b4 now estimates an error correlation) |
Mod3 <- ' # regressions BioA1 ~ b2*Stems1 BioB1 ~ b7*Stems1 BioA2 ~ b3*BioA1 +b5*Stems1 +b11*BioB1 BioB2 ~ b8*BioB1 +b10*Stems1 +b6*BioA1 StemChg12 ~ b1*Stems1 +b9*BioB2 # error covariance BioA1 ~~ b12*BioB1 StemChg12 ~~ b4*BioA2' |
Steps 2-5: Examination of Global Fit Measures and Consideration of Additions to Model 3 – The fit statistics for Model 3 are similar to those for Model 2, but with even closer fit (full results not presented to economise on space). The Χ2 is now 0.133 and p-value = 0.988. All other fit statistics suggest there are no other important links missing from this model.
Step 7: Considering Simplification for Model 3 – We now turn to an examination of the parameter-specific statistics for Model 3 (Table
> # For examination of individual parameter support > parameterEstimates(mod3.fit) |
||||||||||
lhs |
op |
rhs |
label |
est |
se |
z |
pvalue |
ci.low |
ci.up |
|
1 |
BioA1 |
~ |
Stems1 |
b2 |
0.509 |
0.080 |
6.382 |
0.000 |
0.353 |
0.665 |
2 |
BioB1 |
~ |
Stems1 |
b7 |
0.145 |
0.080 |
1.801 |
0.072 |
-0.013 |
0.302 |
3 |
BioA2 |
~ |
BioA1 |
b3 |
0.551 |
0.084 |
6.573 |
0.000 |
0.387 |
0.716 |
4 |
BioA2 |
~ |
Stems1 |
b5 |
0.257 |
0.094 |
2.747 |
0.006 |
0.074 |
0.441 |
5 |
BioA2 |
~ |
BioB1 |
b11 |
-0.133 |
0.084 |
-1.594 |
0.111 |
-0.297 |
0.031 |
6 |
BioB2 |
~ |
BioB1 |
b8 |
0.662 |
0.085 |
7.834 |
0.000 |
0.497 |
0.828 |
7 |
BioB2 |
~ |
Stems1 |
b10 |
0.213 |
0.094 |
2.266 |
0.023 |
0.029 |
0.397 |
8 |
BioB2 |
~ |
BioA1 |
b6 |
0.018 |
0.085 |
0.217 |
0.828 |
-0.149 |
0.186 |
9 |
StemChg12 |
~ |
Stems1 |
b1 |
-0.565 |
0.083 |
-6.786 |
0.000 |
-0.729 |
-0.402 |
10 |
Stemchg12 |
~ |
BioB2 |
b9 |
-0.224 |
0.067 |
-3.332 |
0.001 |
-0.355 |
-0.092 |
11 |
BioA1 |
~~ |
BioB1 |
b12 |
-0.224 |
0.082 |
-2.717 |
0.007 |
-0.385 |
-0.062 |
12 |
BioA2 |
~~ |
StemChg12 |
b4 |
0.181 |
0.083 |
2.184 |
0.029 |
0.019 |
0.344 |
Model 3B, Step 7: Considering Simplification for Model 3B – Model 3B was created by removing link 6 from Model 3. This is done by setting parameter b6 to zero (Table
### Model 3B (parameter b6 now set to zero) |
Mod3B <- ' # regressions BioA1 ~ b2*Stems1 BioB1 ~ b7*Stems1 BioA2 ~ b3*BioA1 +b5*Stems1 +b11*BioB1 BioB2 ~ b8*BioB1 +b10*Stems1 +b6*BioA1 StemChg12 ~ b1*Stems1 +b9*BioB2 # error covariance BioA1 ~~ b12*BioB1 StemChg12 ~~ b4*BioA2 # set parameters to zero b6 == 0' |
> # For examination of individual parameter support > parameterEstimates(mod3B.fit) |
||||||||||
lhs |
op |
rhs |
label |
est |
se |
z |
pvalue |
ci.low |
ci.up |
|
1 |
BioA1 |
~ |
Stems1 |
b2 |
0.509 |
0.080 |
6.382 |
0.000 |
0.353 |
0.665 |
2 |
BioB1 |
~ |
Stems1 |
b7 |
0.145 |
0.080 |
1.801 |
0.072 |
-0.013 |
0.302 |
3 |
BioA2 |
~ |
BioA1 |
b3 |
0.551 |
0.084 |
6.573 |
0.000 |
0.387 |
0.716 |
4 |
BioA2 |
~ |
Stems1 |
b5 |
0.257 |
0.094 |
2.747 |
0.006 |
0.074 |
0.441 |
5 |
BioA2 |
~ |
BioB1 |
b11 |
-0.133 |
0.084 |
-1.594 |
0.111 |
-0.297 |
0.031 |
6 |
BioB2 |
~ |
BioB1 |
b8 |
0.658 |
0.082 |
7.993 |
0.000 |
0.497 |
0.820 |
7 |
BioB2 |
~ |
Stems1 |
b10 |
0.223 |
0.082 |
2.723 |
0.006 |
0.063 |
0.384 |
8 |
BioB2 |
~ |
BioA1 |
b6 |
0.000 |
0.000 |
NA |
NA |
0.000 |
0.000 |
9 |
StemChg12 |
~ |
Stems1 |
b1 |
-0.565 |
0.083 |
-6.786 |
0.000 |
-0.729 |
-0.402 |
10 |
StemChg12 |
~ |
BioB2 |
b9 |
-0.224 |
0.067 |
-3.332 |
0.001 |
-0.355 |
-0.092 |
11 |
BioA1 |
~~ |
BioB1 |
b12 |
-0.224 |
0.082 |
-2.717 |
0.007 |
-0.385 |
-0.062 |
12 |
BioA2 |
~~ |
StemChg12 |
b4 |
0.181 |
0.083 |
2.184 |
0.029 |
0.019 |
0.344 |
Model 3C, Step 7: Considering Simplification for Model 3C – We now direct our attention to Table
In this situation, we must decide which of the models that have been estimated are scientifically defensible. Model 1 was found to be obviously mis-specified due to lack of empirical support and is not a contender for model selection. Model 2, while exhibiting a close model-data fit, was found to lack theoretical support for the originally hypothesised effect of BioA2 on StemChg12. Model 3 was created to solve that problem. Subsequent models examined (Models 3B-D) were all attempts to evaluate simpler versions of Model 3 and are all scientifically defensible. The appropriate model comparison set in this case includes all versions of Model 3.
A model comparison table is presented in Table
> ##### Multimodel Comparisons > library (AICcmodavg) > aictab(list(mod3.fit, mod3B.fit, mod3C.fit, mod3D.fit), + c("MOD3", "MOD3B","MOD3C","MOD3D")) |
||||||
Model selection based on AICc: |
||||||
K |
AICc |
Delta_AICc |
AICcWt |
Cum.Wt |
LL |
|
MOD3B |
16 |
2576.63 |
0.00 |
0.39 |
0.39 |
-1270.27 |
MOD3C |
15 |
2576.64 |
0.01 |
0.39 |
0.77 |
-1271.53 |
MOD3D |
14 |
2579.03 |
2.40 |
0.12 |
0.89 |
-1273.96 |
MOD3 |
17 |
2579.13 |
2.50 |
0.11 |
1.00 |
-1270.25 |
Model Fit |
||||||
Estimator |
ML |
|||||
Model Fit Test Statistic |
0.180 |
|||||
Degrees of freedom |
4 |
|||||
P-value (Chi-square) |
0.996 |
|||||
Comparative Fit Index (CFI) |
1.000 |
|||||
RMSEA |
0.000 |
|||||
90 Percent Confidence Interval |
0.000 0.000 |
|||||
P-value RMSEA <= 0.05 |
0.998 |
|||||
SRMR |
0.006 |
|||||
Regressions: |
||||||
Estimate |
Std.Err |
z-value |
P(>|z|) |
Std.all |
||
BioA1 ~ |
||||||
Stems1 |
(b2) |
0.509 |
0.080 |
6.382 |
0.000 |
0.462 |
BioB1 ~ |
||||||
Stems1 |
(b7) |
0.145 |
0.080 |
1.801 |
0.072 |
0.146 |
BioA2 ~ |
||||||
BioA1 |
(b3) |
0.551 |
0.084 |
6.573 |
0.000 |
0.481 |
Stems1 |
(b5) |
0.257 |
0.094 |
2.747 |
0.006 |
0.204 |
BioB1 |
(b11) |
-0.133 |
0.084 |
-1.594 |
0.111 |
-0.105 |
BioB2 ~ |
||||||
BioB1 |
(b8) |
0.658 |
0.082 |
7.993 |
0.000 |
0.534 |
Stems1 |
(b10) |
0.223 |
0.082 |
2.723 |
0.006 |
0.182 |
BioA1 |
(b6) |
0.000 |
0.000 |
|||
StemChg12 ~ |
||||||
Stems1 |
(b1) |
-0.565 |
0.083 |
-6.786 |
0.000 |
-0.470 |
BioB2 |
(b9) |
-0.224 |
0.067 |
-3.332 |
0.001 |
-0.228 |
Covariances: |
||||||
Estimate |
Std.Err |
z-value |
P(>|z|) |
Std.all |
||
.BioA1 ~~ |
||||||
.BioB1 |
(b12) |
-0.224 |
0.082 |
-2.717 |
0.007 |
-0.227 |
.BioA2 ~~ |
||||||
.StmChg12 |
(b4) |
0.181 |
0.083 |
2.184 |
0.029 |
0.181 |
R-Square: |
||||||
Estimate |
||||||
BioA1 |
0.214 |
|||||
BioB1 |
0.021 |
|||||
BioA2 |
0.381 |
|||||
BioB2 |
0.346 |
|||||
StemChg12 |
0.328 |
Additional summary results for the selected model are shown in Table
A question raised earlier in the paper was about the consequences of retaining a weakly-supported link in a model for the other model parameters. It is known that leaving out an important link can have a major impact on estimated parameter values for the included links. This sensitivity is illustrated by the fact that model Χ2 can drop abruptly when a single link is added (we observed a drop of over 8 points when we added link 12 to create Model 2). The elimination of link 6 from Model 3, however, only increased model Χ2 by a tiny amount (0.05), which is typical when parameters with little support are removed. More to the point in this paper is the question of what would happen if we were to set the dependence of beetle species B on plant stems (parameter b7) to zero? The results from such a change are presented in Table
Results if b6 in Model 3B were to be set to zero (compare to results in Table
Model Fit |
||||||
Estimator |
ML |
|||||
Model Fit Test Statistic |
3.390 |
|||||
Degrees of freedom |
5 |
|||||
P-value (Chi-square) |
0.640 |
|||||
Comparative Fit Index (CFI) |
1.000 |
|||||
RMSEA |
0.000 |
|||||
90 Percent Confidence Interval |
0.000 0.092 |
|||||
P-value RMSEA <= 0.05 |
0.793 |
|||||
SRMR |
0.050 |
|||||
Regressions: |
||||||
Estimate |
Std.Err |
z-value |
P(>|z|) |
Std.all |
||
BioA1 ~ |
||||||
Stems1 |
(b2) |
0.541 |
0.078 |
6.974 |
0.000 |
0.485 |
BioB1 ~ |
||||||
Stems1 |
(b7) |
0.000 |
NA |
0.000 |
||
BioA2 ~ |
||||||
BioA1 |
(b3) |
0.551 |
0.084 |
6.573 |
0.000 |
0.481 |
Stems1 |
(b5) |
0.257 |
0.093 |
2.769 |
0.006 |
0.201 |
BioB1 |
(b11) |
-0.133 |
0.083 |
-1.610 |
0.107 |
-0.104 |
BioB2 ~ |
||||||
BioB1 |
(b8) |
0.658 |
0.081 |
8.079 |
0.000 |
0.541 |
Stems1 |
(b10) |
0.223 |
0.081 |
2.752 |
0.006 |
0.184 |
BioA1 |
(b6) |
0.000 |
0.000 |
|||
StemChg12 ~ |
||||||
Stems1 |
(b1) |
-0.565 |
0.082 |
-6.903 |
0.000 |
-0.474 |
BioB2 |
(b9) |
-0.224 |
0.067 |
-3.343 |
0.001 |
-0.227 |
Covariances: |
||||||
Estimate |
Std.Err |
z-value |
P(>|z|) |
Std.all |
||
.BioA1 ~~ |
||||||
.BioB1 |
(b12) |
-0.228 |
0.083 |
-2.743 |
0.006 |
-0.230 |
.BioA2 ~~ |
||||||
.StmChg12 |
(b4) |
0.181 |
0.083 |
2.184 |
0.029 |
0.181 |
R-Square: |
||||||
Estimate |
||||||
BioA1 |
0.235 |
|||||
BioB1 |
0.000 |
|||||
BioA2 |
0.397 |
|||||
BioB2 |
0.327 |
|||||
StemChg12 |
0.316 |
In this paper, I describe a way to bring the necessary evaluation of p-values into what is ultimately a model comparison process. The advice provided from main-stream SEM, which is very heavily influenced by the study of complex latent variable models, is both exhaustive and exhausting for the ecologist. This paper provides updated advice for practitioners who rely on global estimation software packages for their SEM analyses.
At the present time, the greatest challenge for future studies is to provide defensible advice for the use of information measures in model comparisons. Most researchers investigating this topic have sought to identify a single index for all-purpose use. The most visible discussions in the field of ecology have debated the use of AIC versus BIC. There are also a great many variants of AIC and BIC that have been developed and discussed. For those in ecology, the recent simulation study by
The presentation here would be incomplete without mentioning that the ultimate solution to selecting the best model involves the data itself and not the methods of analysis. The bigger and better the sample, the more confidence we may have in the conclusions. If our goal is to generalise beyond the current sample, there is no substitute for sound mechanistic knowledge and sequential learning across linked studies (
I thank Lori Randall, USGS, Maria Felipe-Lucia, Helmholtz Center for Environmental Research and Frank Pennekamp, University of Zurich, for helpful review comments and suggestions. This work was supported by the USGS Land Change Science and Ecosystems Programs. Any use of trade, firm or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.
U.S. Geological Survey
No conflicts of interest.
This text file contains the R code used to develop the demonstrations included in Grace JB (2020) A 'weight of evidence' approach to evaluating structural equation models. One Ecosystem