One Ecosystem :
Review Article
|
Corresponding author: James Benjamin Grace (jimgrace001@gmail.com)
Academic editor: Alessandro Gimona
Received: 09 Aug 2021 | Accepted: 27 Jan 2022 | Published: 01 Feb 2022
This is an open access article distributed under the terms of the CC0 Public Domain Dedication.
Citation:
Grace JB (2022) General guidance for custom-built structural equation models. One Ecosystem 7: e72780. https://doi.org/10.3897/oneeco.7.e72780
|
|
Structural Equation Modelling (SEM) represents a quantitative methodology for specifying and evaluating causal network hypotheses. The application of SEM typically involves the use of specialised software packages that implement estimation procedures and automate model checking and the output of summary results. There are times when the specification details an investigator wishes to implement to represent their data relationships are not supported by available SEM packages. In such cases, it may be desirable to develop and evaluate SE models “by hand”, using specialised regression tools. In this paper, I demonstrate a general approach to custom-built applications of SEM. The approach illustrated can be used for a wide array of specialised applications of non-linear, multi-level and other custom specifications in SE models.
structural equation modelling, custom specifications, Acadia National Park
Structural equation modelling has grown in popularity as a method for quantitative analysis for natural systems in the past two decades (
Covariance methods, such as lavaan, provide tremendous flexibility with regard to the kinds of models that can be estimated. These include those with latent variables, error correlations, reciprocal interactions and causal loops. The greatest limitations associated with modelling covariances in lavaan include: (a) response types are limited to Gaussian and binary, (b) linkages must be linear equations and (c) incorporation of random effects is limited. The piecewiseSEM package permits a wide variety of response types and also the use of mixed models. It too runs into limitations, however, when dealing with non-linear linkages and more complex responses, such as multi-parameter zero-inflated response variables. In cases where lavaan and piecewiseSEM do not support particular specifications of interest, it is possible to develop SE models as collections of submodels built from a series of regressions. In this paper I illustrate basic procedures for such custom implementations.
For this illustration, I utilise data from a study of wetland biotic integrity, conducted at the Acadia National Park (NP) in Maine, USA (data from
The study area for the ecological example – Acadia National Park, Maine, USA. Upper panel is the view towards the Atlantic Ocean from the top of Cadillac Mountain, the highest point in the Park. Panel A is from an area where human disturbance of the watershed is minor, while Panel B is from an area with intense human disturbance (sewage treatment facility). Panel C is a map of the Park showing variation in human disturbance intensity (HDI) for the wetlands sampled. Photographs by the author. Map produced by Kathryn Miller, US National Park Service.
The ambition of the original study is represented by the metamodel shown in Fig.
1. Establish Causal Assumptions, Testable Implications and Alternative Structural Models
In SEM, we evaluate data expectations that follow from hypothesised model architectures using the principles of causal analysis. This can and should be addressed prior to considering specification details, as it directs the investigator’s focus to important causal assumptions that are separate from any statistical assumptions.
Initial causal diagram (on left), relevant causal assumptions and potential alternative structural models. U variables in the causal diagram refer to unspecified causes of variations in models. In structural models, which are informed by data, U variables are replaced by prediction error variables (epsilons) where appropriate.
In this case, the omission of arrows from Use to Flood, Use to Rich and Hyd to Rich, suggest formal tests for our hypothesis that apply regardless of statistical details. In words, what is hypothesised is that the effect of land use (human activity) on native richness (biological integrity) can be explained by associated changes in hydrology and water level stability. Once data are brought to bear, the conditional independence claims can be tested (which ultimately leads us to learn that more processes were at work than were implied by our initial hypothesis).
Also of importance in our causal diagram (Fig. 4) is the absence of arrows connecting the U variables. The U variables represent the unspecified additional causes creating variations in our variables. If we hypothesised an omitted confounding factor as part of the data generating process, we would include double-headed arrows connecting the U variables influenced by the omitted factor.
When developing SE models by hand, it is important for the investigator to contemplate the alternative models that could be discovered through the analyses. When using specialised SEM programmes and packages, this is not essential (though can be valuable) because automated procedures will compare estimated models to hypothetical saturated models. Fig. 4 shows the nested set of possible structural models in the right panel. Here, arrows indicated by the letters D-F represent linkages that may be added to the initial model, while letters A-C indicate linkages that might be dropped due to lack of empirical support.
2. Consider Data Characteristics
None of the variables included in this analysis is a classical Gaussian variable (Fig.
3. Deciding and Incorporating Custom Specifications
Here, we develop the overall model as a series of submodels, one submodel for each endogenous variable. The specification of submodels involves both an initial consideration of variable characteristics and an evaluation of model diagnostics. As a result, it is possible for final specifications to be of a different form from those initially considered. The R-code for proceeding from initial evaluation to final submodel selection is given in Suppl. material
Hyd – Proportion of Possible Hydrologic Alterations found at a Site: This variable is bounded by zero and the maximum number possible and can be construed to represent proportional counts. For this submodel, I utilised a GLM for Proportional Data as suggested by
Flood – Proportion of the Year that each Site was Flooded: This variable is of the same general form as Hyd, though with many more values. There is, however, an important difference. The Flood variable is based on a truncation of information from the underlying variable “water table level”. The actual water table level is not part of the dataset analysed here and thus constitutes a “latent cause”. As discussed in
Rich – Number of Native Species found at a Site: The Poisson distribution is often idealised as the appropriate expectation for species richness values. In many real-world situations, the variable can be modelled using log-linear models, though in this case, I chose to examine both Poisson and quasi-Poisson GLMs. Diagnostics from the quasi-Poisson indicated modest levels of overdispersion (dispersion parameter value < 1.5). Additionally, a QQ plot of the residuals showed reasonably good fit (Fig. 6. Panel B). Therefore, I chose the Poisson model for interpretations.
4. Checking for Omitted Links That Should be Added
Our approach with SEM is nearly always one of: (1) first check for omitted links that should be included and (2) only after additional links are included do we consider whether any links should be removed. It is easy, in this simple example case, to recognise fairly quickly the alternative models that could be considered (see Alternative Models in Fig. 4). Things will not always be this simple and, in other situations, I would adopt a more sophisticated approach to model checking (refer to
5. Model Pruning, Model Comparisons and Model Selection
The p-values reported from the submodel analyses provided very clear support for or against alternative submodels. Direct model comparisons were conducted where such comparisons seemed necessary (Suppl. 1). The selected submodels are given in Fig.
6. Development of Summary Quantities
The first step in summarising results is to assemble the submodels into the whole model (Fig. 7). In this illustration of methodology, I simply present the raw parameter estimates and standardised quantities, including both standardised parameter estimates and approximate R-squares. Standardised parameter estimates are quantities typically returned for the investigator by SEM software. In this case, however, these need to be computed by hand.
The R code and details for how standardised coefficients were computed are presented in Suppl. 1. For Hyd, which relied on a quasi-binomial specification, I used the latent-linear standardisation method, described in Grace et al. (2018). For Flood, I used a linear model and computed standardised coefficients by hand. For Rich, I created a log-linear approximation to the Poisson model and, to illustrate a slightly simpler approach than used for Flood, extracted standardised coefficients using the QuantPsyc R package (
In this ecological example, the primary hypothesis of a causal chain, whereby intensity of land use influences changes in hydrology that ultimately impact native plant richness, is supported by the results (Fig. 7). This causal chain provides an explanation for the previous observation that measures of biotic integrity are lower in wetlands whose watersheds have been substantially altered. Standardisd coefficient values indicate all of the direct effects in the model are moderately strong contributors to explaining the observed variation in the sample of wetlands studied. Support was also found for an unanticipated effect of land use on richness independent of variations in hydrology. At present, it is not clear what this effect might represent, leaving an important mechanism for further investigation.
It is worth considering the merits and demerits associated with detailed model specifications. Generalised linear models (GLMs), as employed here for count data, rely on link functions that attempt to match the distributions of the observations. This has two proposed benefits. First, assumptions about error distributions tend to be more closely adhered to compared to linear models. This is especially true at the extremes of the distributions. Second, predicted values will fall within the observed limits of measured variables, which will be helpful when the equations are used for forecasting. What is often undiscussed in standard statistical presentations is the fact that GLMs are non-linear functions and, as a result, the coefficients returned can be quite difficult to interpret scientifically. For Poisson models, interpretational issues are minimal because the link function is simply the log of the counts (thus, coefficients represent log-linear relationships). Binomial models, however, are typically based on logit link functions, which can be challenging to interpret. Scientists are accustomed to interpreting coefficients as consistent slopes of response. However, binomial models produce coefficients that represent log odds ratios and the actual relationships between observations and predictors are non-linear outside of the middle range of values. This issue is sufficiently problematic that it is not uncommon for experienced investigators to use linear models when analysing binomial data (
Ecologists are increasingly adopting powerful and sophisticated regression techniques for their analyses. By implementing SEM as a network of regressions, our modelling options are greatly expanded. What is often not returned by existing regression packages are the summary quantities we might need to interpret networks of relationships. For example, aside from generating comparable standardised coefficients, which are vital for interpreting SE models, we might also wish to compute indirect and total effects by multiplying path coefficients, which is easily accomplished for custom applications. This paper provides a general demonstration for how to develop custom-built SE models.
It is important to note that, while custom-build modelling allows for a greater variety of statistical specifications, the classical approach of modelling covariance relationships (e.g. using ‘lavaan’) has its own unique strengths. Covariance modelling permits the inclusion of latent variables, error covariances and reciprocal interactions, all of which have special uses in SEM. Thus, custom-built SE models should be seen as an additional tool, but not a replacement for other existing methods.
I thank Darren Johnson of the US Geological Survey, Alessandro Gimona of The James Hutton Institute, Frank Pennekamp of the University of Zurich, Jackie Potts of The James Hutton Institute and an anonymous reviewer for helpful comments and suggestions. This work was supported by the USGS Ecosystems and Land Change Science Climate Research and Development Programs. Any use of trade, firm or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.
US Geological Survey, Wetland and Aquatic Research Center, 700 Cajundome Blvd, Lafayette, Louisiana, USA
The author declares no conflicts of interest related to this publication.
This file contains the R code used to develop the demonstrations included in Grace JB (2021) General guidance for incorporating custom specifications in structural equation models. One Ecosystem.