Corresponding author: James Benjamin Grace (

Academic editor: Alessandro Gimona

Structural Equation Modelling (SEM) represents a quantitative methodology for specifying and evaluating causal network hypotheses. The application of SEM typically involves the use of specialised software packages that implement estimation procedures and automate model checking and the output of summary results. There are times when the specification details an investigator wishes to implement to represent their data relationships are not supported by available SEM packages. In such cases, it may be desirable to develop and evaluate SE models “by hand”, using specialised regression tools. In this paper, I demonstrate a general approach to custom-built applications of SEM. The approach illustrated can be used for a wide array of specialised applications of non-linear, multi-level and other custom specifications in SE models.

US Geological Survey, Wetland and Aquatic Research Center, 700 Cajundome Blvd, Lafayette, Louisiana, USA

The author declares no conflicts of interest related to this publication.

Structural equation modelling has grown in popularity as a method for quantitative analysis for natural systems in the past two decades (

Covariance methods, such as lavaan, provide tremendous flexibility with regard to the kinds of models that can be estimated. These include those with latent variables, error correlations, reciprocal interactions and causal loops. The greatest limitations associated with modelling covariances in lavaan include: (a) response types are limited to Gaussian and binary, (b) linkages must be linear equations and (c) incorporation of random effects is limited. The piecewiseSEM package permits a wide variety of response types and also the use of mixed models. It too runs into limitations, however, when dealing with non-linear linkages and more complex responses, such as multi-parameter zero-inflated response variables. In cases where lavaan and piecewiseSEM do not support particular specifications of interest, it is possible to develop SE models as collections of submodels built from a series of regressions. In this paper I illustrate basic procedures for such custom implementations.

Establish Causal Assumptions and Testable Implications

Consider Data Characteristics

Deciding on and Incorporating Custom Specifications

Checking for Omitted Links that Should be Added

Model Pruning, Model Comparisons and Model Selection

Development of Summary Quantities

For this illustration, I utilise data from a study of wetland biotic integrity, conducted at the Acadia National Park (NP) in Maine, USA (data from

The ambition of the original study is represented by the metamodel shown in Fig.

In SEM, we evaluate data expectations that follow from hypothesised model architectures using the principles of causal analysis. This can and should be addressed prior to considering specification details, as it directs the investigator’s focus to important causal assumptions that are separate from any statistical assumptions.

In this case, the omission of arrows from Use to Flood, Use to Rich and Hyd to Rich, suggest formal tests for our hypothesis that apply regardless of statistical details. In words, what is hypothesised is that the effect of land use (human activity) on native richness (biological integrity) can be explained by associated changes in hydrology and water level stability. Once data are brought to bear, the conditional independence claims can be tested (which ultimately leads us to learn that more processes were at work than were implied by our initial hypothesis).

Also of importance in our causal diagram (Fig. 4) is the absence of arrows connecting the U variables. The U variables represent the unspecified additional causes creating variations in our variables. If we hypothesised an omitted confounding factor as part of the data generating process, we would include double-headed arrows connecting the U variables influenced by the omitted factor.

When developing SE models by hand, it is important for the investigator to contemplate the alternative models that could be discovered through the analyses. When using specialised SEM programmes and packages, this is not essential (though can be valuable) because automated procedures will compare estimated models to hypothetical saturated models. Fig. 4 shows the nested set of possible structural models in the right panel. Here, arrows indicated by the letters D-F represent linkages that may be added to the initial model, while letters A-C indicate linkages that might be dropped due to lack of empirical support.

None of the variables included in this analysis is a classical Gaussian variable (Fig.

Here, we develop the overall model as a series of submodels, one submodel for each endogenous variable. The specification of submodels involves both an initial consideration of variable characteristics and an evaluation of model diagnostics. As a result, it is possible for final specifications to be of a different form from those initially considered. The R-code for proceeding from initial evaluation to final submodel selection is given in Suppl. material

Our approach with SEM is nearly always one of: (1) first check for omitted links that should be included and (2) only after additional links are included do we consider whether any links should be removed. It is easy, in this simple example case, to recognise fairly quickly the alternative models that could be considered (see Alternative Models in Fig. 4). Things will not always be this simple and, in other situations, I would adopt a more sophisticated approach to model checking (refer to

The p-values reported from the submodel analyses provided very clear support for or against alternative submodels. Direct model comparisons were conducted where such comparisons seemed necessary (Suppl. 1). The selected submodels are given in Fig.

The first step in summarising results is to assemble the submodels into the whole model (Fig. 7). In this illustration of methodology, I simply present the raw parameter estimates and standardised quantities, including both standardised parameter estimates and approximate R-squares. Standardised parameter estimates are quantities typically returned for the investigator by SEM software. In this case, however, these need to be computed by hand.

The R code and details for how standardised coefficients were computed are presented in Suppl. 1. For Hyd, which relied on a quasi-binomial specification, I used the latent-linear standardisation method, described in Grace et al. (2018). For Flood, I used a linear model and computed standardised coefficients by hand. For Rich, I created a log-linear approximation to the Poisson model and, to illustrate a slightly simpler approach than used for Flood, extracted standardised coefficients using the QuantPsyc R package (

In this ecological example, the primary hypothesis of a causal chain, whereby intensity of land use influences changes in hydrology that ultimately impact native plant richness, is supported by the results (Fig. 7). This causal chain provides an explanation for the previous observation that measures of biotic integrity are lower in wetlands whose watersheds have been substantially altered. Standardisd coefficient values indicate all of the direct effects in the model are moderately strong contributors to explaining the observed variation in the sample of wetlands studied. Support was also found for an unanticipated effect of land use on richness independent of variations in hydrology. At present, it is not clear what this effect might represent, leaving an important mechanism for further investigation.

It is worth considering the merits and demerits associated with detailed model specifications. Generalised linear models (GLMs), as employed here for count data, rely on link functions that attempt to match the distributions of the observations. This has two proposed benefits. First, assumptions about error distributions tend to be more closely adhered to compared to linear models. This is especially true at the extremes of the distributions. Second, predicted values will fall within the observed limits of measured variables, which will be helpful when the equations are used for forecasting. What is often undiscussed in standard statistical presentations is the fact that GLMs are non-linear functions and, as a result, the coefficients returned can be quite difficult to interpret scientifically. For Poisson models, interpretational issues are minimal because the link function is simply the log of the counts (thus, coefficients represent log-linear relationships). Binomial models, however, are typically based on logit link functions, which can be challenging to interpret. Scientists are accustomed to interpreting coefficients as consistent slopes of response. However, binomial models produce coefficients that represent log odds ratios and the actual relationships between observations and predictors are non-linear outside of the middle range of values. This issue is sufficiently problematic that it is not uncommon for experienced investigators to use linear models when analysing binomial data (

Ecologists are increasingly adopting powerful and sophisticated regression techniques for their analyses. By implementing SEM as a network of regressions, our modelling options are greatly expanded. What is often not returned by existing regression packages are the summary quantities we might need to interpret networks of relationships. For example, aside from generating comparable standardised coefficients, which are vital for interpreting SE models, we might also wish to compute indirect and total effects by multiplying path coefficients, which is easily accomplished for custom applications. This paper provides a general demonstration for how to develop custom-built SE models.

It is important to note that, while custom-build modelling allows for a greater variety of statistical specifications, the classical approach of modelling covariance relationships (e.g. using ‘lavaan’) has its own unique strengths. Covariance modelling permits the inclusion of latent variables, error covariances and reciprocal interactions, all of which have special uses in SEM. Thus, custom-built SE models should be seen as an additional tool, but not a replacement for other existing methods.

I thank Darren Johnson of the US Geological Survey, Alessandro Gimona of The James Hutton Institute, Frank Pennekamp of the University of Zurich, Jackie Potts of The James Hutton Institute and an anonymous reviewer for helpful comments and suggestions. This work was supported by the USGS Ecosystems and Land Change Science Climate Research and Development Programs. Any use of trade, firm or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

The author declares no conflicts of interest related to this publication.

The study area for the ecological example – Acadia National Park, Maine, USA. Upper panel is the view towards the Atlantic Ocean from the top of Cadillac Mountain, the highest point in the Park. Panel A is from an area where human disturbance of the watershed is minor, while Panel B is from an area with intense human disturbance (sewage treatment facility). Panel C is a map of the Park showing variation in human disturbance intensity (HDI) for the wetlands sampled. Photographs by the author. Map produced by Kathryn Miller, US National Park Service.

Metamodel representing the general hypothesis considered in the source publication (

SE model made up of four variables extracted for use in the illustration contained in this paper. The complete model, including the portion greyed out, was the structural equation model used in

Initial causal diagram (on left), relevant causal assumptions and potential alternative structural models. U variables in the causal diagram refer to unspecified causes of variations in models. In structural models, which are informed by data, U variables are replaced by prediction error variables (epsilons) where appropriate.

Histograms for the four variables used in the illustration.

QQ plots comparing residuals to model expected values.

Final model, R specifications used for the three subcomponent submodels, along with raw parameter estimates, standardised coefficients and adjusted R-squares.

General guidance for custom-built structural equation models

R code

This file contains the R code used to develop the demonstrations included in Grace JB (2021) General guidance for incorporating custom specifications in structural equation models. One Ecosystem.

File: oo_635925.R