One Ecosystem :
Research Article
|
Corresponding author: Andrew M. Neill (anneill@tcd.ie)
Academic editor: Joachim Maes
Received: 26 Sep 2022 | Accepted: 13 Jan 2023 | Published: 02 Feb 2023
© 2023 Andrew Neill, Cathal O'Donoghue, Jane Stout
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Neill AM, O`Donoghue C, Stout JC (2023) Spatial analysis of cultural ecosystem services using data from social media: A guide to model selection for research and practice. One Ecosystem 8: e95685. https://doi.org/10.3897/oneeco.8.e95685
|
Experiences gained through in person (in-situ) interactions with ecosystems provide cultural ecosystem services. These services are difficult to assess because they are non-material, vary spatially and have strong perceptual characteristics. Data obtained from social media can provide spatially-explicit information regarding some in-situ cultural ecosystem services by serving as a proxy for visitation. These data can identify environmental characteristics (natural, human and built capital) correlated with visitation and, therefore, the types of places used for in-situ environmental interactions. A range of spatial models can be applied in this way that vary in complexity and can provide information for ecosystem service assessments. We deployed four models (global regression, local regression, maximum entropy and the InVEST recreation model) to the same case-study area, County Galway, Ireland, to compare spatial models. A total of 6,752 photo-user-days (PUD) (a visitation metric) were obtained from Flickr. Data describing natural, human and built capital were collected from national databases. Results showed a blend of capital types correlated with PUD suggesting that local context, including biophysical traits and accessibility, are relevant for in-situ cultural ecosystem service flows. Average trends included distance to the coast and elevation as negatively correlated with PUD, while the presence of major roads and recreational sites, population density and habitat diversity were positively correlated. Evidence of local relationships, especially town distance, were detected using geographic weighted regression. Predicted hotspots for visitation included urban areas in the east of the region and rural, coastal areas with major roads in the west. We conclude by presenting a guide for researchers and practitioners developing cultural ecosystem service spatial models using data from social media that considers data coverage, landscape heterogeneity, computational resources, statistical expertise and environmental context.
cultural ecosystem services, visitation, social media, spatial modelling, geographic weighted regression, maximum entropy, InVEST, Ireland, ecosystem service assessment
Cultural ecosystem services are defined as “the non-material outputs of ecosystems that affect physical and mental states of people”, some of which require physical (in-situ) interactions between people and ecosystems (CICES v.5.1,
Assessing cultural ecosystem services is challenging for several reasons, one of which is that they vary spatially. As the mosaic of capital stocks varies across the landscape, so too does the basket of services those ecosystems provide to people (
Social media platforms contain information related to in-situ environmental experiences through uploaded content and associated metadata, such as GPS coordinates, descriptive text, titles and date of content creation (
GPS-tagged content uploaded to social media is an emergent source of spatial data used as a proxy for visitation occurrence and intensity because it records a "digital footprint" of places where people have visited (
Modelling the relationships between visitor occurrence and environmental characteristics (both biophysical and human and social attribtues) is a common analysis applied to spatial data from social media (
We selected four different modelling approaches to spatial data from social media, based on their use within literature: 1) global regression, 2) local regression, 3) maximum entropy (MaxEnt) and 4) InVEST recreation model. Throughout this paper, we use the term global with reference to the study area in its entirety (not in the planetary sense) and corresponding models that summarise one average relationship (
Regression models applied at the global scale of a study area, such as generalised linear models (GLMs), summarise the average relationship between variables of interest and social media metrics. Examples include travel-cost estimations for tourism (
Few published studies consider more than one of the modelling approaches outlined above (exceptions include
Model selection for ecosystem service assessment should be co-informed by data availability, data-processing expertise, research questions and spatial extent of the area of interest (
County Galway, Ireland, was selected as the study area because of its heterogeneous landscape, socio-cultural heritage and high visitor numbers. In 2018, visitor numbers were estimated at 1 million domestic and 1.8 million international, contributing a total of €800 million in revenue (
Data were collected from the Flickr social media platform using the statistical programming language R v.4.1.2 (
The photo-user-days (PUD) metric developed by
Environmental variables were selected, based on natural, human and built capital attributes identified as factors contributing to cultural service flows (particularly recreation) in the UN System of Environmental-Economic Accounting framework (
List of variables and indicators capturing human, natural and built capital for each 2 km2 grid cell and their sources.
Predictor |
Capital |
Indicator |
Data source and format |
---|---|---|---|
Elevation |
Natural |
Average elevation |
Copernicus remote sensing DEM. 25 m raster data ( |
Slope |
Natural |
Average slope |
Copernicus remote sensing DEM. 25 m raster data ( |
Rivers |
Natural |
Length of river |
Environmental Protection Agency. Vector data of river bodies ( |
Freshwater cover |
Natural |
Area of freshwater (lakes, ponds, rivers) |
Water framework directive. Vector data of lake segments ( |
Coastline |
Natural |
Distance to coastline |
Ordinance Survey Ireland. Land mask of Ireland 250 k vector file ( |
Habitat diversity |
Natural |
Number of CORINE classification types |
CORINE land-cover map 2012. Raster of land-cover types at 100 m ( |
Land cover |
Natural |
Area of four land-cover types (agriculture, wetlands, urban and forestry and natural areas). |
CORINE land-cover map 2012. Raster of land-cover types at 100 m ( |
Land-cover diversity |
Natural |
Shannon’s diversity index |
CORINE land-cover map 2012. Raster of land-cover types at 100 m ( |
Geological heritage |
Natural |
Presence of designated geological heritage |
Geological Survey Ireland. Vector data of recommended heritage sites ( |
Protected status |
Natural |
Area covered under protected status |
National Parks and Wildlife Service. Vector of protected areas. ( |
Town distance |
Built |
Distance to nearest town |
CSO Census 2011. Boundaries of designated towns and cities (vector) ( |
Population |
Human |
Population density |
CSO Census 2011. Grid of population density at 1 km2 ( |
Major Roads |
Built |
Presence of a major road |
Ordinance Survey Ireland. National road network vector file. ( |
Path density |
Built |
Density of path length |
Open street map roads database (vector) ( |
Amenity and recreation sites |
Built |
Presence of a recreational site (examples include bike rental, sports trails, boating, angling, golf courses) |
Failte Ireland activity database (point data) ( |
Four modelling exercises were applied in this study: 1) global regression (using both presence and count data), 2) local GWR (presence data), 3) MaxENT and 4) InVEST (Table
Outline of four models and description of their basic structure, data requirements and output. PUD refers to photo-user-days metric.
Model family |
Model structure |
Description |
Response (GPS-tagged content) |
Predictors (natural, human and built capital) |
Output |
Global regression (1) |
Non-spatial global logistic regression (a) |
Generalised linear model (binomial family and logit link). |
Binary. PUD presence |
Environmental indicators at 2 km2 resolution |
Average relationships between environmental attributes and likelihood of PUD occurrence. |
Global logistic spatially autocorrelated mixed-model (SAM) (b) |
Generalised linear mixed-model (binomial family and logit link). Environmental predictor fixed effects and spatial random effect (long + lat). |
Binary. PUD presence |
Environmental indicators at 2 km2 resolution |
Average relationships between environmental attributes and likelihood of PUD occurrence accounting for spatial heterogeneity. |
|
Global poisson spatially autocorrelated mixed-model (SAM) (c) | Generalised linear mixed-model (poisson family and log link). Environmental predictor fixed effects and spatial random effect (long + lat). |
Count. PUD total |
Environmental indicators at 2 km2 resolution | Average relationship between environmental attributes and PUD counts accounting for spatial heterogeneity. | |
Local regression (2) |
Logistic geographic weighted regression (GWR) |
A series of logistic regressions (binomial family) computed across the landscape weighting data, based on proximity to regression point. |
Binary. PUD presence |
Environmental indicators at 2 km2 resolution |
Local relationship between environmental attributes and PUD occurrence that vary spatially in magnitude, direction and significance. |
Maximum Entropy (3) |
Maximum Entropy (MaxEnt) |
Predictive model that uses machine-learning and a presence-only approach, based on observed occurrence to predict areas of high suitability. |
Point coordinates. PUD occurrence |
Environmental variables as 100 m rasters |
100 m resolution map of modelled suitability of visitation occurrence and variable contributions to model performance. |
InVEST (4) |
InVEST recreation model |
Self-contained model that queries an archived data set and calculates spatial statistics on user-supplied spatial data to compute an OLS regression. |
Log Count. Annual PUD |
Raw spatial data (vector or raster) analysed using pre-set options |
Average relationship between environmental attributes and log annual PUD. |
A logistic GLM (model a) was computed in R using a binary response variable describing the presence of PUD records per 2 km2 grid cell according to the formula:
PUD presence ~ Environmental predictors, family = binomal (link = logit) (a)
The grid size was selected to ensure a sufficient proportion of cells contained presence records and to capture the variation of environmental attributes. Model optimisation was conducted using stepwise model selection to minimise the Akaike Information Criterion (AIC). This is a standard model optimiser that balances performance and complexity to identify the most parsimonious model (
The output of this global model summarised average trends for the entire region. We hypothesised that these relationships may vary across the landscape due to local socio-environmental context and accessibility. Model performance across spatial scales can be assessed by testing for spatial autocorrelation of the residuals (
PUD presence ~ Environmental predictors + Matern(1|Lat + Long), family = binomal (link = logit) (b)
This procedure is a recommended approach for modelling spatial data in ecology where spatial autocorrelation is detected (
The PUD count per 2 km2 grid cell was used as the response variable in a third model. Count data are typically modelled using a poisson GLM (
PUD Count ~ Environmental predictors + Matern(1|Lat + Long), family = poisson (link = log) (c)
A logistic GWR model was conducted to test for spatially varying relationships using the presence of PUD. GWR computes repeated regression analyses across the landscape and applies a distance-based weighting function so that data points closer to the regression point are weighted more compared to distant data points (see
A GWR model using the PUD count variable was trialled, but ultimately deemed inappropriate. As a global poisson GLM was not supported due to overdispersion and heteroskedasticity concerns, it was surmised that a GWR using a poisson distribution was not recommended as this simply runs a series of repeated poisson models with the same variables, just different weighting schemes. More sophisticated model structures to address this (e.g. negative binomial, quasipoisson) are not currently available in combination with GWR techniques and lack consensus in the academic community and so were not pursued further by us. As more sophisticated model structures become available in the future, these could be considered.
MaxEnt was the third model type tested. This model predicts areas of high probability for visitation, based on characteristics associated with sampled PUD occurrence. MaxEnt differs from the regression models outlined above because it adopts a “presence-only” approach. Sampling social media records cannot prove the absence of visitation and, therefore, a value of 0 PUD does not reflect a true and tested absence of visitation (
The final model considered was the InVEST Visitation, Recreation and Tourism v.3.10 ecosystem service model (
All 38 validation sites returned PUD counts that were plotted against official visitor numbers reported by Fáilte Ireland (Fig.
A total of 25,170 geotagged photographic records were retrieved (Suppl. material
The non-spatial, logistic GLM model of PUD presence contained 11 environmental variables (Table
Results of global regression models. PUD refers to the photo-user-days metric. SAM denotes a spatially-autocorrected mixed-model using a random spatial effect. Significance levels (Sig) are denoted as *** p < 0.001, ** p < 0.01, * p < 0.05.
Global logistic non-spatial (a) (PUD Presence) |
Global logistic SAM (b) (PUD Presence) |
Global poisson SAM (c) (PUD Count) |
|||||||
Predictor | Coefficient | Standard error | Sig | Coefficient | Standard error | Sig | Coefficient | Standard error | Sig |
(Intercept) | -1.303 | -0.291 | *** |
-1.767 |
0.582 |
*** | -1.109 | 0.243 | *** |
Elevation | -0.016 | -0.002 | *** |
-0.018 |
0.003 |
*** | -0.009 | 0.002 | *** |
Slope | 0.278 | -0.042 | *** |
0.309 |
0.072 |
*** | 0.129 | 0.025 | *** |
Coast distance | -0.029 | 0.005 | *** |
-0.047 |
0.011 |
*** | -0.037 | 0.006 | *** |
River length | 0.785 | -0.18 | *** |
0.943 |
0.353 |
*** | 0.373 | 0.159 | *** |
Water cover | 0.022 | -0.004 | *** |
0.031 |
0.007 |
*** | 0.012 | 0.002 | *** |
Path length | 0.082 | -0.025 | *** |
0.127 |
0.034 |
*** | 0.103 | 0.011 | *** |
Major road | 0.674 | -0.138 | *** |
1.131 |
0.189 |
*** | 0.767 | 0.08 | *** |
Population | 0.019 | 0.004 | *** |
0.022 |
0.001 |
*** | 0 | 0 | |
Recreation site | 0.608 | -0.152 | *** |
0.666 |
0.229 |
*** | 0.507 | 0.086 | *** |
Geological heritage | 0.555 | -0.161 | *** |
0.742 |
0.245 |
*** | 0.29 | 0.089 | *** |
Habitat diversity | 0.118 | -0.053 | ** |
0.190 |
0.079 |
** | 0.152 | 0.031 | *** |
N | 1642 | 1642 | 1642 | ||||||
Log likelihood | -803 | -748 | -3076 | ||||||
AIC | 1629 | 1518 | 6178 | ||||||
Spatial autocorrelation of residuals detected: | Yes | No | No |
ROC plots for global logistic regression (black) and local GWR logistic regression (blue), based on presence of PUD at 2 km2 pixel size.
A logistic SAM was constructed given the spatial autocorrelation detected in the logistic GLM (Table
Visitor intensity was modelled using a SAM of PUD counts (Table
The global logistic model displayed evidence of spatial autocorrelation that may mask local relationships. GWR was used to investigate this and the resulting coefficients were mapped to show variation in magnitude, direction and significance (Fig.
The MaxEnt model used 100 m rasters of environmental predictors and PUD occurrence coordinates to predict areas of visitation suitability (Fig.
Average suitability for Flickr-derived photo-user-day occurrence (100 replicates) at 100 m resolution generated by the maximum entropy model.
MaxEnt’s jackknife analysis, based on average AUC values, is shown in Fig.
The output of the InVEST recreation model reports an OLS regression of log transformed annual PUD values (retrieved from an archived dataset) on environmental predictors (Table
InVEST regression model output (ordinary least squares regression and log annual PUD response). Significance levels (Sig) are denoted as *** p < 0.001, ** p < 0.01, * p < 0.05.
Coefficient |
Standard error |
t value |
Sig | |
---|---|---|---|---|
Intercept |
0.112 |
0.108 |
1.04 |
|
Elevation |
-0.00216 |
0.000340 |
-6.35 |
*** |
Slope |
0.0450 |
0.00599 |
7.51 |
*** |
Population |
0.00115 |
0.000260 |
4.41 |
*** |
Habitat diversity |
0.141 |
0.0337 |
4.17 |
*** |
Agriculture |
-0.00462 |
0.000958 |
-4.83 |
*** |
Forest and Natural Area |
-0.00538 |
0.00109 |
-4.92 |
*** |
Wetlands |
-0.00331 |
0.000932 |
-3.55 |
*** |
Coast distance |
-0.00000356 |
0.000001 |
-3.21 |
** |
Town distance |
0.00000871 |
0.000001 |
8.72 |
*** |
Path Length |
0.0000516 |
0.000004 |
13.17 |
*** |
Recreation distance |
-0.0000158 |
0.000004 |
-4.09 |
*** |
Degrees of freedom |
1586 |
|||
Adjusted R2 |
0.487 |
This study investigated potential in-situ cultural ecosystem service flows across a previously untested context using data from social media as a visitation metric and a spectrum of spatial models. The following discussion is split into three major themes: (1) PUD as a proxy for potential in-situ cultural ecosystem service flows in County Galway, (2) spatial model selection and use and (3) general comments about the use of social media data.
This is the first study to use a social media-derived PUD indicator in the Irish context and the positive correlation between PUD and official visitor counts is consistent with other validation studies (
A core set of environmental attributes representing a blend of natural, human and built capital were identified as correlated with PUD across all models: coastal distance, presence of major roads, population density, habitat diversity, elevation and presence of recreational sites. The finding that a blend of capital stocks was correlated with PUD mirrors results in other contexts (
The global, non-spatial logistic model was found to display spatial autocorrelation (violating the assumption of independence) and a higher AIC value compared to both spatial model alternatives (SAM and GWR). Therefore, models that account for the spatial nature of geo-tagged social media data should be used in such cases. Spatially varying local relationships were found using GWR. This is consistent with studies that found evidence of local relationships when modelling cultural ecosystem service flows (
The MaxEnt model was the only method used for prediction due to its presence-only approach (as opposed to testing for correlations in GWR and SAM models). Jackknife analysis showed that elevation, coastal distance, recreational sites and town distance were the variables of greatest predictive influence in the model. Other variables were found to have limited contributions to model prediction, such as river length, water cover and geological diversity, despite showing significant correlations in regression analysis. This result can support the prioritisation of data collection when designing management interventions as some variables appear to be more informative. Results in this case-study suggest rural areas, close to the coast, of moderate elevation and with a major road should be prioritised for targeted management interventions. These areas have the potential to experience high visitor volumes through in-situ cultural ecosystem service supply and the associated anthropogenic disturbance could contribute to negative ecological consequences and compromise long-term service flows.
The InVEST model presented some differences compared to the other models, such as the inclusion of land-cover variables and different significance levels for water cover and geological diversity. This is not unexpected given that InVEST is premised on a different response variable dataset (archived Flickr database), but it does provide a comparator to triangulate with other methods. Overall, the variables of greatest significance in user-led regression techniques (coast distance, elevation, recreational sites, major roads) were also identified as significant using InVEST with less intensive processing of data required. Stepwise model optimisation and diagnostics are not provided by the default InVEST tool and so any changes must be led by the user manually inspecting model outputs, making desired changes and re-running the model in its entirety, which can be time-consuming. These characteristics may be limitations to the InVEST model depending on the research context and similar remarks have been stated in literature (
This is the first social media-based spatial modelling study in the Irish context and so comparison is limited. Previous research used a stated-preference methodology to elicit aesthetic preferences of rural landscapes in Ireland, based on a nationally-representative survey of 430 individuals (
The results demonstrate the applicability of social media to developing spatial models of visitation. The differences in model outputs identified have the potential to create differing interpretations when a single approach is deployed in isolation. Therefore, we find merit in investigating multiple modelling tools—even if, ultimately, one is favoured for final reporting. While this study presented four workflows for clarity and comprehension, there is flexibility to customise these approaches to user needs beyond what is presented here. Some observations regarding data availability, data processing, expertise, research question and spatial extent required are presented below to provide information for future cultural ecosystem service assessment using these methodologies.
Data from social media are widely applicable, but their geographic coverage is unequal. The InVEST model is not suited for use in data-scarce regions as it recommends at least 50% of cells contain PUD records (
Model resolution should also consider the size, the heterogeneity of environmental attributes and the suspected behaviour of the sampled population of the area of interest. For large areas with suspected variation in local socio-environmental context, the presence of local relationships may be hypothesised from study initiation (
The available environmental data also determine grid size and mapping outputs for all four model workflows discussed. Variable selection should be grounded in the hypothesis for the phenomenon of interest and data may be gathered from available regional or national data sources, on-the-ground sampling or remote sensing. Where appropriate data are available, indicator calculation is limited only by the users’ expertise with data processing with a range of basic spatial statistics tools available in most GIS software, for example, proximity, presence, count, density, mean, min, max values. These indicators should be designed so that they vary across the landscape in a meaningful way, based on the cell resolution selected and this necessitates an interplay between indicator design and model resolution. For example, the use of “river presence” was not appropriate for modelling Co. Galway because the landscape contains many river features and, at the 2 km2 scale, almost all cells contained an identical value that was not meaningful. Instead, river length was selected as the indicator. When using GWR, this should also hold true at the bandwidth scale to ensure sufficient variation around each regression point, in addition to local multicollinearity checks (
The balance between resolution and indicator selection was also apparent when creating raster files for use in MaxEnt. Input data to MaxENT must be of identical extent and resolution in raster format that matches the desired output resolution. This required adjusting indicator file formats (indicator values themselves were not recalculated in this process). A second adjustment was made to change some indicators that were informative at the 2 km2 resolution, but less informative at the 100 m resolution. For example, at the larger cell size, binary presence indicators (presence of geological heritage, presence of recreational sites) captured the effect of a feature throughout the surrounding area, whereas, when using smaller cell sizes, this landscape effect was lost. Instead, a proximity or density-based indicator may be more appropriate at fine resolution where the impact of a feature is hypothesised to extend beyond the cell size. Careful consideration is required to understand these relicts of model specification and may require a back-and-forth approach to identify the most appropriate indicators for a given resolution. These considerations are significantly constrained when using the InVEST model which only contains seven pre-set options to the user for calculating environmental indicators (two raster, five vector data types) that stymies customisation (
Beyond statistical details, there are practical considerations that may determine the analysis of data from social media in ecosystem service assessments, such as the availability of computational resources, time and expertise. The InVEST recreation model is designed to be accessible to any user who may be unfamiliar with advanced coding, statistics or GIS by providing a self-contained, comprehensive interface and interpretable outputs (
The MaxEnt model is supported through a self-contained programme (and compatible R package) with introductory online resources available (
The use of user-defined global regression techniques, for example, GLMs and SAMs, is the most customisable approach detailed in this study. The user should follow the usual statistical checks (outliers, dispersion, heteroskedasticity), as well as spatial autocorrelation checks. The user must prepare a data table containing the response variable and environmental indicator for each cell in the landscape, which often requires data wrangling, cleaning and manipulation using spatial statistics. Highly-advanced model structures may be theoretically possible, but inaccessible due to the expertise required to define, run and interpret them. In some cases, the desired model may not yet be computationally possible. For example, multiscale GWR is currently limited to gaussian distributions, while GWR using poisson and binomial model families can only use fixed bandwidths. GWR can be computed in a number of ways including a number of R packages, Python and the stand-alone MGWR tool, although their respective optimising algorithms vary, leading to different outputs. GWR model runs can be computationally intensive and take long periods of time depending on the size of the dataset, optimisation criteria and use of Monte Carlo simulation (
These considerations (summarised in Table
General findings from deploying four spatial model types to the same area of interest for providing information for model selection.
Global regression (1) |
Local regression (2) |
Maximum entropy (3) |
InVEST recreation (4) |
|
---|---|---|---|---|
Are of interest and resolution |
· Determined by landscape heterogeneity and phenomenon of interest · Data coverage determines resolution |
· Variation at local scale required · Deployed where spatially varying relationships hypothesised · Data coverage determines resolution |
· Useful in data-scarce regions with low observations · Permits fine resolution |
· Resolution should ensure 50% cells contain > 0 annual PUD · Limited use in data-scarce regions |
Social media indicator |
· User-determined (automated or manual collection) |
· User-determined (automated or manual collection) |
· User-determined (automated or manual collection) |
· Flickr only 2005-2017 · Stable database |
Response variable |
· Pre-processing and filtering possible · GPS inaccuracy can be buffered · Response variable user defined such as occurrence, rate or count variables |
· Pre-processing and filtering possible · GPS inaccuracy can be buffered · Response variable user defined, such as occurrence, rate or count variables |
· Pre-processing and filtering possible · Point data required (GPS coordinates of occurrence) |
· PUD variable fixed and calculated automatically · GPS assumed accurate · Pre-processing and filtering not possible |
Environmental predictor variables |
· User-driven indicator calculation · May require spatial statistics and GIS · Standard procedure for variable inspection (outliers, collinearity, skewness) |
· User-driven indicator calculation · May require spatial statistics and GIS · Variable inspection at global and local scale (outliers, collinearity, skewness) · Proximity variables may be unsuitable |
· User-driven indicator calculation as raster files of identical extent · May require GIS to prepare · Limited variable inspection and diagnostics |
· User supplied spatial data files · Indicators calculated by pre-set spatial statistics within model run · Output returns variables calculated · Edits require re-running entire model |
Computation |
· Standard regression tools in any statistical software, for example, R · Model assumptions require manual checks (dispersion, outliers, normality, spatial autocorrelation) · If spatial autocorrelation detected, consider SAM or GWR models |
· Specialised tools available, for example, MGWR software · Emerging packages (R and Python) · Bandwidth and kernel set by user · Prior to run, standard checks required (dispersion, outliers, normality) · Local model checks also required, for example, local VIF, Cook’s distance |
· Standalone model software open- access and freely available · Compatible R packages also available · Default output includes model gain, AUC and variable response curves · Optional jackknife analysis included · Optional replication and data partition |
· Standalone model software, open-access and freely available · Some additional tools in Python API · Automatically runs OLS regression · No available option for model inspection beyond default regression table output |
Accessibility and transferability |
· Basic knowledge of statistics and software of choice, for example, R · Flexible and customisable model structure that can be altered by the user with relative ease along a spectrum of complexity · Available to any user familiar with basic regression techniques |
· Advanced spatial modelling approach with some customisation (bandwidth, optimiser criteria, GLM family) · Requires some specialist knowledge for preparation and interpretation · Advanced model runs may take time, especially Monte Carlo simulation · Evolving field of research with some limitations, for example, multiscale bandwidth models |
· Available online supports · Software and machine learning permits complex modelling that otherwise may be unavailable · Advanced model runs may take time especially when replication is used · Interpretation requires expertise with statistics · Optional scenario modelling included |
· No coding knowledge required · User-friendly, stand-alone interface and outputs · Dedicated online support tools and tutorials available · Optional scenario modelling · Uncertain future as archive becomes outdated |
Summary |
Widely applicable and customisable, but requires consideration when selecting resolution, scale, indicators and response variable calculation. May be limited depending by data availability and presence of local relationships |
Useful for investigating spatially varying relationships at the landscape scale, but should only be used when justified. Requires knowledge of spatial statistics and mapping tools. Some limitations due to evolving research field and development of new statistical approaches |
Applicable at high resolution and in data-scarce regions to predict areas of high suitability, but requires exact GPS coordinates and some degree of advanced statistical interpretation, computational resources and preparation of raster files |
Useful for users to detect and visualise general trends with limited resources and expertise in regions that are sufficiently data-rich, where the use of 2005-2017 Flickr data is deemed sufficient |
The focus of this study was the application of social media-derived data in different spatial models, not social media-derived data itself. The caveats of social media as a data source have been well-described in literature, but its use continues to increase, especially in studies related to cultural ecosystem services (
Potential inaccuracy in the geo-tagged coordinates was accounted for by using a 200 m buffer when calculating the PUD variable, given the suspected technological accuracy in northern Europe (
Social media user-groups are a self-selecting, unrepresentative sample of the general population and so data collected carries with it the bias of the userbase (
The social media content used in this study was not screened or filtered beyond the removal of areas outside the area of interest. The volume of data collected (25,000 data points contributed by 1,866 users) and the widespread distribution of those points across the landscape suggest that the sampled data contain a diversity of information and were successfully applied as per other similar studies exploring visitation (
Spatial models applied to data from social media revealed a blend of environmental characteristics related to visitation and potential in-situ cultural ecosystem service flows across County Galway, Ireland. These characteristics included coastal distance, elevation, major roads, recreational sites, urban distance and habitat diversity. Famously, all models are imperfect; but by discussing the workflow for each approach, we have articulated where and why different models may be useful. We hope that this exercise, zooming in on the application of spatial models using social media data to investigate cultural ecosystem services, can serve as a useful signpost for researchers and practitioners involved in ecosystem service assessments and natural capital accounting. Model selection considerations in such exercises should capture the context of the area of interest, computational demands, data availability and structure and research scope. Furthermore, for transparency and clarity, we encourage all researchers and practitioners to explain and justify their choice of model and variables in detail. The results presented are especially pertinent given the growing volume of available data from social media and the need for spatially-explicit models for natural capital accounting and ecosystem service assessments.
Thanks to L. Coleman for GIS data preparation advice and discussion and C. Farrell for cultural ecosystem service discussions prior to project commencement.
All of the authors were funded by Science Foundation Ireland (SFI) under Grant Number 16/RC/3889 for BiOrbic Bioeconomy, SFI Research Centre, which is co-funded under the European Regional Development Fund and by BiOrbic industry partners.
Trinity College Dublin, the University of Dublin
The authors have declared that no competing interests exist.
List of 38 sites across Ireland with official visitor statistics used to validate the PUD social media indicator.
Table showing InVEST parameters used for environmental indicator calculation within the model run. For full details of InVEST model configuration, see Sharp et al. (2018).
GPS location of 6,672 PUD occurrences at the 2 km grid cell size, based on data retrieved from Flickr API query.
Moran's I correlograms used to check for spatial autocorrelation of residuals of three models used: non-spatial logistic GLM, SAMs and GWR.
MaxEnt output files showing variable response and ROC plot, based on 100 bootstrap replicates using a 75:25 training:test partition.