Adoption of Machine Learning Techniques in Ecology and Earth Science

Background The natural sciences, such as ecology and earth science, study complex interactions between biotic and abiotic systems in order to understand and make predictions. Machine-learning-based methods have an advantage over traditional statistical methods in studying these systems because the former do not impose unrealistic assumptions (such as linearity), are capable of inferring missing data, and can reduce long-term expert annotation burden. Thus, a wider adoption of machine learning methods in ecology and earth science has the potential to greatly accelerate the pace and quality of science. Despite these advantages, the full potential of machine learning techniques in ecology and earth science has not be fully realized.


Introduction
Machine Learning (ML) is a discipline of computer science that develops dynamic algorithms capable of data-driven decisions, in contrast to models that follow static programming instructions.The very first mention of 'machine learning' in the literature occurred in 1930 and use of the term has been growing steadily since 1980 (Fig. 1).While discussion of ML is likely to recall scenes from popular science-fiction books and movies, there are many practical applications of ML in a wide variety of disciplines from medicine to finance.Part of what makes ML so broadly applicable is the diversity of ML algorithms capable of performing very well under messy, real-world conditions.Despite, and perhaps because of this versatility, uptake of ML applications have lagged behind traditional statistical techniques in the natural sciences.
The advantage of ML over traditional statistical techniques, especially in earth science and ecology, is the ability to model highly dimensional and non-linear data with complex interactions and missing values (De'ath and Fabricius 2000, Recknagel 2001, Olden et al. 2008, Haupt et al. 2009b, Knudby et al. 2010a).Ecological data specifically are known to be non-linear and highly dimensional with intense interaction effects; yet, methods that assume linearity and are unable to cope with interaction effects are still being used (Olden et al. 2008, Knudby et al. 2010a) with some modification of the data to try and make the methods work (Knudby et al. 2010a, Pasini 2009, Džeroski 2001).Several comparative studies have already shown that ML techniques can outperform traditional statistical approaches in a wide variety of problems in earth science and ecology (Lek et al. 1996b, Levine et al. 1996, Manel et al. 1999, Segurado and Araújo 2004, Elith et al. 2006, Lawler et al. 2006, Prasad et al. 2006, Cutler et al. 2007, Olden et al. 2008, Zhao et al. 2011, Bhattacharya 2013); however, directly comparing the results of statistical techniques to ML techniques can be difficult and requires careful consideration (Fielding 2007).
The exact division between ML methods and traditional statistical techniques is not always clear and ML methods are not always better than traditional statistics.For example, a system may not be linear, but a linear approximation of that system may still yield the best predictor.The exact method(s) must be chosen based on the problem at hand.A meta approach that considers the results of multiple algorithms may be best.This manuscript will discuss four types of ML tasks and seven important limitations of ML methods.These tasks and limitations will be related to six types of ML techniques and their relative strengths and weaknesses in ecology and earth science will be discussed.Specific applications of ML in ecology and earth science will be briefly reviewed with the reasons ML methods are underutilized in natural sciences.Potential solutions will be proposed.

Background
The basic premise of ML is that a machine (i.e., algorithm or model) is able to make new predictions based on data.The basic technique behind all ML methods is an iterative combination of statistics and error minimization or reward maximization, applied and combined in varying degrees.Many ML algorithms iteratively check all or a very high number of possible outcomes to find the best result, with "best" defined by the user for the problem at hand.The potentially high number of iterations is prohibitive of manual calculations and is a large part of why these methods are only now widely available to individual researchers.
Computing power has increased such that ML methods can be implemented with a desktop or even a laptop.Before the current availability of computing power, ecologists and earth scientists had to settle for statistical methods that assumed linearity (Knudby et al.

Use of the phrase 'machine learning' in the Google Books Ngram Viewer:
This plot shows the use of the phrase 'machine learning' by decade as percentage of total words in the Google English Corpus.http://books.google.com/ngrams2010a) and limited, controlled experiments (Fielding 1999b).Both of these restrictions limit the scale of studies and accuracy of results.A similar acceleration has been observed for numerical modeling of natural systems, where model predictions have improved because increased computing power has allowed for the inclusion of more parameters and, more importantly, finer granularity (see Semtner 1995, Forget et al. 2015 for examples in oceanography).
The first step in applying ML is teaching the algorithm using a training data set.The training data set is a collection of independent variables with the corresponding dependent variables.The machine uses the training data to "learn" how the independent variables (input) relate to the dependent variable (output).Later, when the algorithm is applied to new input data, it can apply that relationship and return a prediction.After the algorithm is trained, it needs to be tested to get a measure of how well it can make predictions from new data.This requires another data set with independent and dependent variables, but the dependent variables (target) are not provided to the learner.The algorithm predictions (output) are compared to the withheld data (target) to determine the quality of the predictions and thus the utility of the algorithm.This comparison is an important difference between ML and traditional statistical techniques that use p values for validation.**A Note on Terms: The interdisciplinary nature of ML and its application has resulted in a confusing collection of terms for similar concepts.Below are groups of functional synonyms describing the major concepts discussed in this manuscript.

•
Observation, instance, data point: These terms are used to describe the data instances that can be thought of as rows in a spreadsheet.

•
Explanatory variables, features, input, independent variables, x, regressors: These terms are used to describe the independent variables/input data that are used to make predictions.

•
Outcomes, dependent variables, y, classes, output: These terms are used to describe the dependent variables/output that are the results of the algorithm or part of the training/test set.The outcomes in the test set that the algorithm is trying to predict are referred to as the "target".

•
Outlier, novelty, deviation, exception, rare event, anomaly: These terms are used to describe data instances that are not well represented in the data set.They can be errors or true outliers.

Machine Learning Tasks
There are four different types of tasks that will be discussed in the context of available ML techniques and natural science problems.Most ML techniques can be used to perform multiple tasks and several tasks can be used in combination to address the same problem; therefore, it can be difficult to draw firm boundaries around categories of tasks and techniques.Many of the natural science problems discussed in the latter half of this paper have been addressed using all of the tasks and techniques discussed.The list below is not meant to be comprehensive.Only the tasks most relevant to the natural science applications are discussed here.Each ML technique mentioned here is more thoroughly discussed in its own section.
Task 1) Function Approximation.In this task, the machine inferrs a function from (x,y) data points, which are real numbers (Bishop 2006, Alpaydin 2014).Classification tasks in the natural sciences include the automated identification of species using recorded echolocation calls (Armitage and Ober 2010) and monitoring river water quality (Walley and Džeroski 1996), which have been performed using a Random Forest and a Bayesian Classifier, respectively.A linear discriminant analysis is an example of a traditional statistical method that can perform a classification task (Sokal and Rohlf 2011).
Task 3) Clustering.This task is similar to classification, but the machine is not given training data to learn what the classes are (Jain 2010).It is expected to infer the classes from the data.This task clusters data into groups such that objects in the same group are more similar than objects between groups.Each cluster is then an inferred class.Clusters have a situation-specific definition, thus there are several different clustering strategies available (e.g.hierarchical clustering, centroid clustering, etc).This task is often used for data exploration and knowledge discovery before another ML technique is applied.The Support Vector Machine and Artifical Neural Network (in the form of a Self Organizing Map) are two types of ML techniques that can perform a clustering task (Du 2010, Ben-Hur et al. 2001).
Clustering can be used in the natural sciences to detect rare events (Omar et al. 2013), such as identifying a bird call in hours of streaming remote sensor data (Kasten et al.

2010).
Task 4) Rule Induction.This task extracts a set of formal rules from a set of observations, which can be used to make predictions about new data.Fuzzy Inference and some Treebased ML techniques use rule induction to make predictions.Genetic Algoriothms can be used to infer rules (Fidelis et al. 2000, Sastry et al. 2013).Rule induction is a three-step process of 1) feature construction, where the features are turned into binary features, 2) rule construction, where features are searched for the combination that is most predictive for a class, and 3) hypothesis construction, where sets of individual rules are combined (Fürnkranz et al. 2012a) .Rule induction has been used in the natural sciences to predict microbial biomass and enzyme activity in soil (Tscherko et al. 2007) and develop biodegradation models for industrial chemicals (Gamberger et al. 1996 ).A good example of overlap between categories of tasks and techniques is the decision tree, which uses rule induction to perform a classification task.

Machine Learning Limitations
As with any technique, a working knowledge of the limitations of ML is necessary for proper application (Domingos 2012).Some limitations result from user misconceptions and some result from not recognizing problematic data.There are three major categories of mistakes that result from misconceptions of ML practitioners. 1) Demanding Perfection.Algorithms that perfectly model training data are not very useful.This is due to overfitting and it happens when an algorithm is so good at modeling the training data that it does not perform well in the "real world" (Fig. 2Hawkins 2004).The prediction error given by the training data might be low, but the prediction error given by the test data, called the generalization error, is the measure of how well the algorithm will do in a real-world application.When a model is overfit, the prediction error is much lower on the training data than the test data.In general, as performance on the training data increases, performance on the test data will increase only to a point before decreasing (Fig. 3).
2) Favoring Complexity Over Simplicity.It is important for a problem to be addressed with just the right amount of complexity and this varies according to the nature of the problem and the data (e.g.Merow et al. 2014).A more complex algorithm will not necessarily outperform a simpler algorithm (Olden et al. 2008, Domingos 2012).This misconception is related to overfitting because one way to improve model performance is to make it more complex, but that results in an ungeneralizable model.In many circumstances, more data with a simpler algorithm is better than a more complex algorithm (Domingos 2012).An iterative approach is often best, transitioning from simple to complex techniques and comparing the results.
3) Including As Many Features As Possible.It can be difficult to know a priori which features are important predictors in a given problem, but including a large number of features in a model (i.e. a shotgun approach), especially features that are not relevant, can make a model a poor predictor Keogh and Mueen 2011.This is because as the number of features increases, learners need a rapidly increasing amount of training data to become familiar with all combinations.As a result, algorithms with fewer features are often better predictors than algorithms using many features.This is often referred to as the Curse of  2) Too many categories.This problem is related to the Curse of Dimensionality and it occurs when a category has a high number of distinct values (i.e.high cardinality, Moeyersoms and Martens 2015).High-cardinality categories (such as zip code or bank account number) can be very informative, but can also increase the number of dimensions and thus decrease performance.One way to cope with a high-cardinality category is to combine levels using domain knowledge.For example, in the category "Taxon", instead of having a value for every species, have a single value for every Genus or higher rank.
Another way to address high-cardinality is through data preprossessing and transformations that reduce the number of levels in the category (Micci- Barreca 2001, Moeyersoms andMartens 2015).
3) Missing data.Different types of learners and problems have different levels of tolerance for missing data during training, testing, and prediction (see discussion of ML techniques below).There are several methods for applying ML techniques to data with missing values (Saar-Tsechansky andProvost 2007, Gantayat et al. 2014).Techniques for coping with missing data include imputation, removal of the instance, or segmentation of the model (Saar-Tsechansky and Provost 2007, Gantayat et al. 2014, Jerez et al. 2010).The segmentation approach involves removing the features that correspond to the missing data (Gantayat et al. 2014).In some cases it may be worthwhile to acquire the missing data through more testing, sampling, or experimentation.4) Outliers.If the observations are real and not the result of human error, outliers can be an important source of insight.They only become a problem when they go unnoticed and models are applied to a data set as though outliers are not present.There are many methods for outlier detection that are recommended as a preprocessing step before ML (Escalante 2005).

Tree-based Methods
Tree-based ML methods include decision trees, classification trees, and regression trees (Olden et al. 2008, Hsieh 2009, Kampichler et al. 2010).For these methods, a tree is built by iteratively splitting the data set based on a rule that results in the divided groups being more homogeneous than the group before (Fig. 4).The rules used to split the tree are identified by an exhaustive search algorithm and give insight into the workings of the modeled system.A single decision tree can give vastly different results depending on the training data and typically has low predictive power (Iverson et al. 2004, Olden et al. 2008, Breiman 2001a).Thus, several ensemble-tree methods have been developed to improve predictive power by combining the results of multiple trees, including boosted trees and bagged trees (Breiman 1996, De'ath 2007).A boosted tree results from a pool of trees created by iteratively fitting new trees to minimize the residual errors of the existing pool (De'ath 2007).The final boosted tree is a linear combination of all the trees (Elith et al. 2008).Bagging is a method that builds multiple trees on subsamples of the training data (bootstrap with replacement) and then averages the predictions from each tree to get the bagged predictions (Breiman 1996, Knudby et al. 2010a).
Random Forest is a relatively new tree-based method that fits a user-selected number of trees to a data set and then combines the predictions from all trees (Breiman 2001a).The Random Forest algorithm creates a tree for a subsample of the data set.At every decision only a randomly selected subset of variables are used for the partitioning.The predicted class of an observation in the final tree is calculated by majority vote of the predictions for that observation in all trees with ties split randomly.
Ensemble tree-based methods, especially Random Forest, have been demonstrated to outperform traditional statistical methods and other ML methods in earth science and ecology applications (Cutler et al. 2007, Kampichler et al. 2010, Knudby et al. 2010a).They can cope with small sample sizes, mixed data types, and missing data (Cutler et al. 2007, Olden et al. 2008).The single-tree methods are fast to calculate and the results are easy to interpret (Kampichler et al. 2010), but they are susceptible to overfitting (Olden et al. 2008) and frequently require "pruning" of terminal nodes that do not give enough additional accuracy to justify the increased complexity (Breiman et al. 1984, Garzón et al. 2006, Cutler et al. 2007, Olden et al. 2008, Džeroski 2009).The ensemble-tree methods can be computationally expensive (Cutler et al. 2007, Olden et al. 2008, Džeroski 2009), but resist overfitting (Breiman 2001a).Random Forest algorithms can provide measures of relative variable importance and data point similarity that can be useful in other analyses (Cutler et al. 2007), but can be clouded by correlations between independent variables (Olden et al. 2008).Implementing Random Forest is relatively straightforward.Only a few, easy-tounderstand parameters need to be provided by the user (Kampichler et al. 2010), but the final Random Forest does not have a simple representation that characterizes the whole function (Cutler et al. 2007).Tree methods also do not give probabilities for results, which means that data are classified into categories, but the probability that the classification is correct is not given.
For a more detailed description of tree-based methods see Breiman et al. 1984, Breiman 2001b, Loh 2014and chapter 8 in James et al. 2013.

Artificial Neural Networks
An Artificial Neural Network (ANN) is a ML approach inspired by the way neurological systems process information (Recknagel 2001, Olden et al. 2008, Boddy and Morris 1999, Hsieh 2009).There are many types of ANNs, but only a few are typically used in earth science and ecology, such as the multi-layer, feed-forward neural network, which will be the focus of this section (Pineda 1987, Kohonen 1989, Chon et al. 1996, Recknagel 2001, Lek and Guégan 2000).A multi-layer ANN has three parts: 1) the input layer, which receives the independent variables 2) the output layer, where the results are found, and 3) the hidden layer, where the processing occurs (Fig. 5).Each layer is made up of several units (neurons).Each unit is connected to all the other units in the neighboring layer, but not the units in the same layer or in non-adjacent layers.Feed-forward ANNs allow data to flow in one direction, from input to output only.The number of units in the hidden layer can be changed by the user to optimize the trade-off between overfitting and variance (Camargo and Yoneyama 2001, Kon and Plaskota 2000.Too many units in this layer can lead to overfitting.Each connection between units has a weight.Training the ANN involves an iterative search for an optimal set of connection weights that produces an output with a small error relative to the target.After every iteration, the weights are adjusted to bring the output closer to the target using a back-propagation algorithm.Bayesian methods and Genetic Algorithms can also be used to find the optimal connection weights (Bishop 2006, Kotsiantis et al. 2006, Siddique and Tokhi 2001, Yen and Lu 2000).Performance can be sensitive to initial connection weights, which are typically chosen randomly in the beginning, and the number of hidden units, so multiple networks should be processed while varying these parameters (Olden et al. 2008).
ANN can be a powerful modeling tool when the underlying relationships are unknown and the data are imprecise and noisy (Lek and Guégan 1999).Interpretation of the ANN can be difficult and neural networks are often referred to as a "black box" method (Lek and Guégan 1999, Olden et al. 2008, Wieland and Mirschel 2008, Kampichler et al. 2010).
ANNs can be more complicated to implement and are more computationally expensive than tree-based ML methods (Olden et al. 2008), but ANNs can accommodate a major gain in computational speed with a minor sacrifice in accuracy.For example, an ANN with one fourth the computational cost of a traditional satellite data retrieval algorithm (that uses an iterative method) has an accuracy nearly identical (+ 0.  2006).Because most real-world data cannot be separated with a straight line, significant additional processing may be required.If the classes overlap only slightly, a "buffer zone" or "soft margin" can be created around the hard decision boundary ( Fig. 6bVeropoulos

Genetic Algorithm
Genetic Algorithms (GA) are based on the process of evolution in natural systems in that a population of competing solutions evolves over time to converge on an optimal solution (Holland 1975, Goldberg and Holland 1988, Koza 1992, Haupt and Haupt 2004, Olden et al. 2008).Solutions are represented as "chromosomes" and model parameters are represented as "genes" on those chromosomes (Fig. 7).Training a GA has four steps: 1) random potential solutions are generated (chromosomes), 2) potential solutions are altered using "mutation", and "recombination", 3) solutions are evaluated to determine fitness (minimizing error), and 4) the best solutions cycle back to step 2 (Holland 1975, Mitchell 1998, Haefner 2005).Each cycle represents a "generation".Each chromosome is evaluated using a fitness function that scores its accuracy (Reeves and Rowe 2002).
Depending on the nature of the problem the GA is trying to solve, the chromosome can be strings of bits, real values, rules, or permutations of elements (Recknagel 2001).
An advantage of GA is the removal of the often arbitrary process of choosing a model to apply to the data (Jeffers 1999) and can be used to find the characteristics of other ML techniques, such as the weights and architectures of ANNs (Siddique and Tokhi 2001, Yen and Lu 2000).GAs have seen a rise in popularity due to development of the Genetic Algorithm for Rule-Set Prediction (GARP) used to predict species distributions (Stockwell and Noble 1992).GAs are very popular in hydrology (see Mulligan and Brown 1998 for description of how GA was used to find the Pareto Front) and meteorology (Haupt 2009).
GAs are able to cope with uneven sampling and small sample sizes (Olden et al. 2008).
GAs were developed with broad application in mind and can use a wide range of model structures and model-fitting approaches (Olden et al. 2008).As a result, a larger burden is placed on the user to select complicated model parameters with little guidance, and the fixed-length "chromosomes" can limit the potential range of solutions (Olden et al. 2008 ).
GAs are not best for all problems and many traditional statistical techniques can perform just as well or better (Olden et al. 2008).GARP, in particular, can be susceptible to overfitting (Lawler et al. 2006, Elith et al. 2008).
For a more detailed discussion of GA see Mitchell 1998 andSastry et al. 2013.

Fuzzy Inference Systems
Fuzzy inference methods, such as Fuzzy Logic and Inductive Logic Programming, provide a practical approach to automating complex analysis and inference in a long workflow (Williams et al. 2009).Given a set of training examples, a fuzzy inference system will find a set of rules that can be used for prediction of new instances.The output is in terms of a natural language set of "if/then" rules (Wieland 2008).For example, a set of rules that predict when a child receives an allowance might be "if the room is clean and the behavior is polite, then the allowance is dispensed" (Fig. 8).The if/then rules are created through an algorithm that iteratively selects each class and refines the if-statement until only the selected class remains (e.g.Džeroski 2009).Two examples of these algorithms are FOIL and PROGOL (Muggleton 1995, Quinlan and Cameron-Jones 1995.The National Center for Atmospheric Research (NCAR) has developed three fuzzy inference algorithms to address a complex problem in meteorology (Williams et al. 2009) and these methods have been used to predict landslide susceptibility (Pradhan 2010).
Fuzzy inference systems perform a rule induction task.The resulting rules can be easy to understand and interpret, as long as the rule sets are not too large, but overfitting can be a problem (Kampichler et al. 2010).Because fuzzy inference systems use a larger pool of possible rules and are more expressive, they can be more computationally demanding.
For more information about fuzzy inference systems and rule induction see Fürnkranz et al. 2012b andWang et al. 2007.

Genetic Algorithm Schematic:
In this simplified schematic of a genetic algorithm, the five potential solutions, or "chromosomes", undergo mutation and recombination.Then the best performing solutions are selected for another iteration of mutation and recombination.This cycle is repeated until an optimal solution is found.

Bayesian Methods
Bayesian ML methods are based on Bayesian statistical inference, which started in the 18th century with the development of Bayes' theorem (Laplace 1986).These methods are based on expressing the true state of the world in terms of probabilities and then updating the probabilities as evidence is acquired (Bishop 2006).In most cases, it is important to know the probability that a new datum belongs to a given class, not just the inferred class.
The Bayesian approach can contribute to several ML and traditional statistical techniques, but this section will focus on Bayesian Classifiers.A Bayesian classifier calculates a probability density for each class (Fig. 9).The probability density is a curve showing, for any given value of the independent variable, the likelihood of being a member of that class (Fig. 9).The new datum is assigned to the class with the highest probability.The values of the independent variable that have an equal probability of being in either class are known as the decision boundary and this marks the dividing line between the classes.In the real world, it can be difficult to calculate these a priori probabilities and the user must often make a best-guess.
A Bayesian classifier gives good results in most cases and requires fewer training data compared to other ML methods (Kotsiantis et al. 2006).It is useful when there are more than two distinct classes.The disadvantage is that it can be very hard to specify prior probabilities and results can be quite sensitive to the selected prior.This method does assume that variables are independent, which is not always true (e.g., Lorena et al. 2011).Some Baysian classifiers have Gaussian assumptions which may not be reasonable for the problem at hand.Another issue is that if a specific feature never appears in a class, the resulting zero probability will complicate calculations; therefore, a small probability must often be added, even if the feature does not appear in the class.Fuzzy Inference Rule Set.This is a simplified example of an "if/then" rule set derived from data (in the table) using fuzzy inference.The inferred rules can be used to predict when an allowanced with be dispensed.
For a more detailed discussion of Bayesian Classifiers and Bayesian Networks see section This method would classify the object within the group with the highest probability of being correct.In this example, the white item would be classified as a member of the black group because the probability is higher (Black = 8/17 * 2/8 and Grey = 9/17 *1/9)

Using ML in Ecology and Earth Science
For many researchers, machine learning is a relatively new paradigm that has only recently become accessible with the development of modern computing.While the adoption of ML methods in earth science and ecology has been slow, there are several published studies using ML in these disciplines (e.g., Park and Chon 2007).The following is a brief review of the different published applications of ML in earth science and ecology.

Habitat Modeling and Species Distribution
Understanding the habitat requirements of a species is important for understanding its ecology and managing its conservation.Habitat modelers are interested in using multiple data sets to make predictions and classifications about habitat characteristics and where taxa are likely to be located or engaging in a specific behavior (e.  1990, Brzeziecki et al. 1993, Guisan and Zimmermann 2000).

Species Identification
Identifying taxa can require specialized knowledge only possessed by a very few and the data set requiring expert curation can be large (e.g., automated collection of images and sounds).Thus, the expert annotation step is a major bottleneck in biodiversity studies.In order to increase throughput, algorithms are trained on images, sounds, and other types of data labeled with taxon names.(For more information about automated taxon identification specifically, see Edwards et al. 1987 andMacLeod 2007).The trained algorithms can then automatically annotate new data.This technique has been used to identify plankton, spiders, and shellfish larvae from images (Boddy and Morris 1999, Do et al. 1999, Sosik and Olson 2007, Goodwin et al. 2014).Bacterial taxa have been identified from gene sequences (Wang et al. 2007).Audio files of amphibian, bird, bat, insect, elephant, cetacean, and deer sounds have been classified to species (Parsons and Jones 2000, Jennings et al. 2008, Chesmore 2004, Acevedo et al. 2009, Armitage and Ober 2010, Kasten et al. 2010).Fish and algal species have been identified using acoustic (Simmonds et al. 1996) and optical characteristics (Balfoort et al. 1992, Boddy et al. 1994).ML has been used to differentiate between the radar signals of birds and abiotic objects (Rosa et al. 2015).In some cases, individuals of the same species can be distinguished even if the individuals themselves are unknown a priori (Reby et al. 1998, Fielding 1999b).Common tools include support vector machines (Fagerlund 2007, Sosik and Olson 2007, Acevedo et al. 2009, Armitage and Ober 2010, Goodwin et al. 2014, Rosa et al. 2015), Random Forest (Armitage andOber 2010, Rosa et al. 2015), Bayesian classifiers (Fielding 1999a, Wang et al. 2007), genetic algorithms (Jeffers 1999), and neural networks (Balfoort et al. 1992, Boddy et al. 1994, Simmonds et al. 1996, Do et al. 1999, Parsons and Jones 2000, Jennings et al. 2008, Armitage and Ober 2010, Rosa et al. 2015).

Remote Sensing
Satellite images and other data gathered from sensors at great elevation (e.g., LIDAR) are an excellent way to gather large amounts of data about Earth over broad spatial scales.In order to be useful, these data must go through some minimum level of processing (Atkinson and Tatnall 1997) and are often classified into land cover or land use categories (Guisan and Zimmermann 2000).ML methods have been developed to automate these laborious processes (Lees and Ritman 1991, Fitzgerald and Lees 1992, Lees 1996, Atkinson and Tatnall 1997, Guisan and Zimmermann 2000, Ham et al. 2005, Pal 2005, Gislason et al. 2006, Lakshmanan 2009).ML methods can be used to infer geophysical parameters from remote sensing data, such as inferring the Leaf Area Index from Moderate Resolution Imaging Spectrometer data (Rumelhart et al. 1986, Hsieh 2009, Krasnopolsky 2009).Sometimes remote sensing data and the parameters inferred from them can require spatial interpolation in the vertical or horizontal dimension, which is often performed using ML methods (Krasnopolsky 2009, Li et al. 2011).

Resource Management
Making decisions about conservation and resource management can be very difficult because there is often not enough data for certainty and the consequences of being wrong can be disastrous.ML methods can provide a means of increasing certainty and improving results, especially techniques that incorporate Bayesian probabilities.Several algorithms have been applied to water (Maier and Dandy 2000, Haupt 2009), soil (Henderson et al. 2005, Tscherko et al. 2007), and biodiversity/wildlife management (Baran et al. 1996, Lek et al. 1996b, Lek et al. 1996a, Giske et al. 1998, Guégan et al. 1998, Spitz and Lek 1999, Chen et al. 2000, Vander Zanden et al. 2004, Jones et al. 2006, Sarkar et al. 2006, Worner and Gevrey 2006, Cutler et al. 2007, Quintero et al. 2014, Bland et al. 2014).ML methods have been used to model population dynamics, production, and biomass in terrestrial, aquatic, marine, and agricultural systems (Scardi 1996, Recknagel 1997, Scardi and Harding 1999, Recknagel et al. 2000, Schultz et al. 2000, Džeroski 2001, Recknagel et al. 2002, McKenna 2005, Muttil and Lee 2005).Some specific examples of ML applications in resource management and conservation include 1) inference of IUCN (International Union for Conservation of Nature) conservation status of Data Deficient species (Bland et al. 2014, Quintero et al. 2014), 2) predicting farmer risk preferences (Kastens and Featherstone 1996), 3) predicting the production and biomass of various animal populations (Brey et al. 1996) Furlanello et al. 2003, Cutler et al. 2007, Quintero et al. 2014).

Forecasting
Discovery of deterministic chaos in meteorological models (Lorenz 1963) led to reconsideration of the use of traditional statistical methods in forecasting (Pasini 2009).
Today, predictions about weather are often made using ML methods.The most common ML methods used in meteorological forecasting are genetic algorithms, which have been used to model rainy vs non-rainy days (Haupt 2009) and severe weather (Hsieh 2009).
Forecasting can be important for applications other than weather prediction.In atmospheric science, neural networks are able to find dynamics hidden in noise and successfully forecast important variables in the atmospheric boundary layer (Pasini 2009).The oceanography community makes extensive use of neural networks for forecasting sea level, waves, and sea surface temperature (Wu et al. 2006, Hsieh 2009).In addition to being directly used for forecasting, neural networks are commonly used for downscaling environmental and model output data sets used in making forecasts (Casaioli et al. 2003, Hsieh and Hsieh 2003, Marzban 2003).

Environmental Protection and Safety
Just as ML can help resource managers make important decisions with or without adequate data coverage, environmental protection and safety decisions can be aided with ML methods when data are sparse.ML has been used to classify environmental samples into inferred quality classes in situations where direct analyses are too costly (Džeroski 2001).The mutagenicity, carcinogenicity, and biodegradability of chemicals have been predicted based on structure without lengthy lab work (Džeroski 2001).Sources of air contaminants have been identified and characterized in spite of lack of a priori knowledge about source location, emission rate, and time of release (Haupt et al. 2009a).ML can relate pollution exposure to human health outcomes (Džeroski 2001).Common ML methods for environmental protection include genetic algorithms (Haupt et al. 2009a), Bayesian classifiers (Walley et al. 1992), neural networks (Ruck et al. 1993, Walley and S. 1996, Walley et al. 2000), and fuzzy inference systems (Srinivasan et al. 1997, Džeroski et al. 1999, Džeroski 2001).

Climate Change Studies
One of the more pressing societal problems is the mitigation of and adaptation to climate change.Policy-makers require well-formed predictions in order to make decisions, but the complexity of the climate system, the interdisciplinary nature of the problem, and the data structures prevents the effective use of linear modeling techniques.ML is used to study important processes such as El Niño, the Quasi Biennial Oscillation, the Madden-Julian Oscillation, and monsoon modes (Cavazos et al. 2002, Hsieh 2009, Krasnopolsky 2009, Pasini 2009), and to predict climate change itself (Casaioli et al. 2003, Hsieh and Hsieh 2003, Marzban 2003, Pasini 2009).Predictions about the greenhouse effect (Seginer et al. 1994) and environmental change (Guisan and Zimmermann 2000) have also been made using ML.A very common use of ML in climate science is downscaling and post processing data from General Circulation Models (refs in Hsieh 2009, Pasini 2009).Ecological niche modeling and predictive vegetation mapping (as discussed above) can help predict adaptation to climate change (Wiley et al. 2003, Iverson et al. 2004).The most commonly used ML method in climate change studies is the neural network (Guisan and Zimmermann 2000, Pasini 2009).

Discussion
How can ML advance ecology and earth science?
The application of ML methods in ecology and earth science has already demonstrated the potential for increasing the quality and accelerating the pace of science.One of the more obvious ways ML does this is by coping with data gaps.The Earth is under-sampled, despite spending hundreds of millions of dollars on earth and environmental science (e.g., Webb et al. 2010).Where possible, ML allows a researcher to use data that are plentiful or easy to collect to infer data that are scarce or hard to collect (e.g., Wiley et al. 2003, Edwards et al. 2005, Buddemeier et al. 2008).Conservation managers are particularly well positioned to take advantage of ML via SDMs in invasive species management, critical habitat identification, and reserve selection (Guisan et al. 2013).Depending on the ML method used, one can also learn more about how a system works, for example through the Random Forest Variable Importance analysis.ML methods let the data tell the story and work backwards to understand the system while many numerical models impose a set of equations that may or may not be adequate.Another important way ML can fill in data gaps is through downscaling and performing spatial interpolation (Li et al. 2011).There will never be enough research funding to sample everything all of the time.ML can be a tractable method for addressing the data gaps that prevent scientific progress.
ML can accelerate the pace of science by quickly performing complex classification tasks normally performed by a human.A bottleneck in many ecology and earth science workflows are the manual steps performed by an expert, usually a classification task such as identifying a species.Expert annotation can be even more time consuming when the expert must search through a large volume of data, like a sensor stream, for a desired signal (Kasten et al. 2010).Rather than having all of the data classified by an expert, the expert only needs to review enough data to train and test an algorithm.This bottleneck has been addressed for some types of taxon identification (Cornuet et al. 1996, Sosik and Olson 2007, Acevedo et al. 2009, Armitage and Ober 2010), finding relevant data in sensor streams (Kasten et al. 2010), and building a reference knowledgebase for image analysis (Huang and Jensen 1997).In addition to relieving a bottleneck, ML methods can sometimes perform tasks more consistently than experts, especially when there are many categories and the task continues over a long period of time (Culverhouse et al. 2003, Jennings et al. 2008).In these cases, ML methods can improve the quality of science by providing more quantitative and consistent data (Sutherland et al. 2004, Olden et al. 2008, Acevedo et al. 2009).
As discussed above, ML techniques can perform better than traditional statistical methods in some systems, but a direct comparison of performance between ML techniques and traditional statistical methods is difficult because there is no universal measure of performance and results can be very situation-specific (Fielding 2007).The true measure of the utility of a tool is how well it can make predictions from new data and how well it can be generalized to new situations.Highly significant p-values, R2 values, and accuracy measurements may not reflect this.A study comparing 33 classification methods (including ML and traditional statistics) with 32 data sets found no real difference in performance and suggested that choice of algorithm be driven by factors other than accuracy, such as the characteristics of the data set (Lim et al. 2000).If the accuracy is not significantly improved using ML, it may be better to use a traditional method that is more familiar and accepted by peers and managers.Best practice is to test multiple methods (including traditional statistics) while probing the trade-off between bias and accuracy and choose the technique that best fits the problem.In many natural systems, where non-linear and interaction effects are common, a ML-based model may be more useful.Individual researchers need to select a method based on the specific problem and the data at hand.

Why don't more people use ML?
Even though ML can outperform traditional statistics in some applications (Manel et al. 1999, Kampichler et al. 2000, Segurado and Araújo 2004, Elith et al. 2006, Peters et al. 2007, Pasini 2009, Armitage and Ober 2010, Knudby et al. 2010a, Li et al. 2011, Zhao et al. 2011, Bhattacharya 2013), the potential of ML methods in ecology and earth science has not been exhausted (Olden et al. 2008).The reasons for this are social and technical.New methods can be resisted by established scientists, which can delay wide-spread use (Azoulay et al. 2015).ML methods (as well as some more complex statistical models) can require a high degree of math skill to understand in detail, which means either a long familiarization phase or an acceptance of the algorithm as a "black box" (Kampichler et al.This combination of lack of formal education and important data restrictions can lead to naive applications of ML techniques, which can increase resistence to their adoption. Finally, communication between the ML research community and natural science research community is poor (Wagstaff 2012).The financial sector is applying ML, suggesting that communication is possible when the potential monetary reward is great enough.(The application of ML to the financial sector has had mixed results and uses only some of the same ML techniques discussed herein Fletcher 2016.)There is too much reliance on abstract metrics in the ML research community and not enough consideration of whether or not a particular ML advance will result in a real-world impact (Wagstaff 2012).The small community of ecologists using ML to develop SDMs are not communicating the value of their research to decision-makers and accounts of SDMs being used successfully in conservation are hidden in grey literature (Guisan et al. 2013).Communication and collaboration between the ML community, the ecology community, and the earth science community is poor.

Next Steps
How can the use of ML methods in ecology and earth science be encouraged?One barrier that has been partially lowered is the lack of tools and services to support the application of ML in these domains.Use of ML algorithms built with user infrastructure, such as GRASS-GIS (Garzón et al. 2006) and GARP (Stockwell and Noble 1992) is higher than algorithms without such infrastructure.ML capabilities in R and MatLab have continued to make these methods more user-friendly.Programming skills have become more common in the natural sciences, but user interfaces are still very important for adoption of techniques.
Research scientists want to have a good understanding of the algorithms they use, which makes adoption of a new method a non-trivial investment.Reducing the cost of this investment for ML techniques is an important part of encouraging adoption.One way to do this is through a trusted collaborator who can simultaneously guide the research and transfer skills.These collaborators can be difficult to find, but many potential partners can be found in industry.A useful tool would be a publicly-available repository of annotated data sets to act as a sandbox for researchers wanting to learn and experiment with these methods, similar to Kaggle (https://www.kaggle.com/)but with natural science data.Random Forest is easier for a beginner to implement, gives easy to interpret results, and has high performance on ecology and earth science classification problems (Prasad et al. 2006, Kampichler et al. 2010); thus, Random Forest would be a good starting point for a ML novice.Students can be exposed to ML and command line programming through their graduate education, eliminating the need for a costly time investment during their research career.In addition, an improved statistical education for students would make them more aware of the limitations imposed by rigid models and thus more open to trying ML for some problems.An important part of promoting new techniques is recognizing the practical needs of researchers and working within those boundaries to facilitate change.
Finally, ML successes and impacts in ecology and earth science need to be more effectively communicated and the results from ML analyses need to be easily interpreted for decision-makers (Guisan et al. 2013).Research communities need to do a better job of communicating across domains about the impact of their results (Wagstaff 2012).For best communication between experts, collaborations should begin during and even before algorithm development to help properly define the problem being addressed, instead of developing an algorithm in isolation (Guisan et al. 2013).Once an algorithm has been successfully used in a decision-making process, the results need to be reported as a part of the published literature in addition to the grey literature.
Funding agencies can facilitate this process by specifically soliciting new collaborative projects (research projects, workshops, hack-a-thons, conference sessions) that apply ML methods to ecology and earth science in innovative ways and initiatives to develop education materials for natural science students.Proper implementation of ML methods requires an understanding of the data science and the discipline that can best be achieved through interdisciplinary collaboration.

Conclusions
ML methods offer a diverse array of techniques, now accessible to individual researchers, that are well suited to the complex data sets coming from ecology and earth science.These methods have the potential to improve the quality of scientific research by providing more accurate models and accelerate progress in science by widening bottlenecks, filling data gaps, and enhancing understanding of how systems work.Application of these methods within the ecology and earth science domain needs to increase if society is to see the benefit.Adoption can be promoted through interdisciplinary collaboration, increased communication, increased formal and informal education, and financial support for ML research.Partnerships with companies interested in environmental issues can be an excellent source of knowledge transfer.A good introductory ML method is Random Forest, which is easy to implement and gives good results.However, ML methods have limitations and are not the answer to all problems.In some cases traditional statistical approaches are more appropriate (Meynard andQuinn 2007, Olden et al. 2008).ML methods should be used with discretion. There Figure 1.
Figure 2. Comparison of performance of two algorithms (grey lines) on hypothetical training (A) and test (B) data (black points).a: Training Data.Algorithm 1 (solid line) models the training data perfectly, with no error.Algorithm 2 (dashed line) is much more generalized and does not model the training data as well as Algorithm 1. b: Test Data.Algorithm 1 (solid line) modeled the training data perfectly, but has very high error on the test data.Algorithm 2 had a higher error on the training data, but models the test data with a reasonably low error.Algorithm 1 is an example of overfitting.Algorithm 2 is a much better real-world predictor.

Figure 3 .
Figure 3.As algorithm performance on training data increases, performance on test data increases only to a certain point (dashed line).Increases in performance on training data beyond this point results in overfitting.

Figure 4 .
Figure 4. Decision and Classification Tree Schematic: Tree-based machine learning methods infer rules for splitting a data set into more homogeneous data sets until a specified number of terminal classes or maximum variance within the terminal classes is reached.The inferred splitting rules can give additional information about the system being studied.
1) to the traditional algorithm(Young 2009).Overfitting can be a problem(Kampichler et al. 2010).Many ANNs mimic standard statistical methods (A.Fielding pers.comm.), so a good practice while using ANNs is to also include a rigorous suite of validation tests and a general linear model for comparison(Özesmi et al. 2006).For a more detailed description of a multi-layer, feed-forward ANN with back propagation see section 4.1 inKotsiantis et al. 2006 and section 5 in Bishop 2006.For more information on ANNs in general seeHagan et al. 2014.

Figure 5 .
Figure 5.Artificial Neural Network Schematic:A neural network is made up of three layers (input, hidden, output).Each layer contains interconnected units (neurons).Each connection has an assigned connection weight.The number of hidden units and the connection weights are iteratively improved to minimize the error between the output and the target.
Figure 9. Bayesian Classifier Schematic: This diagram shows a simplified schematic of a Bayesian classifier working to assign a new datum (white triangle) to one of two classes (grey and black).a:Probability Density Plot: A Bayesian classifier calculates a probability density for each class (solid and dotted curve) across a range of values for the new datum (white triangle), which is classified according to which probability is highest at its value (black).The value for which the datum has an equal probability of being in both classes is called the decision boundary (black line).b: Data Plot: An object to be classified (white) can belong to one of two groups (grey or black).This method would classify the object within the group with the highest probability of being correct.In this example, the white item would be classified as a member of the black group because the probability is higher (Black = 8/17 * 2/8 and Grey = 9/17 *1/9) Regression and curve-fitting are types of function approximation.Artificial Neural Networks are one ML technique that performs function approximation (see discussion of ANN below).Natural science problems such as predicting the global riverine fish population (Guégan et al. 1998) and forecasting oceanographic conditions (Hsieh 2009) have been addressed with Artificial Neural Networks performing function approximation tasks.Tree-based methods can also be used for function approximation via regression (Loh 2014).Linear regression is an example of a traditional statistical method that performs a function approximation task (Sokal and Rohlf 2011).Task 2) Classification.This process assigns a new observation to a category based on training data (Alpaydin 2014, Kotsiantis 2007).A common example of classification is the automated sorting of spam and non-spam email.ML techniques that are known to be good classifiers include Random Forest, Support Vector Machines, and Bayesian Classifiers, which will also output the probability that the observation belongs to the inferred class.
(Ben-Hur et al. 2001) solution is to map the data onto a higher-dimensional feature space wherein a linear boundary can be found.An algorithm called a kernel function is used to translate data into the new feature space.Choosing the correct kernel function is important(Kotsiantis et al. 2006) and can slow the training process.SVMs are excellent binary classifiers when given labeled training data.Problems with more than one class, must be divided into multiple binary classification problems.When data are unlabeled, SVMs can be used for clustering, and this is called Support Vector Clustering(Ben-Hur et al. 2001).For a more detailed description of SVMs see section 6 inKotsiantis et al. 2006and chapter 9 in James et al. 2013.For a discussion of kernel functions see Genton 2001.
g., nesting Fielding 1999b,Cutler et al. 2007).The rule-sets developed are referred to as Species Distribution Models (SDM) and can use a wide variety of ML methods to make their predictions or none at all (Guisan and Thuiller 2005).Typically, an algorithm would be trained using a data set matching environmental variables to taxon abundance or presence/absence data.If the algorithm tests well, it can be given a suite of environmental variables from a different location to make predi, Bell 1999t what taxa are present.This technique has been used to identify current suitable habitat for specific taxa, model future species distributions including predicting invasive and rare species presence, and predict biodiversity of an area(Tan and Smeins 1996, Kampichler et , Wiley et al. 2003 al. 2007, Olden et al. 2008,  Knudby et al. 2010a).Common tools include Random Forest(Cutler et al. 2007, Peters et  al. 2007), classification and decision trees(Ribic and Ainley 1997, Bell 1999, Kobler and  Adamic 2000, Vayssières et al. 2000, Debeljak et al. 2001, Miller and Franklin 2002), neural networks(MASTRORILLO et al. 1997, Guégan et al. 1998, Fielding 1999a, Manel et  al. 1999, Brosse et al. 2001, Thuiller 2003, Dedecker et al. 2004, Segurado and Araújo  2004, Özesmi et al. 2006), genetic algorithms (D'Angelo et al. 1995, Stockwell and Peters  1999, Stockwell 1999, McKay 2001, Peterson et al. 2002, Wiley et al. 2003, Termansen et  al. 2006), support vector machines (Pouteau et al. 2012), and Bayesian classifiers (Fischer Common tools for classifying remote sensing images include Random Forest(Knudby et al. 2010b, Duro et al. 2012), support vector machines(Durbha et al. 2007, Knudby et al. 2010b, Zhao et al. 2011, Duro et al.  2012, Mountrakis et al. 2011), neural networks (Rogan et al. 2008), genetic algorithms (Haupt 2009), and decision trees (Huang and Jensen 1997).Random forest and support vector machines have been used for spatial interpolation of environmental variables(Li et  al. 2011).Artificial neural networks have been used to infer geophysical parameters from remote sensing data(Hsieh 2009, Rumelhart et al. 1986, Krasnopolsky 2009).
, 4) examining the effect of urbanization on bird breeding(Lee  et al. 2007), 5) predicting disease risk(Furlanello et al. 2003, Guo et al. 2005), and 6) modeling ecological niches(Drake et al. 2006).Being able to make these types of predictions and inferences can help focus conservation efforts for maximum impact(Knudby et al. 2010a, Guisan et al. 2013).Common ML methods for resource management include genetic algorithms (Haupt 2009), neural networks(Brey et al. 1996, Kastens and  Featherstone 1996, Recknagel 1997, Giske et al. 1998, Guégan et al. 1998, Schultz et al.  2000, Lee et al. 2007), support vector machines (Guo et al. 2005, Drake et al. 2006), fuzzy inference systems(Tscherko et al. 2007), decision trees (Henderson et al. 2005, Jones et  al. 2006), and Random Forest ( 2010).ML methods are highly configurable; thus, it can be overwhelming for researchers to choose the proper test for the job(Kampichler et al. 2010).Many of them require programming skills (e.g.scikit-learn) that many ecologists lack(Olden et al. 2008); however, tools like MatLab and R have developed more user-friendly interfaces and lowered the barrier to adoption for many users.Alternatively, many of the traditional statistical methods are fast to calculate and give easy-to-interpret metrics, like p-values(Olden et al. 2008, Kampichler et al. 2010).Traditional statistical methods are easier to find as a part of an off-the-shelf software package with a user interface and much of the complicated inner workings pleasantly hidden.Traditional statistical methods are part of a typical graduate and undergraduate education in the sciences whereas ML techniques are not.All of these make ML methods less attractive to practicing natural scientists than traditional statistical methods.Another barrier to using ML techniques is the need for adequate amounts of training and test data within the desired range of prediction.This places an important constraint on the application of ML to problems that have appropriate, annotated data sets available.For example, the Google image recognition algorithm was developed using 1.2 million annotated images (Simonite 2016).Rarely does a natural science domain have a quality, annotated data set that large.In addition, the validity of a ML model is restricted to the range represented by the training and test data.For example, a bird behavior model that was trained only on data collected during the summer will not be able to predict winter behavior.This is an important problem in the natural sciences, where the need for extrapolation is high (e.g.predicting climate change).The lack of high-quality data for model development has been cited as a major bottleneck in many fields of ML application (e.g.Bewley et al. 2015, Thessen and Patterson 2011).There are techniques available for developing a model when the data set is small (Corkill and Gormley 2016), but some methods (e.g., cross-validation, Bayesian) give a weaker estimate of model error(Guisan  and Zimmermann 2000, Hsieh 2009).Traditional statistical methods are validated using p values and tend to require much less data to develop a useful model.Thus, it can be harder to validate a ML model than a traditional statistical model.