One Ecosystem :
R Package
|
Corresponding author: Arturo Sanchez-Porras (sp.arturo@gmail.com)
Academic editor: Benjamin Burkhard
Received: 08 Feb 2025 | Accepted: 23 Apr 2025 | Published: 02 May 2025
© 2025 Arturo Sanchez-Porras, Aline Romero-Natale, Otilio Acevedo-Sandoval, Edlin Guerra-Castro
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Sanchez-Porras A, Romero-Natale A, Acevedo-Sandoval O, Guerra-Castro E (2025) imanr: An R tool for the identification of Mexican native maize complexes. One Ecosystem 10: e149055. https://doi.org/10.3897/oneeco.10.e149055
|
|
The conservation of the genetic diversity of native maize in Mexico is a priority due to its cultural, agricultural and environmental importance. This study presents the development and evaluation of the imanr package, a computational tool based on Boosted Ensembles designed to automate the classification of racial complexes of native maize. Using a national database, a model was implemented that leverages morphological and geographical variables to provide precise and rapid classifications. The methodology included the optimisation of key parameters through cross-validation, achieving up to 90% in balanced accuracy and a Cohen's Kappa coefficient of 0.84. These results highlight the robustness of the model compared to traditional methods, which rely on subjective expert judgement and require extended evaluation times. The findings demonstrate that the package not only surpasses conventional methods in terms of efficiency, but also offers an accessible tool for conserving and monitoring native maize diversity, aligning with the recommendations of the Global Maize Project (PGMN). Moreover, its usability was enhanced by developing a graphical user interface, allowing non-specialised users to fully utilise its potential. imanr represents a significant advancement in native maize conservation science, contributing to the modernisation of identification processes and strengthening sustainable management strategies for this essential genetic resource. This model directly addresses the need for innovative tools to monitor and preserve maize diversity in Mexico and suggests a promising pathway for future applications in the agricultural sector.
agrobiodiversity management, Boosted Ensemble classification, genetic diversity monitoring, morphological and geographical data, native maize biodiversity
Maize (Zea mays L.) represents an invaluable biocultural resource, particularly in Mexico, where its genetic, historical and cultural diversity is deeply intertwined with national identity and food security. The modern maize consumed today is the result of the domestication of its ancestor, known as teosinte (Zea mays ssp. parviglumis), a process initiated by Mesoamerican peoples between 9,000 and 5,000 BCE (
The classification of Mexican maize has a long history that spans more than a century. By the end of the 19th and beginning of the 20th centuries, the first systematic attempts to classify maize globally emerged, notably the work of E. Lewis Sturtevant (
During the second half of the 20th century, the number of recognised races increased due to new studies and collections led by researchers, such as Hernández Xolocotzi and Arias Reyes (
Accurate identification of native maize landraces in Mexico is essential for the conservation of adaptive genetic diversity. Each landrace represents a unique set of traits shaped by long-term selection under specific environmental and cultural conditions. The loss of any variety implies the irreversible disappearance of locally-adapted genotypes and agronomic strategies that may become vital in future contexts of climate change and food insecurity (
From an agroecology perspective, native maize forms the structural foundation of the traditional milpa system, which varies significantly across regions. Each maize variety interacts with companion crops, local soil conditions and climate, introducing differences in the functioning and resilience of the agroecosystem. Understanding and preserving maize diversity is, therefore, not only a matter of conserving entire ecological and cultural systems (
Despite its importance, the identification of maize landraces still relies on time-consuming and expert-driven procedures. Classical approaches use dichotomous keys and reference collections, based on morphological descriptors first systematised by
In this context, the imanr package was developed to ease access to maize landrace identification through a computational approach grounded in machine-learning (ML) models, which have proven to be robust and efficient tools for classification and prediction in various areas of biology and ecology (
This article introduces the imanr package as an accessible and versatile tool for farmers, scientists and stakeholders interested in studying and preserving native maize in Mexico. It describes the underlying methodology, model performance and practical implementation in R, highlighting its ability to integrate qualitative and quantitative traits for identifying racial complexes. Beyond improving the precision and efficiency of native maize racial complex identification, this tool directly promotes strategies for conserving genetic diversity. Moreover, it reinforces the recognition of native maize as an invaluable biocultural heritage linked to traditional agricultural systems that sustain essential ecological and cultural practices for agroecosystem resilience.
Dataset: Global Maize Project (PGMN). The development of the imanr package was based on data collected by the Global Maize Project (PGMN), managed by the National Institute for Forestry, Agriculture and Livestock Research (INIFAP) and available through the website of the National Commission for Knowledge and Use of Biodiversity (
2010 Database: This dataset includes 22,932 records with geographic information, producer data, agricultural practices and detailed morphological measurements of ears and plants.
2017 Database: This dataset comprises 25,861 records containing only georeferenced information, race and racial complex classification.
Given the larger diversity of variables available, the 2010 database was selected as the primary source of training for this model.
Data Selection and Processing
A panel of three experts independently evaluated the available variables to identify those most relevant for ensuring quality and consistency in the classification of racial complexes. The following criteria were established:
Proportion of complete records: Variables with a high percentage of missing data, such as the number of ears per plant or grain moisture content, were excluded.
Direct relevance: Variables related to descriptors or cultivation practices that did not provide inherent information about maize characteristics were discarded.
Format consistency: Variables such as planting and flowering dates were eliminated due to inconsistencies in data entry.
The variables considered for inclusion were those with at least 35% complete records. Fig.
Proportion of data completeness by variable across racial complexes in the 2010 PGMN database. Variables are grouped by type into three panels: (a) geographic, (b) qualitative and (c) quantitative. Colour intensity reflects the percentage of complete records, with darker blue tones indicating higher data completeness. This visualisation provided information for the selection of variables for model development by highlighting missing data patterns.
The final set of selected variables included a mix of geographic, qualitative and quantitative data, as detailed in Table
Final set of variables selected for training and applying the classification models. Variables were chosen based on completeness, relevance to racial complex identification and consistency across records in the 2010 PGMN dataset.
Category |
Variable |
Identification |
ID, primary race, racial complex. |
Geographic |
Latitude, longitude, altitude. |
Qualitative |
Grain colour, cob colour, stem colour, row arrangement, ear shape, grain type. |
Quantitative |
Ear length, ear diameter, ear diameter-to-length ratio, rows per ear. |
To ensure compatibility with classification algorithms, qualitative variables were numerically encoded, while quantitative variables were normalised to prevent biases arising from differences in scale. Missing values were handled through mean imputation for quantitative variables and mode imputation for qualitative variables, ensuring consistency and minimising the impact of incomplete data.
Classification Algorithms Tested
Multiple approaches can be employed to address a classification problem such as the one presented in this study. In the development of the imanr package, we explored a range of supervised learning techniques, from simple probabilistic classifiers to advanced ensemble models. Each algorithm was selected to represent a distinct strategy in classification, providing a diverse spectrum of capabilities for handling data. The algorithms considered include:
Naive Bayes (NB): A simple probabilistic classifier that assumes independence amongst predictors. While this assumption rarely holds in real-world datasets, NB is computationally efficient and provides a useful performance baseline. It is known to struggle with correlated features or overlapping distributions (
K-Nearest Neighbours (KNN): An instance-based method that classifies samples based on the majority class amongst their nearest neighbours in the feature space. KNN is intuitive and adapts well to non-linear class boundaries, although it is sensitive to feature scaling, noise and high dimensionality (
Support Vector Machines (SVM): A powerful method for high-dimensional data, SVM finds optimal hyperplanes that maximise the margin between classes. Its performance depends on appropriate selection of kernels and regularisation parameters, especially in multiclass problems (
Random Forests (RF): A robust ensemble of decision trees built using bootstrap samples and random subsets of features. RF is well-known for reducing overfitting and handling multicollinearity and is particularly effective for high-dimensional and mixed-type data (
Boosted Ensembles (BE): An iterative ensemble method that sequentially builds trees, each one correcting the errors of its predecessor. Boosting achieves high accuracy and generalisation, making it suitable for complex data interactions (
Neural Networks (NN): Modelled after biological networks, NNs are capable of learning complex non-linear relationships. We implemented a multilayer perceptron (MLP) variant, which is flexible, but requires careful tuning and sufficient data to avoid overfitting (
Bagged Trees (BT): A variant of ensemble learning in which multiple decision trees are trained independently on different bootstrapped subsets of the data and their predictions are averaged. Unlike RF, BT does not randomly select features at each split, focusing solely on variance reduction (
Each algorithm possesses different advantages: NB is fast and interpretable, KNN is non-parametric and locally adaptive, SVM is effective in high-dimensional spaces, RF and BT offer ensemble robustness, BE maximises predictive performance through iterative correction and NN brings flexibility for capturing subtle patterns (
Model Training and Validation
The dataset was randomly split into 75% for training and 25% for testing. Although other proportions such as 70/30 are commonly used, the exact split is less critical in this case due to the large size of the dataset and the additional use of k-fold cross-validation within the training phase. Cross-validation (CV) provides more robust and generalisable estimates of model performance compared to single holdout validation sets (
To properly interpret the results of the classification experiments, it is essential to consider multiple performance metrics, especially given the multi-class and imbalanced nature of the dataset. Four complementary metrics were selected to evaluate and compare the tested models:
The tuning process prioritised different performance metrics to account for the characteristics of the dataset and the classification goals. Models like RF and NB were tuned using Balanced Accuracy to optimise the trade-off between sensitivity and specificity across classes. Conversely, models such as SVM, BE and NN were tuned using the Area Under the Precision-Recall Curve (AUPRC) to enhance performance in detecting minority classes, given its importance in this study. Additionally, KNN was tuned using the F1-score, as it effectively combines precision and recall, making it well-suited for evaluating the model’s handling of imbalanced data. Finally, BT was tuned using Cohen’s Kappa, which measures the agreement between predicted and actual classes while accounting for overall performance and class-specific errors. This tailored approach ensured that each model was optimised for its specific strengths and the requirements of the dataset.
After tuning, the final model was re-trained on the full 75% training set using the optimal hyperparameters. Its generalisation capacity was then evaluated on the reserved 25% test set, which had not been used during model selection. This process follows best practices in predictive modelling workflows by balancing thorough internal validation with independent performance verification.
Implementation in R and Package Design
The imanr package was developed using the Boosted Ensemble (BE) algorithm as the primary model, based on its superior performance in the comparative analysis of metrics, as well as its inherent ability to handle complex, high-dimensional datasets with multiple features. This approach is well-known for its ability to iteratively enhance model performance by focusing on misclassified instances, resulting in superior accuracy, robustness and generalisation (
Regarding the package design, the xgboost (
External Validation and Reproducibility
External validation was conducted using individual maize samples that were collected in a project to document the native maize from the Otomi-Tepehua Region. The imanr package is publicly available on CRAN at https://cran.r-project.org/package=imanr, accompanied by detailed documentation and reproducible examples to ensure its accessibility and usability for the scientific community and other users interested in native maize classification. The developing version of the package is open source and available on GitHub at https://github.com/rafa6174/imanr.
Development of an Interactive Interface
To enhance the accessibility of the imanr package for users without programming expertise, an interactive graphical interface was developed using ShinyApps. This platform, available at https://arturosp.shinyapps.io/imanrWeb/, allows users to upload morphological and geographical data, perform classifications and intuitively visualise results. The interface democratises the use of the package by eliminating the need for any knowledge of R, making the tool more inclusive and efficient for researchers, students and technicians in the agricultural field.
Model comparison and selection
After training and testing the models, their performance was evaluated using four metrics (i.e. Balanced Accuracy, F1-score, Cohen’s Kappa and AUPRC) that together provide an integral view of model performance, as relying on a single indicator can lead to misleading conclusions, especially in multi-class settings. Fig.
Quantitative performance metrics of seven supervised classification models tested for their ability to predict maize racial complexes. The metrics include Balanced Accuracy, F1-score, Cohen’s Kappa and the Area Under the Precision-Recall Curve (AUPRC), particularly relevant for imbalanced datasets.
Model |
Balanced accuracy |
F1-score |
Cohen’s Kappa |
AUPRC |
BE |
0.904 |
0.846 |
0.838 |
0.914 |
RF |
0.893 |
0.837 |
0.832 |
0.916 |
BT |
0.887 |
0.813 |
0.813 |
0.887 |
KNN |
0.863 |
0.792 |
0.770 |
0.847 |
SVM |
0.858 |
0.771 |
0.693 |
0.558 |
NN |
0.853 |
0.747 |
0.752 |
0.799 |
NB |
0.535 |
0.173 |
0.062 |
0.458 |
Comparative performance of seven supervised learning algorithms evaluated for the classification of native maize racial complexes using the 2010 PGMN dataset. The bar plots display Balanced Accuracy, F1-Score, Cohen’s Kappa and AUPRC scores for each model, highlighting the superior performance of the Boosted Ensemble (BE) method.
RF and BT showed a high performance, closely behind BE, with RF achieving a Balanced Accuracy of 0.893 and AUPRC of 0.916, these results suggesting that models, based on multiple decision trees, are particularly adequate for this classification problem. KNN and SVM also showed strong performance, particularly in Balanced Accuracy (0.876 and 0.859, respectively). NN performed moderately well with an AUPRC of 0.799, but lagged in Cohen’s Kappa (0.752). These results underscore the robustness across metrics of BE, making it the most reliable model for classifying native maize racial complexes.
Final Model Implementation
The BE model implemented with the xgboost package (
Results of hyperparameter tuning for the Boosted Ensemble (BE) classification model using the xgboost implementation. The table presents performance metrics (i.e. Balanced Accuracy, F1-score, Cohen’s Kappa and AUPRC) for different combinations of tree depth, learn rate and sample size. These results were used to determine the optimal configuration that balances model complexity, generalisation and computational efficiency.
tree_depth |
learn_rate |
sample_size |
Balanced accuracy |
F1-score |
Cohen’s Kappa |
AUPRC |
1 |
0.0021 |
0.9412 |
0.6749 |
0.5828 |
0.5748 |
0.5827 |
1 |
0.0198 |
0.5119 |
0.8073 |
0.6817 |
0.7243 |
0.8005 |
5 |
0.0694 |
0.7472 |
0.9043 |
0.8456 |
0.8378 |
0.9137 |
9 |
0.0439 |
0.3298 |
0.9024 |
0.8434 |
0.8368 |
0.9126 |
10 |
0.0578 |
0.5160 |
0.9036 |
0.8451 |
0.8369 |
0.9138 |
Influence of hyperparameter tuning on the performance of the Boosted Ensemble (BE) classification model. The plots display the relationship between key hyperparameters (i.e. tree depth, learn rate and sample size) and model performance, measured through Area Under the Precision-Recall Curve (AUPRC). The results indicate that performance improves with deeper trees and moderate learning rates up to an optimal threshold, beyond which accuracy stabilies or slightly declines, suggesting diminishing returns and potential overfitting.
Table
The relationship between the hyperparameters and the model's performance is further illustrated in Fig.
User Application and Visualisation of Results
The imanr package was designed with accessibility in mind, offering two primary interfaces for end users: (1) a command-line function in R and (2) a graphical interface developed using ShinyApps. Below, we provide examples of how the results are obtained and presented.
In the R environment, users can classify individual or batch samples by passing a dataframe with morphological and geographical information. A ready-to-use template is available in the GitHub repository of the project, which can be edited in any spreadsheet or text editor and then loaded into R. For example:
> library(imanr)
>
> # Read the template already filled with data
> InputSample <- read.csv("/home/user/Downloads/DataTemplate.csv")
> InputSample
>
> # Find the racial complex
> EstimatedComplex <- find_racial_complex(InputSample)
> EstimatedComplex
[1] Ocho hileras
7 Levels: Chapalote Cónico Dentados tropicales Ocho hileras Sierra de Chihuahua ... Ocho hileras
The function returns the predicted racial complex as a factor vector. In this example, the sample is classified as belonging to the "Ocho hileras" complex.
For non-programmers, a web-based interactive application is available at: https://arturosp.shinyapps.io/imanrWeb/. Fig.
Screenshot of the imanr Shiny application for the classification of native maize racial complexes. The user interface allows for manual data entry through numeric fields and dropdown menus corresponding to morphological and geographical variables. Upon submission, the application displays the predicted racial complex along with a reference image and a summary of key traits associated with the classification.
The results obtained highlight the effectiveness of the proposed model, which represents an innovative tool for conserving the genetic diversity of native maize. The performance of the BE model, with an AUPRC exceeding 0.9, positions it as a robust solution for the identification of racial complexes, significantly surpassing the limitations of traditional methods, such as the subjectivity and variability inherent in manual morphometric analyses (
Previous studies have demonstrated that BE methods, such as Gradient Boosting Machines and XGBoost, are powerful classification techniques in conservation biology due to their ability to handle complex, high-dimensional datasets and their iterative improvement mechanism. For instance,
Despite the positive results, it is important to consider the model's limitations that may affect the current version of the package. First, the model is currently restricted to identifying racial complexes. This leaves open the possibility of expanding its capabilities to classify primary races. Such an enhancement would have a significant impact on census and genetic diversity monitoring programmes, aligning with global initiatives for the preservation of native crops (
Third, the model assumes discrete boundaries between racial complexes. However, maize landraces exist along continuous phenotypic and genetic gradients, especially in transitional regions (
The accessibility of the imanr package has been enhanced through the development of an interactive graphical interface using ShinyApps. This platform enables researchers and professionals without advanced programming knowledge to use the tool effectively. The interface allows users to upload morphological and geographical data, perform classifications and visualise results intuitively. By providing a user-friendly and accessible solution, the ShinyApp democratises the use of the imanr package, fostering its adoption by a broader audience, including researchers, students and technicians in the agricultural field. This development aligns with studies that emphasise the importance of intuitive design in maximising the impact of technological innovations (
This project represents a pioneering step in automating the classification of native maize racial complexes in Mexico. Its development supports broader efforts to modernise agrobiodiversity monitoring and reinforces the importance of integrating computational tools in conservation strategies.
Furthermore, the BE model not only demonstrated high predictive accuracy, but also proved to be computationally efficient, delivering reliable classifications in short time frames, even with large datasets. This represents a significant advantage over traditional identification methods, which often depend on expert judgement, prolonged evaluation times and elevated costs. By offering a scalable and accessible solution, the implementation of the BE model through the imanr package constitutes a substantial advancement in the modernisation of native maize racial complex classification.
This study introduces imanr, a BE-based package designed for the automatic classification of native maize racial complexes in Mexico. With a balanced accuracy of 90% and a Cohen's Kappa coefficient of 0.83, imanr stands out for its reliability, efficiency and ability to overcome the limitations of traditional methods, offering an innovative and reproducible approach to the morphological and geographical identification of native maize.
The tool represents a significant advancement in modernising the management and conservation of maize biodiversity, providing an accessible technological solution for researchers and specialists. While imanr is a robust model, future enhancements, such as integrating genomic data and optimising its graphical interface, could expand its scope and utility.
The imanr package is publicly available through the following repositories and platforms:
CRAN Repository: https://cran.r-project.org/package=imanr
Official stable version of the imanr package for installation and use.
GitHub Repository (Development Version): https://github.com/rafa6174/imanr
Shiny Web Application: https://arturosp.shinyapps.io/imanrWeb/
Interactive interface for classifying native maize racial complexes using imanr.
These resources provide access to the package’s code, documentation and interface.
The authors would like to acknowledge SECIHTI for financial support and institutional inspiration provided by CONABIO and the PGMN. This paper was greatly improved by the insightful comments from Malte Hinsch and the an anonymous reviewer.