One Ecosystem : R Package
PDF
R Package
imanr: An R tool for the identification of Mexican native maize complexes
expand article infoArturo Sanchez-Porras, Aline Romero-Natale§, Otilio Arturo Acevedo-Sandoval§, Edlin Guerra-Castro
‡ Escuela Nacional de Estudios Superiores Unidad Mérida, Mérida, Mexico
§ Universidad Autónoma del Estado de Hidalgo, Pachuca Hidalgo, Mexico
Open Access

Abstract

The conservation of the genetic diversity of native maize in Mexico is a priority due to its cultural, agricultural and environmental importance. This study presents the development and evaluation of the imanr package, a computational tool based on Boosted Ensembles designed to automate the classification of racial complexes of native maize. Using a national database, a model was implemented that leverages morphological and geographical variables to provide precise and rapid classifications. The methodology included the optimisation of key parameters through cross-validation, achieving up to 90% in balanced accuracy and a Cohen's Kappa coefficient of 0.84. These results highlight the robustness of the model compared to traditional methods, which rely on subjective expert judgement and require extended evaluation times. The findings demonstrate that the package not only surpasses conventional methods in terms of efficiency, but also offers an accessible tool for conserving and monitoring native maize diversity, aligning with the recommendations of the Global Maize Project (PGMN). Moreover, its usability was enhanced by developing a graphical user interface, allowing non-specialised users to fully utilise its potential. imanr represents a significant advancement in native maize conservation science, contributing to the modernisation of identification processes and strengthening sustainable management strategies for this essential genetic resource. This model directly addresses the need for innovative tools to monitor and preserve maize diversity in Mexico and suggests a promising pathway for future applications in the agricultural sector.

Keywords

agrobiodiversity management, Boosted Ensemble classification, genetic diversity monitoring, morphological and geographical data, native maize biodiversity

Introduction

Maize (Zea mays L.) represents an invaluable biocultural resource, particularly in Mexico, where its genetic, historical and cultural diversity is deeply intertwined with national identity and food security. The modern maize consumed today is the result of the domestication of its ancestor, known as teosinte (Zea mays ssp. parviglumis), a process initiated by Mesoamerican peoples between 9,000 and 5,000 BCE (Caballero-García et al. 2019, Gouttefanjat 2020). Over millennia, this co-evolution of domestication and migration has enabled farmers across diverse Mexican regions to develop unique maize landraces, adapted to specific local conditions and fulfilling nutritional, flavour and resilience demands for various climates and soils (González González et al. 2023, Reza-Solis et al. 2024).

The classification of Mexican maize has a long history that spans more than a century. By the end of the 19th and beginning of the 20th centuries, the first systematic attempts to classify maize globally emerged, notably the work of E. Lewis Sturtevant (Curry 2021) who categorised known varieties into six types, based on kernel morphology. However, the concept of maize race was not introduced until Anderson and Cutler (1942) incorporated morphological, ecological and cultural attributes into a more comprehensive classification. The consolidation of this approach came with the work of Wellhausen et al. (1951), who identified 25 races and three subraces, thus establishing a foundational taxonomy that has served as a reference for subsequent classification efforts.

During the second half of the 20th century, the number of recognised races increased due to new studies and collections led by researchers, such as Hernández Xolocotzi and Arias Reyes (Cruz-León et al. 2013). In this period, research on the isozymatic diversity of maize enabled the grouping of Mexican maize races into seven racial complexes, based on shared morphological traits, ecological adaptation and genetic similarity (Curry 2021). In 2011, the National Commission for Knowledge and Use of Biodiversity (CONABIO) published the results of the Global Maize Project (PGMN), providing a comprehensive update of this classification framework (CONABIO 2011). This initiative identified 64 maize landraces in the country, of which 59 were classified as native, while undescribed landraces were also suggested (Sánchez-González 2011). The resulting data formed the scientific basis for the official delimitation of centres of origin and genetic diversity in Mexican maize, in compliance with the Biosafety Law for Genetically Modified Organisms.

Accurate identification of native maize landraces in Mexico is essential for the conservation of adaptive genetic diversity. Each landrace represents a unique set of traits shaped by long-term selection under specific environmental and cultural conditions. The loss of any variety implies the irreversible disappearance of locally-adapted genotypes and agronomic strategies that may become vital in future contexts of climate change and food insecurity (Caldu-Primo et al. 2017, González-Martínez et al. 2020). Furthermore, native maize is deeply ingrained in the biocultural heritage of indigenous people that have relied in maize not only as a staple crop, but also as a cultural keystone species. As such, native maize varieties are considered part of the cultural ecosystem services that maintain social identity, ritual practices and food sovereignty (Barrasa García 2017, Alpuche-Álvarez et al. 2019).

From an agroecology perspective, native maize forms the structural foundation of the traditional milpa system, which varies significantly across regions. Each maize variety interacts with companion crops, local soil conditions and climate, introducing differences in the functioning and resilience of the agroecosystem. Understanding and preserving maize diversity is, therefore, not only a matter of conserving entire ecological and cultural systems (Fonteyne et al. 2023, Romero-Natale et al. 2024).

Despite its importance, the identification of maize landraces still relies on time-consuming and expert-driven procedures. Classical approaches use dichotomous keys and reference collections, based on morphological descriptors first systematised by Wellhausen et al. (1951), which are often inaccessible or difficult to use for non-specialists. Although local communities maintain their own ethnobotanical classification systems, these are not integrated into formal conservation frameworks. Molecular identification methods are available (Arteaga et al. 2016, Caldu-Primo et al. 2017), yet their adoption is limited due to cost and infrastructure barriers.

In this context, the imanr package was developed to ease access to maize landrace identification through a computational approach grounded in machine-learning (ML) models, which have proven to be robust and efficient tools for classification and prediction in various areas of biology and ecology (Kuhn and Johnson 2013, Kuhn and Silge 2022, Pichler and Hartig 2023). Developed using a Boosted Ensemble model trained on the PGMN database, imanr provides automated, accurate and reproducible classification of racial complexes, based on simple inputs such as geographical coordinates and accessible morphological traits. The algorithm captures complex non-linear relationships amongst variables to produce a reliable estimation of racial complex identity (Nieves-Alvarez et al. 2024). This tool represents a step forward in modernising biodiversity monitoring, while reducing dependence on specialised personnel or costly molecular tools. Its availability as an open source R package and a ready-to-use web application further its potential contribution to both technological democratisation and biocultural conservation.

This article introduces the imanr package as an accessible and versatile tool for farmers, scientists and stakeholders interested in studying and preserving native maize in Mexico. It describes the underlying methodology, model performance and practical implementation in R, highlighting its ability to integrate qualitative and quantitative traits for identifying racial complexes. Beyond improving the precision and efficiency of native maize racial complex identification, this tool directly promotes strategies for conserving genetic diversity. Moreover, it reinforces the recognition of native maize as an invaluable biocultural heritage linked to traditional agricultural systems that sustain essential ecological and cultural practices for agroecosystem resilience.

Materials and Methods

Dataset: Global Maize Project (PGMN). The development of the imanr package was based on data collected by the Global Maize Project (PGMN), managed by the National Institute for Forestry, Agriculture and Livestock Research (INIFAP) and available through the website of the National Commission for Knowledge and Use of Biodiversity (CONABIO 2011). Two main datasets were used:

  • 2010 Database: This dataset includes 22,932 records with geographic information, producer data, agricultural practices and detailed morphological measurements of ears and plants.

  • 2017 Database: This dataset comprises 25,861 records containing only georeferenced information, race and racial complex classification.

Given the larger diversity of variables available, the 2010 database was selected as the primary source of training for this model.

Data Selection and Processing

A panel of three experts independently evaluated the available variables to identify those most relevant for ensuring quality and consistency in the classification of racial complexes. The following criteria were established:

  1. Proportion of complete records: Variables with a high percentage of missing data, such as the number of ears per plant or grain moisture content, were excluded.

  2. Direct relevance: Variables related to descriptors or cultivation practices that did not provide inherent information about maize characteristics were discarded.

  3. Format consistency: Variables such as planting and flowering dates were eliminated due to inconsistencies in data entry.

The variables considered for inclusion were those with at least 35% complete records. Fig. 1 shows the completeness of data for each variable across racial complexes in the 2010 PGMN database, grouped into three panels: (a) geographic, (b) qualitative and (c) quantitative variables. The figure reveals that all geographic and qualitative variables have a high level of completeness for data across all racial complexes, while the completeness of quantitative variables varies considerably. Notably, variables such as ears per plant and kernel length show low completeness, supporting their exclusion from model training. This visual summary facilitated the identification of reliable variables for classification purposes.

Figure 1.

Proportion of data completeness by variable across racial complexes in the 2010 PGMN database. Variables are grouped by type into three panels: (a) geographic, (b) qualitative and (c) quantitative. Colour intensity reflects the percentage of complete records, with darker blue tones indicating higher data completeness. This visualisation provided information for the selection of variables for model development by highlighting missing data patterns.

The final set of selected variables included a mix of geographic, qualitative and quantitative data, as detailed in Table 1. These variables were chosen for their potential to provide robust and relevant information for classifying maize racial complexes.

Table 1.

Final set of variables selected for training and applying the classification models. Variables were chosen based on completeness, relevance to racial complex identification and consistency across records in the 2010 PGMN dataset.

Category

Variable

Identification

ID, primary race, racial complex.

Geographic

Latitude, longitude, altitude.

Qualitative

Grain colour, cob colour, stem colour, row arrangement, ear shape, grain type.

Quantitative

Ear length, ear diameter, ear diameter-to-length ratio, rows per ear.

To ensure compatibility with classification algorithms, qualitative variables were numerically encoded, while quantitative variables were normalised to prevent biases arising from differences in scale. Missing values were handled through mean imputation for quantitative variables and mode imputation for qualitative variables, ensuring consistency and minimising the impact of incomplete data.

Classification Algorithms Tested

Multiple approaches can be employed to address a classification problem such as the one presented in this study. In the development of the imanr package, we explored a range of supervised learning techniques, from simple probabilistic classifiers to advanced ensemble models. Each algorithm was selected to represent a distinct strategy in classification, providing a diverse spectrum of capabilities for handling data. The algorithms considered include:

  1. Naive Bayes (NB): A simple probabilistic classifier that assumes independence amongst predictors. While this assumption rarely holds in real-world datasets, NB is computationally efficient and provides a useful performance baseline. It is known to struggle with correlated features or overlapping distributions (Majka 2017).

  2. K-Nearest Neighbours (KNN): An instance-based method that classifies samples based on the majority class amongst their nearest neighbours in the feature space. KNN is intuitive and adapts well to non-linear class boundaries, although it is sensitive to feature scaling, noise and high dimensionality (Schliep et al. 2004).

  3. Support Vector Machines (SVM): A powerful method for high-dimensional data, SVM finds optimal hyperplanes that maximise the margin between classes. Its performance depends on appropriate selection of kernels and regularisation parameters, especially in multiclass problems (Karatzoglou et al. 2004).

  4. Random Forests (RF): A robust ensemble of decision trees built using bootstrap samples and random subsets of features. RF is well-known for reducing overfitting and handling multicollinearity and is particularly effective for high-dimensional and mixed-type data (Wright et al. 2015).

  5. Boosted Ensembles (BE): An iterative ensemble method that sequentially builds trees, each one correcting the errors of its predecessor. Boosting achieves high accuracy and generalisation, making it suitable for complex data interactions (Chen et al. 2014). Based on preliminary evaluations, this method was selected as the core classifier in imanr.

  6. Neural Networks (NN): Modelled after biological networks, NNs are capable of learning complex non-linear relationships. We implemented a multilayer perceptron (MLP) variant, which is flexible, but requires careful tuning and sufficient data to avoid overfitting (Ripley and Venables 2009).

  7. Bagged Trees (BT): A variant of ensemble learning in which multiple decision trees are trained independently on different bootstrapped subsets of the data and their predictions are averaged. Unlike RF, BT does not randomly select features at each split, focusing solely on variance reduction (Therneau et al. 1999).

Each algorithm possesses different advantages: NB is fast and interpretable, KNN is non-parametric and locally adaptive, SVM is effective in high-dimensional spaces, RF and BT offer ensemble robustness, BE maximises predictive performance through iterative correction and NN brings flexibility for capturing subtle patterns (Thessen 2016, Ghatak 2017, Ramasubramanian and Singh 2019). All models were trained using the tidymodels framework (Kuhn and Wickham 2018), with consistent preprocessing and hyperparameter tuning via cross-validation. The evaluation employed metrics such as Balanced Accuracy, F1-score, Cohen’s Kappa and the Area Under the Precision-Recall Curve (AUPRC) to ensure robust comparison.

Model Training and Validation

The dataset was randomly split into 75% for training and 25% for testing. Although other proportions such as 70/30 are commonly used, the exact split is less critical in this case due to the large size of the dataset and the additional use of k-fold cross-validation within the training phase. Cross-validation (CV) provides more robust and generalisable estimates of model performance compared to single holdout validation sets (Ramasubramanian and Singh 2019, James et al. 2021). Specifically, five-fold CV was applied to tune the hyperparameters of each algorithm, ensuring that all training observations were used both for training and for internal validation at different stages of the procedure. No separate validation set was used, as five-fold cross-validation was performed on the training set to optimise hyperparameters before final model evaluation on the test set. This method ensures robust hyperparameter tuning, while maximising the amount of data available for final testing and performance evaluation (Ghatak 2017, Ramasubramanian and Singh 2019).

To properly interpret the results of the classification experiments, it is essential to consider multiple performance metrics, especially given the multi-class and imbalanced nature of the dataset. Four complementary metrics were selected to evaluate and compare the tested models:

  • Balanced accuracy measures the average recall across all classes, assigning equal importance to each regardless of their frequency. This metric is particularly suitable for datasets with class imbalance, as it prevents majority classes from dominating the evaluation (Thölke et al. 2023).
  • F1-score is the harmonic mean of precision and recall, balancing the trade-off between avoiding false positives and false negatives. It is useful when both types of error have consequences, as in agrobiodiversity classification tasks (Rainio et al. 2024).
  • Cohen's Kappa evaluates the agreement between predicted and true labels, adjusted for chance agreement based on class prevalence. It ranges from -1 to 1, with values above 0.80 considered to reflect "almost perfect agreement" (Rainio et al. 2024).
  • Area Under the Precision-Recall Curve (AUPRC) is particularly informative for imbalanced datasets, as it focuses on the model's ability to correctly identify minority classes (Boyd et al. 2013). In this study, we computed a macro-averaged AUPRC by treating each class as "positive" in a one-vs.-all setting.

The tuning process prioritised different performance metrics to account for the characteristics of the dataset and the classification goals. Models like RF and NB were tuned using Balanced Accuracy to optimise the trade-off between sensitivity and specificity across classes. Conversely, models such as SVM, BE and NN were tuned using the Area Under the Precision-Recall Curve (AUPRC) to enhance performance in detecting minority classes, given its importance in this study. Additionally, KNN was tuned using the F1-score, as it effectively combines precision and recall, making it well-suited for evaluating the model’s handling of imbalanced data. Finally, BT was tuned using Cohen’s Kappa, which measures the agreement between predicted and actual classes while accounting for overall performance and class-specific errors. This tailored approach ensured that each model was optimised for its specific strengths and the requirements of the dataset.

After tuning, the final model was re-trained on the full 75% training set using the optimal hyperparameters. Its generalisation capacity was then evaluated on the reserved 25% test set, which had not been used during model selection. This process follows best practices in predictive modelling workflows by balancing thorough internal validation with independent performance verification.

Implementation in R and Package Design

The imanr package was developed using the Boosted Ensemble (BE) algorithm as the primary model, based on its superior performance in the comparative analysis of metrics, as well as its inherent ability to handle complex, high-dimensional datasets with multiple features. This approach is well-known for its ability to iteratively enhance model performance by focusing on misclassified instances, resulting in superior accuracy, robustness and generalisation (Chen and Guestrin 2016). The BE methodology combines predictions from multiple weak learners, typically decision trees, to produce a strong predictive model, making it particularly suitable for classifying native maize racial complexes, where data imbalance and feature interactions play a critical role (Natekin and Knoll 2013). Model parameters, such as the number of iterations, learning rate and maximum tree depth, were optimised through a five-fold cross-validation procedure to maximise accuracy while minimising overfitting and model variance (Probst et al. 2019).

Regarding the package design, the xgboost (Chen et al. 2014) library was used to implement the model, while the tidymodels (Kuhn and Wickham 2018) framework was employed for preprocessing, metrics evaluation, training and hyperparameter tuning. The structure of imanr includes a main function that accepts morphological and geographical data as input and returns the classification of the racial complex. Additionally, the package incorporates an imputation function that can help the user in the case of missing data. This design makes imanr a versatile and accessible tool for users with different levels of programming and statistical analysis experience.

External Validation and Reproducibility

External validation was conducted using individual maize samples that were collected in a project to document the native maize from the Otomi-Tepehua Region. The imanr package is publicly available on CRAN at https://cran.r-project.org/package=imanr, accompanied by detailed documentation and reproducible examples to ensure its accessibility and usability for the scientific community and other users interested in native maize classification. The developing version of the package is open source and available on GitHub at https://github.com/rafa6174/imanr.

Development of an Interactive Interface

To enhance the accessibility of the imanr package for users without programming expertise, an interactive graphical interface was developed using ShinyApps. This platform, available at https://arturosp.shinyapps.io/imanrWeb/, allows users to upload morphological and geographical data, perform classifications and intuitively visualise results. The interface democratises the use of the package by eliminating the need for any knowledge of R, making the tool more inclusive and efficient for researchers, students and technicians in the agricultural field.

Results

Model comparison and selection

After training and testing the models, their performance was evaluated using four metrics (i.e. Balanced Accuracy, F1-score, Cohen’s Kappa and AUPRC) that together provide an integral view of model performance, as relying on a single indicator can lead to misleading conclusions, especially in multi-class settings. Fig. 2 and Table 2 highlight the BE model as the best performer, achieving the highest Balanced Accuracy (0.903), F1-score (0.843) and Cohen’s Kappa (0.835), with a high value for the AUPRC (0.912). In contrast, the NB model performed poorly across all metrics, with the lowest Balanced Accuracy (0.534) and F1-score (0.173), reflecting its limitations in handling high-dimensional and imbalanced data.

Table 2.

Quantitative performance metrics of seven supervised classification models tested for their ability to predict maize racial complexes. The metrics include Balanced Accuracy, F1-score, Cohen’s Kappa and the Area Under the Precision-Recall Curve (AUPRC), particularly relevant for imbalanced datasets.

Model

Balanced accuracy

F1-score

Cohen’s Kappa

AUPRC

BE

0.904

0.846

0.838

0.914

RF

0.893

0.837

0.832

0.916

BT

0.887

0.813

0.813

0.887

KNN

0.863

0.792

0.770

0.847

SVM

0.858

0.771

0.693

0.558

NN

0.853

0.747

0.752

0.799

NB

0.535

0.173

0.062

0.458

Figure 2.

Comparative performance of seven supervised learning algorithms evaluated for the classification of native maize racial complexes using the 2010 PGMN dataset. The bar plots display Balanced Accuracy, F1-Score, Cohen’s Kappa and AUPRC scores for each model, highlighting the superior performance of the Boosted Ensemble (BE) method.

RF and BT showed a high performance, closely behind BE, with RF achieving a Balanced Accuracy of 0.893 and AUPRC of 0.916, these results suggesting that models, based on multiple decision trees, are particularly adequate for this classification problem. KNN and SVM also showed strong performance, particularly in Balanced Accuracy (0.876 and 0.859, respectively). NN performed moderately well with an AUPRC of 0.799, but lagged in Cohen’s Kappa (0.752). These results underscore the robustness across metrics of BE, making it the most reliable model for classifying native maize racial complexes.

Final Model Implementation

The BE model implemented with the xgboost package (Chen and Guestrin 2016) demonstrated the best performance amongst the evaluated methods for identifying native maize racial complexes. The detailed results of the model tuning are presented in Fig. 3 and Table 3, highlighting an Area Under the Precision-Recall Curve of 0.914 and a Balanced Accuracy of up to 90.4% with a tree_depth value of 10.

Table 3.

Results of hyperparameter tuning for the Boosted Ensemble (BE) classification model using the xgboost implementation. The table presents performance metrics (i.e. Balanced Accuracy, F1-score, Cohen’s Kappa and AUPRC) for different combinations of tree depth, learn rate and sample size. These results were used to determine the optimal configuration that balances model complexity, generalisation and computational efficiency.

tree_depth

learn_rate

sample_size

Balanced accuracy

F1-score

Cohen’s Kappa

AUPRC

1

0.0021

0.9412

0.6749

0.5828

0.5748

0.5827

1

0.0198

0.5119

0.8073

0.6817

0.7243

0.8005

5

0.0694

0.7472

0.9043

0.8456

0.8378

0.9137

9

0.0439

0.3298

0.9024

0.8434

0.8368

0.9126

10

0.0578

0.5160

0.9036

0.8451

0.8369

0.9138

Figure 3.

Influence of hyperparameter tuning on the performance of the Boosted Ensemble (BE) classification model. The plots display the relationship between key hyperparameters (i.e. tree depth, learn rate and sample size) and model performance, measured through Area Under the Precision-Recall Curve (AUPRC). The results indicate that performance improves with deeper trees and moderate learning rates up to an optimal threshold, beyond which accuracy stabilies or slightly declines, suggesting diminishing returns and potential overfitting.

Table 3 summarises model performance across varying values of key hyperparameters, including tree_depth, learn_rate and sample_size. For tree_depth, AUPRC and Balanced Accuracy show an upward trend as the depth increases from 1 to 5. Beyond this point, the performance plateaus, suggesting that deeper trees provide diminishing returns due to potential overfitting. Similarly, Fig. 3 illustrates the relationships between AUPRC and the hyperparameters, revealing that increasing learn_rate, sample_size and tree_depth up to values of 0.057, 0.516 and 10, respectively, are the optimal parameters for training the final model.

The relationship between the hyperparameters and the model's performance is further illustrated in Fig. 3. The graph reveals that increasing learn_rate up to approximately 0.06 results in significant improvements, after which the performance stabilises. Similarly, sample_size values around 0.5 upt to 0.75 achieve optimal AUPRC, with lower and higher values reducing performance. Lastly, the tree_depth parameter strongly impacts performance, with the most significant gains observed as depth increases to 5 and stabilises around 10.

User Application and Visualisation of Results

The imanr package was designed with accessibility in mind, offering two primary interfaces for end users: (1) a command-line function in R and (2) a graphical interface developed using ShinyApps. Below, we provide examples of how the results are obtained and presented.

In the R environment, users can classify individual or batch samples by passing a dataframe with morphological and geographical information. A ready-to-use template is available in the GitHub repository of the project, which can be edited in any spreadsheet or text editor and then loaded into R. For example:

> library(imanr)

>

> # Read the template already filled with data

> InputSample <- read.csv("/home/user/Downloads/DataTemplate.csv")

> InputSample

>

> # Find the racial complex

> EstimatedComplex <- find_racial_complex(InputSample)

> EstimatedComplex

[1] Ocho hileras

7 Levels: Chapalote Cónico Dentados tropicales Ocho hileras Sierra de Chihuahua ... Ocho hileras

The function returns the predicted racial complex as a factor vector. In this example, the sample is classified as belonging to the "Ocho hileras" complex.

For non-programmers, a web-based interactive application is available at: https://arturosp.shinyapps.io/imanrWeb/. Fig. 4 presents a screenshot of the webapp interface, illustrating its intuitive design and output display. The app guides the user through the input process: numeric values are entered directly, while qualitative variables can be selected from predefined dropdown menus. Once the input is complete, pressing the "Run" button produces the predicted racial complex, which is displayed in the right-hand panel along with a representative image. To enhance interpretability of the results, the lower panel provides a summary of the typical characteristics associated with the predicted complex.

Figure 4.

Screenshot of the imanr Shiny application for the classification of native maize racial complexes. The user interface allows for manual data entry through numeric fields and dropdown menus corresponding to morphological and geographical variables. Upon submission, the application displays the predicted racial complex along with a reference image and a summary of key traits associated with the classification.

Discussion

The results obtained highlight the effectiveness of the proposed model, which represents an innovative tool for conserving the genetic diversity of native maize. The performance of the BE model, with an AUPRC exceeding 0.9, positions it as a robust solution for the identification of racial complexes, significantly surpassing the limitations of traditional methods, such as the subjectivity and variability inherent in manual morphometric analyses (Vega-Álvarez et al. 2022).

Previous studies have demonstrated that BE methods, such as Gradient Boosting Machines and XGBoost, are powerful classification techniques in conservation biology due to their ability to handle complex, high-dimensional datasets and their iterative improvement mechanism. For instance, Ghafarian et al. (2022) discuss the superior classification accuracy of BE methods in ecological modelling, highlighting their adaptability to imbalanced data. Similarly, Ren et al. (2020) emphasise the effectiveness of BE in pollution distribution modelling, where the iterative nature of the algorithm is particularly beneficial for preventing overfitting and creating an efficient routine. In another study, Wieland et al. (2021) explore the application of BE methods in habitat suitability modelling, demonstrating their ability to integrate diverse ecological variables, while maintaining robust predictive performance. These findings align with the results presented in this study, showcasing BE as a highly suitable approach for complex biological datasets, such as the classification of native maize racial complexes.

Despite the positive results, it is important to consider the model's limitations that may affect the current version of the package. First, the model is currently restricted to identifying racial complexes. This leaves open the possibility of expanding its capabilities to classify primary races. Such an enhancement would have a significant impact on census and genetic diversity monitoring programmes, aligning with global initiatives for the preservation of native crops (González-Martínez et al. 2020). Second, the training data used in this study were obtained from the 2010 Global Maize Project (CONABIO 2011). While this dataset is the most comprenhensive available for Mexican native maize, it inevitably reflects the state of landrace diversity at the time of collection. There is a possibility that landraces developed or identified after 2010 are not represented in the model. Future efforts could benefit from integrating more recent data to capture such dynamics.

Third, the model assumes discrete boundaries between racial complexes. However, maize landraces exist along continuous phenotypic and genetic gradients, especially in transitional regions (Arteaga et al. 2016, Pace et al. 2024). This phenomenon of racial mixture may result in classification ambiguities, where samples near the margins of two complexes are misclassified. Fourth, while the model achieved high performance and generalisation during cross-validation, there remains a risk of overfitting to the specific structure and distribution of the training data. This was mitigated by using cross-validation with careful hyperaparameter tuning (Ghatak 2017, James et al. 2021), yet true external validation across independent and temporally updated datasets is still limited. Lastly, retraining the model with updated or region-specific datasets, while technically feasible, involves considerable computational cost. Users seeking to adapt imanr to localised datasets must consider the resources to preprocess, tune and validate a new model instance.

The accessibility of the imanr package has been enhanced through the development of an interactive graphical interface using ShinyApps. This platform enables researchers and professionals without advanced programming knowledge to use the tool effectively. The interface allows users to upload morphological and geographical data, perform classifications and visualise results intuitively. By providing a user-friendly and accessible solution, the ShinyApp democratises the use of the imanr package, fostering its adoption by a broader audience, including researchers, students and technicians in the agricultural field. This development aligns with studies that emphasise the importance of intuitive design in maximising the impact of technological innovations (Menegidio et al. 2019, Nishizawa et al. 2020).

This project represents a pioneering step in automating the classification of native maize racial complexes in Mexico. Its development supports broader efforts to modernise agrobiodiversity monitoring and reinforces the importance of integrating computational tools in conservation strategies.

Furthermore, the BE model not only demonstrated high predictive accuracy, but also proved to be computationally efficient, delivering reliable classifications in short time frames, even with large datasets. This represents a significant advantage over traditional identification methods, which often depend on expert judgement, prolonged evaluation times and elevated costs. By offering a scalable and accessible solution, the implementation of the BE model through the imanr package constitutes a substantial advancement in the modernisation of native maize racial complex classification.

Conclusion

This study introduces imanr, a BE-based package designed for the automatic classification of native maize racial complexes in Mexico. With a balanced accuracy of 90% and a Cohen's Kappa coefficient of 0.83, imanr stands out for its reliability, efficiency and ability to overcome the limitations of traditional methods, offering an innovative and reproducible approach to the morphological and geographical identification of native maize.

The tool represents a significant advancement in modernising the management and conservation of maize biodiversity, providing an accessible technological solution for researchers and specialists. While imanr is a robust model, future enhancements, such as integrating genomic data and optimising its graphical interface, could expand its scope and utility.

Web location (URIs) and repository

The imanr package is publicly available through the following repositories and platforms:

These resources provide access to the package’s code, documentation and interface.

Acknowledgements

The authors would like to acknowledge SECIHTI for financial support and institutional inspiration provided by CONABIO and the PGMN. This paper was greatly improved by the insightful comments from Malte Hinsch and the an anonymous reviewer.

Conflicts of interest

The authors have declared that no competing interests exist.

References

login to comment