Glacier mass balance is typically estimated using a range of in situ measurements, remote sensing measurements, and physical and temperature index modelling techniques. With improved data collection and access to large datasets, data-driven techniques have recently gained prominence in modelling natural processes. The most common data-driven techniques used today are linear regression models and, to some extent, non-linear machine learning models such as artificial neural networks. However, the entire host of capabilities of machine learning modelling has not been applied to glacier mass balance modelling. This study used monthly meteorological data from ERA5-Land to drive four machine learning models: random forest (ensemble tree type), gradient-boosted regressor (ensemble tree type), support vector machine (kernel type), and artificial neural networks (neural type). We also use ordinary least squares linear regression as a baseline model against which to compare the performance of the machine learning models. Further, we assess the requirement of data for each of the models and the requirement for hyperparameter tuning. Finally, the importance of each meteorological variable in the mass balance estimation for each of the models is estimated using permutation importance. All machine learning models outperform the linear regression model. The neural network model depicted a low bias, suggesting the possibility of enhanced results in the event of biased input data. However, the ensemble tree-based models, random forest and gradient-boosted regressor, outperformed all other models in terms of the evaluation metrics and interpretability of the meteorological variables. The gradient-boosted regression model depicted the best coefficient of determination value of

We can visualize glaciers as interactive climate-response systems, with their response described by changes in glacial mass over a given period (e.g.

With increasing data points available, a new set of data-driven techniques has gained prominence in various domains of Earth sciences. For example, weather prediction (for a review, see

Through this study, we assess the ability of ML models to estimate annual point mass balance. We use an example of each of the following classes of ML models: ensemble regression tree based, kernel based, neural network based, and linear models. Under ensemble regression tree based, we chose one example of boosted and unboosted models. Specifically, we compare the performance of the random forest (RF), gradient-boosted regressor (GBR), support vector machine (SVM), and artificial neural network (ANN) models against a linear regression (LR) model. We also assess the performance for varying dataset sizes, as real-world measurements are limited. Finally, to explain the role of the input features in each of the ML models, we use permutation importance described further in

understand the utility of ML models in the estimation of glacier mass balance using limited real-world datasets

identify specific use cases for different classes of ML models (ensemble tree based, kernel based, neural network based, and linear regression) pertaining to data availability, evaluation metrics, and explainability

investigate the ability of ML models to unravel the underlying physical processes

explain the relative importance of meteorological variables contributing to the mass balance estimation on a monthly basis over the year.

ML modelling is a data-driven set of modelling techniques. Here, we used a supervised learning framework for regression, where inputs are in the form of monthly meteorological variables, and targets are in the form of point measurements of glacier mass balance. The actual point mass balance measurements are the target data vital to tuning the model parameters. We do this parameter tuning by designing a loss function defining the variation between the actual mass balance measurements, i.e. the target data, and the point mass balance estimates, i.e. the model's output. We start with random initialization of model parameters and fine-tune the parameters to minimize the loss function. For each of the ML models used in the study, we used the mean squared error (MSE) as the loss function. Further, we obtained the features of importance by assessing permutation importance. Figure

Flowchart of the methodology.

The RF model is an ensemble-based algorithm where the base learner used is a decision (regression or classification) tree

Like the RF model, the GBR model is an ensemble-based algorithm where aggregated base learners of decision (classification or regression) trees provide an estimate. However, it differs from the RF model because it uses boosting instead of bagging to construct ensembles. In boosting-based ensembles, base learners are typically weak learners, and the design of subsequent learners is such that the overall error reduces

The SVM model is a powerful ML tool that relies on Cover's theorem. The theorem suggests that data that might not be linearly separable in a lower dimensional space can be linearly separable when transformed into a higher dimensional space. In the context of classification, the SVM model uses a kernel to transform the data into a higher dimensional space

The most crucial component in ML modelling is the availability of target data to train the model. The target data used for training should be representative of the entire population. Hence, we chose the Fluctuations of Glaciers (FoG) database

The second aspect is the input features used by the model to make predictions. The network of weather stations is sparse over much of the Alpine terrain; hence, reanalysis datasets are recommended

We then normalized the data points using a min–max scaling to ensure the absence of user-conceived bias in the model. We have split the dataset using a random split, where 70 % of the total dataset is used for training the model and 30 % is used for testing the model performance. The training split is used in a 3-fold cross-validation process for tuning the hyperparameters, as described further in Sect.

In typical ML workflows, we split the complete dataset (set of features and target data) into training, validation, and testing. We fit the model to the data using the training subset, tune the hyperparameters using the validation subset, and report the independent performance metrics using the testing subset. In our case, we used a 70 %–30 % split for training and testing.
We have considered a hyperparameter grid with all combinations of values that each hyperparameter can take (see Table

Grid of settings used for hyperparameter tuning of each of the models.

For the RF model, we tuned the number of trees. We maintained the maximum depth as indefinite, leading to tree expansion until all nodes were pure. We considered all features to obtain the best split, ensuring minimum bias. As computation for absolute error is slow at each split, we used the squared error as the splitting criterion. This ensured the minimization of the variance after each split. For the GBR model, we tuned the number of trees, maximum depth of each tree (which affects the randomness in the choice of features in each tree), and subsampling ratio (for stochastic gradient boosting). Larger values of maximum depth, such as the indeterminate depth of the RF model, are not used as GBR functions with weak learners to increase the randomness. The SVM model hyperparameter fine-tuning involved kernel selection and a choice of the regularization parameter. Further, in the case of polynomial kernels, the degree of the polynomial was also tuned. For the NN model, we used a fully connected feedforward network where the hyperparameters of the number of layers and number of neurons in a layer were tuned. The activation function ReLU was used to incorporate non-linearity. We used the Adam

The testing dataset evaluation metrics used to assess the models' performances are the coefficient of determination (

ML models are heavily reliant on the availability of training data. To understand the effect of data availability on the model performance, we perform an experiment on varying the training sizes. We split the original dataset into subsets of iteratively increasing sizes. We partition each subset into training and testing partitions using a

The feature importance is represented using permutation importance described in

This section describes the major outcomes of the study categorized as the role of dataset size for the effective training of each ML model (see Fig.

Training and testing performance of each of the models: random forest (RF), gradient-boosted regression (GBR), support vector machine (SVM), artificial neural network (ANN), and linear regression (LR) depicted using the performance metrics

Testing scatter plot depicting the performance for each of the models: random forest (RF), gradient-boosted regression (GBR), support vector machine (SVM), artificial neural network (ANN), and linear regression (LR).

Hyperparameter tuning for the

Percentage importance of all features summed over the accumulation and ablation season for the models: random forest (RF), gradient-boosted regression (GBR), support vector machine (SVM), artificial neural network (ANN), and linear regression (LR).

The number of samples required for training the ML models depends upon the complexity of the model. Thus, each of the models used in this study is variably sensible to the number of training samples. We use the evaluation metrics of RMSE and the correlation coefficient to assess the requirement of training samples for each of the models. Figure

It is interesting to note that RF, GBR, and LR models see an increase in training MAE as opposed to a consistent decrease in testing MAE with increasing training samples. This depicts the tendency of these models to overfit the training samples in the case of smaller datasets. This is evident when observing the order of variation in the training and testing evaluation metric for smaller datasets; e.g. GBR depicts a training MAE of

The best-performing RF model resulted in a testing RMSE value of

Feature importance analysis using permutation importance considering the 17 (10 % of all features) most essential features indicates that the RF model is highly influenced by downward solar radiation in January; net solar radiation in July; downward thermal radiation in June; temperature at 2 m in June; forecast albedo in February and December; snow depth in January and July; snow density and snowmelt in July; sensible heat flux in December, January, March, and May; latent heat flux in August; and surface pressure in June and July. Permutation importance for the RF model summed over the accumulation months had the highest importance scores for sensible heat flux followed by downward solar radiation and forecast albedo. Each of these variables depict a summed percentage importance between 6 %–9 %. Snow depth and pressure are also important, with a summed percentage importance between 3 %–6 %. For the ablation months, only pressure is observed to have a summed percentage importance greater than 6 %. Sensible heat flux, net solar radiation, latent heat flux, snow depth, forecast albedo, snow density, and temperature at 2 m display a summed percentage importance between 3 %–6 %.

Tuning the maximum depth permitted for each weak learner tree was important in estimating the best model, and varying the number of weak learner trees during hyperparameter tuning improved performance in the case of smaller depths of the weak learners. Deeper tree structures did not significantly change the model's performance upon changing the number of trees. Stochastic gradient boosting (subsampling at 0.7) resulted in reduced performance. The hyperparameter combination of the best-performing GBR model is 100 trees with a maximum depth of five nodes (Fig.

The most important meteorological inputs for the GBR model are snowfall in July; downward solar radiation in January and December; forecast albedo in December, January, February, March, and May; sensible heat flux in January, March, May, November, and December; temperature at 2 m in June and August; snow depth in June; and surface pressure in August. Note the marked importance associated with ablation meteorological variables and the months associated with ablation. Permutation importance expressed as a percentage and summed over the accumulation months depicts the most importance to forecast albedo, followed by sensible heat flux, with both variables depicting a summed percentage importance greater than 10 %. Among other meteorological variables, downward solar radiation, net solar radiation, and snow depth in the accumulation months are also important. The ablation months depict higher summed importance values, with forecast albedo in these months prominent. Sensible heat flux, latent heat flux, surface pressure, snowfall, snow depth, and temperature at 2 m above the surface are also important.

The SVM model depicted large fluctuations in the validation score with changes in the hyperparameters. This is represented in Fig.

The permutation importance associated with the sensible heat flux in March is most important, as is the sensible heat flux associated with April, May, June, and December. Latent heat flux in August and October is important. Snowfall in October and snow density for the months of November, December, and January are important. The temperature at 2 m above the surface in June and July; downward solar radiation in December; and forecast albedo in August, October, and December are important. Summing the percentage importance over the accumulation and ablation months, we observe that sensible heat flux in the accumulation months is most important, followed by snow density and downward solar radiation. These three variables depict a summed percentage importance of more than 6 %. The temperature at 2 m a.g.l. and forecast albedo depict an importance between 3 %–6 % for the accumulation months. For the ablation months, sensible heat flux continues to depict a summed percentage importance of more than 6 %. Latent heat flux, snow density, forecast albedo, and temperature at 2 m above the surface also depict a summed percentage importance between 3 %–6 %.

The NN model performance is highly susceptible to hyperparameter selection. We varied the number of hidden layers in the network and the number of neurons in each hidden layer. Figure

The most important meteorological variables in terms of the percentage permutation importance for the NN model are the sensible heat flux for March, April, and May; latent heat flux in July; surface pressure in February; net solar radiation in May and September; downward solar radiation in December; and forecast albedo in July. The snow density in December and the snow depth in January, February, April, July, September, October, and December are important. We see that snow depth across the year dominates the important meteorological inputs for this model. Upon summing the percentage importance for the accumulation and ablation months, we observe that snow depth is the most important for both accumulation and ablation months. Snow density, pressure, sensible heat flux, and downward solar radiation are also important in the accumulation months, with a summed percentage importance value between 3 %–6 %. For the ablation months, net solar radiation is also important. Snow density, forecast albedo, latent heat flux, and sensible heat flux are also important, with summed percentage importance values between 3 %–6 %.

The testing RMSE values for the LR model are

Snow depth over most of the year is the most important feature for the model, with surface pressure also playing an important role. Other features do not depict as high an importance value. However, relative importance varies across the months.

The performance
of each of the models was evaluated using an independent test dataset. The GBR model resulted in the best testing performance MAE, RMSE, and

The performance of all models is affected by the uncertainties associated with the input features and targets. Inherent errors exist in point mass balance estimates, as heterogeneity is not captured sufficiently by the available measurements

Further, the use of input meteorological reanalysis data can result in bias, especially in locations without sufficient ground stations

The testing performance improves by increasing the number of training samples. We observe that for a larger number of data points, marginal improvement is observed upon increasing the number of samples further. The reduction in the rate of improvement for all models suggests that all models have been successfully trained. However, the marginal improvements observed suggest that a potential improvement in model performance is possible when including more data samples. The RF and GBR models overfit the training samples in the case of smaller datasets. The NN model training and testing metrics depict improved performance with training size. The NN model had the most trainable parameters and hence is the most data intensive. A larger number of training samples is essential for models with a larger number of trainable parameters. The training performance of the LR model deteriorates with increasing training samples. While the graph (LR model of Fig.

Further, Fig.

Assuming a winter accumulation-type glacier, we expect the months of November to March to be dominated by accumulation processes and June to September to be dominated by ablation processes. Analysis of the permutation importance (by percentage) of the features of each model was studied month-wise based on a physical understanding of which season-specific features will be most important. Figure

Net solar radiation and albedo are important ablation components. Albedo over snow-covered regions is higher than that of exposed ice or firn. At higher elevations and in summer months, we expect lower albedo values. Thus, variations in albedo are significant. In the case of ERA5-Land, the forecast albedo variable represents both the direct and diffuse radiation incident on the surface, with values dependent on the land cover type. It is calculated using a weight applied to the albedo in the UV–visible and infrared spectral regions. The albedo of snow and ice land covers differs in the UV–visible and infrared spectral regions. This makes forecast albedo more important than broadband albedo, which depends only on the surface net solar radiation and the surface solar radiation downwards. The expected importance of the albedo is observed in the RF, GBR, NN, and SVM model. LR models, in contrast, depict very low importance of albedo for the accumulation months. Thus, we see that the ML models represent the importance of the ablation features well. This is in agreement with the predominantly negative mass balance observed in in situ measurements.

We can observe that the importance associated with the meteorological variables is not dominated solely by total precipitation and temperature, as with temperature index models. Thus, ML modelling can represent the contributions of a complete set of variables with lesser complexity and ease of use than physical models. This also emphasizes the requirement for ML models to use all meteorological variables of interest, as opposed to a subset of them. This is the case with studies such as

With the emergence of artificial intelligence techniques, a number of studies have employed deep learning algorithms for numerous applications. A majority of these studies use neural networks to incorporate non-linearity in the modelling of various Earth observation applications. However, a host of ML techniques exist which remain under-utilized. This is being studied in the ML community (e.g.

An aspect not considered in this study is a transfer learning approach to the ML modelling, where glacier mass balance datasets from other locations can be used to pre-train the neural network and generate an initialization of weights to be tuned by the dataset of the region of interest (see

In this study, we constructed four ML models to estimate point glacier mass balance for the RGI first-order region 11: Central Europe. We used the ERA5-Land reanalysis meteorological data to train the models against point measurements of glacier mass balance obtained from the FoG database. In addition to the NN model, which is being increasingly utilized for glacier mass balance estimation, we used other classes of ML models, such as ensemble tree-based models, RF and GBR, and the kernel-based model, SVM. We compared these ML models with an LR model commonly used for mass balance modelling. Care must be taken to tune the hyperparameters for the GBR, NN, and SVM models. We observe that for these models, hyperparameter tuning was beneficial for improving the estimates of glacier mass balance. For smaller datasets, ensemble models such as RF and GBR depict overfitting. The NN model requires more data samples for effective training. The SVM model can effectively be used in the case of a smaller number of data samples, which is characteristic of real-world datasets. The LR model is consistently unable to capture the complexity of the data and underperforms. For larger datasets, ensemble models such as RF and GBR perform slightly better in terms of

The data used for the study are the monthly mean ERA5-Land reanalysis product for input features (

The supplement related to this article is available online at:

RA, RB, and DC were involved in the design of the study. RA wrote the code for the study and produced the figures, tables, and first draft of the article using inputs from all authors. RB, DC, and SPA proofread and edited the article. RA performed the first level of analysis, which was augmented by inputs from RB, DC, and SPA.

The contact author has declared that none of the authors has any competing interests.

Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We acknowledge the contribution of the journal editors, particularly Emily Collier, for the thorough article handling. We thank Jordi Bolibar and the anonymous reviewer, whose detailed suggestions and inputs have substantially improved the quality of the article. We also acknowledge the engaging discussions with peers, most notably Aniket Chakraborty, who always lent a patient ear and sound suggestions to roadblocks along the way.

This paper was edited by Emily Collier and reviewed by Jordi Bolibar and one anonymous referee.