20th century global glacier mass change: an ensemble-based model reconstruction

Negative glacier mass balances in most of Earth’s glacierized regions contribute roughly one quarter to currently observed rates of sea-level rise, and have likely contributed an even larger fraction during the 20th century. The distant past and future of glaciers’ mass balances, and hence their contribution to sea-level rise, can only be calculated using numerical models. Since independent of complexity, models always rely on some form of parameterizations and a choice of boundary conditions, a need for optimization arises. In this work, a model for computing monthly mass balances of glaciers on the global 5 scale was forced with nine different data sets of near-surface air temperature and precipitation anomalies, as well as with their mean and median, leading to a total of eleven different forcing data sets. Five global parameters of the model’s mass balance equations were varied systematically, within physically plausible ranges, for each forcing data set. We then identified optimal parameter combinations by cross-validating the model results against in-situ mass balance observations, using three criteria: model bias, temporal correlation, and the ratio between the observed and modeled temporal standard deviation of specific mass 10 balances. The goal is to better constrain the glaciers’ 20th century sea-level budget contribution and its uncertainty. We find that the disagreement between the different ensemble members is often larger than the uncertainties obtained via cross-validation, particularly in times and places where few or no validation data are available, such as the first half of the 20th century. We show that the reason for this is that the availability of mass balance observations often coincides with less uncertainty in the forcing data, such that the cross-validation procedure does not capture the true out-of-sample uncertainty of the glacier 15 model. Therefore, ensemble spread is introduced as an additional estimate of reconstruction uncertainty, increasing the total uncertainty compared to the model uncertainty obtained in the cross validation. Our ensemble mean estimate indicates a sealevel contribution by global glaciers (excluding Antarctic periphery) for 1901 2018 of 76.2 ± 5.9 mm sea-level equivalent (SLE), or 0.65 ± 0.05 mm SLE yr−1.

fore, optimization of parameters and/or input data is a standard procedure in glacier modeling (Huss and Hock, 2015;Radić and Hock, 2011;Marzeion et al., 2012). Often, a single parameter is chosen to be minimized during the calibration (e.g., the model's RMSE with respect to observed in-situ mass balances). Rye et al. (2012) suggested multi-objective optimization for a (regional) glacier model, striving for 'Pareto optimality' (Marler and Arora, 2004), to constrain parameters more robustly. 70 Here, we apply a multi-objective optimization, concerning the five global parameters most relevant in the applied model, for each of nine meteorological forcing data sets (see Table 1), their mean and their median. Since the model is able to hindcast glacier evolution, the aim of this work is to (i) optimize the model parameters in order to obtain model setups that reproduce in-situ mass-balance observations as closely as possible, and (ii) to more robustly estimate model uncertainty, taking into account ensemble spread at times and in regions where observations are sparse. We use the model of Marzeion et al. (2012), 75 but introduce changes to the mass-balance calibration routine (see 2.1.2). Additionally, we incorporate newer boundary and initial conditions as well as reference data, against which the model is calibrated and evaluated. We show that the ensemble approach to the reconstruction produces more robust estimates of model uncertainty than taking into account results from a cross-validation alone.

Basic equations and parameters
In this section, those features of the mass-balance model that are relevant to the optimization procedure are described. A more thorough description is given in Marzeion et al. (2012).
The annual mass balance B(t) of each glacier is computed as: where B is the annual modeled mass balance for an individual glacier in year t, P solid i the amount of solid precipitation in month i, µ * a glacier-specific temperature sensitivity parameter, T terminus i the mean temperature in month i at the glacier's terminus elevation, T melt a global threshold temperature for snow and ice melt at the glacier surface, and β * a model bias correction parameter. Values for µ * and β * are obtained by assuming an equilibrium of the glacier in present-day geometry 90 during a 31-year period centered around year t * . In contrast to the initial publication of the model, we objectify the selection of t * : while Marzeion et al. (2012) argue that t * is a function of the regional climatological history, it also depends on the glacier's response time scale, as discussed in Roe et al. (2020, submitted), for which there is no reason to assume spatial coherence. This means that we now do not spatially interpolate t * as before, but introduce it as an additional global parameter. In section 2.1.2, we elaborate further on this point.

95
The inference of the glacier-specific parameters (µ * and β * ) is assessed in a leave-one-glacier-out cross-validation procedure to determine the out-of-sample uncertainty. While values for µ * can be computed for each individual glacier based on t * , those for β * are spatially interpolated from the ten closest glaciers with at least three years of available in-situ observations, using inverse distance weighting.

100
While one global parameter (T melt ) was introduced in Eq. 1, three other ones are associated with the calculation of the monthly solid precipitation P solid i (t): P solid i (t) = (a · P CRU clim i + P anom i (t)) · (1 + γ precip · (z mean − z CRU clim )) · f solid where a is a precipitation correction factor, P CRU clim i is the monthly climatological precipitation sum taken from the grid 105 point of the CRU CL 2.0 data set (New et al., 2002) closest to the respective glacier in month i, P anom i (t) is the monthly total precipitation anomaly deduced from the forcing data set, γ precip is a global precipitation lapse rate, z mean is the mean elevation of the glacier, z CRU clim is the elevation of the grid point in the CRU CL 2.0 data set, and f solid i (t) is the fraction of solid precipitation: where T prec solid is a global threshold temperature for solid precipitation, γ temp is an empirically derived, local temperature lapse rate, z max the maximum glacier elevation, and z terminus the elevation of the glaciers' terminus.
The four global parameters (T melt , a, γ precip , and T prec. solid ) introduced in Eq. 1 -3 are at the core of the model's mass balance computations and hence subject to the optimization presented here. Marzeion et al. (2012) used the CRU TS 3.0 data set to obtain T terminus i (t) and P anom i (t). Here, we include additional meteorological data sets as well as their mean and median values in the optimization (see Sect. 2.2.1).

120
The monthly mass balances are subsequently translated into volume, area and length changes by geometric scaling and relaxation (details in Marzeion et al., 2012). Initial values for the area of each individual glacier at the start of the model run (e.g., beginning of the 20th century) are found using an iterative approach that minimizes the difference in area between modeled glacier and the Randolph Glacier Inventory (RGI) record in the year of the respective observation. If this iterative procedure is not successful, the glacier is not included in the reconstruction. For these glaciers, a simple upscaling is applied in the compu-125 tation of regional and global results.
Note that since the CRU CL 2.0 data set used to obtain P CRU clim i and T CRU clim i does not cover Antarctica, we do not consider glaciers in the periphery of Antarctica and subantarctic glaciers here (labeled region 19 in RGI, 2017).

130
As explained above, we treat the parameter t * as a global one, opposed to a glacier-specific estimation in Marzeion et al. (2012).
In order to illustrate the reasoning, we need to discuss the mass-balance calibration for an individual glacier in the model in detail. The calibration is based on the idea of inferring a glacier's temperature sensitivity µ * by finding a climatologcal time period in the forcing data set (centered around t * ) which would result in a zero annual mass balance of the glacier in present-day geometry. Thus, for each center yeart of a 31-year period, we can calculate µ(t) by requiring: where B M is the mean modeled mass balance of the respective glacier for the years of available mass balance measurements, and B o the mean observed mass balance. Marzeion et al. (2012) chose t * to be thatt, for which |β(t)| was minimal. µ * was then calculated from equation 4 applied to t * , and β * taken as β(t * ). For glaciers without in-situ observations of mass balances, t * and β * were interpolated from the ten closest glaciers with observations, using an inverse distance weighting. Using this 145 5 https://doi.org/10.5194/tc-2020-320 Preprint. Discussion started: 21 November 2020 c Author(s) 2020. CC BY 4.0 License. method, Marzeion et al. (2012) were able to identify a suitable parameter set in the leave-one-out cross-validation procedure, applying CRU TS 3.0 as atmospheric boundary conditions. However, this is not generally the case for the meteorological data sets considered here, and there is a conceptual shortcoming in the spatial interpolation of t * , which we will illustrate for one exemplary model setup.

150
The upper panel of Fig. 1 shows the global average of β(t), weighted by the length of each glacier's in-situ mass-balance measurement time series (henceforth, all mentioned averages over different glaciers imply such a weighting), using CRU TS 4.03 as boundary condition, applying the optimal parameter set (see section 2.3).
The lower panel shows that the distribution of t * estimated directly is bi-modal, with frequent values either at the beginning 155 or end of the considered period, but the spatial interpolation leads to a more even distribution. This in turn means that, generally speaking, the spatial interpolation moves t * towards the mid of the considered time period, thereby increasing the value of β * for glaciers with an early t * , and decreasing it for those with a late t * (see upper panel of Fig. 1). It also shows that there are more glaciers with t * at the beginning of the 20th century than at the end of the 20th century or the beginning of the 21st century.

160
Furthermore, those glaciers with t * at the beginning of the 20th century tend to have a positive β * , implying that even with present-day geometry, those glaciers would have lost mass under climatic conditions of the early 20th century. The zerocrossing of the global average β(t) is thus found at a period when positively and negatively biased glaciers cancel each other.
Since moving the median of t * towards the mid of the of the modeled period generally goes along with an increase of the averaged model bias, using the spatial interpolation of t * tends to lead to a positively biased model setup, which then becomes 165 apparent in the leave-one-glacier-out cross-validation.
In order to avoid this effect, and taking into account that neighboring glaciers will have different response times, such that even if they experience a very similar evolution of climate anomalies we cannot expect a close spatial coherence of t * , we no longer spatially interpolate t * and treat it as a 5th global parameter instead. Note that µ * is still a glacier-specific parameter 170 following Eq. 4, and that β(t * ) is still interpolated from the ten closest glaciers in an inverse distance weighted manner. Also note that the leave-one-glacier-out cross-validation (Sect. 2.3) will reveal any potential new model errors introduced through this change.

175
We conducted the search for an optimal parameter set for the version 4.03 of the CRU TS data (corresponding to an update of Marzeion et al., 2012) and additionally eight reanalysis data sets, as well as the mean and the median of all the data sets (see Table 1). Data sets not extending back to 1901 were filled with CRU TS 4.03 data, exclusively for the purpose of initialization of glacier areas; the results are only shown (and evaluated) during time periods for which we have forcing data from the respective data set.

Glacier Data
The glacier model requires information about location, area, terminus and maximum elevation of each glacier at some point of time within the modeled time interval. The RGI provides these data. Its most recent version (RGI v6.0) was used in this work.
The RGI relies mostly on Landsat and other satellite imagery. Distinction of individual glaciers within glacier complexes was realized mostly by semi-automatic algorithms for detecting watershed divides (RGI, 2017).

190
To be able to cross-validate the modeled mass balances, we use in-situ observations of glaciers' mass balances collected by the World Glacier Monitoring Service (WGMS, 2018). We ignore any uncertainties of these observations (Cogley, 2009) and treat them as the 'true' annual mass balance of a glacier in the recorded year.

195
For the identification of a optimal parameter set, we applied a 'brute-force' approach, i.e. we varied each parameter other than t * (see below) using the following ranges, for each meteorological data set: This resulted in 900 model validation runs for each of the eleven forcing data sets (i.e., 9900 runs in total). For all forcing data sets except 20CRV3, all zero-crossings of the global mean β(t) were found before witht < 1920 (for 20CRV3, some 205 were found in 1962 and 1976). For each forcing data set, we selected the twenty best-performing parameter sets that showed a zero-crossing of the global mean β(t). We then performed another cross-validation with those parameter sets to fine-tune t * , applying the range 1901 to 1920, except for 20CRV3 where we applied the time ranges 1909 -1918, 1960 -1964, and 1974 -1978. Hence, we performed 400 additional cross-validation runs per data set.
From those cross-validations, three characteristic statistical measures of model performance were computed: model bias (i.e., mean model error) with respect to observations, the temporal correlation with observations, and the ratio of standard deviations of interannual variability between modeled and observed mass balances. We do not include the mean squared error (MSE) as a performance measure, since it is simply a (weighted) combination of the three performance measures: where σ M is the standard deviation of modeled mass balances, σ o the standard deviation of observed mass balances, R the Pearson correlation coefficient, M the mean of modeled mass balances, and O the mean of observed mass balances (thus, the last term corresponds to the squared bias).
From Eq. 6 it can be inferred that a minimum MSE occurs for a model setup in which the standard deviation ratio equals the 220 correlation coefficient. Hence, in a model setup that is not perfectly (positively) correlated with the observations (i.e., 0 < R < 1), a more realistic standard deviation ratio (e.g. 1 ≥ σ M σo > R) will result in a higher MSE. However, a correlation coefficient equal to one is generally not achievable in complex models such as the one used in this work. Therefore, minimizing the MSE will lead to preference of parameter sets that underestimate variance. This is problematic, since a correct representation of variance is indicative of correct model sensitivity to changes in the forcing. E.g., it is possible to imagine to apply a model setup 225 that yields a low bias and good correlation, but largely underestimates the interannual variation of mass balances. It is therefore beneficial to not only minimize the MSE, but rather to minimize the three statistical coefficients it comprises individually, in order to not trade a realistic model sensitivity for a smaller MSE.
All three performance measures were calculated for each validated glacier in a respective data set, and then averaged over 230 all these glaciers, weighted by the number of available mass balance observations per glacier.
Standard deviation ratios were brought to represent the deviation from an optimum value (i.e. one) by: To determine for each meteorological data set a model parameter set that, on average, shows the highest skill to represent 235 the behavior of observed glaciers, we normalize the performance measures introduced above such that the individual scores s range from 0 for the worst to 1 for the best validation result by the following equations: where i is the individual model setup the score is calculated for. These scores were then added up to identify the 'optimal' model setup as the one with the maximum overall score. If a model setup obtained the single best result for all three performance 240 measures individually, it would thus yield a score of three. Note that the three (or potentially other) performance measures might be weighted differently, based on the objective of the model application. However, as shown below, we do not find  Table 2 shows the values obtained for performance measures and optimal global parameters. We differentiate between the mean and median of the forcing data input used as individual boundary conditions (mean/median input) and the mean and median of the ensemble output values (mean/median output). For more than half of the validated meteorological data sets, the global mean bias of the optimal parameter set is smaller than 10 mm w.e. yr −1 , and the correlation is larger than 0.6, while the 250 amplitude of the interannual variability is estimated correctly within a small margin (ca. 5 %). RMSEs lie roughly between 700 and 800 mm w.e. yr −1 for most data sets. Only 20CRV3 shows a significantly higher RMSE, caused by some large outliers.
Note that the number of glaciers that cannot be initialized also depends on the meteorological data set used as boundary condition. CERA20C, e.g., not only performs the worst (obtaining an overall score of 1.38 using the optimal parameter set), but  Table 2). Model deviation distribution pairs do not significantly 275 9 https://doi.org/10.5194/tc-2020-320 Preprint. significantly different from those of other ensemble members are to a large degree produced by low-scored model setups, while the mean is only significantly different for CERA20C. The significantly high bias and low score of CERA20C indicate particular issues with this forcing data set and lead us to exclude it from the following ensemble calculations. In the subsequent section we will explore where these issues stem from and in doing so explain why the temporal and spatial constraints of the validation data hinder us to make assertions over which individual model setup is the most reliable one over the whole temporal 285 and spatial model domain.

Differences between ensemble members inconsistent with uncertainty estimates
The leave-one-glacier-out cross-validation procedure applied here is designed to estimate the uncertainty of model results for glaciers that have no in-situ mass balance observations, and for times where there are no in-situ observations. Therefore, in principle, the results of the individual ensemble members should agree within their respective uncertainty estimates. However,  calibration, lower temperatures at t * will lead to higher temperature sensitivities (see Eq. 4). Similarly, the greater increase of temperature will result in higher mass loss rates.

315
Concerning these issues with CERA20C, it is striking that in spite of its large positive specific mass-balance bias in the cross-validation, global mass change estimates obtained with it are strongly more negative than those of the other ensemble members. This underlines the fact that even though the cross-validation is crucial in the optimization process, we cannot entirely rely on it for assessing global and long-term reconstruction performance of individual data sets. Therefore, and because, as stated in the previous section, the best-performing data sets do produce statistically quite similar results for the validation 320 glaciers, we will only use estimates based on the ensemble -i.e., not individual members -in the following.
In both the well-observed regions (panel (a) in Fig. 3) and the global scale (panel (b)), the different model setups disagree stronger in the first half of the 20th century, reflecting that uncertainty in the atmospheric conditions during that time is also greater.

325
All in all, we find that the ensemble spread tends to be larger than uncertainty estimates obtained via the cross-validation, and that this is caused by the majority of glacier observations coming from places and times where the uncertainty of the state of the atmosphere is smaller than what can typically be expected in glacierized regions. Additionally, we assume that the individual glaciers' error estimates are uncorrelated with each other and random, for we do not have direct model error estimates for every glacier and can thus not account for correlations of individual glaciers' errors. However, the ensemble approach allows 330 to explore if, and to which degree, the cross-validation underestimates the true uncertainty of the reconstruction.

Combining model and ensemble uncertainty
To account for both the model error, as calculated in the cross-validation procedure (RMSE), and for and the ensemble spread, the total uncertainty of the ensemble average is calculated as follows. First, we calculate the model error of the ensemble average solely determined by the means of the cross-validation error: where i (t) is the model uncertainty computed in the cross-validation procedure for an individual ensemble member i for year t. Then we add the ensemble spread as a further uncertainty measure to the model error of the ensemble average: Figure 4 shows the temporal evolution of total uncertainty ( ensemble ) as well as the aggregated model uncertainty ( model ) and ensemble spread (σ ensemble ) of the ensemble mean mass change rate estimate. The total uncertainty of the ensemble mean estimate grows as we go back in time, with a sharp increase in the first twenty-five years. This is due to the increase in the model error of the ensemble average, especially in the first decade of the 20th century, which is produced by very high mass 345 losses for a few glaciers in some model setups during that period. The ensemble spread is also greater during the first half of the 20th century compared to later years, which can be attributed to less agreement between meteorological data sets in earlier years. Note that also the number of ensemble members shrinks going back in time, since not all reanalysis products provide data for the whole period.  Table 3 displays the regional and global mass loss rates for different reference periods. Mass change rates estimates for more recent periods are increasingly negative across most regions, reaching -1.00 ± 0.06 mm SLE yr −1 accumulated globally in the most recent period. The only time and region for which an increase in glacier mass is estimated are the Southern Andes in earlier years, although with a relatively high uncertainty due to ensemble spread (see Fig. 2). To explore the period of decelerated mass loss during the 20th century shown in Fig. 6, the periods 1901 to 1940 360 and 1941 to 1980 are shown in Table 3. For most regions, the mass loss change rate estimates are substantially less negative in the latter period; only New Zealand exhibits a significantly larger mass loss. Regarding the global estimate, most of the mass loss deceleration took place in Greenland and the North American continent (i.e. regions 1 to 5). Thus, after increasing mass loss rates until around 1930 (see Fig. 6), glaciers started to lose less mass until around 1980, possibly caused by atmospheric cooling induced by increasing aerosol concentrations (Ohmura, 2006;Ohmura et al., 2007;Wild, 2012). From then on, the 365 glaciers' contribution to sea-level rise accelerated again until the end of the modeled period. Figure 7 shows the drivers of this behavior: the global ensemble mean temperature (lower panel) and precipitation anomalies as well as total amount of solid precipitation (upper panel; see Eq. 2 and 3; all weighted by glacier area). From ca. 1980 on, heat available for ice and snow melt, i.e. the temperature anomaly, increased monotonously. While precipitation at the glacier locations tended to increase over time, the amount of solid precipitation at glacier locations decreases from roughly 1980, implying that not only ablation is 370 increased, but also accumulation is decreased. In contrast to that, the increase in precipitation between ca. 1930 to 1950 was accompanied by a similar increase in solid precipitation, indicating that the warm anomaly at the same period was too weak reduce accumulation.

Global Glacier Mass Loss
Concerning uncertainty estimates, Table 3 shows that most of the uncertainty stems from the regions Alaska, Arctic Canada 375 (North), and Greenland in the more recent periods. In the earliest period, the Russian Arctic region exhibits the highest uncertainty, which is more than double the value of the central regional estimate for that period, indicating that the large model error in the early 20th century (see Fig. 4) is mostly produced in this region, and in Greenland.  (Gardner et al., 2013) also involve GRACE data. Only the disagreeing values from Cogley (2009) do not involve gravimetry measurements. Part of these disagreements might be explained by the storage of meltwater in glacial lakes (Shugar et al., 2020), which (because of the 385 close proximity to the glaciers) cannot be separated from the ice mass in gravimetry data. GRACE will therefore observe lower mass change values than in-situ or geodetic observations. However, since these lower values are more correct concerning the glaciers' contribution to sea-level rise, the issue points to the larger problem of distinguishing between glacier mass change, and the corresponding sea-level change, which are not exactly equal. However, Shugar et al. (2020) also point out that glacial lake storage accounts for only about 1 % of glacier melt volume (excluding Greenland and Antarctica), which indicates that 390 this inconsistency is of limited relevance. Gardner et al. (2013) point to discrepancies between satellite-derived and in-situ estimates of glacier mass losses, alleging a negative bias in in-situ observations. Zemp et al. (2019) addressed this issue by combining glaciological and geodetic measurements. Although our model is calibrated solely using in-situ observations, its estimates are still close to Zemp et al. (2019), in which the uncertainty for the longer period is admittedly large (Table 4).

Discussion
Finally, estimates of the global glacier mass change contribution to sea-level rise, excluding Greenland and Antarctic periphery 395 and not given in Table 4, of Frederikse et al. (2020) agree well with ours for the more recent time intervals they specify (1957 -2018 and 1993 -2018), while our estimates lie at the very low end of the confidence interval given for the whole time interval they studied (1900 -2018). This is presumably due to the modeling approach that their estimates in early years rely on, which includes estimations of disappeared and missing glaciers that are not included in the RGI. The increase of global glacier mass loss estimates this causes declines throughout the 20th century (Parkes and Marzeion, 2018). 400 Regarding regional values, Table 3 shows that roughly two-thirds of our global mass loss estimate, during the most recent time, occurred in Greenland and the North American continent. A large amount of the global uncertainty originates from these regions as well. Comparing our regional mass change estimates for recent years to those in the literature (Ciracì et al., 2020;Wouters et al., 2019;Zemp et al., 2019), the most obvious discrepancy can be found in estimates for the Southern Andes, where our ensemble mean is substantially less negative and even positive in earlier periods, caused by the model setup forced 405 13 https://doi.org/10.5194/tc-2020-320 Preprint. Discussion started: 21 November 2020 c Author(s) 2020. CC BY 4.0 License.
with 20CRV3 reanalysis data. The opposite is true for the regions Arctic Canada (North) and Svalbard, where our estimate is more negative than those previously published. This might be caused by the relatively large portion of area draining into marine-terminating glaciers in those regions, since glacier-ocean interactions are not included in the model we applied and the calibration applying solely atmospheric forcing might thus be problematic. Finally, our regional estimate is significantly more negative for Greenland than for Alaska in the most recent period, while it is not so in Zemp et al. (2019). Thus, while we 410 find a good agreement of our global mass change estimates with previously published ones, there are significant differences in regional estimates.
Although the largest potential of reducing the global uncertainty, relevant to e.g. sea-level rise estimates, is in largely glaciated but less observed regions, reducing it in smaller regions (e.g. Southern Andes) could still be valuable concerning 415 hydrological changes and hence water availability.

Conclusions
A multi-objective optimization of a global glacier mass balance model forced with an ensemble of meteorological data sets was presented. We demonstrated that it is possible to find statistically well performing model setups of model parameters for each forcing data set, but that we cannot robustly identify which model setup is the most reliable when applied outside of 420 the temporal and spatial domain of validation data. However, one data set (CERA20C) can be identified as performing worse that the others. Disagreement between ensemble members is to a large degree attributable to differences in the forcing data in times and at locations where few validation data are available. The differences in the forcing data result in diverging glacier mass loss estimates, especially in the first half of the 20th century. Regionally, the largest ensemble disagreement is found regarding Greenland's peripheral glaciers. Although our estimates lie within the uncertainty range to most of the previously 425 published global estimates, they seem to agree less with those derived from GRACE data. Finally, all ensemble members agree that around the 1930s mass loss rates from glaciers were comparable to those of today. They were followed by a phase of deceleration roughly between 1940 and 1980, and have been accelerating since then.