Bayesian calibration of firn densification models

10 Firn densification modelling is key to understanding ice sheet mass balance, ice sheet surface elevation change, and the age difference between ice and the air in enclosed air bubbles. This has resulted in the development of many firn models, all relying to a certain degree on parameter calibration against observed data. We present a novel Bayesian calibration method for these parameters, and apply it to three existing firn models. Using an extensive dataset of firn cores from Greenland and Antarctica, we reach optimal parameter estimates applicable to both ice sheets. We then use these to simulate firn density and evaluate 15 against independent observations. Our simulations show a significant decrease (25 and 55%) in observation-model discrepancy for two models and a small increase (11%) for the third. As opposed to current methods, the Bayesian framework allows for robust uncertainty analysis related to parameter values. Based on our results, we review some inherent model assumptions and demonstrate how modeland parameter-related uncertainties potentially affect ice sheet mass balance assessments.


Introduction 20
On the Antarctic and Greenland ice sheets (AIS and GrIS), snow falling at the surface progressively compacts into ice, passing through an intermediary stage called firn. The process of firn densification depends on local conditions, primarily the temperature and the snow accumulation rate, and accurate modelling of densification is key to several applications in glaciology. Firstly, variability in firn densification affects altimetry measurements of ice sheet surface elevation changes.
Consequently, it is a large contributor of uncertainty in mass balance estimates that rely on a correct conversion from measured 25 volume changes to mass changes (Li and Zwally, 2011;Shepherd et al., 2012;McMillan et al., 2016). Errors in the firn related correction can lead to over-or underestimation of mass changes related to surface processes, and to misinterpreting elevation change signals as changes in mass balance and in ice flow dynamics. Secondly, firn models are used to estimate the partitioning of surface meltwater into runoff off the ice sheet, and refreezing within the firn column, which strongly influences mass loss rates . Model estimates of current and future surface mass balance of the AIS and GrIS would 30 thus benefit from an improved knowledge of the sensitivity of the densification process to climatic conditions. And finally, the densification rate determines the firn age at which air bubbles become trapped in the ice matrix. Knowing this age is crucial for precisely linking samples of past atmospheric composition, which are preserved in these bubbles, to paleo-temperature indicators, which come from the water isotopes in the ice (Buizert et al., 2014).
Firn densification has been the subject of numerous modelling studies over the last decades (e.g. Herron and Langway, 1980;5 Goujon et al., 2003;Helsen et al., 2008;Arthern et al., 2010;Ligtenberg et al., 2011;Simonsen et al., 2013;Morris and Wingham, 2014;Kuipers Munneke et al., 2015). However, there is no consensus on the precise formulation that such models should use. Most models adopt a two-stage densification process with the first stage characterising faster densification for firn with density less than a critical value, and then slower densification in the second stage. The firn-model intercomparison of Lundin et al. (2017) demonstrated that, even for idealised simulations, inter-model disagreements are large in both stages. Firn 10 compaction is driven by the pressure exerted by the overlying firn layers. Dry firn densification depends on numerous microphysical mechanisms acting at the scale of individual grains, such as grain-boundary sliding, vapour transport, dislocation creep and lattice diffusion (Maeno and Ebinuma, 1983;Alley, 1987;Wilkinson, 1988). Deriving formulations closely describing the densification of firn at the macroscale as a function of these mechanisms is challenging. Consequently, most models rely on simplified governing formulations that are calibrated to match with observations. The final model formulations 15 have usually been tuned to data either from AIS (Helsen et al., 2008;Arthern et al., 2010;Ligtenberg et al., 2011) or from GrIS (Simonsen et al., 2013;Morris and Wingham, 2014;Kuipers Munneke et al., 2015), consisting of drilled firn cores from which depth-density profiles are measured. However, the calibration of firn densification rates to firn depth-density profiles requires the assumption of a firn layer in steady state. To overcome this limitation, some models have been calibrated against other type of data such as strain rate measurements (Arthern et al., 2010;Morris and Wingham, 2014) or annual layering 20 detected by radar reflection (Simonsen et al., 2013), but such measurements remain scarce and do not extend to firn at great depths below the surface. Ultimately, firn model calibration is an inverse problem that relies on using observational data to infer parameter values.
In this study, we adopt a Bayesian approach in order to address firn model calibration. This provides a rigorous mathematical 25 framework for estimating distributions of the model parameters (Aster et al., 2005;Berliner et al., 2008). Bayesian inversion has been applied in several glaciological studies, and it has been demonstrated that this methodology improves our ability to constrain poorly known factors such as basal topography (Gudmundsson, 2006;Raymond and Gudmundsson, 2009;Brinkerhoff et al., 2016a), basal friction coefficients (Gudmundsson, 2006;Berliner et al., 2008;Raymond and Gudmundsson, 2009), ice viscosity (Berliner et al., 2008) and the role of the subglacial hydrology systems on ice dynamics (Brinkerhoff et 30 al., 2016b). In the Bayesian framework, model parameters are considered as random variables for which we seek an a posteriori probability distribution that captures the probability density over the entire parameter space. This distribution allows not only to identify the most likely parameter combination, but also allows us to set confidence limits on the range of values in each parameter that is statistically reasonable. This enables us to quantify uncertainty in model results, to challenge the assumptions https://doi.org/10.5194/tc-2019-274 Preprint. Discussion started: 3 January 2020 c Author(s) 2020. CC BY 4.0 License. inherent to the model itself and to assess correlation between different parameters. Calculations rely on Bayes' theorem (see Sect. 2.4 and Eq. (7)), but because of the high-dimensional parameter space and the non-linearity of firn models, solutions cannot be computed in closed form. As such, we apply rigorously designed Monte Carlo methods to approximate the target probability distributions efficiently. By exploiting the complementarity between the Bayesian framework and Monte Carlo techniques, we recalibrate three benchmark firn models and improve our understanding of their associated uncertainty. 5

Firn densification data
In order to calibrate three firn densification models, we use observations of firn depth-density profiles from 91 firn cores located in different climatic conditions on both the GrIS (27 cores) and the AIS (64 cores) (Fig. 1). Using cores from both ice sheets is important since we seek parameter sets that are generally-applicable and not location-specific. We only consider dry 10 densification since meltwater refreezing is poorly represented in firn models and wet-firn compaction is absent (Verjans et al., 2019). As such, we select cores from areas with low mean annual melt (<0.006 m w.e. yr -1 ) but spanning a broad range of annual average temperatures (-55 to -20℃) and accumulation rates (0.02 to 1.06 m w.e. yr -1 ). For each core, we use the depth- (2) where (m) increases downwards, is the density of firn (kg m -3 ) and is the density of ice (917 kg m -3 ). In Eq. (2), we consider porosity only below 15 m to avoid dependency between 15 and . We choose to use both 15 and 20 in order to account for first-and second-stage densification. We note that 48 cores are too shallow to reach and so cores which do reach this depth provide a stronger constraint to the Bayesian inference method. This is sensible because these deep cores carry information about both stages of the densification process.
We separate the dataset into calibration data (69 cores) and independent evaluation data (22 cores). The latter is selected semirandomly; we ensure that it includes a representative ratio of GrIS-AIS cores and that it covers all climatic conditions, including 25 an outlier of the dataset with high accumulation and temperature (see Supplementary Information). The resulting evaluation data has 8 GrIS and 14 AIS cores; 11 of the 22 cores extend to .

Climate model forcing
At the location of each core, we simulate firn densification under climatic forcing provided by the RACMO2.3p2 regional climate model (RACMO2 hereafter) at 5.5 km horizontal resolution for GrIS  and 27 km for AIS (van Wessem et al., 2018). Results of the calibration would depend on the particular climate model used for forcing. Each firn model simulation consists of a spin-up by repeating a reference climate until reaching a firn column in equilibrium, which is 5 followed by a transient period until the core-specific date of drilling. The reference climate is taken as the first 20-year period of RACMO2 forcing data (1960-1979 and 1979-1998 for GrIS and AIS respectively). The number of iterations over the reference period depends on the site-specific accumulation rate and mass of the firn column (mass from surface down to ).
We ensure that the entire firn column is refreshed during the spin-up but fix the minimum and maximum number of iterations to 10 (200 years spin-up) and 50 (1000 years spin-up). We note that at 33 sites, the core was drilled before the last year of the 10 reference climate and so the transient period is effectively a partial iteration of the spin-up period.
In addition to the climatic forcing, another surface boundary condition is the fresh snow density, 0 . It is taken as a constant site-specific value. Each value is taken in agreement with the shallow densities measured in the corresponding core of the dataset. We prefer this approach to the use of available parameterisations of 0 (Helsen et al., 2008;Kuipers Munneke et al., 2015) to avoid any error in the fresh snow parameterisation to affect the calibration process. 15

Firn densification models
We use the Community Firn Model (Stevens, 2018) as the framework of our study because it incorporates the formulations of all three densification models investigated: HL (Herron and Langway, 1980), Ar (Arthern et al., 2010) and LZ (Li and Zwally, 2011). The Robin hypothesis (Robin, 1958) constitutes the fundamental assumption of HL, Ar and LZ. It states that any 20 fractional decrease of the firn porosity, /( − ), is proportional to an increment in overburden stress. This translates into densification rates depending on a rate coefficient , assumed different for stage-1 and stage-2 densification: The formulations of the rate coefficients rely on calibration and thus differ between the three models investigated: LZ where ̇ is the accumulation rate (m w.e. yr -1 ), the temperature (K), the annual mean temperature, the gas constant, gravity and the water density (1000 kg m -3 ). All remaining terms are model-specific tuning parameters. For ̇, we use the mean accumulation rate over the lifetime of each specific firn layer because it better approximates the overburden stress than the annual mean (Li and Zwally, 2011). HL and Ar use Arrhenius relationships with activation energies ( terms) capturing temperature sensitivity and exponents characterising the exponential proportionality of the rate coefficients to the accumulation 10 rate. Originally, Herron and Langway (1980) inferred all values from calibration based on 17 firn cores, from which they inferred the values for the six free parameters (Table 1)  Nabarro-Herring creep for and grain-growth for . Still, they noted a mismatch with the activation energy fitting their data best. The 0 and 1 parameters were tuned to three measured time series of strain rates collected in relatively warm and 15 high accumulation locations of AIS. Here, we consider all five , , 0 , 1 , as free parameters (Table 1) but keep fixed because of its strong correlation with ; our use of monthly model time steps and depth-density profiles as calibration data is not suitable for differentiating effects of and . Equation (6) shows that LZ has eight free parameters (Table 1), all denoted by in this paper. In contrast to our approach to Ar, we do not add additional accumulation rate exponents to ̇ in Eq. (6) because the dependence of 0 and 1 on ̇ also involves the coefficients 12 and 22 in the definition of 0 and 1 . Li and 20 Zwally (2011) performed their calibration of Eq. (6) against firn cores only from the GrIS. Later, Li and Zwally (2015) proposed a new parameterisation for 0 and 1 , but calibrated for Antarctic firn. Since one of the goals of this study is to find a densification formulation applicable to firn in both GrIS and AIS, we choose to apply our calibration method only to Eq. (6) specified in Li and Zwally (2011). However, in our results' analysis (Sect. 3), we also consider the performance of the Li and Zwally (2015) model on the AIS cores of our dataset. 25

Bayesian calibration
In our approach, the free parameters of the firn models are identified as the quantities of interest and we define this parameter set as . Hereafter, 'original model values' refers to the values originally attributed by Herron and Langway (1980), Arthern et al. (2010) and Li and Zwally (2011) to their respective sets of free parameters . The calibration process relies on Bayes' theorem (Eq. (7)) which allows to update a prior probability distribution ( ) for based on observed data . 5 We use normal and weakly informative priors centred about the original model values so that the constraint of the prior on ( | ) is minor ( Table 1). The data consists of the observed 15 and values of the calibration data. The marginal likelihood, ( ), is a constant term independent of and does not influence the calibration. We use a normal likelihood function ( | ), which quantifies the match of the modelled values with the observed: 10 where 15 and 15 are vectors containing all modelled and observed values for the calibration data of 15 respectively, and similarly for and . We use diagonal covariance matrices 15 and with site-specific variances. The variances determine the spread allowed for the model outputs compared to the observed values and we calculate them taking 5% and 10% margins around 15 and measurements respectively. Allowing for such spread is necessary because multiple 15 causes may lead to model-observation discrepancy such as firn model errors, measurement uncertainties, potential errors in the climatic forcing and approximations in fixing 0 . This particular form of the likelihood function assumes independence between model errors in 15 and in , which is ensured by our calculation of only from 15 m depth to (Eq. (2)). It also assumes normally distributed model errors with respect to the observed values. Both these aspects were verified with preliminary assessments, along with our calculations for the covariance matrices 15 and , as discussed in the 20 Supplementary Information. The posterior distribution ( | ) gives a probability distribution over the parameter space of a given model conditioned on the calibration data. We note here that extreme parameter combinations in the LZ model can lead to negative densification rates. In such cases, we set the modelled values to 0, which leads to extremely low values for the likelihood and for the posterior probability of such parameter sets.

25
There is no analytical form of ( | ) and we must investigate the parameter space to generate an ensemble of approximating ( | ). Such an investigation is achieved efficiently using Markov Chain Monte Carlo methods. We apply the well-known Random Walk Metropolis (RWM) algorithm (Hastings, 1970) and summarize it in Fig. 2, on which we base the brief following description. A given model starts with a certain parameter set and simulates firn profiles at all the calibration sites. Its 15 and results are compared with observations and the general performance of the model using is quantified by the 30 likelihood. From there and with the prior distributions assumed, the posterior probability ( | ) is computed following Eq.
(7). At this point, the state of the chain is (Fig. 2a). The RWM then proposes a new * from a proposal distribution (Fig.   2b). For the latter, we use the symmetric multivariate normal (MVN) distribution which is centred about . This implies that the random choice of * depends only on the current state and on the proposal covariance in the MVN, , which is discussed below. Using the parameter combination * , the model simulates profiles at all calibration sites again (Fig. 2c) and ( * | ) is computed (Fig. 2d). From there, we either accept or reject the proposed * in the ensemble approximating ( | ). 5 The probability of accepting * depends on the ratio ( * | )⁄ ( | ) (Fig. 2e). The set saved in the ensemble (Fig. 2g) is * if accepted or if * was rejected. The saved set becomes the updated current status for the next iteration +1 (Fig. 2a) and the algorithm iterates this process. The RWM has the property that the chain will ultimately converge to a stationary distribution that represents the posterior ( | ). Thus, after a sufficiently high number of iterations of the algorithm, the ensemble of parameter sets is representative of ( | ). We verify adequate convergence using a number of tests, which are 10 shown in the Supplementary Information. The proposal variance must account for dependence between the different components of , i.e. the value of one free parameter can influence the value of another free parameter for the model to reach a good match with the observed data. can capture this dependence between parameters and, for optimality, it is updated every given number of iterations (100 in our study) using Eq. (9)

Results
We present the results of the calibration process after 30000 algorithm iterations and compare the MAP and original models' performances against the 22 evaluation cores. We also evaluate the uncertainty of the posterior distributions and compare 30 performances between the different MAP models. For both HL and Ar, the posterior distributions for the parameters demonstrate some strong disagreements with the original values (Figs. 3a, 3b). The 95% credible intervals for each parameter incorporate 95% of the marginal probability density in the posterior. Two original parameter values of HL ( , ) and three of Ar ( , , ) lie outside these intervals, and one of HL ( 0 * ) is at the lower edge of the interval (Figs. 3a, 3b). This indicates that our analysis provides strong evidence against these 5 original values. The strongest disagreements relate to the accumulation exponents of both models ( , , , ). In contrast, the original LZ values agree better with the posterior distribution and all lie within the 95% credible intervals (Fig. 3c).
We use the original models and the MAP estimates to simulate firn profiles at the evaluation sites and we compare results with the observed values. This is an effective way to assess possible improvements in parameter estimates reached through 10 our method since the evaluation sites were not used in the calibration process. The match between observations and the model is improved for MAP HL (Fig. 4a) and even more for MAP Ar (Fig. 4b), with the original Ar strongly underestimating values.
For LZ, the evaluation against the test set is inconclusive, with a worse performance of the MAP LZ model for (+22 % 15 in RMSE) and a slight improvement for 15 (-1 %) ( Table 2 and Fig. 4c). The relative agreement in parameter values between MAP LZ and the original LZ explains more moderate differences in RMSE. Comparing modelled and observed depthdensity profiles of evaluation data illustrates the differences in performance visually (e.g. Fig. 5). Profiles of the original models of HL and Ar frequently lie outside the credible intervals of their respective MAP models. In contrast, profiles of MAP LZ and of the original LZ tend to be close together. At the climatic outlier of our evaluation data (DML in Fig. 5), MAP LZ 20 performs slightly worse than the original LZ ( Fig. 5i) but improvements are reached for MAP HL and MAP Ar (Figs. 5c, 5f). This demonstrates benefits of this method even at the limits of the calibration range.

Compared to the original HL, MAP HL reaches improvements in
15 for 12 of the 22 evaluation cores and in for 5 of the 11 evaluation cores (Fig. 6a). Generally, MAP HL performs better at AIS sites and worse at GrIS sites. An analysis of the 25 improvement of MAP HL as a function of climatic variables (Fig. 6a) shows that the original HL gives better results in a narrow range of : from -30 to -25 ℃. As such, the better performance at the GrIS evaluation sites is likely due to the original HL being better suited for the particular temperature range corresponding to the conditions of the latter sites. In contrast, MAP HL seems more appropriate for covering a wider range of climatic conditions. For Ar, the original model shows better performance than MAP Ar at few evaluation sites (5 for 15 and 1 for ) which are only in AIS and confined to low-accumulation 30 conditions (Fig. 6b). This is counterintuitive given that Arthern et al. (2010) tuned the original Ar to measurements from high accumulation sites of AIS. Finally, the original LZ performs better than MAP LZ at most GrIS sites (Fig. 6c), which is unsurprising given that its original calibration was GrIS-specific. Again, this seems related to the original LZ performing https://doi.org/10.5194/tc-2019-274 Preprint. Discussion started: 3 January 2020 c Author(s) 2020. CC BY 4.0 License. significantly better in the same narrow range of temperatures as for HL. In total, MAP LZ performs better for 11 of the 22 15 and 4 of the 11 evaluation measurements.
As explained in Sect. 2.3, the original LZ model was developed for GrIS firn only (Li and Zwally, 2011) and later complemented by an AIS-specific model (Li and Zwally, 2015). Using both of these on the evaluation sites of their respective 5 calibration ice sheet, we construct an LZ dual model, which thus really consists of two different models. The RMSE for 15 of LZ dual is slightly lower (-9 %) than that of MAP LZ but significantly larger (+37 %) for . We note that the RMSE of LZ dual is strongly affected by the stage-2 densification performing very poorly at the climatic outlier of the evaluation data, with conditions that are outside of the calibration range of Li and Zwally (2015). 10 We also compare MAP results with the IMAU firn densification model (IMAU-FDM), which has been used frequently in recent mass balance assessments from altimetry Babonis et al., 2016;McMillan et al., 2016;Shepherd et al., 2018). IMAU-FDM was developed by adding two tuning parameters to both densification stages of Ar. All four extraparameters are different for AIS (Ligtenberg et al., 2011) and GrIS (Kuipers Munneke et al., 2015), thus also resulting in two separate models. On the evaluation data, its performance for 15 is slightly better than MAP Ar and MAP LZ but worse than 15 MAP HL , and its performance for is significantly worse than all three MAP models (Table 2).
To assess the uncertainty captured by the Bayesian posterior distributions, we compute results on the evaluation data with the 500 parameter sets randomly selected from each of the three posterior ensembles. For all three models, the average performance of their random sample is similar to the corresponding MAP performance, with a maximum RMSE change of 6% (Table 2). 20 This demonstrates a low uncertainty in the optimal parameter combinations identified by calibration. Furthermore, the best performing 95 th percentile of the random selection allows the construction of the uncertainty intervals shown in Figs 4, 5. Of the original models, LZ reaches the lowest RMSE values. MAP HL performs better in both 15 and than any model tested (Table 2). Comparing the performances of MAP models, MAP HL is followed by MAP Ar and then MAP LZ . This order is still valid for the 500-samples random selections, which account for uncertainty (Table 2). 25

Discussion and Implications
This calibration method is potentially applicable to models of similar complexity in a broad range of research fields. We exploit it here to investigate the parameter space of HL, Ar and LZ, and to re-estimate optimal parameter values conditioned on observed calibration data; no further complexity is introduced since the number of empirical parameters remains the same. We treat the accumulation exponents of Ar ( , ) as free parameters whereas Arthern et al. (2010) decided to fix their values to 1. 30 Analogous to and in HL, these exponents capture the mathematical relationship between densification rates and the accumulation rate, used as a proxy for load increase on any specific firn layer. No physical argument favours a linear https://doi.org/10.5194/tc-2019-274 Preprint. Discussion started: 3 January 2020 c Author(s) 2020. CC BY 4.0 License. proportionality between densification and load increase and any prescribed value for these exponents is a choice of the model designer. Unlike Arthern et al. (2010), Herron and Langway (1980) previously inferred = 1 and = 0.5. Our calibration data shows strong evidence against both these pairs of values; all four are outside the posterior 95% credible intervals (Fig. 3a,   3b). Our results of stage-1 exponents ( , ) smaller than 1 indicate a weaker increase in densification rates with pressure. In firn, the load is supported at the contact area between the grains, which increases on average due to grain rearrangement (in 5 stage-1) and grain growth. As such, firn strengthens in time and the actual stress on ice grains increases slower than the total load (Anderson and Benson, 1963). Morris and Wingham (2014) incorporated this by including a temperature-history function, causing slower densification of firn previously exposed to higher temperatures. This is consistent with both grain rearrangement and grain growth because these processes are enhanced at higher temperatures (Alley, 1987;Gow et al., 2004).
Lower values of the stage-2 exponents ( , ) illustrate the larger strength of high-density firn with larger contact areas between 10 grains. The same can be applied to the LZ model by investigating the posterior correlation between its free parameters. It shows a positive correlation coefficient (0.6) between the accumulation-related parameters of both stages; 12 and 22 . High values of 12 make 0 more sensitive to ̇ (Eq. (6)). However, 0 appears in the numerator of the 1 calculation (Eq. (6)) and higher values of 22 thus moderate the sensitivity of stage-2 densification to ̇. As such, positively correlated 12 and 22 provide further evidence that stage-1 densification rates are more sensitive to accumulation rates. The posterior correlations of 15 all three models are further discussed in the Supplementary Information. HL, Ar and LZ only use temperature and accumulation rates as input variables. Other models use additional variables hypothesised to affect densification rates. These include the temperature-history mentioned above (Morris and Wingham, 2014), firn grain size (Arthern et al., 2010), impurity content (Freitag et al., 2013) and a transition region between stage-1 and 20 stage-2 densification (Morris, 2018). Other models are explicitly based on micro-scale deformation mechanisms (Alley, 1987;Arthern and Wingham, 1998;Arnaud et al., 2000). These efforts undoubtedly contribute to progressing towards physically based models. A potential problem with such approaches is overfitting calibration data by adding parameters to model formulations while detailed firn data remain scarce. As long as more firn data is not available to appropriately constrain the role of each variable in model formulations, we favour the use of parsimonious models relying on few input variables. It is 25 noteworthy that MAP LZ , which relies on eight free parameters, performs worse on the evaluation data than MAP HL and MAP Ar with two fewer free parameters. This highlights that gains in model accuracy should rely not only on better calibration of parameters but also on a reconsideration of the governing densification equations. Additionally, firn core data invokes the assumption of a steady-state depth-density profile. As such, parameter calibration poorly captures seasonal climatic effects on densification. Comprehensive datasets of depth-density profiles (Koenig and Montgomery, 2019) are very valuable to model 30 development. Efforts in collecting and publishing strain rate measurements from the field (Hawley and Waddington, 2011;Medley et al., 2015;Morris et al., 2017), and possibly from laboratory experiments (Schleef and Löwe, 2013), can further benefit model calibration and the progress towards more representative equations.
As an example of consequences of our calibration, we investigate its effects on GrIS firn thickness change under the 2000-2017 climate. At all 27 GrIS sites of our dataset, we compute the cumulative anomaly in thickness change due to firn compaction during the latter period with respect to the reference 1960-1980 period (see Supplementary Information).
Altimetry-based mass balance surveys could interpret any change in firn thickness not captured by firn models as a net gain or loss of ice. The root mean square difference in compaction anomaly between MAP HL and the original HL is 1.5 cm over the all six models. These discrepancy figures are approximate, since different climatic sensitivities of models and variability in climatic changes will cause compensating effects in different areas. Still, they provide an order of magnitude for potential errors in mass balance assessments that are related to choices of model and of parameter values if firn densification is exposed to a realistic climatic shift.
In addition to different cumulative responses, the six models show different sensitivities in terms of monthly values of 15 compaction anomalies over the 2000-2017 period (Fig. 7). Ar shows the strongest sensitivity to climatic conditions diverging from these of the reference period; compaction responds strongly to the general increases in temperature and accumulation rate, especially in late summer. Due to its lower values for , and , MAP Ar exhibits less extreme compaction anomalies than the original Ar and thus less seasonal variability. In sharp contrast to Ar, HL-computed compaction rates remain relatively stable, due to low activation energy values that smooth out the seasonal variability. Firn core observations provide little 20 information and constraints on seasonal patterns of densification. However, it is noteworthy that MAP Ar and MAP LZ tend to show comparable short-scale sensitivities (insets in Fig. 7), despite structural differences in the models' governing equations.
This might indicate that these models fare relatively well in capturing seasonal fluctuations of densification rates and their sensitivity to climate shifts.

Conclusion 25
We have implemented a Bayesian calibration method to estimate optimal parameter combinations applicable to GrIS and AIS firn for three benchmark firn densification models (HL, Ar, LZ). An extensive dataset of 91 firn cores was separated into calibration and independent evaluation data. Two optimised models (MAP HL , MAP Ar ) showed significant improvement against the evaluation data, while MAP LZ reached results close, but slightly worse, to its original version and inferior to MAP HL and MAP Ar . When compared to other models of greater complexity, the MAP models showed comparable or even improved 30 performances, especially MAP HL . Furthermore, the Bayesian approach provides a robust way to evaluate the uncertainty related to parameter value choice, which is a major deficiency of current models. However, at most sites where we evaluated, all three models' uncertainty intervals do not cover observed values. As such, although model results can be improved by re-calibration methods, model tuning alone is insufficient to reach exact fidelity of firn densification models. The formulation of models' governing equations impacts the remaining errors with respect to observations, which highlights deficiencies in our understanding of dry firn densification. Developing a well-constrained physically detailed model is challenging given the number of mechanisms affecting densification rates and their dependency on microstructural properties of firn, which are 5 difficult to observe. Our study demonstrates that, despite these observational limitations, thorough calibration methods relying only on climatic variables can substantially improve firn model accuracy, and constrain uncertainties.      https://doi.org/10.5194/tc-2019-274 Preprint. Discussion started: 3 January 2020 c Author(s) 2020. CC BY 4.0 License. Figure 6. Improvements of the MAP models with respect to the original models for the evaluation data. The ratios indicate the ratios of cores for which an improvement is achieved by the corresponding MAP.