Inferring the basal sliding coefficient field for the Stokes ice sheet model under rheological uncertainty

We consider the problem of inferring the basal sliding coefficient field for an uncertain Stokes ice sheet forward model from synthetic surface velocity measurements. The uncertainty in the forward model stems from unknown (or uncertain) auxiliary parameters (e.g., rheology parameters). This inverse problem is posed within the Bayesian framework, which provides a systematic means of quantifying uncertainty in the solution. To account for the associated model uncertainty (error), we employ the Bayesian approximation error (BAE) approach to approximately premarginalize simultaneously over both the noise in measurements and uncertainty in the forward model. We also carry out approximative posterior uncertainty quantification based on a linearization of the parameter-to-observable map centered at the maximum a posteriori (MAP) basal sliding coefficient estimate, i.e., by taking the Laplace approximation. The MAP estimate is found by minimizing the negative log posterior using an inexact Newton conjugate gradient method. The gradient and Hessian actions to vectors are efficiently computed using adjoints. Sampling from the approximate covariance is made tractable by invoking a low-rank approximation of the data misfit component of the Hessian. We study the performance of the BAE approach in the context of three numerical examples in two and three dimensions. For each example, the basal sliding coefficient field is the parameter of primary interest which we seek to infer, and the rheology parameters (e.g., the flow rate factor or the Glen’s flow law exponent coefficient field) represent so-called nuisance (secondary uncertain) parameters. Our results indicate that accounting for model uncertainty stemming from the presence of nuisance parameters is crucial. Namely our findings suggest that using nominal values for these parameters, as is often done in practice, without taking into account the resulting modeling error, can lead to overconfident and heavily biased results. We also show that the BAE approach can be used to account for the additional model uncertainty at no additional cost at the online stage.

Abstract. We consider the problem of inferring the basal sliding coefficient field for an uncertain Stokes ice sheet forward model from synthetic surface velocity measurements. The uncertainty in the forward model stems from unknown (or uncertain) auxiliary parameters (e.g., rheology parameters). This inverse problem is posed within the Bayesian framework, which provides a systematic means of quantifying uncertainty in the solution. To account for the associated model uncertainty (error), we employ the Bayesian approximation error (BAE) approach to approximately premarginalize simultaneously over both the noise in measurements and uncertainty in the forward model. We also carry out approximative posterior uncertainty quantification based on a linearization of the parameter-to-observable map centered at the maximum a posteriori (MAP) basal sliding coefficient estimate, i.e., by taking the Laplace approximation. The MAP estimate is found by minimizing the negative log posterior using an inexact Newton conjugate gradient method. The gradient and Hessian actions to vectors are efficiently computed using adjoints. Sampling from the approximate covariance is made tractable by invoking a low-rank approximation of the data misfit component of the Hessian. We study the performance of the BAE approach in the context of three numerical examples in two and three dimensions. For each example, the basal sliding coefficient field is the parameter of primary interest which we seek to infer, and the rheology parameters (e.g., the flow rate factor or the Glen's flow law exponent coefficient field) represent so-called nuisance (secondary uncertain) parameters. Our results indicate that accounting for model uncertainty stemming from the pres-ence of nuisance parameters is crucial. Namely our findings suggest that using nominal values for these parameters, as is often done in practice, without taking into account the resulting modeling error, can lead to overconfident and heavily biased results. We also show that the BAE approach can be used to account for the additional model uncertainty at no additional cost at the online stage.

Introduction
Inferring the basal sliding coefficient field using both the linear and nonlinear Stokes ice sheet model from noisy surface velocity measurements has received considerable attention in recent years (see, for example, Truffer, 2004, Raymond and Gudmundsson, 2009, Pollard and DeConto, 2012, Isaac et al., 2015b, Morlighem et al., 2013, Zhao et al., 2018a, b, Giudici et al., 2014, Petra et al., 2012, and Isaac et al., 2015a. The standard approach to this problem invariably assumes that the other parameters of the ice, such as those controlling the rheology, are known precisely. This is particularly common, for example, in the case of the so-called flow rate factor and the Glen's flow law exponent, in which nominal values such as A = 10 −16 Pa −n a −1 and n = 3, respectively, are prescribed; we refer, e.g., to Isaac et al. (2015a), Petra et al. (2014), Gudmundsson (2009), Truffer (2004), Morlighem et al. (2013), Zhu et al. (2016), Zhao et al. (2018b), Giudici et al. (2014), and Pollard and De-Conto (2012). The inference problem is made significantly more challenging (both theoretically and numerically) by al-lowing the rheology parameters to be uncertain and spatially varying. One possible approach to solve the problem is to infer both the basal sliding coefficient and the rheology parameters. However, this considerably increases both the illposedness of the inverse problem and the associated computational costs. For most ice sheet inverse problems considered in the literature, the field of interest is the basal sliding parameter, which arguably presents the largest uncertainty in determining the ice flow rate.
It is well documented that, in practice, the rheology parameters of ice sheets are not known exactly (e.g., Bons et al., 2018;Marshall, 2005;Gillet-Chaulet et al., 2011Cuffey and Paterson, 2010;Brondex et al., 2019;Raymond and Gudmundsson, 2011). Compounding this issue is the fact that measured ice velocities can be heavily influenced by rheology parameters (Schlegel et al., 2015;Bulthuis et al., 2019). This fact was demonstrated in Petra et al. (2012), in which the authors used the same Stokes ice sheet model as in the current paper to reconstruct reasonable estimates of the Glen's flow law exponent from noisy surface velocity measurements, suggesting that the surface measurements are indeed sensitive to changes in the Glen's flow law exponent field. Despite these findings, it is standard in the literature to assume that rheology -among other -parameters of the ice are known a priori (see, for instance, Bons et al., 2018, Marshall, 2005, Gillet-Chaulet et al., 2011, Cuffey and Paterson, 2010, Brondex et al., 2019, and Van der Veen, 2013. In this paper, we treat the rheology parameters (specifically the Glen's flow law exponent and the flow rate factor fields) as auxiliary (nuisance) parameters, i.e., parameters which are not of primary interest. However, fixing these auxiliary parameters at incorrect, though possibly well-justified, values often induces so-called modeling errors. It is well understood, though, that the solutions to inverse problems are generally sensitive to modeling errors which -if not properly accounted for -can lead to inaccurate, nonphysical, and in some cases meaningless solutions of the inverse problem (Brynjarsdóttir and O'Hagan, 2014;Giudici et al., 2014;Somersalo, 2007, 2005). From a statistical viewpoint, fixing auxiliary parameters to nominal values suggests that these parameters are known exactly and hence neglects all associated uncertainties. This in turn often results in biased and overconfident estimates for the parameters of interest (see, for example, Kaipio and Somersalo, 2007, Kaipio and Kolehmainen, 2013, and Nicholson et al., 2018 and the references therein).
We carry out estimation of the basal sliding coefficient within the Bayesian framework (Kaipio and Somersalo, 2005;Stuart, 2010), which is particularly well suited to incorporating various sources and types of uncertainties, including those resulting from model errors (Tarantola, 2005;Somersalo, 2005, 2007). Moreover, to ensure the work here is readily transferable to inference problems in large-scale ice flow problems, such as those discussed in Isaac et al. (2015a), we make use of the computational framework proposed in Bui-Thanh et al. (2013) and Petra et al. (2014) for handling infinite-dimensional Bayesian inverse problems (Stuart, 2010). This approach, combined with adjoint-based means to compute the derivative information needed by the optimization solver, ensures mesh independence and computational efficiency.
To account for the uncertainty in the rheology parameters, we utilize the Bayesian approximation error (BAE) approach Somersalo, 2005, 2007;Kaipio and Kolehmainen, 2013), which, broadly speaking, lumps all modeling and measurement uncertainties into a single additive total error term. The total error can then be approximately marginalized over in a similar manner to how standard additive errors are dealt with (Kaipio and Kolehmainen, 2013). The BAE approach is particularly attractive computationally as (a) the approximate marginalization can be carried out prior to data acquisition, i.e., premarginalization, and (b) the equations to be solved in the adjoint-state approach maintain the same general form (Nicholson et al., 2018).
The BAE approach has been used in a variety of settings (see, for example, Kaipio and Kolehmainen, 2013, Arridge et al., 2006, Castello and Kaipio, 2019, and Lamien et al., 2019 among others, and the references therein). A particularly relevant, and recent, example is the application of the approach to the so-called Robin inverse problem encountered, for instance, in corrosion detection (Nicholson et al., 2018). There the parameter of interest is also a Robin-type boundary condition on an inaccessible part of the domain, while the nuisance parameter is the (electrical or thermal) conductivity of the domain.
To study the performance of the BAE approach, we formulate and solve three ice sheet flow model problems involving synthetic data. Our results suggest that simply setting rheology parameters to nominal values can result in severely misleading estimates of the basal sliding coefficient field and associated posterior uncertainty if the additional uncertainty in the rheology parameters is not accounted for. In comparison, we show that incorporating the additional modeling uncertainties using the BAE approach leads to sensible estimates of the basal sliding coefficient and reasonable posterior uncertainty at no additional online cost. We place particular emphasis on the feasibility of the posterior uncertainty estimates, in particular, on how well the true parameter is contained within the posterior distribution.

Contributions
In previous work, we addressed the problem of inferring the basal sliding coefficient field from surface velocity measurements in the context of ice sheet flow in a deterministic, moderate scale, synthetic observational data setting in Petra et al. (2012), in a Bayesian inference and infinite-dimensional setting in Petra et al. (2014), and more recently in a large-scale, real data setting in Isaac et al. (2015a). Here the goal is to extend this inversion framework to account for additional uncertainties in the ice sheet model. The main contributions of this paper are as follows. Firstly, we show that setting rheology parameters to values commonly found for ice sheet models in the literature can lead to erroneous posterior estimates of the basal sliding coefficient if the underlying uncertainty in the rheology parameters is not accounted for. Secondly, we show that this situation can be remedied by employing the BAE approach to premarginalize over rheology uncertainties. Thirdly, we show that this approach requires no additional computational resources or time at the online stage as all computations required for premarginalization are carried out prior to the acquisition of data.

Organization of paper
The paper is organized as follows. In Sect. 2, we outline the forward nonlinear Stokes flow equations for ice sheet problems, while in Sect. 3 we briefly review the Bayesian framework for inverse problems, the computation of the maximum a posteriori estimate, and the approximate posterior covariance. In Sect. 4, we show how to apply the BAE approach to premarginalize over auxiliary parameters. In Sects. 5 and 6, we formulate and solve three ice sheet inverse problems and study the performance of the proposed method. Finally, Sect. 7 provides concluding remarks.

Forward ice sheet flow model
In this section, we describe the forward ice sheet flow problem that is used for the inference of the basal sliding coefficient field under uncertain rheology. As in Petra et al. (2012Petra et al. ( , 2014 and Isaac et al. (2015a), we model the flow of ice as an isothermal, viscous, shear-thinning, incompressible fluid via the balance of mass and linear momentum (Hutter, 1983;Marshall, 2005;Paterson, 1994), namely where u denotes the velocity field, σ u the stress tensor, ρ the density of the ice, and g gravity. The stress, σ u , can be decomposed as σ u = τ u − Ip, where τ u is the deviatoric stress tensor, p the pressure, and I the identity tensor. The domain considered in this paper is = [0, L] d−1 × [0, H ], for d = 2 or d = 3. We employ the Glen's flow law (Glen, 1955) which relates the stress and strain rate tensors by where η is the effective viscosity, A is the flow rate factor, anḋ ε u = 1 2 (∇u + ∇u T ) andε II = 1 2 tr(ε 2 u ) are the strain rate tensor and its second invariant. Above, n = n(x) is the spatially varying Glen's flow law exponent, which satisfies n(x) ≥ 1 Figure 1. Schematic of a two-dimensional slab of ice (used in examples 1 and 2). The schematic can also be thought of as a cross section through the three-dimensional slab of ice used for Example 3. The blue circles show representative (random) measurement locations but do not necessarily coincide with the actual measurement locations used in the examples. θ is the slope of the ice slab.
for all x ∈ to ensure the ice is a shear-thinning fluid (Glen, 1955).
In line with Petra et al. (2012), the top boundary t is equipped with a traction-free boundary condition, all lateral boundaries p are equipped with periodic boundary conditions, and on the basal surface b we apply a no flow condition for the normal component of u along with a linear sliding law for the tangential components. That is, the boundary conditions are given by (1d) u| l = u| r and σ u n| l = σ u n| r on p , where β(x) is the log basal sliding coefficient field 1 , n is the outward normal unit vector, and T := I − nn T is the projection onto the tangential plane. Above, we generically used l and r to denote pairs of opposing boundaries on p on which periodic conditions are imposed. We note that β generally represents a combination of complex phenomena (see, for example, Schoof, 2005, and Perego et al., 2014. Furthermore, the methods and results discussed in the current paper do not rely on the particular top and lateral boundary conditions specified. As such, alternative boundary conditions could also be imposed on t and p . For a simple illustration of the problem setup (shown in two dimensions), see (1), as it is the starting point for both the finite element discretization of the forward model and the computation of the gradient and action of the Hessian required for the solution of the inverse problem using the adjoint-state method (see, for example, Isaac et al., 2015a). Multiplying the nonlinear Stokes system (1) with arbitrary test functionsũ and p and using integration by parts over (Gockenbach, 2006;Elman et al., 2005), the weak form of Eq. (1) is given by the following. Find (u, p) ∈ W = V × Q such that for all (ũ,p) ∈ W. In line with Elman et al. (2005) and Isaac et al. (2015b), we set V := {u ∈ (H 1 ( )) d : u| l = u| r , u · n| b = 0} and Q := (L 2 ( )) d , for d = 2 or d = 3.
-Discretization. To guarantee the inf-sup stability (wellposedness) of the discretized forward problem, we discretize the velocity and pressure using Taylor-Hood finite elements, i.e., quadratic elements for each velocity component and linear elements for pressure (see, for example, Elman et al., 2005). The basal sliding coefficient field is discretized using continuous linear Lagrange basis functions φ j (s) m j =1 , i.e., β h (s) = m j =1 β j φ j (s), where s ∈ b . In what follows, we denote by β = (β 1 , β 2 , . . ., β m ) ∈ R m the discrete basal sliding coefficient field.

Inferring the basal sliding coefficient field
In this section, we summarize the Bayesian inference framework, which will be used in combination with the Bayesian approximation error approach, to account for uncertainties in rheology parameters. To allow for the systematic incorporation of uncertainties, we consider the inverse problem in the Bayesian framework (Tarantola, 2005;Kaipio and Somersalo, 2005). In this framework, the solution of the underlying statistical inverse problem is given by the posterior probability density. For nonlinear inverse problems with expensive forward models and high-dimensional parameters (as is the case for ice sheet inverse problems), fully characterizing the posterior is typically not tractable. Consequently, we compute the Laplace approximation of the posterior, which requires only the maximum a posteriori (MAP) estimate, i.e., the basal sliding coefficient which maximizes the posterior density and the approximate posterior covariance.
We use Bayes' theorem to write the solution of the Bayesian inverse problem as the posterior measure, which describes the probability law of the parameter conditioned on measurements (Tarantola, 2005;Stuart, 2010). The Formulation of the posterior relies on both the prior density and the likelihood function, which we outline below. We note that initially we pose the problem in an infinite-dimensional setting, which is particularly well suited to large-scale problems (e.g., Bui-Thanh et al., 2012;Isaac et al., 2015a) as it ensures the discretization invariance and well-posedness of the Bayesian inverse problem (Stuart, 2010).

The prior
We postulate a Gaussian prior density on the (spatially varying) basal sliding coefficient, i.e., β ∼ N (β * , C β ), with covariance operator C β , and mean value β * ∈ E, where E is defined as the range of C 1 2 β (see, for example, Stuart, 2010 andBui-Thanh et al., 2013 for more details). To ensure the inverse problem is well posed in infinite dimensions, we use a squared inverse elliptic operator to define the prior covariance operator 2 (e.g., Flath et al., 2011;Bui-Thanh et al., 2013;Petra et al., 2014).
More specifically, we take C β = A −2 , where A is the second order elliptic differential operator defined by where the strictly positive parameters γ β (m 2 ) and δ β (adimensional) control the correlation length and the marginal variance. Specifically, the correlation length (defined as the distance for which the two points have a correlation coefficient of 0.1) is proportional to γ β /δ β (m), while the variance is proportional to δ −2 β γ β /δ β d−1 2 (see, for example, Khristenko et al., 2019 and the references therein). This choice of prior covariance operator is particularly well suited to large-scale problems, as discretization of A (using a finite element discretization) is sparse (see, for example, Lindgren et al., 2011 andOsborn et al., 2017). As discussed in Khristenko et al. (2019), Daon and Stadler (2018), and Roininen et al. (2014), suitable boundary conditions need to be stipulated to reduce boundary artifacts. In this work, we choose to equip A with periodic boundary conditions on ∂ b , which parallels the periodic boundary condition (1e) of the forward model. We note that the discrete representation of the prior covariance operator, denoted pr , is defined as follows (e.g., Bui-Thanh et al., 2013;Petra et al., 2014;Villa et al., 2021): Therefore, the discrete parameter β follows a Gaussian distribution N (β * , pr ), with prior mean β * ∈ R m and covariance pr . That is, the prior density of β is given by where · −1 pr denotes the −1 pr weighted l 2 inner product.

The data likelihood
We assume the velocity measurements, denoted d, are corrupted by additive noise and are related to the basal sliding coefficient through where F : L 2 ( ) → R q is called the parameter-toobservable map, and e ∈ R q denotes the noise in the measurements.
As is somewhat common in the literature (e.g., Raymond and Gudmundsson, 2009;Petra et al., 2012Petra et al., , 2014, we take the data to consist of (noisy) point-wise observations of each component of the velocity field on the top surface 3 . In discrete settings, we compute F(β) by first solving the Stokes equation (Eq. 1) and then applying a linear observation operator that extracts the velocity at the measurement locations. We assume the noise, e, is independent of the basal sliding coefficient, has zero mean, and is Gaussian, i.e., e ∼ N (0, e ). The likelihood is then of the following form (e.g., Tarantola, 2005;Kaipio and Somersalo, 2005):

The posterior
By applying Bayes' theorem, the posterior density of β is proportional to the product of the prior density (Eq. 5) and the data likelihood (Eq. 7). This is given by The corresponding MAP estimate is then defined as We note that the problem of finding the MAP estimate, defined in Eq. (9), reduces to a deterministic inverse problem.
To solve this problem we use an inexact Newton conjugate gradient (CG) method, as in Petra et al. (2012). To derive the required first (i.e., gradient) and second (i.e., Hessian) derivative information needed by Newton's method, we use an adjoint-based method and refer the reader to Petra et al. (2012) for the derivation and expressions of the required derivatives.

Quantifying posterior uncertainty
To (approximately) quantify the resulting uncertainty in the inferred basal sliding parameter, we invoke a local Gaussian approximation of the posterior (i.e., the Laplace approximation). That is, the solution to the Bayesian inverse problem is now given by a Gaussian distribution with mean β MAP and covariance po given by the inverse of the (Gauss-Newton) Hessian of the negative log-posterior, evaluated at the MAP estimate. More specifically, we make the approximation, β|d ∼ N (β MAP , po ), with β MAP given by Eq. (9), and where H(β) is the Gauss-Newton Hessian of the data misfit term (i.e., the negative log-likelihood), and F is the Jacobian matrix of the parameter-to-observable map, F (e.g., Bui-Thanh et al., 2013). The construction of the posterior covariance matrix (i.e., the inverse of the Hessian) is prohibitive for large-scale problems since its dimension is equal to the dimension of the parameter. To make operations with the posterior covariance matrix tractable, we exploit the fact that the eigenvalues of H(β MAP ) collapse to zero rapidly since the data contain limited information about the (infinite-dimensional) parameter field. Thus a low-rank approximation of the data misfit component of the Hessian H can be constructed as in Isaac et al. (2015a) by solving the generalized eigenvalue problem where r = diag(λ 1 , λ 2 , . . ., λ r ) ∈ R r×r is a diagonal matrix collecting the r largest generalized eigenvalues, λ i , and ∈ R m×r is the matrix collecting the corresponding −1 pr -orthonormal eigenvectors, v i . Above, the truncation index r is chosen such that the remaining eigenvalues, λ i , for i = r + 1, . . ., m, are sufficiently smaller than 1 -often chosen such that λ i ≤ c, for some 0.01 ≤ c ≤ 1 (e.g., Isaac et al., 2015a;Flath et al., 2011).

Premarginalization over auxiliary parameters and the Bayesian approximation error approach
The Bayesian approximation error (BAE) approach Somersalo, 2007, 2005;Kaipio and Kolehmainen, 2013) can be used to approximately premarginalize over auxiliary parameters. The BAE approach essentially combines all uncertainties, including those generated by fixing uncertain parameters, into a single additive total error term. The total error term can then be premarginalized over, i.e., marginalized over before the acquisition of data. We next outline the process. We denote by a the auxiliary parameters which in the current study are defined over the entire computational domain, , and are assumed to be Gaussian distributed with covariance operator C a = L −2 , where L is defined by and mean value a * . In line with the forward problem, L is equipped with periodic boundary conditions on the lateral boundaries of , while on the top and bottom boundaries we enforce Robin boundary conditions. Note that explicit knowledge of the distribution of a is not needed; we only require the ability to sample realizations of a. In what follows, we denote by a any (possibly more than one) discretized auxiliary (uncertain) parameter, such as the rheology parameters. Next, we letF(β, a) denote an accurate parameter-toobservable mapping so that the relationship between the parameters and the measured data is given by Then, with the aim of avoiding so-called joint inversion, i.e., estimating β and a simultaneously, we introduce the approximate parameter-to-observable mapping: F(β) =F(β, a * ).
That is, the approximate parameter-to-observable map is the accurate parameter-to-observable map but with the auxiliary parameters fixed to the associated mean value, i.e., a = a * . Fixing a to some other nominal value is also possible. The goal is then to carry out the estimation of β using only the approximate parameter-to-observable map, F(β), while taking into account the (statistics of) the discrepancy between the models. To this end, Eq. (14) is reformulated as where ε =F(β, a) − F(β) is known as the approximation error and ν = e + ε as the total error (e.g., Nicholson et al., 2018;Tarvainen et al., 2010). Next, the approximation error is approximated as a Gaussian with mean ε * and covariance ε , i.e., ε ∼ N (ε * , ε ). Though, formally, the approximation error depends on the parameters, i.e., ε = ε(β, a), a further approximation, termed the enhanced error model or the composite error model approximation, is often made, which approximates ε as independent of all parameters (Kaipio and Kolehmainen, 2013). This leads to the total errors being distributed as ν ∼ N (ν * , ν ) = N (ε * , e + ε ).
Use of the BAE approach results in an updated posterior density for β: which is obtained by explicit marginalization over ν (Kaipio and Kolehmainen, 2013). The updated MAP estimate is then This updated expression for the MAP estimate is only a slight modification of the original MAP estimate given in Eq. (9); thus reformulating the corresponding adjoint, incremental forward, and incremental adjoint equations is essentially trivial. Lastly, the updated posterior covariance matrix (under the Laplace approximation) is now given by

Computing the approximation error statistics
In the current paper, all parameters are taken to have Gaussian (prior) distributions, i.e., z ∼ N (z * , C z ), with z = (β, a). We also assume β and a are independent; thus specifying β * , a * , C β , and C a fully describes the prior density. Unlike the statistics of the parameters and the measurement noise, the mean (ε * ) and covariance ( ε ) of the approximation errors must in general be estimated based on (Monte Carlo) samples. That is, with N ∈ N the number of samples, ε ( ) =F(β ( ) , a ( ) ) − F(β ( ) ) for = 1, 2, . . ., N , where β ( ) and a ( ) are samples drawn from the joint prior density, and The samples, β ( ) and a ( ) , are generated efficiently as in Villa et al. (2021, Eq. 30).
It is worth noting that all sampling and computations of the approximation errors and the associated statistics can be carried out prior to the acquisition of any data and is thus often termed offline computations (Kaipio and Kolehmainen, 2013). Furthermore, though the computational cost per sample of ε in the current paper is two forward (nonlinear) Stokes solves, the sampling procedure is embarrassingly parallel, i.e., each sample can be carried out independently, and in practice, only a fairly small number of samples is required.
We conclude this section by giving several rules of thumb relating to the use of the BAE approach (for more details see Kaipio and Kolehmainen, 2013). Firstly, the total number of samples required to accurately construct the statistics of the approximation errors is generally (often substantially) less than N = 1000. Secondly, two measures have been developed to identify when neglecting the approximation errors can result in misleading, and potentially infeasible, results (Kaipio and Kolehmainen, 2013). Specifically, if either holds, then the approximation errors are said to dominate the noise, and ignoring them often gives erroneous results. Intuitively, if the approximation errors dominate the noise, then ignoring them often results in overconfidence in the approximate forward model, in turn leading to overly confident and biased posterior estimates.

Numerical examples
In this section, we outline three numerical examples to assess the applicability, performance, and robustness of the BAE approach to account for uncertain rheology parameters. Several additional examples are provided in the accompanying Supplement, as detailed in Sect. 6.4. In all cases, the parameter of interest is the basal sliding coefficient, β. Any other unknown/uncertain parameters are (approximately) premarginalized over using the BAE approach, as outlined in Sect. 4. The forward problems considered here are based on the models used in the Ice Sheet Model Intercomparison Project for Higher-Order ice sheet Models (ISMIP-HOM) benchmark study carried out in Pattyn et al. (2008). Accordingly, all problems are considered in box-like geometries, i.e., Furthermore, in all model problems we take the ice slab to be set on an incline plane with slope θ = 0.1 • , the density of the ice to be ρ = 910 kg m −d , and the gravitational acceleration constant to be g = 9.81 m s −2 . For all examples, we set the length at L = 10 4 m, while for examples 1 and 2 we set H = 250 m, and for Example 3, we set H = 10 3 m. In Fig. 1, we show a two-dimensional schematic of the model problem setup.
The true basal sliding coefficient fields used for each example are based on those in Petra et al. (2012). Specifically, letting ω = 2π/L, for examples 1 and 2 (posed in two dimensions) we set β(s) = 7 + sin(ωs), ∀s ∈ b , as shown in Fig. 7, while in Example 3 (posed in three dimensions) we set β(s) = 7 + 3 sin(ωs 1 ) sin(ωs 2 ), ∀s ∈ b , as shown in Fig. 13. For all numerical experiments, we use synthetic measurements; these are randomly placed noisy point-wise measurements of each component of the velocity on the top surface of the domain, i.e., at points on t . Examples 1 and 2 are carried out based on q = 80 measurement locations, while for Example 3 we use q = 100 measurement locations. These measurements are obtained by adding zero mean white noise to the solution of the forward problem. Thus the additive noise is of the form e ∼ N (0, e ) with covariance matrix e = δ 2 e I. We take δ e to satisfy δ e = ( 1 100 ) × (max(Bu(β true )) − min(Bu(β true ))), i.e., the noise level is 1 % of the range of the noiseless synthetic measurements. The precise noise level is problem specific; however, when using GPS techniques and InSAR velocity measurements, a 1 % noise level is realistic (see, for example, Martin and Monnier, 2014 and the references therein).
For all examples considered here, the prior mean for the basal sliding coefficient, β, is set at β * = 7. On the other hand, the prior covariance operator, C β , is identical for examples 1 and 2, while for Example 3 different controlling parameters are used (details are provided in Table 1). Along with the true basal sliding coefficient used in examples 1 and 2, we also show the prior distribution and three samples drawn from the prior in Fig. 7. In Fig. 12, we show four samples from the prior used in Example 3.

Example problems
We now give the specific details of each model problem and make apparent which parameters we treat as auxiliary parameters and subsequently premarginalize over. Key details about each model problem are summarized in Table 1.
Example 1: uncertain flow rate factor in the two-dimensional linear Stokes ice sheet model The first example is carried out assuming a linearized (Stokes) ice sheet model in two-dimensions. Specifically, we set n = 1 in Eq. (1), resulting in the effective viscosity being given by η(x) = 1 2 A(x) −1 . The flow rate factor, A, is taken to be unknown and spatially varying, as is often the case in reality. We represent the flow rate factor as A = A 0 exp(−na(x)), with A 0 = 2.140373×10 −7 Pa −1 a −1 , n = 1 is the Glen's flow law exponent, and the pre-factor, exp(−na(x)), takes the role of the auxiliary parameter, which will subsequently be premarginalized over using the BAE approach. The pre-factor accounts for several physical and computational phenomena, such as the Arrhenius relationship between A(x) and the ice temperature (e.g., Cuffey and Paterson, 2010;Zhu et al., 2016) and the use of the so- called enhancement factors (Cuffey and Paterson, 2010;Ma et al., 2010). The "exp" function is used to ensure the prefactor remains positive.
The prior distribution of the flow rate pre-factor is set by taking the prior mean to be a * = 0, while the controlling parameters of the prior covariance operator are given in Table 1. The true pre-factor and three other samples drawn from this prior distribution are shown in Fig. 2 for examples 1a and 1b. As outlined below, the computational meshes used for examples 1a and 1b are different. This leads to differences in the true pre-factor used for both examples. This in turn results in different synthetic data being used for the inversions; however, in both cases, the standard deviation of the noise is δ e ≈ 0.07. In both cases, the flow rate pre-factor is discretized using continuous quadratic Lagrange basis functions.
We use this example to also demonstrate that the proposed approach is independent of the discretization, a critical property to have when aiming to solve large-scale problems. This is done by considering identical problems on two different levels of discretization. Specifically, we consider the problem on two structured meshes having substantially different levels of discretization.
(a) In the first case, the computational mesh consists of 2000 triangular elements, which results in the discretized velocity and pressure having 8400 degrees of freedom (DOFs), and 1100 DOFs, while the basal sliding coefficient has 100 unknowns, and the flow rate prefactor has 4200 DOFs.
(b) In the second case, the mesh is refined, and it consists of 8000 triangular elements, leading to 32 800, and 4200 DOFs for the discretized velocity and pressure, respectively, while the dimensions of the basal sliding coefficient and the flow rate pre-factor are 200 and 16 400, respectively.
Example 2: uncertain Glen's flow law exponent in the two-dimensional nonlinear Stokes ice sheet model For the second example, we use the nonlinear Stokes problem (Eq. 1) as the governing equation. We take the Glen's flow law exponent, n(x), as an uncertain (and unknown) spatially varying auxiliary parameter, i.e., we set a(x) = n(x), and proceed to approximately premarginalize over it. The prior mean of the Glen's flow law exponent is set to a * = 3, while the parameters controlling the covariance operator, C a , are given in Table 1 and are chosen to ensure that the Glen's flow law exponents are in line with the literature. As noted, a shear-thinning rheology is generally used when modeling ice sheets, and we thus enforce 1 ≤ n(x). In the current paper, this is done by rejection sampling, which corresponds to constraining the function space in which n lies (see Dashti and Stuart, 2016, Eq. 10.10 for details), though other methods could also be used, such as reparameterizing the Glen's flow law exponent. We also use this example to study the effect of larger modeling errors (i.e., excessive errors). That is, we consider the case when the variance of the approximation errors is so large that essentially all information in the data is washed out. As we shall see, however, the resulting uncertainty estimates are still feasible. To induce larger uncertainties (and resulting approximation errors), we alter the prior covariance operator for the Glen's flow law exponent, n, to favor more highly oscillatory realizations. We can thus further divide Example 2 into two cases: (a) modest approximation errors and (b) excessive approximation errors.
The parameters used to control the covariance of the distributions on n are shown in Table 1. The true Glen's flow law exponents used to generate the data for examples 2a and 2b are drawn from the respective distributions, which, along with several other samples of the Glen's flow law exponent from each of the distributions, are shown in Fig. 3. In both cases, the Glen's flow law exponent is discretized using continuous linear Lagrange basis functions, while the computational mesh used is the same as that used in Example 1a. Finally, in Example 2a we have δ e ≈ 0.04, while in Example 2b we have δ e ≈ 0.05.
Example 3: uncertain flow rate factor in the three-dimensional nonlinear Stokes ice sheet model In this example, we consider a three-dimensional (d = 3), nonlinear analogue of Example 1. Specifically, we consider Eq. (1) in three dimensions, with the Glen's flow law exponent set to n = 3. Similar to Example 1, we suppose the flow rate factor is spatially heterogeneous, unknown, and parameterized as A = A 0 exp(−na(x)). Here the nominal value for the flow rate factor is set to A 0 = 10 −16 Pa −3 a −1 , the Glen's flow law exponent is set to n = 3, and the pre-factor,  exp(−na(x)), takes into account several physical and computational phenomena as described previously.
The mean value of the auxiliary parameter is set at a * = 0, while the parameters controlling the distribution of the prefactor are given in Table 1. These values for the prior covariance operator of a ensure the flow rate values are in line with those presented in the literature (see, for example, Table 3.4 in Cuffey and Paterson, 2010). In Fig. 4, we show the true flow rate pre-factor (a sample from the prior) along with three other samples from the associated prior density. Unlike Example 1, the flow rate pre-factor in this example is discretized using continuous linear Lagrange basis functions. The computational mesh used consists of 19 200 tetrahedral elements, leading to 81 600 DOFs for the velocity, 3600 DOFs for the pressure, 27 200 DOFs for the flow rate pre-factor, and 400 DOFs for the basal sliding coefficient.

Estimates and approximate posterior covariances
For each of the examples listed above, we compare the estimation results (MAP points and approximate posterior co-variances) for three different approaches. Within each example, for each of the approaches, the same prior distribution is used for the basal sliding coefficient; thus it is only the associated likelihoods that differ. In our analysis, we place particular emphasis on the feasibility of the posterior estimates, that is, whether or not the computed posterior distributions support the true basal sliding coefficient. The three different approaches considered are as follows.
(a) The accurate case (REF). In this case, any auxiliary parameters are set to their true values; i.e., we usẽ F(β, a true ) as the parameter-to-observable map. REF is computed as a benchmark/reference. The resulting likelihood for REF is We denote the corresponding MAP estimate and the posterior covariance matrix by β CEM MAP and CEM po , respectively.

(c) The Bayesian approximation error approach (BAE).
This approach also uses the approximate model, F(β), but accounts for the approximation errors using the BAE approach outlined in Sect. 4. As given in Eq. (17), the updated likelihood found using the BAE approach is with the MAP estimate and the posterior covariance matrix denoted by β BAE MAP and BAE po , respectively.

Results
Here we discuss and compare MAP estimates for the basal sliding coefficient and the respective approximate posterior covariance for each example. As alluded to previously, we pay particular attention to the feasibility of the posterior uncertainty estimates when comparing the results. We also examine the spectrum of the prior preconditioned misfit Hessians, which gives further insight into the uncertainty, and Table 1. Details for each of the examples considered. The first column (Ex.) refers to the example number; the second, third, and fourth columns give details of the forward model used, including which Stokes model is used, the aspect ratio, L H , and the definition of the auxiliary parameter; the fifth, sixth, and seventh columns give the discretization details, including the number of degrees of freedom for the velocities and pressure, the number of degrees of freedom of the unknown parameters, (β, a) DOFs, and the number of measurements, q; finally, the 8th through 12th columns give details on the prior distributions for the unknowns, including the parameters controlling the prior covariance operator for β, the controlling parameters for the prior covariance operator of a, and the prior mean, a * . Note that in examples 2a and 2b the prior for the auxiliary parameter is further constrained by enforcing 1 ≤ n(x), while for all examples the prior mean for the basal sliding coefficient is taken as β * = 7. Furthermore, the correlation length for β in examples 1 and 2 is approximately 4900 m, while in Example 3 the correlation length is approximately 3200 m. Finally, in the definition of the auxiliary parameter for Example 3, the Glen's flow law exponent is n = 3.

Model details
Discretization details Prior distribution details Ex. Stokes the sensitivity of each approach. To conclude the section, we give a brief comparison of the online computational costs (in terms of linearized Stokes partial differential equation, PDE, solves) for computing the MAP estimates.
To solve the optimization problems, we use an inexact Newton-CG method (see, for example, Petra et al., 2012). In all cases, we start the optimization procedure using the prior mean for the initial estimate of the basal sliding coefficient, while the prior covariance operator is used as the preconditioner. The optimization is carried out using a Gauss-Newton Hessian approximation for the first five iterations and then full Newton combined with an Armijo line search (Nocedal and Wright, 2006). Convergence is established when the gradient has decreased by a factor of 10 6 relative to the norm of the initial gradient.
The numerical results presented in this paper are obtained using hIPPYlib (an inverse problem Python library; Villa et al., 2018Villa et al., , 2021. The hIPPYlib library implements stateof-the-art scalable adjoint-based algorithms for PDE-based deterministic and Bayesian inverse problems. It builds on FEniCS (Dupont et al., 2003;Logg et al., 2012) for the discretization of the PDEs and on PETSc (Balay et al., 2009) for scalable and efficient linear algebra operations and solvers needed for the solution of the PDEs. In line with the finite element discretization used for the weak form of the forward problem (Eq. 2), in what follows, we use Taylor-Hood finite elements for the adjoint, incremental forward, and incremental adjoint equations, as in Petra et al. (2012).

Example 1
In this example, we consider the case of an uncertain flow rate factor in the two-dimensional linear Stokes ice sheet model and demonstrate the mesh independence of the approach. We begin by discussing the statistics of the ap-proximation errors which are induced by treating the unknown flow rate factor as a known constant, specifically, A = 2.140373 × 10 −7 Pa −1 a −1 . In Fig. 5, we show the marginal distribution of the approximation errors in the x component (top) and y component (bottom) for Example 1a (left) and Example 1b (right). The approximation errors are similar for the coarser mesh (Example 1a) and the finer mesh (Example 1b), both having fairly constant mean and variance in each component. For both examples, the mean of the approximation errors in the x component of the velocity measurements is non-zero, ε * ≈ 0.2, while the standard deviation of the approximation errors is substantially larger than the additive noise (δ e ≈ 0.07). That is, the approximation errors dominate the additive noise, as explained in Sect. 4.1, and it is likely (and is indeed the fact) that ignoring the approximation errors may lead to infeasible results.
To illustrate the convergence of the approximation errors, in the top row of Fig. 6, we show the spectrum of the approximation errors covariance matrices, ε , for examples 1a and 1b for increasing sample sizes (N = 62, 250, 500, and 1000). From the figure, it is evident that for N ≥ 250 samples, the spectra essentially coincide and have both converged, thus demonstrating the discretization independence of the approach and that approximately N ≥ 250 samples are likely sufficient to characterize the approximation error statistics. Note, however, that the results displayed here use N = 1000 samples.
To give further insight into the distribution of the approximation errors, we show the (Pearson's) correlation matrix of the approximation errors for Example 1a (left) and 1b (right) in the bottom row of Fig. 6. Firstly, we notice that the correlation matrices are highly structured (unlike the noise covariance matrix which is diagonal). It is also apparent that the correlation matrices are (visually) identical, further il- (c, d) Distribution of the approximation errors in the y direction velocity measurements for Example 1a (c) and 1b (d). The mean of the approximation errors, ε * , is indicated with a red line, while higher probability density is indicated by darker shading.
lustrating the discretization independence. The 2 × 2 block structure of the correlation matrices is to be expected since the measurement number indexing used corresponds to measuring the q = 80 velocity measurements in the x direction first, followed by the 80 velocity measurements in the y direction. The behavior within the diagonal blocks is also fairly intuitive as periodic boundary conditions are used, while the structure also illustrates that measurements (relatively) far away from each other are fairly uncorrelated. Comparing the diagonal blocks we see that the approximation errors in the x component of the velocity measurements are more highly correlated at greater distances than those of the y component. Finally, the off-diagonal blocks show a nontrivial correlation between the approximation errors in the x and y components.
In particular, this figure shows that the approximation errors have a similar structure to the main diagonal but reveals smaller (Pearson's) correlation coefficients (these range from about −0.5 to 0.5).
In the top row of Fig. 7, we show the marginal prior distributions and the resulting marginal posterior distributions. Also shown are the corresponding MAP estimates, the true basal sliding coefficient, and three draws from each of the distributions. Firstly, the accurate MAP estimate, β REF MAP , is in good agreement with the true basal sliding coefficient, while the accurate posterior is clearly feasible in the sense that the true basal sliding coefficient is well supported by the Laplace-approximated posterior. On the other hand, the MAP estimate found using the conventional error model, β CEM MAP , differs substantially from the true basal sliding coefficient over most of the domain. Furthermore, the posterior is essentially infeasible, with the actual coefficient having virtually no posterior density. Conversely, the MAP estimate found using the BAE approach, β BAE MAP , is in fairly good agreement with the true coefficient, and the Laplace-approximated posterior supports the truth well. We do see that the marginal posterior standard deviations found using the BAE approach are somewhat larger than those found using the accurate and conventional error approaches. This is typical, and to be expected, as the additional uncertainty in the flow rate prefactor manifests itself as extra posterior uncertainty.
In the bottom row of Fig. 7, we show the corresponding results for Example 1b. The results are fairly similar to Example 1a when using the accurate approach and when using the BAE approach 4 despite the substantial difference in the discretizations used. Lastly, the MAP estimate found using the conventional error model has changed drastically from Example 1a, though the posterior is equally as bad.

Example 2
In this example, we consider the case of an uncertain Glen's flow law exponent in the two-dimensional nonlinear Stokes ice sheet model and also demonstrate what happens when the approximation errors are, in some sense, too large. The approximation errors here are the result of treating the unknown, and spatially varying, Glen's flow law exponent as a fixed constant, i.e., setting n = n 0 = 3. To induce the larger approximation errors for Example 2b compared to Example 2a, we increase the uncertainty in the Glen's flow law exponent by altering the associated prior distribution (see Table 1). The difference in magnitude of the approximation errors is apparent in Fig. 8, where we show the marginal distribution of the approximation errors at the observation locations in the x direction (top) and y direction (bottom) velocities for examples 2a (left) and 2b (right). Note that the variance of the approximation errors for Example 2b is substantially larger than that of Example 2a. Considering that the standard deviation of the added noise for the small approximation error case (Example 2a) is δ e a ≈ 0.04 and for the large approximation error case (Example 2b) is δ e b ≈ 0.05, the approximation errors in both examples dominate the noise (see Sect. 4.1).
The top row of Fig. 9 shows the spectrum of the covariance matrices of the approximation errors, ε , for N = 125, 500, 1000, and 2000 samples for Example 2a and for N = 250, 1000, 2000, and 4000 samples for Example 2b. For Example 2a, it appears N ≈ 500 is enough samples, though for the results here we used N = 1000, while for Example 2b we require N ≈ 2000 samples. The fact that more samples are required to ensure convergence of the approximation errors in Example 2b follows naturally from the increased uncertainty. It is worth pointing out that Example 2b is used mainly to demonstrate how the BAE approach performs in the presence of too much modeling uncertainty; thus for the purposes of the current study we deem taking N = 2000 as tolerable.
In the bottom row of Fig. 9, we show the (Pearson's) correlation matrices of the approximation errors. The correlation matrices for this example share several of the characteristics seen in the corresponding correlation matrices in Example 1, specifically, the block structure and general behavior within the blocks. Comparing the correlation matrices for examples 2a and 2b, it appears the approximation errors in the x component for Example 2a are more highly correlated at greater distances towards the edges of the computational domain compared to the approximation errors in the x component of the velocity measurements for Example 2b.
In the top row of Fig. 10, we show the marginal prior and Laplace-approximated posterior distributions, as well as three draws from each of the distributions, the corresponding MAP estimates, and the true basal sliding coefficient for Example 2a. A couple of conclusions can be drawn from this figure. First, the accurate MAP estimate, β REF MAP , closely resembles the true basal sliding coefficient, and the truth is well supported by the accurate posterior distribution. Second, the Laplace-approximated posterior found using the conventional error approach is infeasible for most of the right half of the domain, with the MAP estimate, β CEM MAP , (severely) underestimating the true basal sliding coefficient. Third, the true basal sliding coefficient lies well within the bulk of the (Laplace-approximated) posterior for the BAE approach, with the MAP estimate, β BAE MAP , in fairly good agreement with the true basal sliding coefficient.
In the bottom row of Fig. 10, we show the corresponding results for Example 2b, in which the approximation errors are excessive. Under the Laplace approximation, the accurate posterior, found by using the true Glen's flow law exponent, remains an accurate representation of the truth as in Example 2a. The posterior found using the conventional error model approach has significantly deteriorated, however, with the true basal sliding coefficient even more markedly underestimated and the truth lying well outside the bulk of the Laplace-approximated posterior over almost all of the domain. Conversely, by taking into account the excessive modeling errors in Example 2b, the posterior found using the BAE approach is comparable to the prior, with the corresponding MAP estimate, β BAE MAP , being fairly similar to the prior mean. This demonstrates that when using the BAE approach, as the modeling errors become larger, the corresponding posterior density tends towards the prior, as should be hoped, to avoid overconfidence in biased results.

Example 3
In this example, we consider an uncertain flow rate factor in a larger scale, three-dimensional nonlinear Stokes ice sheet model. The approximation errors are the result of setting the unknown flow rate factor to A = 10 −16 Pa −3 a −1 . The spectrum for the approximation errors are shown in Fig. 11. The plot indicates that taking 500 < N ≤ 1000 samples would likely suffice to accurately characterize the approximation errors. For the results discussed here, we used N = 1000 samples. The average standard deviation of the approximation errors in the x component of the approximation errors is approximately 3.1, while for the y and z components, the average standard deviation of the approximation errors are 0.5 and 0.4, respectively. The standard deviation of the noise, on the other hand, is δ e ≈ 0.25. We thus can expect the resulting estimates found by disregarding the approximation errors to be unreasonable (see Sect. 4.1). and 14 that the reference posterior is completely feasible, and the corresponding MAP estimate is in good agreement with the true basal sliding coefficient. On the other hand, although the MAP estimate found using the conventional error model (CEM) shows similar qualitatively behavior, as seen in Fig. 13, when taking the corresponding posterior density into account, it is clear that the approach is essentially infeasible, with the truth lying well outside the bulk of the posterior across most of the domain (see Fig. 14). Finally, from Figs. 13 and 14 we see that, though not as good as the accurate case, the MAP estimate found using the BAE approach qualitatively remains similar to the truth. Furthermore, the truth is generally very well supported by the BAE posterior under the Laplace approximation.

Additional examples in the Supplement
To further demonstrate the flexibility and robustness of the proposed approach, we provide several additional numerical examples in the Supplement accompanying this paper. Specifically, we consider three additional cases that are variations of Example 1. Section S.1 of the Supplement demonstrates the robustness of the BAE approach in the case in which the true distribution of the auxiliary parameter is not known. The results show that as long as the true auxiliary parameter is well supported by the assumed distribution, the BAE approach provides posterior-consistent estimates of the basal sliding coefficient β. Section S.2 compares the CEM and BAE approaches for different assumptions of the mean and marginal variance of the prior distribution. The results show that the qualitative behavior of the two approaches is consistent to that observed for Example 1. Finally, Sect. S.3 compares the BAE approach and a so-called tempering approach, in which a heuristic criterion (such as the L-curve) is used to select an appropriate scaling of the prior or likelihood density. This tempering approach can be understood as varying the regularization parameter in a deterministic setup. This example demonstrates that the BAE approach provides a robust solution to the inverse problem without requiring multiple solutions of the inverse problem for different scaling parameters. In addition, in contrast to the tempering approach, the BAE approach does not rely on (possibly) unreliable heuristic methods to select the scaling parameter.

Spectra of the data misfit Hessians and computational costs
In this section, we compare the spectra of the data misfit Hessian and the computational cost of the three approaches (accurate, conventional error, and Bayesian approximation error) for each of the three examples. The dominant eigenvalues of the data misfit Hessian, H (see Eq. 10), evaluated at the corresponding MAP estimate, are shown in Fig. 15 for each example. Firstly, we observe that for all three approaches in all three examples, we only need to retain a relatively low (c, d) Distribution of the approximation errors in the y direction velocity measurements for Example 2a (c) and 2b (d). The mean of the approximation errors, ε * , is indicated with a red line, while higher probability density is indicated by darker shading.
number of eigenvalues and eigenvectors to compute a reasonable low-rank approximation of the Laplace posterior covariance matrix. Secondly, we see that the dominant spectrum resulting from using the BAE approach is often lower than that of the reference and conventional error approach cases. This is to be expected since we are accommodating the approximation errors, which naturally lead to an increase in uncertainty. Finally, the dominant spectrum of the misfit Hessian for examples 1 and 2, found using the conventional error approach, further illustrates the fact that ignoring the uncertainty in the auxiliary parameters can lead to overconfidence in erroneous estimates. The spectrum of the misfit Hessian for Example 3, found using the conventional error approach, seems to be somewhat anomalous in that the spectrum decays faster than that of the misfit Hessian found using the BAE approach. However, this is possibly explained by the fact that the respective misfit Hessians are evaluated at quite different MAP estimates. Figure 15 shows that the number of eigenvalues required to compute a reasonable low-rank approximation, in the sense of Eq. (12), is considerably lower for the BAE approach in most of the examples. This result suggests that computing the low-rank approximation is cheaper for the BAE approach compared to the other two approaches. With regard to the computational cost, we consider the number of (linearized) Stokes problem solves required for the optimization algorithm to converge as the unit of cost. As stated in Sect. 3, we use the inexact Newton-CG algorithm with Armijo line search to find the MAP point. At each iteration, inexact Newton-CG requires the following: (a) one (or more if required to satisfy the sufficient descent condition) evaluation of the log-likelihood, which involves solving the nonlinear Stokes equations; (b) one gradient evaluation, which involves solving an additional linearized Stokes problem, i.e., the adjoint equation; and (c) one Newton system solve using the conjugate gradient (CG) method, which at each CG iteration requires solving two linearized Stokes problems, i.e., the incremental forward and adjoint problems.
The total number of linearized Stokes solves required to compute the MAP estimate can then be calculated -per Gauss-  Newton iteration -as the sum of the number of iterations required to solve the nonlinear forward problem plus one adjoint solve (to calculate the gradient), along with one incremental forward solve and one incremental adjoint solve (to calculate the action of the Hessian) per CG iteration.
To ensure a sufficient decrease in the objective function at each (Gauss-)Newton iteration, the forward problem may be solved multiple times until the Armijo condition is satisfied, thus further increasing the number of linearized Stokes solves. The results shown in Table 2 indicate that in each of the examples considered in this paper, the BAE approach generally requires less than half the number of the linearized Stokes solves that are required for the REF case to converge to the MAP point. Furthermore, the conventional error approach re- quires (in some cases, significantly) more iterations, and thus linearized Stokes solves, than the accurate model. This is to be expected as the optimization is hampered by model mismatch. It is also worth noting that the CEM approach requires substantially more backtracking iterations compared to the REF and BAE approaches, which is in line with Nicholson et al. (2018). Furthermore, the number of CG iterations is significantly reduced for the BAE approach when compared to the CEM and REF case.

Conclusions
In this paper, we have considered the inference for the basal sliding coefficient field for ice sheet flow problems with uncertain rheology from surface velocity measurements. The rheology parameters of the ice, in particular the flow rate factor and the Glen's flow law exponent, are often uncertain and can, at best, only be estimated in practice. We considered examples in both two and three dimensions and used both the linear and nonlinear Stokes ice sheet model. In each of the cases considered, our goal was to infer the basal sliding coefficient only; as such the unknown rheology parameters were a priori fixed to nominal values and treated as auxiliary parameters. To account for the resulting modeling uncertainties (or errors), we employed the Bayesian approximation er- Table 2. The cost of solving for the MAP estimates, measured in number of linearized Stokes solves. The first column (Ex.) refers to the example number, and the second column (Est.) refers to which MAP estimate we are solving for, i.e., the reference MAP (REF), the MAP found using the conventional error model (CEM), or the MAP found using the BAE approach (BAE). The third column (#N) gives the number of (Gauss-)Newton iterations, while the fourth column (#CG) reports the total number of CG iterations. The fifth column (#back) reports the number of backtracks needed throughout the (Gauss-)Newton iterations, and the sixth column (#O) gives the total number of objective function evaluations. Finally, the last column (#Stokes) gives the total number of linearized Stokes solves (for forward, adjoint, incremental forward, and incremental adjoint problems). The (Gauss-)Newton iterations are terminated when the norm of the gradient is decreased by a factor of 10 6 , while the CG iterations are terminated in line with the Eisenstat-Walker condition (Eisenstat and Walker, 1996) (to avoid over-solving) and the Steihaug criteria (Steihaug, 1983) (to avoid negative curvature). The results illustrate that the use of the approximation error approach can be carried out at no additional online cost compared to the conventional error approach and reference case.  . The bottom row shows the cross section along line l 2 (y = 7.5 km) in the same order. In each plot, the mean of the distribution (blue line) is shown along with the true basal sliding parameter, β true (red line), three samples from the respective distributions (green lines), the marginal distribution (shaded) with darker shading indicating higher probability, and the ±2 (approximate) standard deviation intervals (dashed black line). Figure 15. Spectra of the prior-preconditioned Hessian of the data misfit computed using Eq. (11) for Example 1 (a), Example 2a (b), Example 2b (c), and Example 3 (d). The spectra for Example 1a (coarse mesh) are shown in the fainter colors, while the spectra for Example 1b (fine mesh) are shown in the richer colors. The horizontal dashed black line (at λ = 1) shows the reference value for the truncation of the spectrum of the prior-preconditioned Hessian of the data misfit. lustrate a limitation of the BAE approach, we included an example in which the modeling errors introduced were, in some sense, too large. This case led to a posterior density (found using the BAE approach) which showed very little reduction in variance compared to the prior, though it still contained the truth.
By avoiding the simultaneous estimation of the basal sliding coefficient and rheology parameters (which are spatially varying over the entire domain), the online computational overheads of the estimation problem are substantially reduced. To ensure the work carried out here is applicable to large-scale problems, i.e., scalable, we initially posed the problem in infinite dimensions and then employed the adjoint-state methodology to compute the MAP estimate.
In assessing the applicability and performance of the BAE approach, the current study only considers fairly limited domains, i.e., box-like geometries and idealized boundary conditions. A natural next step for future work is to apply the same framework to more realistic setups and to continentalscale ice flow problems.
Data availability. No external data or geometries/meshes were used in this paper. The observations for the inversion are synthetic measurements obtained via simulation. All the parameters used to obtain the data are reported in the paper.
Author contributions. All authors contributed equally to the design of numerical experiments, the discussion and interpretation of the numerical results, and the writing of the original and revised version of this paper. OB implemented the numerical methods and ran the numerical experiments with help from all co-authors.
Competing interests. The authors declare that they have no conflict of interest.