Process-based projections of the sea-level contribution from land ice components are often obtained from simulations using a complex chain of numerical models. Because of their importance in supporting the decision-making process for coastal risk assessment and adaptation, improving the interpretability of these projections is of great interest. To this end, we adopt the local attribution approach developed in the machine learning community known as “SHAP” (SHapley Additive exPlanations). We apply our methodology to a subset of the multi-model ensemble study of the future contribution of the Greenland ice sheet to sea level, taking into account different modelling choices related to (1) numerical implementation, (2) initial conditions, (3) modelling of ice-sheet processes, and (4) environmental forcing. This allows us to quantify the influence of particular modelling decisions, which is directly expressed in terms of sea-level change contribution. This type of diagnosis can be performed on any member of the ensemble, and we show in the Greenland case how the aggregation of the local attribution analyses can help guide future model development as well as scientific interpretation, particularly with regard to spatial model resolution and to retreat parametrisation.

Process-based projections of ice sheets' contributions to sea-level changes generally rely on numerical models that simulate the gravity-driven flow of ice under a given environmental (atmospheric and oceanic) forcing derived from atmosphere–ocean general circulation model (AOGCM) output. To cover the large spectrum of uncertainties that impact the outcomes of these numerical models, a popular approach is to perform common sets of numerical experiments by considering a range of forcing conditions (e.g. Barthel et al., 2020), various initial conditions, and/or model design (i.e. different choices in the modelling assumptions including different ice-sheet model (ISM) formulations, different input parameters' values, etc.) within a multi-model ensemble (MME) approach. This results in an ensemble of realisations, named ensemble members. Recent MME studies have analysed, within the Ice Sheet Model Intercomparison Project for CMIP6 (ISMIP6), the future evolution of the ice sheets of Greenland (Goelzer et al., 2018, 2020) and Antarctica (Seroussi et al., 2020).

Providing such projections using numerical models is challenging because the considered physical processes are highly complex and may involve non-linear feedbacks operating on a wide variety of timescales. Due to the importance of these projections in supporting coastal adaptation (Kopp et al., 2019), improving their interpretability is of high interest.

When dealing with interpretability, the key is generally not only to deliver modelling results but also to explain why the numerical model delivered some particular results given the set of chosen modelling assumptions (Molnar, 2022). Commonly used approaches to improve interpretability usually focus on measuring the importance of modelling assumptions for prediction (e.g. Lundberg et al., 2020). Two main approaches exist, either global or local. In the global approach, the objective is to explore the sensitivity over the whole range of variation in the considered modelling assumption, i.e. to assess the variable importance across the whole MME dataset. This can be done by quantifying the MME spread and by identifying its origin (see, among others, Murphy et al., 2004; Hawkins and Sutton, 2009; Northrop and Chandler, 2014). For this objective, popular statistical approaches generally rely on variance decomposition (analysis of variance, ANOVA); see, for example, Yip et al. (2011) for an introduction. To complement these global methods, we adopt in this study a second approach named “local” because it aims at measuring the importance of the input variables locally at the level of individual observations (and not globally across all observations unlike the first approach). This means that the local approach focuses on how particular modelling assumptions (i.e. value of a given model parameter, a given ISM formulation, etc.) influence the considered prediction. This is the local attribution approach adopted by the machine learning community (e.g. Murdoch et al., 2019) and named “situational” in the statistical literature (Achen, 1982). As described by Štrumbelj and Kononenko (2014), if the measure of local importance is positive, then the considered modelling assumption has a positive contribution (increases the prediction for this particular instance); if it is negative, it has a negative contribution (decreases the prediction); and if it is 0, it has no contribution.

A possible local attribution approach can follow a “one-factor-at-a-time” procedure, which consists of analysing the effect of varying one model input factor at a time while keeping all others fixed (see an example performed by Edwards et al., 2021). Though simple and efficient, this approach presents several shortcomings (dependence on the chosen base case, dependence on the magnitude of variations, failure when the model is non-linear, etc.; see an in-depth analysis by Štrumbelj and Kononenko, 2014). A more generic approach has emerged in the domain of explainable machine learning (Murdoch et al., 2019), named SHapley Additive exPlanations (SHAP; Lundberg and Lee, 2017). SHAP has successfully been used in many domains of application, such as finance (Bussmann et al., 2021), medicine (Jothi and Husain, 2021), land-use change modelling (Batunacun et al., 2021), mapping of tropospheric ozone (Betancourt et al., 2022), or digital soil mapping (Padarian et al., 2020).

SHAP builds on the Shapley values that were originally developed in cooperative game theory for “fairly” distributing the total gains to the players, assuming that they all collaborate (Shapley, 1953). Making the analogy between a particular prediction and the total gains, SHAP allows breaking down any prediction as an exact sum of the modelling assumptions' contribution with easily interpretable properties (see a formal definition in Sect. 3); each contribution then reflects the influence of the considered modelling assumptions for the particular prediction.

In this study, our objective is to compute measures of local importance for each considered modelling assumption using SHAP applied to an MME of sea-level projections. Applying SHAP in this context faces however several difficulties. First, it is not the prediction provided by the modelling chain (used to generate the MME) that is decomposed by SHAP, but it is a machine-learning-based proxy (named the ML model) that relates the modelling assumptions (termed as “inputs” in the following) to the equivalent sea-level changes (denoted sl). Validating the use of this proxy is one key prerequisite of the approach. Second, building the ML model relies on the analysis of the available MME results, which are limited (typically up to 50–100 ensemble members) due to the large computational time cost of the modelling chain. This results in MMEs that are incomplete and unbalanced: i.e. several combinations of modelling assumptions are missing in the MME while some are more frequent than others. Statistically, this incompleteness and unbalanced design might result in statistical dependence among the input variables (related to the modelling assumptions). Overlooking this dependence structure might mislead us in the interpretation of the inputs' individual influence; see an extensive discussion by Do and Razavi (2020). To overcome the afore-described difficulties, we propose a SHAP-based procedure combined with a cross-validation procedure (Hastie et al., 2009) and appropriate techniques for modelling the dependence (Aas et al., 2021; Redelmeier et al., 2020). Through aggregation of the SHAP-based local explanations, we further show how they can be helpful for both improving the scientific interpretation and guiding future model developments. The proposed procedure is applied to sea-level projections for the Greenland ice sheet (Goelzer et al., 2020) by considering the time evolution of sea-level contributions.

The paper is organised as follows. We first describe the sea-level projections used as an application case and the corresponding design of numerical experiments (Sect. 2). In Sect. 3, we provide further details in the statistical methods that are used to estimate the local explanations. In Sect. 4, we apply the methods and provide some approaches to combine the local explanations to obtain global understanding of the MME results across time.

To test our approach, we define a case study based on the MME study carried out by Goelzer et al. (2020) in the framework of the ISMIP6 initiative. In the following, we only provide a brief summary of the GrIS MME dataset, and the interested reader is invited to refer to Goelzer et al. (2020) and references therein for further details.

To compute the annual time evolution of sea-level contributions from the
Greenland ice sheet (GrIS) up to 2100, the modelling chain combines different
models: (1) a number of AOGCMs that produce climate projections according to
given greenhouse gas forcing scenarios, (2) a regional climate model (RCM)
that locally downscales the AOGCM forcing to the GrIS surface, and (3) a range
of ISMs (initialised to reproduce the present-day state of the GrIS as
best as possible from a given initial year to the end of 2014) that produce
projections of ice mass changes and sea-level contributions. Given bed
topography across the ice–ocean margin around Greenland, the ISMs are forced
by surface mass balance (denoted SMB) anomalies from the atmospheric
RCM-derived forcing and by an empirically derived parametrisation that
relates changes in meltwater runoff from the RCM and ocean temperature
changes from the AOGCMs to the retreat of tidewater glaciers (Slater et
al., 2020). The parameter that controls retreat is denoted

As the primary objective of this work is to evaluate the relevance of the
“SHAP” approach, we focus on a subset of the original GrIS MME study based
on one AOGCM, namely MIROC5 (Model for Interdisciplinary Research on Climate – version 5) forced under the most impactful climate scenario
Representative Concentration Pathway 8.5 (RCP8.5) because a sufficient number of MME results are available to validate
our approach. For this case, a total of 55 numerical experiments were
extracted to analyse the time evolution of sea-level changes with respect to
2015 (Fig. 1); each of these results is associated with different modelling
choices represented by different ISMs that are described in Appendix A,
Table A1. In addition, for the selected AOGCM, we are able to analyse the
sensitivity to the parameter

The analysis is focused on nine main modelling assumptions related to different aspects of the modelling chain (Table 1), namely numerical implementation, initial conditions, modelling of ice-sheet processes, and environmental forcing. Only the modelling assumptions that are commonly shared by all models described by Goelzer et al. (2020) in their Appendix A were considered, i.e. without an empty entry in Table A1 in this paper. Note that some preliminary groupings of categories were carried out to ensure a minimum of variation across the experiments with at least two experiments associated with a given category (specified in the last column of Table 1), which is needed to properly conduct the performance analysis of the ML model (see further details in Sect. 3.2).

Modelling assumptions considered in the MIROC5 RCP8.5-forced GrIS MME.

Count number of the MIROC5 RCP8.5-forced GrIS MME members with respect to the different modelling assumptions described in Table 1.

In the following, we name the choices made for each of these
modelling assumptions inputs. One input setting defines an experiment of the MME.
Formally, the inputs are treated either as continuous variables (for

Let us consider sl

It is important to note that Eq. (1) does not aim to linearise

In order to quantify

Schematic overview of the different steps of the procedure.

The objective of this section is to assess the validity of replacing

On the one hand, the local performance indicator is chosen to be the
absolute error

Finally, it should be noted that no matter how much effort is put in
increasing the ML predictive capability, a perfect match to the true model
is rarely achievable, in particular due to difficulties in approximating the
mathematical relationship between the inputs and sl or due to the absence of
input variables that are important with respect to the sl prediction error.
Thus, a residual degree of prediction error may still remain. This has
implications for the interpretation of low

We follow the approach developed by Lundberg and Lee (2017), who proposed defining

Formally, consider a cooperative game with

In this setting, the Shapley values can then be interpreted as the
contribution of the considered input to the difference between the
prediction

In practice, the computation of the Shapley value may be demanding because
Eq. (2) implies covering all subsets

In the case considered in this study, there exists some dependence among the
inputs. A commonly encountered example is when the values for the minimum
and maximum grid sizes are correlated. Additional examples are provided in
Sect. 4.1. In this case, the interpretation of the SHAP decomposition
provided by the kernel SHAP method might give wrong answers (Aas et al.,
2021) because it relies on the independence assumption for calculating the
conditional probability

Conditional inference trees belong to the class of decision trees that use a
two-stage recursive partitioning algorithm, namely (1) partitioning of the
observations by univariate splits in a recursive way and (2) fitting a constant
model in each cell of the resulting partition (for the regression problem).
Different splitting procedures exist, and here we use the one proposed by
Hothorn et al. (2006) that uses a significance test to select input
variables rather than selecting the variable that maximises the information
measure (such as the Gini coefficient; Breiman, 1984). In this approach, the
stopping criterion is based on

To identify the dependence structure, we proceed as follows. We first consider the first input variable to be the response and fit a CTREE model by viewing the remaining input variables as the predictor variables. If the resulting tree model includes one of the predictor variable, this means that there is some dependence with the considered response (i.e. the first variable in this example). Otherwise, the resulting tree model is empty. This approach is re-conducted by considering each of the input variables as the response in turn. As a result, the procedure identifies the non-empty tree model or models that represent the dependence structure between some input variables.

In this section, we apply the procedure described in Sect. 2 (schematically depicted in Fig. 3) to the MIROC5 RCP8.5-forced GrIS MME. We first analyse the dependence between the different modelling assumptions (Sect. 4.1). Then, we train and build ML models and select the best-performing ones by following Steps 1–2 of the procedure (Sect. 4.2). On this basis, we apply the local attribution approach to measure the local importance and summarise the results to provide different levels (detailed in Sect. 3.1) of information on sensitivity (Steps 3–4, Sect. 4.3).

Tree models representing the dependence between the different modelling assumptions (indicated at the bottom of each tree). The bottom nodes (leaf nodes) provide the proportion of experiments given the modelling choices defined along the branches of the tree model. Each colour corresponds to a different category of the considered modelling assumption. For instance, the left tree in the middle row provides the relation between the choice in the numerical method with the type of initialisation and the minimum grid size. The blue (red) colour is related to the finite element FE (finite difference FD) category.

We first analyse the statistical dependence among the modelling assumptions
(inputs) by applying the CTREE approach described in Sect. 3.4 (using a split
criterion threshold of 95 % and Bonferroni-adjusted

Using the results of the MIROC5 RCP8.5-forced GrIS MME, we train a series of
ML models to predict sl across time. The following ML models with corresponding
hyperparameters (see Appendix B for details) are considered:

9 RF regression models with hyperparameters ns

30 XGB models with hyperparameters maximum depth

1 LIN model.

As explained in Sect. 3.2, satisfying the global performance criterion does
not necessarily ensure that the ML model gives an accurate approximation of
all sl predictions. For some cases, the discrepancies can be too large to
properly analyse the local explanations. This is illustrated with Fig. 5b, which shows the comparison between the true sl value and the corresponding
ML-based prediction for 2100. For instance, we note that the predictions for
the largest sl value largely depart from the

In total, LIN, XGB, and RF models retained 3.4 %, 24.6 %,
and 72 % respectively of the total number of experiments (on average over time). After
applying this procedure, the MRAE criterion (shown in blue in Fig. 5a) reaches
values below 10 % on average over time (with a maximum value not larger
than 15 % for the year 2040). Note that the MRAE curve after this selection is not
necessarily the lowest one because the selection procedure implies
minimising not only MRAE but also the local performance

In this section, we first compute the measures of local importance for each experiment in the MIROC5 RCP8.5-forced GrIS MME for a given prediction time (here 2100); such a type of diagnostic (Level 1 of the procedure) helps to understand and quantify the impact of particular assumptions made by the modellers (Sect. 4.3.1). Then, we analyse in Sect. 4.3.2 how the influence of each modelling assumption evolves as a function of the considered input value (Level 2 of the procedure). This analysis allows us to deepen our understanding of the model structure for a given prediction time. Finally, Sect. 4.3.3 summarises all results over time (Level 3 of the procedure) to provide a global insight (i.e. across all MME members) into the sensitivity of sl to the modelling assumptions.

We first illustrate the application of SHAP to a selected set of ML-based
sl predictions for 2100. Figure 6 provides the SHAP-based decomposition of the
ML-based prediction (horizontal blue bar) into the positive (green bar) or
negative (red bar) contribution (

Diagnostic of particular ML-based sl predictions using SHAP for the year
2100 considering six different settings of the modelling choices (indicated
on the vertical axis). The horizontal blue bar corresponds to the ML-based
sl prediction (the difference with the true value is indicated by the error
term

The analysis of Fig. 6 illustrates how the SHAP-based approach can be used
to diagnose the MME results.

Case (a) corresponds to the largest sl value (of 19.08 cm) that is predicted by
the ML model at 17.79 cm (with a prediction error

Case (b) (Fig. 6b) corresponds to the second-largest sl value (of 15.32 cm)
that is predicted by the ML model at 15.36 cm (with a prediction error

Case (c) has the same setting as Case (b) except for a larger minimum grid
size (here of 16 km). This results in a lower influence of the minimum grid
size (

Case (d) corresponds to an sl value close to the one in Case (c) and illustrates
that, despite the differences with Case (c) (i.e. initial SMB,
initialisation type, and minimum resolution), the contribution of the largest
contributors to sl, i.e. ice flow type, initial year, and

The comparison between Cases (b) to (d) also points out that, for relatively close predicted values, the modelling choices contribute equivalently to the prediction despite some minor differences in the setting of the modelling assumptions.

Cases (e) and (f) illustrate however that, when the dissimilarity in the settings is larger, the modelling choices contribute differently to the prediction although the predicted values are very close (here close to the ensemble mean of 10.8 cm). In Case (f), all modelling assumptions contribute equivalently to sl, whereas it is mainly ice flow type and the type of dataset for bed topography in Case (f).

We explore in Figs. 7 and 8 how the magnitude of the modelling
assumption's contribution to sl, as well as the direction, changes depending on
the value of the considered input by applying the SHAP dependence plot
proposed by Lundberg et al. (2020). To judge the significance of the
contribution, we compare the results to the range defined by

We first analyse the continuous variables. Figure 7a confirms the large
influence of

Application of SHAP to all members of the
MIROC5 RCP8.5-forced GrIS MME for the year 2100. Each panel provides

Though a trend in the (initial year –

Figure 7c and d give insights into the influence of the spatial resolution by
showing a zone of low-to-moderate influence defined for a minimum and a
maximum grid size

Application of SHAP to all members of the
MIROC5 RCP8.5-forced GrIS MME for the year 2100. Each panel provides the
boxplots of

Focusing on the categorical input variables, Fig. 8 further indicates that
the most impactful modelling assumption for sl is the ice flow choice,
either of SIA or of HYB type with a positive or negative contribution, and the B dataset
for bed topography: the corresponding boxplots in Fig. 8b and e are
well outside the

The analysis of Sect. 4.3.2 is now performed for all members of the
MIROC5 RCP8.5-forced GrIS MME for any prediction time. As indicated in Sect. 3.1, to be able to compare the influence between the different predictions
across time, we analyse in Fig. 9 the statistics of the absolute value of

Considering initial conditions, Fig. 9a and c show that it is the initialisation type that has the largest impact in the medium term (before 2050/2060), and after this date, it is the choice in the initial year that has the most impact. Conversely, in the long term (after 2050/2060), the influence of the initialisation type reduces up to a negligible level (compared to the prediction error). Figure 9b shows that the influence of the initial SMB is low (even negligible) regardless of the considered prediction time with the exception of some particular cases outlined by black dots lying outside the boundaries of the whiskers (these cases are illustrated in Fig. 6a, e).

Considering numerical implementation, the choice of the numerical method has here a small (even negligible) impact on sl values (Fig. 9d) especially in the medium/long term (after 2050). We note also that the moderate influence of the minimum and maximum grid size remains quasi-constant over time (Fig. 9e, f), hence suggesting that the grid size's influence is time-invariant; i.e. all modelled processes are affected by the spatial resolution in a similar way, independently of the prediction time.

Finally, considering ice-sheet processes and environmental forcing, an
important influence of

Statistics of

Improving the interpretability of sea-level projections is a matter of high interest given their importance to support decision-making for coastal risk management and adaptation. To this end, we adopt the local attribution approach developed in the machine learning community to provide results about the role of various modelling choices in generating inter-model differences in the MME. These results are intended for different potential users.

First, the diagnostics illustrated in Fig. 6 (and all provided by Rohmer, 2022, for MIROC5 RCP8.5-forced GrIS MME in 2100) help the individual modellers involved in the modelling exercise to understand and quantify the impact of their particular assumptions. Figure 6b–d illustrate situations where the SHAP approach allows such critical analysis, including checking that the same modelling assumptions have a similar impact on close sl values.

Second, aggregating all diagnostic results (Level 2 and 3 of the proposed
approach) provides guidance to the modelling group involved in the
definition of experimental protocols for MMEs (such as ISMIP6; Nowicki et
al., 2016, 2020). Some key aspects are identified and deserve to be taken
into account in future model developments and modelling exercises.

Our results confirm the need for simulations that are sufficiently spatially resolved: sl results are largely affected by too coarse grids (here with a minimum and maximum grid size larger than 5 and 16 km respectively) regardless of the prediction time.

The influence of the modelling assumptions depends on the considered
prediction time: in the short/medium term (before 2050), initialisation and
ice flow type primarily contribute to sl, whereas in the long term, the initial
year and

Some modelling choices have little impact on the sl values (on average across the considered MME results), in particular choosing a finite element or finite difference numerical scheme or the dataset for bed topography.

Additional computer experiments are worth conducting to better explore given
parts of the parameter space with a view to confirming the identified trends
(Figs. 7 and 8), in particular for a minimum grid size ranging from 3 to 4 km
and for

Robustness analysis of the local importance analysis for the largest simulated sl value in 2100 (Case (a) in Fig. 6). The horizontal coloured bars correspond to the quantified contributions by including all input variables (results of Fig. 6a). The endpoints of the thick and thin horizontal black error bars are the minimum/maximum and the percentiles at 25 % and 75 % respectively computed when iteratively excluding one of the nine input variables.

These results were obtained by overcoming two major difficulties. The first one is related to the incomplete and unbalanced design of the numerical experiments (Sect. 4.1). Here, applying more commonly used statistical methods, namely the linear regression model or the ANOVA-based approach, would hardly be feasible. On the one hand, Sect. 4.2 clearly shows that the mathematical relationship between sl and the inputs is not necessarily linear, and more advanced regression techniques need to be used (like RF or XGB models). On the other hand, the considered design of experiments is incomplete and unbalanced (as shown in Sect. 2), which complicates the application of ANOVA. Ideally a full factorial design should be used to properly apply ANOVA: in our case, the design should then contain 3200 experiments, i.e. far more than the available number of experiments. Some solutions have been proposed in the literature (see, for example, Evin et al., 2019, and references therein), and an avenue for future work could focus on the comparison of ANOVA with our approach. The second difficulty is related to the presence of statistical dependencies (as outlined in Sect. 4.1), which makes the interpretation of the individual effects less straightforward (a problem related to multicollinearity in the statistical community, e.g. Shrestha, 2020) and might even lead to wrong conclusions regarding uncertainty partitioning (see discussion by Do and Razavi, 2020). Here the SHAP–CTREE combined approach developed by Redelmeier et al. (2020) helps alleviate this problem by explicitly incorporating the dependence in the computation of the Shapley values (Sect. 3.4; see also Aas et al., 2021, for an extensive study of this problem). In light of the different algorithms available in the literature (Aas et al., 2021; Frye et al., 2020), an interesting line of future research could focus on a more systematic analysis of the inputs' dependence, which could serve as a strong basis for defining clear recommendations on how to treat it in the context of MMEs.

However, it should be underlined that the high performance of our approach is strongly dependent on two key prerequisites. First, the high predictive capability of the ML model should be carefully checked and confirmed as done in the GrIS case (Sect. 4.2). For this purpose, several aspects need further investigation in future work: (1) instead of selecting one single ML model, a combination of models could be proposed following, for example, the “super-learner” method of van der Laan et al. (2007) or the model class reliance approach of Fisher et al. (2019); (2) finding the optimal hyperparameters' settings could benefit from more advanced search algorithms for optimisation (Probst et al., 2019).

The second prerequisite is the careful selection of which input variables to
include in the analysis. The set of quantified contribution is always
guaranteed, by construction (see Sect. 3.3), to add up to exactly the total
sl projection. This has the practical advantage of easing the interpretation and
communication of the results. However, this also means that the quantified
contributions are themselves dependent on the choice of the input variables.
One advantage of the SHAP approach is that variables whose influence is
negligible will be assigned a low contribution, but this does not address
the issue of the impact of some missing input variables that are important
for the sl prediction, i.e. the influence of some “hidden factors”. The
proposed cross-validation error partly addresses this problem since high
cross-validation error reflects any difficulties in approximating the
mathematical relationship between sl and the inputs, which include the
afore-mentioned problem. To provide additional discussion, we conducted a
robustness analysis by re-running the local attribution approach (and ML
model fitting and selection) for the largest simulated sl value in 2100 (Case
(a) in Fig. 6), at each iteration, with one of the nine input variables being
removed in turn. Figure 10 provides the changes in the quantified
contributions represented by a horizontal black error bar. The comparison
with the width of the horizontal coloured bar (representing the value of the
original analysis including all nine input variables) confirms the high
robustness of the large

In this study, we described the use of the machine-learning-based SHapley Additive exPlanations (SHAP) approach to quantify the importance of modelling assumptions in sea-level projections produced in an MME study. The proposed approach was applied to a subset of the GrIS ensemble that is characterised by a limited number of experiments (50–100), an unbalanced design, and the presence of dependence between the inputs. Our results have shown the added value of the proposed approach to inform us about the influence of the modelling assumptions at multiple levels: (Level 1) locally for particular instances of the modelling assumptions, (Level 2) on the model structure at a given prediction time, and (Level 3) globally over time. These results are intended for different potential users, namely the ice-sheet modelling community (individual modellers or modelling groups in charge of the design of experiments) but also adaptation practitioners, who take decisions based on sea-level projections that rely on models such as those modelling the Greenland ice mass losses. Trust in these projections and therefore accelerated coastal adaptation can be enabled by the analyses described in this study, allowing us to better interpret the uncertainty range in projections. This study illustrates that performing such diagnoses rigorously requires advanced mathematical techniques.

This study should however be seen as a first assessment of the potential of the SHAP-based approach, and in order to bring the SHAP-based approach to a fully operational level, we recognise that several aspects deserve further improvements. First, a common pitfall of any new tool is its misuse and over-trust in the results (as highlighted by Kaur et al., 2020). Future steps should thus concentrate on multiplying the application cases (in particular by varying the AOGCM and the RCP choice) with an increased cooperation between the different communities, namely ice-sheet modellers, ML researchers, human–computer interaction researchers, and socio-economic scientists.

Second, it is the question of the global effects of the modelling assumptions that deserves particular intensified investigation. In addition to methodological work exploring advanced procedures such as SAGE (Shapley Additive Global importancE; Covert et al., 2020) or variance-based approach used in the uncertainty quantification community (e.g. Iooss and Prieur, 2019), the key will be the development of robust protocols to design balanced and complete numerical experiments. This partially resolved problem (see, for example, discussion by Aschwanden et al., 2021) could benefit from increased inter-disciplinary cooperation as well.

Model characteristics used in the MIROC5 RCP8.5-forced GrIS MME considered in the study (adapted from Goelzer et al., 2020, their Appendix A).

The modelling assumptions outlined in bold were not considered in the analysis, namely velocity type, surface/thickness, and geothermal heat flux (GHF) because they are not commonly shared across the different models. The reader is invited to refer to Goelzer et al. (2020) for the definition of the abbreviations for these three model characteristics.

Let us first denote sl

The linear regression (LIN) model is given by

The random forest (RF) regression model is a non-parametric technique based
on a combination (ensemble) of tree predictors (using regression trees;
Breiman et al., 1984). Each tree in the ensemble (forest) is built based on
the principle of recursive partitioning, which aims at finding an optimal
partitioning of the input parameters' space by dividing it into

The RF model, as introduced by Breiman (2001), aggregates the different
regression trees as follows: (1) random bootstrap sampling from the training
data and randomly selected

The RF hyperparameters considered in the study are ns and

Extreme gradient boosting (Friedman, 2001) is a tree ensemble method like RF model but differs regarding how trees are built (gradient boosting builds one tree at a time) and how tree-based results are combined (gradient boosting combines results during the fitting process).

Formally let us denote by

The first term

the maximum depth of the tree models, which corresponds to the number of nodes from the root down to the furthest leaf node (this hyperparameter controls the complexity of the tree model);

the learning rate, which is a scaling factor applied to each tree when it is added to the current approximation (a low rate value means that the trained model is more robust to overfitting but slower to compute);

the maximum number of iterations of the algorithm.

The sea-level dataset is the one compiled by Edwards et al. (2021),

JR designed the concept, set up the methods, and undertook the statistical analyses. JR and HG defined the protocol of experiments. JR, RT, GLC, HG, and GD analysed and interpreted the results. JR wrote the manuscript draft. JR, RT, GLC, HG, and GD reviewed and edited the manuscript.

The contact author has declared that none of the authors has any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

For the ISMIP6 results used in this study, we thank the Climate and Cryosphere (CliC) effort, which provided support for ISMIP6 through sponsoring of workshops, hosting the ISMIP6 website and wiki, and promoting ISMIP6. We acknowledge the World Climate Research Programme, which, through its Working Group on Coupled Modelling, coordinated and promoted CMIP5. We thank the climate modelling groups for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the CMIP data and providing access, the University at Buffalo for ISMIP6 data distribution and upload, and the multiple funding agencies who support CMIP5 and ESGF. We thank the ISMIP6 steering committee, the ISMIP6 model selection group, and the ISMIP6 dataset preparation group for their continuous engagement in defining ISMIP6. This is ISMIP6 contribution no. 27. Some resources were provided by Sigma2 – the National Infrastructure for High Performance Computing and Data Storage in Norway through projects NN8006K, NN8085K, NS8006K, NS8085K, NS9560K, NS9252K, and NS5011K.

This publication was supported by PROTECT. This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement no. 869304, PROTECT contribution number 48. In addition, HG has received funding from the Research Council of Norway under projects 270061, 295046, and 324639.

This paper was edited by Ginny Catania and reviewed by two anonymous referees.