Improving interpretation of sea-level projections through a machine-learning-based local explanation approach

Rohmer, Jeremy; Thieblemont, Remi; Le Cozannet, Goneri; Goelzer, Heiko; Durand, Gael

doi:https://doi.org/10.5194/tc-16-4637-2022

Articles | Volume 16, issue 11

https://doi.org/10.5194/tc-16-4637-2022

© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

Special issue:

Improving the contribution of the land cryosphere to sea level...

https://doi.org/10.5194/tc-16-4637-2022

© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 16, issue 11

Research article

| Highlight paper

|

04 Nov 2022

Research article | Highlight paper |

| 04 Nov 2022

Improving interpretation of sea-level projections through a machine-learning-based local explanation approach

Jeremy Rohmer, Remi Thieblemont, Goneri Le Cozannet, Heiko Goelzer, and Gael Durand

Download

Final revised paper (published on 04 Nov 2022)
Preprint (discussion started on 16 Jun 2022)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2022-435', Anonymous Referee #1, 27 Jul 2022

The manuscript "Improving interpretation of sea-level projections through a machine- learning-based local explanation approach" provides a novel approach to analyzing multi-model sea level projections, using machine learning. The method put forward in this study has two parts: (1) using ML models as surrogates for the projection ensemble, (2) explaining the role of various modeling assumptions in generating the differences between sea level projections. The method is novel for our community and provides a potentially powerful was to analyze MIP output in a way that can guide potential future development (though this part needs more elucidation). I have some questions about particular choices in the methods and in general the writing would benefit from clarification. However, overall I find the contribution to be interesting and appropriate for The Cryosphere.

Major questions:

1. Within the context of an uncalibrated multi-model ensemble, the explanation approach adopted here explains why individual models deviate from the median projection. However, it does not say how an individual model deviates from the "truth" (since the models are not compared against reality or some benchmark). Thus, the authors need to be clear about this fact in their claim that this method can help in model development. How can it help it model development? If modelers now that factor X is causing their model to deviate from the ensemble median in some way, how will this facilitate development or adjustment of factor X? The authors needs to be more specific about this claim, which is essentially the motivation for using this method.

2. As explained in the manuscript, the SHAP method is always guaranteed to produce a set of quantified contribution (mu) that add up to exactly the total projection. This has some nice mathematical properties, but it also means that the quantified contributions are themselves dependent on the choice of which factors (x) to include in the analysis. For complex ice sheet models, this includes many, many more differences than the six included in this analysis. This leads me to three questions:

(a) The criteria you used to select which factors to include was those "without empty entry in Table A1 and with a sufficient number of variation across the models." The first makes sense - you need labels (though perhaps some discussion would be in order about whether it is possible to add labels by talking to the modelers response for these simulations). The second is unclear to me - it seems that you left out some potentially very important factors like initialization year and bed topography used with complete labels. Ultimately this selection, which would seem to be very important in the overall, seemed quite ad-hoc.

(b) Ultimately, I think much of the issue with (a) could potentially be solved by adding one more factor, which is the model itself (e.g., AWI-ISSM, NCAR-CISM, etc.). There are many modeling choices that are different between those models (e.g., how certain processes are parameterized, etc) which is not described by the six existing categories. Including this as a category would determine how large these inter-model differences are. Hopefully, they would be small contributors to differences, but if they aren't small, then this indicates that more effort is required to label these differences in a meaningful way. Right now, these hidden factors are likely being lumped in together with other factors where there are inter-model differences.

(c) Ultimately, there needs to be much more discussion of the sensitivity of the results to the choice of factors to include in the analysis. This can be done by a leave-one-out validation exercise to see how the results change if one of the "known" factors is converted to a "hidden" factors. This would help other researchers hoping to do a similar analysis to decide how to make this choice of factors (or how much effort to put into labelling unlabelled factors) at the beginning of the exercise.

3. One aspect of the procedure that I found confusing is the procedure for training and then selecting the ML models. First, it is unclear why an ML surrogate is need. The manuscript says "our knowledge on the mathematical relationship f(.) is only partial and based on the n MME results." I'm not exactly clear on why this is a problem, so more explanation would be appreciated. Furthermore, you then fit a range of ML models and for each MML result, you pick a different ML model fit depending on which does a better job at reproducing this particular result. It is unclear why you do it in this way as opposed to just picking the ML model which best fits all the results. In particular, this seems to conflict with your prior statement that you need knowledge on "the mathematical relationship f(.)" in order to perform the explanation part of the procedure. Overall, this whole aspect of the procedure is quite confusing and could do with clearer explanations as to its purpose in the overall study.

4. Similar to point #3, the introduction of the ML model quality metrics (Q, RAE, MAE) is confusing. It is unclear exactly how each of these metrics is used in order to perform model selection. Is MAE even used? If not, then its not clear why it is introduced.

Other points:

Line 10-12: confusing sentence

Line 24: can be->are

Line 28: ice-sheet->ice sheets

Line 33: the key to what?

Line 33: its not clear that this method can tell us why the numerical delivered some particular results (in the absolute sense), but rather why it differs from other model results

Line 35: "this" view? which?

Line 39: how particular modelling

Line 43: positive with respect to model sensitivity?

Line 60: local importance of what?

Line 80: Is this a test case or just the point of the study?

Line 92: more discussion could be useful about why GCM choice is not included. I can see that this would be done to reduce complexity, but ultimately GCM choice is important to the full ensemble of results from ISMIP. At a minimum, better explanation of why this difference is omitted would be helpful.

Line 105: explain further about "preliminary groupings"

Line 127: the f(.) notation is generally very confusing to me, not sure what "." means in this context

Line 151: why are these particular ML models used? Also, I think its a stretch to call a linear regression an ML model (I get that its a simple benchmark, but most people would just call it a statistical model)

Line 157-158: not sure what this sentence means

Line 163: explain what you mean by "strong theoretical basis"

Line 181: the objective of this section

Line 181: ...by h_theta through the use of...

Section 3.3: I generally really liked this explanation of Shapley approach

Line 237: from the global mean prediction to what?

Line 235: The definition of the Shapley values guarantees that the sum...

Line 242: where is equation (5)?

Section 3.4: it would help here to give an example of the sort of dependence between inputs you are thinking of. I know you talk about this later in the results, but its hard to understand what you mean by dependence here without a concrete example.

Line 266: the procedure describe in...

Line 269: sensitivity of projected SLR to modeling assumptions at difference levels

Line 295-298: The results described here are confusing, because from Figure 5a, it seems like the best performing models are all RF.

Line 304: I could not find anything outlined in orange (maybe I'm not looking hard enough...)

Line 319: the use of "perform" here is confusing

Line 324: sensitivity of what?

Figure 6: Can you add a label to each panel indicating which model experiment it is. Also, it seems like it would make sense to color bars which are "negligible" (in terms of being less than the error) as grey or something similar to indicate that their importance is below the threshold that you have defined as being interpretable. This would help reduce the amount of information the reader has to absorb when looking at this very interesting but dense figure.

Line 359: large absolute value of kappa

Line 365: the results are non-existent

Fig 7 and Line 371: I'm not sure the smooth fitting is defensible. There is clearly overfitting happening here, and you even talk about (for example) mu=+3 cm in Figure 7b, when such values are clearly only a product of the smoothing in an area of parameter space with no simulation.

Figure 7b: the dip in mu between 2 and 4 km resolution make me wonder whether the modeling of input dependence has completely removed this effect. My guess is that the three results with 3 km resolution are all a single type of model which exhibits idiosyncratic behavior for other reasons beside resolution. This gets to point 2b above, but also why I think it is inadvisable to fit a smooth curve to these plots.

Line 383-384: Confusing sentence

Figure 9: this figure needs better explanation as I am not sure what the box plots represent compared to the red dots, etc.

Line 410: which provides narratives about the role of various modeling choices in generating inter-model differences in the GrIS MME

Line 419: affected by too coarse grids (here coarser than 5 km)

Line 424: could not have

Line 431: comparison of ANOVA with

Line 435: it would be useful to quantify the extent to which this problem is alleviated or not - several prior comments relate to the possibility that input dependence is still present in results

Line 436: for the high performance of our approach is

Line 443: In this study, we described the use

Line 444: assumptions in sea-level projections

Line 455: global effects of what?

Citation: https://doi.org/10.5194/egusphere-2022-435-RC1
- AC1: 'Reply on RC1', Jeremy Rohmer, 12 Sep 2022
  
  Dear Referee #1,
  We would like to thank you for the constructive comments. We agree with most of the suggestions and, therefore, we have modified the manuscript to take on board their comments. In the attached PDF, we recall the reviews (black, italic) and we reply to each of the comments in turn (blue).
  Yours sincerely,
  J. Rohmer on behalf of the co-authors
  
  Citation: https://doi.org/10.5194/egusphere-2022-435-AC1
RC2:
'Comment on egusphere-2022-435', Anonymous Referee #2, 29 Jul 2022

Following is a review of, “Improving interpretation of sea-level projections through a machine-learning-based local explanation approach” by J. Rohmer et al.

In this manuscript, the authors describe a strategy for the interpretation of an ensemble of Greenland Ice Sheet model projections, in particular, those created for the ISMPI6 experiment. The goal is to use a novel machine-learning approach (SHAP) to analyze the existing ensemble, and bring insight to the ice sheet modeling community, in terms of what modeling choices affect model results and when. This manuscript focuses on the ISMIP6 high end (RCP8.5) projections forced with MIROC5 output, which are provided through the year 2100. The authors find that different modeling assumptions influence results during varying epochs of the projection. In particular, model results are sensitive to the retreat parameter, especially after the first 30 years of simulation, as well as the choice of ice flow equation. A significant dependence of results on minimum grid cell spatial resolution is also found. The authors conclude that the SHAP approach is a promising method for analysis of earth system multi-model ensembles (MME), especially in terms of extracting information about how modeling assumptions may drive simulation results. They note that, with caution, the analysis can offer valuable insight, and they offer suggestions on how to improve upon the approach for future studies. The manuscript is well written and organized, and the figures are of good quality. The authors especially take care in describing the methods, including a schematic to describe the procedure adopted for this study.

Overall, the manuscript is successful in illustrating that the SHAP approach can be used to help researchers interpret results of MME experiments, like ISMIP6. The methods are novel, especially in the adaption of relatively new machine learning techniques to ice sheet model projections of sea-level change. The introduction offers a thorough explanation of the background of the adopted approach and the data section adequately describes the ISMIP6 experiments and model assumptions chosen for this study. However, I find that some additional explanation could be added to the application section, to help lead the reader through the analysis results. I also find that the discussion could be expanded to add context to the analysis results, particularly with respect to how the results might be compelling for the ice sheet modeling community or how they might impact future ice sheet model intercomparison projects. Overall, I recommend publication of this manuscript with minor revisions.

Below I have some specific comments and suggestions for the authors:

Line 35: Could you please explicitly define what is meant by global vs. local for this context? This is not necessarily terminology that some readers would be familiar with.

Section 2 title, figure 7 caption, and line 468 (and maybe others): Since these are produced from simulations, they are not really “Data”, but model output.

Line 98: more accurately, this can be referred to as “global mean sea-level equivalent”

Line 126: maybe “predicted” or “modeled” or “simulated” sea-level change?

Line 357: “sea-level” contribution

Line 360: Could you add a statement about what this might mean from the modeling perspective (similar to what is done to conclude the next paragraph about Fig 7b)? For example, does having a negative sea-level contribution result from a low κ value suggest anything to modelers, or is it possibly too dependent on the specific warming scenario being tested (i.e. RCP8.5 from MIROC5)?

Line 372: First, please take care to lead the reader through your logic, in particular, it would be helpful to explicitly remind the reader that conclusions are based on the red envelope derived from your analysis. In terms of the stated conclusions in this paragraph, it looks to me that 3cm is a value suggested by the fitting curve envelope. It also appears that 3cm is significantly above the plotted interval (maybe ~2.5 cm SLE is a more accurate number here?). Also, if we follow the logic of using the fitted curve to drive conclusions, then it looks like to me that there are values for >5 km resolution that should still be considered negligible. The curve actually suggests that perhaps >7.5km might be a more appropriate cutoff for the >5km statement? In addition, it seems that since the few results from the 3-4km range are driving the 2 km conclusion (as noted in the text). Because of this, I suggest softening the statement to say that results support a minimum grid size of 5km for sure, but they also reveal that a minimum resolution of as fine as 2km may be required, with more investigation needed. Overall, please consider revising this paragraph’s wording, in general, with more accurate statements to reflect the plotted output. (This is important as the results are highly pertinent to ice sheet modelers and may be referenced to support modeling decisions in the future.)

Line 376: This final statement is a bit awkward. Please consider revising. Maybe something like: “if spatial grid resolution is too coarse, this choice may highly influence the results of sea-level projections.”

Lines 383-384: This sentence is awkward, and it is not explicit what is meant by “mask” in this context. Please reword. For example, “Fig. 8 further suggests that the contribution of minimum grid size might dominate over (?) all the other modeling choices, since they do not exceed those contributions associated with minimum grid sizes of 8km (?) or greater.” Or something similar. With respect to this statement, don’t the ice flow extreme responses technically exceed that of the extremes of the available minimum grid size? That is, aren’t the results in Fig 7b within 10km and 15km artificially high, as an artifact of the fitted curve? This might just be an issue of reworking this paragraph to lead the reader through the stated conclusions in a clear way, but as written, the logic is not obvious.

Line 396: be noticeable in the -> “to impact the”?

Figure 9f title: For consistency, please reference κ in addition to (or instead of) Retreat parameter

Line 424: At this point in the discussion, it would be helpful to add some sentences to put these results into context for ice sheet modelers. That is, what are the implications for some of these findings and suggestions? Fleshing out some of these ideas and expanding upon them during the discussion would broaden the audience who can benefit from this type of study.

Lines 425-435: It would also be beneficial to the manuscript to lead the reader and explicitly explain about how these points pertain to the particular study case here (as opposed to only referring back to earlier sections). For example, is the choice of method mostly appropriate because of the specific model assumptions that were chosen? Though these assumptions do have inter-dependencies, there are many other model choices which could be studied but are much more inter-dependent (for example different physically-based parameters involved with various processes related to ice dynamics). Can you say anything about what type of ice sheet model parameters this method would and would not be appropriate for diagnosing, based on the experience gained in this study? While the language currently included is cautionary, I would like to see more discussion geared towards ice sheet models in particularly, like whether the results may be highly specific to the chosen ensemble (e.g., RCP8.5 MIROC5 climate forcing) or what type of “right” or “wrong” conclusions an ice sheet modeler designing a new intercomparison project might take from the method presented here.

Minor notes:

Line 23: GCM is more typically used to stand for General Circulation Model. If a more general acronym is desired here, I recommend using something like ESM (Earth System Model) for this context.

Line 91: relevance “of”

Line 145: units

Line 226: gets

Line 323: allows “us” (?)

Line 353: values

Line 359: values

Line 364: “indication of” where

Line 374: “a” few

Line 419: setting “of” the minimum grid size

Line 425: helps “alleviate”

Citation: https://doi.org/10.5194/egusphere-2022-435-RC2
- AC2: 'Reply on RC2', Jeremy Rohmer, 12 Sep 2022
  
  Dear Referee #2,
  We would like to thank you for the constructive comments. We agree with most of the suggestions and, therefore, we have modified the manuscript to take on board their comments. In the attached PDF, we recall the reviews (black, italic) and we reply to each of the comments in turn (blue).
  Yours sincerely,
  J. Rohmer on behalf of the co-authors
  
  Citation: https://doi.org/10.5194/egusphere-2022-435-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Publish subject to minor revisions (review by editor) (20 Sep 2022) by Ginny Catania

AR by Jeremy Rohmer on behalf of the Authors (28 Sep 2022) Author's response Author's tracked changes Manuscript

ED: Publish subject to technical corrections (30 Sep 2022) by Ginny Catania

AR by Jeremy Rohmer on behalf of the Authors (07 Oct 2022) Author's response Manuscript

Post-review adjustments

AA: Author's adjustment | EA: Editor approval

AA by Jeremy Rohmer on behalf of the Authors (27 Oct 2022) Author's adjustment Manuscript

EA: Adjustments approved (28 Oct 2022) by Ginny Catania

Short summary

To improve the interpretability of process-based projections of the sea-level contribution from land ice components, we apply the machine-learning-based SHapley Additive exPlanations approach to a subset of a multi-model ensemble study for the Greenland ice sheet. This allows us to quantify the influence of particular modelling decisions (related to numerical implementation, initial conditions, or parametrisation of ice-sheet processes) directly in terms of sea-level change contribution.