Antarctic ice sheet model comparison with uncurated geological constraints shows that higher spatial resolution improves deglacial reconstructions

Halberstadt, Anna Ruth W.; Balco, Greg

doi:10.5194/tc-20-931-2026

Articles | Volume 20, issue 2

https://doi.org/10.5194/tc-20-931-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/tc-20-931-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 20, issue 2

Research article

|

04 Feb 2026

Research article |

| 04 Feb 2026

Antarctic ice sheet model comparison with uncurated geological constraints shows that higher spatial resolution improves deglacial reconstructions

Anna Ruth W. Halberstadt and Greg Balco

Download

Final revised paper (published on 04 Feb 2026)
Supplement to the final revised paper
Preprint (discussion started on 15 May 2025)
Supplement to the preprint

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-2008', Anonymous Referee #1, 03 Jul 2025

The authors present a new automated workflow for model-data comparison using the Penn State University ice sheet model and the ICE-D surface exposure age database. The paper goes in depth about different approaches and metrics for comparing model simulations to spatially and temporally sparse data, with a thoughtful treatment of uncertainties (model, analytical, and geological). This work is impressive and well suited for publication in The Cryosphere after some revision.

My largest concern with this manuscript is also the easiest to fix: the original sources of the data shown in figures need to be cited, not just the ICE-D database. The way this database is currently used means that Balco (2020) gets cited instead of the original data sources. For instance, the six sites shown in Fig 16 a–f span four publications, but only two of those publications are cited, and those not in relation to the data. Surely there is a way to easily compile the necessary citations when pulling data from ICE-D.

Because the paper is very long and dense, it would benefit from some significant revision for clarity. For instance, as you’ll see from my specific comments below, I was confused for a long time about the different metrics being described, how they were used, and how they did or did not interact with each other. It became more clear in the Results and Discussion sections, but it made the Methods section very difficult to get through. I suggest adding a more clear and thorough roadmap of that section before getting into the details.

I have also pointed out some places in the detailed comments below where aspects of the methodology seem arbitrary or not well supported by the data, or where such complete automation could be undesirable. Obviously it would be absurd to rely on careful manual curation of an ever-growing dataset forever, but the pitfalls of automation need to be clearly addressed. The manuscript does a fairly good job at this in various places, but I think it would benefit from a dedicated sub-section of the Discussion that goes further into these considerations beyond the comparison in Fig 11.

Specific comments:
In general, locations of sample sites shown in figures need to be marked on a map somewhere. For instance, site KRING in Fig 8 is not shown on any location map.

While automation and the use of uncurated data certainly has obvious benefits, the cosmogenic nuclide record is often ambiguous and frequently relies on expert assessment of individual samples and their geologic context. I worry that automating the model-data comparison process will come with a loss of expert judgment. For instance, the exposure-age record at Diamond Hill is quite complex and uses multiple nuclides (Be-10 and C-14) and sample types (erratics and bedrock), but Fig 16a is missing a key in-situ C-14-saturated sample and shows a greatly simplified version of the history that does not agree with the preferred ice history suggested by flow-band modeling (Hillebrand et al., 2021). I’m not insisting that the preferred history must be correct but instead pointing out how much nuance is lost in the automated approach.

L88–100. You could use the modern ice surface elevation at each site as a constraint, rather than a complete spatial map. This is worth considering adding to your analysis, especially given that many of the later figures in the paper show poor fit to present day ice thickness at the end of the model runs and this is essentially a free data point. And if one reason to compare model results with paleo-constraints is to calibrate a model for future projections, getting the modern state right is very important.

L102: Clarify that this ~10–20km resolution only holds for paleo simulations. Plenty of models use much higher resolution than this for shorter-term simulations.

L 163: Is “marine ice shelf instabilities” intentional phrasing?

L 167–170: All of these data sets (with the possible exception of Liu et al., 2009) are pretty seriously outdated at this point. Has any attempt been made to update these? Is there evidence that the model is insensitive to the choice of these datasets?

Section 2.2 needs more information:
What data sets are used to define present day ice thickness and bed topography?
How does the basal friction inversion work?
Is there also an optimized ice stiffness field?
How is the present-day temperature state achieved and what geothermal flux product is used?
What spatial resolution? Has any mesh convergence testing been done?
What sub-shelf melt and calving schemes are used?

L 182: “The model basal inversion slipperiness input is downscaled at a correspondingly high resolution for each nested domain.” Does this just mean it is interpolated from the coarse resolution, or is the inversion done on the nested domain or some other high resolution continental domain? For most of the outlet glaciers, the coarse resolution inversion is probably meaningless, since even Byrd Glacier is at best going to be a few grid cells wide at best.

“Nested simulations have been shown to be resolution-independent.” Numerically speaking, this cannot be true in general. The figure referenced from DeConto et al (2021) shows that the model is more or less converged with respect to resolution at 10km grid spacing for that particular domain and scenario. This will not necessarily hold for other applications.

Is there some relaxation period used to let the model adjust to the nesting resolution? Going to higher resolution usually means the model needs time to adjust, which can take thousands of years.

Fig 16 caption references Fig 0, which does not exist.

L261: Maybe this is made clear later on, but from this text it sounds like only ice thickness change (and not the actual value of ice thickness or surface elevation) is evaluated here. I understand that it’s often very tricky to get models to match absolute surface elevation, but you could have an excellent model agreement to observed ice thickness change and still be extremely far off in terms of the ice surface elevation. For many cases, this would make it seem like the model is doing a good job of explaining the data, when in fact it is biased extremely high or low. Update: Okay, I see that you also have this exceedance metric later on. It would be good to mention it here.

L330: What do you define as LGM age here? Depending on the production rate scaling scheme, even using 30ka would prevent C-14-saturated samples from being included. This also brings up another symptom of automation, which is that pre-exposed erratics can still provide a constraint on LGM surface elevation even if they do not provide a good constraint on the timing, but those are discounted here.

L335: What data product do you use to define the modern ice surface? Also, do the sites have to be strictly within a model cell? Due to the low model resolution, I can imagine cases where the sample sites are in a cell that is always ice free, but could still provide constraints on an adjacent cell that contains ice. This could happen even at the 2km resolution for some of the smaller TAM outlets.
What are the numbers of lumped sites for the different resolutions?

L375: Needs some example citations

L 400: It is starting to occur to me that there should be a summary table or list at the beginning of section 3 that briefly covers all the different metrics used.

S 3.1.6 Surely there is a more rigorous way to estimate the minimum geologic error? It seems strange to use this relatively sophisticated Monte Carlo analysis and then eyeball the dividing line for the tail of the distribution. Perhaps using the 95th percentile value as the cutoff? I know it’s unlikely to change the number much, but this would be preferable in terms of reducing interpretation bias and leaving the workflow flexible as new samples are added to the database.
Also, is there a strong justification for making this a uniform value, rather than varying it spatially by sector, for instance?

L423: needs references for “ Previous work has relied on continental-scale ice sheet models at 20-40km”. In general, all of section 3 would benefit from more references.

Fig 4 legend is missing the grey curve for the first 2km entry. The box on the continental map showing the location of b–e is hard to see. I suggest making it a bright color that contrasts better with the ice thickness map (or simply removing the ice thickness from this plot and just showing the grounding line or continental outline).

L472–474: Possibly as (or maybe more) important than the ice-flow perturbation is the formation of blue ice areas and wind scoops on the down-wind side. See figure 4 in Bintanja (1999), for instance.

L476: I’m not quite sure what is meant by “due to parameter variation” here. There are many factors aside from parameter uncertainty that play into this: forcing uncertainty, initial condition uncertainty, structural uncertainty, etc. You could probably just delete that phrase, as it’s fairly obvious why it is hard to match the modern state at the end of a simulation lasting tens of thousands of years. I would also change “may not” to “is very unlikely to” or even “will not”.

L487: Does the float-scoring metric really account for time? If so, it seems that time is double counted when also applying the time-offset metric.

L500: I don’t understand this sentence: “The timing of model thinning should be generally insensitive to model-data alignment issues.” Can you reword for clarity? By “alignment”, are you referring to the elevation misfit? It could also be interpreted as general alignment along space and/or time dimensions, which makes the sentence rather ambiguous.

L501: The parenthetical “horizontal” here is a bit strange, since horizontal by definition refers to space, not time. Specify that this is along the horizontal axis in the age-elevation space shown in Fig 6, for instance.

L506: There could also be random errors in the forcing datasets accounting for this. Not all forcing uncertainty is going to be systematic. Conversely, model resolution issues could certainly impart a systematic bias. For instance, low resolution model simulations tend to exhibit marine-ice-sheet-instability-style collapse more readily than high resolution simulations. If the model is not converged with respect to resolution (as I would bet is the case for 40km simulations), there would be a systematic bias across the entire West Antarctic Ice Sheet during a deglaciation event.

Based on Figure 6, it looks like you apply a dozen or so equally spaced offset values and then score them to find the best of those user-defined values. Presumably there is a straightforward way to solve this as a minimization problem, which would be more accurate and robust and make the workflow more automated. Also, panel b would make more sense if the axes were flipped, since the offset is the independent variable here.

S3.2.3: Are these metrics weighted equally? Can you do some kind of L-curve analysis to determine the optimal weights?

L529: If I understand this correctly, using “the closest time of exposure of the data point” means that the interpretation of the exposure age is dependent on the model prediction, which doesn’t seem appropriate here. It also suggests that any age within the uncertainty bounds is equally likely, which is not the case.

L~565: By this point, I’m fairly confused. What was the point of the metrics in 3.2.3 if you instead use this misfit metric in equation 1? I think this section needs to start with a more thorough roadmap that explains which metrics are needed and what they are used for.

3.3.2 title: Presumably you want to score the model, not the samples?

L583: Why not use a similar 10kyr value for this case? The way it is done here makes the two youngest samples in Fig 7 have significantly less weight than the oldest sample, even though the model fits them all very poorly. Also, it seems like the choice of the 10% ice-thickness-change vertical window could have a very large impact on scoring. If that window was just a very tiny bit wider, both the oldest and the second-youngest samples in Fig 7 would be interpreted by the algorithm as having relatively small ∆t values. How was this 10% window determined? Instead of using a window of uniform height for each site, what if the algorithm could solve for the necessary window height and then account for that in the misfit calculation?
Also, should this reference Fig 7 instead of Fig 5?

Fig 8: The top panel shows a misfit of 0 even though it underestimates modern ice thickness by 50m (out of only 350m total thickness change) and is still thinning at a significant rate at the end of the simulation. See my comment above about including the modern ice thickness as a constraint.
This misfit cap of 50 seems quite arbitrary as well. How was this determined? Can you show at least two other sites to show that this value is representative?
This figure doesn’t necessarily support the cap of 50. For instance, the difference between the misfits of 46 and 55 is not all that significant: they both thin too little, too late, and the difference is just a matter of degree. And more importantly, the run with misfit 55 thins at about the right time and is in fact significantly better than the run with misfit 120, which bears almost no resemblance to the data whatsoever. So I think the limit of 50 is actually a bit too restrictive here. One could also argue that the runs with misfits of 46 and 55 both explain the data better in some senses than the run with misfit of 34, since that thins thousands of years too early (assuming that the assertion in L 487 that the float misfit score accounts for timing of thinning is true).

L607: How sensitive are the conclusions to this spatial weighting scheme? It seems like you could defensibly define this in multiple ways — for instance by using ice-flow catchment boundaries (either time-dependent or modern) — and those could possibly give different answers without a clear way to choose between them.

Eq 5: I’m still a bit lost at this point. Reading from the top of section 3 from top to here, I don’t understand how all the different metrics come together. You first define a float-scoring (i.e., thickness change) metric and a best-time-offset metric in S3.2.3. Then in 3.3.1 you define this sample misfit metric that appears to be more or less independent of either of those. That metric is used to calculate site misfit, which is then used to calculate model misfit. So where do the float-scoring and best-time-offset metrics come in? As I’ve mentioned above, this section would greatly benefit from a very clear roadmap at the beginning that gives a very clear summary of the procedure before you dive into the details. The text at the beginning of Section 3 seems to attempt this, but doesn’t really help clarify the approach for me.

L 628–630: need example references

L705: Technically, saturated C-14 concentrations provide robust evidence for lack of LGM ice cover. The scaling factor of 10 seems arbitrary as well. How sensitive are results to this assumption?

L731: Here the float misfit score has come back. Is this in fact the same as the misfit defined in Eq 5? Why is the time-misfit score not mentioned here?

L788: “wrong more consistently” is rather ambiguous phrasing. Reword for clarity. Something like “exhibit a more consistent time-offset bias”. Refer to Fig 13 here to help illustrate.

Fig 13: Median and interquartile range might be more meaningful than standard deviation for distributions like these.

L835: Are the 20- and 10-km results shown anywhere? If not, a figure should be added.

L898: delayed and more rapid compared with data, or with other model runs?

L896–904: My guess is that this is due to the power-law (Weertman-type) basal friction parameterization used in the PSU model, which uses a relatively large exponent and thus assumes a relatively hard ice-sheet bed. A more-plastic bed rheology would likely be more appropriate for much of the Antarctic Ice Sheet, though the appropriate rheology and thus the appropriate friction parameterization will vary widely in space. The power-law exponent has not been varied as a parameter for any study with this model that I am aware of, but it has a large impact on time-evolving behavior in other models (Parizek et al., 2013; Gillet-Chaulet et al., 2016; Nias et al., 2018; Joughin et al., 2019; Brondex et al., 2019; Hillebrand et al., 2022; Schwans et al, 2023). It cannot be determine with a snap-shot inversion approach, but instead requires calibration in time-evolving simulations, or a transient inversion. I think using a more-plastic bed rheology might tend to lower the maximum thickness (leading to better exceedance scores) and also lead to steadier, less abrupt thinning (potentially improving float misfit scores if I understood the previous text correctly). Obviously you don’t need to attempt this here, but it’s worth mentioning that this key assumption has never been tested with this model.

L980: One explanation that occurs to me (although there are many many possible and complementary explanations) is that deposits in currently ice-free valleys alongside TAM outlet glaciers can be hundreds of meters lower in elevation than deposits of the same age that are on the glacier valley walls because the glacier margins extended several km into these valleys at the LGM (see, eg., the wide range of elevations of the mapped Britannia I limit at Lake Wellman in King et al., 2020). Those two populations would both be compared against very similar (or identical) modern-day surface elevations, so they record very different ice thickness changes despite recording the same event at almost the same location. Without some calculation that accounts for the very different elevation of these valley-floor deposits (e.g., using some estimated surface slope to project them back to the glacier centerline or nearest model grid-cell), it’s unlikely that they will be used accurately in this analysis. Recorded rates of thinning also vary widely (up to maybe a factor of 2) between valley-floor and nearby valley-wall samples. An automated approach will likely always miss the difference between these types of samples. To be fair, however, most expert curation would has also missed that distinction.
There’s also the issue that the small-scale meteorology of the glacier margins is a strong control on location of deposits, and is not going to be represented at all in the model. For instance, the presence of algae-hosting melt ponds at the LGM and even the deposition of erratics in valleys requires marginal ablation areas.
L991: Resolution is just one of many important model choices that have a bearing on this, so it’s not necessarily resolution that is limiting accuracy at this point. The remaining discrepancies are more likely due to all the other sources of uncertainty (model structure, unrepresented physics, parameters that likely need to vary spatially, bed topography, forcing, etc). I would guess that you’re not going to see much further improvement at higher resolution at this point, although it would be interesting to test that for a few of these ensemble members. The nested domains are also still driven by the low-resolution simulations at the boundaries, so they cannot completely decouple from the inaccurate low-resolution ice dynamics occurring outside the high-resolution region. This might not be a big issue at the interior sites, but in the TAM it is probably important, since the grounding-line position in the Ross Embayment is likely to be resolution dependent.

The sign convention on time in the figures in the appendix is reversed relative to the main text. Please point this out in captions.

References:
Balco, Greg. "A prototype transparent-middle-layer data management and analysis infrastructure for cosmogenic-nuclide exposure dating." Geochronology 2.2 (2020): 169-175.
Bintanja, Richard. "On the glaciological, meteorological, and climatological significance of Antarctic blue ice areas." Reviews of Geophysics 37.3 (1999): 337-359.
Brondex, J., Gillet-Chaulet, F., and Gagliardini, O.: Sensitivity of centennial mass loss projections of the Amundsen basin to the friction law, The Cryosphere, 13, 177–195, https://doi.org/10.5194/tc-13-177-2019, 2019.
DeConto, Robert M., et al. "The Paris Climate Agreement and future sea-level rise from Antarctica."Nature 593.7857 (2021): 83-89.
Gillet-Chaulet, F., Durand, G., Gagliardini, O., Mosbeux, C., Mouginot, J., Rémy, F., and Ritz, C.: Assimilation of surface velocities acquired between 1996 and 2010 to constrain the form of the basal friction law under Pine Island Glacier, Geophys. Res. Lett., 43, 10311–10321, https://doi.org/10.1002/2016GL069937, 2016.
Hillebrand, Trevor R., et al. "Holocene thinning of Darwin and Hatherton glaciers, Antarctica, and implications for grounding-line retreat in the Ross Sea." The Cryosphere 15.7 (2021): 3329-3354
Hillebrand, Trevor R., et al. "The contribution of Humboldt Glacier, northern Greenland, to sea-level rise through 2100 constrained by recent observations of speedup and retreat." The Cryosphere 16.11 (2022): 4679-4700.
Joughin, I., Smith, B. E., and Schoof, C. G.: Regularized Coulomb Friction Laws for Ice Sheet Sliding: Application to Pine Island Glacier, Antarctica, Geophys. Res. Lett., 46, 4764–4771, https://doi.org/10.1029/2019GL082526, 2019.
King, Courtney, et al. "Delayed maximum and recession of an East Antarctic outlet glacier." Geology 48.6 (2020): 630-634.
Nias, I. J., Cornford, S. L., and Payne, A. J.: New Mass-Conserving Bedrock Topography for Pine Island Glacier Impacts Simulated Decadal Rates of Mass Loss, Geophys. Res. Lett., 45, 3173–3181, https://doi.org/10.1002/2017GL076493, 2018.
Parizek, B. R., et al. "Dynamic (in) stability of Thwaites Glacier, West Antarctica." Journal of Geophysical Research: Earth Surface 118.2 (2013): 638-655.
Schwans, Emily, et al. "Model insights into bed control on retreat of Thwaites Glacier, West Antarctica." Journal of Glaciology 69.277 (2023): 1241-1259.

Citation: https://doi.org/10.5194/egusphere-2025-2008-RC1
- AC1: 'Reply on RC1', Anna Ruth Halberstadt, 24 Oct 2025
  
  Please see attachment.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2008-AC1
RC2:
'Comment on egusphere-2025-2008', Anonymous Referee #2, 05 Sep 2025

This study by Halberstadt & Balco presents a series of new metrics for paleo data-model comparisons, and applies them to an ensemble of ice sheet model simulations to show that (i) higher resolution nested domains (in their case, 2 km) can improve data-model mismatches, and (ii) there is no "best metric" for data-model comparison, as each metric captures different aspects of ice sheet model behaviour. This often results in a tradeoff between matching the magnitude and the timing of thinning. Furthermore, they show, using their metrics, that the use of an uncurated dataset for model scoring is as constraining as the use of curated data, meaning that it is much easier to include new geologic data as they are published (and also has no subjective/interpretive bias), further improving our ability to constrain ice sheet model ensembles.
This study is relevant to the paleo ice sheet modelling community, both due to the new misfit metrics presented, and to further show within a non-idealised context that higher resolution is needed for ice sheet models to better match constraints of ice sheet geometry and time evolution. The small number of ensemble members and explored parameters is not a problem, as it is justified by their different goals from previous data-model comparisons: not necessarily try to obtain statistically significant distributions of different metrics for AIS behaviour, but rather showcase their new tools and that high-resolution simulations improve data-model mismatches.
Overall, the findings are well worth being published, but the manuscript needs a more thorough, careful revision of the text before I can recommend it for publication. There is a lot of redundant information spread across different sections which read more like a repeat of the text rather than a helpful recap. In other instances, the explanation of the scoring metric feels a bit convoluted, or the justification behind certain ways to circumvent a problem is not super clear. Several figures are not cited anywhere in the text (and I suspect one is missing entirely), and some subsections could be reordered (with slight rephrasing/rearrangement) for a more fluid read. Below I offer some more general suggestions, as well as line-by-line and figure-specific comments.
General comments
Section 2.3:

- How often is information carried to the nested domains, i.e., what is the coupling time step? Which variables are passed on?

- How is the downscaling done at the beginning of the high-resolution simulations? Is it just a linear interpolation from the coarser domain?
Section 3:

- Section 3.1.1 could be made shorter, and perhaps part (if not all of it) should go to the introduction, which is more fitting for this kind of information. As it stands, it feels like a massive break in reading flow to provide adequate background, which should have been presented earlier. For example, there's overlap on the discussion of coarse model resolution being unable to resolve glaciers where data is collected, which feels more like redundant information than a helpful recap.

- Section 3.2.1: the first paragraph could be more concise, it replicates a lot of information already repeated 2-3 times...
- The description of the metrics could be made clearer:

- In "float scoring", does it mean that the changes in ice thickness are being compared, rather than in ice surface elevation? Fig. 5 shows "elevation above modern ice", but I it's not completely clear how the modelled change in thickness is translated to surface elevation (is it modern+offset? if so, please explicitly state that)

- In "best-time-offset metric", it is not stated at any point that the curve you are trying to minimise is the best-fitting curve obtained through the float scoring metric. This is only made clear way further down in the discussion (L865-866). This is important to be explicitly stated here as well.

- ice thickness exceedance: is that simply how much the modelled ice thinned beyond what the cosmo data shows for sites where no Holocene (re)thickening happened?
- I honestly struggle with the fact that the way each scoring is minimised is discussed before the scoring formula is presented. After reading the entire manuscript a second time, I can better see the reason, but only because I had already read the latter sections. Please consider if there is an easier way to introduce all concepts needed in this section, e.g., first introducing the different misfit metrics (3.3.1), then how they are minimised (float scoring, best-time-offset; 3.2.3), then what happens once they fall out of this range (sect 3.3.2). Naturally some of the discussion text around limitations and justifications for the scoring in section 3.3 (e.g., L562-567, Sect. 3.3.2) could be kept where it is, as it is about what was discussed in current Sect 3.2.
- Section 3.5: part of the information in L641-650 feels more repetition than recapping (as in Sect 3.1.1). I suggest the half-life discussion here and in Sect. 3.1.1 to be put together, maybe in the introduction. This way this section can be made much more concise. The paragraph in L651-663 could also be significantly shortened, considering it is about types of data that were not used..
Section 4:

- The explanation behind variations in OCFACMULT and CSHELF (two ice-shelf related parameters) having a much stronger impact in the model-data fit is quite brief and in my opinion could be better highlighted, as it is an important result. Similarly, I miss a discussion on why TLAPSEPRECIP does not affect the results as much. I would expect this factor to have a stronger impact, considering it directly affects how the ice surface is resolved in the grid cells where the data is present, and potentially by how much the ice sheet interior thickens (or not) during the LGM.
Line by Line comments
L22: I believe it was supposed to be "simulations" in plural
L32: It should be more specific what "proximal geologic records" means. It comes up rather early in the manuscript, without providing enough background. Considering this article should be of interest to a wider audience than paleo ice sheet modellers that compare their models with geological constraints, it's worth giving better background to what that refers.
L133: there's an extra ")"
L168-170: How is the oxygen isotope curve translated to temperature anomalies? Similarly, are the time slices of Liu et al. linearly interpolated within the model, is it a step-change, or how exactly is it prescribed? I appreciate a lot of technical information on the model might deviate from the manuscript's focus, but it is good to provide a good explanation on how the climate is prescribed to help the discussion about the model limitations leveraged by the time offsets.
L172-174: I would assume that the way the atmospheric forcing is prescribed is the same as in Sect 2.1 (see comment above), but that cannot be the case for the ocean. Please make it clearer how the initial conditions were obtained.
L178: It would be helpful to mention Fig. 1a here for the reader to be promptly pointed to where these domains are
L205: It should be made clearer that CSHELF is needed because in paleo simulations the grounding line will likely advance beyond the areas where information can be obtained by inversion/nudging. This is a problem that paleo ice sheet modellers are familiar with, but other ice sheet modellers might not be.
L230: Please clarify which type of interpolation was used. Was it just a simple linear interpolation?
L245-246: It would be useful to state the resulting ranges of this approach so it can be more easily compared with the other methodologies described.
L264-269: I am confused here. Is amplitude of thinning the thickness difference or magnitude of thinning rates? Are the "thinning patterns" the magnitude of thinning rates or the shape of the thickness-evolution curve? Is the dataset for both metrics totally different, or is just the same (Ice-D), but processed in a different way to allow computing the desired metric?
L344-349: I am not sure I follow the procedure described. I don't think a "steepest descent scheme" is a familiar term to all target readers of this manuscript, and hence more details should be added.
L382-387: When comparing the text with Fig. 2d, the smaller error bars obtained from the MC approach make 1 or 2 accepted samples to lie outside of the derived interval. Is that correct? Does that mean that they are then also discarded? It is not clear to me how the obtained error/uncertainty is used.
L391: I assume this a different vertical window than discussed before in L245-246. Is there another term or an adjective that could be used for either to avoid confusion?
L401-402: Is that the age spread in the 'MC cloud' at each sample elevation? It is not quite clear how the number "500" was obtained, as the way it reads it feels rather subjective.
L437-442: At which time step/slice is that search done? From the explanation of the reasoning behind the search, it could either be at every step (since the base of the hillside would move as the ice retreats) or at present (the 'lowest base possible').
L455-456: Wouldn't a site, by definition, not span different model grid cells? I am confused what this sentence is trying to say.
L485-486: Do I understand correctly that the thickness-evolution curve is added on top of the sample-derived modern ice surface, and then offsets are applied until the minimum misfit is found? This needs to be clarified.
L575-579: Were there any sensitivity tests performed to show to which extent this happens, and to support the choice for 10kyr? I would assume so, but it would be good to explicitly say/show the numbers.
L605-608: Please provide the formula for S_w, since it is explicitly used in Eq. (5).
L615: I assume "all site misfits" is the n in Eq. (5)? Please make it clear.
L674-676: does distinguishable/indistinguishable refers to whether their nuclide concentration lying below the steady-state value considering the standard deviation? I honestly appreciate that there is a lot of care not to use overly technical jargon, but in some situations it might obscure the real meaning of a sentence.
L699-700: Wouldn't these events be less likely to deposit erratics at all? Why "post-LGM-age erratics" only?
L702: Can you be more specific regarding how Slgm is calculated between 0 and 1?
L708: I am not sure I follow the argument, please rephrase/expand.
L712-715: Is the "present reference" in the modelled thickness above present based on the model's state at the end of the simulation, or based on the present-day dataset? I would assume this is a similar problem than the "float scoring" tried to circumvent? Or are elevation and thickness changes being used interchangeably here? Or is "thickness" not the full thickness of the ice sheet, but the difference in thickness/elevation? This part is getting me confused!!
L751-756: This information has already been given previously, so the paragraph could be condensed to 1-2 sentences for conciseness.
L763: This writing is odd, I'd suggest something like "Fig. 11 also shows the impact of their tier designations on our scoring".
L776-784: Most of this information was already given in previous sections, so this paragraph could be significantly shortened.
L785-788: I like the way that the technical part is summarised in less technical terms: "domains are less wrong" and "are wrong more consistently". I think, however, that you do not need to explain it straight away within parentheses what it means, because the following paragraphs do pretty much that, using almost the exact same words. Recapping information is good, but when sentences are so close to each other it feels too repetitive...
L806: The mention of panels d,e in Fig. 13 comes before panels a,b,c. I'd recommend reordering the figure panels (i.e., panels d,e should be a,b and panels a,b,c should be c,d,e)
L806-813: Unless I misunderstood the point here, I am not sure I fully agree with it. This might be true if the different sites are relatively close to each other, but, in line with the discussions in Section 5.2, there's no guarantee that asynchronous changes in the atmosphere and the ocean did not happen around the ice sheet, meaning that different regions would have experienced an earlier/later deglaciation, or that certain basins would have responded faster/slower to a given forcing.
L829: It would be easier to follow what a best-time-offset outlier is if it was first defined in the previous paragraph (suggestion, around L 822), so you wouldn't need to add this "i.e." parenthesis. Also, please try to be consistent with best-time-offset or best-offset, as it is not clear whether they were supposed to be the same or not (if not, please make sure they are both clearly defined).
L882: simulations -> simulation
L928: The use of "modern model timestep" here can be ambiguous, I'd suggest something like "modern time slice" or just "modern time".
L930-931: This information could come in the introduction already, as it is not quite dependent of the results, and yet might appease the skeptical reader sooner rather than too late.
L961-964: Although the argument makes a lot of sense, I can only see the mentioned trend on East Antarctica (EAIS+TAM), as opposed to being "most prominent". It is perhaps because of the clustering of points, but this trend is not visible at all for WAIS or Weddell. There are both too-early and too-late points with just as low low exceedance scores in both regions. If the clustering does not allow for this relationship to be properly observed in the figure, then the axis needs to be expanded.
L969-971: Some punctuation and rephrasing is needed for clarity. Suggestion: "model member 7-2-35 overthickens (...), simulating a delayed timing of thinning compared to the 5-0.3-35 model, which shows a smaller amplitude change and an earlier onset of thinning".
L978: Unclear what a "a lingering disconnect" refers to, especially because, without being sure what that means, the following sentence seems to contradict what was just said.

Figures
I miss a figure that shows the name and location of sites that are mentioned in the manuscript, as this information does not exist anywhere. I even suspect it was accidentally left out, considering Fig. 16 mentions this information exists in "Fig. 0".
It would also be useful to have a figure showing how wide the range of modelled deglacial behaviours actually is (e.g., changes in grounded ice volume through time), so it can be compared with the other examples cited by the authors.
Fig 2: Caption of panel (c) says 'red', but at least to my eyes it is orange.
Fig. 4: The difference in ice surface elevation between the 2 and 20km domains before the main thinning event (panel a) is quite striking. Considering the 2km domain uses the 20km as initial conditions, is there a "relaxation period" at some point between 30 ka and ~19ka, a previous thinning event, or why is the discrepancy between the two curves so large?
Fig. 6: Is the zero-offset curve in Fig. 6a the same as the black line in Fig. 5a, as the caption implies? They look very different to me before 15 ka. Please double check/clarify.
Fig. 8: The axes labels are too small. Please enlarge
Fig. 13: The meaning of the red highlight on the bars in panels d,e is not described in the caption. Do these bars represent the points shown in panel (a)? Also, is there a more helpful name to refer to than "best offset"? It is not clear whether that is the "smallest needed offset", or something else. It would be helpful to more clearly relate to the scoring metric that was used.
Fig. 16: Caption refers to "Fig. 0", and I believe this is a figure that is missing in the manuscript because I'd suggest adding a figure that shows the site names discussed in different figures. Also, the grounding-line contour is not visible.
Figures 3, 8, and 9 are never referenced in the text. Fig. 10 is only referenced in the caption of Fig. 16.

Citation: https://doi.org/10.5194/egusphere-2025-2008-RC2
- AC2: 'Reply on RC2', Anna Ruth Halberstadt, 24 Oct 2025
  
  Please see attachment.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2008-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Publish subject to revisions (further review by editor and referees) (15 Nov 2025) by Florence Colleoni

AR by Anna Ruth Halberstadt on behalf of the Authors (15 Nov 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (01 Dec 2025) by Florence Colleoni

RR by Anonymous Referee #2 (11 Dec 2025)

Suggestions for revision or reasons for rejection

I have gone through the revised manuscript and must say it is a massive improvement from the first version in all manners: it is better structured, flows much better while reading, and the key concepts and metrics are much better described. I particularly liked the idea of adding a table that describes the different metrics, even if there's room for improvement (see suggestions below). In general, I find this work a significant step forwards in term of data-model comparisons at paleo scales in ice sheet modelling, and only offer relatively minor suggestions, mostly editorial and some others to clarify key missing points.

Once these are taken care of, I am happy to recommend this manuscript for publication in The Cryosphere. I look forward to seeing it published!

L30: "Because the time period [...] is too short", as opposed to "are".

L156: I still find it confusing that this is called a "secondary set of cosmogenic nuclide model constraints (on the maximum ice thickness during the last glacial cycle)", considering it is essentially the same Ice-D dataset. It would be much clearer if the two datasets were explicitly split into something like "thinning constraints" and "maximum thickness constraints" (instead of calling one of them 'secondary'), stating that both were based on the Ice-D dataset. I know it is largely a naming issue, and perhaps a pet peeve, but I found it much easier to understand what was being done once I decided to call them like that myself.

L203-206: It is not yet clear to me what "linearly downscaling the continental simulation across the preceding 2,000 years" means. Do you average the model results between 32 and 30ka, and this is what you (bi?)linearly interpolate to the regional domain? This needs to be more clearly described.

L244-247: I understand that it is nice and important to acknowledge previous studies that tried something similar, but I feel like this was already done in the introduction (for this specific study and others), and the justification for the choice of parameters, parameter space, and number of simulations is already given at the start of this section. This paragraph feels a bit redundant and unnecessary, so I'd be tempted to remove it.

L273: "this vertical window is ±47m".

Table 1, "Role in the 'float scoring approach'" for Mmodel: is it supposed to be "The 'model misfit' model score is computed by summing all 'float misfit' site scores"? This table is really helpful, but still needs more clear descriptions. Some variable names are the same for different variables, which adds to the confusion. Is there a way to improve the variable naming so each reference to another metric can be clearly related to its respective cell? For example, 'float misfit' and 'exceedance' are mentioned in the description cells, but never appear as variable names.

L485: There's an extra '('.

Figure 7: Panel (b) seems to be reversed, as the figure caption and panel (a) treat the applied negative offset as making the modelled deglaciation happen later (and thus shifted towards the left). The caption states "-2 kyr" as the offset, whereas in the figure it is +2 kyr. It also refers to a red dot, whereas the dot is (I believe) plotted in blue.

L647: The mention of Fig. 16 comes before many other figures (I believe before Fig. 10). I understand that it is convenient to have it as such, but please double check if that is in accordance with editorial guidelines, or if that figure needs to be put further up/moved to the supplement/appendix.

L677: correct C14 to 14C at the end of the line.

L715: Is H_model computed as max(H_model) because it is the maximum thickness change over the simulation time? If that is correct, please make it explicit.

Fig 12b Please change the x-axis direction so it is in the same order as the other figures, for consistency.

L811-813: I still do not agree with "A numerical simulation where the modeled timing of thinning has to be shifted by a different amount at each site to minimize model/data mismatch is
a poor representation of the geologic record;". It is not reasonable to believe that offsets in time should be the same across the entire ice sheet, as the climate evolution is not homogenous or synchronous over the entire continent. I do, however, agree with the follow-up argument (L814-816), where it is explicitly stated "A simulation where model/data mismatch at each site can be minimized by shifting the modeled timing of thinning by the same value everywhere **within a catchment** is a good representation...". I think the same reservation (i.e., "within a catchment") applies to the former, and would suggest the authors adding it.

Fig 13a,b: As in Fig. 7b, please double check the offset signals (and axes directions) so all figures are consistent with each other.

L875-877: which lingering challenges? Please state them (or at least the main ones), otherwise this sentence feels too generic.

L885: There's a typo in the in-text citation style.

L885-886: This statement is fairly unspecific and general: "a symptom of systematic difficulties with model reconstructions of deglacial ice sheet behavior". This point is much better addressed in L996-1004, so I'd suggest either referencing it (as done before in the manuscript for other concepts/arguments), or expanding on this statement.

L900: I just noticed that "behaviour" here has the British spelling, whereas it has the American spelling throughout the rest of the manuscript. According to the TC guidelines, both styles are allowed but they need to be consistent throughout the manuscript.

L924-926: I find it hard to take this statement at face value considering that the x and y axis ranges in Fig. 14 are different in all panels. In absolute values, the exceedance scores are much better for EAIS and TAM, whereas the float score varies quite a lot in all plots. The distribution of all points in the plot would change if the axes limits were kept the same for all of them, so I wonder if it would make sense to normalise the scores across regions? Is there any better way to compare the data than how is presented here?

Fig. 15: Please double check whether the sign of the offset agrees with the intended sign after checking Figs. 7 and 13.

L974: Is it worth adding that despite persisting for longer, they do match the thinning curve better (i.e., lower float scores)?

Fig. 16g: Please add to the caption what is being plotted in greyscale. Is that ice thickness or basal topography? If so, at what time?

Hide

RR by Anonymous Referee #1 (20 Dec 2025)

Suggestions for revision or reasons for rejection

I thank the authors for carefully addressing my comments on the manuscript. These revisions have made the paper much easier to read, which I think most readers will appreciate for a paper of this length. I have a few further comments and suggestions, but I consider the paper pretty much ready for publication.

All line numbers refer to the version with tracked changes.
L 213: Are ocean temperatures still taken from a single ocean depth of around 300–400m, as was previously the case with this model?
L236: As in my previous review, this statement is incorrect, or is at least ambiguous. DeConto et al. (2021) demonstrated convergence with respect to resolution below about 10km for a nested domain of Thwaites Glacier.
Section 2.4: Are you using a full-factorial ensemble design? It could be useful to have a small table with the parameters and their values.

Section 4: In the author’s response to reviewers, there is an explanation that really helped clear up my confusion during the first round of review: “We don’t combine these metrics together into one total score, because our metrics are aimed towards answering different questions about model data fit. The ‘optimal’ weights for combining metrics into one total score would differ based on the user’s interest.” I might just be missing it, but I don’t see a statement like this in the revised text. I think including something like this in the first few paragraphs of section 4 would help readers understand the rest of that section.

L560: sentence needs revising: “such as snow buildup patterns such as wind scoops”
L680: Model resolution could lead to systematic offsets as well; for example, coarse resolution models are often more susceptible to MISI-style retreat than high-resolution models. Conversely, forcing uncertainty can be random rather than systematic.
Fig 12b: Does this include just the 40 km runs?

Hide

ED: Publish subject to minor revisions (review by editor) (19 Jan 2026) by Florence Colleoni

AR by Anna Ruth Halberstadt on behalf of the Authors (19 Jan 2026) Author's response Author's tracked changes Manuscript

ED: Publish subject to technical corrections (20 Jan 2026) by Florence Colleoni

AR by Anna Ruth Halberstadt on behalf of the Authors (21 Jan 2026) Manuscript

Short summary

We developed a new framework for testing how well computer models of the Antarctic ice sheet match geological measurements of past ice thinning. By using more data and higher-spatial-resolution modeling, we improve how well models capture complex regions. Our approach also makes it easier to include new data as they become available. We describe multiple metrics for comparing models and data. This can help scientists better understand how the ice sheet changed in the past.