IceAnatomy: a benchmark dataset and methodology for automatic ice boundary extraction from radio-echo sounding data

Dreier, Marcel; Koch, Moritz; Gourmelon, Nora; Blindow, Norbert; Steinhage, Daniel; Wu, Fei; Seehaus, Thorsten; Braun, Matthias; Maier, Andreas; Christlein, Vincent

doi:10.5194/tc-19-5337-2025

Articles | Volume 19, issue 11

https://doi.org/10.5194/tc-19-5337-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/tc-19-5337-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 19, issue 11

Research article

|

04 Nov 2025

Research article |

| 04 Nov 2025

IceAnatomy: a benchmark dataset and methodology for automatic ice boundary extraction from radio-echo sounding data

Marcel Dreier, Moritz Koch, Nora Gourmelon, Norbert Blindow, Daniel Steinhage, Fei Wu, Thorsten Seehaus, Matthias Braun, Andreas Maier, and Vincent Christlein

Download

Final revised paper (published on 04 Nov 2025)
Preprint (discussion started on 20 Jan 2025)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-3597', Anonymous Referee #1, 04 Mar 2025

The manuscript presents a significant advancement in the study of glacial structures through the use of RES data, introducing a standardized benchmark dataset that will greatly benefit the community. By providing over 45000 km of annotated radargrams, this work sets a foundation for future studies aiming to automate and improve ice boundary delineations using deep learning models. The potential of this dataset to facilitate robust comparisons of model performance across various settings is particularly valuable, given the diverse geographic conditions represented.
Despite the merits, there are specific areas that require attention before this manuscript can be recommended for publication:

1. The choice to standardize all radargrams to a height of 1024 pixels requires further justification, especially given the reduction in resolution this causes, which could potentially affect the precision of the derived ice boundaries. The manuscript should provide a more detailed rationale (possibly linked to computational efficiency) for this choice, considering the capabilities of the U-Net-like to process varying input shapes.

2. The decision to not use an exclusive flight for the AWI testing subset due to significant variability among the radargrams is questionable. The inherent variability could, in fact, provide a rigorous real-world test scenario, which is crucial for assessing the robustness and adaptability of the model to new and varied environments (which should be the eventual goal of any benchmark dataset and the models developed based on them). A reevaluation of the testing subset choice is recommended to potentially enhance the findings.

3. The omni model shows reduced performance in the FAU and AWI domains, which the authors attribute to domain shifts. Consideration of alternative approaches such as weighting samples by domain frequency or uniformly sampling training examples across domains could potentially mitigate this issue. An exploration of these methods would be valuable for enhancing model generalization.

4. The proposed U-Net uses two heads to separately predict the ice surface and bottom. Why is it better than a straightforward approach with one head simultaneously doing both? Softmax can be applied later in the column-wise manner to extract the boundaries as well, so it should not be a limitation.

5. The authors write in Section 5.1: "Depending on the chosen method, the metrics used to assess the quality of the predictions differ," which is not really true, as zone predictions are easily convertible to boundaries and vice versa, so there is no problem to providing the whole set of metrics.

6. The manuscript claims that confusion matrix-based metrics would perform poorly if predictions are, e.g., consistently off by a pixel. However, this statement is misleading as these metrics are typically used for zone predictions, not boundary delineations. A correction or further explanation is needed to resolve this confusion.

7. In Appendix A, it is stated that the authors used dropout layers inside the ResBlocks. Was it a regular dropout? If not, it should be specified. If yes, I would suggest also trying something like spatial dropout, as many practitioners found it more helpful in convolutional networks.

8. Figures 6, 7, and similar graphics are challenging to interpret. I would suggest just plotting four curves---two groundtruths (surface and bottom) and two predictions on top (e.g. dashed).
Overall, the paper is nicely written. The authors have also shared the dataset and software publicly, which significantly enhances the reproducibility of the study and trust in the results presented.

Citation: https://doi.org/10.5194/egusphere-2024-3597-RC1
- AC1: 'Reply on RC1', Marcel Dreier, 02 May 2025
  
  Dear reviewer,
  Thank you for taking the time to review our manuscript. We have answered your questions in the attached PDF file.
  Kind regards
  
  Marcel Dreier
  
  Citation: https://doi.org/10.5194/egusphere-2024-3597-AC1
RC2:
'Comment on egusphere-2024-3597', Anonymous Referee #2, 02 Apr 2025
Manuscript egusphere-2024-3597, Ice Anatomy: A Benchmark Dataset and Methodology for Automatic Ice Boundary Extraction from Radio-Echo Sounding Data"
The manuscript presents the "IceAnatomy" dataset, a benchmark dataset for automatic ice boundary extraction from radio-echo sounding (RES) data. It also introduces baseline models trained on this dataset, providing an initial framework for comparing deep learning-based ice boundary delineation methods. This work is relevant to the glaciology and remote sensing communities, particularly those developing machine learning techniques for RES data analysis. The paper is well-structured, and the dataset has potential value for the field. However, several areas require further clarification and improvement before acceptance.
General Assessment
The manuscript uses varying terminology, such as “the air-ice layer and ice-ground layer” and “ice bottom and ice surface layer.” Maintaining consistency in terminology throughout the text would improve clarity and readability.

The manuscript claims to present the first benchmark dataset for ice boundary extraction, yet related datasets such as CReSIS data have been widely used. The authors should explicitly contrast IceAnatomy with existing datasets and justify why this dataset is uniquely valuable beyond just being a "benchmark.

The radargram visualizations are useful but could benefit from additional annotations. Additionally, the color scheme makes it difficult to distinguish certain features, and the way different annotations are represented could be improved for better clarity

The manuscript is highly technical and may be challenging for a glaciological audience. Since The Cryosphere primarily targets glaciologists, the extensive use of computer science jargon and technical terminology either requires more thorough explanations or suggests that a different journal may be a better fit. Ensuring the content is more accessible to the journal’s primary readership should be a key consideration.

The scientific motivation of the study could be further elaborated. This is one of the aspects that might suggest the paper, in its current form, would be better suited for a more technical journal.

The rationale for the baseline model choices (e.g., why U-Net with specific modifications) should be better justified. Why not test other architectures such as Transformers or hybrid CNN-RNN models?

The manuscript states that the dataset consists of manually labeled ice boundaries but does not provide sufficient details on the annotation process. What steps were taken to ensure label accuracy? Were multiple annotators involved? How was inter-annotator variability handled?

The inclusion of noisy annotations from CReSIS data is acknowledged, but how does this affect training and evaluation? Have any data cleaning techniques been applied?

The dataset includes different radar systems and processing methods, which may introduce domain shifts. Are these shifts quantified? How do they impact model performance?

The AP-5% metric relaxes the error bounds, but why were 1% and 5% chosen? Would alternative thresholds (e.g., 2% or 10%) provide additional insights?

The "ice boundary collapse" issue observed in predictions is significant. Could this be mitigated with additional constraints in the loss function or post-processing techniques?

The paper does not discuss the impact of hyperparameters in training. How sensitive is the model to learning rate, regularization, and architecture modifications?

Some terms, such as "depth resolution," "relative error," and "wave velocity assumptions," need clearer definitions in the main text rather than just appearing in equations.

Some parts of the manuscript have an informal tone.

The manuscript overstates the novelty and impact of its contributions. It describes the framework as the "first step" toward automated ice thickness mapping, despite acknowledging decades of prior research. Similarly, the claim that this work has "invited other scientists to start working" in this area overlooks longstanding studies. These statements should be revised to more accurately reflect the field’s history.

Line by line assessment:
Line 36: The phrase “radargram of the glacier” sounds somewhat awkward. Additionally, the manuscript does not always adhere consistently to standard glaciological terminology.
Line 61: “ The term "ice boundary layers" is used in the manuscript, but there are more precise and commonly accepted ways to refer to these features.
Line 64: The statement, “however, a large portion of …,” should be supported with evidence. Importantly, the critique of automatically labeled bedrock seems contradictory, as the study itself aims to achieve this. Clarifying this point would strengthen the argument.
Line 98: Jebeli et al. 2023 have performed a very similar aim to this work in their study.
Line 88: Moqadam et al. 2024 (DOI: 10.22541/essoar.172987463.39597493/v1) also have done the tracking of internal layers.
Line 98 – 105 : The manuscript would benefit from citing additional relevant work to provide a more complete context for readers. For instance, Moqadam and Eisen (https://doi.org/10.5194/egusphere-2024-1674) offers a broad review of prior research on ice boundary extraction, making it a fitting reference at the end of the literature review.
Line 102: Where the use of CNN for autoamtic tracing of internal layers is mentioned, Jebeli et al. 2023 (DOI: 10.13140/RG.2.2.23219.20007), Moqadam et al. 2024 (DOI: 10.22541/essoar.172987463.39597493/v1) directly addresses the application of deep learning to this task and would be valuable citations in the section discussing recent advancements in this area.
Including these references, along with other relevant studies would help situate the manuscript within the broader body of existing research and provide readers with a more comprehensive view of the field.
Line 124: not clear what the authors want to say.
Line 134 – 137: More references are needed to support the claims.
Figure 1: black lines for the flight paths are not so easy to distinguish.
Line 141: “Hence, … “ it is not clear or accurate argument for the clearer signal of the thinner ice. The aim of the sentence is evident but the sentence should be reformulated.
Line 148: the sentence seems to be incomplete.
Line 181: this process needs to be elaborated.
Line 200: the sentence is confusing.
Line 226: “the” should be removed.
Lines 240–245, 264–271, and other similar sections contain highly technical explanations. These should either be clarified and simplified for better accessibility to the journal’s audience or, if the technical depth is essential, the choice of journal may need to be reconsidered.
Line 251: hyphen needed between differently and sized.
Line 311: the authors mention that resizing changes the MAE so they introduce MME. It is not clear why they keep the MAE in the paper, if MME is a more suitable metric.
Line 315: The phrase 'pass through a pixel' is unclear. At times, the radargram are treated as an image, and at other times as a matrix. However, it is important to note that a wave does not pass through a pixel.
Line 324: The argument presented is not compelling. Line 381: The sentence needs to be rewritten for clarity. Lines 392-399: These sentences need to be rewritten for clarity and flow. Line 401: Please provide a more detailed explanation of the ablation study.
Line 414: the explanation of temperate ice this can appear much earlier in the manuscript
line 418: the sentence should be rewritten.
Line 421: It is obvious that the differences decrease when AP-5% is considered, and there is nothing surprising about this result. Please rewrite this statement or clarify the reasoning behind the argument.
Line 430: the sentence does not read well.
Line 437: Please provide further explanation. What exactly do you mean, and why is this the case?
Line 441: The authors mention that thicker ice is more challenging, but shouldn't it actually be easier, as less collapse would occur in thicker ice compared to thinner ice?
Line 448: The statement "We believe that our framework is the first step towards a potential fully automated generation of ice thickness maps based on RES data" could be reworded for accuracy. As noted in the literature review, research in this area has been ongoing for nearly two decades. While this work is a valuable contribution, positioning it as the first step towards automation may not fully acknowledge prior advancements in the field.
Line 462: The statement suggesting that this work has "invited other scientists to start working in this research area" may overstate its impact. Given the examples of previous studies provided by the authors, it would be more accurate to acknowledge the long-standing research efforts in this field while highlighting how this study builds upon them.
Line 464: “ice depth” is unclear.
Final Recommendations
Based on the overall assessment and line-by-line feedback, I recommend major revisions for this manuscript. The authors should address several critical issues to improve the clarity, accuracy, and scientific rigor of the work. Below are the key areas that require attention:
Terminology and Consistency

Novelty and Claims

Dataset and Baseline Model Justification

Technical Detail and Accessibility

Clarification of Evaluation Metrics and Model Performance

Model Training and Hyperparameter Impact

Annotation and Data Quality

Conclusion
In conclusion, the manuscript offers valuable contribution to the field, but there are a few areas that require further attention to improve clarity, scientific claims, and provide more thorough justification for the methodology. To enhance the manuscript, the authors should consider revising the sections where misleading claims are made, providing additional explanations for key points, and ensuring that their work is appropriately situated within the broader context of existing research. These revisions will help ensure the manuscript is well-positioned for publication.
Citation: https://doi.org/10.5194/egusphere-2024-3597-RC2
- AC2: 'Reply on RC2', Marcel Dreier, 02 May 2025
  
  Dear reviewer,
  Thank you for taking the time to review our manuscript. We have answered your questions in the attached PDF file.
  Kind regards
  
  Marcel Dreier
  
  Citation: https://doi.org/10.5194/egusphere-2024-3597-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (14 May 2025) by Carlos Martin

AR by Marcel Dreier on behalf of the Authors (16 May 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (19 May 2025) by Carlos Martin

RR by Anonymous Referee #1 (31 May 2025)

RR by Anonymous Referee #2 (23 Jun 2025)

Suggestions for revision or reasons for rejection

Review of “Ice Anatomy: A Benchmark Dataset and Methodology for Automatic Ice Boundary Extraction from Radio-Echo Sounding Data”
This manuscript presents a benchmark dataset, “IceAnatomy,” designed to support and standardize the development and evaluation of deep learning models for extracting ice surface and bottom boundaries from radio-echo sounding (RES) radargrams. The dataset includes over 45,000 km of RES observations from multiple institutions and systems across diverse glaciological settings, along with baseline models and standardized train-test splits. Overall, the work addresses a pressing need in the cryosphere and remote sensing communities for reproducible, large-scale datasets that can accelerate progress in automated ice thickness estimation.
I commend the authors for their thorough and careful revisions, which have significantly improved the clarity, completeness, and overall quality of the manuscript since the previous review round.
Below, I provide detailed comments regarding the strengths and areas where the manuscript could be improved.
• In the sentence “It is the first…” line 45, the term “human-annotated labels” could be clarified further. Does this refer to fully manual annotations or semi-automated labels subsequently verified or correctd by humans? Given the importance of label quality in training and benchmarking deep learning models, this distinction is relevant for understanding the dataset’s reliability.
• Line 51: please rewrite the sentence
• lines 52-75: The authors suggest that near real-time identification of the ice bottom boundary during RES data acquisition could allow for dynamic adjustments of flight plans to focus on areas of high interest. While this is an interesting idea, I wonder how often knowledge of the ice bottom alone, without broader context (e.g., basal conditions, surface conditions, prior survey goals), would justify altering flight plans during a campaign. Some clarification or examples from field experience would strengthen this claim and help the reader better understand its practical relevance.
• Line 57: “This would represent a step toward a comprehensive, quantitative, and standardized approach for interpreting radargrams, ultimately leading to fully automated products that could significantly benefit the cryospheric research community.” please clarify that “ interpreting radargrams” is only in terms of ice surface and ice bottom boundaries.
• Line 75 – 77: The statement regarding the limitations of existing ice bottom labels is broad and lacks sufficient specificity. To strengthen this important critique, the authors should clearly separate and elaborate on each claimed issue, such as inaccuracies, automatic generation methods, data unavailability, lack of transparency, and missing radargrams, and provide concrete examples or citations of datasets where these problems have been documented. Without this clarification, the claim risks appearing vague and unsubstantiated, which weakens the justification for the need and novelty of the IceAnatomy dataset. More precise and evidence-backed discussion is necessary to convincingly demonstrate the dataset’s advantages over existing resources.
• Line 77: In support of the statement regarding the limitations of existing ice bottom labels (e.g., inaccuracy, lack of transparency, or missing radargrams), the authors cite several references. However, Dong et al. 2022 use synthetic radargrams, which may not be directly relevant to a critique of real RES datasets or their associated manual/automatic annotations. I recommend revisiting this citation and ensuring that each reference clearly supports the specific issue being discussed. This would improve the precision of the argument and strengthn the manuscript’s positioning.
• Line 102 – 108: In the list of references for works that track internal ice and snow layers, the citation Moqadam and Eisen (2024) is included alongside algorithm-focused studies. However, this is a review article rather than a method paper, so it may be better to distinguish it from the rest. Consider adding a sentence such as “For an overview of methods used in this domain, see Moqadam and Eisen (2024)” instead. This would clarify the nature of the citation and improve the precision of the literature summary.
• Line 141: The statement “As the glaciers are temperate, i.e., most of the ice is close to or at the pressure melting point, they contain a relatively high proportion of water” would benefit from a supporting reference. Please consider citing glaciological studies or datasets that characterize the thermal regime and water content of these specific glaciers to substantiate this claim.
• Line 143: The authors state that the glacier characteristics “pose a significant challenge to machine learning systems.”. That is very good intuition, I appreciate that. This is an important and plausible point, but it would be helpful to clarify whether this is based on prior research, quantitative comparisons in the current study, or anecdotal experience. If other studies have demonstrated lower model performance on temperate glaciers or radargrams from deep/steep troughs, please cite them. Otherwise, consider softening the language or providing some evidence from the dataset or baseline results presenteed in this paper.
• Line 154: The reference to Rignot et al. (2011) for ice velocity maps is valid, but more recent and higher-resolution velocity datasets are now available. I recommend updating or complementing this citation with a more recent source to ensure the comparison reflects the current state of ice velocity mapping.
• Line 203 – 213:
◦ The authors provide a commendably detailed description of the annotation process, including the use of a single interpreter for consistency, cross-profile validation, and comparison with control points. This level of detail strengthens confidence in the dataset quality, good job.
◦ Since the authors mention using ReflexW software for zooming and clarifying radargram features, it would be helpful to include a formal citation or reference for this commercial software to guide readers unfamiliar with it.
◦ The description suggests that the labeling involved some degree of software-assisted (semi-automatic) annotation rather than purely manual picking. For clarity, please specify whether the labels were created fully manually, semi-automatically with manual corrections, or a combination thereof. This clarification is important for users evaluating the dataset and its annotations.
• Line 268 – 270: The description of the U-Net–based model for ice boundary extraction is clear and well supported by relevant citations. However, I suggest including an additional recent relevant work in this context: Moqadam et al. (2024), which presents a closely related approach with U-Net for ice boundary extraction using deep learning. Also, as the cited version of this work is a preprint, please update the citation to the published version to ensure readers have access to the finalized paper.
• Line 341: “depth resolution is the time it takes for the wave to pass through the physical equivalent of a pixel in the radargram,” is not scientifically accurate. Depth resolution refers to the minimum vertical distance between two subsurface reflectors that can be distinguished as separate features in the radargram. It is a spatial (distance) parameter, not a temporal one, and depends on the radar wave velocity and the system’s temporal (time) resolution. I recommend revising this sentence for clarity and accuracy.
• Line 487: Tone of self-evaluation. The sentence claiming the work is "a significant step" and "an important advancement" could benefit from more objective framing or clearer support from the results. Consider revising this statement to maintain a more neutral tone in line with scientific conventions. While the impact of the work is indeed notable, I suggest moderating this language unless further evidence is provided to substantiate such claims in comparison to existing datasets or methods.

Referee Report: PDF

Hide

ED: Publish subject to minor revisions (review by editor) (28 Jul 2025) by Carlos Martin

AR by Marcel Dreier on behalf of the Authors (14 Aug 2025) Author's response Author's tracked changes Manuscript

ED: Publish as is (14 Aug 2025) by Carlos Martin

AR by Marcel Dreier on behalf of the Authors (27 Aug 2025) Manuscript

Short summary

In this paper, we present a ready-to-use benchmark dataset to train machine learning approaches for detecting ice thickness from radar data. It includes radargrams of glaciers and ice sheets alongside annotations for their air–ice and ice–bedrock boundary. Furthermore, we introduce a baseline model and evaluate the influence of several geographical and glaciological factors on the performance of our model.