Towards ice thickness inversion: an evaluation of global DEMs by ICESat-2 in the glacierized Tibetan Plateau

Accurate estimates of regional ice thickness, which are generally produced by ice-thickness inversion models, are 10 crucial for assessments of available freshwater resources and sea level rise. Digital elevation model (DEM) derived surface topography of glaciers is a primary data source for such models. However, the scarce in-situ measurements of glacier surface elevation limit the evaluation of DEM uncertainty, and hence its influence on ice-thickness modelling over the glacierized area of the Tibetan Plateau (TP). Here, we examine the performance over the glacierized TP of six widely used and mainly globalscale DEMs: AW3D30 (30 m), SRTM-GL1 (30 m), NASADEM (30 m), TanDEM-X (90 m), SRTM v4.1 (90 m) and MERIT 15 (90 m) by using ICESat-2 laser altimetry data while considering the effects of glacier dynamics, terrain, and DEM misregistration. The results reveal NASADEM as the best performer, with a small mean error (ME) of –1.0 and a root mean squared error (RMSE) of 12.6 m. A systematic vertical offset existed in AW3D30 (–35.3 ME and 34.9 m RMSE), although it had a similar relative accuracy to NASADEM (~13 m STD). TanDEM-X also performs well (–0.1 ME and 15.1 m RMSE), but suffers from serious errors and outliers on steep slopes. SRTM-based DEMs (SRTM-GL1, SRTM v4.1, and MERIT) (all 20 ~36 m RMSE) had an inferior performance to NASADEM. However, their errors were reduced in the ablation zone when glacier variations were excluded. Errors in the six DEMs increased from the south-facing to the north-facing aspect and become larger with increasing slope. Misregistration of DEMs relative to ICESat-2 footprint in most glacier areas is small (less than one pixel). An intercomparison of four ice-thickness models: GlabTop2, Open Global Glacier Model (OGGM), Huss-Farinotti (HF), Ice Thickness Inversion Based on Velocity (ITIBOV), show that GlabTop2 is sensitive to the accuracy of both elevation 25 and slope, while OGGM and HF are less sensitive to DEM quality, and ITIBOV is the most sensitive to slope accuracy. Considering the inconsistency of DEMs acquisition dates, NASADEM would be a best choice for ice-thickness estimates over the TP, followed by AW3D30, and TanDEM-X (if steep and high elevation terrain can be avoided). Our assessment figures out the performances of mainly global DEMs over the glacierized TP. This study not only avails the glacier thickness estimation with ice thickness inversion models, but also offered references for other cryosphere studies using DEM. 30 https://doi.org/10.5194/tc-2021-197 Preprint. Discussion started: 13 July 2021 c © Author(s) 2021. CC BY 4.0 License.


Introduction
The Tibetan Plateau (TP), which includes the Pamir, Hindu Kush, Karakoram, Himalaya, and Tibet regions, covers an area of ~3 million km2 and has a mean elevation of more than 4000 m a.s.l. (Fig. 1). It accounts for more than 82% of the Earth's land surface area above 4000 m a.s.l. (Fielding et al. 1994), and is often referred to as the Third Pole of Earth or the Asian Water Tower (Yao et al. 2012) due to its high elevation and abundant water resources in the form of glaciers, snow, permafrost, lakes, 35 and rivers. The TP has a glacierized area of ~8.3×104 km2 (RGI Consortium, 2017) with an ice volume of ~6.2×103 km3 (Farinotti et al. 2019), mainly distributed in the Karakoram and Himalaya regions.
Ice thickness is a crucial parameter for assessing the contribution of glaciers to global sea level rise (Kraaijenbrink et al. 2017), quantifying regional water availability (Huss and Hock 2018;Immerzeel et al. 2020), and evaluating cryosphere-related hazards (Linsbauer et al. 2016;Zheng et al. 2021). In the TP, owing to the lack of in-situ ice thickness measurements (Welty 40 et al. 2020), regional glacier thickness is mainly estimated by ice-thickness inversion models (ITIMs) using open access digital elevation models (DEMs) (Farinotti et al. 2009;Farinotti et al. 2019;Frey et al. 2014). The DEM is a fundamental part of most regional ITIMs (Farinotti et al. 2017), and is often used to determine center flow lines , shear stress (Frey et al. 2014;Wu et al. 2020), apparent mass balance (Farinotti et al. 2009), and for ice-thickness interpolation (Huss and Farinotti 2012). In addition to its use in ITIMs, the DEM has been an essential model input for a wide range of TP glaciology 45 studies, such as glacier inventory (Bhambri et al. 2011;Ke et al. 2016;Mölg et al. 2018), glacier mass change (Brun et al. 2017;Shean et al. 2020;Zhou et al. 2018), glacier related disasters (Allen et al. 2019;Kä ä b et al. 2018;) and projections of glacier or glacial lake evolution (Kaser et al. 2010;Kraaijenbrink et al. 2017;Zheng et al. 2021).
The uncertainty in the DEMs can lead to different ITIM outcomes Fujita et al. 2017;Furian et al. 2021;Kä ä b 2005), especially for those ITIMs in which the DEM is a crucial input. For example, the sensitivity of the glacier bed 50 topography (GlabTop) model to slope increases for shallower slopes , and slope overestimated by low contrast of snow cover can often lead to geometric distortion and missing data (Reuter et al. 2007;Takaku et al. 2020).
Estimates of the accuracy of DEMs in different terrains and physiognomy, and for different vegetation coverage and land use 65 have been conducted outside the TP using Global Navigation Satellite Systems (GNSS) measurements or high-resolution DEMs (Gonzá lez-Moradas and Viveen 2020; Grohmann 2018;Hawker et al. 2019;Uuemaa et al. 2020). The performance of specific DEMs varied in these studies, indicating that the local terrain and land cover influenced the DEM accuracy. In the TP, glaciers are distributed across different climatic zones and have a wide range of elevations with rugged and complicated terrain (Fielding et al. 1994; Thompson et al. 2018). GNSS measurements are not accessible for most glaciers, and public access high-70 resolution DEM is also a limitation due to its long temporal coverage (Shean 2017). The assessment of DEM accuracy in specific regions with limited GNSS measurements and high-resolution DEM is not enough to determine the performance of global DEMs across the whole glacierized TP. Liu et al. (2019) evaluated the performance of seven public freely-accessed DEMs over the TP with sparse ICESat altimetry data and suggested that AW3D30 has a high degree of accuracy. However, ICESat data with a footprint of 70 m (larger than 75 the resolution of their estimated DEMs) could result in intra-pixel errors in steep slopes (Uuemaa et al. 2020). Besides, glacial regions were not considered in their studies, due to the variations of glaciers over time. Misregistration among DEMs, which may lead to evaluation bias (Han et al. 2021;Hugonnet et al. 2021;Van Niel et al. 2008), was also neglected. Bearing these issues in mind, and considering the limitations of optics sensors in rugged terrain and the glacier accumulation area , it is clear that a further assessment of the performance of AW3D30 is required. Recently, TanDEM-X (released in 80 2017) and NASADEM (released in 2020) have been reported to have large improvements in accuracy relative to previous DEM products for various land-cover types (Wessel et al. 2018), floodplain sites (Hawker et al. 2019), slightly undulating terrain (Altunel 2019), and mountain environments (Gdulová et al. 2020). Nonetheless, their performance over the rugged and glacierized TP remains unclear.
The purpose of this study is to evaluate the optimal DEM to use for regional ice thickness estimation over the TP. We first 85 evaluated the performance of six widely used DEMs: AW3D30, SRTM-GL1, NASADEM, TanDEM-X, SRTM v4.1, and MERIT which are derived from different sensors and have different resolutions, against ICESat-2 data which has been proven to have a high vertical accuracy and resolution. The elevation differences between these DEMs and the ICESat-2 are systematically analyzed with regard to aspect, slope, elevation, and glacier zones. The influence on the accuracy assessment of glacier elevation changes, terrain and misregistration among DEMs is then quantified. Finally, we compare the performance 90 of ice thickness estimates derived using the six DEMs against in-situ measurements of ice thickness using Ground Penetrating Radar (GPR). The influence of DEM uncertainties on the model outcomes is also analyzed.

95
The numbered labels refer to glaciers used as examples in Fig.12. b) Location of Ground Penetrating Radar (GPR) profiles over the Chhota Shigri Glacier which is used as an example. c) Relative location of six beams when Advanced Topographic Laser Altimeter System (ATLAS) has backward orientation. Distance between RGTs is 28.8 km. d) Percentage of ICESat-2 data in different months from October 2018 to November 2020. The boundary of the TP is derived from SRTM above 2500 m a.s.l (Zhang et al. 2013).

ICESat-2 elevation data referenced
ICESat-2, a follow-on mission to the Ice, Cloud, and land Elevation Satellite (ICESat), was launched on 15 September 2018, with the goal of acquiring Earth's geolocated surface elevation that referenced to the WGS 84 ellipsoid at the photon level.
ICESat-2 ATLAS (Advanced Topographic Laser Altimeter System) emits a pulse every 0.7 m along the track covering a horizontal circular area with ~17 m diameter and 0.5 m in vertical extent. We used the ICESat-2 Level-3A land-ice ATL06 105 product. ATL06 heights are median-based heights derived from a linear-fit model over each segment corrected for first-photon bias. The segment has a length of 40 m centered on reference points at 20-m intervals along the track. The ATL06 product has better than 5 cm height accuracy and better than 13 cm surface measurement precision in the Antarctic  product also contains land background points. The RGI6.0 glacier inventory (RGI Consortium, 2017) was used to extract points falling on the glaciers (Fig. 2). 110 ICESat-2 ATL06 data covering the TP from October 2018 to November 2020 was downloaded from https://earthdata.nasa.gov/ (Fig. 1). There are 2436 files containing about 100 GB of data in total. The fields: Location (latitude, longitude), surface elevation (h_li), elevation uncertainty (h_li_sigma) and quality (atl06_quality_summary) were used. By combining the quality field (atl06_quality_summary=0)  with the glacier inventory, a total of 3.5 million points out of 0.16 billion records over the TP were selected (Fig. 1). The slope, aspect and elevation value of the cell center of 115 the DEMs were extracted for the ICESat-2 footprints. thickness inversion models. The wi1, wi2, wi3 and wi4 denote the weight of each modeled ice thickness, i from 1 to 6 are six different DEMs, 120 and the number 1−4 are the four ice thickness inversion models.

DEMs evaluated (AW3D30, TanDEM-X, NASADEM, SRTM)
Six global-scale DEMs were selected for evaluating their influences on ITIMs, based on popularity, data source, resolution and sensor type (optics or SAR) (Table 1) 2) TanDEM-X 90 m DEM (hereafter TanDEM-X) is a product derived from the first bistatic X band SAR mission of the world 130 which took place from 2014 to 2016 (Bachmann et al. 2021). It is a pixel-reduced product of the global TanDEM-X DEM with a pixel spacing of 0.4 arcseconds (12 m). The official reported absolute vertical and horizontal accuracy is better than 10 m at the 90% confidence level. It is noted that the current release is a non-edited version: areas with outliers, noise and voids remain. The original data was collected during different seasons and years, and the influence of ablation and accumulation of glaciers should also be noted. Data was acquired from https://download.geoservice.dlr.de/TDM90/. 135 3) NASADEM is a new product released in 2020, which is derived by reprocessing the original SRTM signal data using updated interferometric unwrapping algorithms and auxiliary data, such as ICESat, to reduce voids and improve vertical accuracy (Crippen et al. 2016). Remnant voids are filled mainly by Global Digital Elevation Model (GDEM) v3 data.
This data was downloaded from https://search.earthdata.nasa.gov/. The first open-access ice-thickness database of global glaciers also adopted SRTM-GL1 as its DEM source (Farinotti et al. 2019). Voids were primarily filled by ASTER GDEM2. SRTM v4.1 and MERIT were selected to compare with TanDEM-X, and simultaneously estimate the influence of DEM resolution on ITIMs. SRTM v4.1, with a spatial resolution of 90 m, is produced by the method proposed by Reuter et al. (2007), including merging tiles, filling small holes iteratively and interpolating across the holes using a range of methods, according to the size of hole, and the land 145 type surrounding it (https://cgiarcsi.community/data/srtm-90m-digital-elevation-database-v4-1/). SRTM v4.1 was also used to compare against the performance of SRTM-GL1 to estimate the influence of resolution. MERIT is also widely used with a spatial resolution of 90 m. It was developed by removing absolute bias, stripe noise, speckle noise, and tree height bias from the existing spaceborne DEMs (SRTM3 v2.1 and AW3D30 v1) using multiple satellite datasets and filtering techniques (Yamazaki et al. 2017). Its accuracy was significantly improved, especially in flat regions (Yamazaki 150 et al. 2017). The overall accuracy is similar to TanDEM-X in floodplain sites (Hawker et al. 2019), but lower in short vegetation. The dataset was downloaded from http://hydro.iis.u-tokyo.ac.jp/~yamadai/MERIT_DEM/.

Ice thickness inversion models
Tiles of six DEMs (AW3D30, TanDEM-X, NASADEM, SRTM-GL1, SRTM v4.1, and MERIT) were used to form a mosaic of terrain data covering the whole TP. Four ice-thickness inversion models (GlabTop2, HF, OGGM, ITBOV) were used. The 160 Chhota Shigri Glacier located in western Himalaya for which the GPR data were available ( Fig.1) was selected as an example to evaluate the influence of DEM uncertainty on the ITIMs. Full details of the ITIMs are given below: GlabTop (Glacier bed topography) is based on the theory that glacier thickness is mainly determined by the slope of the terrain Linsbauer et al. 2009;Paul and Linsbauer 2012). It is assumed that the glacier is an ideal plastic fluid, with bottom slip being ignored. Based on the empirical relationship between mean shear stress along the centerlines and the 165 range of glacier elevation (Haeberli and Hoelzle 1995) (Eq. 1), the actual basal shear stress τ can be determined.
where ΔH is the elevation range of glacier. The ice thickness h can then be determined from Eq. (2) sin h fg where f is the shape factor, ρ is glacier density (850±6 0 kg/m 3 ) (Huss 2013), g is the acceleration due to gravity (9.8 m/s 2 ) and α is the slope. Glabtop2 is an automated method for calculating ice thickness, similar to GlabTop, but avoiding digitizing the branch lines. For details refer to Frey et al. (2014). 170 HF (Huss-Fainotti) model is based on the mass balance principle which relates the surface mass balance of the glacier (b) to the ice flux and variation in the glacier thickness. Given the ice flux, ice thickness can be calculated according to Glen's ice flow law (Farinotti et al. 2009a;Huss and Farinotti 2012).
where h is the mean elevation of band thickness, q is the ice flux, fsl is the basal slip correction factor, n=3 is the exponent of flow law, ρ is glacier density (850±6 0 kg/m 3 ) (Huss 2013), g is the acceleration due to gravity (9.8 m/s 2 ), f is the valley 175 shape factor (0.8) (Cuffey and Paterson 2010) and A is the Glen flow rate factor (3.24×10 24 Pa -3 s -1 ) (Cuffey and Paterson 2010;Gantayat et al. 2014).

This method defines a new variable
ice flux q can be obtained by estimating b , which is determined from experience (Huss and Farinotti 2012). Ice thickness in each elevation band can then be determined by substituting into Equation (3). Finally, h is extrapolated, in combination with the slope, to obtain the distributed ice thickness, according to the parameters in Huss and Farinotti (2012).
The Open Global Glacier Model (OGGM) is based on the same concept as HF, but has two main differences . Firstly, the method described in Kienholz et al. (2014) is used to automatically obtain the middle streamlines and 185 watershed division. Secondly, the apparent mass balance data are reconstructed from the local climatic dataset from variables such as precipitation and temperature.
The Ice Thickness Inversion Based On Velocity (ITIBOV) model obtains the ice thickness by combining the surface velocity field with the Glen ice flow law (Gantayat et al. 2014;Glen 1955;McNabb et al. 2012 where h is ice thickness, us is glacier surface velocity, and k is the contribution ratio of basal slip velocity relative to the us. 190 We used the ITSLIVE dataset (Gardner 2019 ) as the us input. We assumed that basal slip only occurred during the warm seasons, and k was calculated by dividing the annual glacier velocity by winter glacier velocity (Wu et al. 2020). Data from the Global Land Ice Velocity Extraction from Landsat 8 (GoLIVE) dataset with a date separation length of less than 96 days are used to estimate the monthly velocity , allowing the winter velocity (December, January and February) and annual mean velocity to be calculated. Basal factor k was calculated as 0.80 (Fig. S1). Some shared 195 parameters, such as creep factor, shape factor and basal creep factor are the same in all four models.
It is possible that an ensemble of the output from different models can improve the modeled thickness (Farinotti et al. 2017;Farinotti et al. 2021). Therefore, after calculating the ice thickness from four models using different DEMs, we calculated an ensemble ice thickness using the same DEM but with different models. The weighting given to each model is iteratively calculated to achieve a minimal mean absolute error (Fig. 2). 200

Accuracy assessment
The error in the DEMs is considered to be the difference between the DEM elevation and the ICESat-2 measurement. To remove the influence of outliers, elevation differences outside four standard deviations were removed. Mean error (ME), mean absolute error (MAE), median error, root mean square error (RMSE), standard deviation (STD), and normalized median absolute deviation (NMAD) were calculated for the error assessments. NMAD and ME were used to assess the disturbance 205 from extreme errors (Gdulová et al. 2020).  Figure 3 shows a comparison of elevation from the six DEMs with the ICESat-2 data. The four standard deviations (that is 4 std) filter on the differences between ICESat-2 and DEMs used to filter out extreme outliers excluded less than 1% of the data. 210

Accuracy of DEMs
Overall, non-irregular deviation existed among these DEMs and ICESat-2 elevation after filtering. The ICESat-2 vs DEMs values are distributed tightly around the fit line with a slope coefficient close to 1, with no obvious differences among the R2.
NASADEM and TanDEM-X performed the best in terms of intercept and fit RMSE, with very little difference to the ICESat-2 data. For the other four DEMs, there are obvious systematic shifts which can be inferred from the high R2 values, but high intercept values. 215 gives the fit results for data within 4 std of the mean. 'Outlier' denotes the proportion of outliers relative to the total records. 'R2', 'RMSE', and 'Intercept' are fit results when the slope coefficient is set to 1. Elevation range was cut to 3500−6500 m, the range in which most 220 elevations values are located, to show clearly the effect of using different multiples of the std from the mean.
The difference statistics for the six DEMs are presented in Figure 4. Statistically, Median and ME differed little, and extreme values did not influence the ME much after the 4 std filter was applied. STD was slightly larger than NMAD, especially for TanDEM  The spatial distribution of the ME and STD are shown in Figure 5. AW3D30, SRTMGL-1, SRTM v4.1 and MERIT all have large negative ME over the TP, while the ME of NASADEM and TanDEM-X are mostly within the range -10 to 10 m. For these two DEMs, the ME in the Himalaya is more negative than that in southeast Tibet, and it is slightly positive in western 240 Kunlun and the Karakoram mountains. It is worth noting that in the Himalaya and southeast Tibet, the ME of NASADEM is more negative than that of TanDEM-X. Overall, STD along the Hindu Kush-Himalaya and southeast Tibet was larger than in other regions. Thereinto, STD in southeast Tibet was relatively larger (>12 m). Specifically, the STD of AW3D30 and NASADEM was minimum and spatially relevant. TanDEM-X performs well in overall statistics (Fig. 4b), but it is not stable.
Spatially, it performed worse than SRTM-GL1 in terms of STD (Fig.5b). The STD and ME of SRTM v4.1 and MERIT are 245 almost the same in space (Fig.5b), corresponding to their similar overall STD (both ~18 m) and ME (both ~32 m) values ( Fig.4b).

Differences between DEMs and ICESat-2 in aspect, slope and elevation
The influence of terrain factors on the differences between the DEM and ICEsat-2 elevations for the six DEMs are presented in Figure 6. The influence of aspect is most apparent for SRTM-GL1, with a median value of about -25 m in the south aspect 255 which increased in magnitude gradually towards to the north aspect (-35 m). A similar pattern, but with a smaller amplitude (±5 m) is apparent for the NASADEM and TanDEM-X (Fig.6a).
For NASADEM and TanDEM-X, the differences plotted against slope are distributed around zero, with mean median values of about -1.6 and -1.0 m, respectively (Fig.6b). For the other DEMs, the differences with slope are mainly less than zero, with a median range of -30 to -48 m and a mean upper quartile of ~20 m. The median differences of the 30-m DEMs generally 260 increased along the slope. However, for the 90-m DEMs, the difference increased with slope at first, but then decreased on steep slopes. For all DEMs, the spreads of differences become larger as the slope becomes steeper. This increase is most obvious for TanDEM-X and SRTM v4.1, with rates of 0.71 m/degree (r=0.96, p<0.01) and 0.60 m/degree (r=0.83, p<0.01).
On slopes of less than 20 o , TanDEM-X has the best quality with a mean median value of 0.2±0.39 m and 5.8±2 .2 m, but increased disparity on steeper slopes. Overall, relative to the other DEMs, NASADEM behaves best against slope in terms of 265 spread and median value. MERIT shows a slight advantage over SRTM v4.1 with a reduced spread for steep slopes.
The differences for all DEMs generally increased with elevation, with fluctuations at very high elevations (Fig.6c). For NASADEM and TanDEM-X, the differences are negative at lower elevations and slightly positive at higher elevations; NASADEM varied from -70 to 20 m over the full elevation range, and from -10 to 10 m over the range 4500−6500 m, where measurements are concentrated; TanDEM-X varied from -15 to 20 m over the full elevation range, and from around -5 to 5 270 m between 4500 and 6500 m. For the other four DEMs, the differences all remained below zero for the full range of elevations.

Differences between DEMs and ICESat-2 in different glacier zones
Differences in different glacier zones were also estimated and are shown in Figure 7a-d. We divided it into four sub-zones using the maximal, median and minimum elevation from the RGI glacier inventory (Fig. 7e). Here we consider Zone 1 to be 280 the ablation area and Zone 4 the accumulation area. Zone 2 and Zone 3 are transition areas. Crests of the probability distribution of differences located in the negative axis range in Fig. 7a move to the right in Fig. 7b−d. Correspondingly, ME, MAE and RMSE all decrease when moving from Zone 1 (ablation area) to Zone 2 (transition area) ( Fig.7 and Table S1). Spatially, areas in the glacier terminus are subject to more melting (Brun et al. 2017)   In terms of STD, NASADEM performed best in all glacier zones, except for Zone 1, with values ranging from 9.2 to 16.5 m (Table S1). AW3D30 had the best performance of all DEMs in Zone 1, and was the next best performer overall, with a STD varying from 12.0 to 20.7 m. The STD of TanDEM-X was better than that of SRTM-GL1 in Zone 1 and Zone 2, but worse in Zone 3 and Zone 4 where it suffered from discrepancies. MERIT showed slight improvements in STD at ~2% relative to 300 SRTM v4.1.

Comparisons of ice thickness modelled by DEMs
The models are not adjusted independently according to the difference between the output and GPR results. Therefore, the results are not indicators of the performance of the models but rather references for examining the influence of different DEMs on specific ITIMs. The effect of the DEMs on the model outcomes are presented in Figure 8 and are quite obvious. Mean ice 305 thickness differs, according to the DEM used, by up to 88%, 4%, 47% and 7% for GlabTop2, HF, ITIBOV and OGGM, respectively. The deepest ice thickness differs by up to 55%, 25%, 16% and 18% for GlabTop2, HF, ITIBOV and OGGM, respectively.
The mean ice thicknesses from GlabTop2 and ITIBOV using the 90-m DEMs (that is TanDEM-X, SRTM v4.1 and MERIT) are ~30 m less than those obtained from using 30-m DEMs (that is AW3D, SRTM-GL1 and NASADEM) (Fig. 8). GlabTop2 310 and OGGM using AW3D30, and HF and ITIBOV using NASADEM output the maximal mean thickness. GlabTop2, ITIBOV and OGGM using TanDEM-X, and HF using SRTM-GL1 output the minimum mean thickness.
The influence of different DEMs on ITIMs can also be identified when making a comparison with the GPR results ( Fig. 8 and Table 2). If median error is used as the criteria, GlabTop2 using NASADEM, HF using AW3D30, ITIBOV using NASADEM and OGGM using NASADEM achieved the relatively best simulation (Fig. 9). If RMSE was used, GlabTop2 using 315 NASADEM, HF using SRTM-GL1, ITIBOV using NASADEM and OGGM using TanDEM-X performed best (Table 2).
In different glacier zones, each DEM-model combination has its merits and weakness ( Table 2). Totals of 9, 3, 4, 3 and 1 output achieved the minimum RMSE in profiles by different ITIMs using AW3D30, SRTM-GL1, NASADEM, TanDEM-X and MERIT, respectively. Overall, NASADEM, as input to GlabTop2 and ITBOV, performs best, with RMSE values of 75.4 and 61.3 m, respectively; SRTM-GL1 performed the best in HF with an RMSE of 50 m; TanDEM-X performed the best in 320 OGGM with an RMSE of 52.8 m ( Table 2).

325
Following the procedure of Farinotti et al. (2017), results from the four models are further composed to achieve the minimum MAE between the modelled and GPR thicknesses (Fig. 8e). The weights for each model in ten experiments are shown in S2. After composition, the mean thickness using different DEMs ranged from 77 (acquired based on TanDEM-X) to 91 m (acquired based on NASADEM). The thickness error of the results based on NASADEM is best (median value 2.3 m), followed 330 by TanDEM-X (median value -7.5 m). The minimum RMSE is for AW3D30 (45.9 m), followed by NASADEM with a RMSE of 47.7 m.

Factors related to the differences of DEMs: glacier elevation change, terrain
The identified extreme outliers (Fig. 3) are mostly located in the glacier terminus, high elevation and high slope regions ( Fig.   10a−b). Extreme glacier melt, such as in southeastern Tibet, and surges, as observed in the Karakoram, can also lead to dramatic https://doi.org/10.5194/tc-2021-197 Preprint. Discussion started: 13 July 2021 c Author(s) 2021. CC BY 4.0 License. elevation changes, resulting in large difference (Fig. 10c). This glacier elevation change effect is also reflected in the spatial 360 distribution of difference (Fig. 5), elevation bins (Fig. 6c) and glacier zones (Fig. 7). The differences at lower elevations are negative, and generally increase with elevation, consistent with the fact that glaciers melt at lower elevations and accumulate at higher elevations (Cuffey and Paterson 2010). The differences in the NASADEM and TanDEM-X data with elevation comply with these features. NASADEM was acquired in 2000 and TanDEM-X was acquired in 2010−2015, and the value of NASADEM is more negative than TanDEM-X, as would be expected. By making a comparison between SRTM-GL1 and 365 NASADEM from the same original data, we conclude that the negative differences of the other four DEMs through the elevation bins may be related to absolute vertical shift. MERIT shows less improvement over SRTM v4.1 in glacierized terrain than in the flat regions in terms of both absolute and relative accuracy (Yamazaki et al. 2017). The relatively more negative and larger values of ME and STD along the Hindu Kush-Himalaya, southern Tibet (  375 Table 3 Comparisons of differences between four SRTM based DEMs and ICESat-2 elevation over glacier zones before and after adjustment. ICESat-2 data acquired in February are used to calcuate the differences. Glacier zones are defined according to Fig. 8e.  and Zone 4. This may be related to the slight elevation change in the accumulation region (Brun et al. 2017;Shean et al. 2020), and high uncertainty due to steeper slopes and higher elevations (Fig. 6b−c)   Figure 11. Elevation profiles of the original and adjusted six DEMs and ICESat-2 along the ICESat-2 tracks on four selected glaciers across the study region. Locations of glaciers in a−d are labelled 1−4 in Fig. 1, respectively.

390
ICESat-2 data covering the period October 2018 to October 2020 repeat every 91 days. Therefore, variations of ICESat-2 elevation data caused by glacier fluctuations will have influenced the error statistics (Fig. 12a). Precipitation on the TP mainly occurs in June−August (Maussion et al. 2014). Hence, after precipitation accumulation on glaciers in spring and summer, the elevation have increased, and the mean difference decreased. With little accumulation, the glacier melt and sublimate in autumn 395 and winter , meaning decreased elevation, and an increase in the mean difference. However, the magnitude of these changes is much smaller, at a level of less than 3 m (Fig. 12), compared with the large ME, MAE and RMSE magnitude of most of the DEMs (with the exceptions of TanDEM-X and NASADEM) (Fig. 4). When taking all points from different seasons into consideration, the seasonal effects could also partly cancel each other out. If only the ICESat-2 data from February was used (Table 3), NASADEM and TanDEM-X still perform better than others. Therefore, we conclude that the fluctuations 400 of ICESat-2 data have no influence on the performance of the DEMs. Additionally, the atmospheric forward scattering and https://doi.org/10.5194/tc-2021-197 Preprint. Discussion started: 13 July 2021 c Author(s) 2021. CC BY 4.0 License.
subsurface scattering of ICESat-2 photons, which are not quantified in ATL06, may also lead to a biased estimate of elevation . The elevation differences have a strong dependence on terrain factors (Fig. 7a−b). The differences with aspect show contrasting features to the distribution of measurements in different aspects (Fig. 7d). The largest errors are concentrated in the north aspect, as was also reported in previous studies (Gorokhovich and Voustianiouk 2006;Shortridge and Messina 2011), in which they were attributed to the orientation of the sensor (Gdulová et al. 2020;Shortridge and Messina 2011). However, here, the data from different sensors all show this aspect dependence, and we infer that it may be related to the accordant 415 distribution of data in different slopes with aspect. There are many more measurements with steeper slopes in the north aspect, and less measurements with flatter slopes in the south aspect (Fig. 13). The error and spread become larger with steeper slopes (Fig. 7b), as also reported by Liu et al. (2019) and Uuemaa et al. (2020), maybe due to geometric deformation or shadow (Liu et al. 2019) . Therefore, the error variation with aspect tends to be related to steeper slopes (Gdulová et al. 2020;Gorokhovich and Voustianiouk 2006). 420 Almost half the points in the 55 o −90 o slope region are identified as extreme outliers (Fig. 10a). Differences also show large discrepancies for all DEMs in the steeply sloping regions where voids and large errors are frequent (Falorni 2005). Steep slopes 425 combined with low resolution led to variations in the spread of differences in Fig. 6b. Spreads of differences were larger on steep slopes for the 90-m DEMs than those of the 30-m DEMs. Intra-pixel variation aggravates this effect in steeply sloping regions (Uuemaa et al. 2020), lower resolution or reduced pixel DEMs smooth the terrain details and lead to inaccurate elevation compared with the 20-m footprint of ICESat-2 points. The spread and the number of outliers gradually increased with the slope, especially for the TanDEM-X case (Fig. 7b). Using the terrain in the rugged Shishapangma region (Fig. 12b) as an example, we can see that the elevation from TanDEM-X suffers from serious errors along the ridge at high elevations with the output almost blurred. Even so, TanDEM-X still has overall accuracy advantages over SRTM v4.1 and MERIT, indicating the high quality of TanDEM-X in low relief regions (Fig. 7b).

Influence of misregistration
Misregistration among DEMs, which has been ignored in previous research (Gonzá lez-Moradas and Viveen 2020; Liu et al. 435 2019), is an important error source when looking at DEM differences (Hugonnet et al. 2021;Van Niel et al. 2008). This study intends to give direct insights about the quality of uncorrected DEM products, so the misregistration problem was not tackled before the evaluations were carried out. However, the influence of misregistration was evaluated. According to the sinusoidal relationship between aspect and error differences between two DEMs (Van Niel et al. 2008), using the co-registration method in Nuth and Kä ä b (2011) and ICESat-2 points outside the glaciers, offset pixels at the 1 o ×1 o grid scale were estimated (Fig.  440   14). Misregistration was found to be worst in eastern Kunlun Mountains and Qilian Mountains where only a small number of glaciers developed (Fig. 1) and was slight in the south of the TP. All the DEMs mismatch by less than one pixel, except for AW3D30, which had the worst misregistration in the north and inner TP (Fig. 14). Considering that only the cell center value was used, sub-pixel shift may have little influence. A comparison of errors before and after correcting sub-pixel misregistration confirms this conclusion (Fig. 15b). The probability distribution of difference before and after co-registration was almost the 450 same, as were ME, MAE, STD and RMSE (Table S2). However, examples of pixel misregistration strongly affected the probability distribution, except for AW3D30, SRTM-GL1, SRTM v4.1 and MERIT (Fig. 15a); STD changed by 1−3 m, while ME, MAE and RMSE changed by less than 1.2 m ( Table S2). The probability distribution symmetry varies. Hence, we supposed that the symmetry variations of difference compensate the effect of offset. Since glaciers are distributed among the mountains with different aspects, if we shift the DEM in the x-and y-directions, the increased differences would be 455 compensated for by decreased elevation differences (Fig. 15c). Therefore, the large errors remaining in these four DEMs should be due to systematic deviations (Han et al. 2021), rather than the influence of misregistration.

Influence of DEMs on ice thickness estimated by ITIMs
Different DEMs resulted in differences in ITIM ice thickness with a range of 32−65% (Fig. 10). Generally, the outcome with GlabTop2 and ITBOV using 30-m DEMs is better than with the 90-m DEMs, with error differences of 52% and 45%, 465 respectively. This is different from the conclusion that improving only DEM resolution, without calibrating the shape factor, did not benefit the model result in GlabTop2 (Ramsankaran et al. 2018). But when we used a calibrated shape factor of 0.66 as suggested by Ramsankaran et al. (2018), the model results from the 30-m DEMs were still better than those from the 90-m DEMs (Fig. 16a). With GlabTop2, elevation data was used to determine not only the slope, but also the shear stress (Frey et al. 2014). A +5 o error in slope caused more than a -34.1 % difference in the output for slopes of less than 20 o . Additionally, 470 relative elevation errors had an enormous impact (Fig. 16 c). For glaciers with an elevation range of less than 400 m, which accounted for 41% of the total number and 5% of the total area, +10, +30, and +50 m errors in elevation range caused more than +2%, +6% and +10% differences in output. Such errors in elevation range had greater influence (Fig. 19b), especially for small glaciers, which have smaller elevation ranges. These two errors propagate and lead to a much larger overall error (Table   3). Thus, GlatbTop2 with NASADEM and AW3D30 as the best quantity input achieved the best RMSE in comparison with 475 GPR measurements. In contrast to the other ITIMs, the ITIBOV model directly estimated the ice thickness at each grid cell according to cell velocity information without interpolation. The slope sensitivity of ITIBOV is higher than that of GlabTop2, with a +5 o error in slope causing more than a -71.4% difference in the output for slopes of less than 20 o (Fig. 16b). The flatter the slope, the more sensitive ITIBOV is to the slope error (Fig. 16b). Although along-and across-track slope data are provided in the ICESat-2 ATL06 product, they are incompatible with the slope estimated from DEMs due to their different data formats 480 and algorithms used (Burrough and McDonell 1998;Smith et al. 2019). Moreover, the surface terrain of glaciers changes with time due to accumulation, melting and motion (Dehecq et al. 2018;Shean et al. 2020). Nevertheless, the accuracy of the DEMs estimated here could also provide some information about slope accuracy. The better relative accuracy of NASADEM and AW3D30 means that ITIBOV with these DEMs as input led to the relatively best outcomes (Table 2).
For HF and OGGM, the modelled results did not show large differences when 30-m DEMs were replaced with 90-m 485 resolution DEMs (Fig. 9): means that high spatial resolution improved the outcome little (Pelto et al. 2020). For the HF model, elevation data was used for convergence calculation of apparent mass balance and mean slope in elevation bins (Farinotti et al. 2009;Farinotti et al. 2019), whereas, for OGGM, it is used to extract flowlines, shear stress at flowlines and mass balance at an elevation . The relative accuracy of DEMs was more vital than absolute accuracy for these two models. Although NASADEM and TanDEM-X had the large advantage of absolute accuracy, the output of HF and OGGM 490 using these two DEMs did not have much advantage over that using the other DEMs (Fig. 9). The STD of RMSE values for HF and OGGM using six DEMs are 7.0 and 6.1 m, respectively ( Table 2). STD of mean ice thickness by HF and OGGM using six DEMs are 1.1 and 1.9 m ( Table 2). The HF and OGGM models are not very sensitive to the DEM absolute accuracy. The performance of AW3D30 in OGGM and SRTM-GL1 in HF is even slightly better than NASADEM in these two models ( 2). Specifically, the better performance of SRTM-GL1 should be attributed to the calculation of slope. Though the model has 495 high sensitivity to the slope (Fig. 16a), the mean slope in each elevation band was used, defined as a tangent of the width and elevation difference in the elevation bin (Huss and Farinotti 2012). Figure 16. Sensitivity test of shape factor, slope and elevation on ice-thickness inversion models. (a) Difference between modelled thickness and GPR measurements when a calibrated shape factor of 0.66 was used in GlabTop2; (b) Percentage difference of modelled ice thickness 500 from GlabTop2, HF, ITIBOV and OGGM when there is +5 o slope error; (c) Percentage difference of modelled ice thickness from GlabTop2 when the elevation range error is +10, +30 and +50 m for different elevation ranges.
The different models have various levels of robustness to the quality of the input DEMs. When the models are comprehensively utilized, the influence of the input DEM manifests itself ( Fig. 9 and Table 2). The RMSE of ITIMs from 30-505 m DEMs was 16.8% less than that from 90-m DEMs. Models using AW3D30 and NASADEM, equipped with higher resolution and better relative accuracy, achieved the best outcomes. However, it should be noted that the large misregistration in AW3D30 in the northern TP may lead to the mismatch between terrain and glacier outlines. This will lead to an https://doi.org/10.5194/tc-2021-197 Preprint. Discussion started: 13 July 2021 c Author(s) 2021. CC BY 4.0 License. overestimation of slope and a consequential underestimation of ice thickness (Huss and Farinotti 2012), due to the mountain terrain being relatively steeper than the glaciers. 510

Conclusions
In the present study, six DEMs (that is AW3D30, SRTM-GL1, NASADEM, TanDEM-X, SRTM-GL1 and MERIT) from different sensors with different spatial resolutions were evaluated using ICESat-2 data. The influence of glacier dynamics, terrain and misregistration on the DEM accuracy were analyzed. Out of the three 30-m DEMs, NASADEM was the best performer with a small ME of -1.0 and a RMSE of 12.6 m. Out of the three 90-m DEMs, TanDEM-X performed best with an 515 ME of -0.1 and a RMSE of 15.1 m. The quality of TanDEM-X was stable and unprecedented on shallow slopes, but suffered from serious problems on steep slopes, especially along the steep ridges. For AW3D30, a systematic vertical and horizontal offset existed on glacier terrain, however, it still has a similar relative accuracy to NASADEM. SRTM-based DEMs (that is SRTM-GL1, SRTM v4.1 and MERIT) (~36 m RMSE) were inferior to NASADEM, although, when glacier variations were excluded, all of their errors were reduced in the ablation zone. MERIT shows little improvement over SRTM v4.1 in glacierized 520 terrain. The influence of glacier elevation change on the elevation difference is larger for DEMs acquired earlier, at low elevations and in the ablation region. However, this does not influence the conclusion that NASADEM performed the best, followed by TanDEM-X. For all the DEMs, the errors increased from the south-aspect slope to north-aspect slope, controlled by the increasing error with slope. Misregistration errors in the glacier-rich region are mostly within one pixel, benefiting from the 20 m footprint of ICESat-2, relative to the 30-or 90-m resolution DEMs, and only have a small influence on the evaluation. 525 The effect of DEM accuracy on ice-thickness inversion models depends on the model properties. Generally, a higher resolution DEM was helpful for better model outcomes due to the intra-pixel influence. For the widely used GlabTop2 model, which is sensitive to the accuracy of elevation and slope, using NASADEM, with the highest absolute accuracy, as the input, facilitated the best outcome. Although the OGGM and HF models are less sensitive to the quality of DEM, the use of NASADEM or AW3D30, both with a high relative accuracy, was still favorable. Among the four ice-thickness inversion 530 models, ITIBOV was the most sensitive to slope accuracy. Ice-thickness inversion models using AW3D30 or NASADEM as input gave the best outcomes. These two DEMs also perform the best when four ice-thickness inversion results were aggregated by the minimum MAE optimization method to form an ensemble.
Considering the influence of inconsistency in data acquisition time on generating glacier terrain, we suggest that NASADEM is the best choice for ice-thickness inversion models over the TP. AW3D30 could be a good substitute if its systematic shift 535 was corrected. TanDEM-X is an appropriate alternative for glaciological research focusing on the flat glacier terminus, but it requires further improvement for use in steep terrain or for ice-thickness inversion.