A new automatic approach for extracting glacier centerlines

Glacier centerlines are crucial input for many glaciological applications. From the morphological perspective, we 10 proposed a new automatic method to derive glacier centerlines, which is based on the Euclidean allocation and the terrain characteristics of glacier surface. In the algorithm, all glaciers are logically classified as three types including simple glacier, simple compound glacier and complex glacier, with corresponding process ranges from simple to complex. The process for extracting centerlines of glaciers introduces auxiliary reference lines, and follows the setting of not passing through bare rock. The program of automatic extraction of glacier centerlines was implemented in Python and only required glacier boundary and 15 digital elevation model (DEM) as input. Application of this method to 48571 glaciers in the second Chinese glacier inventory automatically yielded the corresponding glacier centerlines with an average computing time of 20.96 s, a success rate of 100% and a comprehensive accuracy of 94.34%. A comparison of the longest length of glaciers to the corresponding glaciers in the Randolph Glacier Inventory v6.0 revealed that our results were more superior. Meanwhile, our final product provided more information about glacier length, such as the average length, the largest length, the lengths in the accumulation and ablation 20 regions of each glacier.

there are 168331 glaciers (including ice caps) in the world, with a total area of 726258 km 2 apart from ice sheets. Glaciers 25 move towards lower altitude by gravity, which is the most obvious distinction between glacier and other natural ice bodies.
The glacier flowlines are the motion trajectories of a glacier, and the main flowline is the longest flow trajectory of glacier ice.
Due to the differences in the speed and moving direction of any point at the surface or inside the glacier, the calculation of the main flowline of glaciers requires a coherent velocity field data, which is difficult to obtain on the global or regional scale (McNabb et al., 2017). Alternatively, some concepts such as the glacier axis and the glacier centerlines were proposed (Le Bris 30 and Paul, 2013;Machguth and Huss, 2014;Kienholz et al., 2014).
As an important model parameter, glacier centerline can be used to determine the change of glacier length (Leclercq et al., 2012a;Nuth et al., 2013), analyze the velocity field (Heid and Kääb, 2012;Melkonian et al., 2017), estimate the glacier ice volume (Li et al., 2012;Linsbauer et al., 2012), and develop one-dimensional glacier model (Oerlemans, 1997;Sugiyama et al., 35 2007). Meanwhile, the length of the longest glacier centerline is one of the key determinants of glacier geometry and an important parameter of glacier inventory (Leclercq et al., 2012b;Paul et al., 2009). The length and area of glacier can be also used to estimate the large-scale glacier ice volume (Zhang and Han, 2016;. The length change at the terminus of a glacier can directly reflect the state of motion, e.g., glacier recession, glacier advance or surging (Gao et al., 2019). Winsvold et al. (2014) analyzed the changes of glacier area and length in Norway, using glacier inventories derived from 40 Landsat TM/ETM+ images and digital topographic maps. Herla et al. (2017) explored the relationship between the geometry and length of glaciers in the Austrian Alps based on a third-order linear glacier length model. Leclercq et al. (2012) reconstructed annual averaged surface temperatures in the past 400 years on hemispherical and global scale from glacier length fluctuations (Leclercq et al., 2014). These studies indicated that both the extraction of contemporary glacier length and the reconstruction of historical glacier length require more accurate automatic extraction methods of glacier flowlines.

45
In order to obtain the length of glaciers, some automatic or semi-automatic methods were proposed in recent years. Schiefer et al. (2008) extracted the longest flow path on the ice surface based on a hydrological model, which was generally 10% to 15% larger than the glacier length. Le Bris et al. (2013) accomplished the automatic extraction of flow lines from the highest https://doi.org/10.5194/tc-2020-294 Preprint. Discussion started: 26 October 2020 c Author(s) 2020. CC BY 4.0 License. point to the terminus of a glacier based on the concept of glacier axis, with a verification accuracy of 85%. Unfortunately, the 50 branches of glacier centerlines have not been extracted and the length is not necessarily the maximum for huge or complex glaciers (Paul et al., 2009). Machguth et al. (2014) proposed an extraction method of glacier length based on the slope and width of glacier with a success rate of 95-98%, however the branches of glacier centerlines could not be extracted either. Kienholzs et al. (2014) applied the grid-least-cost route approach to the automatic extraction of glacier flow lines, having an automation degree of 87.8% with additional manual intervention. Yao et al. (2015) proposed the semi-automatic method of 55 extraction glacier centerlines based on Euclidean allocation theory, which required the expertise and experiences for composite valley glaciers and ice caps. The aims of this study are to design an algorithm to: (i) automatically generate centerlines for the main body of each glacier and its branches; (ii) automatically calculate the longest length, average length, the length of accumulation region, and the length of ablation region of each glacier, along with corresponding polylines; and (iii) improve the degree of automation as much as possible on the premise of ensuring the accuracy of glacier centerlines.

2 Input data and test region
The glacier dataset used in this study is the Second Chinese Glacier Inventory (SCGI) released by National Tibetan Plateau Data Center (http://westdc.westgis.ac.cn/data), which has been approved by some organizations (e.g., WGMS, GLIMS, NSIDC, etc.) and adopted in the Randolph Glacier Inventory (RGI) v6.0 (Guo et al., 2017). According to the SCGI (Fig.1), there were 48571 glaciers in China, with a total area of 51766.08 km 2 , accounting for 7.1% of the glacier area in the world except for the Antarctic and Greenland ice sheets . Due to the lack of automatic method to calculate glacier's length, there was no length property in the SCGI, and some subsequent studies haven't made great breakthroughs (Yang et al., 2016;Ji et al., 2017).
The SCGI was produced based on Landsat TM/ETM+ images and ASTER images in the period of 2004-2011 andSRTM v4.1 70 with a spatial resolution of 90 m . In this study, we selected SRTM1 DEM v3.0 (http://www2.jpl.nasa.gov/srtm, last accessed on March 2, 2013, with a spatial resolution of 30 m) (Farr et al., 2007) in consideration of its free access and higher data quality, which was used to identify division points on the glacier outlines, extract ridge lines in the coverage region https://doi.org/10.5194/tc-2020-294 Preprint. Discussion started: 26 October 2020 c Author(s) 2020. CC BY 4.0 License. of glaciers, and generate the glacier centerlines. Additionally, we extracted glacier data in China from the RGI v6.0 provided by GLIMS (http://www.glims.org/RGI/). There are 38053 glaciers matching the graphic position of the SCGI. The field of Lmax 75 of RGI v6.0 provides the length of the longest flowlines on the glacier surface, which was calculated with the algorithm proposed by Machguth et al. (2014). For verifying the validity and accuracy of glacier centerlines, we compared the extracted longest length of glaciers with the value of Lmax in the RGI v6.0.

3 Principles and algorithm of glacier centerline extraction
In order to implement the automatic extraction of glacier centerlines, we have designed a new set of algorithms. Relevant parameters and processing procedures are introduced as follows.

Model parameters
The code was written in Python and partially invoked the site package of arcpy. The calculation of the glacier centerlines relies 85 on two basic inputs: (i) glacier in the form of polygon with a unique value field and a projection coordinate system (unit: m), (ii) DEM data having the spatial resolution and acquisition time close to the images used for glacier inventory. We defined 11 adjustable parameters named Pi (i=1,…,11) (see Table 1), which were achieved by classifying glacier polygon through a set of reasonable rules. The purpose is to improve the degree of automation and the accuracy. Three key parameters are described as: -P3: the threshold of flow accumulation, to control the generation of auxiliary lines.

90
-P6: the step size of searching the local highest points, to control the extraction of extremely high points.
-P8: The grid cell size of Euclidean allocation, to improve the algorithm efficiency.
In the algorithm, the number of the local highest points is affected by the perimeter of the glacier (PG). We took the given area (A) and the perimeter (P) of the equilateral triangle corresponding to A as the grading threshold (Eq.1). According to the area 95 (AG) and the perimeter (PG) of each glacier, all glaciers were divided into five levels (Eq.2), which represented the five levels of glacier polygon with difference in PG. The built-in parameters were set according to the different levels (Table 1). P4, P5 and P9 were controlled in proportion to the side length of the equilateral triangle corresponding to P. The proportional coefficient was T (Eq.3). According to the actual situation of the repeated programing test, the empirical value of each parameter was given in Table 1.

110
In this paper, glaciers were divided into three categories: simple glacier (extremely high point: single, auxiliary line: no, the area of bare rock: no), simple compound glacier (extremely high point: several, auxiliary line: no, the area of bare rock: no), and complex glacier (extremely high point: several, auxiliary line: yes, the area of bare rock: yes). Following the principle from simple to complex, the algorithm was composed of six main steps: data preprocessing, extraction of auxiliary lines, identification of division points, reconstruction of feature lines, extraction of centerlines and the calculation of glacier length.

115
The flow chart of the algorithm is illustrated in Fig.2.
The automatic extraction of glacier centerlines obeys the following rules: (i) the elevation of the local highest points must be higher than the equilibrium line altitude (ELA), (ii) a glacier has only one exit, which is the lowest point of the outer boundary of the glacier outline (Gpo); (iii) the auxiliary line only acts on the accumulation region of glacier; (iv) the Gpo, auxiliary lines, 120 and bare rock region are also used to restrict the flow direction of the glacier centerlines. The data preprocessing includes four parts: (i) checking the input data, (ii) pre-processing the glacier outlines, (iii) fine-tuning the built-in parameters, and (iv) calculating the ELA of glaciers. First, the polyline of the outer boundary of the glacier (Gpl), the polygon of the Gpo, the boundary of the bare rock in glacier (Gbr) were obtained by splitting the glacier outlines in the 130 importing module. These temporary data would be used as the input parameters of other modules in subsequent process.
Secondly, the module exported the number of closed lines in glacier outlines, AG and PG, which were used to determine the number of bare rocks on the glacier surface, the type and level of glaciers. Thirdly, according to the parameter adjusting rules at the level of glaciers, 11 built-in parameters were fine-tuned. Finally, the median elevation (Zmin) of each glacier aided by its DEM was computed, which was then used to estimate the ELA of each glacier. The schematic diagram of processing glacier 135 outlines is shown in Fig.3.

Extraction of auxiliary lines
For making glacier centerlines more reasonable, we introduced the auxiliary lines that represent the internal ridgelines of The automatically extracted ridgelines were often disconnected, and it was necessary to remove independent existence or unreasonable ridgelines using the auxiliary data such as DEM, ELA and Gpo by ergodic algorithms. Firstly, the ridgelines of the glacier surface (As) were obtained by clipping the ridge lines using Gpo. The set of all possible starting points of auxiliary 155 https://doi.org/10.5194/tc-2020-294 Preprint. Discussion started: 26 October 2020 c Author(s) 2020. CC BY 4.0 License. lines was gained by intersecting As with Gpl. Then, the ridgeline clusters connected to each starting point were achieved and marked by traversing the point set. The number of auxiliary lines was initially determined. Lastly, the longest length of each auxiliary line was calculated by adopting the critical path algorithm. The final auxiliary lines (Al) were obtained by screening all auxiliary lines using the three parameters of P4, P5 and P11.

160
The division points include the lowest point (Pmin) and the local highest point (Pmax). The ordered point set (h) was obtained after converting Gpl from a polyline to a point set and extracting the elevation for the point set. The method for obtaining Pmin was relatively simple, as showed in Eq. (4).

=
(ℎ 1 , ℎ 2 , ⋯ , ℎ ) In comparison, the extraction of Pmax was more complicated. It was necessary to ensure the extraction of all possible branches 165 of the centerlines and avoid the redundancy of branches. The algorithm could be divided into four steps: (i) obtaining the local highest point set (M″) by filtering h (Eq.5, Eq.6) according to P6, (ii) removing the elements (Eq.7) at an altitude lower than ELA from M'', (iii) removing the elements (Eq.8) of adjacent distance less than P9 from M', and (vi) checking, deleting or adding some local highest points (Eq.9) using the auxiliary lines to ensure that there was at least one local highest point among adjacent auxiliary lines.

Calculation of glacier length
The polyline's code of the end of Ggcf was determined by Pmin after Ggcf being converted from multipart to singlepart and unified coded. Then all branches of glacier centerlines and glacier length were achieved using algorithm (Fig.7) The length of each branch of glacier centerlines was counted. The average length (Eq.17) of all branches was called the average 220 length of a glacier (GLmean). The longest length (Eq.18) of all branches was called the longest length of a glacier (GLmax). In addition, the part above ELA in GLmax was regarded as the accumulation region length (GLacc) of a glacier, and the part of GLmax with altitude lower than ELA was regarded as the ablation region length (GLabl) of a glacier. Finally, the corresponding vector data were generated and some attributes including the corresponding polyline code, glacier code, the value of glacier length were added.

Methods of quality analysis
Here, we used SCGI as the test data to run the designed program, including 48571 glaciers. The extraction results of some 230 typical examples of glaciers (from simple to complex) are presented in Fig.8. The accuracy of glacier centerlines was evaluated based on a random verification method in this study. All glaciers (total quantity: NG) corresponding to the samples were obtained and arranged in ascending order of the area. Specifically, 100 random integers were generated in the set of [0, NG).
Glaciers with corresponding serial number were exported as samples. After the visual inspection, the accuracy evaluation was conducted based on the following statistical analysis. 235 https://doi.org/10.5194/tc-2020-294 Preprint. Discussion started: 26 October 2020 c Author(s) 2020. CC BY 4.0 License.

Figure 8: The centerlines for some typical glaciers.
Firstly, 100 glaciers were randomly selected from the glacier dataset as samples to obtain a verification accuracy (R1) (Eq.19).
Secondly, each level of glaciers was separately taken as the total (NT), and 100 glaciers were randomly selected. There were 5 samples for 5 levels, which were used to calculate a verification accuracy (R2) (Eq.20) by taking the number proportion of 240 each glacier level as the weight. Then, 100 glaciers with the largest, middle and smallest areas were selected separately as samples. The verification accuracy (R3) (Eq.21) was derived using 1:2:1 as the allocation proportion of weight. Finally, the

Sample selection and assessment criteria
Visual inspection in combination with satellite images and topographic maps is the most direct evaluation method for extraction 250 results. Using 48571 glaciers in China as the test data, nine samples of 900 glaciers were selected for three verifications according to the evaluation method defined in section 4.1. The samples used for verification and relative information are given in Table 2. Considering the possible defaults of the input data, we set some standards of accuracy evaluation (

Statistics of different samples
According to the standards in Table 3, the selected samples were conducted with visual investigation. The results of nine samples were displayed in Fig.9. The statistical results showed that the accuracy of verification-2 was the highest (95.25%), 265 the accuracy of verification-3 was the second (94.76%) and the accuracy of verification-1 was 93%. The comprehensive accuracy of glacier centerlines was 94.34%, of which category-I and category-II accounted for 86.06% and 8.28%, respectively.
Meanwhile, we summarized the frequency of each type in each sample basing on 2nd-level categories. As seen in Fig.10, the problems of centerlines of small glaciers were mainly caused by the inaccurate selection of division points due to the insufficient accuracy of DEM (code: 22) and incorrect calculation results of some slope glaciers (code: 33). The problems of 270 centerlines of large glaciers were mainly concentrated in some types coded in 31 and 32, which needed to be repartitioned and recalculated. In addition, a few problems were found in samples: the upper outlines of glacier were across the ridgeline; a small number of glaciers were not correctly segmented; the altitude in glaciers' DEM was abnormal. It implied that the reasonable glacier outlines and accurate DEM data were the prerequisite for extracting glacier centerlines and calculating glacier length.

280
In the RGI v6.0, 38053 glaciers in the SCGI were adopted and accounted for 78.35% of the total glaciers in China, by checking the GLIMS_ID in both glacier datasets. As mentioned above, the Lmax of the longest glacier length was contained in the RGI v6.0. In order to further verify the accuracy of glacier length calculated by this method, we calculated the difference (DL) between GLmax and Lmax, and then arranged them in ascending order to generate the distribution diagram of sequence-DL (Fig.11).
If DL was negative, the GLmax of glaciers with the corresponding serial number was smaller than Lmax and vice versa. Overall, 285 there were only a small part of glaciers with extremely large |DL| at both ends (Fig.11). After visual inspection, GLmax was more consistent with the actual conditions of glaciers. were calculated (Fig.12). The size of three pixels for DEM was used as the statistical tolerance, which means glaciers within the tolerance range were regarded as consistent extraction results. Statistically, there were 22017 glaciers with equal tolerance, 925 glaciers with negative DL and 15111 glaciers with positive DL. In terms of numerical comparison, GLmax obtained by our method was slightly larger than Lmax in RGI v6.0.

Analysis of abnormal DL
Combining the designed algorithm with visual inspection, the preliminary analysis showed that the local abnormal DEM, 300 inaccurate glacier outlines and some glacier types (such as ice cap, slope glacier, etc.) were the main causes of abnormal DL (Fig.13). Slope glacier is typical multi-origin and multi-exit glacier with almost the same number of local highest points and local lowest points, which often exist in pairs ( Fig.13-a). If the local highest point did not match the local lowest point, a value of positive DL would occur ( Fig.13-a, blue polyline). Local abnormalities in DEM generally resulted in a shorter GLmax (negative DL), as showed in Fig.13-b. Some key local highest points could not be identified because of the inaccurate outlines, 305 resulting in a large negative DL (Fig.13-c). For non-single glacier, this algorithm could only identify a lowest point, and all branches of glacier centerlines converge to this point, which would increase the length of most branches and make GLmax to be too large or even wrong (Fig.13-d).

310
The small or abnormal Lmax of some glaciers was also the main reason of abnormal DL. An abnormal example is shown in  Table 4. 325 Table 4 The statistics of assigning tasks and results of execution in tests. The results of the tests showed that the program took an average of 20.96 s to extract a glacier, and it spent 86.26 s or even longer for some complex glaciers. Among the first eight processing tasks, T4 took the least time. The main reason was that the https://doi.org/10.5194/tc-2020-294 Preprint. Discussion started: 26 October 2020 c Author(s) 2020. CC BY 4.0 License. assigned glaciers in this task were mostly small and complex glaciers were less, except for the higher machine configuration.
T7 took the longest time, and the cause was the lower machine configuration. The results of all tasks were merged to obtain 330 the centerline dataset of the SCGI. It contained seven vector files (56 items) and nine logs, which took up about 912 MB in the storage.

Influence of glacier outline quality and DEM
The extraction method of glacier centerlines belongs to geometric graphic algorithm and depends on glacier outlines. Natively, comparing with the previous studies, our method has similar problems: (i) the delayed shunt and early convergence of the 335 branches and (ii) the centerlines of same glacier in different periods, which is not geometrically comparable for some glaciers in drastic changes of outlines. The extraction results also showed that the branches of some glacier centerlines did have delayed diversion or early convergence, while the impact on the simulation of glacier's main flowline was limited. Considering that the results of extracting glacier centerlines change with the changes of glacier outlines, the measurement of the length change of glaciers in different periods will be the focus of our future work. We may further design a new algorithm to automatically 340 supplement, extend, delete or modify the benchmarking glacier centerlines, so as to measure the changes of centerlines and length of glaciers in different periods.
Bare rock region refers to the non-glacial component that is within the outer boundary of the glacier outlines but is not covered by snow or ice. It can be divided into two types: one is the exposed rock protruding on the glacier surface; the other is the cliff 345 generally existing between the upper part of the glacier and the firn basin. The snow or ice on the upper part of the glacier enters the firn basin through the cliffs. And the snow or ice on the cliffs are also important sources of replenishment for firn basin. So the cliffs are theoretically considered to be part of the glacier. However, the cliffs may be similar to the bare rock area during the ablation season, and the cliffs are often accompanied by the presence of image shadows, which will easily cause misjudgments of glacier outlines in interpretation.

350
Determining the ownership of bare rock regions in Gfl will improve the quality of glacier centerlines. In this study, all bare https://doi.org/10.5194/tc-2020-294 Preprint. Discussion started: 26 October 2020 c Author(s) 2020. CC BY 4.0 License.
rock regions were considered to be the first category, and such cases were handled accordingly. The first category was divided into two types: (i) the bare rock area on the upper part of the glacier being equivalent to the ice divide and (ii) the bare rock area near the end of the glacier. The attribution of most bare rock areas in the upper part of the glacier can be determined by 355 the intersection point of Al, Gpl with Gbr. Only a few bare rock areas still exist alone, Eq. (12) was required to determine the segments of the Gfl to which they belong. Some bare rock areas located in the ablation area were allowed to exist alone in the Gfl, and the probability of their existence was extremely low.
The determination of glacier's ELA is difficult. Some scholars believed that each glacier has its own ELA (Sagredo et al., 360 2016;Cui and Wang, 2013), but other scholars argued that the ELA of all glaciers in a certain region is the same (Sagredo et al., 2014;Jiang et al., 2018). The measurement of ELA requires continuous and long-term observation data, so it is very difficult to determine the ELA of the glaciers in large-scale. In this study, the ELA used to distinguish between the accumulation area and the ablation area of the glacier was estimated by calculating the Zmin. For some glaciers (such as calving glaciers), the Zmin is above the actual ELA, which has been reasonably explained by scholars (Braithwaite and Raper, 2009). And it was 365 considered that this overestimation is unlikely to affect the automatic calculation of glacier length (Machguth and Huss, 2014).

Some other factors influencing centerline of glaciers
Automatic extraction of glacier centerlines was basically carried out during the processing of polylines, so the processing algorithm of polylines in the program occupied a considerable part of codes. Among them, several common problems of disconnected polylines are shown in Fig.15. The following four types are important, which have a great influence on the 370 accuracy and extraction automation of glacier centerlines.
(i) During the post-processing of the auxiliary lines, due to the inaccuracy of ice divide or the problems of DEM, the ridgelines in the edge of the ice divide of some glaciers start at the Gpl and end up with the Gpl or in parallel along the Gpl, which are unreasonable. In response to this problem, the algorithm set corresponding rules for screening in the processing of auxiliary 375 lines, reducing the impact of such problems as much as possible.
(ii) The visually closed vector polyline is not completely closed. Its start and end are at the same point, which is equal to a natural division point. Unless the natural division point of Gpl completely coincides with a certain division point, the number 380 of polyline records in the Gpl after division will be one more than we expected. Therefore, it is necessary to identify the natural division point during processing and merge the two disconnected polyline records.
(iii) The algorithm of Euclidean allocation is accomplished based on raster operation, which is equivalent to the equidistant scatter operation with the interval of P8 on the glacier surface. For some glaciers with horizontal or vertical distribution of the 385 Gpl, the extraction will continue after the centerlines overlaps with the Gpl. We only need to design the corresponding functions to detect and delete this redundancy of the disconnected polylines.
(iv) In the process of calling the module of Euclidean allocation to generate the centerlines, there is a slight probability that pixels with strictly equal distances will appear. The central axis will generate a regular rectangle based on the raster pixel 390 corresponding to the central point, which will affect the calculation of the GLmax. In the algorithm, a function to identify and deal with such problems was added after the Euclidean allocation, then the polylines on one side of the diagonal of a rectangle were randomly retained.

Conclusions
An automatic method for extracting glacier centerlines based on Euclidean allocation in two-dimensional space was designed 395 and implemented in this study. The automatic extraction method only needs the glacier outlines and the corresponding DEM to automatically generate the graphic data of Ggcf, the graphic and numerical data of different glacier lengths (GLmax, GLmean, GLacc, and GLabl). The standardized and automatic extraction of glacier centerlines requires no manual intervention. Meanwhile, we used SCGI as the test data to run the program. The success rate of extracting glacier centerlines was very close to 100% and the comprehensive extraction accuracy reached 94.34%, which reflected the robustness and simplicity of our method.

400
The automatic extraction algorithm proposed has three advantages: (i) introducing the auxiliary reference lines which ensure the validity of the upper glacier centerlines; (ii) success in automatically obtain the longest length of each glacier and the branches of glacier centerlines; (iii) providing more information of glacier lengths (e.g., GLmax, GLmean, GLacc, and GLabl).
Compared with the Lmax in RGI v6.0, the GLmax value calculated by our algorithm is in better agreement with the actual length 405 of the glacier. Our future work will focus on improving the time efficiency of the algorithm, providing the updated datasets of glacier centerlines with higher-quality, and identifying the abnormal glacier phenology such as glacier surging rapidly.

Code availability
The code used to support the findings of this study are available from the corresponding author upon request.

410
The datasets including the SCGI, RGI v6.0 and SRTM1 DEM v3.0 used in this study are freely available. The database of glacier centerlines in the SCGI produced in this study are available from the corresponding author upon request.