Comparison of in-situ snow depth measurements and impacts 1 on validation of unpiloted aerial system lidar over a mixed-use 2 temperate forest landscape 3 4

The accuracy and consistency of snow depth measurements depend on the measuring device and the 12 conditions of the site and snowpack in which it is being used. This study compares collocated snow depth 13 measurements from a magnaprobe automatic snow depth probe and a Federal snow tube, then uses these 14 measurements to validate snow depth maps from an unpiloted aerial system (UAS) with an integrated Light 15 Detection and Ranging (lidar) sensor. We conducted three snow depth sampling campaigns from December 2020 16 to February 2021 that included 39 open field, coniferous, mixed, and deciduous forest sampling sites in Durham, 17 New Hampshire, United States. Average snow depths were between 9 and 15 cm. For all sampling campaigns and 18 land cover types, the magnaprobe snow depth measurements were consistently deeper than the snow tube. There 19 was a 12% average difference between the magnaprobe (14.9 cm) and snow tube (13.2 cm) average snow depths 20 with a greater difference in the forest than the field. The lidar estimates of snow depth were 3.6 cm and 1.9 cm 21 shallower on average than the magnaprobe and snow tube, respectively. While the magnaprobe had a better 22 correlation with the UAS lidar, the root mean square errors were higher for the magnaprobe than the snow tube, 23 likely due to overprobing by the magnaprobe into leaf litter. Even though the differences between the in-situ 24 sampling methods resulted in modest performance differences when used to validate the UAS lidar snow depths 25 in this study, measuring vegetation height, leaf litter, and soil frost with in-situ snow depths from multiple 26 sampling techniques helped to account for the errors of in-situ snow depth for robust validation of the UAS snow 27 depth maps. 28 29 Short Summary. This study compares snow depth measurements from two manual instruments and an airborne 30 platform in a field and forest. The manual instruments’ snow depths differed by 1 to 3 cm. The airborne 31 measurements, which do not penetrate the leaf litter, were consistently shallower than either manual instrument. 32 When combining airborne snow depth maps with manual density measurements, corrections may be required to 33 create unbiased maps of snow properties. 34

scale (CCi HS-6 Electronic Scale, 2 gram resolution). Snow mass was the total mass net of the empty tube mass. 153 Snow density was determined from the snow mass and sampled volume. depth precision analyses indicated that snow depth differences of 1 cm or less could be detected in a 1x1 m area 159 using nine samples for most of the study area (Jacobs et al. 2021). At each location, a 1x1 m square polyvinyl 160 chloride (PVC) grid was placed on the snow surface with one vertex located coincident with a stake. The 161 orientation of two adjacent sides of the grid was recorded. Nine magnaprobe depth measurements were made at 162 an approximately even spacing within the 1x1 m grid. Immediately after the magnaprobe measurements, snow 163 tube snow depth measurements were made at the same nine locations by positioning the snow tube directly over 164 each magnaprobe sampling location. At a 10 th location within each 1x1 m grid, the snow tube was used to make 165 a SWE measurement. For the 24 February 2021 campaign, after the magnaprobe measurements were completed 166 for the two northern transects, the instrument was transferred to a new operator who made measurements on the 167 southernmost transect (Transect 1). The QA/QC process identified notable errors for observations from that 168 transect. Transect 1 data for that date were removed from the analysis. 169 170 Moultrie Wingscapes Birdcam Pro Field Cameras were used to capture images of the snowpack relative to a 1.5 171 meter marked PVC pole following the method used in NASA's 2020 SnowEx field camera campaign in Grand 172 Mesa, CO (personal communication, 16 November 2020). Three cameras were used; one was in the open field, 173 one was in the coniferous forest, and one was in the deciduous forest (Fig. 1). Each camera was mounted 174 approximately 0.85 m above the ground and placed approximately 5.5 m from its respective PVC pole. Each 175 camera's field of view included the entirety of the PVC pole, some of the ground surface below the pole, and 176 some open area above the pole. Each PVC pole was spray-painted red and was marked with 1 cm and 10 cm 177 increments. The cameras captured images of the poles every 15-minutes for the duration of the study period. Snow 178 depth was derived by manual inspection of the photos and recorded to the nearest cm. 179 Daily soil frost depth data were collected at field and forest locations at the Thompson Farm Research Observatory 182 using (CRREL-Gandahl) style frost tubes (Gandahl 1957). The frost tubes have flexible, polyethylene inner tubing 183 filled with methylene blue dye whose color change is easy to differentiate when extruded from ice (Gandahl 1957; 184 Rickard and Brown 1972;Sharratt and McCool 2005). The outer tubing consists of PVC pipe installed between 185 0.4 to 0.5 m below the soil surface (Ricard et al. 1976;Sharratt and McCool 2005 Leaf litter depth was measured on 2 April 2021 after the spring snowmelt. The leaf litter depth was measured at 191 each snow depth sample location. Sampling was conducted using a PVC collar or round ring that is 8 cm in depth 192 and 10 cm in diameter (Kaspari and Yanoviak 2008). The collar was placed in the leaf litter and was pushed down 193 until it was through the leaf litter layer. If sticks or larger stones were in the way, they were either carefully 194 removed or the collar was moved slightly to an adjacent location. Measurements were taken using a wooden ruler 195 at four cardinal points in the collar. The four measurements were recorded and averaged, and the final litter depth 196 value was recorded to the nearest cm. 197

198
Magnaprobe penetration depth measurements were also made when snow was not present to capture the probe's 199 penetration into the leaf litter. Directly following the 2 April 2021 leaf litter sampling using the collar, 20 200 magnaprobe leaf litter depth measurements were made at each of the 39 snow depth sampling locations. 201 Measurements were taken within a 1.5 m radius of the stake. When using the magnaprobe, the weight of the probe 202 was the only force applied on the ground to minimize penetration into the duff layer and underlying soil. The 203 probe was gently rested on the ground rather than being forced into the ground. The 20 measurements were 204 recorded and averaged to obtain a magnaprobe litter depth at each location. 205

Lidar Sampling 206
UAS snow-on lidar surveys were conducted at Thompson Farm prior to in-situ sampling on each of the campaign 207 dates. A snow-off baseline survey was conducted on 2 April 2 2021 following snowmelt. The sensor payload 208 consisted of the Velodyne VLP-16 laser scanner, and the Applanix APX-15 Inertial Navigation System (INS; 209 GPS+IMU). The VLP-16 is a lightweight (~830 grams) low power (~8W) sensor, which makes it ideal for UAS 210 deployment. The sensor incorporates 16 rotating IR lasers that are arranged and oriented on the payload to provide 211 a 30° along-track field of view with a cross-track field of view limited only by the range of the sensor 212 (approximately 100 m). At an altitude of 65 m, the range of the sensor produces an effective cross-track field of 213 view of approximately 98°, but varies depending on the characteristics of the target surface. Each laser operates 214 at a wavelength of 903 nm. This wavelength is ideal because it is outside of the first major electromagnetic 215 absorption feature of snow (centered at 1030 nm). A reduction in signal strength would be observed over snow 216 cover for lidar sensors that operate at wavelengths coinciding with strong electromagnetic absorption. The VLP-217 16 has two return modes, single-return and dual-return, which record the strongest return or the strongest and the 218 last return, respectively. In dual-return mode, the VLP-16 collects ~300,000 distance measurements per second 219 with a reported uncertainty of 3 cm at a range of 100 m. 220 221 For these acquisition missions, the VLP-16 was hard-mounted to a DJI Matrice 600 to maintain constant lever 222 arm offsets between the inertial navigation system (INS) GPS antenna, the lidar sensor, and the INS board. As 223 opposed to a gimbal mounted system, this hard-mounted configuration achieves a more tightly coupled system, 224 resulting in improved point cloud geolocation accuracy. The lidar sensor was set to dual-return mode to improve 225 ground detection in the forested areas of our field site. We buffer on all sides using catalog options in lidR to ensure returns near tile edges were classified. PMF was 240 parameterized using a set of window sizes of 1, 3, 5, and 9 m, and elevation thresholds of 0.2, 1.5, 3, and 7 m, 241 which were determined by varying value sets and assessing digital terrain models (DTMs) to determine the 242 parameter sets that produced a visually smooth surface over a dense grid (Muir et al. 2017). Following ground 243 classification for each tile, returns within the 15 m tile buffers were removed, and all resulting 100 m square 244 ground classified tiles were merged. The result of the PMF is that non-ground returns (i.e., trees, shrubs, and 245 noise) were filtered out of the point cloud data sets, so that only returns from ground surfaces remained. The two 246 data sets, non-ground returns and ground returns from the original point clouds, were coded according to LAS 247 specifications and merged. The ground returns were extracted for the 1 x 1 m square sampling sites, corresponding 248 to the alignment and orientation of the respective PVC grids. The lidar snow depth was calculated as the difference 249 between the mean snow-on and mean snow-off elevations within each sampling grid. 250

Statistical Approach 251
The magnaprobe, snow tube, and lidar snow depth measurements were summarized and compared for the field 252 and forested areas by sampling campaign date following (Willmott 1982). Each comparison was conducted using 253 the individual grid cell measurements (N = 9 at each grid cell), and grid cell average depths. Sample statistics that 254 were calculated and compared for each of these datasets included the mean and standard deviation, the bias, the 255 mean absolute error (MAE), and the root mean square error (RMSE). A line of best fit was generated for each to 256 provide the corresponding slopes and intercepts, and r-squared values. As described by Willmott (1982), MAE of 257 the compared data sets, is given as: 258 (1) 259 260 where X and Y are two of the magnaprobe, snow tube, and lidar snow depth and N is the number of samples. The 261 root mean square error (RMSE) is the average squared difference between the compared data sets given as: 262

Magnaprobe vs. Snow Tube 284
The full experiment yielded individual 936 pairs of snow depth measurements from the snow tube and the 285 magnaprobe (Fig. 2a). Overall, there was moderate agreement (R 2 = 0.55) between the two datasets for all three 286 sampling campaigns ( Table 2). The snow depths measured by the magnaprobe (14.9 cm average snow depth) 287 were deeper than the snow tube (13.2 cm average snow depth) with an overall bias of 1.7 cm. The magnaprobe 288 snow depth was at least 0.5 cm deeper than the snow tube in 74% of the 936 measurement pairs. Only 6.3% of 289 the pairs had snow tube snow depths exceeding magnaprobe snow depths by 0.5 cm or more. 7.4% of the pairs' 290 magnaprobe snow depths were over 5.0 cm deeper than the snow tube. In eight pairs of measurements, the 291 magnaprobe snow depth was more than double the snow tube snow depth. The overall agreement between the snow tube and magnaprobe was better when the nine measurements within a 307 single 1x1 m grid cell were averaged at each of the sampling locations ( Fig. 2b and Table 2). There is a notable 308 improvement in grid cell statistics, and the correlation is stronger (overall R 2 = 0.76), with slopes closer to one, 309 intercepts closer to zero, and the RMSE values reduced to 2.5 cm or less. Although averaging has no impact on 310 the overall bias, the range of differences among pairs narrowed. Boxplots show that there is a consistent difference 311 (magnaprobe minus snow tube) that is typically constrained to less than 3 cm, but that a limited number of outliers 312 were observed (Fig. 3b). The magnaprobe snow depth was at least 0.5 cm deeper than the snow tube in almost all 313 grid cells (86.7%), but only three grid cells had differences greater than 5 cm. There were no instances in which 314 there was a doubling of snow depth. 315

Magnaprobe vs. Snow Tube by Land type 316
The magnaprobe and snow tube snow depths differ by land type, with the field having deeper snow and more 317 spatial variability than the forest land types (Fig. 4). Among the three forest types, the deepest snow was in the 318 deciduous-dominated forest, with mixed and coniferous forest having similar snow depths. The mean difference 319 between the magnaprobe and snow tube snow depths is a modest 1.3 cm in the field and a 1.9 cm in the forest, 320 with differences of 1.9, 2.0, and 1.9 cm in the deciduous, mixed, and coniferous land types, respectively. Based 321 on t-test results, the magnaprobe measured significantly deeper snow depth compared to the snow tube in both the 322 field and the forest. The t-test results identified significant differences between snow depths from the two probing 323 techniques regardless of whether individual locations (p-value < 0.001) or grid cell average snow depths (p-value 324 = 0.02) were used. Based on Welch's adjusted ANOVA test, there are no significant differences in overprobing 325 among forest land types (p-value = 0.24). The RMSE values between the magnaprobe and snow tube snow depths 326 are 3.0 cm (2.3 cm) and 2.5 cm (2.0 cm) for the forest and field sampling sites (grid average values), respectively. 327 Thus, the sampling method has a different impact in the field than the forest and the RMSE and bias values provide 328 an indicator of the different errors associated with in-situ measurements based on land type when used for model 329 or remote sensing validation. 330

Impacts of Leaf Litter on Magnaprobe vs. Snow Tube Depth 331
The range of leaf litter depths measured in the forest using the collar was typically 3 to 7 cm with an average leaf 332 litter depth of 3.9 cm (Fig. 5). The snow-off magnaprobe litter depth measurements in the forest had an average 333 value of 5.8 cm and the differences were significantly larger than depths measured using the collar (p-value < 334 0.001). The litter depths in the forest regardless of measurement technique exceeded the differences between the 335 magnaprobe and snow tube snow depths in the forest, which were 2.5, 1.7, and 1.4 cm on 18 December, 4 336 February, and 24 February, respectively. 337

Lidar and In-Situ Snow Depth Comparison 338
While the previous sections identified significant differences between the magnaprobe and snow tube snow 339 depth measurements, the average differences, 1.3 and 1.9 cm in the field and forest, respectively, are 340 https://doi.org/10.5194/tc-2022-7 Preprint. Discussion started: 3 February 2022 c Author(s) 2022. CC BY 4.0 License. relatively modest. One of the motivations for this study was to understand the impact of those differences 341 on the validation of emerging high resolution snow depth datasets such as those from UAS SfM or lidar 342 observations. Here, we briefly examine the lidar snow depth performance relative to both in-situ sampling 343 techniques and land type (Table 3 and Fig. 6), then discuss the impact of different sampling techniques on 344 that evaluation. 345

346
The lidar-derived snow depths for each of the 1x1 m grid cells were extracted as described in Section 2.2. 347 For both magnaprobe and snow tube measurements, the agreement with lidar is markedly better in the field 348 than the forest (Fig. 6). Overall, the lidar estimates of snow depth are typically shallower than the in-situ 349 observations ( Table 3). This is particularly evident for the 24 February 2021 forest lidar snow depths. The 350 lidar also has larger cell-to-cell variability than the in-situ measurements, as quantified by the standard 351 deviation, particularly in the forest. This large variability in the forest combined with the relatively small 352 range of snow depths even across sampling dates makes it nearly impossible to identify relatively shallow 353 or deep snow depths within the forest. The very low correlation values for both in-situ validation approaches 354 reflect the low signal-to-noise ratio. In contrast, there is fairly strong evidence in the field that snow depth 355 differences that exceed 3 cm are discernible. 356 357 Fig. 7 shows that the differences between the lidar and in-situ observations, regardless of method, are 358 considerably larger than the differences between the two in-situ sampling methods. The magnaprobe's 359 potential to overprobe through leaf litter and duff layers to a greater extent than the snow tube impacts the 360 quantification of performance. Overprobing negatively impacts the bias, MAE, RMSE, and linear regression 361 intercept metrics. The RMSE values are slightly higher for the magnaprobe than the snow tube, and to a 362 large extent this reflects the higher bias when using the magnaprobe as compared to the snow tube. In 363 contrast, the snow tube's RMSE is largely due to the snow tube's high site to site differences rather than an 364 overall bias. Thus, for individual locations, the magnaprobe is more consistent in its agreement with the 365 lidar. This is also reflected in the higher R 2 value. 366 4 Discussion 367

Uncertainty and impacts from overprobing 368
This study quantifies the differences between snow depth measurements made with a magnaprobe and with a 369 Federal snow tube sampler. The differences seem to be primarily associated with greater overprobing by the 370 magnaprobe into vegetation/organic layers and thawed soils. The result was that magnaprobe snow depth 371 measurements were observed to be higher than snow tube measurements, with a greater difference in the forest Overprobing also impacts SWE estimates. Given the efficiency of making snow depth measurements, a snow 390 survey will often make numerous snow depth measurements per snow density measurement then combine the 391 measurements to obtain SWE (Elder et al. 1998;López-Moreno et al. 2013). In some cases, only snow depth is 392 measured and bulk density is derived from empirical relationships. In either case, any biases in snow depth will 393 be transferred to the SWE estimates. Based on leaf litter measurements and the differences between the lidar snow 394 depth estimates and the in-situ measurements, it appears that both instruments overprobe to some extent. In fact, 395 a typical application of the snow tube will overprobe by design to extract the snow core and a "plug of soil". 396 However, because the operator removes any vegetation and soil prior to recording measurements, snow tube 397 measurements can readily correct for the overprobing. The errors incurred by combining magnaprobe 398 measurements with snow tube density values to determine SWE likely equal or exceed those from the 1.9 cm 399 depth differences observed in this study. suggest that snow depth measurements from field cameras may have better agreement with lidar-based snow 436 depths. An added advantage of field cameras is that the snowpack would not be impacted through destructive 437 measurements and foot tracks to measurement locations. 438

Future perspectives 439
While airborne-based lidar and SfM photogrammetry have been widely used to generate spatially distributed snow 440 depth maps at scales between ground measurements and satellite or regional snow products (Deems et  2) Measurements of leaf litter and soil frost may help to account for the overprobing errors, particularly when 478 using a magnaprobe. 479 3) To cross-check on ground snow depth measurements, the use of multiple sampling techniques is highly 480 recommended (rather than a single method) because the measurement errors vary by sampling methods and 481 surface conditions (e.g., low vegetation, leaf litter, and soils), particularly in shallow snowpacks. 482 As the UAS lidar or optical systems are increasingly used in snow research, it is prudent to recognize that snow 483 depth maps produced by these remote sensing products are likely to be modestly shallower than coincident in situ 484 observations. The differences among measurement techniques in this present study reflect the current study area, 485 surface conditions for a single season, and the operation of the instruments by this project team. Further studies 486 to minimize the errors from in-situ sampling in various snow environments in with different vegetation and soil 487 conditions are needed to accurately validate UAS snow depth maps and to provide guidance on best practices for 488 using these maps in combination with in situ measurements to represent differences in snow depth and SWE over 489