Accuracy and Inter-Analyst Agreement of Visually Estimated Sea Ice Concentrations in Canadian Ice Service Ice Charts

This study compares the accuracy of visually estimated ice concentrations by eight analysts at the Canadian Ice Service against three standards: i) ice concentrations calculated from automated image segmentation, ii) ice concentrations calculated from automated image segmentation that were validated by the analysts, and iii) the modal ice concentration estimate by the group. A total of 76 pre-defined areas in RADARSAT-1/RADARSAT-2 imagery are used in this study. Analysts overestimate ice concentrations when compared to all three standards, most notably for low ice concentrations (1/10 3/10). 5 The spread of ice concentration estimates is highest for middle concentrations (5/10, 6/10), and smallest for 9/10. The overestimation in low concentrations and high variability in middle concentrations introduces uncertainty in the ice concentration distribution in ice charts. The uncertainty may have downstream implications for numerical modeling and sea ice climatology. Inter-analyst agreement is also measured to determine which classifier’s ice concentration estimates (analyst or automated image segmentation) disagreed the most. It was found that one of the eight analysts disagreed the most, followed second by 10 the automated segmentation algorithm. This suggests high agreement in ice concentration estimates between analysts at the Canadian Ice Service. The high agreement, but consistent overestimation, results in an overall accuracy of ice concentration estimates in polygons to be 39%, 95% CI [34%, 43%] for an exact match in ice concentration estimate with calculated ice concentration from segmentation, and 84%, 95% CI [80%, 87%] for +/one ice concentration category. Only images with high contrast between ice and open water, and well-defined floes are used: true accuracy is expected to be lower than what is found 15


Introduction
Sea ice charts are routinely made by national Ice Services to provide accurate and timely information about sea ice conditions for the safety of the public, i.e. for Arctic shipping. Ice analysts and forecasters generate these charts by using various data 20 sources, including remotely sensed imagery, to quantify various sea ice characteristics, including sea ice concentration. Ana- The main objective of this study was to determine the probability that a given ice concentration in a Canadian Ice Service ice chart polygon reflects the ice concentration found in the corresponding SAR image used by analysts to create the ice chart.
To achieve this, we assessed: (i) How accurate analysts and forecasters are at visually estimating ice concentration when compared to calculated ice concentration from image segmentation under best case scenarios (i.e. image segmentation adequately resembles the 5 visual segmentation done by an analyst or forecaster) (ii) How consistent analysts and forecasters are with one another in visually estimating ice concentration from SAR imagery.
The paper is structured as follows. Section 2 describes the data and standards for ice chart creation. Section 3 describes the methodology for generating the sample polygons used in this study, calculating total ice concentration from image segmentation, and capturing analyst estimates of ice concentration. Section 4 describes the producer and user's accuracy, as well as the 10 two skill scores used in this study. Section 5 provides the results of the comparison between visually estimated ice concentrations and calculated ice concentrations using the skill scores. Section 6 compares visually estimated ice concentrations against the modal ice concentration value. Section 7 describes the accuracy of visually estimated ice concentrations in polygons. The paper concludes with a discussion in the final section.

15
Section 2 describes elements of ice charting. Section 2.1 describes the remote sensing data that is the primary input data source for generating ice charts. Section 2.2 briefly describes type of ice charts created at the Canadian Ice Service. Section 2.3 gives an overview of the egg code, which is the international standard used for ice charting, and is the method for communicating ice concentrations in an ice chart.

20
Sea ice concentration is routinely imaged using satellites due to its ability to monitor large spatial areas. Passive microwave and synthetic aperture radar (SAR) sensors are preferred over optical imagery because of their ability to see through clouds and lack of reliance on solar illumination. The CIS relied on RADARSAT-1, a SAR sensor, for ice charting until its decommissioning in 2013. The Canadian Ice Service currently relies predominantly on RADARSAT-2.
RADARSAT-2 can send and receive combinations of horizontal and vertical polarizations, which refer to the orientation 25 of the electromagnetic waves sent and received by the sensor. The main polarizations used for sea ice monitoring with RADARSAT-2 are (1) horizontal transmit and horizontal receive (HH), and (2) horizontal transmit and vertical receive (HV).
The HV band has been shown to be less sensitive to the incidence angle of the satellite, which is useful for sea ice monitoring during the melt season. The combination of both HH and HV channels together has been shown to better distinguish between sea ice concentrations than either channel alone (Karvonen, 2014). RADARSAT-1, on the other hand, could only receive HH 30 or VV. Only HH was used for sea ice monitoring in this study, although dual-pol (HH/HV) are used operationally. value, Ct, is the code found on the first line of the egg code. Secondary concentration values (Ca, C b , Cc) can be found on the second line when more than one ice type is present.

Chart Types
A number of different types of charts are generated at the Canadian Ice Service (e.g. regionals, dailies, image analyses, concentration, stage of development, etc.), which vary due to the chart's purpose, relevant time, or underlying data sources. Image analysis charts are created by visually interpreting specific satellite images. These charts are constrained to the geographic extent and resolution of the corresponding satellite images. Daily ice charts combine different sources of information, introducing 5 variability between the ice chart and the satellite image.

Egg Code
The egg code is a World Meteorological Organization international standard for coding ice information (see Figure 1). Each polygon in an ice chart is assigned an egg code with corresponding values. The egg code contains information on the ice concentration (C), stage of development (S), and the predominant form (F) of ice, within an oval shape. The top value in the 10 egg code is the total ice concentration, which includes all stages of development of ice. Total ice concentration are expressed in categories, where the ice concentration as a percentage is rounded to the nearest tenth. Less than 1/10 of sea ice is used to denote open water, which is not the absolute absence of ice but is the definition of ice less than 1/10. Partial concentration is used when more than one ice type is present within the delineated polygon. No partial concentration is reported when only one ice type is found. In our study, we only considered total concentration rather than partial concentrations.

Ice Concentration Estimates
In this paper we consider three standards against which analysts' visually estimated ice concentration is compared against, due to the absence of absolute ground truth. The standards used are: i) ice concentrations derived from automated segmentation; ii) ice concentrations derived from automated segmentation that have been validated by analysts; and iii) the mode of ice concentration estimates given by analysts. This section describes the methodology for creating sample polygons used in this study, calculating ice concentrations from automated segmentation, and capturing visual estimation of ice concentration from participating ice analysts.

5
RADARSAT-1 and RADARSAT-2 images were randomly selected from the Canadian Ice Service image archive. Each image was manually reviewed to find areas of clear contrast between water and ice to optimize segmentation capability and reduce potential ambiguity in visual analysis. A former operational analyst delineated potential polygons for the sample of selected cases used in this study. The polygon sizes were compared to polygon sizes from the published operational daily charts and image analyses that used the same RADARSAT images. The sample polygon sizes were compared to the sizes of polygons from 10 published charts with the greatest intersecting overlap. Figure 2 shows the difference between polygon sizes. Polygon sizes were not normally distributed. Under a Wilcoxon-Mann-Whitney rank test, polygons from image analyses and daily charts are not significantly different in their sizes (p = 0.226). On the other hand, polygons generated for this study are significantly smaller than polygons from image analysis charts (p = 0.002), although there is overlap in the size range. Polygons generated for this study are also significantly smaller than polygons from daily charts (p = 0.071), with less overlap in the size range than 15 for the image analyses.

Ice Concentration Estimates Using Automated Image Segmentation
The University of Waterloo MAp Guided Ice Classification (MAGIC) system was used to classify RADARSAT-2 pixels as ice or open water (Clausi et al., 2010). MAGIC uses an Iterative Region Growing using Semantics (IRGS) framework to classify pixels into categories. The RADARSAT-2 images analyzed in this study were run through the MAGIC algorithm using default 20 parameters. As an input to MAGIC, we specified only two classes in the polygon to force MAGIC to segment the pixels into ice or open water only. After running MAGIC we performed a visual inspection (and when necessary, we manually assigned the classification of ice or open water) to ensure that the resulting ice concentration was calculated correctly. Figure 3 shows an example of the output from MAGIC.
Only the HH band is used for both segmentation and visual interpretation in this study. This is done to ensure that differ-25 ences in ice concentration estimates between individuals were restricted to only interpretation of the segmentation, rather than interpretations of the multiple polarizations normally available.
The total ice concentration was calculated as a ratio of the ice pixels to the total number of pixels in the polygon, where N is the number of ice pixels, and T is the total number of pixels in the polygon. Result values were binned into 30 categories to reflect the ice concentration categories used in operational ice charts (Table 1).

Ice Concentration Estimates From Operational Analysts
A total of eight analysts and forecasters were given a customized user interface designed for this study (Figures 4,5). (For the remainder of this paper, the term analyst will be used to indicate analyst or forecaster). Two sample polygons were used as a test run to ensure that analysts were familiar with the user interface before the assessment. After completing the test run, analysts completed the following sequence for each polygon presented: 5 (i) Input an estimated ice concentration value for the delineated area. Options were restricted to only the values found in the standard ice-chart egg code.
(ii) The analyst was presented with the segmentation results from MAGIC only after submitting the value in the previous step. The analysts were able to toggle back and forth between the original image and the segmentation results.
(iii) The analyst was then asked if they agreed or disagreed with the segmentation results.

10
(iv) If the analyst input "Disagree," they were then asked to state if they felt the segmentation algorithm over-estimated or under-estimated the ice.
(v) Analysts were asked to input any additional comments. This allowed for comments to explain why they felt their estimate differed from the segmentation results.
The analyst repeated these five steps for all polygons in random order until the entire set was completed. There was a total 15 of 76 polygons analyzed by eight analysts, which resulted in a total of 608 cases. Each polygon had a total of eight responses Figure 4. A screenshot of the user interface that the analysts were presented with. Each polygon was pre-defined so all analysts estimated ice concentrations from the same polygons. The outlined polygon was presented to the analyst, who could zoom in/out and pan the image.
They were then asked to input an ice concentration value.
as all analysts completed the same set of polygons. We chose to present the segmentation results after analysts had input an estimated ice concentration value in order to prevent potentially biasing results. We considered the cases where the analyst stated "agree" in step (iii) as valid; cases where analyst stated "disagree" in step (iii) were considered invalid. This was done to allow us to subset the ice concentrations from MAGIC to only those cases where analysts found the segmentation was valid.  MAGIC. Analysts were able to toggle the segmentation results on/off. Answers were locked after input so that they could not be changed after the segmentation results became visible.

Accuracy and Agreement Skill Scores
This section describes the two skill scores and measures used in this study to determine accuracy and agreement. We use an error matrix, producer's accuracy, user's accuracy, and the kappa statistic for assessing the accuracy of classifiers in remotely sensed imagery (Lillesand et al., 2015). The same statistical framework is known in verification as the multi-categorical contingency table with its calibration-refinement factorization, likelihood-base rate factorization, and Heidke skill score (Murphy 5 and Winkler, 1987;Wilks, 2011;Joliffe and Stephenson, 2012). We employ these measures in our study to assess the accuracy of ice concentration estimates provided by the analysts compared to ice concentrations calculated from the automated image segmentation. In addition to the kappa statistic, we use Krippendorff's alpha to measure agreement between analysts. This measure is often used in counseling, survey research, and communication studies to measure inter-rater reliability-that is, the level of agreement between multiple judges (Hallgren, 2012). Whereas the kappa statistic is restricted to only comparing two judges, Krippendorff's alpha can compare multiple individuals. In our study, it was used to measure the agreement between 5 individual analysts, and to identify the level of disagremeent between analyst and MAGIC. In the context of this paper, we refer to it as inter-rater agreement rather than inter-rater reliability, (so as not to be confused with reliability in the verification context). Table, Producer's Accuracy, and User's Accuracy

Multi-categorical Contingency
The multi-categorical contingency table was used to compare the analyst's visual estimation of ice concentration against the First, we define  Symmetrically, for each ice concentration category x estimated by the analyst, the user's accuracy p(m|x) is the probability that MAGIC will report the same value as the analyst. In the Murphy and Winkler (1987) verification framework, the user's accuracy is the conditional probability of MAGIC estimating a specific ice category, given that the analyst estimated the same category. The user's accuracy is computed by dividing the number of correctly estimated ice concentrations by the row total for each category (the row totals correspond to the marginal probabilities of the analysts estimates).

5
The joint probability p(x, m), of MAGIC estimating the m sea-ice category and the analyst estimating the x sea-ice category, informs on the accuracy (aka agreement between analyst and MAGIC). Best accuracy is achieved when p(x, m) = 1 for x = m (along the diagonal) and p(x, m) = 1 for x = m (off the diagonal). The joint probability p(x, m) is related to both the producer's and user's accuracy, as:

Kappa Statistic
The kappa statistic is a skill score which measures how well an analyst can perform compared to chance. In forecast verification 15 the kappa statistics in known as the Heidke Skill Score, and it is the skill score constructed from the percent correct against random chance (Wilks, 2011;Joliffe and Stephenson, 2012). The kappa statistic can be calculated for any contingency table to measure the level of agreement between analysts and the segmentation algorithm. This measure takes into account the possibility of chance agreement between analysts and MAGIC when determining the agreement found between them.
The kappa statistic, κ, is calculated as where p 0 = p(i, i) is the agreement between the analyst and the segmentation results, p e is the agreement that a random estimation is expected to achieve, and 1 is the value attained by p 0 when there is perfect agreement.
The observed agreement p 0 (known also as percent correct), is the sum of the diagonal joint probabilities for the k ice concentration categories.
The expected chance agreement, p e , is for the k ice concentration categories, where p h i is the marginal probability for the i th row, and p v j is the marginal probability for the j th column. The product of the marginal probabilities p h iṗ v j gives the joint probability for the categories i and j occurring at the same time by chance, in virtue of the Bayes Theorem.
A weighted kappa can be used to apply a penalty to disagreements which increases with distance from the diagonal. This is unlike the unweighted case above, where all disagreements are equally penalized. In this study, we used a linearly weighted 5 kappa (for each ice concentration category away from the diagonal, we increase the penalty by one).
A κ value of 1 indicates complete agreement between the estimated ice concentration and the calculated ice concentration from image segmentation. A κ value close to 0 indicates agreement close to that expected by chance. A negative value is theoretically possible, which indicates that an analyst is worse than random chance. However, negative values are rare, and 0 is often used as a lower bound. Landis and Koch (1977) suggested values greater than 0.8 represent strong agreement, values 10 between 0.6 and 0.8 represent moderate agreement, values between 0.4 and 0.6 represent mild agreement, and values below 0.4 as poor agreement.

Krippendorff's Alpha
Krippendorff's alpha is a skill score that measures the level of agreement between multiple analysts. Krippendorff's alpha ranges from 0 to 1, where 1 indicates perfect agreement. Krippendorff suggests that α = 0.8 indicates good skill, although he 15 suggests a value of α = 0.667 as a tentatively acceptable lower limit. Krippendorff's alpha, α, is calculated as where d o is the observed disagreement, and d e is the expected disagreement (when there is no reliability).
The observed disagreement, d o is 20 and the expected disagreement, d e is The value n is the total number of pairs of values c and k. The values o ck , n c , n k , and n are all frequencies in the coincident matrix defined in Table 3, which is built from the reliability matrix (Table 2). The coincident matrix is 25 For this study, the ice concentration categories were treated as ordinal data, which is applicable as the ice concentration categories can be treated as ranks. That is, the lowest rank has the least amount of sea ice and the highest rank has the most. For ordinal data, the metric difference, δ 2 ck , is

Comparison of Estimates from Generated Polygons against MAGIC
This part of the study focuses on the first objective of the study, which was to compare analyst estimated ice concentrations with ice concentrations derived from image segmentation using MAGIC.

5
Segmentation is not necessarily any more accurate than ice charts; in fact, ice charts are considered to be more accurate than segmentation. Therefore, we compared the visually estimated ice concentrations against all segmentation results first.
Next  tion results. Perfect agreement between analyst estimation and the segmentation results lie along the diagonal; entries below (above) the diagonal show over (under) estimation by analysts: the analysts tend to over-estimate the ice category with respect to MAGIC. Figure 8 shows the individual contingency tables of responses for each analyst that participated in this exercise. Figure 9 compares the two marginal distributions in Figure 7. Over-estimation of ice concentration categories for the low concentrations (i.e. 2/10 to 4/10) results in fewer polygons with low ice concentrations. The over-estimation also increases the 5 number of polygons with high ice concentration (9/10 to 10/10).
We then subset the data down to only those responses where the analyst stated the segmentation was valid. In the cases where some analysts accepted the segmentation results, while others did not, we only considered the responses where it was valid. Figure 10 shows the combined responses from all participants in this study. (Individual responses are shown in Figure 11).
As expected, removing the cases where analysts reported the segmentation results were invalid reduced the bias and narrowed

20
The kappa statistic was calculated to measure the level of agreement between analysts and MAGIC ( Figure 13). Both an unweighted kappa and weighted kappa was used. The unweighted kappa penalized all disagreements equally while the weighted kappa weighted greater penalties for larger differences in ice concentration estimates (e.g. far off with respect to the diagonal).
The values for kappa when weighted were higher than the unweighted kappa, indicating that the spread of ice concentration estimates with respect to the diagonal were small. 25 The weighted kappa value assessed for all 76 polygons and all analysts as a group was 0. negative even when the responses were subset to only the MAGIC results that the analyst found acceptable.

Inter-Rater Reliability Between Analysts and MAGIC
For this part of the analysis, we employed Krippendorff's alpha to measure the inter-rater agreement between all eight analysts and MAGIC. The use of Krippendorff's alpha allowed us to assess how much analyst responses differed from MAGIC (and among themselves). This was important since we used MAGIC as a reference standard to compare analyst estimates in the previous section.  Figure 15).

10
To determine if analysts disagreed with MAGIC or with each other more, we sequentially removed each participant (analysts and MAGIC) from the group. Krippendorff's alpha was recalculated for the remaining analysts in the group. The analyst whose removal caused the largest increase in Krippendorff's alpha was removed. That is, the analyst whose estimates disagreed the most from the remainder of the group was removed first. This process was repeated sequentially until only the two analysts whose estimates best agreed with one another remained. The results are shown in Figure 14. Sequentially removing each 15 analyst from the group to maximize the increase in α suggests an order by which analysts have greatest disagreement from the rest of the group. It also identifies which individuals could potentially benefit from additional training to ensure consistency among analysts when visually estimating ice concentrations. Figure 14 shows that one analyst disagreed with all participants the most; MAGIC had the second most/largest disagreements. This illustrates that most analysts have high agreement, which leads to inter-rater reliability in ice concentration estimates. 20 Finally, a Krippendorff value of α = 0.814, 95% CI [0.799, 0.829], measuring inter-rater agreement (between analysts and MAGIC), was found when the polygons were subset to only the estimates where analysts validated the segmentation results ( Figure 16). The high α value (compared to 0.762 and 0.779) indicates agreement between analysts and MAGIC is strongest when only validated MAGIC estimates were included. Even when estimates were subset to only validated polygons (compared to when they were not), it was found that one analyst disagreed the most with the group's estimates. MAGIC's estimates had 25 the second highest disagreement.

Comparison of Estimates Between Analysts Only Using the Group Modal Response
For this section of the analysis, the mode, or most commonly reported sea ice concentration by analysts, was used as the standard against which all other estimates were compared. This removed the dependence on segmentation results completely, isolating the assessment to only the spread and variability of estimates between analysts.

30
A polygon was assigned a single modal value if there was only one mode in ice concentration estimates. If there was more than one mode, then both were valid, and spread was determined using the closest modal value. For example, if a polygon had the modes 5/10 and 6/10, then an ice concentration estimate of 4/10 was considered 1/10 under-estimation away from 5/10. In the event that there was two modes with a gap, such as 4/10 and 6/10, then the midpoint between modes was used (e.g. 5/10).
For three modes, the middle modal value was used.
A contingency table was produced to compare the spread and variability of all responses against the mode (Figure 17). The contingency table shows that, even in the absence of segmentation, analysts tend to over-estimate ice concentrations compared to the modal estimate by the group for low concentrations (1/10 -3/10). The largest spread in ice concentration estimates 5 is found at 5/10 ice concentration. The spread of the analysts estimates away from the modal value, by analysts, for all ice concentration categories, follows a normal distribution when all responses are collapsed. Figure 18 shows the difference in the marginal distributions in Figure 17.

Accuracy of Ice Concentrations in Canadian Ice Service Ice Chart Polygons
This section focuses on the accuracy of ice concentrations in polygons from the perspective of chart users, such as the shipping 10 industry, modellers, climatologists, and other researchers. For these interpretations, we assume that the results of this exercise extend to visual estimation of ice concentration in all Canadian Ice Service ice charts. Recall that the main research question of this study was to determine how reliable the ice concentration estimate in a polygon in an ice chart is-that is, how often is the ice concentration given in the ice chart actually that ice concentration in the SAR image used to create the chart. This is assessed by using the producer's and user's accuracies (refer to Section 4 for more detail).

15
We first determined the producer's accuracy in order to assess the accuracy of the ice concentration estimates in the charts.
The producer's accuracy gives us the probability that an analyst will assign a polygon the correct (SAR) ice concentration category. Figure 19 shows the producer's accuracy, derived from Figure 10. The producer's accuracy is low overall. For example, the producer's accuracy for 8/10 shows that analysts correctly label a given polygon as 8/10 (according to MAGIC) 39% of the time. Analysts have a rate of accuracy of 38%, 95% CI [33%, 43%], overall in estimating ice concentration to the 20 exact tenth; this increases to 84%, 95% CI [80%, 87%], when the condition for accuracy is relaxed to +/-one tenth.
We then evaluated the user's accuracy ( Figure 20) derived from Figure 10. The user's probability gives us the probability that the ice concentration assigned to a polygon in SAR has the correct ice concentration value (as estimated by the analysts). (Exact values of the user's accuracy can be found in Table A1 and Table A2).
Most ice concentration categories have similar producer's accuracy scores but varying user's accuracy scores. The large confidence interval range is due to the small sample sizes. (Sample sizes ranged between n = 29 to n = 78 for the ice concentration categories shown). A tighter confidence interval would require a greater number of polygons in this study or more 30 analyst participation. The size of the confidence interval range is not due to the variability of estimates by analysts. Figure 20 also shows that the over-estimation of ice concentration by analysts results in higher accuracy for lower ice concentration categories. That is, since analysts tend to over-estimate ice concentration, the hits (and hit rate) for this overestimated ice categories are also increased. Low concentrations, on the other hand, are more likely to be accurately estimated by analysts.

Discussion and Conclusions
In this study we analyzed the distribution of ice concentrations estimated by analysts and forecasters visually using SAR 5 imagery. Visually estimated ice concentrations were compared against three different standards: automatically calculated ice concentrations, automatically calculated ice concentrations that were validated by analysts, and the mode of visually estimated ice concentrations. In all three cases, visually estimated ice concentrations were over-estimated for low ice concentration categories (1/10 to 3/10) and had high variability for middle ice concentration categories (5/10, 6/10). In general, the ice concentrations estimates were consistent within analysts, and the analysts estimates were overall in agreement with the automated 10 segmentation estimates (as shown by Figures 7 and 10, and the high values of the kappa statistics in Figure 13).
The analysts' ice concentration estimates compared to the automated segmentation estimates exhibit an over-estimation (for all ice concentration categories evaluated). This result was achieved not only when considering all polygons, but also when considering solely the polygons for which the automated segmentation was validated by the analysts (compare Figures 7 and   10). Although these results suggest that analysts routinely over-estimate ice concentration values, it is also worth noting that the When analyzing consistency between analysts, it was found that seven out of eight analysts strongly agreed with one another on the ice concentration estimates; one of the analysts was found in strong disagreement with the others, and the automated 20 segmentation algorithm ranked second, in terms of disagreement ( Figure 14). This was also the case when subsetting the polygons to only those that analysts determined were valid ( Figure 16).
Despite the small spread between analysts estimates, it was found that ice concentration estimates from the individual analysts can vary by as much as four ice concentration categories away from the modal value. Moreover, low to intermediate ice concentrations (2/10, 3/10, 4/10) were slightly overestimated by the analysts (Figure 17) when compared to the modal 25 value.
Finally, the accuracy of the analysts ice concentration estimates against SAR images was assessed by the producer's and user's accuracies. The probability that a given polygon in an ice chart was assigned the correct ( It should be noted that the images used for this study were selectively picked to be areas with well-defined floes with high contrast against the black (water) background. SAR image quality varies from image to image, and even within image.
Likewise, the structure of sea ice in Canadian waters can vary greatly, with brash and rubble ice along the East Coast and well-defined floes in the Beaufort Sea. Ice without well-defined shape may not be captured due to the resolution of the sensor.
This study quantifies the accuracy of sea ice concentration estimates under the best case scenario of well-defined floes in very 5 clear SAR images. It is expected that accuracy would decrease under brash ice conditions and/or poor image quality.
It is possible to infer some preliminary comments about the use of segmentation for automatically segmenting ice and classifying ice concentrations. Overall, analysts found the segmentation results were good but consistently missed strips and patches, or new ice. Perhaps this was due to the selection of images towards the summer months; sea ice floes in the summer can be covered by surface melt and/or characterized by new ice growth. Both of these could have impacted the ability of 10 the algorithm to capture all of the ice. This led to consistent under-estimation of sea ice concentration by the segmentation algorithm. Using both HH/HV polarizations would have yielded a better segmentation of the sea ice in the images but were not used in this study.
Another possible factor that may contribute to the accuracy of ice concentration estimates is the size of the polygons. A large polygon will require an analyst to zoom out to view a larger geographic area. The size and shape of the floes inside the 15 polygon may also have an impact on the accuracy of ice concentration estimates. The size of sample polygons used in this study were smaller than polygons found in corresponding ice charts and daily charts (not shown), which indicates the possibility that analysts had to zoom in to the image more than they would for regular ice charting. It is unknown if the size of the polygon affects the estimation of ice concentration within the polygon.
Another area of interest not investigated in this study was the potential for variation in how analysts define a polygon. In this 20 study, analysts were presented with pre-defined polygons to isolate variability in responses to their ice concentration estimates only. Perhaps, if the size of a polygon impacts an ice concentration estimate, then how an analyst defines a polygon may also impact their ice concentration estimate.
During this study, interest was voiced by the CIS Operations analysts and forecasters who participated in this study on using the tool developed as a training tool for new analysts. The tool could be used to measure if, over time, analysts' variability 25 converge towards a common value from training. There was also interest and potential in using this tool as an ISO metric; quantifying the level of agreement between individual analysts provides a reliability measure of the spread of estimates of the group.
The differences between skilled human interpretation of ice concentration and automated algorithms needs to be better understood before automated ice classification schemes can be widely adopted in operational ice services. With respect to 30 estimating ice concentration in this study, this means understanding why the analysts disagreed with the automated output (181 times out of 608 responses; refer to Figure 6).
The variability of ice concentration estimates can impact end users as the resulting distribution of ice concentrations changes.   Table A2. Probability of a polygon being assigned correctly for each ice concentration category (user's accuracy, or p(m|x)).p refers to the point estimate of the likelihood found in our sample data; lower and upper refer to the lower and upper 95% confidence intervals, calculated using a Sison-Glaz multinomial confidence interval method. For any ice concentration category where thep value for each row does not sum to 1.0, the remainder was found outside of ± three ice concentration categories from the category denoted in the first column. Ice