Rater reliability has been one of the core considerations in human-mediated language assessment. It is also important for training machine learning algorithms in automated language assessment, since human ratings are the key label used to be predicted by automatically extracted linguistic features. Reliable human ratings are a prerequisite for ML models’ prediction accuracy. This study attempts to meta-analyse inter-rater reliability coefficients in human-mediated assessment, using second language English prosody (i.e., stress, intonation, and rhythm) as an example. Prosodic features have been found to be significantly correlated with comprehensibility and communicative success in second language English speech. However, the existing prosody assessment showed great variations in terms of construct operationalisation, rater background, and rating scales used to assess prosody. This meta-analysis aims to understand how these variations might influence rater reliability. A Bayesian meta-analysis was adopted in this study because it can incorporate prior knowledge, ascertain the true null effect, directly model uncertainty, and intuitively compare model fit. A total of 441 reliability estimates were extracted from the screened articles (n = 107), and this paper focuses on the inter-rater reliability as assessed by Cronbach’s alpha (k = 127). The overall inter-rater reliability was 0.92, with 95% credible interval ranging from 0.87 to 0.96. The between-study heterogeneity (τ = 0.65) suggests great variations among studies. Inter-rater reliability was higher when prosody was assessed at the global level as compared with at the specific level. It was also higher when the rating scales were accompanied with specific descriptors than with labels at either endpoint. This meta-analysis calls for further improvement in prosody assessment by clarifying the construct and refining rating scales and has implications for automated prosody assessment.
Zoom link: us02web.zoom.us/j/86206272874?pwd=QzZSZCtCWnNIeUh0cHAvZzNXWWYwZz09