Friday 22 July 2016

John Bateman On The Need For Inter-Rater Reliability In SFL Research

Low interrater reliability scores can be indicative of several things: most of which relate to problems that should be considered. One cause of low reliability would be that the coders don't fully understand, or agree on, the way categories should be applied; another cause is that the categories are intrinsically poorly defined, and so agreement is unlikely.
Thus, when categories are meant to be applicable according to some theory (e.g., SFL), checking whether coders can actually apply the categories is not a bad step. The paper:
O'Donnell, M.; Zappavigna, M. & Whitelaw, C. (2008) 'A survey of process type classification over difficult cases'
In:
Jones, C. & Ventola, E. (Eds.) From Language to Multimodality: new developments in the study of ideational meaning, Equinox Publishing Ltd., 47-64.
provides some sobering data on how reliably some categories in transitivity are being applied.
I too would recommend the O'Donnell et al paper. And another paper that is relevant is
Laura Gwilliams and Lise Fontaine. 2015. Indeterminacy in process type classification. Functional Linguistics, 2:8, pages 1-19.

Blogger Comments:

[1] Lack of theoretical understanding is clearly the major reason why Systemic Functional linguists disagree on analyses, as demonstrated by any public discussion — e.g. Sysfling, Sysfunc, Systemic Functional Linguistics Interest Group in which instances are analysed.

[2] To be clear, it is not so much that "categories are intrinsically poorly defined" but that, on SFL model, language itself is said to be an indeterminate system.  For the types of indeterminacy, and the reasons for it, see the views of Halliday & Matthiessen here.

[3] Neither of these papers includes, in its experimental design, the most fundamental principle of grammatical analysis: taking a trinocular perspective.  Halliday & Matthiessen (1999: 504):
A stratified semiotic defines three perspectives, which (following the most familiar metaphor) we refer to as ‘from above’, ‘from roundabout’, and ‘from below’: looking at a given stratum from above means treating it as the expression of some content, looking at it from below means treating it as the content of some expression, while looking at it from roundabout means treating it in the context of (i.e. in relation to other features of) its own stratum.
Halliday & Matthiessen (2004: 31):
We cannot expect to understand the grammar just by looking at it from its own level; we also look into it ‘from above’ and ‘from below’, taking a trinocular perspective. But since the view from these different angles is often conflicting, the description will inevitably be a form of compromise.
In contrast, O'Donnell, in O'Donnell et al. (2008: 63), demonstrates no knowledge of this fundamental principle, framing the problem in terms of a lack of explicit coding criteria:
Both our analysis of individual clauses (Section 3) and of the grouping of coders (Section 4) show that the divide between using conceptual vs. syntactic criteria is widespread throughout the community as a whole, and each individual chooses which path they follow. This is, we believe, the result of the lack of explicit coding criteria in general, and argue that what the community needs is explicitly stated sets of criteria for coding practices, and perhaps distinct criteria descriptions for particular applications.
Gwilliams & Fontaine (2015: 17) are similarly oblivious, recommending a more "delicate" two level grammatical analysis — one semantic, one syntactic:
Although the motivation for a single-level analysis of experiential meaning is desirable, it does not appear that a one-dimensional classification is always sufficient to account for both syntactic and semantic realisation. If a representative analysis is to be maintained within the SFL framework, it appears that a more delicate analysis of the experiential meta-function is required, in order to provide the individual with all the relevant tools to conduct a fully representative analysis. Specifically the option to annotate syntactic and semantic interpretations separately would alleviate problems associated with the lack of correspondence between these levels.
Lack of theoretical understanding is the pervasive problem in the SFL community, whether it be in analysing language, or in analysing analyses of language — or, indeed, in workbooks designed to teach the theory (evidence here and here).

Inter-rater reliability is merely a statistical measure of the degree of agreement among raters.  In a community where lack of theoretical understanding is demonstrably widespread, at all levels, agreement is not a measure of theoretical competence.

The distinction between interpersonal agreement and experiential consistency is comically enshrined in this drawing by B Kliban: