A note about the assessment of annotations

The goal of the M3D system is to offer a framework to the research community that allows for the annotation of co-speech gesture in a multidimensional manner. Though this approach is largely rooted in the leading approaches developed by David McNeill and Adam Kendon, it is important to consider that there are a wide variety of theoretical approaches to the study of gesture. Additionally, some of the aspects of gesture that need to be coded can be quite subjective in nature, where the labeler must interpret what the speaker is meaning to represent, say, or do with his gesture. For these reasons, it is virtually impossible to say there is only one absolute true and correct annotation for any given gesture. 

Let’s consider a few thing to illustrate this point:

Researchers may adhere to specific theoretical approaches that guide their labeling


Gesture studies is a very interdisciplinary research area, and researchers may employ specific, previously-established definitions or theoretical approaches when assessing aspects of gesture. Let’s take gesture referentiality for example. M3D allows labelers to label a referential gesture as having degrees of iconicity, metaphoricity, or deixis (as opposed to forcing annotators to make a decision about a single gesture being “iconic”, “metaphoric” or “deictic” in nature). However, researchers may vary in how they define “deixis”. Some may define deixis in a very conservative manner (e.g., pointing to a locus in space to indicate the location of an object or entity) while others may take a more inclusive definition (e.g., any use of spatial location or illustration of motion; for more discussion, see, Gullberg, 1998, p. 94). The objective of M3D is not to set standard definitions for such concepts in the field, but rather to simply offer a tool that allows researchers to apply their definitions in a non-mutually-exclusive way. 

Labelers need to decide a speaker’s intentions, or when the communication of that intention begins or ends


Based on a video, labelers must assess things like potential referential or pragmatic meanings, or deciding how to segment the stream of movement into gesture phases (e.g., deciding at what frame a stroke begins or ends, or if a particular movement should be labeled as a stroke or preparation). As the labeler is not able to read the speaker’s mind, labelers must make their best interpretation of where the gesture is occurring, and what sort of meaning is being communicated. 

What this means for the M3D Training Program 

As there is no single “correct” annotation, be sure to keep this in mind when going through the training program. Specifically, when looking at examples or doing the tasks, you may not agree with the interpretation that was given. For example, if you were to say that you would annotate something differently from the way it has been annotated in the M3D training program, it does not mean that it is automatically “wrong”. The important thing is that you follow your perception to annotate, and that you are able to justify WHY you would annotate it in a different way. 

What this means as I label my own material with M3D 

We believe that any multimodal labeling should ideally take place in a team environment, where multiple labelers are collaboratively labeling the data. The entire team should decide on which conceptual definitions they will follow, and how they are operationalized in the data (and importantly, be transparent about reporting such aspects in any publications). As teams go through labeling the data, there should be routine meetings to discuss complicated cases, and regular checks on inter-annotator reliability. In this way, a single database that has been reliably coded by following clear definitions should be replicable.


In addition to manual labelling, kinematic data may also help remove some of the subjectivity (e.g., using motion tracking data to better assess gesture phase boundaries). For this, the EnvisionBox website is a valuable resource.  

How can I check reliability? 

ELAN has a great built-in function to assess reliability. Please check out the ELAN manual for more information. 


For assessing reliability of multidimensional annotations, for example, to assess the agreement in the annotation of referential dimensions, we recommend using the Measuring Agreement on Set-value Items (MASI) technique, which is a distance-measuring measure. MASI offers a distance metric which can then be used in typical statistical measures of agreement, such as Krippendorf’s alpha. For more information about MASI, see Passonneau (2006) and Artstein & Poesio (2008). For it’s application in R, have a look at this R script. To run the script, it is necessary to have both R and Python installed on the computer, and the data should look this this.

Useful Resources

ELAN manual, reliability section: https://www.mpi.nl/tools/elan/docs/manual/index.html#Sec_Calculate_inter-annotator_reliability.html 


EnvisionBox Website (Platform which contains a series of modules that will help you process and analyze multimodal data streams - particularly useful for motion tracking techniques, database building, and statistical analyses!): https://envisionbox.org/  

Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational linguistics, 34(4), 555-596.

Gullberg, M. (1998). Gesture as a communication strategy in second language discourse: A study of learners of French and Swedish. Lund University Press.

Passonneau, R. (2006). Measuring Agreement on Set-Valued Items (MASI) for Semantic and Pragmatic Annotation. Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC ’06). http://www.lrec-conf.org/proceedings/lrec2006/pdf/636_pdf.pdf