Challenge: Identifying Gesture Units

How are gestures grouped to larger Gesture Units? 

Learning Objectives 

In this challenge, we will:

Explanatory Video: Identifying Gesture Units 

Let's watch the video to learn more about Gesture Units! We recommend watching it in full screen to see the gestural movements in the examples better. 

Click here to see the transcript of the video 


In this video we will learn about Gesture Units, or G-Units. But first, we’ll start with defining what we mean by the prosodic dimension of gesture, then we will define what manual Gesture Units are. Then, we will explain how to identify and annotate them. Finally, we’ll explore a few examples together. This is an important first step when using M3D, as the G-Unit annotations serve as the foundation for all of the other annotations you will create. 

 

The Prosodic Dimension of Gesture

 

In M3D, Gesture Units belong to the prosodic dimension of gesture, we use the term “prosody” in a general sense to mean raw organizational structure. That is, how events can be grouped together at various levels, and how certain parts of those groupings may be more prominent than others. Prosody in gesture can be seen as homologous to prosody in speech. In both domains, larger chunks can be broken down into smaller phrases. So in gesture, larger gesture units can be broken down into smaller gesture phrases, while in speech, larger intonational phrases can be broken down into smaller intermediate phrases.  That is, gesture has a prosodic structure, just as speech does.

 

What are Gesture Units?

 

The Gesture Unit annotation captures the largest chunk or unit of gestural movement. Gesture Units are considered to be the central units of analysis of manual gesture action. They consist of groupings of movements that are assembled and perceived to be part of an individual phrase or unit. From an annotation perspective, the manual Gesture Unit is defined as the span of time from when the hands leave rest to when they return to rest. We can think of Gesture Units as roughly corresponding to the spans of time when the speaker is using the hands to communicate. Essentially, this means that at this stage of annotation, we are roughly detecting the temporal regions in which communicative manual movements are occurring, and separating those from the regions in which they are not.

 

Since a G-Unit is largely identified by having a “rest” at both ends, it is important to understand exactly what we mean by a rest. We consider a speaker to be at rest when their hands are in a rest position with minimal movement, indicating that they are not actively gesturing. Rest can be, for example, full relaxation, such as when the hands are still and down by the speaker’s sides, on the arms of a chair when sitting, or folded in the lap or in front of the body. But we can also identify rest occurring in other locations; for example, some people may rest with their arms slightly bent and their hands held up in front of them.

 

We call these other positions “partial rest positions” because even though their arms are not totally “relaxed”, the lack of movement and intention in these positions suggests that the speaker is in some sort of a “home base”, or a place to put their hands when not actively using them to communicate.  

 

How do we Identify G-Units?

 

It is important to note that when identifying G-Units, we should be looking for moments of rest which include both a pause in movement as well as a return to a fully relaxed or partial rest position. When only one of these two are present, Gesture Unit identification may become a little bit trickier. For example, if a speaker stops moving their hands, but does not actually return to a rest position, we would call a hold phase. When holding, the hand usually stays tense, giving us the perception that the speaker has not finished gesturing, and is only putting the gesturing on pause. In such cases, we would not see this ‘hold’ event as the end of a Gesture Unit. 

 

Alternatively, the hands may return to a rest position but immediately leave again to continue gesturing. Again, even though the hands physically return to a rest position, it appears that the speaker has not actually finished gesturing and thus, should not be coded as the end of a Gesture Unit. Importantly, labelers should give importance to their perception of the grouping of movement. Labelers should ask themselves whether they perceive a region of gestural movement as two units with a “rest” in between them, or as a single one, and annotate accordingly. Even with practice, some cases may remain quite tricky. Later we will look at some particularly unclear cases, and provide a few tips which can help you make a decision.   

 

How do we annotate G-Units?

 

Before starting to annotate the Gesture Units in a video sample, it is important to watch the entirety of the video at full speed, with the sound, to get a feel for the speaker’s gestural style. Pay attention to how the speaker gestures, and ask yourself questions like “where are their hands when they seem to not be gesturing?”,“do they produce long spans of gestural movement, or are they generally short?”, "how complex are their movements when they are gesturing?” By thinking about these sorts of things, you will sensitize yourself to the speaker’s style of gesture, and this can be very useful when annotating. 

 

When you start annotating Gesture Units, you want to make sure that you are not biased by the speech, since it is well established that the speech can influence perception of accompanying gestures. This means the audio should be turned off when annotating Gesture Units. You may find it useful to play the video at various speeds, which makes it easier to identify Gesture Units.  

 

Examples

 

Here is an example of a single G-Unit [see example]. We see that at first both hands are at rest, in a central, (partial rest) position. Both of the speaker’s hands leave this rest position to do some gestural movement before returning again to rest, which happens to be in that same partial rest position. 

 

Here is another example, but this time there is a sort of pause in movement [see example]. This pause may seem to indicate a rest, leading us to annotate this as two separate Gesture Units. However, the hands remain rather tense and held in position. These cues suggest that this is a ‘hold’, or part of a gestural communicative act, which the speaker has not yet finished. Thus we would consider this as a single G-Unit.

 

As we have already mentioned, sometimes a gestural excursion may return to a rest position but immediately leave again to execute more gestural movements, without pausing. While a labeler’s perception of the groupings are generally quite clear, such cases may be quite tricky and labelers may not be able to decide if it is one Gesture Unit or two. In such cases, we recommend checking to see if there is a perceivable pause when playing the video at full speed. If there is still any doubt, we recommend using frame-by-frame analysis to see if the hands remain still for at least 300 milliseconds. This number corresponds to the average duration of stillness at which people perceive pauses in movement, and is useful as a support for inter-annotator reliability, guiding multiple annotators to use concrete criteria to come to the same conclusions in ambiguous cases. 

 

Let’s have a look at a more complex Gesture Unit with lots of hand movements. Pay attention to the left hand, and remember that we are looking for moments where the hand returns to a resting state [see example]. In this example, we see that the speaker’s left hand indeed returns to a central rest position where the hand relaxes, but we don't perceive a pause, as the hand immediately starts to move into a new gesture. The guidelines above would lead labelers to converge on annotating this somewhat ambiguous example as a single Gesture Unit. 

 

Wrap-up


In this video, we’ve explored what manual Gesture Units are and how we annotate them. Remember: Gesture Units refer to spans of time from when the hands leave a rest position, to their return to a rest position. Also, rest positions may refer to total relaxation, with the hands by the side for example, as well as partial relaxation, where hands and arms may not be fully relaxed, yet are not actively gesturing either – the position seems to be little bit less intentional. Such differences can be more readily captured if we understand a speaker’s individual gestural style, so it's important to familiarize oneself with each speaker’s style before annotating. Thanks for watching!


Task 1: Recognize the Gesture Units 

Practice how to identify the areas in which manual movement occurs and how to separate those from areas in which it does not occur. Remember, Gesture Units can consist of multiple gestures. Click on the "Let's go" button to start the task. The videos within the task might take a minute to load. If you have trouble accessing the task, please click here.


Duration: about 7 minutes 

Task 2: Select the best Gesture Unit annotation

Click on the "Let's go" button to start the task. The videos within the task might take a minute to load. If you have trouble accessing the task, please click here.


Duration: about 10 minutes 


This task requires basic knowledge of ELAN 

Annotation Tutorial: Let's annotate together!

Watch this annotation tutorial to find out how to annotate Gesture Units in ELAN and for being prepared for the final task of this challenge.

Task 3: Now it's your turn!

This task requires knowledge of ELAN 

Click on the button below to download the .mp4 and the .wav file, and import it together with the M3D template to ELAN (If you haven't downloaded ELAN yet, see link below)

Try annotating Gesture Units on your own. After that, check the solutions and compare your annotations.

Useful Resources

Download ELAN: https://archive.mpi.nl/tla/elan/download

M3D Resources (Template, Manual, M3D-TED corpus): https://osf.io/ankdx/