LINK:
http://wiki.nuigroup.com/Gesture_recognition
Background
Touchlib does a fine job of picking out contacts within the input surface. At the moment, there is no formal way of defining how those blob contacts are translated into intended user actions. Some of this requires application assistance to provide context, but some of it is down to pattern matching the appearance, movement and loss of individual blobs or combined blobs.
What I'm attempting to describe here, and for others to contribute to, is a way of
- Describing a standard library of gestures suitable for the majority of applications
- A library of code that supports the defined gestures, and generates events to the application layer
By using an XML dialect to describe gestures, it means that individual applications can specify their range of supported gestures to the Gesture Engine. Custom gestures can be supported.
By loosely coupling gesture recognition to the application, we can allow people to build different types of input device and plug them all into the same applications where appropriate.
In the early stages of development, we are all doing our own thing with minimal overlap. Over time we will realise the benefits of various approaches, and by using some standardised interfaces, we can mix and match to take advantage of the tools that work best for our applications. Hard coded interfaces or internal gesture recognition will tie you down and potentially make your application obsolete as things move on.
I'd really appreciate some feedback on this - this is just my take on how to move this forward a little at this stage.
Gesture Definition Markup Langauge
GDML is a proposed XML dialect that describes how events on the input surface are built up to create distinct gestures.
Tap Gesture
<gdml>
<gesture name="tap">
<comment>
A 'tap' is considered to be equivalent to the single click event in a
normal windows/mouse environment.
</comment>
<sequence>
<acquisition type="blob"/>
<update>
<range max="10" />
<size maxDiameter="10" />
<duration max="250" />
</update>
<loss>
<event name="tap">
<var name="x" />
<var name="y" />
<var name="size" />
</event>
</loss>
</sequence>
</gesture>
</gdml>
The gesture element defines the start of a gesture, and in this case gives it the name 'tap'.
The sequence element defines the start of a sequence of events that will be tracked. This gesture is considered valid whilst the events sequence remains valid.
The acquisition element defines that an acquisition event should be seen (fingerDown in Touchlib). This tag is designed to be extensible to input events other than blob, such as fiduciary markers, real world objects or perhaps facial recognition for input systems that are able to distinguish such features.
The update element defines the allowed parameters for the object once acquired. If the defined parameters become invalid during tracking of the gesture, the gesture is no longer valid.
The range element validates that the current X and Y coordinates of the object are within the specified distance of the original X and Y coordinates. range should ultimately support other validations, such as 'min'.
The size element validates that the object diameter is within the specified range. Again, min and other validations of size could be defined. Size allows you to distinguish between finger and palm sized touch events for example.
The duration element defines that the object should only exist for the specified time period (milliseconds). If the touch remains longer than this period, its not a 'tap', but perhaps a 'move' or 'hold' gesture.
The loss element defines what should occur when the object is lost from the input device.
The event element defines that the gesture library should generate a 'tap' event to the application layer, providing the x, y, and size variables.
Double Tap Gesture
<gesture name="doubleTap">
<comment>
A 'doubleTap' gesture is equivalent to the double click event in a normal
windows/mouse environment.
</comment>
<sequence>
<gestureRef id="A" name="tap" />
<duration max="250" />
<gestureRef id="B" name="tap">
<onEvent name="acquisition">
<range objects="A,B" max="10" />
</onEvent>
<onEvent name="tap">
<range objects="A,B" max="10" />
<event name="doubleTap">
<var name="x" />
<var name="y" />
<var name="size" />
</event>
</onEvent>
</gestureRef>
</sequence>
</gesture>
This example shows how more complex gestures can be built from simple gestures. A double tap gesture is in effect, two single taps with a short space between. The taps should be within a defined range of each other, so that they are not confused with taps in different regions of the display.
Note that the gesture is not considered invalid if a tap is generated in another area of the display. GestureLib will discard it and another tap within the permitted range will complete the sequence.
In the case of double tap, an initial tap gesture is captured. A timer is then evaluated, such that the gesture is no longer valid if the specified duration expires. However, if a second tap is initiated, it is checked to make sure that it is within range of the first. range is provided with references to the objects that need comparing (allowing for other more complex gestures to validate subcomponents of the gesture). This is done at the point of acquisition of the second object.
Once the second tap is complete and the event raised, range is again validated, and an event generated to inform the application of the gesture.
Move Gesture
<gesture name="move">
<comment>
A 'move' is considered to be a sustained finger down incorporating
movement away from the point of origin (with potential return during
the transition).
</comment>
<sequence>
<aquisition type="blob" />
<update>
<range min="5" />
<event name="move">
<var name="x" />
<var name="y" />
<var name="size" />
</event>
</update>
<loss>
<event name="moveComplete">
<var name="x" />
<var name="y" />
<var name="size" />
</event>
</loss>
</sequence>
</gesture>
Zoom Gesture
<gesture name="zoom">
<comment>
A 'zoom' is considered to be two objects that move towards or away from
each other in the same plane.
</comment>
<sequence>
<compound>
<gestureRef id="A" name="move">
<gestureRef id="B" name="move">
</compound>
<onEvent name="move">
<plane objects="A,B" maxVariance="5" />
<event name="zoom">
<var name="plane.distance" />
<var name="plane.centre" />
</event>
</onEvent>
<onEvent name="moveComplete">
<plane objects="A,B" maxVariance="5" />
<event name="zoomComplete">
<var name="plane.distance" />
<var name="plane.centre" />
</event>
</onEvent>
</sequence>
</gesture>
A zoom gesture is a compound of two move gestures.
The compound element defines that the events occur in parallel rather than series.
The plane element calculates the line between the two objects, and checks the maximum variance in the angle from its initial (so you can distinguish between a zoom and a rotate, for example).
'move' events from either object are translated into zoom events to the application.
Rotate Gesture
<gesture name="rotate">
<comment>
A 'rotate' is considered to be two objects moving around a central axis
</comment>
<sequence>
<compound>
<gestureRef id="A" name="move">
<gestureRef id="B" name="move">
</compound>
<onEvent name="move">
<axis objects="A,B" range="5" />
<event name="rotate">
<var name="axis.avgX" />
<var name="axis.avgY" />
<var name="axis.angleMax" />
</event>
</onEvent>
<onEvent name="moveComplete">
<axis objects="A,B" range="5" />
<event name="rotateComplete">
<var name="axis.avgX" />
<var name="axis.avgY" />
<var name="axis.angleMax" />
</event>
</onEvent>
</sequence>
</gesture>
The axis element calculates the midpoint between two objects and compares current position against the initial.
GestureLib - A Gesture Recognition Engine
GestureLib does not currently exist!
The purpose of GestureLib is to provide an interface between Touchlib (or any other blob/object tracking software), and the application layer. GestureLib analyses object events generated by Touchlib, and creates Gesture related events to the application for processing.
GestureLib reads gesture definitions defined in GDML, and the operates a pattern matching principle to those gestures to determine which gestures are in progress.
Why GestureLib?
My feeling is that this functionality should be separated from Touchlib, a) for the sake of clarity, and b) because its quite likely that working solutions for a high performance multi-touch environment will require distributed processing. i.e. one system doing blob tracking, another doing gesture recognition, and a further system for the application. If you can get all of your components within the same machine, then excellent, but modularity gives a great deal of flexibility and scalability.
Proposed Processing
When a object is acquired, GestureLib sends an event to the application layer providing the basic details of the acquired object, such as coordinates and size. The application can then provide context to GestureLib about the gestures that are allowed in this context.
For example, take a photo light table type application. This will have a background canvas (which might support zoom and pan/move gestures), and image objects arranged on the canvas. When the user touches a single photo, the application can inform GestureLib that the applicable gestures for this object are 'tap', 'move' and 'zoom'.
GestureLib now starts tracking further incoming events knowing that for this particular object, only three gestures are possible. Based on the allowable parameters for the defined gestures, GestureLib is then able to determine over time which unique gesture is valid. For example if a finger appears, it could be a tap, move or potentially a zoom if another finger appears. If the finger is quickly released, only a tap gesture is possible (assuming that a move must contain a minimum time or distance parameter). If the finger moves outside the permitted range for a tap, tap can be excluded, and matching continues with only move or zoom. Zoom is invalid until another finger appears, but would have an internal timeout that means the introduction of another finger later in the sequence can be treated as a separate gesture (perhaps another user, or the same user interacting with another part of the application).
Again, the application can be continually advised of touch events so that it can continue to provide context, without needing to do the math to figure out the exact gesture.