Entries in the input techniques category

Voice Code

Almost all science fiction movies made so far have invariably featured some form of speech recognition, where all sorts of computers, robots or even sliding doors could accurately understand and respond to human speech under any conditions. Although speech recognition technology hasn’t progressed quite as fast as SF writers had imagined, it is increasingly being used for everyday products and services: a handful of mobile phones can already respond to basic voice commands without needing any previous training with samples of your voice, and Interactive Voice Response (IVR) systems will pick up your call, listen to what you want, and tell you where is the nearest restaurant or when is the next train.

The reason such applications have been (relatively) succesful is that they manage to overcome the inaccuracies of speech recognition and the ambiguity of human speech by limiting themselves to a particular domain (e.g. requests about movie showtimes) and taking into account the specific domain knowledge (e.g. users are likely to pronounce the name of a movie currently on show). By carefully following this principle, researchers from the National Research Council of Canada have created VoiceCode, an application that allows computer programmers to dictate program code to their computer, with virtually no need to touch the keyboard.

Voice Code

The main issue with dictating program code is that programming languages were never meant to be spoken but were designed to have a clear, simple structure that can be easily understood by a computer. A simple example of this is the naming of functions and variables: in program code it is not uncommon to see abbreviations such nInMsgs for a variable that represents the number of incoming messages. Without VoiceCode, one would have to spell out the abbreviation somehow like “n-capital-i-n-capital-m-s-g-s”. VoiceCode, however, can learn and adapt to the ways that computer programmers abbreviate such names, and match a full spoken phrase with its likely abbreviations. But such heuristics are bound to fail once in a while, so what happens in this case? VoiceCode again follows a basic principle of interaction design: in case of error, give the user a quick way to recover. Whether the error was made in the stage of voice recognition or in translating natural language to program code, it can be easily corrected by selecting one of the alternative interpretations presented in a popup window.

It’s impossible to describe all the features that make this piece of software a great case study in voice recognition, so I highly recommend watching their demo video that was presented in the CHI 2006 conference. However, as impressive as this video may be, don’t rush to throw away your keyboards just now. You may escape RSI, but VoiceCode authors are quick to point out that a similar condition, voice strain may be linked to the continuous use of speech recognition products. One can only wonder if such problems may also arise with more futuristic input technologies, such as brain wave reading.

Pinching thin air

Multi-touch is all the rage these days. Presenters resizing, zooming and rotating photographs and maps, all with a simple movement of two hands or fingers across the display surface. Regardless of how likely you are to actually want to do these things, it is a compelling interaction technique, because it is natural and direct. However, touch screens have their downsides and limitations, and it is uncertain whether they will ever displace the keyboard and the mouse from their spot on our desktops.

An alternative for gestural input is through computer vision, as is used in the the EyeToy for the Sony PlayStation 2. However, accurate recognition of complex gestures involving fingers poses a challenge for these systems.

Computer scientist Andrew Wilson has now found a new way of achieving vision-based gestural input using a much simpler method than previous approaches. Using a standard web cam looking down onto your keyboard, his software can recognise when you put your thumb and forefinger together, and allows you to then move, zoom and rotate content thus “grabbed”. In addition to the two-point manipulations that multi-touch allows, you can also pinch with only one hand and twist it to rotate, or move it up towards the camera to zoom in.

Andrew Wilson demonstrating vision-based pinching interface

I highly recommend watching the video. (Try to ignore Robert Scoble’s musings in between Wilson’s explanations. Credit to him, though, for shooting this as part of his tour of Microsoft Research.)

The way it works is simple but ingenious. Instead of trying to recognise complex shapes of the hand, Wilson’s solution uses a simple heuristic: while your hands are in the picture without any fingers touching, the background is one single continuous region; when you put your thumb and forefinger together, you are “pinching off” a piece of the background, creating a new region in the image. Regions touching the edge of the image are ignored, which avoids interpreting shapes that you may inadvertently create by cutting across the corner of the image with your arm. To allow the one-handed manipulations, the software further has to find the approximate ellipse formed between the fingers, and react to changes in its orientation and size.

The solution has some limitations. Altough the interaction seems simple and natural, the hands and fingers have to be held in a particular way and within a certain area. The appearance of the background is also important so that hands can be recognised. And although the technique can be used to control the mouse cursor, the interaction for this is not as natural as for direct manipulation. There are also no empirical results from user testing yet, so there may be further usability issues lurking. Despite its limitations, however, this interface looks promising in that it may allow ad-hoc gestural interaction to complement our keyboard and mouse, without requiring expensive new hardware.