How Kinect tracks people

Tuesday, 09 November 2010

The Kinect hardware is impressive but what about the software? Body tracking is a longstanding problem in computer vision - has it been solved at last?

When you consider the Microsoft whole body input device - Kinect - you tend to focus on the hardware. Well it is quite impressive - a standard video camera and an infra-red camera working together to provide a depth map of the 3D scene.

kinect2

However the software deserves a mention because its role is to segment the depth map into objects and then track the objects. More specifically it tracks a person in real time without them having to wear sensors and this is a very difficult task - one that has been extensively studied as part of AI and computer vision.

It has now been revealed that the key software was developed by Microsoft Research Cambridge's vision research group. The old way of approaching the problem is to construct an avatar and attempt to find a match in the data provided by the camera. Tracking is a matter of updating the match by moving the avatar as the data changes.This was the basis of the first Kinect software and it didn't work well enough for a commercial product. After about a minute or so it tended to lose the track and then not be able to recover it. It also had the problem that it only worked for people whe were the same size and shape as the system's developer - because that was the size and shape of the avatar used for matching.

kinect1

The new approach developed by the vision research team makes use of machine learning. They trained a learning system to recognise body parts. This can then be used to identify body parts in the incoming data stream using the GPU on a per pixel basis. The classifications are then pooled across pixels to produce hypothetical 3D body joint positions used by a skeletal tracking algorithm.

Explaining this recently one of the researchers said:

"We train the system with a vast and highly varied training set of synthetic images to ensure the system works for all ages, body shapes & sizes, clothing and hair styles. Secondly, the recognition does not rely on any temporal information, and this ensures that the system can initialize from arbitrary poses and prevents catastrophic loss of track, enabling extended gameplay for the first time."

Who says that AI never delivers on its promises! It is also clear that this sort of approach has many more applications than just a game input device.

Further reading: