[singlepic id=12 w=320 h=240 float=center]


I thought it would be interesting to be able to control your computer, specifically the media player, via the web cam. The basic idea being that you could hold up the palm of your hand to stop the music, or point to the right to have it skip to the next track and so on.

This was my attempt at achieving that.

At this point it is unfinished, I don’t plan to go back to it any time soon; mainly because after playing around with it, I realised that although the concept is interesting, it just isn’t as quick and easy to use as keyboard short-cuts!

Ultimately, however, it was fun to work on and as usual I learned a lot from the experience.

Read On…

Basic Approach

I wanted to develop my own algorithms for this project because my goal was as much about practice and learning as it was about a finished project.

The first and most obvious problem is that getting a computer to be able to understand what it “sees” and act on it accordingly is not a trivial one.

It seemed to me that a neural network would be a good solution for this application but in order to implement one we would first need to interpret the data from the webcam and break it down into a format that the neural net could more easily process.

I suppose in theory you could just plug the output from the web cam straight into a behemoth of a neural network and be done with it, but if it were that easy our cars would be driving us to work in the morning.

Each hand action that i wanted to interpret would be a different shape, so if i could find a reliable way to identify the hands in the frame and then map their shape; that data could be fed into a neural network which should be able to identify which action was required.

The basic idea was:

– Monitor feed from webcam

– Isolate hands from the background and head

– “Map” their shape

– Pass shape into neural network

– Neural network identifies command

– Carry out required Action

Lets look at how I tacked each of those.

Monitor Feed:

Well, this was the easiest step. Java provides some libraries for accessing the webcam, the Java Media Framework. Interestingly these are platform specific libraries, at the time i was writing the code I couldn’t find a reliable set of platform independent libraries, so i continued on with e windows only solution. Platform dependence would become an issue again at a later stage, but I’ll get to that further on.

The JMF allows you to take a frame from the output of the web-cam. This will be the context of this article. The video stream is taken frame by frame; successive algorithms will be applied to each frame. For the purposes of this article each frame is an RGB bitmap.

Isolate Hands from the Background and Head

This is the most complex and involved step in the whole process. Taking an image from a web-cam and identifying certain objects within it is certainly not straight forward

I mentioned earlier I wanted to implement my own algorithms rather than using standard or existing algorithms. I made this choice largely because I felt that going in “blind” would force me to think much harder about the problem and how to solve it. One can never have too much practice in developing algorithms! I knew that this would almost certainly (the “almost” exists only for my ego) result in inferior performance to using established algorithms, but I don’t regret the choice. It was much more fun figuring it out for myself (or failing!).

The first step seemed kind of obvious, I would need a way to isolate the edges within the image, this would give a much “cleaner” image to work with. As it turns out, this wasn’t a great first step; I actually came back later and added an algorithm that removes the background from the image before it is processed. I’ll go into more detail about that later.

The algorithm I implemented was very simple, and looking back on it, it is not; strictly speaking, an edge detection algorithm. In actuality it is a “difference detection” algorithm.

Essentially it will identify areas that differ substantially from their immediate surroundings. Fortunately that is a pretty good approximation of an edge so the algorithm worked pretty well in this regard.

The algorithm:

(The algorithm makes use of several classes but I will include only the relevant code here)


This approach worked pretty well, the use of a threshold allowed me to adjust the algorithm on the fly and eliminate noise whilst maintaining a solid edge. This was useful because it required a different threshold for different lighting conditions, naturally.

A noisy image:[singlepic id=16 w=320 h=240 float=center]

Much cleaner:[singlepic id=17 w=320 h=240 float=center]

Although that approach picked out the edges fairly well, it still resulted in a noisy image, so I implemented a quick colour matching filter too.

My original intent was actually to give a greater weighting to skin tones by running colour matching first, to eliminate some of the “non skin” objects in frame. This didn’t work well at all, for several reasons, firstly the image quality from the webcam is pretty poor, but mostly the colour distribution isn’t what I expected.

I think that due to the way our brains process what we see; namely the special significance that people and particularly faces hold, we perceive a greater variation in colour than there is.  Ultimately I ended up with a butchered frame.

At that point however it became obvious that if I bumped up the threshold and ran the colour shading before the edge detection it would have the effect of passing a higher contrast image into the edge shader and result in less noise. The results were good enough that I could move on.

[singlepic id=15 w=320 h=240 float=center]

Mapping the Shape

The next challenge was to identify sub regions within the edges and format the special data from each isolated area into a form that a neural network could process. A neural network is a good fit for this because the input for each gesture will be similar each time but not exactly the same, it will vary mostly due to the users distance from the webcam at the time, and normal inconsistencies in movement and position.

I decided that a 2D Boolean array would work well; the neural network would be presented with the array, where a true value would equal a “set” sub area.

To do this I split each frame into smaller squares and then considered the pixels within each. Bear in mind at this point the processed frame is a monochrome image, i.e. contains only black or white pixels. White pixels are areas of interest whereas black pixels are areas that have been identified as “not part of a hand”.

The algorithm basically counts all the black pixels (not hands) within a given square and if that count is below a threshold, which takes into account the noisy image from the web-cam, then that square is identified as not containing an edge. Essentially that square is within a hand.

Once again this is more of a meta identification as all I have really done at this stage is ramp up the contrast and pick out areas of significant variance. This has the effect of identifying edges and boundaries, and an edge or boundary is by definition not part of a hand. Everything within this boundary is a hand though, or, as you can plainly see in the screenshots a head. More on that later.

With a little tweaking of the various algorithmic parameters the output starts to approach something useable, with two noticeable caveats.

The first being that whilst areas of interest have been identified, there is currently no computational way of differentiating between different elements in the frame (right hand vs. left hand, or head for example).

There is a simple solution to this, if not a robust one.

I decided that if there was not a direct path between one subset (collection of squares) to another, that is to say, they are separated by an edge, they are different elements.

To determine this I ran a recursive depth first algorithm on the data returned from the “sub square” algorithm. As a side note, this is one of the places that not going with out of the box solutions did shoot me in the foot in a way. I spent not inconsiderable time getting to grips with a recursive algorithm that was capable of identifying these distinct sub regions. Had I searched for an algorithm I would have found and implemented a depth first search in very short order, as it was, working from scratch took me rather longer.

Here are a few examples of some of the botched attempts.

[nggallery id=5]
And this shows it working; each distinct element has a unique identifier.
[singlepic id=23 w=320 h=240 float=center]

Once I had the algorithm sorted and was able to identify individual elements, it was time to deal with the elephant in the room.

As you can see, from the very beginning, my head was always there, it gets in the way and needs to be removed from the frame.

I experimented with some rather crude methods to differentiate hands from head:

[singlepic id=18 w=320 h=240 float=center]

However they relied on position and size of the head, which isn’t terribly robust. Fortunately while I was working on that I realised something that I should have realised from the start.

When sat at a computer, the background rarely changes and the position of the head rarely changes, so by comparing the current frame to a rolling “reference frame” I was able to identify areas of recent movement.

This resulted in an image like this:

[singlepic id=21 w=320 h=240 float=center]

When all the processes are combined the result is something like this.

[singlepic id=20 w=320 h=240 float=center]

The result is actually pretty good, by finding the bounds of the “potential hand” I can generate a 2D Boolean array and feed that into the neural network.

From that point its just a case of training the network and then interacting with the media player.


Screenshot Gallery:

[nggallery id=10]