It seems in every field there are vast array of tools and tricks used by people who really don’t understand them. This is okay in some situations, but often one finds taking the time to delve a bit deeper offers benefits not otherwise found by “just using it” (including the simple joy of figuring out how things work, of course). In the field of computer vision one such tool is called optical flow, and it is a method for computing vector fields describing motion (magnitude as well as direction) in video streams. Often when you see explanations of optical flow they’re purely mathematical or, worse, simply showing what functions to call in whatever computer vision library you happen to be using (such as OpenCV); neither of which offer great insight into why it works. The point of this post is to walk the reader through the math (gently!) and then to describe why it works the way it works.
I mentioned optical flow computes vectors signifying motion or flow within a sequence of video frames. The motion could be a dog walking across the field of view, a car driving by and so on. By “flow” I’m referring to the pixels of varying intensity and/or color flowing across the image due to said motion. Take this animated GIF:
The first thing you’ll see is the silver car lurching forward in the two image sequence. It can be said the pixels representing the silver car flow “forward” (or at an angle towards the viewer); it is the task of optical flow to accurately represent this flow in a form we can use.
“In a form we can use”, in many instances, means in a form that can be leveraged to predict where the object in motion will likely end up in the next frame. Optical flow does this by providing us with a vector for each pixel or super-pixel. I simply annotated the car image to make my point below:
Here you can see vectors assigned to pixels or pixel groups signifying the direction of motion of the object being analyzed (which in this instance is the silver car). Optical flow, if implemented correctly and used under the right conditions, provides this for us.
Optical flow bases itself on a number of assertions, the most important of which is the brightness consistency restraint. The restraint essentially claims that as a pixel moves from point 1 to point 2 in the scene it will maintain the same brightness intensity (in gray-scale images). This is crucial to a lot of the building blocks to optical flow. Mathematically this can be expressed as:
Where I() as a function returns the intensity of the pixel given parameters x, y and t (x-coordinate, y-coordinate and time-coordinate respectively). This essentially means as the point undergoes and over , the position but not the intensity changes.
Typically the mathematical statements of optical flow then leverage a Taylor series approximation of so as to express it as a sum of first order partial derivatives (with the higher order terms dropped out).
Doing this affords us a realization: that is equal to zero (recall the pixel intensity does not change, according to optical flow). This gives us a nifty restraint we can place on pixels: that at a pixel, the change in the image’s intensity with respect to x (the gradient in the x direction) plus the image’s intensity with respect to y (the gradient in the y direction) plus the change in that pixel’s intensity over the time it took to move effectively sums to zero. This is a deep realization with optical flow, so let’s devote a bit of time to understanding what it means.
Understanding the restraint
Let’s state it again:
And now let’s think a bit about what this means by examining how a human notices motion. Take the following image with a vertical gradient:
Imagine we were to stretch this gradient ten miles to the left and ten miles to the right, so that we had a long, thin rectangle filled with this particular pattern. Now imagine we moved it, slowly, to the left. Because one cannot see any unique color changes in the rectangle as it slides in this direction (in particular, perpendicular to the gradient), it would be very difficult for a human to note the rectangle was moving at all (recall that we do not see the ends of the rectangle).
Correspondingly, if we were to stretch the gradient vertically ten miles up and ten miles down and began to move it downwards, over time we’d eventually see the rectangle growing darker as we encountered the darker and darker portions of the gradient before our eyes.
This tells us something important about human vision: if something is moving but it’s color and brightness doesn’t change, we have no way of knowing it’s moving — our eyes play a trick on us! But if there is a change in color and intensity in front of us, we can extrapolate something quite interesting: if we never divert our eyes from one point in front of us, know the rate at which the object darkened over a certain unit distance, and keep track of how long it takes the single point we’re observing to darken, we can determine the velocity of the moving object (because the change in brightness equates to distance traveled). And this is precisely what optical flow is doing.
Before we get too far ahead of ourselves, let’s return to the formula we worked out previously, namely:
Let’s fiddle with the statement a bit. By subtracting the partial derivative with respect to time from both sides, we come up with the following equality:
We can phrase this relationship in words as the following: the direction of the intensity gradient with respect to the point <x,y> in the x direction is inversely proportional to the change in intensity of <x,y> with respect to time in the x direction (and likewise in the y direction). This is simply a fancier way of expressing what we already did above.
To round out the equation, let’s then divide both sides by because resulting in:
The change in x over time means the velocity in the x-direction; the change in y over time means the velocity in the y-direction and so we can finish our statement as such:
We now have an equality that can be used to determine the velocity of a pixel in the x-direction and y-direction, give that we can compute the change in the image intensity with respect to the x-direction, the y-direction and the “t-direction” (time). However, we also immediately see we have one equation with two unknowns, meaning it cannot be solved as-is. There are many ways in which we can solve this particular problem (called the aperture problem); one of the most famous was one developed by Lucas and Kanade, known as the Lucas-Kanade method.
Lucas-Kanade works by restraining our problem to a window of pixels as opposed to looking at a single pixel. This means we can develop a system of linear equations which can therefore be used to approximate (using a least-squares approximation) the average velocity of the pixels within the window. The linear system is constructed in this manner.
Let and and so that our restraint above can then be stated as
We can then break out things in form:
And solve for x using standard least squares approximation.
If you made it this far, congrats. Here’s some Kenny Loggins.