Video Stabilization
 

Video Stabilization

Image stabilization in still cameras are usually used to counter blur caused by movement of the camera during the short instant when the shutter is open. It works remarkably well, but works less well for video. First, there's only so much movement that the camera can compensate for. During still photography the stabilization is effective down to about ten divided by the focal length in millimeter seconds, so for a 27 mm lens it will be effective down to 10/27 seconds or about 1/2.5 s. Most video clips are longer than this. Second, it isn't always obvious if the movement is intended or not.

Therefore, when shooting video, one should bring a tripod (even a small one helps), or go grab a steadicam rig. But what do you do when the tripod is too small, or left back home, and a steadicam is right out because you're down to one kidney after buying your latest and greatest camera?

You just have to fix it in post. And that's what this article is about.

Table of Contents

1. The Basic Idea

2. The Result

3. Realizing the Idea

3.1. Tracking Reference Points

3.2. Solver

3.3. Damping

3.4. Applying the Transformation

3.5. Crop

 1. The Basic Idea

The basic idea is to pick one or more reference points and then translate and rotate each frame so that the reference points end up in the right positions.

For example, if we choose a reference point that in frame 1 is at [100, 100], and we find the point at [110, 105] in frame 2, then frame 2 should be shifted up and to the left by [-10, -5]. We would then go to frame 3 and make sure that the reference point ends up at [100, 100].

With only a single reference point we are unable to correct for rotation, so we'll allow several reference points, and even points that are constrained to lie anywhere on a line instead of having a single position. We'll also allow reference points to move. The only requirement we have for these points is that we for each frame can compute an error - the amount the point deviates from its expected location. A solver will then be used to find the combination of translation and rotation that minimizes the RMS of the errors.

 2. The Result

Let me start off by showing the results. The first example is a hand-held clip of Hötorget, Stockholm. The left half of the image is the raw footage, the right half is the stabilized clip. The stabilizer is given four reference points and corrects for both shifts and rotation.

The second example is a small Christmas Tree shot against the background of people moving around on Drottninggatan. The single reference point is set in the tree, and the stabilizer only corrects for shifts, not rotations. As you can see, it isn't possible to get a completely steady image due to the background moving in relation to the tree thanks to the parallax effect. If I had used a tripod this would not have been an issue.

The third example is me walking with the camera in Tyresta. The reference point is set straight ahead, and as you can see the stabilizer doesn't correct for rotation (it needs more than one reference point for that).

 3. Realizing the Idea

With all theoretical pieces in place we're set to get something working.

 3.1. Tracking Reference Points

The tracking code is very simple: We know where the reference point is in frame N, and we want to know where it is in frame N + 1. For this we have frame N and frame N + 1. We know that the reference point in frame N is at P. We define a sample radius and a search radius. Then we cut out a square with edge length 2 * sampleRadius centered on P. We try to match this square with frame N + 1, starting at P and looking at points closer to P than the specified search radius. When we're done, we'll know the new position. The new position will become the new P when we track the point from N + 1 to frame N + 2.

The above algorithm works best if the camera is moving, such as in the third example above. If the camera is still, the square should always be taken from the first frame in which the reference point is defined, and taken from where it is in that frame. The search should however start in P, just as before. This is because as we track the point from frame to frame, small errors accumulate and the point may drift. When the camera moves this is inevitable - the drift is less than the shakes so we have a net gain - but when the camera is still we can get much better results by using the same reference patch.

 3.2. Solver

At this point we have a collection of reference points. We define a transformation consisting of three parameters [xt, yt, r], corresponding to x- and y-translations followed by a rotation of angle r around the center of the translated image[1]. In the first frame the transformation is [0, 0, 0] - all reference points are exactly where they should be and no correction is needed. For frame N then, we start with the transformation used for frame N - 1.

For each point then, we take the position where it was found in frame N, and transform it according to the current transformation and see where it ends up. We then compute the distance from that point to the nearest point where we think it should be - for fixed points this is simply the distance to the fixed point, for points constrained to be on a line, it is the perpendicular distance to that line, and so on. That distance, squared, is the error for that particular point.

We then look at the transformations around the current transformation: we adjust the x and y translations by a quarter-pixel, and the rotation by one rotation unit[2]. We then repeat the error calculation for the new transformation parameters, and pick the one with the least error. If this error is less than the currently best transform we switch to this transformation and repeat the process. If all changes to the transformation resulted in worse errors, we stop.

 3.3. Damping

If we just use the transformation found in the previous step with a clip filed with a moving camera we risk that the movie ends up unnaturally stiff. For that reason we keep the last L transformation parameters in a buffer. We then add the new transformation to the buffer at position 0 and shift all other L - 1 transformations. The final parameter values we use are then a weighted average of the parameters in the buffer where each set of parameters is given a weight of d-i where d is a damping factor and i is the index of the parameter set in the buffer. For example, with d = 0.8, the parameter set at position 0 has a weight of 1.0, the set at position 1 a weight of 0.8, the set at position 2 a weight of 0.64, and so on.

 3.4. Applying the Transformation

Once we have the optimal transformation we apply it. This is a simple linear transformation of an image and I'll restrict myuself to some implementation notes.

  • Allow for subpixel translations. Frequently the translation will not end up on a pixel boundary. If you only allow translations by integer pixel values the image will appear to vibrate as image noise causes the reference points to shift small amounts.

  • Use interpolation when rotating the image. For the same reason as above, the image looks like it is vibrating otherwise.

  • I use bilinear interpolation for subpixel translations and the rotation. It is not as good as, for example, bicubic, but given the quality of the raw footage I don't think I'm missing much.

 3.5. Crop

Since we shift the image around, the output from the stabilizer will have black borders. These are removed by tracking the maximum transformation values. From those it is then trivial to figure out a sub-rectangle that is always visible. The stabilized footage can then be cropped to this.

Footnotes

[1]

If we only have a single reference point the rotation parameter is skipped, as the problem otherwise becomes under-constrained: We can always get an error of zero, and then rotation through the reference point can be done without adding any error.

[2]

I have defined the rotation unit to be tan-1 (1 / (imageWidth * 8)). This means that a change by one unit will result in the sides of the image moving up or down by 1/16th of a pixel.