Reduced Search for Macroblock Motion Partitioning in MPEG-4 Encoders

Earl Levine, earl (at) stanfordalumni.org

Summary

The MPEG-4 part 10 video encoder (also known as H.264) allows a large number of partitions of 16 x 16 pixel macroblocks for specifying motion compensation. This paper is an initial investigation of feasibility of reducing the complexity of an encoder by sometimes reducing the number of partitionings searched. Results using modified reference encoder software suggest these types of techniques enable a useful tradeoff between complexity and rate-distortion performance.

Motivation

The MPEG-4 part 10 video encoder allows a 16 x 16 pixel macroblock to be partitioned various ways for purposes of motion compensation. Each partition may have its own motion vector. The largest partition options are:

16 x 16 (that is, a single motion vector for the whole macroblock)
16 x 8 (two motion vectors)
8 x 16 (two motion vectors)
8 x 8 (four motion vectors)

An 8 x 8 partition may also be further split into subpartions of size 8 x 4, 4 x 8, and 4 x 4.

Presumably the extra bits needed to transmit more motion vector information could be offset in some macroblocks by the reduced residual gained from specifying more exactly the border of a region of motion. For example, in a simple scene of an large static diamond-shaped object moving uniformly in front of a static background, being able to specify a very good approximation of the diagonal boundaries of the foreground object could allow reducing the residual to near zero.

Each possible of partitioning tried takes significant computation in the encoder. The goal of this work is to investigate one technique for reducing the number of partitionings searched with the goal of reducing complexity with acceptible loss of quality in the encoding.

Algorithm

In a typical encoder, for each partitioning searched there will be a subsearch over possible sets of motion vectors for that macroblock and that partitioning. Usually some kind of cost measure would be computed for comparison against other possible partitionings, with the least-cost choice being selected for the encoding. For example the cost measure could be a sum-absolute-difference metric or a rate-distortion-aware metric.

Intuition suggests that if a motion boundary crosses through a macroblock the cost difference between a fine partitioning and coarser partitionings would be significant. Therefore if the search begins with the smaller subpartitionings (4 x 4, 4 x 8, 8 x 4, and 8 x 8) and the optimal-cost partitioning so far is much better than the coarsest partitioning (that is, 8 x 8) then it seems likely that searching even larger partitionings (8 x 16, 16 x 8, and 16 x 16) may be a waste of time. With this in mind, for each macroblock the algorithm tried is this:

Compute the cost of all partitionings using 8 x 8 or smaller. Don't compute the costs of the of the larger partitionings (16 x 8, 8 x 16, or 16 x 16) yet.
Compute a decision metric which is the sum, over all four 8 x 8 subblocks of the macroblock, of the difference between the cost of that subblock's largest (i.e. 8 x 8) subpartitioning and the cost of the optimal subpartitioning. Note that this metric is nonnegative, and could be 0 if 8 x 8 is optimal for all subblocks.
Compare this metric to a threshold. If the metric is larger than the threshold, skip searching and comparing to the larger partitionings, otherwise go ahead and search them as well.

The value of the threshold will be varied here to see its effect on the results. The threshold is one way to control the tradeoff between complexity and encoder quality.

This algorithm is just a simple scheme based on intuition and would be expected to be beaten by an algorithm based on an analysis of the cost data over a sample set. However the goal is to check whether this intuition is roughly right using experimental results.

Experimental Results

The official reference model software (see references) was modified to include the algortihm described above and also to allow collecting some extra desired runtime statistics. The gprof profiling tool was used to observe CPU time on a 1 GHz Celeron PC running Red Hat Linux 7.3. The video data used for the results here is simply the first 45 frames of the "foreman" sequence (see references).

The specific verion of reference software used was JM 7.3 and the default encoder configuration was used except for disabling rate-distortion decision for partitioning search. This simplified the confusion of the cost comparisons in that software.

In order to get an rough idea of the values of thresholds to try, simple statistics were collected on the value of the decision metric for cases in which the optimal choice was either a large partition size or smaller subpartitions. For those two cases respectively the mean of the metric was found to be 45.7 and 145.1 for the data used.

Threshold Execution time in seconds % execution time savings % of macroblocks with
reduced search average loss per macroblock
in internal cost measure SNR Y, dB SNR U, dB SNR V, dB bitrate (kb/s)

infinite (full search) 155.74 0 0 0 35.67 39.01 40.63 89.21

120 152.03 2.4 16.5 2.62 35.67 38.99 40.67 89.69

100 152.87 1.8 24.6 5.85 35.63 39.00 40.62 89.42

80 148.07 4.9 33.25 8.96 35.64 38.98 40.67 90.58

70 147.54 5.3 37.4 11.2 35.59 38.98 40.66 91.71

60 147.27 5.4 41.6 14.1 35.60 38.98 40.69 90.83

50 147.06 5.6 46.3 17.2 35.59 38.98 40.66 91.71

40 146.50 5.9 50.4 20.2 35.60 38.95 40.68 93.69

30 144.42 7.3 54.8 23.9 35.62 38.95 40.66 92.87

20 142.91 8.2 60 27.9 35.62 38.94 40.67 95.44

The overall trend that emerges from this data is that the reduced search decision can be made affimatively for a significant portion of macroblocks while having a less significant impact on encoding quality. For example, when the threshold is set to 50, reduced search may be performed on about half of macroblocks while only increasing bitrate by about 3 percent.

Conclusions and Future Directions

The results suggest that there is value to a technique that sometimes avoids searching the full space of partitionings. A significant number of macroblocks may have less than a full search with negligible loss of quality. (See the column "% of macroblocks with reduced search".) These techniques certainly provide a relatively simple way to continuously trade off the number of partitionings searched for rate-distortion performance. One attractive feature is that negligible complexity is added by reduced search techniques using decisions based on costs that are computed anyway.

The reference model software does not reveal great runtime gains from reducing the number of partitionings searched. This is because much of the execution time (about 45% typically) is occupied by precomputation for motion search (i.e. subsampling). In a system where that task is accelerated by special-purpose computation hardware, the portion of execution taken by search over motion partitioning would be more significant.

The algorithm used here for deciding whether to avoid further searching is quite simple and could certainly be improved upon. A good next step would be to gather statistics over a larger set of video sequences and use that data to determine optimal parameters of a predictor, for example to predict the loss incurred by skipping the full search.

There is no reason to assume that the correct order to search is the one used here or that only a single decision should be made. For example, one might first try 16 x 16, 8 x 16, and 16 x 8 partitionings, then based on those costs make a decision whether to try 8 x 8. If the decision is made to continue, after determining the cost of the 8 x 8 partitioning another decision might be made about whether to continue to the smaller partitionings and so on.

References

ITU-T Rec. H.264 / ISO/IEC 11496-10, "Advanced Video Coding", Final Committee Draft, Document JVT-G050, March 2003

Iain E G Richardson, "H.264 / MPEG-4 Part 10 White Paper: Prediction of Inter Macroblocks in P-slices", http://www.vcodex.fsnet.co.uk/h264_interpred.pdf

Karsten Suhring, editor, H.264/AVC reference software, http://bs.hhi.de/~suehring/tml/

TML Project Video Sequences, http://kbs.cs.tu-berlin.de/~stewe/vceg/sequences.htm

Threshold	Execution time in seconds	% execution time savings	% of macroblocks with reduced search	average loss per macroblock in internal cost measure	SNR Y, dB	SNR U, dB	SNR V, dB	bitrate (kb/s)
infinite (full search)	155.74	0	0	0	35.67	39.01	40.63	89.21
120	152.03	2.4	16.5	2.62	35.67	38.99	40.67	89.69
100	152.87	1.8	24.6	5.85	35.63	39.00	40.62	89.42
80	148.07	4.9	33.25	8.96	35.64	38.98	40.67	90.58
70	147.54	5.3	37.4	11.2	35.59	38.98	40.66	91.71
60	147.27	5.4	41.6	14.1	35.60	38.98	40.69	90.83
50	147.06	5.6	46.3	17.2	35.59	38.98	40.66	91.71
40	146.50	5.9	50.4	20.2	35.60	38.95	40.68	93.69
30	144.42	7.3	54.8	23.9	35.62	38.95	40.66	92.87
20	142.91	8.2	60	27.9	35.62	38.94	40.67	95.44