3D Mastering Methods

For the purpose of this discussion we define a mastering method as the mechanism used for representing a 3D scene in the video stream that will be compressed, stored, and/or transmitted. Mastering standards are typically used in this process.

As alluded to earlier, a 3D mastering standard called “3D Master” is being defined by SMPTE. The high-resolution 3D master file is one that is used to generate other files appropriate for various channels; for example, theater release, media (DVD, Blu-ray Disc) release, and broadcast (e.g., satellite, terrestrial broadcast, cable TV, IPTV, and/or Internet distribution). The 3D Master is comprised of two uncompressed files (left- and right-eye files), each of which has the same file size as a 2D video stream. Formatting and encoding procedures have been developed to be used in conjunction with already-established techniques, to deliver 3D programming to the home over a number of distribution channels.

In addition to normal video encoding, 3D mastering/transmission requires additional encoding/compression, particularly when attempting to use legacy delivery channels. Additional encoding schemes for CSV include the following [6]: (i) spatial compression and (ii) temporal multiplexing.

Frame Mastering for Conventional Stereo Video (CSV)

CSV is the most well-developed and the simplest 3D video representation. This approach deals only with (color) pixels of the video frames captured by the two cameras. The video signals are intended to be directly displayed using a 3D display system. Figure 3.5 shows an example of a stereo image pair: the same scene is visible from slightly different viewpoints. The 3D display system ensures that a viewer sees only the left view with the left eye and the right view with the right eye to create a 3D depth impression. Compared to the other 3D video formats, the algorithms associated with CSV are the least complex.

A straightforward way to utilize existing video codecs (and infrastructure) for stereo video transmission is to apply one of the interleaving approaches illustrated in Fig. 3.6. A practical challenge is that there is no de facto industry standard
available (so that any downstream decoder knows what kind of interleaving was used by the encoder). However, there is an industry movement toward using an over/under approach (also called top/bottom spatial compression).

Spatial Compression. When an operator seeks to deliver 3D content over a standard video distribution infrastructure, spatial compression is a common solution. Spatial compression allows the operator to deliver a stereo 3D signal (now called frame-compatible) over a 2D HD video signal making use of the same amount of channel bandwidth. Clearly, this entails a loss of resolution (for both the left and the right eye). The approach is to pack two images into a single frame of video; the receiving device (e.g., set-top box) will, in turn, display the content in such a manner that a 3D effect is perceived (these images cannot be viewed in a standard 2D TV monitor). There are a number of ways of combining two frames; the two most common are the side-by-side combination and the over/under combination. As can be seen there, the two images are reformatted at the compression/mastering point to fit into that standard frame. The combined frame is then compressed by standard methods and delivered to a 3D-compatible TV, where it is reformatted/rendered for 3D viewing.

The question is how to take two frames, a left frame and a right frame, and reformat them to fit side-by-side or over/under in a single standard HD frame. Sampling is involved, but as noted, with some loss of resolution (50% to be
exact). One approach is to take alternative columns of pixels from each image and then pack the remaining columns in the side-by-side format. Another approach is to take alternative rows of pixels from each image and then pack the remaining rows in the above/under format (Fig. 3.7).

Studies have shown that the eye is less sensitive to loss of resolution along a diagonal direction in an image than in the horizontal or vertical direction. This allows the development of encoders that optimize subjective quality by sampling
each image in a diagonal direction. Other encoding schemes are also being developed to attempt to retain as much of the perceived/real resolution as possible. One approach that has been studied for 3D is quincunx filtering. A quincunx is a geometric pattern comprised of five coplanar points, four of them forming a square (or rectangle) and a point fifth at its center, like a checkerboard. Quincunx filter banks are 2D two-channel nonseparable filter banks that have been shown to be an effective tool for image coding applications. In such applications, it is desirable for the filter banks to have perfect reconstruction, linear phase, high coding gain, good frequency selectivity, and certain vanishing-moment properties
[7–12]. Almost all hardware devices for digital image acquisition and output use square pixel grids. For this reason and for the ease of computations, all current image compression algorithms (with the exception of mosaic image compression for single-sensor cameras) operate on square pixel grids. It turns out that the optimal sampling scheme in the two-dimensional image space is claimed to be the hexagonal lattice; unfortunately, a hexagonal lattice is not straightforward in terms of hardware and software implementations. A compromise, therefore, is to use the quincunx lattice; this is a sublattice of the square lattice, as illustrated in Fig. 3.7. The quincunx lattice has a diamond tessellation that is closer to optimal hexagon tessellation than square lattice, and it can be easily generated by down-sampling conventional digital images without any hardware change. Because of this, quincunx lattice is widely adopted by single-sensor digital cameras to sample the green channel; also, quincunx partition of an image

was recently studied as a means of multiple-description coding [13]. When using quincunx filtering, the higher-quality sampled images are encoded and packaged in a standard video frame (either with the side-by-side or over/under arrangement). The encoded and reformatted images are compressed and distributed to the home using traditional means (cable, satellite, terrestrial broadcast, and so on).

Temporal Multiplexing. Temporal (time) multiplexing doubles the frame rate to 120 Hz to allow the sequential repetitive presentation of the left eye and right eye images in the normal 60-Hz time frame. This approach retains full resolution for each eye, but requires a doubling of the bandwidth and storage capacity. In some cases spatial compression is combined with time multiplexing; however, this is more typical of an in-home format and not a transmit/broadcast format. For example, Mitsubishi’s 3D DLP TV uses quincunx sampled (spatially compressed) images that are clocked at 120 Hz as input.

Compression for Conventional Stereo Video (CSV)

Typically, the algorithms to compress act to separately encode and decode the multiple video signals, as shown in Fig. 3.8a. This is also called simulcast. The drawback is the fact that the amount of data is increased compared to 2D video; however, reduction of resolution can be used as needed, to mitigate this requirement. Table 3.1 summarizes the available methods.

It turns out that the MPEG-2 standard includes an MPEG-2 Multi-View Profile (MVP) Coding that allows efficiency to be increased by combining temporal/inter-view prediction as illustrated in Fig. 3.6b.H.264/AVC was enhanced a few years ago with a stereo Supplemental Enhancement Information (SEI) message that can also be used to implement a prediction as illustrated in Fig. 3.8b. Although not designed for stereo-view video coding, the H.264 coding tools can be arranged to take advantage of the correlations between the pair of views of a stereo-view video, and provide very reliable and efficient compression performance as well as stereo/mono-view scalability [14].

For more than two views, the approach can be extended to Multi-view Video Coding (MVC) as illustrated in Fig. 3.9 [15]; MVC uses inter-view prediction by referring to the pictures obtained from the neighboring views. MVC has been standardized in the Joint Video Team (JVT) of the ITU-T Video Coding Experts Group (VCEG) and ISO/IEC MPEG. MVC enables efficient encoding of sequences captured simultaneously from multiple cameras using a single video stream. MVC is currently the most efficient way for stereo and MVC; for two views, the performance achieved by H.264/AVC stereo SEI message and MVC are similar [16]. MVC is also expected to become a new MPEG video coding standard for the realization of future video applications such as 3D Video (3DV) and Free Viewpoint Video (FVV). The MVC group in the JVT has chosen the

H.264/AVC-based MVC method as the MVC reference model, since this method showed better coding efficiency than H.264/AVC simulcast coding and the other methods that were submitted in response to the call for proposals made by the MPEG [15, 17–20].

Some new approaches are also emerging and have been proposed to improve efficiency, especially for bandwidth-limited environments. A new approach uses binocular suppression theory that employs disparate image quality in left- and right-eye views. Viewer tests have shown that (within reason), if one of the images of a stereo pair is degraded, the perceived overall quality of the stereo video will be dominated by the higher-quality image [16, 21, 22]. This concept
is illustrated in Fig. 3.10. Applying this concept, one could code the right-eye image with less than the full resolution of the left eye; for example, downsampling it to half or quarter resolution (Fig. 3.11). Some call this asymmetrical

quality. Studies have shown that asymmetrical coding with cross-switching at scene cuts (namely alternating the eye that gets the more blurry image) is a viable method for bandwidth savings [23]. In principle this should provide comparable
overall subjective stereo video quality, while reducing the bitrate: if one were to adopt this approach, the 3D video functionality could be added by an overhead of say 25%–30% to the 2D video for coding the right view at quarter resolution.