Other methods have been discussed in the industry, known generally as 2D in conjunction with metadata (2D + M). The basic concept here is to transmit 2D images and to capture the stereoscopic data from the “other eye” image in the form of an additional package, the metadata; the metadata is transmitted as part of the video stream (Fig. 3.12). This approach is consistent with MPEG multiplexing; therefore, to a degree, it is compatible with embedded systems. The requirement to transmit the metadata increases the bandwidth needed in the channel: the added bandwidth ranges from 60%–80% depending on quality goals and techniques used. As implied, a set-top box employed in a traditional 2D environment would be able to use the 2D content, ignoring the metadata, and properly display the 2D image; in a 3D environment the set-top box would be able to render the 3D signal.
Some variations of this scheme have already appeared. One approach is to capture a delta file that represents the difference between the left and right images.
A delta file is usually smaller than the raw file because of intrinsic redundancies. The delta file is then transmitted as metadata. Companies such as Panasonic and TDVision use this approach. This approach can also be used for stored media. For example, Panasonic has advanced (and the Blu-ray Disc Association is studying), the use of metadata to achieve a full-resolution 3D Blu-ray Disc standard. A 1920 × 1080p 24 fps resolution per eye is achievable. This standard would make Blu-ray Disc a high-quality 3D content (storage) system. The goal was to agree to the standard by early 2010 and have 3D Blu-ray Disk players emerge by the end-of-year shopping season 2010. Another approach entails transmitting the 2D image in conjunction with a depth map of each scene.
Video Plus Depth (V + D)
As noted above, many 3DTV proposals often rely on the basic concept of “stereoscopic” video, that is, the capture, transmission, and display of two separate video streams (one for the left eye and one for the right eye). More recently, specific proposals have been made for a flexible joint transmission of monoscopic color video and associated per-pixel depth information [24, 25]. The concept of V + D representation is the next notch up in complexity.
From this data representation, one or more “virtual” views of the 3D scene can then be generated in real-time at the receiver side, by means of Depth- Image-Based Rendering (DIBR) techniques [26]. A system such as this provides important features, including backwards compatibility to today’s 2D digital TV; scalability in terms of receiver complexity; and easy adaptability to a wide range of different 2D and 3D displays. DIBR is the process of synthesizing “virtual” views of a scene from still or moving color images and associated per-pixel depth information. Conceptually, this novel view generation can be understood as the following two-step process: at first, the original image points are re-projected into the 3D world, utilizing the respective depth data; thereafter, these 3D space points are projected into the image plane of a “virtual” camera that is located at the required viewing position. The concatenation of re-projection (2D to 3D) and subsequent projection (3D to 2D) is usually called 3D image warping in the Computer Graphics (CG) literature and will be derived mathematically in the following paragraph. The signal processing and data transmission chain of this kind of 3DTV concept is illustrated in Fig. 3.13; it consists of four different functional building blocks: (i) 3D content creation, (ii) 3D video coding, (iii) transmission, and (iv) “virtual” view generation and 3D display.
As it can be seen in Fig. 3.14, a video signal and a per-pixel depth map is captured and eventually transmitted to the viewer. The per-pixel depth data can be considered a monochromatic luminance signal with a restricted range spanning
the interval [Znear, Zfar] representing, respectively, the minimum and maximum distance of the corresponding 3D point from the camera. The depth range is quantized with 8 bit, with the closest point having the value 255 and the most distant point having the value 0. Effectively, the depth map is specified as a grayscale image; these values can be supplied into the luminance channel of a video signal and the chrominance can be set to a constant value. In summary, this representation uses a regular video stream enriched with so-called depth maps providing a Z -value for each pixel. Note that V + D enjoys backward compatibility because a 2D receiver will display only the V portion of the V + D signal. Studies by
the European ATTEST (Advanced Three Dimensional Television System Technologies) project indicate that depth data can be compressed very efficiently and still be of good quality; namely, that it needs only around 20% of the bitrate
that would otherwise be needed to encode the color video (the qualitative results were confirmed by means of subjective testing). This approach can be placed in the category of Depth-Enhanced Stereo (DES).
A stereo pair can be rendered from the V + D information, by 3D warping at the decoder. A general warping algorithm takes a layer and deforms it in many ways: for example, twists it along any axis, or bends a layer around itself or adds
arbitrary dimension with a displacement map. The generation of the stereo pair from a V + D signal at the decoder as illustrated in Fig. 3.15. This reconstruction affords extended functionality compared to CSV because the stereo image can be adjusted and customized after transmission. Note that in principle, more than two views can be generated at the decoder thus enabling support of multi-view displays (and head motion parallax viewing within reason).
V + D enjoys backwards compatibility, compression efficiency, extended functionality, and the ability to use existing coding algorithms. It is only necessary to specify high-level syntax that allows a decoder to interpret two incoming video streams correctly as color and depth. The specifications “ISO/IEC 23002-3 Representation of Auxiliary Video and Supplemental Information” and “ISO/IEC 13818-1:2003 Carriage of Auxiliary Data” enable 3D video-based V + D to be
deployed in a standardized fashion by broadcasters interested in adopting this method.
It should be noted however, that the advantages of V + D over CSV entail increased complexity for both, sender and receiver. At the receiver side, view synthesis has to be performed after decoding to generate the second view of the
stereo pair. At the sender (capture) side, the depth data have to be generated before encoding can take place. This is usually done by depth/disparity estimation from a captured stereo pair; these algorithms are complex and still error
prone. Thus in the near future, V + D might be more suitable for applications with playback functionality, where depth estimation can be performed offline on powerful machines, for example in a production studio or home 3D editing suite,
enabling viewing of downloaded 3D video clips and 3DTV broadcasting [16].
Multi-View Video Plus Depth (MV + D)
There are some advanced 3D video applications that are not properly supported by any existing standards and where work by the ITU-R or ISO/MPEG is needed. Two such applications are given below:
- wide range multi-view autostereoscopic displays (say, nine or more views);
- FVV (environment where the user can chose his/her own viewpoint).
These 3D video applications require a 3D video format that allows rendering a continuum and/or large number of output views at the decoder. There really are no available alternatives: MVC discussed above does not support a continuum
and becomes inefficient for a large number of views; and, we noted that V + D could in principle generate more than two views at the decoder but in practice, it supports only a limited continuum around the original view (artifacts increase
significantly with the distance of the virtual viewpoint). In response, MPEG started an activity to develop a new 3D video standard that would support these requirements.
The MV + D concept is illustrated in Fig. 3.16. MV + D involves a number of complex processing steps where (i) depth has to be estimated for the N views at the capture point, and then (ii) N color with N depth video streams have to
be encoded and transmitted. At the receiver, the data have to be decoded and the virtual views have to be rendered (reconstructed).
As was implied just above, MV + D can be used to support multi-view autostereoscopic displays in a relatively efficient manner. Consider a display that supports nine views (V1–V9) simultaneously (e.g., with a lenticular display manufactured by Philips; Fig. 3.17). From a specific position a viewer can see
only a stereo pair of views, depending on the viewer’s position. Transmitting nine display views directly (e.g., by using MVC) would be taxing from a bandwidth perspective; in this illustrative example only three original views (views V1,
V5, and V9) along with corresponding depth maps D1, D5, and D9 are in the decoded stream—the remaining views can be synthesized from these decoded data by using DIBR techniques.
Layered Depth Video (LDV)
LVD is a derivative and also an alternative to MV + D. LDV is believed to be more efficient than MV + D because less information has to be transmitted; however, additional error-prone vision processing tasks are required that operate
on partially unreliable depth data. These efficiency assessments remain to be fully validated as of press time.
LVD uses (i) one-color video with associated depth map and (ii) a background layer with associated depth map; the background layer includes image content that is covered by foreground objects in the main layer. This is illustrated in
Figs 3.18 and 3.19. The occlusion information is constructed by warping two or
more neighboring V + D views from the MV + D representation onto a defined center view. The LDV stream or substreams can then be encoded by a suitable LDV coding profile.
Note that LDV can be generated from MV + D by warping the main layer image onto other contributing input images (e.g., an additional left and right view). By subtraction, it is then determined which parts of the other contributing
input images are covered in the main layer image; these are then assigned as residual images and transmitted while the rest is omitted [16].
Figure 3.18 is based on a recent presentation at the 3D Media Workshop, Heinrich Hertz Institut (HHI) Berlin, October 15–16, 2009 [27, 28]. LDV provides a single view with depth and occlusion information. The goal is to achieve automatic acquisition of 3DTV content, especially to obtain depth and occlusion information from video and to extrapolate a new view without error.
Table 3.2, composed from technical details in Ref. [29] provides a summary of the issues associated with the various representation methods.