Acquiring the 3D structure of the bone is the first step in registration. The following sections discuss the different features of the acquisition system, stereo vision, and the available calibration methods. Implementations of the systems in this work will also be addressed.

The Image Acquisition System

The image acquisition system comprises the cameras, the video capture card, and the illumination mechanism. Figure 3.1 shows the system in the context of the complete setup. The illumination mechanism is application dependent, and will be discussed accordingly. The next sections summarize some of the most important features of the acquisition system.

Camera Characteristics

Two Charged Coupled Devices (CCD) cameras are used for the acquisition of the 3D structure of the vertebra. A CCD camera has the advantage of delivering a fully digital image, with a frame rate of up to 30 per second. However, several other characteristics influence the quality of the delivered image. The following sections describe some of these characteristics.

Resolution and Capture Rate

The resolution of a CCD camera is determined by the number of Charged Coupled Devices used in it. A CCD camera may have a resolution of up to 2048x2048 pixels, as is the case of the GEN 7 by Princeton Scientific Instruments. The CCD camera used in the experimentation is a 640x480 Creative Sharevision, using a 514x491 CCD grid.

The capture rate depends on the scanning frequency of the camera. In general, CCD cameras are manufactured to reach up to 30 frames/second, which corresponds to the maximum human-vision capture frequency. The Creative camera delivers 30 frames/second.

The Lens

The camera lens, or apparatus of lenses, focuses the light on the CCD array that converts it to a digital image. In CCD cameras, the lens is a major contributor to the degradation of the fidelity of the image, because it suffers from several defects that distort the image on the CCD array. Lens aberration will be addressed more thoroughly in the sections 3.3.2.

Finally, the lens needs to be focused every time the range of the target object is changed; moreover, for the same image, it has a finite depth of focus, where objects outside a certain range will be blurred.

Light Balancing

Several mechanisms exist for controlling the amount of light that enters the camera, and the way it is interpreted. In general, these mechanisms are automated to allow a coherent amount of light into the camera. For instance, in the camera used for this work, electronic shutters are automatically controlled to have speeds from 1/60s to 1/15000s, enabling it to function in a wide range of illumination conditions.

An important mechanism when dealing with colors is white balancing, where the camera has to be calibrated in front of a white board to maintain a good translation of colors under fixed illumination conditions. In the Creative camera, white tracking is done automatically for color temperatures ranging between 2800°K and 6000°K, where the color temperature of light is a measurement of how close it is to white light. Although this should mean that the camera should faithfully reproduce color under illumination conditions ranging from the incandescent lamp (2820-3000°K) to sunlight (»5000°K), the camera fails to do so consistently. This is largely due to the sensitivity of CCD cameras to infrared light.

Noise

As is the case with any sensing devices, internal noise is always present in the final output. The Creative camera insures a 46dB signal to noise ratio, which is satisfactory for the purpose of this work.

Other characteristics

Other characteristics may be present in CCD cameras, such as output impedance and internal synchronization. These however have negligible effects on the proposed 3D acquisition.

Video capture Card

Interfacing a digital camera with the computer often needs a video capture card that is mainly characterized by its capture rate and maximum resolution support. In addition, the video card controls the captured image in a number of ways, the most important of which are described next for the Thruvu® card used in the experimentation.

Resolution and Frame Rate

The resolution of the captured image can be customized through the card settings. Moreover, the frame rate can be varied between 1 and 30 frames/second.

Brightness, Contrast, Hue and Saturation

The brightness and contrast of the image can be adjusted from the card settings, as well as the hue and saturation of the colors. Although this may easily be accomplished using software, when specified in the card settings, it is done by hardware and is therefore much faster.

Stereo Systems

Binocular stereo is a simple and flexible way to obtain 3D information about a scene. However, it is still under investigation and no standard stereo configuration has yet been proposed. The following paragraphs present the most commonly used stereo configurations and discuss the advantages and drawbacks of each.

Parallel Configuration

where P_r-P_l is the called the disparity of the image, and 2h the distance between the camera centers, usually referred to as the baseline.

Clearly, depth is proportional to the disparity and the focal length. Therefore, and since the focal length is fixed, a larger disparity sensitivity will translate into more accurate range recovery. Disparity sensitivity is limited by the image resolution, and can only be increased if the distance between the cameras, or the baseline, is increased. In its turn however, a wide baseline constrains the overlap between the two images, and limits the useful proportion of the image, affecting again the useful resolution and decreasing the sensitivity. This makes the parallel configuration unappealing in most applications.

Converging Configuration

In a converging camera configuration, the overlap between the two images can be maximized without affecting the baseline. On the other hand, converging stereo causes more pronounced obstruction and more involved mathematical processing. Whereas increased obstructions have to be compromised, the mathematical problem can be worked out nearly without any loss of accuracy. The next sections describe the processing needed for dealing with a converging stereo system.

Parallax and Camera Rotation

The most important subtlety when dealing with converging stereo is the difference between parallax and non-parallax motion. The Figure 3.3 illustrates this difference, where in the upper images the camera rotates about its center but does not translate. The images are thus related by a plane projective transformation, and no parallax distortion is present. In the upper left and lower images the camera rotates about its center and translates. The images are no longer related by a plane projective transformation, and motion parallax is evident (Fisher 1997).

When converting a convergent stereo system to a parallel one, the parallax should not be removed, in order to conserve depth information. This is usually done by either reverting one of the cameras to a parallel configuration by the proper rotation, or by incorporating the rotation in the disparity equation. An example of image rectification can be found in Kang et al. (1994), and an example of adjusting the depth equation can be found in Young (1994), where the depth equation 3.1 now becomes:

Z₀ is called the depth of focus, it is the distance from the straight line joining the two camera centers and the point of intersection of the two optical axis (Figure 3.16). The depth of focus is obtained by calibration. Although this is the most widely used depth recovery equation, it is an approximate result that holds as long as the disparity is small. A more general calculation of depth can be obtained using the following equation from Kanatani (1993):

where b is the baseline vector, m and m' are the point vectors joining the camera centers O and O' to the image point P (Figure 3.4), and q is defined as:

Epipolarity

The difference between the epipoles of each image should be accounted for before any further processing can be carried out, since it facilitates the correspondence of left and right pairs. Figure 3.7 shows the epipoles (ep₁ and ep₂) of a point M on the image planes P^F₁and P^F₁, where C₁ & C₂are the camera centers.

Correspondence

The importance of correspondence between points on each image has already been emphasized and the classical approaches to solve the correspondence problem have been summarized in chapter 2, their efficiency is now addressed.

Classical Correlation Method

The main directive behind correlation methods used for stereo correspondence is to match a certain correlation criterion for each pair of points. For this, a window is associated around the position of each point on the second image, where candidates for a matching pixel are evaluated. A large number of correlation functions are currently in use, these are compared in Aschhwander et al. (1992), as well as several windowing mechanisms, such as the adaptive windowing scheme described by (Kanade 1994). More sophisticated schemes that increase robustness by accounting for occlusions can be found, such as the one in Lan (1995).

Nevertheless, all correlation methods are computationally demanding if an accurate match is needed. Specifically, all points should be tested if the most of the two images is to be used, and on each of these points, a large number of operations has to be performed, depending on the correlation method. In general efficiency is around O(w₁w₂nm) with a considerably large overhead, where m and n represent the image size and w₁w₂ the window size. For algorithms that make use of the epipolarity condition, the efficiency is O(w₁nm), in which case w₁ will be is close to n/2 for wide baseline systems.

In the case of the spine bone, classical correlation methods have proved to be unsuitable due to the following two characteristics of the vertebra:

Other methods

Correspondence can be resolved by means other than correlation, namely, by optical flow and Fourier transforms (Weng, 1993), and more recently complex wavelets (Pan 1996). Optical flow and Fourier transform are badly suited to the structure of the bone vertebra, because of the high frequency components introduced by the irregular bone surface. Wavelets have not been explored because of the complexity they introduce to the system.

Color Correlation

The inability to isolate features that are easy to manipulate from the image imposes a large computational burden, to which the following solution is proposed: the bone will be placed under three coplanar colored light sources pointing to it from equally spaced angles. In principle, the discrimination of neighboring points should increase, thereby facilitating the matching process. This method makes use of the white color of the bone and of its very irregular contour, which allows a large color gradient on a small surface. Figure 3.8 introduces the idea of color correlation.

The next chapter is devoted to the description and analysis of the color correspondence scheme.

Camera Calibration

Almost all computer vision calculations rely on an ideal camera model called the pinhole model (Figure 3.9). Moreover, a camera system is usually characterized by its intrinsic parameters that depend solely on the camera structure, and its extrinsic parameters that depend on its position. Calibrating a camera is determining its intrinsic and extrinsic parameters, where the intrinsic parameters will be used to relate the camera model to the perfect pinhole model. The following sections discuss the different parameters and calibration techniques.

Focal Length Calculation

The focal length of the camera is the distance between the CCD array and the center of the lens. Usually, it is specified in the camera data sheet; however, two reasons stand behind the need for recalculating the focal length:

Consequently, experimental computation of a value for the focal length should be addressed before any further work can be carried out with the camera. From the depth equation, it seems relatively easy to calculate f given a set of points with known 3D coordinates (a grid); however, the following complications will inevitably arise.

Lens Aberration

To focus the image on the CCD array, a convergent lens has to be used, and if no other correcting lens (or system of lenses) is introduced, the image may suffer from several distorting effects, the most important of which is the radial-lens distortion. Figure 3.10 (a) shows a distorted image of a grid, and Figure 3.11 (b) the same grid with radial lens correction.

The optical solution to this problem requires a sophisticated set of lens and zooming equipment. The non-optical alternative is to model the distortion and eliminate it from the image. Radial-lens distortion can effectively be modeled by (Wolf 1983):

where r is the distance between a point and the image center, r' is the corrected distance, and k₁, k₂ are constants found by interpolation.

Direct Calculation

A major difficulty in finding k₁ and k₂is to obtain a good set of data to be used for interpolation. The exact focal length has to be known in advance. However, in obtaining the focal length, the radial-distortion was also needed. A proposed solution to this recursive problem is to work with length ratios instead of points. Referring to Figure 3.10, the width of the small square near the center is compared to that of the ones near the edge, this should approximate the ratio of the distances of the centers of the squares from the image center. Figure 3.11 depicts a radial lens correction for a set of sample points using MATLAB®.

An Improved Calculation

The algorithm requires accurate knowledge of the 3D position of two additional reference points, as shown in Figure 3.12. Moreover, only two pairs of points are used for each calculation, and the image is totally reconstructed before the next parameter may be calculated, resulting in a large probability for error build-up.

Robust Validation

An implementation to check for the validity of the interpolated k's is imaging a straight line on one of the borders. The corrected image should also mirror a straight line, regardless of the direction Figure. This is shown in Figure 3.13, where the blue square is the correction of the red one.

This could further be exploited to obtain k₁ and k₂from minimizing a colineation function of several point on a straight line, the procedure is as follows:

This procedure only requires that the calibration board be rigid, and that it spans the entire image. It is not affected by the position or orientation of the board with respect to the camera. Furthermore, the board could even be folded near its center, to allow for the acquisition of calibration points from a wider focal depth, thus further decreasing the sensitivity of the procedure to the current focal length. Such a calibration board is shown in Figure 3.14. The algorithm used for point recognition is get105pt (see Appendix A), which recognizes the points based on their gray scale intensity, and differentiates their relative position on the board from their surrounding color.

Finally, and as is the case with any non-linear optimization techniques, a good initial guess is needed to avoid false local minima. In this case, the coefficients obtained using the first method give consistent convergence. The values obtained for k₁ and k₂ are:

Exact Focal Length Calculation

As described in Kanatani (1993), an estimate of the focal length can be used to get the exact focal length up to measuring precision. The procedure relies on properties of the vanishing points, as shown in Figure 3.15.

The importance of this calculation length lies in the fact that the error caused by lens aberration and misalignments can now be lumped in the approximate focal length. The above procedure was tried for several sets of data, always yielding the same value of f = 403.537.

Camera Position Estimation

Some mathematical relations can be applied to directly solve for the camera position. An example of such an approach has already been introduced in 3.2.2.2, where the roll, pitch and yaw of the camera are calculated with respect to a reference board. However, the presented calculations, except for the camera roll, assume knowledge of the distance to the center of the board. An approach that does not use the distance from the calibration object is next proposed.

Starting from a parallel configuration, the two camera are tilted inwards to obtain a convergent configuration, this stereo configuration (Figure 3.16) is widely used because the depth calculations could be approximated by the depth of focus equation discussed in 3.2.2.1.

Consider a horizontal bar of known length placed between the two cameras, with the fixation point on its center. The tilt angle could then be calculated as described next:

Referring to Figure 3.17, and keeping in mind that rotating a camera around an object is equivalent to rotating that object around the camera, the dimensions of the bar can be characterized by:

where 2a is length of the image of the bar before the rotation, and b & c after a tilt of y. a and b are the angles between the straight lines relating the camera center to the center of the bar and the camera center to the respective ends of the bar.

Once y is obtained, the exact orientation of the cameras will be known, and depth calculation can be carried out without any approximation.

Finally, the camera roll can also be found using a trigonometric identity described in Bridges (1997).

The above procedure was automated using MATLAB®, and tested on a synthetic image and achieved an absolute accuracy of 0.1%.

Robust Calibration

By robust calibration is meant that the calculated parameters should be acquired in a way that insures the convergence of the model towards an ideal reference model, usualy the pinhole model. The motivation behind robust calibration, as is the case with most non-linear models, is that the calculated parameters cannot be easily uncoupled from the model nonlinearities. Moreover, in calculating parameters separately, the overall error of the system will most likely be amplified by the error of each parameter.

Robust calibration guarantees that the parameter will be such that the overall system meets the ideal specifications set for it. The parameters would converge to values that do not necessarily correspond to their true values, but to values that balance the errors in other parameters, and account for any un-modeled non-linearity. Practically, this approach delivers very satisfactory results, unless active vision is used, where knowledge of the exact parameters becomes necessary, as argued by Shih et al. (1996).

Mathematical Preliminaries

The Pinhole Model

Moreover, and as is usually the case, if the camera coordinate system is scaled by k_u in the u direction and k_v in the v direction, in addition to an origin translation of (u₀, v₀), then the previous equation would be replaced by:

The above matrix represents the mapping between a 3D coordinate system and the image coordinate system. It only depends on the internal characteristics of the camera, or its intrinsic parameters. Note that the camera is still assumed to have no distortion caused by its lenses or by the misallignment of the CCD array. Modeling these deformations is possible by incorporating appropriate factors into the intrinsic matrix. However, these would be limited to a first order lens aberration correction. An alternative for correcting these aberrations has already been discissed in section 3.3.2.

The Essential Matrix

If the coordinate system to be used is not the standard coordinate system, and the new one is found by rotating the standard coordinate system by R, and translating it by t, as shown in Figure 3.18, then the relation between the new 3D coordinates and the image coordinates is:

The second matrix is calculated from the camera extrinsic parameters, the essential matrix is defined as:

The Fundamental Matrix

Finally, and referring again to Figure 3.18, the relationships between the image m₁ of point M on the first camera, and m2 on the second camera is:

where the product of the three matrices is called the fundamental matrix. The importance of the previous result lies in the fact that it could be used to correlate epipolar points. An example of epipolar points was already introduced in Figure 3.7, where F and F^Trelate ep₁(m₂) to ep₂(m₁).

Implementation

Variations in the above approaches are also possible, such as setting a number of parameters to reduce the number of simultaneously estimated parameters. When possible, this significantly reduces the computational burden of minimizing a non-linear function of several parameters. Moreover, some approaches use a calibration object of known dimension to facilitate the processing. The following sections present some of these techniques.

Least Square estimation of extrinsic parameters

If no calibration object is to be used, the extrinsic parameters can still be obtained given the intrinsic parameters and a set of correspondence points with coordinates m and m'. The coordinates of the points are stored as [x y f], where f is the focal point. Moreover, and since the set of point pairs is obtained by using a certain correlation technique that inevitably introduces some error, a least square approach is used. An example of such a scheme is developed in Kanatani (1993):

The above procedure relies solely on the epipolarity property to estimate up to a scaling factor the camera extrinsic parameters. It minimizes the essential matrix introduced in 3.4.1.2, which has 8 degrees of freedoms (6 for the rotation, 3 for the translation and 1 constrained by projective geometry). The most important aspect of the procedure, as is the case with all robust calibration methods, is the way non-linear minimization is implemented: it uses variations of the generalized Moore-Penrose inverse to solve for least square equalities. The procedure has been automated using MATLAB®, and tested on a synthetic object shown in Figure 3.19.

Although the delivered results where satisfactory, the procedure is very sensitive to noise (as confirmed by Zhang et al. 1995). Moreover, and because the focal length is explicitly used in all the calculations, the procedure is also very sensitive to the focal length, as shown in Figure 3.21. Finally, the procedure does not account for the error introduced by lens aberrations, thus the correction for radial-lens distortion had to be carried out independently before the calibration.

Finally, it should be noted that if non-linear minimization is added, the procedure delivers better results in the presence of noise. The non-linear estimation used is that embedded in MATLAB®, and is based on a Nelder-Med type simplex search.

Tsai's Algorithm

An algorithm that estimates the intrinsic and extrinsic parameters in the same time was introduced by Tsai (1987). It minimizes in a least square sense the fundamental matrix in for pair points from each image. A recent implementation under MATLAB® was carried out by Heikkiliä based on Heikkiliä and Silvén (1996). This implementation uses a Levenberg-Marquardt optimization for non-linear minimization. The drawbacks of the Tsai algorithm are that it needs detailed dimensions about the CCD array, and that it models only first order radial lens distortion. A calibration example will next be presented. Figure 3.22 shows the calibration board with the recognized points, and Figure 3.23 the location of the points in 3D.

The algorithm failed to converge for several sets of calibration points. However, when radial lens distortion was corrected before the calibration was performed, the algorithm converged with a high standard deviation. Nevertheless, the obtained results where totally inaccurate. The reasons for this will be elaborated in Chapter 7.

Image Acquisition