Calibration of Kinect-type RGB-D Sensors for Robotic Applications

The paper presents a calibration model suitable for software-based calibration of Kinect-type RGB-D sensors. Additionally, it describes a twostep calibration procedure assuming a use of only a simple checkerboard pattern. Finally, the paper presents a calibration case study, showing that the calibration may improve sensor accuracy 3 to 5 times, depending on the anticipated use of the sensor. The results obtained in this study using calibration models of different levels of complexity reveal that depth measurement correction is an important component of calibration as it may reduce by 50% the errors in sensor reading.


INTRODUCTION
Robot system building is a process that regularly involves a number of design compromises.System integrators are often faced with a problem of making a cost-effective system, providing at the same time high accuracy and rich sensing capabilities.Hence, there is an understandable interest in employing consumertargeted sensing devices with Microsoft Kinect being a notable example.Thanks to its mass production and thus low price, Kinect is perceived as an attractive option in many robotic applications.
Since the initial release of Kinect XBOX 360 in 2010 [1], several variants of the sensor have also appeared.In 2012, Microsoft presented the enhanced Kinect for Windows [2].At the same time, ASUSTeK and PrimeSense used the same technology to develop 3D vision sensors with the similar features [3,4] so that it is possible to speak now about a generation of consumer-grade 3D sensors.
Kinect and similar 3D sensors consist of a color (RGB) camera and depth (D) sensor.Generally, they are low accuracy and low precision devices.Diverse studies, including this one, show that their accuracy is on the order of 2-3%.At a distance of 4m from the sensor, this corresponds to RMS error on the order of 10cm.This level of accuracy is quite satisfactory in e.g.human interaction applications.However, it may appear unsuitable in some specific robotic uses (e.g., indoor navigation or fine manipulation).
Accuracy can be improved by software correction of sensor outputs.The correction is based on a specific calibration model whose parameters are identified during the calibration process.The calibration procedure consists in essence in collecting sensor outputs and comparing them to reference data, and it assumes using a special calibration rig, i.e. an object of precisely known dimensions, and/or use of a high-precision measurement tool.
This paper contributes to software-based techniques for enhancing Kinect sensor accuracy in three ways.First, it presents a calibration model suitable for software calibration of Kinect-type sensors.Second, it describes a two-step calibration procedure recently proposed by this author [5,6] for identification of the model parameters.The procedure is simple in the sense that it makes use of only a commonly adopted camera calibration tool (i.e., a checkerboard) and it does not require additional specific calibration objects or external measurement devices.Finally, the paper contains a calibration example involving two use cases: one in which only the depth sensor is employed, and another in which a joint use of both RGB camera and depth sensor is assumed.The case study incorporates calibration and accuracy analysis of Kinect XBOX 360 sensor.However, the results and the approach might be quite applicable to other similar sensors.
The paper is written as an extended version of the conference paper [5].Compared to [5], this paper presents a derivation of the Kinect depth measurement correction model and it additionally provides few more details about the calibration process.
The paper is organized as follows.A short review of several representative works on accuracy and calibration of Kinect-type sensors is given in the next section.Section 3 contains a description of employed sensor calibration model and model identification procedure.Section 4 presents analysis of accuracy that can be expected in different scenarios and with different levels of complexity of calibration models.Section 5 summarizes conclusions on the attained results.

RELATED WORK
Accuracy of Kinect-type sensors was a subject of investigation involving use of different measuring devices.Dutta [7] measured accuracy of Kinect sensor using a high-precision seven-camera Vicon 3D motion capture system and reported mean errors of up to 10cm (standard deviation on the order of 5cm) in the range covering the distances of up to 3m from the sensor.The sensor was previously calibrated and the calibration involved identification of parameters of the sensor camera.Gonzales-Jorge et al. [8] investigated accuracy of uncalibrated Kinect XBOX 360 and Asus Xtion using a specially designed measurement fixture.Their measurements confirm that both accuracy and precision deteriorate with the distance.These authors obtained similar accuracy/precision for both examined sensors and the values of RMS error were on the order of 10mm (standard deviation on the order of 8mm) at the distance of 2m.
Several works addressed improvement of sensor accuracy.In earlier works, e.g.Burrus [9] and Zhang and Zhang [10], the main concern was on procedures and methods for identification of intrinsic parameters of sensor cameras.Later, the focus has moved to calibration of sensor depth measurement model, with important works of Khoshelham and Elberink [11], Smisek et al. [12], and Herrera C. et al. [13].These works addressed transformation of disparity maps provided by the sensor into depth maps.However, with the actual OpenNI [14] and Microsoft Kinect SDK [15] application programming interfaces, disparity data are already converted into depth using a nominal factory model.To account for this change, a reformulation of depth calibration model was proposed in [6], advocating to utilize a linear relationship between actual and sensor-provided inverse depths.
Calibration of Kinect-type 3D sensors is naturally split into two parts: identification of parameters of sensor cameras and identification of parameters of depth measurement model.It is natural and simplest to identify camera parameters from raw camera data, but there were also attempts to calibrate sensor camera directly from depth maps [10] and perform a joint depth/RGB camera calibration [13].Although advantageous in principle, this approach suffers from the problem of calibrating the camera using a low precision depth map: the low precision problem leads to a need for an extremely large number of measurements.Additionally, the joint calibration, although having a potential of improving the optimal solution, may display an undesired interaction: for example, the data on joint calibration reported in [13] show that a refinement of depth model can paradoxically lead to enlargement of reprojection errors of RGB camera.
An important issue in depth model calibration is the measurement of depth as it normally requires either a special 3D calibration rig or an external measuring tool.In [11], depth was measured using a simple measuring tape.In [13], correspondence between depth map and RGB camera image was established using external corners of calibration table.A similar approach was proposed by Draelos et al. [16].Geiger et al. [17] proposed more complex calibration object, consisting of multiple checkerboards.Shibo and Qing [18] designed a specific calibration board with regularly-spaced drilled holes allowing their easy identification in both RGB images and depth maps.
In this work, depth calibration follows a procedure proposed in [6], where the Kinect RGB camera, once calibrated, is used as a depth measuring device for subsequent calibration of the depth model.Details of both the model and the calibration procedure are given in the next section.

SENSOR CALIBRATION
Kinect-type 3D sensors considered in this work operate as structured light sensors.A sensor (Fig. 1) incorporates a laser infra-red (IR) diode for emitting a dotted light pattern and an IR camera for capturing reflected patterns.Depth is calculated by sensor software on the basis of disparity of reflected patterns with the respect to the reference patterns obtained for a plane placed at a known distance from the sensor.A supplementary RGB camera is added to provide additional information on color and texture of the surface.Thus, sensor output consists of three data flows: images from RGB camera, raw images from IR camera, and depth maps calculated by sensor firmware.Sensor calibration can be viewed as a refinement of correspondences between 3D object coordinates and coordinates in RGB, IR, and depth images.The proposed calibration procedure consists of two steps: the first step comprises calibration of sensor's RGB/IR cameras, whereas the calibration of depth model is performed within the second step.
, and finally, d.The transformation into pixel coordinates, using transformation of the form ( ) In this work, the following functional forms of distortion function and camera matrix were considered (see e.g.[19] for details): It is readily seen that the model ( 1) neglects tangential distortions.This simplification is made since the early tests with Kinect revealed that tangential distortion of its RGB camera was below the achievable level of precision.Besides, the distortion found in IR camera was extremely low and neglecting the tangential distortion did not change much the total reprojection error.Thus, the adopted functional form  Identification of RGB/IR camera parameters was conducted in Matlab environment, using Bouguet's camera calibration toolbox [19].A 9 8   checkerboard with 30mm square fields was employed as a calibration rig.Then, a set of ten pairs of close-up RGB/IR images of the checkerboard placed in different positions and orientations were collected and submitted to calibration (Fig. 2).To achieve appropriate light conditions for calibrating the IR camera, the IR emitter was disabled during imaging (an ordinary stick tape was used to cover the projector; the newer Kinect for Windows sensor model allows programmable control over the IR emitter).
While acquiring images, both cameras were set to their maximum resolutions, which were 1280 960  for the RGB camera and 640 480  for the IR camera.The results are summarized in Table 1.The most important difference between nominal and identified parameters is in focal length which differs about 1.8% for the RGB camera and 2.4% for the IR camera.  is the distance from the sensor, the transformation is given by: 1 Besides, the depth map can be extended with texture and color information from RGB image using the coordinate transformations: where RGB Z is the z-coordinate of RGB X .Evaluation of expressions (3-6) involves computation of distortions and therefore it is of interest to explore whether a simpler approximation of distortions is possible and whether the distortions could be perhaps completely neglected.Additional insight into the actual level of distortions introduced by camera optics provides Fig. 3, where the amount of distortions, expressed in pixels, is shown on contour lines and the direction of distortions is shown by blue arrows.It is seen that the direction of distortions is opposite for the RGB and IR cameras, so that they effectively affect the net deviation in the same direction.Distortions increase from the center to periphery of images: if the object of interest is kept within the central circles on Fig. 3, the distortions introduced by cameras could be as low as 1 pixel for the RGB camera and 0.25 pixels for the IR camera.Since the distortions are multiplied to 3D deviations with the factor of z f , these pixel distortions correspond to 3D deviations on the order of, respectively, 3.8/1.7mmfor the RGB/IR camera at the distance of 4m from the sensor.Thus, the deviations could be neglected for such central objects.On the other hand, by approaching peripheral image area, the deviations enlarge and a more complex model of deviations becomes necessary.

Depth measurement calibration
Sensor reading of Kinect-type sensors is based on internal computation of depth using the detected disparity between images of IR light beans obtained after reflection from the measurement surface and the reference surface.The value of such inferred depth depends on sensor geometry and optical characteristics that are subject to manufacturing variations.
To derive a relationship between the disparity and the measured depth, consider Fig. 4 that schematically displays a bean PMC emitted from the projector P under the angle  , reflected from the measurement surface at the point M , and hit the image plane at the point C , forming the angle  with the sensor optical axis.
Looking at Fig. 4 we obtain: where l is the length of the baseline between the projector and the IR camera, z is the distance from the reflection surface, x is the offest of the reflected bean on the sensor surface, and f is the focal length of the IR camera.By combining ( 7) and ( 8), after rearrangement we obtain: For a reference surface at the distance of 0 z , the relation ( 9) becomes: By combining ( 9) and ( 10), the measurement model is obtained in the form: where is the disparity between offsets obtained for a measurement and reference surfaces.The model (11) is internally used by sensor software to transform detected disparity into depth.It is readily seen that the inferred value depends on 0 , , l f z that are subject to manufacturing variations.Thus, the actual output S z from the sensor is really an approximation based on the nominal model: 0 1 1 From ( 11) and (12), the following relation between z and S z is obtained: where , Z Z a b are the values that are characteristics of particular sensor: Ideally, 1

In this work, identification of parameters ,
Z Z a b is conducted by the procedure proposed in [6].First, a set of pairs of RGB/depth images of the same checkerboard that was used in camera calibration is collected.However, compared to the camera calibration step where a set of close-up views were employed, here a set of views with different object distances are selected.Note yet another difference: in the camera calibration case, the pairs of RGB/IR images were acquired whereas for the depth calibration the pairs of RGB/depth images are collected.Each of RGB images is afterward converted into grayscale and then the corner coordinates are extracted for inner 10 9  corners.See Fig. 5 as an example where the extracted corners are emphasized in both RGB and depth images.For each view, say k th , the extracted pixel coordinates ( ) ( , , ) The z-component ( , , )  Z i j k of ( , , ) IR i j k X is afterward compared to sensor reading.The sensor value is determined by converting ( , , ) IR i j k X into pixel coordinates of IR camera and by searching for the nearest neighbor in sensor depth map k m : ( ) ( ) ( , , ) ( ( , , )) Finally, the obtained set of pairs ( , , ) Z i j k , ( , , ) a b of depth measurement model ( 13) using the least squares fit.
Calibration was performed using three views displayed in Fig. 5  .Fig. 6 illustrates the depth correction curve obtained for these values (the curve shown in black).It is seen that the correction increases with depth up to the value on the order of 55mm at the end of the sensor range.For reference, Fig. 6 also shows measurement errors obtained for all checkerboard corners in all views in Fig. 7.The measurement points are shown as green (lighter) dots; the points that were used in calibration are highlighted in red (darker dots).It is important to underline that the error reduction that can be achieved by applying the described procedure is indeed a reduction of differences between reading of depth sensor and depth values obtained using the sensor's RGB camera.Therefore, well calibrated camera is an absolute prerequisite for a quality calibration.In this section, an analysis is made of achievable accuracy of both uncalibrated and calibrated sensor and the improvement introduced by calibration.

Figure 7. Views used in accuracy analysis
To this end, two possible scenarios are examined.In the first case, a use of only the depth sensor is assumed.An example of this scenario in the context of robotic SLAM (Simultaneous Localization and Mapping) provide works of Audras et al. [20] and Fioraio and Konolige [21] where the issue is on direct utilizing point clouds produced by Kinect to build a dense map of environment.Registration of the point clouds in successive depth frames is performed in an iterative process known as ICP (Iterative Closest Point).An alternative scenario, allowing to speed-up the relatively slow ICP process, involves the registration using sparse salient keypoints in 2D images obtained from the RGB camera, whereas the depth of the keypoints is found by searching the depth map.Representative implementations of this approach are described by Henry et al. [22] and Endres et al. [23].From the standpoint of Kinect accuracy, the main characteristic of this approach is a joint use of the depth sensor and the RGB camera and the concern is on errors of 3D coordinates inferred for selected points in RGB images.
In the framework of the analysis, three levels of sensor modeling are investigated in both examined scenarios: the nominal sensor model, the sensor with calibrated RGB/IR cameras, and the case with the additional depth measurement model calibration.
The analysis is based on measurements conducted on the same checkerboard employed in sensor calibration, using the views of the checkerboard shown in Fig. 7 (with the exclusion of the three central views in 2 nd , 5 th , and 8 th row) that were used in calibration of the depth measurement model).As in the calibration case, measurement points correspond to inner 90 corners of the checkerboard.
For the purpose of the analysis, the views were divided into nine groups, each containing two or three views of the checkerboard placed approximately at the same position with respect to the sensor but in different orientations.In this manner, a set of nine clusters of measurement points was obtained.The clusters were used to estimate RMS errors and standard deviations for different distances from the sensor.

Depth sensor accuracy
In this analysis, the accuracy of 3D coordinates ( , , ) IR i j k X determined by the sensor for selected points ( ) ( , , ) in its depth map is examined.The points ( ) P IR x are selected as projections of checkerboard corners and they are computed using the same procedure as in depth measurement calibration: first, the transform ( ) RGB C k T is determined for each view k , then ( 14) is applied to find the best guess ˆ( , , ) IR i j k X of actual external coordinates from which ( ) ( , , ) P IR i j k x are calculated using (15).Sensor output is afterward generated by applying in order ( 16), ( 13), (3), and ( 4).
(From the order of calculation it is seen that the factors affecting the accuracy are the IR camera and depth measurement algorithm.) The resulting deviations of ( , , ) IR i j k X from ˆ( , , ) IR i j k X are clustered according to average depth and the obtained statistics (root mean square error L  and standard deviation  ) for different clusters is summarized in Table 2 and Fig. 8.It is seen that the application of nominal model produces large average errors, which are on the order of 35mm at the distance of 2m and on the order of 75mm at the distance of 3m.Calibration of the IR camera yields significant reduction of errors at shorter distances.(It could be noted here that almost the same results have been obtained after neglecting all deviations in parameters of the camera except for the focal length.This insensitivity is attributed to small camera distortion and the fact that the measurement points were always in the central region of the image.)However, the errors remain large at distances larger than 2.5m.In this range, calibration of depth model allows decreasing average errors more than two times compared to the case of calibrating only camera model and more than three times compared to the nominal sensor model.x , and finally finding an intersection of the ray through the undistorted depth map.As in the first scenario, points ( , , ) IR i j k X are further compared to ˆ( , , ) IR i j k X and the statistics obtained for clusters of points is calculated.The results are illustrated in Fig. 9, where it is seen that the errors obtained with nominal parameters enlarged by approximately 50% compared to the first scenario.On the other hand, errors obtained with calibrated RGB camera almost did not change, what is expected having in mind that the evaluation of ˆ( , , ) IR i j k X and ( , , ) IR i j k X was done using the same RGB camera model.

CONCLUSION
This study confirmed large RMS errors of considered sensors.The errors were on the order of 35mm (standard deviation on the order of 10mm) at the distance of 2m and on the order of 75mm (standard deviation on the order of 15mm) at the distance of 3m.When considering joint use of the depth sensor with its associated RGB camera, the effective RMS errors enlarged by 50%.
Errors could be reduced by calibration of sensor's camera and depth measurement model.The procedure proposed in [5,6] proved to be effective as it allowed reducing RMS errors more than three times compared to the case when nominal sensor model was employed.
Calibration of depth model was an important element of the overall calibration as it allowed reducing by 50% the RMS errors left after calibration of sensor's cameras.

3. 1 .
Camera calibrationCamera calibration assumes identification of parameters of functions modeling transformation of 3D coordinates of external objects into coordinates in image plane.

-
c r are column/row indices of image pixels.Thus, calibration consists in identification of:-Intrinsic parameters: elements of the camera matrix K and distortion function ( ) ( ) Extrinsic parameters: elements of the matrix e T .For a stereo pair of Kinect cameras, calibration encompasses identification of camera matrices RGB homogeneous 3D coordinates from the coordinate frame of RGB camera to the coordinate frame of IR camera.
(1) involved only a radial distortion specified by the parameters 1 2 3 , , k k k .a. RGB images b.IR images

Figure 2 .
Figure 2. Close-up views employed in camera calibration vect.RGB→IR: [ 0.00260, -0.00600, 0.00175 ] Translat.vect.RGB→IR: [-25.07450,0.28102, 0.79722 ] Once the parameters of transformations are obtained, depth map images provided by the sensor are easily converted into 3D maps.Assuming that a point in the depth map is available in of the form   column, row) coordinates in the IR image coordinate frame and IR z

Figure 4 .
Figure 4. Depth measurement geometry differ from ideal values and it leads to systematic errors in depth measurement.Therefore, appropriate tuning of depth model parameters may improve the accuracy.

Figure 5 .
Figure 5. Views employed in depth calibration