DBSCAN for Hand Tracking and Gesture Recognition

Hand segmentation and tracking are important issues for handgesture recognition. Using depth data, it can speed up the segmentation process because we can delete unnecessary data like the background of the image easily. In this research, we modify DBSCAN clustering algorithm to make it faster and suitable for our system. This method is used in both hand tracking and hand gesture recognition. The results show that our method performs well in this system. The proposed method can outperform the original DBSCAN and the other clustering method in terms of computational time.


Introduction
An accurate object tracking depends on the condition of the input data. The good data can be obtained through a preprocessing or segmentation method. A lot of research indicates that these methods can provide a good results[1], but this method will meet difficulties when detecting the subjects when the background is cluttered or with articulated poses. That condition can decrease the accuracy, moreover using an unnecessary information will lead to an increasing of computational time and affect the accuracy [2]. Some of the algorithms needs high computation time when it is implemented [3] [4], so it is difficult to be used on real-time object tracking systems.
More recently, depth data is found to be very beneficial. Compared to the color data, depth data has some advantages providing additional 3D information that offers more detail and important hints [5]. Depth camera can work in dark conditions, reduce ambiguity in scale, is largely color and texture invariant, and resolves some silhouette ambiguities. They also perform as a pre-processing by simplifying the background subtraction [6]. All of the advantages help to process the data and make it faster.
Developing hand gesture recognition system system should pay attention to the hand tracking as an important part. Hand detection or tracking must successful segmenting hand as an input to provide the best result of the recognition algorithm [7]. Considering the importance of clustering in image segmentation then it should choose the right method to be used [8]. The depth data produced by a depth camera is not always clear without noise [9], moreover most of these cameras do have high noise levels on a raw depth map [10]. Unexpected and large amounts noise can be a problem and may limit the information [11] [12], so we need to eliminate it to obtain a good results.
Many applications need a fast method to solve the clustering problems. Currently many methods have been used as a clustering method such as DBSCAN, K-Means [13], Mean Shift [14], and OPTICS [15]. The DBSCAN method is believed to be the fastest method of all those methods and it able to clustering data while reducing the existing noise [16]. In this research, we propose a modified Density-based spatial clustering of applications with noise (DBSCAN) as a data clustering algorithm to do a fast hand tracking and hand gesture recognition.

Related Work
Image segmentation is an important step in image processing, moreover it is related with the image recognition system. A meaningful part of the image is obtained using image segmentation process. The most problem that frequently occur is the background is similar to the object [17]. Currently some images also have depth data, with this data we can make segmentation more easily as in some existing research [18] [19].
Generally speaking, hand tracking and hand gesture recognition is challenging problem to solved. There are so many methods used for clustering like k-means and hierarchical, and expectation-maximization(EM) [20]. There are also broadly use classification method such as Nearest Neighbor, Support Vector Machines and Linear Discriminant Analysis(LDA) for recognition problem [21]. Here we proposed a different approach to solve this problem. We use DBSCAN that originally used for clustering method to handle the object tracking and recognition problem. Using clustering method, make us more flexible against the additional of class while the system running, and it is no need to use a data training to solve the problems. Instead of only use a vanilla DBSCAN [22], we do some modification to make it faster and fit for real-time system.

 Proposed Method
This system has three main process. First, is a segmentation process that use a depth threshold for the depth data as an input. This step resulting a depth data that only consist the hand object. Second, is a tracking process that use our proposed method which is the modified version from DBSCAN to separate the hand into left and right. To maintain the tracking result, do the calculation by find a closest distance between the center point of current and previous label. Third, recognition process that also use our proposed method to calculate the number of the raised up fingers. The overall process is shown in Figure 1.

Fig. 1. System Architecture
The proposed method has a similar with the original DBSCAN do the iteration until all data is clustered. DBSCAN that we proposed, finish the clustering process for all data without a repetition for the data that not clustered yet. This method runs in Input Depth Data Segmentation Tracking Recognition one turn of iteration for all data and it able to clusters the data in the same time. The detail explanation is shown by algorithm 1. Here we set a minimum neighbor and epsilon as one.
Algorithm 1: Modified DBSCAN data input = Matrix of x and y position list= list of position of non-zero data Set label of the first position data as cluster 1 Repeat all data in list If surrounding of the current data is not empty Set the label according to the smallest label among these data If the cluster of surrounding data is different Set all the cluster member into the same label Else Set the cluster as a new cluster end.

Segmentation
Segmentation process is an important process for object tracking and hand gesture recognition. It is performed as a preprocessing step to make an input data more stable and easier to use. In this system, we use simple depth threshold as a segmentation to remove the background from the original depth data Dt. Fixed threshold value T defined, any object that have value less than the threshold will be considered as a target. We assuming the target as a person that want to perform a gesture. The average depth value Adepth of the target used as a depth reference and anything in front of it will be considered as a hand. The depth value Adepth can be calculated as equation 1.

Fig. 2. Data Segmentation Process (a) Original depth image, (b) After background removal, and (c) Hand segmentation
Data segmentation process is illustrated in Figure 2 which (a) as original depth image where the person (including his hand) and the area behind the target are detected. (b) is the image after background removal using depth threshold, and (c) is a hand detection result as the last result of segmentation process. In this figure we can see the different color that indicate the different value of the depth.

Object Tracking
DBSCAN is an unsupervised clustering which means they do not create the fix labels or representations for the cluster [23]. This method only separates the object into clusters and if the data is refreshed in next frame the labels maybe change. In this step we have two labels for each frame as a hand area H for both hands. For each area we find a center point C by calculating the average of the point inside the hand area, like in equation 2. This center point is represented by a red dot as in Figure 3.

Fig. 4. Hand Tracking
This center point is used to determine the label by comparing the distance of current frame center points with the previous frame. Figure 4 shows the hand in white color as a previous frame and hand in blue color is a current frame. Each hand has a center point C1, C2 are in previous frame and C3, C4 are in current frame. We calculate the distance of one center point in current frame with both of center points in previous frame. We use Euclidian distance to calculate D as in equation 3. Let we choose a center point C3 of current frame then we calculate its distance by D1 from (C3, C1) and D2 from (C3, C2). If D1 is less than D2, then C3 has the same label as C1 and vice versa.

Hand Gesture Recognition
Hand gesture recognition is a way to know or differentiate between various hand poses. Here we can recognize a gestures according to the fingers which raised and can count up to 10 fingers raised for the total number from both hands. The recognition, only need a hand region as an input. To keep the focus only on the hand, we crop the input according to the width (X) and height (Y) of the both hands that refer to center point C, by the equation 4 below to get the crop area Cr for all hands like conceived in Figure 5.
The hand center point C is used to get the hand palm area P. We assume that this center point is the center of the hand palm and use it to remove the area inside the radius distance R from the center point of the hand like illustrated by Figure 6. The value of R is important for the recognition, if the value is too small then several fingers will unite. In the other hand if it is to large some of the fingers will be removed. Both of condition can lead to a wrong recognition result, so it is very important to set the correct R value.
The fingers area F obtained by using equation 5. The calculation of the fingers as a gesture is using our proposed DBSCAN algorithm. However, we know from the image that there is some part of hand N which cannot be removed, so we consider this as a noise and we do not count it as a finger.
The gesture is not affected by which fingers is rising up. Figure 7 shows the hand gestures in a different fingers position, however some of them is the same gesture as long as the number of fingers is the same. In that figure we can see that the gesture 2, 3 and 4 have a different finger that rising up.

Experimental and Result
Computational time is one of the issues that need to be solved. Here we provide some improvement in our proposed method to solve the issue. The comparison of computational time shown in Table 1, it is proving that our method is outperforming the other method in computational time and compared to the original DBSCAN by 192.5% times speedup on average. However, to make it more reliable for real-time system it needs a faster result. So, we decide to reduce the input size by resize it. The smaller input size produce faster computational time. The fastest computational time while using our method was found to be about 5ms, but the data is too small, then it makes some important part of the data will be missing and it is affects the recognition result. We try various input size as shown by Figure 8, and choose the input with 0.25 resize that can produce a stable result and the smallest size input.   Table 2 shows the confusion matrix from testing result. The testing conducted by performs fifty times test per gesture. This matrix can be used to measure how accurate the method by calculate the Precision, Recall and Accuracy. The Precision shows an average of around 80% among those predicted positive for all gestures. The Recall represents the actual positives or true positive of the result, with an average value of 81%. The Accuracy observe correctly predicted against the total testing result and it shows 90.3% in average. Based on these result we can see that our method can achieve a good result in a recognition task. The recognition is also robust due to the rotation. It is because we can handle the hand rotation problem by preprocessing the hand data before it feeds on our proposed method. Figure 9 shows this method can handle the rotation.

Conclusion
This paper describes the possibilities of using DBSCAN as a method to handle an object tracking and gesture recognition problems. The result show it is robust against various hand poses and hand rotation for both problems. The proposed method outperforms another clustering method, even compared to the original DBSCAN by 192.5% times speedup on average and the recognition has a high accuracy up to 90%.