Semantic segmentation

We present the depth-adaptive deep neural network using a depth map for semantic segmentation. Typical deep neural networks receive inputs at the predetermined locations regardless of the distance from the camera. This fixed receptive field presents a challenge to generalize the features of objects at various distances in neural networks. Specifically, the predetermined receptive fields are too small at a short distance, and vice versa. To overcome this challenge, we develop a neural network which is able to adapt the receptive field not only for each layer but also for each neuron at the spatial location. To adjust the receptive field, we propose the depth-adaptive multiscale (DaM) convolution layer consisting of the adaptive perception neuron and the in-layer multiscale neuron. The adaptive perception neuron is to adjust the receptive field at each spatial location using the corresponding depth information. The in-layer multiscale neuron is to apply the different size of the receptive field at each feature space to learn features at multiple scales. The proposed DaM convolution is applied to two fully convolutional neural networks. We demonstrate the effectiveness of the proposed neural networks on the publicly available RGB-D dataset for semantic segmentation and the novel hand segmentation dataset for hand-object interaction. The experimental results show that the proposed method outperforms the state-of-the-art methods without any additional layers or pre/post-processing.

Responsive image Responsive image

B. Kang, Y. Lee, and T. Nguyen, "Depth Adaptive Deep Neural Network for Semantic Segmentation," IEEE Transactions on Multimedia, 2018

Hand segmentation

Hand segmentation for hand-object interaction is a necessary preprocessing step in many applications such as augmented reality, medical application, and human-robot interaction. However, typical methods are based on color information which is not robust to objects with skin color, skin pigment difference, and light condition variations. Thus, we propose hand segmentation method for hand-object interaction using only a depth map. It is challenging because of the small depth difference between a hand and objects during an interaction. To overcome this challenge, we propose the two-stage random decision forest (RDF) method consisting of detecting hands and segmenting hands. To validate the proposed method, we demonstrate results on the publicly available dataset of hand segmentation for hand-object interaction. The proposed method achieves high accuracy in short processing time comparing to the other state-of-the-art methods.

Responsive image Responsive image

B. Kang, K.-H. Tan, N. Jiang, H.-S. Tai, D. Tretter, and T. Nguyen, "Hand Segmentation for Hand-Object Interaction from Depth map," IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2017

Sign language fingerspelling recognition using CNNs

Sign language recognition is important for natural and convenient communication between deaf community and hearing majority. We take the highly efficient initial step of automatic fingerspelling recognition system using convolutional neural networks (CNNs) from depth maps. In this work, we consider relatively larger number of classes compared with the previous literature. We train CNNs for the classification of 31 alphabets and numbers using a subset of collected depth data from multiple subjects. While using different learning configurations, such as hyper-parameter selection with and without validation, we achieve 99.99% accuracy for observed signers and 83.58% to 85.49% accuracy for new signers. The result shows that accuracy improves as we include more data from different subjects during training. The processing time is 3 ms for the prediction of a single image. To the best of our knowledge, the system achieves the highest accuracy and speed. The trained model and dataset is available on our repository. Responsive image

B. Kang, S. Tripathi, and T. Nguyen, "Real-time Sign Language Fingerspelling Recognition using Convolutional Neural Networks from Depth map," ACPR 2015

Hand articulations tracking

Real-time hand articulations tracking is important for many applications such as interacting with virtual / augmented reality devices or tablets. However, most of existing algorithms highly rely on expensive and high power-consuming GPUs to achieve real-time processing. Consequently, these systems are inappropriate for mobile and wearable devices. Therefore, we propose an efficient hand tracking system which does not require high performance GPUs.

In our system, we track hand articulations by minimizing discrepancy between depth map from sensor and computer-generated hand model. We also initialize hand pose at each frame using finger detection and classification. Our contributions are: (a) propose adaptive hand model to consider different hand shapes of users without generating personalized hand model; (b) improve the highly efficient frame initialization for robust tracking and automatic initialization; (c) propose hierarchical random sampling of pixels from each depth map to improve tracking accuracy while limiting required computations. To the best of our knowledge, it is the first system that achieves both automatic hand model adjustment and real-time tracking without using GPUs. Responsive image

B. Kang, Y. Lee, and T. Nguyen, "Efficient Hand Articulations Tracking using Adaptive Hand Model and Depth map," ISVC 2015

Facial muscles 3D modeling using MRI

We proposed 3D human face modeling based on facial muscles using magnetic resonance imaging (MRI) with ultra-short echo-time (UTE) pulse sequence. T1-weighted, isotropic (1.0x1.0x.1.0mm3) resolution 3D invivo data was acquired with 3 tesla MR scanner. We employed anisotropic diffusion filter, morphological operations, and region growing algorithm for segmentation of facial muscles. We were able to segment and reconstruct the following facial muscles: orbicularis oris, mentalis, orbicularis oculi, zygomaticus major, zygomaticus minor, temporalis, and buccinators. The segmented muscles using UTE images can improve 3D human face modeling. Human face modeling should consider facial muscles in order to produce accurate face models for trustworthy results of imaginary plastic surgery and natural 3D animations.

Responsive image

B. Kang, M. Kim, T. Hong, and D. Kim, “Facial Muscles 3D Modeling using Ultra-short Echo-time (UTE) Magnetic Resonance Imaging (MRI),” IEEK Summer Conference 2013