Presenting Technologies For 3D Skeleton-based Action Classification
Presenting Technologies For 3D Skeleton-based Action Classification
Abstract Nowadays, human action classification work is developing day by day. And with the launch of stereo cameras, 3D depth sequence is used in the action classification. This survey introduces the history of technologies of action classification, and highlights the state of art of the recent technologies for 3D skeleton based action classification. We clear up the reasons and the different methods of data pre-processing, and introduce various action feature representation and classification. Furthermore, this survey introduces the categorization of the recent works in 3D skeleton based action classification.
Keywords — Action classification, Depth sequence, 3D skeleton
.
N
I. INTRODUCTION[1]
NOWADAYS, there are multiple application fields to apply detection, recognition and analysis of human behaviors and actions. For some common fields, such as surveillance, human-computer interaction, assistive technologies and sign language, researchers give a huge of contribution for application of human action recognition and classification.
There are many scenario-based approaches to recognize composite events for visual surveillance and content-based video retrieval. Paper [1] proposed a novel framework to describe and recognize complex video events, which is called scenario in the semantic structure. A new feature model based on a string representation of the video is proposed by Gaur et al. By matching the string representations of the query and the test video, final recognition is obtained, and variability in sampling rates and speed of activity execution are allowed [2]. A Hierarchical Bayesian Network is used for two person interactions’ recognition, paper [3] gives us a solution of recognizing for multiple body interaction. A self-similarity-based descriptor for view-independent video analysis is proposed by paper [4]. The central application is human action recognition, and also can be surveillance. Human-computer interaction is also need to collect a variety of nonverbal information (e.g., arm movements, facial expressions, eye fixations) [5]. For some of disabilities or young kids or elders, the appearance of assistive system makes life easier. The basic idea is that access the intervention of computer and motors, break the limitations in strength and range of motion of those people [6]. Sign language recognition is an important application, and lots of work for improve hand-shapes recognition accuracy [7-9]. Based on facial expressions and behaviors on the shoppers, consumer behavior analysis can help computer to understand consumers’ goal and interests [10]. All of the mentioned applications are motivating computer vision community to conduct research on human action recognition and modelling.
There is a process that from biological vision to computer vision, that is, templatize the process of human vision to distal stimuli and proximal stimulus patterns and simulate that on the computer. Johansson used bright spots as carriers of element motions in complex motion patterns. His experiment [11] analyzed the visual information from typical motion patterns when some pictorial form aspect of the patterns is known. Bright or dark spots are distributed on joints of the human body to against a homogeneous, contrasting background in the experiment. The experiment shows that motion perception may be affected by the number of bright spots and their distribution. And this study shows that human vision can detect motion detections and different motion patterns. The study inspired large of researchers about human pose estimation and recognition and being popular with the development of machine learning because people wish machine can recognize human pose and motion patterns by action-classes.
The initial researches used single monocular image to estimate human pose and recognize motions [12]. Because of the constraints of body articulations, that is, joints are the end of bones with constant length, part detectors and pictorial structure models are used to model the body parts and infer the body pose [13-17].
However, for the reason of body’s great range of posture, the estimation for body pose is hard to achieve. People have found some ways to recognize several simple actions by ignoring the specific joints but focusing on the holistic representation of the human body [18, 19, 20]. Paper [18] describes a “bag-of-rectangles” method for representing and recognizing human actions in videos. The method uses rectangles to spread all over the body and train separate SVM classifiers for each action. Paper [20] present an instant action recognition method to recognize an action in real-time from only two continuous video frames by capture the optical flow and edges of body.
With the development of technology and the appearance of cheap stereo camera, researchers began to estimate pose and recognition work by using depth sequence. Paper [21] propose a method by using single depth image to predict the 3D position of body joints without depending on information from preceding frames. The depth image proved to be useful in providing data for a quick and accurate pose or gesture estimation. Skeleton can be used to describe human body pose or gesture, and it has been defined by computer vision community to be the representation of human torso, head and limbs [11]. Therefore, human body pose can be defined by the relative position of joints in the skeleton.
In recent years, 3D action classification based on depth map and skeleton image are applied in human-computer interface and games. And we have a batch of new works by means of those technologies [22-24].
The mentioned works have implied that 3D skeleton-based action classification is difficult to achieve because semantically similar motions may not necessarily be numerically similar [25]. What is more, with the purpose on capturing the relationship among different joints, most of the works attempt to introduce novel human body pose representations for motion recognition from 3D skeleton data [33,26,27]. And other works [28,29] try to find the discriminative model of joints or group of joints for each class. And furthermore, researchers began to consider the trajectories of the 3D joints. The various classification method or frameworks have been proposed and applied to action classification work.
In this survey, we focus on action classification from skeleton data. We will make an overview of approaches about depth maps and related technologies, body pose estimation, pre-processing of skeletal data, action representation and classification, fusing skeletal data with other sensors’ data, datasets and validation protocol, comparison of methods at the state of the art, and base on those, we will discuss about the future develop direction.
II. Depth Maps And Related Technologies
A depth map is an image that not only stores RGB in each pixel, but also the distance to the camera in the corresponding point. Depth maps have been studied and applied for long time in robotic field [30,31] and scene reconstruction [32-34]. Paper [30] deals with the design and implementation of a visual control of a robotic system composed of a dexterous hand and stereo cameras. Antonio Chella et.al. [31] present a cognitive architecture for posture learning of an anthropomorphic robotic hand. Peter et.al [32] use depth maps to reconstruct the model of indoor environments. And objects can also be reconstructed by using same approach [33,34].
Although various different techniques have been mined to estimate depth map of the scene, in very recent years the technology has advanced quickly because of the low-price depth camera launched.
We can describe three most popular technologies which are used to generate depth map. And we don’t care about the details in this survey.
A. Stereo Cameras
A stereo camera is a type of camera with two or more lenses with a separate image sensor or film frame for each lens. It can estimate depth map by using stereo triangulation [35]. The depth map estimated by stereo camera is still unreliable when the environment is homogeneous colors or intensities.
B. Time-of-flight (ToF)
Time-of-flight using a modulated light as the source, and measure the time of light pause to get the depth result of the scene. The delay by the reflected light is called “time of flight”. Also, some cameras measure phase differences between emitted and reflected light or signals. The measurement result may not precise and reliable because of low spatial resolution, error by radiometric, geometric and illumination variation, motion blur, background light, and multiple reflections [36].
C. Structure Light
Structure light camera projects an infrared structure light pattern to the scene. While projecting a pattern onto a 3D surface, the observed pattern is geometrically distorted [37]. By comparing the original pattern and the deformed observed pattern, exact geometric reconstruction of the surface shape can be recovered. Depth estimation can be unreliable especially in case of reflective or transparent surfaces.
III. Body Pose Estimation
We will introduce the most commonly techniques for body pose estimation from depth maps.
A. Methods for skeleton estimation
Several different experiences for estimating body parts in RGB data [35,38,39,40] or depth map [21,41] has been reported in articles. The technologies to obtain skeleton data can be listed as follows.
Motion capture: Motion capture sensors can record human actions. Motion capture system can be implemented by several kinds of technologies. Optical track system triangulates the 3D position of the aim object using stereo vision. The special retroreflective markers are used to mark the key joints of the body by attaching on body, and then the camera can receive the position of joints by detecting the markers. Also, there are some systems identify the motion of body by multiple LEDs [42].
Intensity image: An intensity image is a data matrix, whose values represent intensities within some range. Recent years the most popular model for pose is pictorial structure [15]. The model decomposes the whole body into different local parts of body, and each part has its own position and orientation in space. Body parts are enforced to construct skeleton by pairwise constraints.
The advance of pictorial structure model has focused on modeling of the appearance of the parts by stronger detectors [13,38] or flexible mixture of models [39]. Other articles show that they enhance the model by introducing further pairwise configuration constraints among body parts [43,44]. While the original pictorial structure uses a tree model [38,39], which permits precise estimation of the body pose, there are some non-tree models used in other articles still having good result. However, in these cases, the computational complexity increases for pruning strategies [16,45,46] or approximated inference [44,47]. And some methods using such as conditional random field [17] and max margin learning [39,47,48], to learn the model parameters based on parts.
Depth maps: Jamie Shotton et.al [21] propose a new method to quickly and accurately predict human pose from a single depth image, which is applied and diffusely used to infer skeletons comprised of 20 joints on a frame-by-frame basis from depth images captured by the Kinect sensor. An object recognition algorithm is trained to recognize 31 distinct body parts, a single input depth image is segmented into a dense probabilistic body parts labeling. By designing an intermediate representation in terms of body parts, the difficult pose estimation problem is transformed into a simpler per-pixel classification problem, for which efficient machine learning techniques exist.
In a simple word, from a single input depth image, a per-pixel body part distribution is inferred. For training data, generate realistic synthetic depth images of humans of many shapes and sizes in highly varied poses sampled from a large motion capture database. The classifier used is a deep randomized decision forest. A forest is an ensemble of decision trees. Each tree consists of split and leaf nodes. Each split consists of a feature and a threshold. To classify pixel in image, the current node is set to the root, and then the features computed. The current node is then updated to the left or right child according to the comparison by the threshold, and the process repeated until a leaf node is reached. Once a leaf node is reached, the pixel is assigned with the class-conditional distribution stored in the leaf node. The distributions are averaged together over all the trees in the forest. The per-pixel information is pooled across pixels to generate reliable proposals for the positions of 3D skeletal joints. A local mode-finding approach based on mean shift with a weighted Gaussian kernel is employed to accumulate the global 3D centers of probability mass for each part.
OpenNI library [49] is also widely used for skeleton estimation.
B. Off-the-shelf solutions
We have searched some of off-the-shelf solutions for the depth machine vision system, and they are divided into passive stereo vision and active stereo vision by their principle. The passive stereo vision system is using multiple cameras to capture the same scene to get multiple images from different source degrees. The active stereo vision system is projecting some patterns or infrared to the scene and knowing the depth by detect the delay time of receiving or the shape changing of the pattern.
Table 1 summarizes our collection work for the solutions, and the list may not exhaustive. It’s worth noting that, the Kinect series will halt production by the latest news, we still put them into list because their abundant software development kit (SDK) still deserve to reference.
Table 1 Hardware solutions for depth vision system
Name of product | Link to Website |
ZED Stereo Camera | https://www.stereolabs.com/ |
The Captury | http://thecaptury.com/ |
OptiTrack Mo-Cap | http://www.optitrack.com/ |
Vicon | https://www.vicon.com/ |
DepthSense 325 | https://www.softkinetic.com/ |
Kinect sensor V1 | https://www.microsoft.com/ |
Kinect sensor V2 | https://www.microsoft.com/ |
Intel RealSense 3D Camera | https://www.intel.com/ |
Asus Xtion Pro Live | https://www.asus.com/ |
Creative Senz3D | https://us.creative.com/ |
Infineon 3D Image Sensor | https://www.infineon.com/ |
Hepagon-Mesa Imaging Sensor | http://enterprise.hptg.com/ |
Brekel Pro Pointcloud | http://brekel.com/ |
Stucture Sensor | https://structure.io/ |
IV. Pre-processing of skeletal data
The purpose of pre-processing of skeletal data is to deal with the biometric differences and different temporal duration of the sequences. For biometric differences, normalization of poses can compensate it.
Explain biometric differences may require a transforma- tion of each skeleton into a normalized pose. The problem is related to motion retargeting. Motion retargeting is simply editing existing motions to achieve the desired effect. Since motion is difficult to create from scratch using traditional methods, changing existing motions is a quicker way to obtain the goal motion. Michael Gleicher [50] presented a technology for problem of adapting an animated motion from one character to another, and this is involving in computer graphics. Motion retargeting can generate the parameters of the motion such as joints’ angles or other specific properties for the target character by satisfying constraints about the resulting poses.
Action is described as a sequence of 3D trajectories of joints in time series. The representation of action depends on the choice of reference coordination system, which is different on biometric differences. Some works such as [29] compute the joint angles between any two connected limbs and use the time series of joint angles as the skeleton motion data. In paper [21], all the 3D joint coordinates are transformed from the world coordinate system to a person centric coordinate system by placing the hip center or shoulder center if applied at the origin. In [28], skeletons are aligned based on the head location and normalized its scale by head length. Mohamed et.al [51] normalized joint coordinates over the sequence to range from 0 to 1 in all dimensions before computing the descriptor for making it scale invariant. Mihai et.al [52] smooth each coordinate of the normalized pose vector, along the time dimension with a 5 by 1 Gaussian filter for a better numerical approximation. In above examples, we need to train data to get a reference skeleton to be a standard. The lengths of limbs in the reference skeleton are adjusted to unit norm.
Also, there is an issue in action classification that the action sequences may have different length. The length of the action sequence may depend on the velocity and type of the action.
There are some works for solving this issue, works such as [22,28] use global feature representation of the entire sequence sacrificing, in general, the information about the temporal structure of the sequence. The most common approach is the bag-of-word model for human action recognition, which represents a sequence in terms of code words in a dictionary. A temporal pyramid pooling scheme can be used to create a descriptor of an action sequence [51,53,54].
Approaches like [55,56] overcome the issue of the length of the sequences by assuming the same order of the linear dynamic system for all the action classes and performing system identification to compute the parameters of the model.
V. Action Representation And Classification
3D action representation can be categorized into joint-based representations, mined joints based descriptors and dynamics-based descriptors. And joint-based representation can be categorized into spatial descriptors, geometric descriptors and key-pose based descriptors.
A. Joint-based representation
The basic method of this category is to capture the correlation of the body joints by extract optional feature descriptors. And according to the different kind of the characteristics in the skeleton sequence, the three subcategories can be distinguished, which are spatial descriptors, geometric descriptors and key-pose based descriptors, and their state of the art will be introduced one by one as follows.
The first subcategory, that is spatial descriptors, is the simplest method to get the relative body joint locations by consider all the pairwise distances of the body joints, including the distances of all of the joints in current frame, and the distances of joints from current frame to the ones in the previous frame. Such as works by Chris et.al [26], the distance between every pair of points in the current frame is computed by the OpenNI software. To capture motion information, the Euclidean distance for all joint location pairs between the current frame and the previous frame are computed. And to capture the overall dynamics of body movement, similar distances are computed between the current frame and a generic skeleton. Each individual feature value was clustered into one of 5 groups via k-means and replaced with a 5-bit vector containing a 1 at the cluster index of the value, and a 0 for all other bits. The work presents a novel Logistic Regression learning framework that automatically finds the most discriminative canonical body pose representation of each action and then performs classification using these extracted poses. Paper [22] gives a similar method using 3D position differences but not distances of joints in the same skeleton, between the joints in the current frame and the ones in the previous frame, and between the current frame and the ones in initial frame. It then applies Principal Component Analysis to joint differences to obtain EigenJoints by reducing redundancy and noise. Naive-Bayes-Nearest-Neighbor (NBNN) is employed as a classifier to recognize multiple action categories. And there are some further attempts to capture the correlation of joints by means of its covariance matrix to represent skeleton sequence [51] or convolution neural networks [56].
The second subcategory named geometrical descriptors, try to estimate the sequence of geometric transformations required to represent the relations among different body parts. Georgios et.al [57] propose a local skeleton descriptor that encodes the relative position of joint quadruples. Such a coding implies a similarity normalization transform that leads to a compact (6D) view-invariant skeletal feature, referred to as skeletal quad. Further, the use of a Fisher kernel representation is suggested to describe the skeletal quads contained in a (sub)action. A Gaussian mixture model is trained from training data, for the purpose of the generation of any set of quads is encoded by its Fisher vector. A multi-layer representation of Fisher vectors leads to an action description that roughly carries the order of sub action within each action sequence. Efficient classification is here achieved by linear SVMs. The proposed action representation is tested on widely used datasets, MSRAction3D and HDM05.
The final subcategory, that is key-pose based descriptors, computes a set of key-poses and represents skeleton as the closest key-poses. A key-pose is defined by means of only one or two specific features of the complete posture. Histogram of posture are used as baseline method to show the different ranking joints and comparing them, the best result can represent the action sequence. That way is descripted in [29] and called Histograms of Most Informative Joints (HMIJ).
B. Mined joints based descriptors
Mined joints based descriptors attempt to learn what body parts are involved and are used to discriminate among actions. Mined joints based descriptors use the information of the subsets of joints to represent the whole motion. Paper [53] also uses this descriptor, it models each joint by its location and velocity in a spherical coordinate system, each action is modeled as a set of histograms. Then a temporal pyramid can be used to capture the structure of pose. Till now, we can realize that the different representation of action is compatible, they are not distinguished and not independent from each other.
C. Dynamics-based descriptor
This method treats the skeleton sequence as 3D trajectories and model the dynamics of such time series. It can be achieved by considering linear dynamical system (LDS) [55,56] or hidden Markov models (HMM)or mixed approaches [58].
Paper [54] divides skeleton into different part, presents hierarchical skeletal feature extraction procedure from each frame. Then it models the dynamics of these features over the entire sequence as well as over small spatial and temporal windows using a set of LDSs. Finally, it proposes to use Multiple Kernel Learning (MKL) to computing discriminative metrics for these sets of LDSs by learn the sets of optimal weights for each part configuration and temporal extent.
VI. Fusing skeletal data with other sensors’ data
There are several Approaches like Action Recognition Based on A Bag of 3D Points [59] which can employ an action graph to model explicitly the dynamics of the actions and a bag of 3D points to characterize a set of salient postures that correspond to the nodes in the action graph, [59] Mining Actionlet Ensemble [60] Which is learnt to represent every action with depth camera. These approaches are fusing the data or information from some data streams such as: skeleton data, depth maps and videos.
Only working with the depth maps is trying to describe the approximate depth from human surface to the camera. A.W. Vieira, E.R. Nascimento [61] divide the space and time axes in the space-time occupancy patterns (STOP0), they are computed to represent 3D action. In [24], Depth sequence is described as histograms of oriented surface normal (HON4D) captured in the 4D volume, based on depth and spatial coordinates.
There are also re-interpret methods which proved to be successful for action recognition from RGB video. They may be proved to be useful for the depth data such as: E. Ohn-Bar [62], tend to use a spatiotemporal feature for depth maps based on a modified histogram of oriented gradients (HOG) to recognize the action.
Other works use the skeleton data to detect the body parts from a depth camera. J. Wang [60] use depth data and estimated 3D joint positions to compute the local occupancy pattern feature. The temporal structure of action is captured by the Fourier temporal pyramid. In [63], Hog is used to describe both the images and depth maps by combining the RGB, depth and hand positions, body pose and motion features extracted from skeleton joints.
Human activity recognition using multi-features and multiple kernel learning (MKL) [64] focuses on human activity recognition in RGB-D data sequences. Shape and motion information are fused. Shape feature can describe 3D silhouette structure from the depth map. Motion features can describe body movement from the estimated 3Djoint positions. There are four distal limb segments of the human body to describe the motion in this method: left and right arms/legs parts. With respect to the initial frame, the orientation and translation distance describes each distal limb segment. MKL is used to produce an optimal kernel matrix within the SVM.
Using the joint of skeletal data and accelerometer measurements may be a god way for the assistive monitoring technologies. In [65], a hidden Markov model (HMM) classifier over the observed concatenated features was used for hand gesture recognition. In next section, some datasets are described.
Fig1. Sample of depth maps with 3D joint positions for Tennis sever action
VII. Datasets
In Table 2, most commonly used datasets for skeleton-based action recognition are listed. The component of table: Dataset name, Reference to the paper introducing the dataset, Number to the action classes, Number to the action sequences, number of subjects performing the actions, the presence (N=No and Y=Yes) of RGB, depth and skeletal data. Some datasets may be inappropriate for methods requiring learning of the parameters of complex models because of the small number of available samples.
Table 2. The most commonly used datasets for skeleton-based action recognition.
Dataset | Ref. | Classes | Seq. | NS | RGB | Depth | Skel. |
UCF | [26] | 16 | 1280 | 16 | N | N | Y |
MHAD | [66] | 11 | 660 | 12 | Y | Y | Y |
MSRA 3D | [60] | 20 | 557 | 10 | N | Y | Y |
MSRDA | [60] | 16 | 320 | 16 | Y | Y | Y |
UTKA | [58] | 10 | 195 | 10 | Y | Y | Y |
A. UCF
The UCF dataset contains only skeletal data. It provides the skeleton (15 joints) data for 16 actions performed 5 times by 16 individuals (13 males and 3 females, all ranging between ages 20 and 35). The action samples are in total 1280 with a temporal duration that ranges in [27,229], and an average length of 66±34 frames. The actions in this dataset are: balance, climb ladder, climb up, duck, hop, kick, leap, punch, run, step back, step front, step left, step right, twist left, twist right, and vault.
B. MHAD
The MHAD (multimodal human action database) [66] provides data from a motion capture system, stereo cameras, depth sensors, accelerometers, and microphones. There are 660 motion sequences of 11 actions performed 5 times by 12 actors. The actions in this dataset are: jumping in place, jumping jacks, bending, punching, waving two hands, waving right hand, clapping throwing a ball, sit down and stand up, sit down, and stand up.
C. MSRA3D
The MSRA3D dataset [59] provides both skeletal and depth data. The skeleton (20 joints) for 20 actions performed 2–3 times by 10 subjects is provided by this dataset. There are both 3D world coordinates and screen coordinates plus depth of the detected skeleton joints. The actions in this dataset are: high arm wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, draw x, draw tick, draw circle, hand clap, two hand waves, side-boxing, bend, forward kick, side kick, jogging, tennis swing, tennis serve, golf swing, pick up and throw.
Table 4. Average accuracy value reported for skeleton-based action recognition. (S) indicates that the performance refers to skeletal data adoption but the cited work reports also performance on hybrid representation.
Cat. | Methods | Classifier | UCF | MHAD | MSRDA | UTKA | MSRA-3D All |
S | [26] | Log. reg. | 95.94% | – | – | – | 65.7% |
S | [56] | sHMM + CNN | – | – | – | – | 82% |
S | [67] | CNN | – | 98.38%a | – | – | – |
S | [54] | SVM | 98.8% | – | – | – | 91.4%a |
S | [52] | KNN + Vote | 98.50% | – | 73.8%b | – | 91.7%b |
S | [62] (S) | SVM | 97.07% | – | – | – | 83.53% |
S | [66] | K-SVM | – | 79.93% | – | – | – |
S | [68] (S) | Rnd forest | – | – | – | 87.9%b | – |
G | [27] | DTW+SVM | – | – | – | 97.08% | 89.48% |
G | [69] | DP+KNN | – | – | – | 91.5% | – |
G | [70] | DP+KNN | 99.2% | – | – | 91.5% | – |
K | [71] | Gen. model | – | – | – | – | 89.5%b |
K | [22] | sHMM | – | – | – | 90.92% | – |
K | [72] | SVM | – | – | – | – | 90.56%c |
M | [29] | SVM | – | 95.37% | – | – | 33.99%d |
M | [73] | NBNN | – | – | 70% | – | – |
M | [53] | PLS-SVM | – | – | 70% | – | 91.5% |
M | [53] | PLS-SVM+TP | – | – | 73.1% | – | 90.1% |
M | [60] (S) | SVM | – | – | 68% | – | 88.2% |
D | [54] | MKL-SVM | – | 100% | – | – | 90%d |
D | [55] | SVM | 97.91%b | – | – | 88.5% | 91.21%d |
D | [58] | dHMM | – | – | – | 86.7% | 89.1% |
D | [74] | dHMM | 97.66% | – | – | – | 89.23% |
D. UTKA
The UTKA dataset [58]6 provides RGB, depth and skeletal data. The skeleton consists of 20 joints for 10 actions performed twice by 10 subjects. The actions in this dataset are: walk, sit down, stand up, pick up, carry, throw, push, pull, wave and clap hands. The dataset comprises 195 sequences. The lengths of the skeleton sequence ranges in [5,170] with an average value of 30.5±20 frames.
Experiments in [22] are performed in leave-one-out cross-validation (LOOCV) on a subset of manually selected joints head, L/ R elbow, L/ R hands, L/ R knee, L/ R feet, hip center and L/ R hip.
E. MSR daily activity 3D
In the MSR daily activity 3D dataset [60], there are 16 activities: drink, eat, read book, call cellphone, write on a paper, use laptop, use vacuum cleaner, cheer up, sit still, toss paper, play game, lie down on sofa, walk, play guitar, stand up, and sit down. Actions are performed twice by 10 different subjects: once in standing position, and once in sitting position. There are three types of streams: Depth maps, skeleton joint positions (20 joints), and RGB video of 320 gesture samples. In some cases, in each frame more than a skeleton is detected. The authors of [60] suggest to use the first detected skeleton. Joint locations are represented both in terms of real world coordinate and normalized screen coordinates plus depth.
VIII. Comparison of methods and Conclusions
We will discuss the performance of the reviewed methods in 3D skeleton-based action classification in this section.
First of all, we need to figure out how to evaluate the performance of these methods; later, we need pay particular attention to methods that report experimental results on the most commonly used datasets.
Experimental results are picked from [75]. They have performed the method and get an average accuracy value among the almost commonly used datasets. Final we will get a conclusion.
A. Performance evaluation
What approach we use to measure the performance of a classification method is the average accuracy value which is defined as the number of correctly classified sequences over the total number of sequences to classify.
Here, we only report the average accuracy value attained by each method unless differently specified.
B. Datasets included in the comparison
The most used dataset/benchmarks are the UTKA, UCF, MSRA-3D, MHAD and MSRDA datasets. Among all these benchmarks, we note that MHAD is the only one providing Mo-Cap data, while UCF provides skeletons of 15 joints estimated from depth maps.
The rest of the datasets provide skeletons of 20 joints estimated from depth maps (see Table 2 for details on the main characteristics of each dataset).
C. Discussion
We have collected in Table 4 the accuracy values reported by methods at the state of the art on UCF, MHAD, MSRDA, UTKA and MSRA-3D. In this table, all action classes are used in the experiments.
D. Conclusion
Due to the cheaper depth camera and some successful works like tracking from depth maps or skeleton estimation, a lot of work on human action recognition from skeleton sequence has been done. We have mentioned some main technologies (hard ware and software) in this paper. These technologies based on 3D skeleton-based action classification can assist us to identify the action when we just input the skeleton data with time.
In this paper, we also mention some most commonly used datasets. Most adopted datasets for skeleton-based action classification are UCF, MHAD, MSRDA, UTKA and MSRA-3D. They consist of different number of body joints in the skeletons, availability of depth/RGB data and so on.
Finally, we have discussed the performance of these classification methods by using the experiment results in [75].
Last but not least, in order to extend our understanding about human action recognition, we think the way to reason on skeleton data is very convenient. Such kind of studies could greatly enhance both gaming applications and human–computer interaction [76].
References
[1] S. Kwak, B. Han, J. Han, Scenario-based video event recognition by constraint flow, in: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Colorado Springs, 2011, pp. 3345–3352, http://dx. doi.org/10.1109/CVPR.2011.5995435.
[2] U. Gaur, Y. Zhu, B. Song, A. Roy-Chowdhury, A string of feature graphs model for recognition of complex activities in natural videos, in: Proceedings of International Conference on Computer Vision (ICCV), IEEE, Barcelona, Spain, 2011, pp. 2595–2602, http://dx.doi.org/10.1109/ICCV.2011.6126549.
[3] S. Park, J. Aggarwal, Recognition of two-person interactions using a hier- archical Bayesian network, in: First ACM SIGMM International Workshop on Video surveillance, ACM, Berkeley, California, 2003, pp. 65–76, http://dx.doi. org/10.1145/982453.982461.
[4] I. Junejo, E. Dexter, I. Laptev, P. Pérez, View-independent action recognition from temporal self-similarities, IEEE Trans. Pattern Anal. Mach. Intell. 33 (1) (2011) 172–185, http://dx.doi.org/10.1109/TPAMI.2010.68.
[5] Z. Duric, W. Gray, R. Heishman, F. Li, A. Rosenfeld, M. Schoelles, C. Schunn, H. Wechsler, Integrating perceptual and cognitive modeling for adaptive and intelligent human–computer interaction, Proc. IEEE 90 (2002) 1272–1289, http://dx.doi.org/10.1109/JPROC.2002.801450.
[6] Y.-J. Chang, S.-F. Chen, J.-D. Huang, A Kinect-based system for physical rehabilitation: a pilot study for young adults with motor disabilities, Res. Dev. Disabil. 32 (6) (2011) 2566–2570, http://dx.doi.org/10.1016/j. ridd.2011.07.002.
[7] A. Thangali, J.P. Nash, S. Sclaroff, C. Neidle, Exploiting phonological con- straints for handshape inference in ASL video, in: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Colorado Springs, 2011, pp. 521–528, http://dx.doi.org/10.1109/CVPR.2011.5995718.
[8] A. Thangali Varadaraju, Exploiting phonological constraints for handshape recognition in sign language video (Ph.D. thesis), Boston University, MA, USA, 2013.
[9] H. Cooper, R. Bowden, Large lexicon detection of sign language, in: Pro- ceedings of International Workshop on Human–Computer Interaction (HCI), Springer, Berlin, Heidelberg, Beijing, P.R. China, 2007, pp. 88–97.
[10] H. Moon, R. Sharma, N. Jung, Method and system for measuring shopper response to products based on behavior and facial expression, US Patent 8,219,438, July 10, 2012 〈http://www.google.com/patents/US8219438〉.
[11] G. Johansson, Visual perception of biological motion and a model for its analysis, Percept. Psychophys. 14 (2) (1973) 201–211.
[12] L. Sigal, Human pose estimation, Comput. Vis.: A Ref. Guide (2014) 362–370.
[13] K. Mikolajczyk, B. Leibe, B. Schiele, Multiple object class detection with a generative model, in: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, IEEE, New York, 2006, pp. 26–37.
[14] P. Viola, M.J. Jones, D. Snow, Detecting pedestrians using patterns of motion and appearance, in: Proceedings of International Conference on Computer Vision (ICCV), IEEE, Nice, France, 2003, pp. 734–742.
[15] P.F. Felzenszwalb, D.P. Huttenlocher, Pictorial structures for object recognition, Int. J. Comput. Vis. 61 (1) (2005) 55–79, http://dx.doi.org/10.1023/B: VISI.0000042934.15159.50.
[16] V. Ferrari, M. Marin-Jimenez, A. Zisserman, Progressive search space reduction for human pose estimation, in: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Anchorage, Alaska, 2008, pp. 1– 8, http://dx.doi.org/10.1109/CVPR.2008.4587468.
[17] D. Ramanan, Learning to parse images of articulated objects, in: Advances in Neural Information Processing Systems 134 (2006).
[18] N. Ikizler, P. Duygulu, Human action recognition using distribution of oriented rectangular patches, in: Proceedings of Workshop on Human Motion Understanding, Modeling, Capture and Animation, Springer, Rio de Janeiro, Brazil, 2007, pp. 271–284.
[19] A. Klaser, M. Marszałek, C. Schmid, A spatio-temporal descriptor based on 3d-gradients, in: Proceedings of British Machine Vision Conference (BMVC), BMVA Press, Leeds, UK. 2008, p. 275:1.
[20] L. Wang, Y. Wang, T. Jiang, D. Zhao, W. Gao, Learning discriminative features for fast frame-based action recognition, Pattern Recognit. 46 (7) (2013) 1832–1840, http://dx.doi.org/10.1016/j.patcog.2012.08.016.
[21] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, R. Moore, Real-time human pose recognition in parts from single depth images, Commun. ACM 56 (1) (2013) 116–124, http://dx.doi.org/10.1145/ 2398356.2398381.
[22] L. Xia, C.-C. Chen, J. Aggarwal, View invariant human action recognition using histograms of 3D joints, in: Proceedings of Computer Vision and Pattern Recognition Workshops (CVPRW), IEEE, Providence, Rhode Island, 2012, pp. 20–27, http://dx.doi.org/10.1109/CVPRW.2012.6239233.
[23] X. Yang, Y. Tian, Eigenjoints-based action recognition using Naive-Bayes- Nearest-Neighbor, in: Proceedings of Computer Vision and Pattern Recog- nition Workshops (CVPRW), IEEE, Providence, Rhode Island, 2012, pp. 14–19, http://dx.doi.org/10.1109/CVPRW.2012.6239232.
[24] O. Oreifej, Z. Liu, W. Redmond, HON4D: histogram of oriented 4D normals for activity recognition from depth sequences, in: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), Portland, Oregon, 2013, pp. 716–723, http://dx.doi.org/10.1109/CVPR.2013.98.
[25] A. Yao, J. Gall, G. Fanelli, L.J. Van Gool, Does human action recognition benefit from pose estimation? in: Proceedings of the British Machine Vision Con- ference (BMVC), vol. 3, BMVA Press, Dundee, UK, 2011, pp. 67.1–67.11, http:// dx.doi.org/10.5244/C.25.67.
[26] S.Z. Masood, C. Ellis, M.F. Tappen, J.J. LaViola, R. Sukthankar, Exploring the trade-off between accuracy and observational latency in action recognition, Int. J. Comput. Vis. 101 (3) (2013) 420–436, http://dx.doi.org/10.1007/ s11263-012-0550-7.
[27] R. Vemulapalli, F. Arrate, R. Chellappa, Human action recognition by repre- senting 3D skeletons as points in a Lie Group, in: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Columbus, Ohio. 2014, pp. 588–595, http://dx.doi.org/10.1109/CVPR.2014.82.
[28] C. Wang, Y. Wang, A.L. Yuille, An approach to pose-based action recognition, in: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Portland, Oregon, 2013, pp. 915–922, http://dx.doi.org/10.1109/ CVPR.2013.123.
[29] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, R. Bajcsy, Sequence of the most informative joints (SMIJ): a new representation for human skeletal action recognition, J. Vis. Commun. Image Represent. 25 (1) (2014) 24–38, http://dx. doi.org/10.1016/j.jvcir.2013.04.007.
[30] I. Infantino, A. Chella, H. Dindo, I. Macaluso, Visual control of a robotic hand, in: Proceedings of International Conference on Intelligent Robots and Sys- tems (IROS), vol. 2, IEEE, Las Vegas, CA, USA, 2003, pp. 1266–1271, http://dx. doi.org/10.1109/IROS.2003.1248819.
[31] A. Chella, H. Dindo, I. Infantino, I. Macaluso, A posture sequence learning system for an anthropomorphic robotic hand, Robot. Auton. Syst. 47 (2) (2004) 143–152, http://dx.doi.org/10.1016/j.robot.2004.03.008.
[32] P. Henry, M. Krainin, E. Herbst, X. Ren, D. Fox, RGB-D mapping: using depth cameras for dense 3D modeling of indoor environments, in: Experimental Robotics, Springer Tracts in Advanced Robotics, vol. 79, Citeseer, Springer, Berlin, Heidelberg, 2014, pp. 477–491, http://dx.doi.org/10.1007/978-3-642- 28572-1_33.
[33] J.C. Carr, R.K. Beatson, J.B. Cherrie, T.J. Mitchell, W.R. Fright, B.C. McCallum, T. R. Evans, Reconstruction and representation of 3D objects with radial basis functions, in: Proceedings of Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), ACM, Los Angeles, CA, USA, 2001, pp. 67–76, http://dx.doi.org/10.1145/383259.383266.
[34] V. Kolmogorov, R. Zabih, Multi-camera scene reconstruction via graph cuts, in: Proceedings of European Conference on Computer Vision (ECCV), Springer, Copenhagen, Denmark, 2002, pp. 82–96.
[35] E. Trucco, A. Verri, Introductory Techniques for 3-D Computer Vision, vol. 201, Prentice Hall, Englewood Cliffs, 1998.
[36] S. Foix, G. Alenya, C. Torras, Lock-in time-of-flight (tof) cameras: a survey, IEEE Sens. J. 11 (9) (2011) 1917–1926.
[37] D. Scharstein, R. Szeliski, High-accuracy stereo depth maps using structured light, in: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, IEEE, Madison, Wisconsin, 2003, p. I-195.
[38] P. Felzenszwalb, D. McAllester, D. Ramanan, A discriminatively trained, multiscale, deformable part model, in: Proceedings of Conference on Com- puter Vision and Pattern Recognition (CVPR), IEEE, Anchorage, Alaska, 2008, pp. 1–8, http://dx.doi.org/10.1109/CVPR.2008.4587597.
[39] Y. Yang, D. Ramanan, Articulated pose estimation with flexible mixtures-of- parts, in: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Colorado Springs, 2011, pp. 1385–1392, http://dx.
[40] J. Shen, W. Yang, Q. Liao, Part template: 3D representation for multiview human pose estimation, Pattern Recognit. 46 (7) (2013) 1920–1932, http: //dx.doi.org/10.1016/j.patcog.2013.01.001.
[41] M. Ye, X. Wang, R. Yang, L. Ren, M. Pollefeys, Accurate 3d pose estimation from a single depth image, in: Proceedings of International Conference on Computer Vision (ICCV), IEEE, Barcelona, Spain, 2011, pp. 731–739.
[42] T.B. Moeslund, E. Granum, A survey of computer vision-based human motion capture, Comput. Vis. Image Underst. 81 (3) (2001) 231–268, http://dx.doi. org/10.1006/cviu.2000.0897.
[43] X. Ren, A. C. Berg, J. Malik, Recovering human body configurations using pairwise constraints between parts, in: Proceedings of International Con- ference on Computer Vision (ICCV), vol. 1, IEEE, Beijing, P.R. China, 2005, pp. 824–831.
[44] T.-P. Tian, S. Sclaroff, Fast globally optimal 2d human detection with loopy graph models, in: Proceedings of Conference on Computer Vision and Pat- tern Recognition (CVPR), IEEE, San Francisco, CA, USA, 2010, pp. 81–88.
[45] D. Tran, D. Forsyth, Improved human parsing with a full relational model, in: Proceedings of European Conference on Computer Vision (ECCV), Springer, Crete, Greece, 2010, pp. 227–241.
[46] B. Sapp, A. Toshev, B. Taskar, Cascaded models for articulated pose estima- tion, in: Proceedings of European Conference on Computer Vision (ECCV), Springer, Crete, Greece, 2010, pp. 406–420.
[47] Y. Wang, D. Tran, Z. Liao, Learning hierarchical poselets for human parsing, in: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Colorado Springs, 2011, pp. 1705–1712.
[48] M.P. Kumar, A. Zisserman, P.H. Torr, Efficient discriminative learning of parts- based models, in: Proceedings of International Conference on Computer Vision (ICCV), IEEE, Kyoto, Japan, 2009, pp. 552–559.
[49] S.S. SDK, Openni 2, openNI 2 SDK Binaries 〈http://structure.io/openni〉, 2014.
[50] M. Gleicher, Retargetting motion to new characters, in: Proceedings of Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), ACM, Orlando, Florida, USA, 1998, pp. 33–42, http://dx.doi.org/10.1145/280814.280820.
[51] M.E. Hussein, M. Torki, M.A. Gowayyed, M. El-Saban, Human action recog- nition using a temporal hierarchy of covariance descriptors on 3D joint locations, in: Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), AAAI Press, Beijing, P.R. China, 2013, pp. 2466–2472.
[52] M. Zanfir, M. Leordeanu, C. Sminchisescu, The moving pose: an efficient 3d kinematics descriptor for low-latency action recognition and detection, in: Proceedings of International Conference on Computer Vision (ICCV), IEEE, Sydney, Australia, 2013, pp. 2752–2759.
[53] A. Eweiwi, M.S. Cheema, C. Bauckhage, J. Gall, Efficient pose-based action recognition, in: Proceedings of Asian Conference on Computer Vision (ACCV), Springer, Singapore, 2014, pp. 1–16.
[54] T. Kerola, N. Inoue, K. Shinoda, Spectral graph skeletons for 3D action recognition, in: Proceedings of Asian Conference on Computer Vision (ACCV), Springer, Singapore, 2014, pp. 1–16.
[55] R. Slama, H. Wannous, M. Daoudi, A. Srivastava, Accurate 3D action recog- nition using learning on the Grassmann manifold, Pattern Recognit. 48 (2) (2015) 556–567, http://dx.doi.org/10.1016/j.patcog.2014.08.011.
[56] D. Wu, L. Shao, Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition, in: Proceedings of Con- ference on Computer Vision and Pattern Recognition (CVPR), IEEE, Colum- bus, Ohio. 2014, pp. 724–731.
[57] G. Evangelidis, G. Singh, R. Horaud, et al., Skeletal quads: human action recognition using joint quadruples, in: Proceedings of International Con- ference on Pattern Recognition (ICPR), IEEE, Stockholm, Sweden, 2014, pp. 4513–4518, http://dx.doi.org/10.1109/ICPR.2014.772.
[58] L. Lo Presti, M. La Cascia, S. Sclaroff, O. Camps, Gesture modeling by Hanklet-based hidden Markov model, in: D. Cremers, I. Reid, H. Saito, M.- H. Yang (Eds.), Proceedings of Asian Conference on Computer Vision (ACCV 2014), Lecture Notes in Computer Science, Springer International Publishing, Singapore, 2015, pp. 529–546, http://dx.doi.org/10.1007/978- 3-319-16811-1_35.
[59,] W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3D points, in: Proceedings of Computer Vision and Pattern Recognition Workshops (CVPRW), IEEE, San Francisco, CA, USA, 2010, pp. 9–14, http://dx.doi.org/10. 1109/CVPRW.2010.5543273.
[60] J. Wang, Z. Liu, Y. Wu, J. Yuan, Mining actionlet ensemble for action recog- nition with depth cameras, in: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Providence, Rhode Island, 2012, pp. 1290–1297, http://dx.doi.org/10.1109/CVPR.2012.6247813.
[61] A.W. Vieira, E.R. Nascimento, G.L. Oliveira, Z. Liu, M.F. Campos, STOP: space– time occupancy patterns for 3D action recognition from depth map sequences, Prog. Pattern Recognit. Image Anal. Comput. Vis. Appl. (2012) 252–259, http://dx.doi.org/10.1007/978-3-642-33275-331.
[62] E. Ohn-Bar, M.M. Trivedi, Joint angles similarities and HOG2 for action recognition, in: Proceedings of Computer Vision and Pattern Recognition Workshops (CVPRW), IEEE, Portland, Oregon, 2013, pp. 465–470, http://dx. doi.org/10.1109/CVPRW.2013.76.
[63] J. Sung, C. Ponce, B. Selman, A. Saxena, Unstructured human activity detec- tion from RGBD images, in: Proceedings of International Conference on Robotics and Automation (ICRA), IEEE, St. Paul, Minnesota, 2012, pp. 842– 849, http://dx.doi.org/10.1109/ICRA.2012.6224591.
[64] S. Althloothi, M.H. Mahoor, X. Zhang, R.M. Voyles, Human activity recogni- tion using multi-features and multiple kernel learning, Pattern Recognit. 47 (5) (2014) 1800–1812.
[65] K. Liu, C. Chen, R. Jafari, N. Kehtarnavaz, Fusion of inertial and depth sensor data for robust hand gesture recognition, IEEE Sens. J. 14 (6) (2014) 1898–1903.
[66] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, R. Bajcsy, Berkeley MHAD: a com- prehensive multimodal human action database, in: Proceedings of Workshop on Applications of Computer Vision (WACV), IEEE, Clearwater Beach Florida, 2013, pp. 53–60.
[67] E.P. Ijjina, C.K. Mohan, Human action recognition based on MOCAP infor- mation using convolution neural networks, in: Proceedings of International Conference on Machine Learning and Applications (ICMLA), IEEE, Detroit Michigan, 2014, pp. 159–164, http://dx.doi.org/10.1109/ICMLA.2014.30.
[68] Y. Zhu, W. Chen, G. Guo, Fusing spatiotemporal features and joints for 3D action recognition, in: Proceedings of Computer Vision and Pattern Recog- nition Workshops (CVPRW), IEEE, Portland, Oregon, 2013, pp. 486–491, http://dx.doi.org/10.1109/CVPRW.2013.78.
[69] M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, A. Del Bimbo, Space– time pose representation for 3D human action recognition, in: Proceedings of the International Conference on Image Analysis and Processing (ICIAP), Springer, Naples, Italy, 2013, pp. 456–464, http://dx.doi.org/10.1007/978-3-642- 41190-850.
[70] M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, A. Del Bimbo, 3-D human action recognition by shape analysis of motion trajectories on Rie- mannian manifold, IEEE Trans. Cybern. 45 (7) (2015) 1340–1353.
[71] I. Lillo, A. Soto, J.C. Niebles, Discriminative hierarchical modeling of spatio- temporally composable human activities, in: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Columbus, Ohio. 2014, pp. 812–819.
[72] M. Barnachon, S. Bouakaz, B. Boufama, E. Guillou, Ongoing human action recognition with motion capture, Pattern Recognit. 47 (1) (2014) 238–247, http://dx.doi.org/10.1016/j.patcog.2013.06.020.
[73] L. Seidenari, V. Varano, S. Berretti, A. Del Bimbo, P. Pala, Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses, in: Pro- ceedings of Conference on Computer Vision and Pattern Recognition Work- shops (CVPRW), IEEE, Portland, Oregon, 2013, pp. 479–485.
[74] L. Lo Presti, M. La Cascia, S. Sclaroff, O. Camps, Hankelet-based dynamical systems modeling for 3D action recognition, in: Image and Vision Comput- ing, Elsevier, 44 (2015), 29–43, http://dx.doi.org/10.1016/j.imavis.2015.09.007 〈http://www.sciencedirect.com/science/article/pii/S02628%85615001134〉.
[75] Presti, Liliana Lo, and Marco La Cascia. “3D skeleton-based human action classification: A survey.” Pattern Recognition 53 (2016): 130-147.
[76] A. Malizia, A. Bellucci, The artificiality of natural user interfaces, Commun. ACM 55 (3) (2012) 36–39.