A Comparison of Two Neural Network Based Methods for Human Activity Recognition

Document Type : Research Article


Yazd University


In this paper, two different methods of human activity recognition based on video signals are introduced. The first method explores the effectiveness of combining feature descriptors obtained by local descriptors and artificial neural network classifier. It is used in the traditional approach and the local descriptors extract interest points or local patches from the videos, and the feature vectors are later constructed based on the intrests, and eventually feature vectors are used as the input of a two-layer feed-forward artificial neural network (ANN). Experimental results show that using the HOG3D descriptor with ANN gives the best performance. On the other hand, deep learning architectures have attracted much consideration for automatic feature extraction in the last years, so an improved 3D convolutional neural network architecture is also designed as the second method. They are implemented and compared with state-of-the-art approaches on two data sets. The results exhibit that method 1 is superior when the shortage of sample data is the main restriction. It respectively achieves recognition accuracies of 97.8% and 99.8% for the Weizmann and KTH action data sets. In addition, method 2 is considerable for its automatic features extraction, and achieves an acceptable result with lots of original training data. As a result, it gets recognition accuracy of 92% for the KTH data set while this value is drastically reduced for the Weizmann data set.


Main Subjects

[1] M.A. Khan, T. Akram, M. Sharif, M.Y. Javed, N. Muhammad, M. Yasmin, An implementation of optimized framework for action classification using multilayers neural network on selected fused features, Pattern Analysis and Applications, 22(4) (2019) 1377-1397.
[2] K.-P. Chou, M. Prasad, D. Wu, N. Sharma, D.-L. Li, Y.-F. Lin, M. Blumenstein, W.-C. Lin, C.-T. Lin, Robust feature-based automated multi-view human action recognition system, IEEE Access, 6 (2018) 15283-15296.
[3] A. Abdelbaky, S. Aly, Human action recognition using short-time motion energy template images and PCANet features, Neural Computing and Applications,  (2020) 1-14.
[4] R. Singh, S. Nigam, A.K. Singh, M. Elhoseny, Intelligent Wavelet Based Techniques for Advanced Multimedia Applications, in, Springer, 2020.
[5] A. Klaser, M. Marszałek, C. Schmid, A spatio-temporal descriptor based on 3d-gradients, in, 2008.
[6] R. Poppe, A survey on vision-based human action recognition, Image and vision computing, 28(6) (2010) 976-990.
[7] I. Laptev, On space-time interest points, International journal of computer vision, 64(2-3) (2005) 107-123.
[8] D.G. Lowe, Distinctive image features from scale-invariant keypoints, International journal of computer vision, 60(2) (2004) 91-110.
[9] P. Scovanner, S. Ali, M. Shah, A 3-dimensional sift descriptor and its application to action recognition, in:  Proceedings of the 15th ACM international conference on Multimedia, 2007, pp. 357-360.
[10] H. Bay, A. Ess, T. Tuytelaars, L. Van Gool, Speeded-up robust features (SURF), Computer vision and image understanding, 110(3) (2008) 346-359.
[11] J. Uijlings, I.C. Duta, E. Sangineto, N. Sebe, Video classification with densely extracted hog/hof/mbh features: an evaluation of the accuracy/computational efficiency trade-off, International Journal of Multimedia Information Retrieval, 4(1) (2015) 33-44.
[12] T. Guha, R.K. Ward, Learning sparse representations for human action recognition, IEEE transactions on pattern analysis and machine intelligence, 34(8) (2011) 1576-1588.
[13] M.M. Moussa, E. Hamayed, M.B. Fayek, H.A. El Nemr, An enhanced method for human action recognition, Journal of advanced research, 6(2) (2015) 163-169.
[14] X. Sun, M. Chen, A. Hauptmann, Action recognition via local descriptors and holistic features, in:  2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, IEEE, 2009, pp. 58-65.
[15] M. Saremi, F. Yaghmaee, Efficient encoding of video descriptor distribution for action recognition, Multimedia Tools and Applications, 79(9) (2020) 6025-6043.
[16] S.N. Boualia, N.E.B. Amara, 3D CNN for Human Action Recognition, 2018.
[17] S. Ji, W. Xu, M. Yang, K. Yu, 3D convolutional neural networks for human action recognition, IEEE transactions on pattern analysis and machine intelligence, 35(1) (2012) 221-231.
[18] V.A. Chenarlogh, F. Razzazi, Multi-stream 3D CNN structure for human action recognition trained by limited data, IET Computer Vision, 13(3) (2018) 338-344.
[19] G. Yu, T. Li, Recognition of human continuous action with 3D CNN, in:  International Conference on Computer Vision Systems, Springer, 2017, pp. 314-322.
[20] K. Liu, W. Liu, C. Gan, M. Tan, H. Ma, T-c3d: Temporal convolutional 3d network for real-time action recognition, in:  Thirty-second AAAI conference on artificial intelligence, 2018.
[21] R.G. Baraniuk, M.B. Wakin, Random projections of smooth manifolds, Foundations of computational mathematics, 9(1) (2009) 51-77.
[22] M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as space-time shapes, in:  Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, IEEE, 2005, pp. 1395-1402.
[23] C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local SVM approach, in:  Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., IEEE, 2004, pp. 32-36.
[24] M.M. Moussa, E. Hamayed, M.B. Fayek, H.A. El Nemr, An enhanced method for human action recognition, Journal of advanced research, 6(2) (2015) 163-169.
[25] C. Liu, J. Liu, Z. He, Y. Zhai, Q. Hu, Y. Huang, Convolutional neural random fields for action recognition, Pattern recognition, 59 (2016) 213-224.