An Improved Two-stream 3D Convolutional Neural Network for Human Action Recognition

In order to obtain global contextual information precisely from videos with heavy camera motions and scene changes, this study proposes an improved spatiotemporal two-stream neural network architecture with a novel convolutional fusion layer. The three main improvements of this study are: 1) the Resnet-101 network has been integrated into the two streams of the target network independently; 2) two kinds of feature maps (i.e., the optical flow motion and RGB-channel information) obtained by the corresponding convolution layer of two streams respectively are superimposed on each other; 3) the temporal information is combined with the spatial information by the integrated three-dimensional (3D) convolutional neural network (CNN) to extract more latent information from the videos. The proposed approach was tested by using UCF-101 and HMDB51 benchmarking datasets and the experimental results show that the proposed two-stream 3D CNN model can gain substantial improvement on the recognition rate in video-based analysis.

Chaolong Zhang, Chen, J., Xu, Y., Zhang, C., Xu, Z., Meng, X., & Wang, J. (2019). An Improved Two-stream 3D Convolutional Neural Network for Human Action Recognition. 25th International Conference on Automation and Computing (ICAC).

doi:10.23919/IConAC.2019.8894962