Video analysis for human action recognition is one of the most important research areas in pattern recognition and computer vision due to its wide applications. Deep learning-based approaches have been proven more effective than conventional feature engineering-based models. However, the performance is still unreliable when facing real-world application scenarios. Inspired by the Convolutional Neural Network (CNN) and Recurrent Long-Short Term Model (LSTM), this paper presents an augmented treble-stream deep neural network architecture that supports direct extraction of spatial-temporal features from video streams and their corresponding dense optical flows. This innovative approach assists effective detection of complex video event features that are annotated by rich event “appearance” and motion features. Substantially improved recognition accuracy is recorded during the experiments that are carried and benchmarked over public video event datasets, for example, UCF 101 and HMDB 51. Analytical evaluation approves the validity and effectiveness of the treble-stream neural network design.