Literature Review - CankayaUniversity/ceng-407-408-2020-2021-Violent-Activity-Detection-from-Videos GitHub Wiki

A Review of the Action Recognition Literature Relating to Violent Detection

Esin GÖKÇE, Özgün GÜLER, Arda Efe ŞEN

(c1611022, c1611023, c1611053)@student.cankaya.edu.tr

Department of Computer Engineering, Cankaya University Nov 5, 2020, Version 1.0

Abstract

Action recognition is a sub-branch of Computer vision and it has been studied in many areas. Since the 1980s, many different applications have been studied on the study of human activities. However, while these human activities include simple actions such as walking and running, violence and aggressive movements have been relatively less studied. Besides, the difficulty of people distinguishing mass violence acts in surveillance systems poses a major problem today. In this regard, there is a general demand for the dehumanization of crowd violence and the establishment of automatic systems. This paper aims to examine acts of violence, articles trying to find solutions to the problem with different datasets and different methods.

Keywords: Action recognition, Computer vision, Violent detection, Deep learning, Violent detection from videos, Surveillance system, Recognition.

1. Introduction

Trying to reveal tasks or functions that can be done visually in a computerized environment can be called computer vision. Decision-making is human-oriented and depend on digital images or video images. Computer vision is used to reveal information production using the methods of analyzing, creating, processing and interpreting the digital image.

Recognition, on the other hand, is subordinate to the topics such as computer vision, image processing, machine learning, and deep learning, and after the images are taken, it is the determination of whether they contain an object, feature, or activity according to a certain problem.

There are many subheadings used in recognition. Some of them are as follows. Object identification. This technology is used for locating and recognizing objects in any video or image. Identity is the process of recognizing an object to be targeted by evaluating it according to its sample. Detection is based on scanning digital image videos for a specific condition, generally calculated by different machine learning algorithms. Content-based Image Retrieval is the process of recognizing and finding all images with a certain content in a large

image data. Pose Estimation is the estimation of the target object's position, direction, and condition according to the image taken from the camera. Facial recognition is a technology that can identify and verify the person in digital environments. Shape recognition technology is the automatic recognition of patterns and regularities in data. Another issue that needs to be addressed concerning recognition examples is action recognition. Action recognition is more concerned with motion estimation in which an image sequence is processed. Some of the methods involved in this are Visual odometry, Tracking, and Optical Flow. Visual odometry is the camera's 3D rotation and rotation from a sequence of images produced by the camera. In the field of computer vision, Visual odometry means predicting the movement of a camera according to a solid scene. Tracking is usually the process of tracking the movements of a smaller point of interest or set of objects in the image sequence. These can be vehicles, people, or other assets. Tracking systems are widely preferred in areas such as security and surveillance. The Optical Flow is used to determine how the point moves relative to the image plane for each point in the image. It is used for many areas such as object and action extraction.

2. Common Findings

Considering the articles reviewed and referenced, although the methods used in detecting violent videos are mostly similar, it has been observed that this is aimed in all studies, as machine learning is a method that is open to and needs improvement. The number of studies available on the detection of violence in the videos we are working on is limited, and the methods used are currently limited, but it is an area that shows improvement. First of all, the most important factor in machine learning is the dataset. Collecting a data set requires a difficult and lengthy process, however, the dataset used must also be a clean and usable dataset. Therefore, as we have observed in the referenced articles, the datasets used are those that have already been revealed, sorted, and cleaned. For example, in addition to data sets such as Hockey and Movies, Medieval datasets are one of the most preferred datasets. In addition to these, studies using their datasets were also observed. However, we should point out that the excess data and the cleanness of the data are more successful in testing the accuracy of the algorithms used. It was observed that common datasets were used in the articles reviewed.

Second, when the methods used are examined, the main purpose is to classify data. It has been observed that the classification of violent videos that we want to study as violent or non-violent in the referenced literature is a point where machine learning algorithms are widely used. While SVM and MoSIFT approaches are common in classification, it has been observed that CNN is widely used in deep learning algorithms. In addition to these, the use of a temporal segment network has also been observed for violence detection. As a result, it has been seen that the priority is to go to classification algorithms in a dataset, and approaches to improve the accuracy and time effect of machine learning with these algorithms have been put forward.

3. Related Work

In this section, there are approximately 30 articles researches dependent and independent from each other. Each article is explained in summary form.

In this study of Febin et al., since Temporal Derivative provides a fast, but very low accuracy rate to optical Flow alone, SIFT (MoBSIFT), a graded intensity detection method based on motion boundary, was used in this project. In this method, the MoSIFT descriptor was developed in terms of both accuracy and complexity by adding motion limit histogram (MBH) and motion filtering algorithm. KTH, Weizmann, and IXMAS, which are popular data sets used in action recognition, were used. (Febin, Jayasree, & Joy, 2019)

In this study of Song et al., the three-dimensional Convolutional Neural Networks (3D ConvNet) method was used for action recognition, and a new scheme was prepared with preprocessing, making the 3D ConvNet method more effective. It was examined according to the video lengths and patterns. According to the experimental result, 99.62% in a hockey fight, 99.97% in movies, and 94.3% in crowd intensity. (Song, ve diğerleri, 2019)

In this Project of Mahmoodi & Salajeghe, classification algorithms are based on detecting the severity. Histogram of the Optical flow Magnitude and Orientation (HOMO) method was used to improve the insufficiency of existing applications. In preprocessing, first, the gray tones of the entered data are distinguished, then the optical flow size and direction of each pixel are compared. According to the threshold values of the compared pixels, 6 binary indicators are classified using the Support Vector Machine method and MATLAB (Mahmoodi & Salajeghe, 2019).

In this article of Chen et al., aims to reduce the effects of violent scenes in movies on children, and it is divided into violent perception, action scene perception, and bloody frame perception. The movie was divided into scenes and Support Vector Machines were used for classification. Facial, blood, and movement information were distinctive features to identify scenes with violent content. In the experimental results, it was determined that the method was successful (Chen, Su, & Hsu, 2011).

In this article of Lloyd et al., it is mentioned that the human factor is insufficient to perceive acts of violence and an automatic method that detects the content of violence with computer vision techniques is proposed. Since acts of violence are more common among the crowd, visual texture measurements were used to detect here. This approach varying according to dynamics is Violent Crowd Texture, reaching 0.98 and 0.91 ROC values for real-world and severe flow data, respectively. ROC (Reciever operating Characteristic). In this project, KTH and Weizmann datasets which are common in action recognition are used (Lloyd, Rosin, Marshall, & Moore, 2016).

In this study of Khan et al., which aims to detect some inappropriate content of violence and magic in the cartoons that children watch continuously, an application that filters certain frames of the video using image processing techniques is proposed. The dataset used to carry out the experiments is composed of approximately 100 videos, both violent and non-violent (Khan, Tahir, & Ahmed, 2018).

In this study of Mu et al., unlike previous violent scene perception studies, it was thought that acoustic or vocal cues could play an important role in detecting violent acts and this was focused on. After pre-processing the videos, it was observed that CNN gives better results than SVM by using CNN and SVM algorithms, one of the classification methods. MediaEval 2015 dataset is used in this study (Mu, Cao, & Jin, 2016).

In this article of Demarty et al., examines Hollywood scenes for the detection of violent content. It is aimed to minimize the impact of violence on children, which occurs in the majority of Hollywood movies. Medieval2013 dataset, which is widely used in this project, is used (Demarty, ve diğerleri, 2014).

In this article of Gu et al., a new model for perception of violence is proposed that detects the meanings between visual and auditory data from the same video. The neural networks method applied in this subject has been used to extract motion, sound, and image properties. Competitive results were obtained in the Violent Scene Detection 2015 (VSD) data set used in this study and the Violence Correspondence Detection dataset they created (Gu, Wu, & Wang, 2020).

In this study of Accattoli et al., it is aimed to reduce the rate of error in perception of violence with software that is smart compared to the human factor in video surveillance. To achieve this, a solution based on 3D Convolutional Neural Network has been proposed for detecting aggressive movements. Hockey Fight, Crowd Violence and Movie Violence datasets used in most of the other studies gave positive results (Accattoli, Sernani, Falcionelli, Mekuria, & Dragon, 2020).

In this article of Soliman et al., LSTM with VGG-16, which is a pre-trained model on ImageNet, is used. Also, RGB as a neural network for violent images in the video, a Real-Life Violence Situations (RLVS) dataset of 2000 videos was created to provide better results of the proposed model and to minimize the disadvantages of the previously used datasets.(Soliman, ve diğerleri, 2019).

In this project of Bernejo et al., MoSIFT, an enlargement of the SIFT image identifier, was used for video recognition. It extracts histograms of the directional gradients in the standard SIFT image. It has been used because it is more successful than others. Also, BOW was preferred. The bag of words method represents each video sequence as a histogram on a visual word. It has been observed that this method defines the content of violence in the fighting sequences 90%. KTH are datasets used to detect actions such as walking, running, INRIA IXMAS, CAVIAR, acts of violence. This project also utilized 100 video hockey datasets from the National Hockey League (NHL) (Bernejo, Deniz, Bueno, & Sukthankar, 2011).

In this study of Xu et al., MoSIFT algorithm is used. However, Kernel Density Estimation (KDE) was used to increase the percentage of success and it was aimed to distinguish the videos better. SVM was used with the Sparse coding method. The hockey and crowd violence dataset was used for this method, achieving nearly 94% success (Xu, Gong, Yang, Wu, & Yao, 2014).

In this study of Gao et al., OVIF based on horizontal and vertical directions has been used as action descriptors to address the problem of human action recognition. Violent flows were preferred to obtain to a high degree sampled points, and then they used these dense trajectories to calculate local descriptors for action recognition. Violent Flow can define the consistent movement of moving objects with optical flow histograms based on horizontal and vertical directions, which is a good feature for motion detection and tracking. It is a widely used method. Hockey fights and violent flow datasets were used among the datasets used. A success rate of 87% has been achieved in these datasets. (Gao, Liu, Sun, Wang, & Liu, 2016)

In this article of Sudhakaran & Lanz, aimed to classify violent content in videos by creating end-to-end trainable artificial neural networks. It was seen that better results were obtained by using fewer parameters. It is mentioned that they get better results than models trained using deep neural networks. Convolutional neural networks (CNN) and convolutional LSTM (ConvLSTM) methods are used for this. These methods have been tested on 3 main datasets, and among them, it has been observed that more efficient results are obtained compared to LSTM for hockey fight. (Sudhakaran & Lanz, 2017)

In this study of Zhou et al., the state-of-the-art Temporal Segment Network (TSN) is used for violent interaction detection, which can model long-range temporal structure throughout the entire video. Based on the TSN, a network called FightNet has been created for violent interaction detection. Also, for ConvNets model training, violent interaction videos were collected from datasets named Violent Interaction Data Set (VID). 1000 hockey video clips, 500 of which were violent, were used as a dataset. Besides, 200 video samples taken from the films were also included in the study. (Zhou, Ding, Luo, & Hou, 2017)

In this study of Kaya & Keçeli, the optical flows of the videos were calculated using the Lucas Kanade method. Then, a 2D template was created with overlapping optical flow sizes and methods. These templates are have provided to a pre-trained neural network CNN. Cubic Kernel Support has tested with Vector Machine and K-nearest classifiers in three different datasets using prediction and proposed methods. Crowded and non-crowded datasets were preferred. (Kaya & Keçeli)

In this project of Dai et al., a subset of the ImageNet class and the CNN neural network model was used, and the CNN framework was used to extract the motion optical properties in static frames. It was advanced with Long Short Term Memory models. SVM is used as a classifier. (Dai, ve diğerleri, 2015)

This study of Zajdel et al., aimed to detect violent behavior in public places with audio and video. For this, violent scenarios were carried out with the actors at a train station. With the Dynamic Bayesian Network, visual and audio events were combined and a collective aggression indicator was produced. Violence videos containing 13 sounds were used as a dataset. (Zajdel, Krijnders, Andringa, & Gavrila, 2007)

In this project of Lin & Wang, sound and image are classified separately and, it aimed to improve the detection of violence. It has been evaluated in two parts as audio and video view. The audio which violent or non-violent and, the image part is classified according to a model that combines the violent event as motion, flame, explosion, and blood. pLSA was used for sound intensity evaluation, macroblock (MB) classification was used for image evaluation. 5 films containing violence were used as a dataset. (Lin & Wang, 2009)

In this article of Bilinski & Bremond, a study has been conducted to reveal and specify violence on surveillance videos. The purpose of determining the existence and time of violence. In this study, the IFV approach is proposed, which can be represented using both local features and spatial-temporal positions. In addition to the Improved Fisher Vectors approach, the popular sliding window approach is emphasized, and with this approach, Improved Fısher Vectors are reformulated. The summed-area table data structure is used to reveal the approach in a shorter time. As a result of the study, it is observed that with the improvements made, it becomes more accurate and faster to define violence. (Bilinski & Bremond, 2016)

In this article of Acar et al., study was conducted on violent videos, and modeling of time efficiency and concepts was emphasized. Widely accepted approaches used in detecting violent parts of video content to model a single concept lean on auditory or visual features in the feature area. However, if this modeling is not placed compactly in the feature area, it cannot provide a fidelity representation of violence in terms of audio-visual features. This article proposes to model violence through multiple (sub) concepts using audio-visual features (MFCC based audio and advanced motion features) to address this shortcoming. The study, which started with a rough-level analysis, tries to reveal it with time-efficient sound features to overcome the heavy calculations caused by motion characteristics and continues with an analysis with more advanced features when needed. (Acar, Hopfgartner, & Albayrak, 2016)

In this study of Giannakopoulos et al., aimed to reveal the violent scenes in movies with a multi-step approach using fusion methodologies. Automated audio and visual processing and analysis are performed to predict probabilistic measures first on a dataset of 10 films. then, with a meta classification architecture that combines audiovisual information, it is desired to be classified as violent or non-violent. (Giannakopoulos, Makris, Kosmopoulos, Perantonis, & Theodoridis, 2010)

In this study of Giannakopoulos et al., the problem of perception of violence is discussed by using some popular frame-level sound properties from the time and frequency domain in the audio data. Support Vector Machine classifier that decides according to violence is fed as input with some of the calculated feature statistics. The results obtained according to this study reveal the feasibility of the approach and better performance can be observed. (Giannakopoulos, Kosmopoulos, Aristidou, & Theodoridis, 2006)

In this study of de Souza et al., an assessment of the importance of local spatial-temporal properties is presented by a cross-validation method. A violence detector is presented using linear support vector machines on the concept of visual codebooks. As a result, it is seen that motion models are very important to distinguish violence from normal motion compared to visual descriptors based solely on the space field. (de Souza, Chavez, do Valle Jr., & de A. Araujo, 2010)

In this study of Hassner et al., an approach is aimed to detect violent movements in images on surveillance video cameras in real-time. The statistics collected for short frame sequences based on the change of flow vector sizes over time are typified by the Violent Flows descriptor. Using linear SVM, VIF descriptors are categorized as violent or non-violent. It sets out standard comparisons designed to test the accuracy of time perception and classification as violent or non-violent. (Hassner, Itcher, & Kliper-Gross, 2012)

In this study of Gong et al., which aims to make a three-stage determination of violence, firstly, potential violence is determined according to the universal film rules. A semi-controlled cross-feature learning (SCFL) technique is used that can use untagged data to combine different types of features and improve classification performance. Then, the detection of conventional sound effects related to the violence and the classification of different sound events are made representative with a shot-based severity score. Finally, the possibilities in the first two stages are integrated as reinforcement to create the final inference. (Gong, Wang, Jiang, Huang, & Gao, 2008)

In this study of Goto & Aoki, a new approach is put forward in the perception of violent scenes with machine learning of visual and auditory features. It is aimed to maximize the modality of videos with multiple kernel learning. In this context, implicit learning is applied. A study is presented in which mid-level Violence Clustering is recommended. This study adopts the violence score calculation approach for shooting. (Goto & Aoki, 2014)

In this study of Datta et al., aimed at detecting human violence, movement trajectory information, and orientation information of a person's limbs are used to detect violence. The direction and magnitude of motion are considered to define an Acceleration Measure Vector(AVM). It is presented as the temporal derivative of AVM in the resulting jolt. (Datta, Shah, & Lobo, 2002)

In this project Ye et al., it is aimed to detect school violence. It was aimed to detect moving targets in the foreground with the KNN (K-Nearest Neighbor) method. Later, the detected targets were processed by morphological processing methods. Using the rectangular frame method around the moving targets, the optical flow properties, which aim to optimize the differences between daily life and violent activities, have been extracted. Relief-F Wrapper algorithms are used to reduce the feature size. SVM was applied as a classifier. To improve the performance, a DT-SVM (Decision Tree-SVM) two-layer classifier was developed, so the accuracy reached 97.6%. Daily life activities including school violence and campus sports were used as a dataset. (Ye, Wang, Ferdinando, Seppanen, & Alasaarela, 2018)

4. Conclusion

As a result, in the researches and articles we examined, we concluded that action recognition and computer vision, the algorithms, and datasets used in this context, often have a common point. In this literature review, there has been a preliminary study about which method will be more usable for us and which datasets will be more efficient by keeping the abundance of articles and our field of research broad.

5. References

Acar, E., Hopfgartner, F., & Albayrak, Ş. (2016). Breaking Down Violence Detection: Combining Divide-et-Impera and Coarse-to-Fine Strategies. Neurocomputing, s. 225-237.

Accattoli, S., Sernani, P., Falcionelli, N., Mekuria, D. N., & Dragon, A. F. (2020). Violence Detection in Videos by Combining 3D Convolutional Neural Networks and Support Vector Machines. Applied Artificial Intelligence, s. 1-16. doi:https://doi.org/10.1080/08839514.2020.1723876

Bernejo, E., Deniz, O., Bueno, G., & Sukthankar, R. (2011). Violence Detection in Video Using Computer Vision Techniques. In CAIP, s. 332-339.

Bilinski, P., & Bremond, F. (2016). Human Violence Recognition and Detection in Surveillance Videos.

Chen, L. H., Su, C. W., & Hsu, H. W. (2011). Violent Scene Detection In Movies. International Journal of Pattern Recognition and Arti¯cial Intelligence, 25(8), s. 1161-1172. doi:10.1142/S0218001411009056

Dai, Q., Zhao, R.-W., Wu, Z., Wang, X., Gu, Z., Wu, W., & Jiang, Y.-G. (2015). Detecting Violent Scenes and Affective Impact in Movies with Deep Learning. Fudan-Huawei at MediaEval.

Datta, A., Shah, M., & Lobo, N. D. (2002). Person-on-Person Violence Detection in Video Data . Datta, A., Shah, M., & Da Vitoria Lobo, N. (n.d.). Person-on-person violence detection in video dSupported by User Interaction for Service Robots, s. 433-438.

de Souza, F. D., Chavez, G. C., do Valle Jr., E. A., & de A. Araujo, A. (2010). Violence Detection in Video Using Spatio-Temporal Features. De Souza, F. D. M., Cha, G. C., do Valle, E. A., & de A Araujo, A. (2010). Violence Detection in Video Using Spatio-TemporalSIBGRAPI Conference on Graphics, Patterns and Images.

Demarty, C. -H., Ionescu, B., Jiang, Y. -G., Quang, V. L., Schedl, M., & Penet, C. (2014). Benchmarking Violent Scenes Detection in Movies. Demarty, C.-H., Ionescu, B., Jiang, Y.-G., Q2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI). doi:10.1109/cbmi.2014.6849827

Febin, I. P., Jayasree, K., & Joy, P. T. (2019). Violence detection in videos for an intelligent surveillance system using MoBSIFT and movement fltering algorithm. Pattern Analysis and Applications. doi:10.1007/s10044-019-00821-3

Gao, Y., Liu, H., Sun, X., Wang, C., & Liu, Y. (2016). Violence detection using Oriented VIolent Flows. Image and Vision Computing.

Giannakopoulos , T., Kosmopoulos, D., Aristidou, A., & Theodoridis, S. (2006). Violence Content Classification Using Audio Features. Lecture Notes in Computer Science, s. 502-507.

Giannakopoulos, T., Makris, A., Kosmopoulos, D., Perantonis, S., & Theodoridis, S. (2010). Audio-Visual Fusion for Detecting Violent Scenes in Videos. Lecture Notes in Computer Science, s. 91-100.

Gong, Y., Wang, W., Jiang, S., Huang, Q., & Gao, W. (2008). Detecting Violent Scenes in Movies by Auditory and Visual Cues. Gong, Y., Wang, W., Jiang, S., Huang, Q., & Gao, W. (2008). Detecting Violent Scenes in MovieLecture Notes in Computer Science, s. 317-326.

Goto, S., & Aoki, T. (2014). Violent Scenes Detection Using Mid-Level Violence Clustering. Computer Science & Information Technology (CS & IT) , s. 283-296.

Gu, C., Wu, X., & Wang, S. (2020). Violent Video Detection Based on Semantic Correspondence. IEEE Access. doi:10.1109/ACCESS.2020.2992617

Hassner, T., Itcher, Y., & Kliper-Gross, O. (2012). Violent Flows: Real-Time Detection of Violent Crowd Behavior. Hassner, T., Itcher, Y., & Kliper-Gross, O. (2012). Violent flows: Real-timeIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

Kaya, A., & Keçeli, A. S. (Tarih yok). Violent activity detection with the transfer. Electronics Letter, 53(15), s. 1047-1048.

Khan, M., Tahir, M. A., & Ahmed, Z. (2018). Detection of Violent Content in Cartoon Videos using Multimedia Content Detection Techniques. 2018 IEEE 21st International Multi-Topic Conference (INMIC).

Lin, J., & Wang, W. (2009). Weakly-Supervised Violence Detection in Movies with Audio and Video Based Co-training . Advances in Multimedia Information Processing-PCM, s. 930-935.

Lloyd, K., Rosin, P. L., Marshall, D., & Moore, S. C. (2016). Detecting Violent Crowds using Temporal Analysis of GLCM Texture. arXiv preprint arXiv.

Mahmoodi, J., & Salajeghe, A. (2019). A classification method based on optical flow for violence detection. Expert Systems With Application, 127, s. 121-127.

Mu, G., Cao, H., & Jin, Q. (2016). Violent Scene Detection Using Convolutional Neural Networks and Deep Audio Features. Pattern Recognition, s. 451-463. doi:10.1007/978-981-10-3005-5 37

Soliman, M. M., Kamal, M. H., El-Massih Nashed, M. A., Mostafa, Y. M., Chawky, B. S., & Khattab, D. (2019). Violence Recognition from Videos using Deep Learning Techniques. 2019 IEEE Ninth International Conference on Intelligent Computing and Information Systems (ICICIS). doi:10.1109/ICICIS46948.2019.9014714

Song, W., Zhang, D., Zhao, X., Yu, J., Zheng, R., & Wang, A. (2019). A Novel Violent Video Detection Scheme Based on Modified 3D Convolutional Neural Networks. IEEE Access, s. 39172-39179.

Sudhakaran, S., & Lanz, O. (2017). Learning to Detect Violent Videos using Convolutional Long Short-Term. In Proc. 14th IEEE International Conference on Advanced Video and Signal Based Surveillance.

Xu, L., Gong, C., Yang, J., Wu, Q., & Yao, L. (2014). VIOLENT VIDEO DETECTION BASED ON MoSIFT FEATURE AND SPARSE CODING. 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), s. 3538-3542.

Ye, L., Wang, L., Ferdinando, H., Seppanen, T., & Alasaarela, E. (2018). A Video-Based DT–SVM School Violence Detecting Algorithm. Sensors, 20(7).

Zajdel, W., Krijnders, J. D., Andringa, T., & Gavrila, D. M. (2007). CASSANDRA: audio-video sensor fusion for aggression detection. In: IEEE Conference on Advanced Video and Signal Based Surveillance, s. 200-205.

Zhou, P., Ding, Q., Luo, H., & Hou, X. (2017). Violent Interaction Detection in Video Based on Deep. J. Phys. Conf. Ser.