Week 10 (W4 Jan25) Crimes in the UK - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
Week 10 (W4 Jan25) Crimes in the UK
Note: For unknown abbreviations and terms, please consider the glossary. If anything is missing, just create an issue or write us an email and we will add it. A list of our data attributes can be found here.
Content
- Predicting Number of Crimes
- Outlier detection
- Update for the UK Police
- Preparing the data for stop and search predictions
Predicting Number of Crimes
Concerning our first prediction goal to predict the number of crimes between 09/2011 and 09/2016, we have run several prediction algorithms that are available in RapidMiner using different settings and parameters. Then, we have evaluated the model and its performance.
Our general RapidMiner-process looks as following:
- Retrieve data: We have worked both with the MSOAs (~7200 tuples) and LSOAs (~35,000 tuples), thereby, the data were retrieved from CSV files.
- Select Attributes: three types of attributes were used:
- The attribute that should be predicted: number_crimes or number_robbery or number_bulgary ...
- Demographic data: 150 attributes in case of MSOA and 100 attributes in case of LSOA, which are selected by a feature selection algorithm.
- Number of points of interst: the total number as well as for 36 categories
- Set Role: Here, we set the attribute that should be predicted and "msoa_code"/"lsoa_code" as ID.
- Split Data: We have split the data in a training set (70%) and a test set (30%).
- Cross Validation: See below.
- Apply Model: This operator takes the test set and uses the learned model to predict the values.
- Generate Attributes: We computed the squared error for every tuple in the test set.
- Performance: It computes various performance measures.
Inside of the Cross Validation, there is on the one side the training operator...
... and on the other side the operators to measure the performance for a validation iteration:
We have worked with the Deep Learning, Linear Regression, Polynomial Regression and SVM operator, and after testing different parameters, we found that Deep Learning with MSOA performs best. We have tried to find patterns for mispredictions by looking into the data manually by creating a map visualisation. Thereby, we used the squared error that we have computed with the Generate Attribute operator. In the following map, the gray areas represent the data used for training and the remaining areas for testing. The redder the area on the map the worst the prediction. Unfortunately, we could not find any pattern.
For the model, the most important variables are:
- Number of cars in households with the household type "Other"
- Number of public service buildings
- Number of shopping supermarkets
- Number of atomates
- Number of other shopping facilities
- Number of males between 35 and 49 who have no cars in the household.
In total, we got the following performance result:
- Root mean − squared error: 1424
- Root relative - squared error: 37.3%
- Correlation (between the predicted and actual value): 93.5%
We have also run the prediction for every crime_type - the performance of all of those predictions were quite similar.
Violence and sexual offences
Criminal demage and arson
Outlier detection
We used the final table, with MSOA as rows and demographic structure and criminal data as columns, as a training set for our second prediction goal. The target is to detect anomalous areas in the UK with respect to criminal behavior based on demographic structure.
Since we are not sure whether we have a clean dataset that represents the population of normal observations, we chose to follow an unsupervised “outlier” rather than “novelty” detection approach. One possible approach is to use “OneClassSVM” as an unsupervised outlier detection algorithm. As seen in the image below, OneClassSVM estimates the support of a high-dimensional distribution. For any arbitrary given dataset, OneClassSVM detects the soft boundary of a given dataset, which represents a decision function for similar and different in the given dataset.
Please note that this image is used here only to visually explain the OneClassSVM. It doesn't represent our own dataset. Here is the source of the image (http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html#sklearn.svm.OneClassSVM)
OnseClassSVM detected around 600 MSOAs as abnormal. We were confronted with the question: In which respect are those 673 MSAOs abnormal. Unfortunataly the library didn’t have methods for answering this question. Therefore we planed to follow the following approach:
-
Cluster in MSOAs (including the detected outlierts) based only on the demographic structure, without crime data. → Gain knowledge about demographicaly similar MSOA to the outliers. This shall serve as a basis for comparison in a later step.
-
After clustering we add the criminal data to the MSOAs and calculate the mean values of the criminal data for each cluster. This will result in one representative vector for criminal behavior per cluster.
-
For each abnormal classified MSAOs, we then compare the mean vector against the the crime data of the abnormal classified MSOAs.
Through following this approach we believe that we can determine what makes an outlier an outlier. In other words, we can understand which features mostly contributed for the 673 MSOAs to become outliers. The implementation of latter approach for understanding the outliers will be done next week after getting the supervisor’s feedback this week.
Update for the UK Police:
As mentioned in the previous wiki entry, last week we have collected all our findings of the descriptive statistics part of the lab in one centralized pdf document. We thought it would be a good idea to share these findings with the UK police and therefore we have sent them this document. Furthermore, we have informed them in the mail about our prediction goals and that we will send them a document in 2 weeks with our findings with respect to those prediction goals for their information. We will post their reply on the wiki as soon as we receive one.
Preparing the data for stop and search predictions
Our third prediction goal is to determimne what is the probability that a person (of a specific gender, in a specific age range, and with a specific ethniticity) will be stopped and searched in an area (on a specific weekday). An example of such a prediction would be the probability that a 27 years old asian male would be stopped in the MSOA "E02002139".
For this prediction, we have created a new table which contains stop and search data and demographic data pro MSOA. The initial stop and search table did not have information about the MSOA where the stop and searches have been made. Therefore, we computed the MSOA of each stop and search by using the given geolocations.
Last steps
- Finish the stop and search prediction table.
- Run the prediction algorithms on the dataset.