Dataset "Crime Reports in the UK" - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
Dataset / Crime in England, Wales and Northern Ireland
- Proposer: Miguel Sandim - @msandim - [email protected]
- Votes: @paulafortuna, @kapoorabhishek24, @ishaanraj, @winduprasetya, @carockets
- Team members:
- @chaoran-chen
- @ssoima
- @aserhany
Summary
This dataset includes crime reports from the UK police between 2010 and 2016. Some of the information available: the location of the crime, its category, and last known outcome.
Weekly Progress
-
Week 01 (W46-Nov16) Crimes in the UK -- Main findings:
- Anti social behaviour is the dominant crime type in the UK since 2010. Males get stopped and searched 11 times as often as females
- Stored the raw data into database using a python script. Joined the “Streets” and “Outcomes” tables in database
- Discovered opportunities for data enrichment: i) aggregated neighborhood statistics using LSOAs and ii) integration of Points of interests (POI) within a specific range of crimes
-
Week 02 (W47-Nov23) Crimes in the UK -- Main findings:
- The number of points of interest and stops by the police near crimes differs significantly with respect to the crime types as well as crime outcomes. E.g. thefts from the person mostly happen at places with a lot of points of interest.
- The given location data are anonymised. In order to find out the degree of inaccuracy and to assess the effect on our approach, we are in contact with the police of UK.
-
Week-03-(W48-Nov30)-Crimes-in-the-UK -- Main finding:
- Crime rate on regional level: London (73.76%), North East (69.70%), and Yorkshire and the Humber (67.13%) are considered to be the regions with the highest crime rates in the UK.
- First approach for predicting crime outcome types in the city of London using a random forest classifier. The weighted F1-score for the prediction model: 0.76478
-
Week-05-(W50-Dec14)-Crimes-in-the-UK -- Main finding:
-
This first prediction experiment of crime type using random forest showed that there exist no correlation between the current input features, at least in the current form, and the crime outcome. This confirms our impression that the current data doesn't contain the essential information about crimes (Perpetrator’s and victim’s age, ethnicity, criminal and psychological history, witness statements, etc.) for predicting crime outputs.
-
Data analysis from a new angle - namely location based. This with the help of the newly included demographic data (174 features) about the 120 local authorities in the UK.
-
Created i) correlation matrix of the location based aggregated crime data and ii) maps of crime types distribution in local authority areas in the UK. Based on the two documents we still can't conduct direct statements. Yet they shall serve the new, location based, prediction goals
-
Week-07-(W51-Dec28)-Crimes-in-the-UK -- Prediction goals (more details in the wiki-entry):
-
Predict the effects of demographic structure on criminal behaviour
-
Detect anomalous areas with respect to criminal behavior based on demographic structure
-
Predict the probability of a person with a specific background getting stopped and searched in a certain area on a specific week day
-
Qualitative assessment of police forces with respect to solving crimes of different types.
-
Conducted descriptive statistics on the crime data, location accuracy, and demographical influences.
-
Included two new datasets about i) police forces ii) demographical structure of LSOAs and MSOAs.
-
Measured the inaccuracy rate of crime locations, demonstrated in the dark maps.
-
We have computed the points of interest, stop and searches and crimes in LSOAs and MSOAs as preparation for the coming predictive tasks.
-
We selected 150 demographic features from a list of over 130,000 using "scikit learn".
-
Second prediction goal: 673 MSOA were classified as outliers with respect to their criminal behavior based on the demographic data and the Points of interests. was conceptualized an approach for understanding the factors that lead those 673 MSOAs outliers. It will be implemented next week after getting the supervisor’s feedback.
-
We have sent our descriptive statistics pdf document to the UK police in order to share with them our findings for the first half of the lab course.
-
We compared the probability of people in different age-ranges and with various ethnicities to get stopped and searched.
-
We implemented the concecpet for understanding the factors that played the biggest role in defining the MSOA outliers. Those are: "Total crime types", "Number of Anti-Social behaviour", "Number of other theft", "Number of shoplifting", and "Number Robbery"
Prediction Goals
- What are the expected outcomes for a crime in a certain region? Has this pattern changed over time?
Other Goals
- Discover epicenters of certain types of crimes in the city using clustering techniques.
- Correlate the approx. locations in the dataset with nearby locations using Google Places' API (are thefts more likely to happen in areas with supermarkets or shops?).
EDIT 1: Since the presentation on Wednesday, I've found an additional dataset regarding stops and searches (see Long Description). Additional goals for this dataset could be:
- Analyze the correlation location-wise between stop and searches and reported crimes.
Long Description
The dataset proposed relates to crime and policing in England, Wales and Northern Ireland and consists of monthly reports provided by 45 police forces from December 2010 to August 2016. This dataset is divided into several folders (one for each month) and each folder has a CSV file for each force's crime reports for that specific month, with a total of 6.29GB of reports.
This dataset includes the following features:
- Month - Month in which the crime was reported.
- Reported by - The force that provided the data about the crime.
- Longitude and latitude - The anonymised coordinates of the crime.
- LSOA code and name
- Crime type - Categorical type of the data (e.g. robbery, violence and sexual offence, ...)
- Last outcome category - Outcome of the crime report (e.g. still under investigation, suspect was found guilty, ...)
- Context - Text description of the crime reported (I've always found this field left empty).
Check the dataset's website for more information on the data.
EDIT 1: I found out that it's possible to obtain an additional monthly dataset for "stop and searches" conducted by each police force, which may be interesting to include in the analysis. The size of this dataset combined with the previous one is 6.40GB (which means this itself takes 100~110MB) and includes data from December 2014 to August 2016.
This dataset includes the following features:
- Type - Type of the search (e.g. vehicle, person).
- Date - Specific date and hour of the stop and search.
- Policing Operator - Was this stop and search part of a police operation? If positive, what's the name of the operation?
- Longitude and latitude
- Gender - Gender of searched individual.
- Age range - Age range of the suspect.
- Ethnicity of the suspect
- Ethnicity of the officer
- Object of search - I suspect this is the reason for the stop and search.
- Outcome - Outcome of the stop and search (e.g. suspect arrested, nothing found, ...)
- Outcome linked to the object of search (boolean field)
- Removal of more than just outer clothing (boolean field)
Links / Data / Other
Due to the large size of the dataset, I will include download instructions here:
- Go the dataset's website (P.S. - I had problems with Chrome, so try another browser).
- Select the range December 2010 to August 2016.
- Check all forces, Include crime data and Include stop and search data (this last one for the additional dataset).
- Click on Generate file and download it.