Cleaning data part II and Implementing Lasso - Gabya06/GlobalTerrorism GitHub Wiki

Cleaning data part II: Old + New steps

Countries top6:

  1. Added Thailand, removed USA. Final list --- "Iraq","Pakistan","Afghanistan","India","Philippines","Thailand".

Cleaning columns:

  1. Removed rows with >n/2 NAs.

  2. Removed columns with text.

  3. Removed additional columns using domain knowledge - "eventid","provstate","city","latitude","longitude","specificity","location","summary","targsubtype1","motive","weapdetail","propcomment","scite1","scite2","dbsource","target1","corp1","nkillter","nkillus","nwoundus","nwoundte".

  4. Removed INT columns with NA (-9) data.

  5. Response variable: nvictim = nkill + nwound

  6. Vectorized column "gname" and replaced with "gname.index"

  7. Workaround for columns with NAs: (Refer "NAVector" for status of NAs)

a. column "natlty": assumed country of incidence

b. column "guncertain1": deleted corresponding rows

c. column "ishostkid1": deleted corresponding rows

d. column "nvictim": deleted corresponding rows

e. column "nperp" and "nperpcap": based on BIC, AIC and other scores, not important column. Ignoring for now.

f. column "weapsubtype1": trying linear regression to extrapolate the missing 4% data --- WIP