Data Understanding - mdengo/cab-survey GitHub Wiki

Gathering Data

Outline data requirements

The data needs to be provided in a sklearn compatible format, such as csv.

Verify data availability

The data set Clinical, Anthropometric & Bio-Chemical Survey is publicly available on kaggle.com in csv-format. When importing the csv data sets, param low_memory = True is necessary as there are different type columns in the data sets.

Define selection criteria

The only data source is https://www.kaggle.com/rajanand/cab-survey last called the 22.11.18 at 12.02h in Tartu, Estonia. We choose the biggest data set out of 'CAB_22_CT.csv', 'CAB_05_UT.csv', 'CAB_20_JH.csv', 'CAB_23_MP.csv', 'CAB_08_RJ.csv', 'CAB_data_dictionary.xlsx', 'CAB_21_OR.csv', 'CAB_10_BH.csv', 'CAB_09_UP.csv', 'CAB_18_AS.csv',so it is 'CAB_09_UP.csv'.

Our goals is to find good features for each of the three disease indicators blood sugar, blood haemoglobin, and blood pressure. For each of these disease indicators, the feature selection (and sample selection) must be done differently:

  • outcome of blood sugar testing (features 36-38)
    • fasting blood glucose level out of 70-100 mg/dL range indicates diabetes
    • only adults
  • blood haemoglobin for individuals 6 months or older (features 27-29)
    • blood hemoglobin below 13.5g/dL in men or 12.0g/dL in women indicates anemia
  • systolic/diastolic blood pressure (features 30-33)
    • only adults

After trying to predict/recognize patterns with these features, we can add other features that increase outcome.

Describing Data

We have Uttar Pradesh data set provided in csv format. The data set consists of approximately 490.000 cases and 53 features. The features can be assigned to either survey related data (such as date, state), general personal ID data (such as sex, age…) and individual health data (such as haemoglobin, pulse rate, illness type). Of the 53 features some are only relevant depending on the age/sex of the individual:

  • features 22-26 for individuals aged 1 month or older
  • features 27-29 for individuals aged 6 months or older
  • features 30-38 for individuals aged 18 years or older
  • features 39-41 for women aged 15-49
  • features 42-50 for children under 3 years
  • features 51-53 for children under 5 years

Depending on our goal we can probably discard several cases. There is data about 1.89 million individuals so the number of cases will probably still be sufficient. As the regarded Indian provinces have many inhabitants, in the end the sample size needs to be considered when drawing conclusions.

Exploring Data

Initial exploration revealed that all states except for Rajasthan have approximately balanced number of male and female individuals. Based on the data dictionary attached to the dataset feature age_code marks if the age feature shows age in years(Y), months(M) or days(D). For Rajasthan data this column is numeric and not easily interpreted. These finding suggest we should exclude Rajasthan data.

Verifying Data Quality

Further exploration revealed some input errors that have to be dealt with. This should not become a major problem. Questioned individuals also had the option of refusing blood sugar/blood pressure/haemoglobin testing. We have looked at the proportion of individuals for whom testing was conducted. For each state and each indicator over 50% of total individuals consented to testing. This confirms that the data we need exists.