Data consolidation - AgileDataScienceUB/ADS4 GitHub Wiki

Starting from three distinct csv files containing data of the domain we're analyzing, we've decided to keep the following features:

  • EmployeeID: Integer identifying an employee
  • record id: Integer identifying a unique row of our final dataset
  • hire date: Date of hire of the employee
  • record date: Date of the registration of the employee in the dataset
  • termination date: date of termination of the employee (it can be a default date if he's still working)
  • length of service: Integer expressing the years from hiring to record date
  • age: Integer referring to the age of the employee
  • target: boolean expressing our class value. It can assume value 1 if an employee is likely to leave, 0 otherwise.
  • other target fields: list of other fields related to our target value.
  • job title: String saying the role of the employee in the organization.
  • salary: Monthly income of an employee
  • Special field types: optional

These features have been decided using domain knowledge, and properly analyzing the two original datasets, using a data driven approach and an analysis driven approach mixed, as common in data warehouses design.