Data consolidation - AgileDataScienceUB/ADS4 GitHub Wiki
Starting from three distinct csv files containing data of the domain we're analyzing, we've decided to keep the following features:
- EmployeeID: Integer identifying an employee
- record id: Integer identifying a unique row of our final dataset
- hire date: Date of hire of the employee
- record date: Date of the registration of the employee in the dataset
- termination date: date of termination of the employee (it can be a default date if he's still working)
- length of service: Integer expressing the years from hiring to record date
- age: Integer referring to the age of the employee
- target: boolean expressing our class value. It can assume value 1 if an employee is likely to leave, 0 otherwise.
- other target fields: list of other fields related to our target value.
- job title: String saying the role of the employee in the organization.
- salary: Monthly income of an employee
- Special field types: optional
These features have been decided using domain knowledge, and properly analyzing the two original datasets, using a data driven approach and an analysis driven approach mixed, as common in data warehouses design.