Data consolidation - AgileDataScienceUB/ADS4 GitHub Wiki

Starting from three distinct csv files containing data of the domain we're analyzing, we've decided to keep the following features:

EmployeeID: Integer identifying an employee
record id: Integer identifying a unique row of our final dataset
hire date: Date of hire of the employee
record date: Date of the registration of the employee in the dataset
termination date: date of termination of the employee (it can be a default date if he's still working)
length of service: Integer expressing the years from hiring to record date
age: Integer referring to the age of the employee
target: boolean expressing our class value. It can assume value 1 if an employee is likely to leave, 0 otherwise.
other target fields: list of other fields related to our target value.
job title: String saying the role of the employee in the organization.
salary: Monthly income of an employee
Special field types: optional

These features have been decided using domain knowledge, and properly analyzing the two original datasets, using a data driven approach and an analysis driven approach mixed, as common in data warehouses design.