12_Weekly Team Update and Planning - ulquyorra-11/Cinemalytics GitHub Wiki

📅 Entry Date: 2024-02-25

📌 Meeting Topic: Working on Data Sets (Separation, Merging and Cleaning)

✍️ Author: Uzair

🙋 Attendance

  • Samer
  • Uzair

✅ Highlights & Achievements

  • Separated netflix, prime, disney+ datasets in movies and series datasets
  • Combined movies datasets of all datasets
  • Combined series datasets of all datasets
  • Removed unnecessary columns director, date_added, show_id, cast, type
  • Renamed column listed_in to genre
  • Renamed 'duration' to 'duration_min' and 'duration_seasons'
  • Replaced empty values with NULL
  • Created Cleaned dataset files for further use
  • Created and assigned new tickets for next tasks

❗ Challenges

  • DataFrames when saved with Python add a column at the beginning of the dataset. Some data sets were prepared in SQL which did not have this additional column. We discovered this problem while combining the datasets.
  • As a solution, we manipulated all the datasets in SQL to avoid using 2 platforms but as a suggestion, only one should be used for the whole process to keep the data uniform.

📝 Notes

  • Cleaned and combined datasets are now available in Data -> Clean folder
  • Data separation, merging, and cleaning have been done in both Python and SQL for learning
  • The next step is visualization of the cleaned DataSets
  • The next group meeting date has been set to Thursday (29.02.2024) at 7:00 PM (19:00)