IMDB Dataset Proposal - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Summary

The Internet Movie Database (IMBD) is one of the most extensive and up to date datasets available out there. They have offered their dataset via FTP for non commercial use, and consists of many different attributes about all of those movies. For example, movie title, genre, actors/actresses, directors, company, year, etc. These features, in conjunction with the openly available Movielens user ratings for movies offer an extensive list of possible feature derivations.

Prediction Goals

Possible prediction goals that I can think of right now, keeping in mind the available features, are:

  • Classify movies into high, medium and low rating ones, based on what combination of cast members, budget, directors, writers etc worked on the movie.
  • For an upcoming movie, predict the rating that the movie might get based on the above mentioned factors.
  • For the different genres of movies, predict what combination of the movie team would have the best chance of bagging high ratings.
  • Figure out if there is a pattern in what forms of movies are usually high earning (thus, also high ranking) based on the past data.

Long description

The IMDB provides about 25 attributes for more than 40,000 movies. These attributes would require a fair amount of preprocessing, just perfect for what is needed to be successful in this course. From feature extraction, conjunction to derivation, this dataset in combination with the Movielens dataset offers a lot of different possible approaches. The attributes in IMDB are in list form and would usually be needed to be converted into the Bag of words format to be used. The movielens dataset has a linking method to map the movies in their dataset to IMDB movies.