Dataset Yelp's Academic Dataset - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Dataset Yelp's Academic Dataset

Proposer: Abhishek Kapoor - @kapoorabhishek24 - [email protected]
Votes 🗳:

Summary

Yelp's Academic Dataset is a treasure trove of local business data, categories namely Restaurants, Hotels, Education, Travel, Local Services, and so on. The data is collated from 10 cities across 4 countries namely UK, Canada, Germany, and US. It would be interesting to know users opinions from these countries, cultural trends, predict seasonal effects on a particular business and about the kind of business which is in demand right now, if you ever want to start one of your own.

Prediction Goals

Figuring out the Trend Setters: Which Business made things popular?
Seasonal Effects: Is winter more preferred for a certain business?
Does Location has an Impact on the business?
Finding the expert users
Business trends and recommendations
Finding N-Grams in customer reviews, which words were frequently used for well reputed restaurants?

Long Description

Some facts about the yelp's Academic Dataset:
* Size: ~2.6 GB
* Format: JSON
* User Reviews: 2.7M
* Businesses: 86K
* Users: 687K

The Dataset can be used to predict plenty of interesting insights. It is diverse and has a huge collection of real world business reviews and customer expectations. By having a diverse set of cities, we can compare and contrast what makes a particular city different. What cuisines are people raving about in these different countries? Does location play a role in business success, which cities are favoring what kind of business? Which is the most trending business type, which people are in need of or is playing a major role in their daily life? Detecting Changepoints regarding a business like which event/review/time lead to a business failure.

All this predictions and findings will get us to know about the current trends regarding business success/failures, user expectations and opinions and so on. We can also submit this project to Yelp (No Deadline for Academic Research).

Notes on the Dataset

Each file is composed of a single object type, one json-object per-line.

Business: {
    'type': 'business',
    'business_id': (encrypted business id),
    'name': (business name),
    'neighborhoods': [(hood names)],
    'full_address': (localized address),
    'city': (city),
    'state': (state),
    'latitude': latitude,
    'longitude': longitude,
     ...

Users: {
   'type': 'user',
   'user_id': (encrypted user id),
   'name': (first name),
   'review_count': (review count),
   'average_stars': (floating point average, like 4.31),
   'votes': {(vote type): (count)},
   'friends': [(friend user_ids)],
   'elite': [(years_elite)],
   'yelping_since': (date, formatted like '2012-03'),
   'compliments': {
       (compliment_type): (num_compliments_of_this_type),
       ...
   },
   'fans': (num_fans),
  }

For further information check here.