Dish Discovery from Yelp Reviews
Restaurants
Reviews

Abstract

This paper proposes a novel framework for automatic dish discovery via word embeddings on restaurant reviews. We collect a dataset of user reviews from Yelp and parse the reviews to extract dish words. Then, we utilize the processed reviews as training texts to learn the embedding vectors of words via the skip-gram model. In the paper, a nearest- neighbor like score function is proposed to rank the dishes based on their learned representations. We brief some analy- ses on the preliminary experiments and present a web-based visualization at http://clip.csie.org/yelp/.

Background

With the growth of social media, corporations, such as Yelp, have accumulated a great number of user generated content (UGC). In the literature, some studies have been conducted with a perspective of finding critical information hidden in the content. While much has been proposed on accurate sentiment interpretation towards reviews and recommendation, little has focused on dish-level analysis. In this paper, therefore, we aim to provide a novel framework for automatic dish discovery from restaurant reviews via the embedding techniques. We employ regular expressions to first parse restaurant reviews to extract dish words, and then utilize the processed reviews as training texts to learn embedding vector of each word via the skip-gram model. In addition, a nearest-neighbor like score function is proposed to rank the dishes via their learned representations. Prelimi- nary experiments are conducted on a real-world restaurant review dataset collected from Yelp Data Challenge.

Dataset

Our preliminary experiments involve a real-world restaurant review dataset collected from Yelp Data Challenge. We first choose the top 100 restaurants containing the most reviews in the area of Las Vegas and then manually parse the menu of each restaurant from its official website. Out of those 100 restaurants, we winnow out the restaurants with a complete menu, setting the reviews of those restaurants and their menus as our dataset. In summary, there are 69 restaurants and 95,578 reviews in total after the filtering; the number of words per review in average is about 147 and the vocabulary size is 46,017.