Abstract
This paper proposes a novel framework for automatic dish discovery via word
embeddings on restaurant reviews. We collect a dataset of user reviews from
Yelp and parse the reviews to extract dish words. Then, we utilize the
processed reviews as training texts to learn the embedding vectors of words via
the skip-gram model. In the paper, a nearest- neighbor like score function is
proposed to rank the dishes based on their learned representations. We brief
some analy- ses on the preliminary experiments and present a web-based
visualization at
http://clip.csie.org/yelp/.
Background
With the growth of social media, corporations, such as Yelp, have accumulated a
great number of user generated content (UGC). In the literature, some studies
have been conducted with a perspective of finding critical information hidden
in the content. While much has been proposed on accurate sentiment
interpretation towards reviews and recommendation, little has focused on
dish-level analysis. In this paper, therefore, we aim to provide a novel
framework for automatic dish discovery from restaurant reviews via the
embedding techniques. We employ regular expressions to first parse restaurant
reviews to extract dish words, and then utilize the processed reviews as
training texts to learn embedding vector of each word via the skip-gram model.
In addition, a nearest-neighbor like score function is proposed to rank the
dishes via their learned representations. Prelimi- nary experiments are
conducted on a real-world restaurant review dataset collected from Yelp Data
Challenge.
Dataset
Our preliminary experiments involve a real-world restaurant review dataset
collected from
Yelp Data
Challenge.
We first choose the top 100 restaurants containing the most reviews in the area
of Las Vegas and then manually parse the menu of each restaurant from its
official website.
Out of those 100 restaurants, we winnow out the restaurants with a complete
menu, setting the reviews of those restaurants and their menus as our dataset.
In summary, there are 69 restaurants and 95,578 reviews in total after the
filtering; the number of words per review in average is about 147 and the
vocabulary size is 46,017.