In this exercise you will prepare some TripAdvisor customer review data for brand positioning analyses. The data are of two kinds, numerical ratings that reviewers gave to hotels they stayed in, and things they said about their hotels in the form of text data.
The kind of analysis that the data you’ll prepare will be used for is called perceptual mapping. Different kinds of analytic methods are used to do perceptual mapping, but the basic objective is to use data on customers’ perceptions to represent brands in a low-dimensional space (a “map”) that can be interpreted based on important brand differences. Different kinds of data can be used to do it. In this GrEx you’ll prepare both numerical data and also text data for it.
The data is in some number json files. Each file contains data for one hotel. The json files are archived in the zip file hotelCustsSu2017.zip.
This GrEx is in two parts. Part 1 is about getting the numeric ratings data ready, and Part 2 is about readying the text data.
These data have been made available by the resarchers of the LARA project, http://www.cs.virginia.edu/~hw5x/dataset.html
A lot of this GrEx has to do with Python `dictionaries,` or “dicts.” If you’re completely new to dicts, you may want to spend a few minutes reviewing what’s in the course readings or is online about them. There are “plain vanilla,” or “everyday” dicts, and some specializations can be found in the Python `collections` package.
Starter: Get a Look At Some of The Hotel Data
Download the zip file with the data in it, and extract the json files in it. Then, in a Python session or in a Jupyter Notebook, read what’s in the hotel file 100506.json. Here’s a way to do it. First, import the `json` package. Then, assuming that the file is in your current working directory,
with open(‘100506.json’) as input_file:
Take a look at jsondat. You see that it consists of nested dictionaries. As you may know, Python dictionaries hold things in key:value pairs. The values in a dictionary can be other dictionaries, lists, scalars (single numbers), pandas DataFrames and Series, and other things. The key is a label for the thing.
Is jsondat itself actually a dictionary? Try type(jsondat) to see. What are the keys in jsondat, and what types are they? Extact each of the key:value pairs from jsondat and check to see what kinds of Python objects they are. You should find something about hotel information, and another thing about hotel reviews. The latter includes whatever reviews have been provided for this particular hotel. Examine what’s in the reviews data. There may be ratings on various scales, and also comments that reviewers may have made when they posted their reviews.
Bear in mind that what’s in any hotel data file may vary.
Processing the json Files for Perceptual Mapping
For both parts of this GrEx you’ll need to read the hotel data files into Python (or Jupyter) for processing. You’ll want to read the files in one at a time, process what’s in the file read, and then move on to reading and processing the next file. You can load what’s in each file using code like the above. Note the use of the ‘with open’ idiom. When you read (or write) a file using it, the file is automatically closed when the with clause ends.
There are a couple of ways you can specify the files to be read, but you’ll want to read them using a loop of some kind, and in the loop you’ll do the processing of each file.
Something to be aware of from the start is that the hotel files don’t all have exactly the same contents, and the reviewers of the hotels didn’t necessarily provide the same ratings. It’s possible that some hotel files may not have any reviews at all, so you’ll need to be sure that your code can handle such possibilities. (Hint: Do you know about Python Try and Except?)
Part 1: Numeric Peceptual Mapping Data
In this part you will use the data in the json files to create a new data set.
(a) Build it as a `pandas` DataFrame. Your DataFrame should have a row for each hotel review, and the following columns in it:
Each rating given by the reviewer in its own column
Your DataFrame should have at least one row for every hotel. Hotel name, Review ID, and Author name should be string type variables. Review date can also be a string. The other columns should be numeric. Use an appropriate code to represent missing values.
(b) Report the number of reviews for each hotel in the DataFrame.
(c) Calculate and report statistics describing the distribution of the overall rating received by the hotels. (This is across all the hotels, not for each hotel separately.)
(d) Save your DataFrame by pickling it, and verify that your DataFrame was saved correctly. Be sure to save it as you may be asked to share it with the class later in the course.
Some find using pandas.io.json.json_normalize to be helpful for converting json objects into DataFrame type objects.
Part 2: Text Data for Perceptual Mapping
Here’s the fun part of this GrEx. It’s time to get dirty with some text data.
Each hotel review may include text that the reviewing author wrote. You are going to create a json file that includes for each hotel a dictionary summarizing the “content” words in the authors’ text. This text is the value of the “Content” key in a review.
You want to process the written comments for each hotel, one hotel at a time, to put the number of times each “content” word occurs in the comments into a dict. This dict should have the words as keys, and the counts of the times they occur as the values of the keys. For example, suppose the word ‘bathroom’ occurs 50 times in the comments for a hotel. Then the dict of words for this hotel would have a key of ‘bathroom’, and the value associated with it would be 50.
A “content” word is a word that carries meaning, and that is not a stop word. Stop words are words like prepositions that are often removed when doing text mining or natural language processing (NLP). They are a “fuzzy set” in that there’s no universally agreed upon set of them. See https://en.wikipedia.org/wiki/Stop_words for a bit about them.
To build your dictionary of content word counts for a hotel, you need to exclude stop words, and also any other things like html tags and puncuation marks. Python resources you can use for this include the string, Beautiful Soup, stop-words, and (perhaps) urllib packages. (nltk, too if you’re feeling adventurous or if you’re familiar with it.)
Now here’s a way you might approach the task for creating a content word count dictionary for a hotel. You might decide to do it in a different way, of course.
1 Create one big long string of the contents of comments about the hotel.
2 Clean the string of all html tags, punctuation, and other “non-word” stuff. This should leave you with a string of words separated by spaces. You might want to remove digits, too.
(Note: there can always be anomalies to contend with when pounding text data into ‘submission.’ And head-scratchers to be resolved, too. For exsample, what should be do about apostrophes, like in “we’ll?” Oh, well…)
3 Convert the string to a list of words by spltting the string on spaces.
4 Remove all stop words from the list, leaving in it your “content” words. 5 If you’re feeling adventurous, you might want to do word stemming, as well, e.g. using the stemming package or some other resource. (See https://en.wikipedia.org/wiki/Stemming about stemming if you’re interested.)
6 Create a dict fom the list in which the keys are the “content” words, and their values are the number of times each word occurs.
(Hint: Think about iterating through your list to build up your dict.)
Creating the json file of hotel word dictionaries
Once you’ve done the above for a hotel’s comments, you need to add it to a dict that you’ll save as a json file. This dict will be a “dict of dicts”. Each of the dicts in it will have your dict of word counts for a hotel as the value, and it’s key in your “dict of dicts” will be the hotel’s ID, which is a number.
Imagine that you name your “dict of dicts” hotWords. The keys in hotWords would be the hotel IDs. The value associated with each hotel ID key is a hotel’s content word count dictionary.
hotWords.keys() would have hotel IDs in it. (What type is hotWords? What about hotWords.keys()? )
Each of the hotel ID keys in hotWords.keys() would have a dict of words and word counts as its value, assuming of course that any comments were made about the hotel.
To summarize, here’s what you need to do for Part 2:
- Create for each hotel a dict of comment content words and their frequencies, the counts of their occurrences.
- Add each of the hotel content word dicts to a dict with their hotel IDs as the keys.
- Write this dict to a json file, and verify that it is written correctly.
- Report the number of unique content words in each of the hotel’s dicts. (Unique content words, not how many times the words occurred.)
Be sure to save your the json file you “serialized” as you may be asked to share it later.
In no more than seven (7) pages, and in a single pdf file, provide the following.
Provide your commented, syntactically correct code for each step you took to create the results described.
Be sure to save the data files you created.
For each of the steps above, provide commented, syntactically correct code. Do not submit all of your code for this assigment in a single block, followed by a block of all of your output. Assignments organized in this manner will not be graded.