Steven Lin | Movie Script Analytics

Exploratory Analysis

The exploratory results showed the top characters, locations and scenes by simply counting the number of utterances (quotes). The unigram analysis matched the exploratory results, with the main characters being Riley and Joyce. In addition, the analysis highlights memory as an important theme in the movie.

Because the character names are emotions (Joy, Sadness, Fear, Anger and Disgust), they were replaced by common names (Joyce, Felipe, Angelo and Diane) for the text analysis to avoid issues with the parsing and name entity recognition, and biases with the sentiment analysis (e.g. Sadness mentions Joy most of the time).

The unigram word cloud from quotes and descriptions of the script provides basic messages about characters, locations and events. For example, Riley, Joyce are the two main characters and followed by Sadness and Bing Bong.

The distribution of quotes by characters and time index (quote index) is shown next. Quotes of the five emotions are across the whole movie. Joy, Sadness and Fear are the top 3 characters with the most quotes , so their quotes frequency distributions are denser than others. One highlight is that Bing Bong appeared in a certain time range shown by a yellow region in the figure. Another highlight is that Mom’s and Dad’s emotions first time showed up together in the conversation on the San Francisco dining table.

The next figure indicates a distribution of quotes by location and time index (quote index). Headquarter, Long-term Memory and San Francisco House are the top 3 locations with the most quotes, so their quotes frequency distributions are denser than others. 3 areas, which are circled out, indicated interesting insights.

Sentiment Analysis

The sentiment analysis of each character matched the personality definition they represent. For example, Joy has one of the most positive mood while Disgust the most negative mood. In addition, the sentiment corresponds to turning points in the movie. For instance, when Riley moves to San Francisco there is a drop in the positive sentiment, which drops further to scenes where there is drama and struggles (e.g. dinner argument and trying to escape from long-term memory). The sentiment becomes more positive in the hockey rink due to cheering and end of the movie when Joy is back.

We visualized the sentiment result by nine main characters. On the right we can see the average binary sentiment and average count sentiment. The results of the two different methods are similar: same orders of characters by an average of sentiment. Mon and Joy are the top two positive characters and Sadness and Disgust are only negative characters of these main characters. Count sentiment gives more weights on sentiment scores (counts of positive, negative words), so it has a larger range of average of sentiment compared to binary sentiment. Therefore, we selected count sentiment method for further sentiment visualizations.

In the movie, Joy and Sadness were always together. It is interesting to compare the sentiment scores between them. In the figure below, Joy has more positive scores than negative ones; in contrast, Sadness has more negative scores than positive ones. The result matches the definition of personalty for these two characters.

The image below shows the average of sentiment scores by each scene. The average sentiment scores dropped from ‘Riley is Born’ to ‘Moving to San Francisco’. Joy and Sadness fell into Long-term Memory and Riley’s emotions were not controlled by Joy, and the sentiment score sharply dropped to negative level until they met Bing Bong. Scene ‘Hockey Tryouts’ hits the highest sentiment scores since there were a lot of encouragements.

Topic Analysis

Topic analysis (LDA) shows that the topics detected can reflect the personality of characters, the moods in the scenes and themes in each location.

LDA was used to determine the topics by character, location and scene. The output generated by LDA is the topic distribution for each document (e.g. character) and the word distribution for each topic detected. The input was the TFIDF matrix. The Top 10 words for each of the Top 10 Topics is shown on the right.

The plots on the left are bubble charts of topic frequency by character. We found out topic 28 is only significant in Bing Bong and topic 26 is only significant in Sadness. The also shows top words of topic 28 and 26. Words ‘fly’ and ‘thank’ in topic 28 of Bing Bong matches his personality; similarly, word ‘worry’ in topic 26 of Sadness ‘ matches her personality.

The plots on the right are bubble charts of topic frequency by locations. We found out topic 16 is only significant in Minnesota House and topic 17 is only significant in San Francisco. The top words of topic 16 and 17 are shown as well. Word ‘skate’ and ‘money’ in topic 26 are uniquely referring to Riley’s life at Minnesota matches her personality. Moreover, words ‘worst’ and ‘introduce’ reflect the bad time at San Francisco

We plot overall top 5 topics across scenes. Topic 3 and topic 17 are comparably more related to each scene than other topics. Based on the features (keywords) of topics, the distributions of some topic fluctuate across scenes. For example, topic 7, which includes keywords ‘Riley’, ‘Dad’, ‘mom’, shows relevance in scene ‘Riley is Born’, ‘Dinner Argument’ and ‘Riley is Back’ instead of ‘Meet Bing Bong’.

Document Similarity

Document similarity shows following features would lead to a high document similarity: characters with similar personality, two locations involving same scenes and characters and two scenes happening at same locations with same characters.

Fear and Disgust have a high similarity since their personalities match. A same conclusion can be drawn on a high similarity between Joy and Bing Bong. The reason that Joy and Sadness have a high similarity is they always appeared together in same locations and scenes.

The reason that Headquarter has a high similarity with both Long-term Memory and San Francisco is these locations shared same scenes and characters.

Clustering Analysis

Hierarchical clustering with ward distance shows good performance for characters, with the analysis resulting in separate groups of emotions, physical world characters and supporting characters.

Clustering based on the tfidf matrix and using k-means with different cluster size for characters, locations and scenes was conducted. The results show that the main characters are assigned to separate clusters, while the rest of characters are usually assigned to one cluster.

However, due to the small size of the dataset (e.g. few characters, locations and scenes), hierarchical clustering was performed as it is more appropriate for small datasets. The algorithm merges or splits clusters recursively based on the linkage criteria (based on cosine distance in this case). In this case, ward, average and complete linkages were tested. The results show that ward distances seem to perform better than the other linkages.

The figure on the right shows the dendrogram using ward linkage for characters. It can be seen that at a high level, 3 clusters are formed: the main characters (e.g. Riley, Mom, Dad, Bing Bong and the emotions), Mom’s emotions, and the rest of the characters. At the lower level, we can see that Mom and Dad are grouped together, which are then grouped together with Riley. Similar insights can be gained from the emotions and the rest of the characters, indicating that the clustering overall makes sense.

Want to know more?

Visit the project github repo. to view the comprehensive report and code

INSIDE OUT Movie Script Text Analytics

INSIDE OUT
Movie Script Text Analytics