NLP on Movie Scripts + Recommendation System

ML Techniques

  • Unsupervised learning
  • NLP
  • Topical Modeling
  • Non-matrix factorization (NMF)

Libraries/tools

  • NLTK
  • TfidfVectorizer
  • Beautiful Soup
  • Pandas
  • Plotly

Purpose

For this project, I wanted to create a tool that could be used both by movie production companies to expedite the process of identifying promising scripts and by users to generate movie recommendations based on similar topics found in different scripts.

Process

I started by scraping 70 full-length movie scripts from www.imsdb.com and cleaning and then splitting them each into 100 chunks. I decided to split the scripts this way because the full-length documents contained too many words to be analyzed effectively at once. In addition to splitting the scripts, I wrote a pre-processing function that incorporated word lemmatization, stop word removal, and part of speech tagging.

These vectors represent all of the words (after pre-processing) that are in each chunk of each script. The document-word vectors are then condensed into just 11 topics (attributes) through the use of non-matrix factorization (NMF). The result was the creation of 'topic-word' and 'document-topic' vectors. The former represents the most common words in each topic. A few of the topics included action, comedy, horror, and fantasy. The doc-topic vectors quantitatively represent how much of each topic is present in each chuck (document).

In order to get the total topic scores for each movie, as opposed to just scores for each scene, I used a simple pandas groupby on the doc-topic dataframe to select all scenes from the same movie and then added the scores from each scene. Since each movie had 100 chunks, this was a fairly simple way to generate cumulative topic scores for each movie.

Now, with each movie containing the scores of clearly defined topics (movie-topic vectors), the last step was to use this information as the backbone of a simple movie recommendation system. To create this, I calculated the cosine similarities of all of the movie-topic vectors and then created a function to output the four most similar movies to the given input movie name. For example, if the input movie was 2019's 'Joker', then the function would output the four most similar movies of "Scream", "Insidious", "The Book of Eli", and "Get-Out."

Please see the presentation I gave about this project below (required to be breif) and the GitHub repository for the project files.