top of page

MOVIE SIMILARITY RANKING TOOL

moviesimilarityranker.PNG

The goal of this project was to generate a list of movies whose plot descriptions are similar to a user-defined plot description. A database of 35,000 wikipedia movie plot descriptions was used. In order to compare the plot descriptons, I used spaCy for natural language processing and scikit-learn for tf-idf in order to determine similarity. In the example above, the first paragraph of the Spider-Man plot description from Wikipedia was used for the user-defined plot description. Unsurprisingly, Spider-Man is the first result. The reason that the similarity is ~0.51 instead of 1.00 is because I only used the first paragraph of the plot summary on Wikipedia, whereas the database file contains the entire plot description.

Click the images below to enlarge
Stop Words Removal

Stop Words Removal

In order to use tf-idf (the comparison algorithm), some of the most common words in the English language (known as stop words in natural language processing) are removed in order to increase the effectiveness of the comparison algorithm. In addition, punctuation is removed and the words in the descriptions are lemmatized (running becomes run, etc).

Plot Description Entry

Plot Description Entry

The user-defined plot description, just like the descriptions in the database, is subjected to the same stop words removal process and lemmatization.

bottom of page