top of page
MOVIE SIMILARITY RANKING TOOL
The goal of this project was to generate a list of movies whose plot descriptions are similar to a user-defined plot description. A database of 35,000 wikipedia movie plot descriptions was used. In order to compare the plot descriptons, I used spaCy for natural language processing and scikit-learn for tf-idf in order to determine similarity. In the example above, the first paragraph of the Spider-Man plot description from Wikipedia was used for the user-defined plot description. Unsurprisingly, Spider-Man is the first result. The reason that the similarity is ~0.51 instead of 1.00 is because I only used the first paragraph of the plot summary on Wikipedia, whereas the database file contains the entire plot description.
Click the images below to enlarge
Stop Words RemovalIn order to use tf-idf (the comparison algorithm), some of the most common words in the English language (known as stop words in natural language processing) are removed in order to increase the effectiveness of the comparison algorithm. In addition, punctuation is removed and the words in the descriptions are lemmatized (running becomes run, etc). | Plot Description EntryThe user-defined plot description, just like the descriptions in the database, is subjected to the same stop words removal process and lemmatization. |
---|
bottom of page