The scenario

Machine Learning is a growing field & it seems that almost every company are deploying machine learning algorithms to solve problems. The question is, is machine learning actually necessary, or is it just so companies can say 'I'm doing that too' ?.

I was recently speaking to a satellite TV company in the UK and they were looking to create a movie recommendation algorithm, which recommended movies that a user might like, based on their previous purchasing behaviour, along with the star ratings they gave the films after they'd watched them. They were looking to solve the problem using a machine learning algorithm.

The question is this. If I were to filter only the movies the user had marked as 4 or 5 stars and then search those movie parameters against the database of movies to find movies with similar properties; would that not provide an accurate recommendation?

Bringing data into Pandas & stripping out the columns we don't need

In [8]:
import pandas as pd

#Import your data & strip out the columns you don't need
path = 'movie.csv'
df = pd.read_csv(path, delimiter=',', header='infer')
df2 = df[['movie_title', 'director_name', 'content_rating', 'actor_1_name', 'genres']]
df2.head()
Out[8]:
movie_title director_name content_rating actor_1_name genres
0 Avatar James Cameron PG-13 CCH Pounder Action|Adventure|Fantasy|Sci-Fi
1 Pirates of the Caribbean: At World's End Gore Verbinski PG-13 Johnny Depp Action|Adventure|Fantasy
2 Spectre Sam Mendes PG-13 Christoph Waltz Action|Adventure|Thriller
3 The Dark Knight Rises Christopher Nolan PG-13 Tom Hardy Action|Thriller
4 Star Wars: Episode VII - The Force Awakens Doug Walker NaN Doug Walker Documentary

Select a row for a specific movie from the dataframe

In [9]:
#define the movie you're looking for & then search for it
title = 'Pirates of the Caribbean: At World\'s End'
chosen = df2.loc[df2['movie_title'].str.contains(title)]
chosen 
Out[9]:
movie_title director_name content_rating actor_1_name genres
1 Pirates of the Caribbean: At World's End Gore Verbinski PG-13 Johnny Depp Action|Adventure|Fantasy

Use iloc to store cell contents in variables

In [10]:
#as your search above returns a single row,we can pick the data out from each cell & store in a variable
chosen_director = chosen.iloc[0]['director_name']
chosen_rating = chosen.iloc[0]['content_rating']
chosen_actor = chosen.iloc[0]['actor_1_name']
chosen_genre = chosen.iloc[0]['genres']

print('Director' + chosen_director)
print('Rating' + chosen_rating)
print('Actor' + chosen_actor)
print('Genre' + chosen_genre)
DirectorGore Verbinski
RatingPG-13
ActorJohnny Depp
GenreAction|Adventure|Fantasy

Split genre at the | so we can iterate over it

In [11]:
chosen_genre = chosen_genre.split('|')
print(chosen_genre)
['Action', 'Adventure', 'Fantasy']

Generate the ratings match score

We then compare each field, one by one, with the other movies in the dataset. When a movie matches, we append give it a score, when it doesn't, it gets a zero. The only complication here was genre - a movie can have 3 genres, pipe separated. So, we split them into a list (above) and then iterate through the list to provide a score (below).

In [12]:
#convert the content_rating column of the dataframe to a list
ratings = df2.content_rating.tolist()
ratings[:4]
Out[12]:
['PG-13', 'PG-13', 'PG-13', 'PG-13']
In [13]:
#create a new empty list to store the ratings in
new_rating = []

#iterate over each item in the ratings list. If it matches say 1 and if it doesn't say 2
for x in ratings:
    if x == chosen_rating:
        new_rating.append(1)
    else:
        new_rating.append(0)
            
#Convert the list into a pandas series
rating_score = pd.Series(new_rating)

#Add the pandas series to the dataframe
df2['rating_score'] = rating_score.values
df2.head()
/Users/keenek1/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
Out[13]:
movie_title director_name content_rating actor_1_name genres rating_score
0 Avatar James Cameron PG-13 CCH Pounder Action|Adventure|Fantasy|Sci-Fi 1
1 Pirates of the Caribbean: At World's End Gore Verbinski PG-13 Johnny Depp Action|Adventure|Fantasy 1
2 Spectre Sam Mendes PG-13 Christoph Waltz Action|Adventure|Thriller 1
3 The Dark Knight Rises Christopher Nolan PG-13 Tom Hardy Action|Thriller 1
4 Star Wars: Episode VII - The Force Awakens Doug Walker NaN Doug Walker Documentary 0

Now generate the same for all the other fields

In [14]:
#does actor1 match?
actors = df2.actor_1_name.tolist()
new_actors = []
for x in actors:
    if x == chosen_actor:
        new_actors.append(1)
    else:
        new_actors.append(0)
        
actor_score = pd.Series(new_actors)
df2['actor_score'] = actor_score.values

#does genre match? This is a bit more complex, as we have multiple genres for each movie. 
#The more genres that match, the higher the score
genres = df2.genres.tolist()
new_genre = []
for x in genres:
    ysum = 0
    nsum = 0
    for y in chosen_genre:
        if y in x:
            ysum = ysum+1
        
    if ysum == 0:
        new_genre.append(0)
    else:
        new_genre.append(ysum)
    
genre_score = pd.Series(new_genre)
df2['genre_score'] = genre_score.values
df2.head()
/Users/keenek1/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
/Users/keenek1/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:30: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[14]:
movie_title director_name content_rating actor_1_name genres rating_score actor_score genre_score
0 Avatar James Cameron PG-13 CCH Pounder Action|Adventure|Fantasy|Sci-Fi 1 0 3
1 Pirates of the Caribbean: At World's End Gore Verbinski PG-13 Johnny Depp Action|Adventure|Fantasy 1 1 3
2 Spectre Sam Mendes PG-13 Christoph Waltz Action|Adventure|Thriller 1 0 2
3 The Dark Knight Rises Christopher Nolan PG-13 Tom Hardy Action|Thriller 1 0 1
4 Star Wars: Episode VII - The Force Awakens Doug Walker NaN Doug Walker Documentary 0 0 0

Calculate the total similarity score

In [16]:
#Calculating match score
df2['match_score'] = df2['actor_score']+df2['rating_score']+df2['genre_score']

#limit result set to just the movie title & the score 
#and sort the dataframe to show most relevant movies first.
search_results = df2[['movie_title', 'match_score']].sort_values(by=['match_score'], ascending=False)
search_results.head()
/Users/keenek1/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[16]:
movie_title match_score
13 Pirates of the Caribbean: Dead Man's Chest 5
205 Pirates of the Caribbean: The Curse of the Bla... 5
18 Pirates of the Caribbean: On Stranger Tides 5
1 Pirates of the Caribbean: At World's End 5
108 Warcraft 4

There is no surprise that the other Pirates of the Caribbean movies are at the top of the list. Warcraft also scores highly - it doesn't have the same lead actor, but all 3 genres & the content rating to match.

We have a large number of other features we could add into the model to increase the accuracy, so in this instance, if you're not going to implement 'other customers like you bought...' functionality, then I believe that a machine learning algorithm is not necessarily required.

In [ ]: