Movie Industry
Three decades of movies
CONTEXT
Okayyyy, did somebody say ‘movies’? Of course, it’s the title of this article, I’m sure you’re hyped already, good to know .Who doesn’t like to relax watching or seeing their favorite movie (s)? But what exactly is happening to the movie industry? Why does it look like Netflix is taking over? These are the questions that led me to creating this data set focused on movie revenue and it’s analysis over the decades.
CONTENT
There are 6820 movies in this dataset(220 movies per year,1986–2016). Each movie has the following characteristics:
- budget : the budget of a movie. Some movies don’t have this, so it appears as ‘0’.
- company : the production company.
- country: country where the movie was produced.
- director : the director of the movie.
- genre : genre of the movie.
- gross : revenue of the movie.
- name : name/title of the movie.
- rating : rating of the movie.
- release : date released(YYYY/MM/DD).
- runtime : duration of the movie.
- score : the movie’s original song.
- votes : number of votes.
- star : main actor/actress of the movie.
- writer : writer of the movie.
- year : year of release.
DATA WRANGLING
Before the visualization and analysis of a dataset, data wrangling is done. What is data wrangling? Data wrangling is the process of gathering and transforming data to answer an analytical question. It is also known as data cleaning. To carry out this operation, we use python(a programming language), with it’s library that is specifically used for this process ‘pandas’ which will be imported below as ‘pd’. After this, the dataset will be read into the program with ‘read_csv’. The dataset is a CSV file(Comma-separated values).
import pandas as pd
df = pd.read_csv("movies.csv",engine='python')
df
Okayy, we now have our dataset. Really, it wouldn’t be nice to miss the null values in a dataset, so we have to check for it using ‘df.isnull().sum’.
df.isnull().sum()
What a relief, there are no null values. We then proceed to checking the names of the companies that produced our favorite movie(s) with;’df[‘company’].unique’.
df['company'].unique()
Aww, just look at that, companies that produced our favorite movies. Also, we definitely do not want duplicate files in our dataset, so we check using ‘df.duplicated().sum’.
df.duplicated().sum()
Output: 0
Amazing! such a beautiful dataset. There are no duplicate files. Let’s check out the description of the dataset;’df.describe()’.
df.describe()
Now, we have the mean,max,25%,50% etc. We proceed to check our info using ‘df.info’.
df.info()
Alright, we now really know about our favorite movies, it’s budget, companies that produced them etc. Now we want to check the number of stars from our favorite movies using ‘df[‘star’].nunique’.
df['star'].nunique()
output:2504
So we can see that there are 2504 stars in our dataset.
You can get the dataset used in this article here. The files for this article can be gotten here on my GitHub Profile.
Thanks For Reading! And anticipate for its analysis and visualization.