Exploratory Data Analysis using SQL

Exploratory Data Analysis using SQL

INTRODUCTION

SQL stands for Structured Query Language. Structured query language (SQL) is a programming language for storing and processing information in a relational database. A data analyst uses SQL to manipulate and gain insights from the data. In this project, we will analyze data and for this purpose we will use sakila sample database.

For this project, I will be using MySQL Workbench. You can download it from here. MySQL is an open-source relational database management system. You can use any other RDBMS software. Once your software is set up we shall import Data into our database.

IMPORTING DATA

After downloading sakila database extract the files from the zip file. Create database on workbench and name it Sakila.

Open MYSQL workbench and on file click on open SQL script

Open folder you extracted the data to and click on schema then open and run script

Repeat the same process to import data and run the SQL script

Data Analysis

Find all films with PG-13 films with rental rate of 2.99 or lower

SELECT *
FROM (
SELECT *
FROM film
) x
WHERE x.rating = ‘PG-13’ AND x.rental_rate <= 2.99;

In my filter condition I used and since both of the conditions must be satisfied. The sub-query helped optimize our SQL query run time to save on time and memory.

All films that have deleted scenes

SELECT *
FROM film f
WHERE f.special_features like ‘%Deleted Scenes%’;

special_features field contains description about the film and that is where it specifies whether a film has deleted scene. LIKE operator is used to retrieve the data in a column of a table, based on a specified pattern.

All active customers

SELECT *
FROM customer c
WHERE c.active = 1;

Active field in customers table has a boolean data type of 1 or 0 where 1 is True and 0 is a False. We filtered customers where active is 1 or True.

Distinct names of customers who rented a movie on 26th July 2005

SELECT DISTINCT(concat(c.first_name, ‘ ‘, c.last_name)) as full_name
FROM customer c
inner join rental r USING(customer_id)
WHERE r.rental_date like ‘2005-07-26%’;

I decided to concatenate first and last name just incase two or more customers have the same first name but different last name are not left out

How many rentals we do on each day?

select date(rental_date),
count(rental_id) as num_of_rentals
from rental
group by date(rental_date)
order by num_of_rentals desc;

rental_date column is a datetime data type therefore I extracted date from rental date so that we had to group by date only.

What are the three top earning days so far?

select date(payment_date) as date,
count(rental_id) as num_of_rentals,
sum(amount) as money
FROM payment
GROUP BY date(payment_date)
ORDER BY sum(amount) desc
limit 3;

After grouping our data we need to sort the data in a descending manner by use of order by and to choose the top 3 best days so far we use limit clause.

With this analysis of the data we are now aware of which type of movies to keep in stock, active customers to keep and can also further create a predictive model to predict the sales of film but for that, we need some more data.

Thanks for reading! For full project check it out on Github

If you want to get in touch with me, feel free to reach me at carsonallan71@gmail.com or my LinkedIn Profile.

Please follow and like us:
Pin Share