====================================================
====================================================
I will be working with a dataset consisted of info on 50,000 used cars. We sampled 50,000 data points from the full dataset, to ensure your code runs quickly in our hosted environment.
The goal of this project is to clean the data and analyze the included used car listings as well as become familiar with some of the unique benefits jupyter notebook provides for pandas.
Read the autos.csv CSV file into pandas, and assign it to the variable name autos.
Try without specifying any encoding (which will default to UTF-8)
If you get an encoding error, try the next two most popular encodings (Latin-1 and Windows-1252)
until you are able to read the file without error.
import numpy as np
import pandas as pd
autos = pd.read_csv("autos.csv", encoding="Latin-1")
Create a new cell with just the variable autos and run this cell.
A neat feature of jupyter notebook is its ability to render the first few and last few values of any pandas object.
autos.head(5)
Use the DataFrame.info() and DataFrame.head() methods to print information about the autos dataframe, as well as the first few rows.
autos.info()
autos.head()
Write a markdown cell briefly describing your observations.
Looking at the dataframe, I can see it's a mix of objects and integers. There is a total of 19 columns, some of which have issues with the column name that will need to be cleaned up.
=======================================
=======================================
Use the DataFrame.columns attribute to print an array of the existing column names.
autos.columns
Copy that array and make the following edits to columns names:
autos.columns
column_name_corrections = {
'dateCrawled' : 'date_crawled',
'name' : 'name',
'seller' : 'seller',
'offerType' : 'offer_type',
'price' : 'price',
'abtest' : 'ab_test',
'vehicleType' : 'vehicle_type',
'yearOfRegistration' : 'registration_year',
'gearbox' : 'gear_box',
'powerPS' : 'power_ps',
'model' : 'model',
'odometer' : 'odometer',
'monthOfRegistration' : 'registration_month',
'fuelType' : 'fuel_type',
'brand' : 'brand',
'notRepairedDamage' : 'unrepaired_damage',
'dateCreated' : 'ad_created',
'nrOfPictures' : 'num_pictures',
'postalCode' : 'postal_code',
'lastSeen' : 'last_seen'
}
Assign the modified column names back to the DataFrame.columns attribute.
autos.columns
pd.__version__
# I'm using pandas version 24. If using anything lower than 23, then this will fail. instead use the command
# commented out below
autos.columns = autos.columns.map(column_name_corrections)
# autos.columns = pd.Series(autos.columns).map(column_name_corrections)
autos.columns
==================================
==================================
Now it's time to determine what other cleaning tasks need to be done. things to look for: - Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. - Examples of numeric data stored as text which can be cleaned and converted.
The following methods are helpful for exploring the data: - DataFrame.describe() (with include='all' to get both categorical and numeric columns) - Series.value_counts() and Series.head() if any columns need a closer look.
Use DataFrame.describe() to look at descriptive statistics for all columns.
autos.describe(include = 'all')
Write a markdown cell noting:
Answer: Looks like both price and odometer columns both have numeric values, but are actually stored as txt. Also, 'seller' and 'offer_type' seem to have the same values
Removing any non-numeric characters.
autos['price'] = autos['price'].str.replace(',','')
autos['price'] = autos['price'].str.replace('$','')
autos['odometer'] = autos['odometer'].str.replace(',','')
autos['odometer'] = autos['odometer'].str.replace("km","")
Converting the column to a numeric dtype.
autos['price'] = autos['price'].astype(int)
autos['odometer'] = autos['odometer'].astype(int)
dtypes = autos.dtypes
print(dtypes)
Using DataFrame.rename() to rename the column to odometer_km.
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)
autos.columns
===============================================================
===============================================================
We'll use:
For each of the odometer_km and price columns:
Use the techniques above to explore the data
Series.unique().shape to see how many unique values
print(autos['price'].unique())
print(autos['odometer_km'].unique())
Series.describe() to view min/max/median/mean etc
autos[['price','odometer_km']].describe()
Series.value_counts() (with chained to .head() if there are lots of values.)
autos['price'].value_counts().head()
autos['odometer_km'].value_counts().head()
Because Series.value_counts() returns a series, we can use Series.sort_index() with ascending= True or False to view the highest and lowest values with their counts (can also chain to head() here).
autos['price'].sort_values(ascending=False)
autos.columns
for row in autos.head(5):
print(row)
# if row[4] == 0:
# print(row.index.item())
# one = autos.head(1)
# one.index.item()
# one
# Look at individual entries
autos.iloc[26772]
autos['price'].sort_index(ascending=True).head(1)
If you find there are outliers, remove them and write a markdown paragraph explaining your decision.
After you have removed the outliers, make some observations about the remaining values.
To do this, I found min +1 and max -1. I then just made autos equal to everything in between
# find the min value in price column
min_price = autos['price'].min()
print(min_price)
# Find the max value in price column
max_price = autos['price'].max() - 1
print(max_price)
autos = autos[autos["price"].between(min_price,max_price)]
autos['price'].sort_values(ascending=False)
Another way is to find min and max and keep any rows that do not match the min and max
# find the min value in price column
min_price = autos['price'].min()
print(min_price)
# Find the max value in price column
max_price = autos['price'].max()
print(max_price)
# Remove any rows with values == to min value
autos = autos[autos['price'] != min_price]
# Remove any rows with values == to max value
autos = autos[autos['price'] != max_price]
autos['price'].sort_values(ascending=False)
Observations: There was an entry of 99 million dollars and a number of entries that were zero dollars. I removed them and kept everything in between
==============================================================================
==============================================================================
Looking at the columns that have dat data in them, there are some with data populated by the crawler (date_crawled and last_seen) and some grabbed from the website itself (ad_created, registration_month and registration_year)
I also looks like date_crawled, last_seen and ad_created are identified as string values whereas the others are represented by numeric values.
I will need to convert the string values to numeric in order to quantitatively understand the data by running Series.describe().
Printing out the three columns, I can see they are represented in full timestamp vaulues:
autos[['date_crawled','ad_created','last_seen']][0:5]
I can see that the first 10 characters represent the day
print(autos['date_crawled'].head().str[:10])
Next:
# extract only the date-month-year component for 'date_crawled'
dmy_date_crawled = autos['date_crawled'].str[:10]
dmy_date_crawled.head()
# extract only the date-month-year component for 'ad_created'
dmy_ad_created = autos['ad_created'].str[:10]
dmy_ad_created.head()
# extract only the date-month-year component for 'last_seen'
dmy_last_seen = autos['last_seen'].str[:10]
dmy_last_seen.head()
# Getting a count in percentage values, with missing values and sorting in ascending order for the column 'date_crawled'
dmy_dc = dmy_date_crawled.value_counts(normalize=True, dropna=False).sort_index()
print(dmy_dc)
print(dmy_dc.describe())
# Getting a count in percentage values, with missing values and sorting in ascending order for the column 'ad_created'
dmy_ac = dmy_ad_created.value_counts(normalize=True, dropna=False).sort_index()
print(dmy_ac)
print(dmy_ac.describe())
# Getting a count in percentage values, with missing values and sorting in ascending order for the column 'last_seen'
dmy_ls = dmy_last_seen.value_counts(normalize=True, dropna=False).sort_index()
print(dmy_ls)
print(dmy_ls.describe())
date_crawled and last_seen seem to correlate in that they have the same length (34 entries)
autos['registration_year'].describe()
registration_year looks to have some invalid years on the min and max side. I'll need to address that next
==============================================================================
==============================================================================
Looking from my last exercise, registration_year has some weird values for min - 1000 and max - 9999. Also looking at the the dates of date_crawled, ad_created and last_seen, I saw the range of years as ~2015 - 2016. Something is not matching up. To dig in deeper, I'll need to figure out the correct range of years.
autos['registration_year'].sort_values()
Looking at the above output, I can see the first four entries fall outside the realistic date a car could be registered. Same observation on the last number of entries, they seem to be quite far into the future. So based on these observations, I will take the year range of 1910 thru 2019 and filter out everything that falls outside of that
autos = autos[autos["registration_year"].between(1910,2019)]
autos['registration_year'].describe()
autos['registration_year'].sort_values()
autos['registration_year'].value_counts(normalize=True)
All the values outside our range have been removed. Looking at the distribution, it's clear that as we go back in time with regards to registration years, the amount of listings decrease.
=======================================================================================
=======================================================================================
# I selected unique brands in the dataset that have a presence of > 5%
autos2 = autos['brand'].value_counts(normalize=True).sort_values(ascending=False)
brand_bool = autos['brand'].value_counts(normalize=True).sort_values(ascending=False) > 0.05
brand_autos = autos2[brand_bool]
print(brand_autos)
# From here, since Series.value_count produces a series with an index table, I can leverage that
brand_autos.index
# Looping over my selected brand and assign the mean price to each entry in the dictionary 'aggregated_data',
# with the brand name as the key.
aggregate_data = {}
for brand in brand_autos.index:
individual_brand_bool = autos['brand'] == brand
individual_brand = autos[individual_brand_bool]
brand_mean = individual_brand['price'].mean()
if brand not in aggregate_data:
aggregate_data[brand] = brand_mean
aggregate_data
This is a good way to see which auto brands fall within a certain budget.
============================================================================
============================================================================
# From the dictionary 'aggregate_data', I create a panda series called 'brand_mean_prices'
brand_mean_prices = pd.Series(aggregate_data)
brand_mean_prices
# Now, the 'brand_mean_prices' series has an index and I can use that to create a dataframe with a column labeled 'mean_price'
df = pd.DataFrame(brand_mean_prices, columns=['mean_price'])
df
# Calculating the mean mileage and mean price for each of the top brands and
# storing the results in dictionaries.
mean_price_dict = {}
mean_mileage_dict = {}
aggregate_data = {}
for brand in brand_autos.index:
individual_brand_bool = autos['brand'] == brand
individual_brand = autos[individual_brand_bool]
price_mean = individual_brand['price'].mean()
mileage_mean = individual_brand['odometer_km'].mean()
if brand not in aggregate_data:
mean_price_dict[brand] = price_mean
mean_mileage_dict[brand] = mileage_mean
print(mean_price_dict)
print(mean_mileage_dict)
# Converting both dictionaries to series objects
prices_mean = pd.Series(mean_price_dict)
mileage_mean = pd.Series(mean_mileage_dict)
print(prices_mean)
print(mileage_mean)
# Creating a dataframe ('mileage_price') from the 'mileage_mean'series object using the dataframe constructor.
mileage_price = pd.DataFrame(mileage_mean, columns=['mean_mileage'])
mileage_price
# Assigning the mean_price series as a new column in 'mileage_price'.
mileage_price['mean_price'] = prices_mean
mileage_price
Looking at the mean of mileage vs. the mean of price, I've narrowed down a possible purchase. Seems like this is right in line with what we usually see out there. The Ford, VW and Opel are in one price range and the BMW, Audo and MB are a bit above that. Having this in front of you can definitely help make the decision as to which one to go for.