Project2: Exploring Ebay Car Sales Data

====================================================

Step 1/9 - Importing CSV and Inspecting DataFrame

====================================================

I will be working with a dataset consisted of info on 50,000 used cars. We sampled 50,000 data points from the full dataset, to ensure your code runs quickly in our hosted environment.

The goal of this project is to clean the data and analyze the included used car listings as well as become familiar with some of the unique benefits jupyter notebook provides for pandas.

Read the autos.csv CSV file into pandas, and assign it to the variable name autos.
Try without specifying any encoding (which will default to UTF-8)
If you get an encoding error, try the next two most popular encodings (Latin-1 and Windows-1252) until you are able to read the file without error.

In [377]:
import numpy as np
import pandas as pd
autos = pd.read_csv("autos.csv", encoding="Latin-1")

Create a new cell with just the variable autos and run this cell.

A neat feature of jupyter notebook is its ability to render the first few and last few values of any pandas object.

In [378]:
autos.head(5)
Out[378]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Use the DataFrame.info() and DataFrame.head() methods to print information about the autos dataframe, as well as the first few rows.

In [379]:
autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
In [380]:
autos.head()
Out[380]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Write a markdown cell briefly describing your observations.

Looking at the dataframe, I can see it's a mix of objects and integers. There is a total of 19 columns, some of which have issues with the column name that will need to be cleaned up.

=======================================

Step 2/9 -Cleaning Up Column Names

=======================================

Use the DataFrame.columns attribute to print an array of the existing column names.

In [381]:
autos.columns
Out[381]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Copy that array and make the following edits to columns names:

  • yearOfRegistration to registration_year
  • monthOfRegistration to registration_month
  • notRepairedDamage to unrepaired_damage
  • dateCreated to ad_created
In [382]:
autos.columns
Out[382]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
In [383]:
column_name_corrections = {
    'dateCrawled' : 'date_crawled',
    'name' : 'name',
    'seller' : 'seller',
    'offerType' : 'offer_type',
    'price' : 'price',
    'abtest' : 'ab_test',
    'vehicleType' : 'vehicle_type',
    'yearOfRegistration' : 'registration_year',
    'gearbox' : 'gear_box',
    'powerPS' : 'power_ps',
    'model' : 'model',
    'odometer' : 'odometer',
    'monthOfRegistration' : 'registration_month',
    'fuelType' : 'fuel_type',
    'brand' : 'brand',
    'notRepairedDamage' : 'unrepaired_damage',
    'dateCreated' : 'ad_created',
    'nrOfPictures' : 'num_pictures',
    'postalCode' : 'postal_code',
    'lastSeen' : 'last_seen'
}

Assign the modified column names back to the DataFrame.columns attribute.

In [384]:
autos.columns
Out[384]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
In [385]:
pd.__version__
Out[385]:
'0.24.2'
In [386]:
# I'm using pandas version 24. If using anything lower than 23, then this will fail. instead use the command
# commented out below
autos.columns = autos.columns.map(column_name_corrections)

# autos.columns = pd.Series(autos.columns).map(column_name_corrections)
In [387]:
autos.columns
Out[387]:
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

==================================

Step 3/9 - Basic Data Exploration

==================================

Now it's time to determine what other cleaning tasks need to be done. things to look for: - Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. - Examples of numeric data stored as text which can be cleaned and converted.

The following methods are helpful for exploring the data: - DataFrame.describe() (with include='all' to get both categorical and numeric columns) - Series.value_counts() and Series.head() if any columns need a closer look.

Use DataFrame.describe() to look at descriptive statistics for all columns.

In [388]:
autos.describe(include = 'all')
Out[388]:
date_crawled name seller offer_type price ab_test vehicle_type registration_year gear_box power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created num_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-08 10:40:35 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

Write a markdown cell noting:

  • Any columns that have mostly one value that are candidates to be dropped
  • Any columns that need more investigation.
  • Any examples of numeric data stored as text that needs to be cleaned

Answer: Looks like both price and odometer columns both have numeric values, but are actually stored as txt. Also, 'seller' and 'offer_type' seem to have the same values

Removing any non-numeric characters.

In [389]:
autos['price'] = autos['price'].str.replace(',','')
autos['price'] = autos['price'].str.replace('$','')
autos['odometer'] = autos['odometer'].str.replace(',','')
autos['odometer'] = autos['odometer'].str.replace("km","")

Converting the column to a numeric dtype.

In [390]:
autos['price'] = autos['price'].astype(int)
autos['odometer'] = autos['odometer'].astype(int)
dtypes = autos.dtypes
print(dtypes)
date_crawled          object
name                  object
seller                object
offer_type            object
price                  int64
ab_test               object
vehicle_type          object
registration_year      int64
gear_box              object
power_ps               int64
model                 object
odometer               int64
registration_month     int64
fuel_type             object
brand                 object
unrepaired_damage     object
ad_created            object
num_pictures           int64
postal_code            int64
last_seen             object
dtype: object

Using DataFrame.rename() to rename the column to odometer_km.

In [391]:
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)
autos.columns
Out[391]:
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

===============================================================

Step 4/9 - Basic Data Exploration - Unique Values and Outliers

===============================================================

We'll use:

  • Series.unique().shape to see how many unique values
  • Series.describe() to view min/max/median/mean etc
  • Series.value_counts(), with some variations:
    • chained to .head() if there are lots of values.
    • Because Series.value_counts() returns a series, we can use Series.sort_index() with ascending= True or False to view the highest and lowest values with their counts (can also chain to head() here).
  • When removing outliers, we can do df[(df["col"] > x ) & (df["col"] < y )], but it's more readable to use df[df["col"].between(x,y)]

In [ ]:
 

For each of the odometer_km and price columns:
Use the techniques above to explore the data

Series.unique().shape to see how many unique values

In [392]:
print(autos['price'].unique())
print(autos['odometer_km'].unique())
[ 5000  8500  8990 ...   385 22200 16995]
[150000  70000  50000  80000  10000  30000 125000  90000  20000  60000
   5000 100000  40000]

Series.describe() to view min/max/median/mean etc

In [393]:
autos[['price','odometer_km']].describe()
Out[393]:
price odometer_km
count 5.000000e+04 50000.000000
mean 9.840044e+03 125732.700000
std 4.811044e+05 40042.211706
min 0.000000e+00 5000.000000
25% 1.100000e+03 125000.000000
50% 2.950000e+03 150000.000000
75% 7.200000e+03 150000.000000
max 1.000000e+08 150000.000000

Series.value_counts() (with chained to .head() if there are lots of values.)

In [394]:
autos['price'].value_counts().head()
Out[394]:
0       1421
500      781
1500     734
2500     643
1000     639
Name: price, dtype: int64
In [395]:
autos['odometer_km'].value_counts().head()
Out[395]:
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
Name: odometer_km, dtype: int64

Because Series.value_counts() returns a series, we can use Series.sort_index() with ascending= True or False to view the highest and lowest values with their counts (can also chain to head() here).

In [396]:
autos['price'].sort_values(ascending=False)
Out[396]:
39705    99999999
42221    27322222
39377    12345678
47598    12345678
27371    12345678
2897     11111111
24384    11111111
11137    10000000
47634     3890000
7814      1300000
22947     1234566
43049      999999
514        999999
37585      999990
36818      350000
14715      345000
34723      299000
35923      295000
12682      265000
47337      259000
38299      250000
37840      220000
40918      198000
43668      197000
28090      194000
20351      190000
17140      180000
11433      175000
32840      169999
18509      169000
           ...   
21171           0
24900           0
1220            0
31264           0
24871           0
2470            0
21256           0
29503           0
37021           0
21253           0
47861           0
29482           0
31336           0
35819           0
27020           0
4266            0
2481            0
15208           0
33620           0
12819           0
31332           0
35821           0
8438            0
43925           0
38832           0
8445            0
29499           0
15225           0
43923           0
18089           0
Name: price, Length: 50000, dtype: int64
In [397]:
autos.columns
Out[397]:
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pictures', 'postal_code',
       'last_seen'],
      dtype='object')
In [398]:
for row in autos.head(5):
    print(row)
#     if row[4] == 0:
#         print(row.index.item())

# one = autos.head(1)
# one.index.item()
# one
date_crawled
name
seller
offer_type
price
ab_test
vehicle_type
registration_year
gear_box
power_ps
model
odometer_km
registration_month
fuel_type
brand
unrepaired_damage
ad_created
num_pictures
postal_code
last_seen
In [399]:
# Look at individual entries
autos.iloc[26772]
Out[399]:
date_crawled                                2016-03-22 10:51:48
name                  Subaru_Justy_1000_4WD_Allrad_Teiletraeger
seller                                                   privat
offer_type                                              Angebot
price                                                        20
ab_test                                                 control
vehicle_type                                         kleinwagen
registration_year                                          1990
gear_box                                                manuell
power_ps                                                     50
model                                                     justy
odometer_km                                              150000
registration_month                                           11
fuel_type                                                benzin
brand                                                    subaru
unrepaired_damage                                           NaN
ad_created                                  2016-03-22 00:00:00
num_pictures                                                  0
postal_code                                               82387
last_seen                                   2016-03-27 23:17:01
Name: 26772, dtype: object
In [400]:
autos['price'].sort_index(ascending=True).head(1)
Out[400]:
0    5000
Name: price, dtype: int64

If you find there are outliers, remove them and write a markdown paragraph explaining your decision. After you have removed the outliers, make some observations about the remaining values.

To do this, I found min +1 and max -1. I then just made autos equal to everything in between

In [401]:
# find the min value in price column
min_price = autos['price'].min()
print(min_price)
0
In [402]:
# Find the max value in price column
max_price = autos['price'].max() - 1
print(max_price)
99999998
In [403]:
autos = autos[autos["price"].between(min_price,max_price)]
In [404]:
autos['price'].sort_values(ascending=False)
Out[404]:
42221    27322222
47598    12345678
39377    12345678
27371    12345678
2897     11111111
24384    11111111
11137    10000000
47634     3890000
7814      1300000
22947     1234566
43049      999999
514        999999
37585      999990
36818      350000
14715      345000
34723      299000
35923      295000
12682      265000
47337      259000
38299      250000
37840      220000
40918      198000
43668      197000
28090      194000
20351      190000
17140      180000
11433      175000
32840      169999
18509      169000
22673      163991
           ...   
21171           0
24900           0
1220            0
31264           0
24871           0
2470            0
21256           0
29503           0
37021           0
21253           0
47861           0
29482           0
31336           0
35819           0
27020           0
4266            0
2481            0
15208           0
33620           0
12819           0
31332           0
35821           0
8438            0
43925           0
38832           0
8445            0
29499           0
15225           0
43923           0
18089           0
Name: price, Length: 49999, dtype: int64

Another way is to find min and max and keep any rows that do not match the min and max

In [405]:
# find the min value in price column
min_price = autos['price'].min()
print(min_price)
# Find the max value in price column
max_price = autos['price'].max()
print(max_price)

# Remove any rows with values == to min value
autos = autos[autos['price'] != min_price]
# Remove any rows with values == to max value
autos = autos[autos['price'] != max_price]
0
27322222
In [406]:
autos['price'].sort_values(ascending=False)
Out[406]:
39377    12345678
27371    12345678
47598    12345678
2897     11111111
24384    11111111
11137    10000000
47634     3890000
7814      1300000
22947     1234566
43049      999999
514        999999
37585      999990
36818      350000
14715      345000
34723      299000
35923      295000
12682      265000
47337      259000
38299      250000
37840      220000
40918      198000
43668      197000
28090      194000
20351      190000
17140      180000
11433      175000
32840      169999
18509      169000
22673      163991
45387      163500
           ...   
5883            1
49377           1
21342           1
11442           1
17984           1
33067           1
39507           1
32839           1
37914           1
40143           1
20569           1
37500           1
23037           1
344             1
5246            1
20663           1
30162           1
15054           1
39180           1
7791            1
17110           1
25281           1
49845           1
47265           1
30185           1
46595           1
48786           1
1987            1
14350           1
30146           1
Name: price, Length: 48577, dtype: int64

Observations: There was an entry of 99 million dollars and a number of entries that were zero dollars. I removed them and kept everything in between

==============================================================================

Step 5/9 - Basic Data Exploration - Working with Dates and Date Columns 1/2

==============================================================================

Looking at the columns that have dat data in them, there are some with data populated by the crawler (date_crawled and last_seen) and some grabbed from the website itself (ad_created, registration_month and registration_year)

I also looks like date_crawled, last_seen and ad_created are identified as string values whereas the others are represented by numeric values.

I will need to convert the string values to numeric in order to quantitatively understand the data by running Series.describe().

Printing out the three columns, I can see they are represented in full timestamp vaulues:

In [407]:
autos[['date_crawled','ad_created','last_seen']][0:5]
Out[407]:
date_crawled ad_created last_seen
0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54
1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08
2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37
3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28
4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50

I can see that the first 10 characters represent the day

In [408]:
print(autos['date_crawled'].head().str[:10])
0    2016-03-26
1    2016-04-04
2    2016-03-26
3    2016-03-12
4    2016-04-01
Name: date_crawled, dtype: object

Next:

  1. Get a count of each of these dates using value_counts()
  2. Show the count in percentage values, so I will use normalize for each of the series/ columns.
  3. I'll also include missing values in the distribution by using the dropna=False) method.
  4. Finally, I'll sort them in ascending order
In [409]:
# extract only the date-month-year component for 'date_crawled'
dmy_date_crawled = autos['date_crawled'].str[:10]
dmy_date_crawled.head()
Out[409]:
0    2016-03-26
1    2016-04-04
2    2016-03-26
3    2016-03-12
4    2016-04-01
Name: date_crawled, dtype: object
In [410]:
# extract only the date-month-year component for 'ad_created'
dmy_ad_created = autos['ad_created'].str[:10]
dmy_ad_created.head()
Out[410]:
0    2016-03-26
1    2016-04-04
2    2016-03-26
3    2016-03-12
4    2016-04-01
Name: ad_created, dtype: object
In [411]:
# extract only the date-month-year component for 'last_seen'
dmy_last_seen = autos['last_seen'].str[:10]
dmy_last_seen.head()
Out[411]:
0    2016-04-06
1    2016-04-06
2    2016-04-06
3    2016-03-15
4    2016-04-01
Name: last_seen, dtype: object
In [412]:
# Getting a count in percentage values, with missing values and sorting in ascending order for the column 'date_crawled'
dmy_dc = dmy_date_crawled.value_counts(normalize=True, dropna=False).sort_index() 
print(dmy_dc)
print(dmy_dc.describe())
2016-03-05    0.025321
2016-03-06    0.014040
2016-03-07    0.036005
2016-03-08    0.033308
2016-03-09    0.033102
2016-03-10    0.032176
2016-03-11    0.032567
2016-03-12    0.036931
2016-03-13    0.015666
2016-03-14    0.036540
2016-03-15    0.034275
2016-03-16    0.029602
2016-03-17    0.031640
2016-03-18    0.012907
2016-03-19    0.034770
2016-03-20    0.037878
2016-03-21    0.037405
2016-03-22    0.032999
2016-03-23    0.032217
2016-03-24    0.029335
2016-03-25    0.031599
2016-03-26    0.032196
2016-03-27    0.031085
2016-03-28    0.034852
2016-03-29    0.034131
2016-03-30    0.033678
2016-03-31    0.031846
2016-04-01    0.033678
2016-04-02    0.035469
2016-04-03    0.038599
2016-04-04    0.036519
2016-04-05    0.013093
2016-04-06    0.003170
2016-04-07    0.001400
Name: date_crawled, dtype: float64
count    34.000000
mean      0.029412
std       0.009764
min       0.001400
25%       0.029973
50%       0.032783
75%       0.034831
max       0.038599
Name: date_crawled, dtype: float64
In [413]:
# Getting a count in percentage values, with missing values and sorting in ascending order for the column 'ad_created'
dmy_ac = dmy_ad_created.value_counts(normalize=True, dropna=False).sort_index() 
print(dmy_ac)
print(dmy_ac.describe())
2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033164
2016-03-10    0.031888
2016-03-11    0.032896
2016-03-12    0.036766
2016-03-13    0.017004
2016-03-14    0.035181
2016-03-15    0.034008
2016-03-16    0.030117
2016-03-17    0.031291
2016-03-18    0.013587
2016-03-19    0.033678
2016-03-20    0.037940
2016-03-21    0.037610
2016-03-22    0.032814
2016-03-23    0.032052
2016-03-24    0.029273
2016-03-25    0.031743
2016-03-26    0.032258
2016-03-27    0.030982
2016-03-28    0.034975
2016-03-29    0.034070
2016-03-30    0.033493
2016-03-31    0.031888
2016-04-01    0.033678
2016-04-02    0.035140
2016-04-03    0.038846
2016-04-04    0.036890
2016-04-05    0.011816
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created, Length: 76, dtype: float64
count    76.000000
mean      0.013158
std       0.015957
min       0.000021
25%       0.000021
50%       0.000144
75%       0.032104
max       0.038846
Name: ad_created, dtype: float64
In [414]:
# Getting a count in percentage values, with missing values and sorting in ascending order for the column 'last_seen'
dmy_ls = dmy_last_seen.value_counts(normalize=True, dropna=False).sort_index() 
print(dmy_ls)
print(dmy_ls.describe())
2016-03-05    0.001070
2016-03-06    0.004323
2016-03-07    0.005393
2016-03-08    0.007411
2016-03-09    0.009614
2016-03-10    0.010663
2016-03-11    0.012372
2016-03-12    0.023797
2016-03-13    0.008893
2016-03-14    0.012599
2016-03-15    0.015872
2016-03-16    0.016448
2016-03-17    0.028079
2016-03-18    0.007349
2016-03-19    0.015831
2016-03-20    0.020648
2016-03-21    0.020648
2016-03-22    0.021368
2016-03-23    0.018527
2016-03-24    0.019762
2016-03-25    0.019207
2016-03-26    0.016798
2016-03-27    0.015645
2016-03-28    0.020874
2016-03-29    0.022356
2016-03-30    0.024765
2016-03-31    0.023797
2016-04-01    0.022789
2016-04-02    0.024929
2016-04-03    0.025197
2016-04-04    0.024477
2016-04-05    0.124771
2016-04-06    0.221813
2016-04-07    0.131914
Name: last_seen, dtype: float64
count    34.000000
mean      0.029412
std       0.043717
min       0.001070
25%       0.012429
50%       0.019485
75%       0.023797
max       0.221813
Name: last_seen, dtype: float64

date_crawled and last_seen seem to correlate in that they have the same length (34 entries)

In [415]:
autos['registration_year'].describe()
Out[415]:
count    48577.000000
mean      2004.752928
std         88.633473
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

registration_year looks to have some invalid years on the min and max side. I'll need to address that next

==============================================================================

Step 6/9 - Basic Data Exploration - Working with Dates and Date Columns 2/2

==============================================================================

Looking from my last exercise, registration_year has some weird values for min - 1000 and max - 9999. Also looking at the the dates of date_crawled, ad_created and last_seen, I saw the range of years as ~2015 - 2016. Something is not matching up. To dig in deeper, I'll need to figure out the correct range of years.

In [416]:
autos['registration_year'].sort_values()
Out[416]:
22316    1000
49283    1001
24511    1111
32585    1800
10556    1800
28693    1910
3679     1910
45157    1910
22659    1910
30781    1910
21416    1927
22101    1929
11246    1931
2221     1934
2573     1934
23804    1937
21421    1937
39725    1937
26607    1937
26103    1938
24855    1939
13963    1941
25792    1941
11585    1943
11047    1948
14020    1950
1171     1950
32091    1950
35921    1951
44406    1951
         ... 
42768    2018
2089     2018
41965    2018
21708    2018
5462     2018
18345    2018
13244    2018
1121     2018
19707    2018
7283     2018
14700    2018
3801     2018
44772    2018
5763     2019
49185    2019
27578    2800
4549     4100
453      4500
42079    4800
4164     5000
24519    5000
22799    5000
49153    5000
27618    5911
8360     6200
25003    8888
49910    9000
8012     9999
38076    9999
33950    9999
Name: registration_year, Length: 48577, dtype: int64

Looking at the above output, I can see the first four entries fall outside the realistic date a car could be registered. Same observation on the last number of entries, they seem to be quite far into the future. So based on these observations, I will take the year range of 1910 thru 2019 and filter out everything that falls outside of that

In [417]:
autos = autos[autos["registration_year"].between(1910,2019)]
autos['registration_year'].describe()
Out[417]:
count    48557.000000
mean      2003.459336
std          7.571143
min       1910.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       2019.000000
Name: registration_year, dtype: float64
In [418]:
autos['registration_year'].sort_values()
Out[418]:
3679     1910
28693    1910
45157    1910
22659    1910
30781    1910
21416    1927
22101    1929
11246    1931
2221     1934
2573     1934
39725    1937
23804    1937
26607    1937
21421    1937
26103    1938
24855    1939
25792    1941
13963    1941
11585    1943
11047    1948
14020    1950
32091    1950
1171     1950
44406    1951
35921    1951
23372    1952
35453    1953
40765    1954
25556    1954
36794    1955
         ... 
13772    2018
38000    2018
23841    2018
43378    2018
44237    2018
43383    2018
29445    2018
47930    2018
3801     2018
6424     2018
28403    2018
29459    2018
30720    2018
16187    2018
6408     2018
48495    2018
41656    2018
44256    2018
21243    2018
33430    2018
28938    2018
41036    2018
17069    2018
11436    2018
7329     2018
38546    2018
21177    2018
32467    2018
49185    2019
5763     2019
Name: registration_year, Length: 48557, dtype: int64
In [419]:
autos['registration_year'].value_counts(normalize=True)
Out[419]:
2000    0.064996
2005    0.060465
1999    0.059682
2004    0.055667
2003    0.055584
2006    0.055008
2001    0.054307
2002    0.051198
1998    0.048664
2007    0.046893
2008    0.045616
2009    0.042960
1997    0.040180
2011    0.033425
2010    0.032724
2017    0.028688
1996    0.028276
2012    0.026979
1995    0.025290
2016    0.025125
2013    0.016537
2014    0.013654
1994    0.012954
2018    0.009700
1993    0.008753
2015    0.008073
1992    0.007641
1990    0.007146
1991    0.006981
1989    0.003583
          ...   
1976    0.000432
1969    0.000391
1975    0.000371
1965    0.000350
1964    0.000247
1963    0.000165
1959    0.000124
1961    0.000124
1910    0.000103
1937    0.000082
1962    0.000082
1958    0.000082
1956    0.000082
1950    0.000062
1955    0.000041
2019    0.000041
1934    0.000041
1957    0.000041
1951    0.000041
1941    0.000041
1954    0.000041
1953    0.000021
1948    0.000021
1927    0.000021
1943    0.000021
1929    0.000021
1939    0.000021
1938    0.000021
1931    0.000021
1952    0.000021
Name: registration_year, Length: 81, dtype: float64

All the values outside our range have been removed. Looking at the distribution, it's clear that as we go back in time with regards to registration years, the amount of listings decrease.

=======================================================================================

Step 7/9 - Basic Data Exploration - Exploring Values Within Column Data - Aggregation

=======================================================================================

In [420]:
# I selected unique brands in the dataset that have a presence of > 5%
autos2 = autos['brand'].value_counts(normalize=True).sort_values(ascending=False)
brand_bool = autos['brand'].value_counts(normalize=True).sort_values(ascending=False) > 0.05
brand_autos = autos2[brand_bool]
print(brand_autos)

# From here, since Series.value_count produces a series with an index table, I can leverage that
brand_autos.index
volkswagen       0.212822
opel             0.108635
bmw              0.108635
mercedes_benz    0.095764
audi             0.085837
ford             0.069691
Name: brand, dtype: float64
Out[420]:
Index(['volkswagen', 'opel', 'bmw', 'mercedes_benz', 'audi', 'ford'], dtype='object')
In [421]:
# Looping over my selected brand  and assign the mean price to each entry in the dictionary 'aggregated_data',
# with the brand name as the key.
aggregate_data = {}
for brand in brand_autos.index:
    individual_brand_bool = autos['brand'] == brand
    individual_brand = autos[individual_brand_bool]
    brand_mean = individual_brand['price'].mean()
    if brand not in aggregate_data:
        aggregate_data[brand] = brand_mean
In [422]:
aggregate_data
Out[422]:
{'volkswagen': 6600.382233404296,
 'opel': 5281.321706161138,
 'bmw': 8493.857251184834,
 'mercedes_benz': 8526.623225806452,
 'audi': 9212.9306621881,
 'ford': 7305.141843971631}

This is a good way to see which auto brands fall within a certain budget.

============================================================================

Step 8/9 - Basic Data Exploration - Data Aggregation(Avg Mileage vs Price)

============================================================================

In [423]:
# From the dictionary 'aggregate_data', I create a panda series called 'brand_mean_prices'
brand_mean_prices = pd.Series(aggregate_data)
In [424]:
brand_mean_prices
Out[424]:
volkswagen       6600.382233
opel             5281.321706
bmw              8493.857251
mercedes_benz    8526.623226
audi             9212.930662
ford             7305.141844
dtype: float64
In [425]:
# Now, the 'brand_mean_prices' series has an index and I can use that to create a dataframe with a column labeled 'mean_price'
df = pd.DataFrame(brand_mean_prices, columns=['mean_price'])
In [426]:
df
Out[426]:
mean_price
volkswagen 6600.382233
opel 5281.321706
bmw 8493.857251
mercedes_benz 8526.623226
audi 9212.930662
ford 7305.141844
In [427]:
# Calculating the mean mileage and mean price for each of the top brands and
# storing the results in dictionaries.
mean_price_dict = {}
mean_mileage_dict = {}
aggregate_data = {}
for brand in brand_autos.index:
    individual_brand_bool = autos['brand'] == brand
    individual_brand = autos[individual_brand_bool]
    price_mean = individual_brand['price'].mean()
    mileage_mean = individual_brand['odometer_km'].mean()
    if brand not in aggregate_data:
        mean_price_dict[brand] = price_mean
        mean_mileage_dict[brand] = mileage_mean
        
In [428]:
print(mean_price_dict)
print(mean_mileage_dict)
{'volkswagen': 6600.382233404296, 'opel': 5281.321706161138, 'bmw': 8493.857251184834, 'mercedes_benz': 8526.623225806452, 'audi': 9212.9306621881, 'ford': 7305.141843971631}
{'volkswagen': 128961.67989161989, 'opel': 129455.92417061611, 'bmw': 132686.2559241706, 'mercedes_benz': 130848.3870967742, 'audi': 129492.56238003839, 'ford': 124327.71867612294}
In [429]:
# Converting both dictionaries to series objects
prices_mean = pd.Series(mean_price_dict)
mileage_mean = pd.Series(mean_mileage_dict)
print(prices_mean)
print(mileage_mean)
volkswagen       6600.382233
opel             5281.321706
bmw              8493.857251
mercedes_benz    8526.623226
audi             9212.930662
ford             7305.141844
dtype: float64
volkswagen       128961.679892
opel             129455.924171
bmw              132686.255924
mercedes_benz    130848.387097
audi             129492.562380
ford             124327.718676
dtype: float64
In [430]:
# Creating a dataframe ('mileage_price') from the 'mileage_mean'series object using the dataframe constructor.
mileage_price = pd.DataFrame(mileage_mean, columns=['mean_mileage'])
mileage_price
Out[430]:
mean_mileage
volkswagen 128961.679892
opel 129455.924171
bmw 132686.255924
mercedes_benz 130848.387097
audi 129492.562380
ford 124327.718676
In [431]:
# Assigning the mean_price series as a new column in 'mileage_price'.
mileage_price['mean_price'] = prices_mean
mileage_price
Out[431]:
mean_mileage mean_price
volkswagen 128961.679892 6600.382233
opel 129455.924171 5281.321706
bmw 132686.255924 8493.857251
mercedes_benz 130848.387097 8526.623226
audi 129492.562380 9212.930662
ford 124327.718676 7305.141844

Looking at the mean of mileage vs. the mean of price, I've narrowed down a possible purchase. Seems like this is right in line with what we usually see out there. The Ford, VW and Opel are in one price range and the BMW, Audo and MB are a bit above that. Having this in front of you can definitely help make the decision as to which one to go for.