Project2: Exploring Ebay Car Sales Data

====================================================

Step 1/9 - Importing CSV and Inspecting DataFrame

====================================================

I will be working with a dataset consisted of info on 50,000 used cars. We sampled 50,000 data points from the full dataset, to ensure your code runs quickly in our hosted environment.

The goal of this project is to clean the data and analyze the included used car listings as well as become familiar with some of the unique benefits jupyter notebook provides for pandas.

Read the autos.csv CSV file into pandas, and assign it to the variable name autos.
Try without specifying any encoding (which will default to UTF-8)
If you get an encoding error, try the next two most popular encodings (Latin-1 and Windows-1252) until you are able to read the file without error.

import numpy as np
import pandas as pd
autos = pd.read_csv("autos.csv", encoding="Latin-1")

Create a new cell with just the variable autos and run this cell.

A neat feature of jupyter notebook is its ability to render the first few and last few values of any pandas object.

autos.head(5)

Use the DataFrame.info() and DataFrame.head() methods to print information about the autos dataframe, as well as the first few rows.

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

autos.head()

Write a markdown cell briefly describing your observations.

Looking at the dataframe, I can see it's a mix of objects and integers. There is a total of 19 columns, some of which have issues with the column name that will need to be cleaned up.

=======================================

Step 2/9 -Cleaning Up Column Names

=======================================

Use the DataFrame.columns attribute to print an array of the existing column names.

autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Copy that array and make the following edits to columns names:

yearOfRegistration to registration_year
monthOfRegistration to registration_month
notRepairedDamage to unrepaired_damage
dateCreated to ad_created

autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

column_name_corrections = {
    'dateCrawled' : 'date_crawled',
    'name' : 'name',
    'seller' : 'seller',
    'offerType' : 'offer_type',
    'price' : 'price',
    'abtest' : 'ab_test',
    'vehicleType' : 'vehicle_type',
    'yearOfRegistration' : 'registration_year',
    'gearbox' : 'gear_box',
    'powerPS' : 'power_ps',
    'model' : 'model',
    'odometer' : 'odometer',
    'monthOfRegistration' : 'registration_month',
    'fuelType' : 'fuel_type',
    'brand' : 'brand',
    'notRepairedDamage' : 'unrepaired_damage',
    'dateCreated' : 'ad_created',
    'nrOfPictures' : 'num_pictures',
    'postalCode' : 'postal_code',
    'lastSeen' : 'last_seen'
}

Assign the modified column names back to the DataFrame.columns attribute.

autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

pd.__version__

'0.24.2'

# I'm using pandas version 24. If using anything lower than 23, then this will fail. instead use the command
# commented out below
autos.columns = autos.columns.map(column_name_corrections)

# autos.columns = pd.Series(autos.columns).map(column_name_corrections)

autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

==================================

Step 3/9 - Basic Data Exploration

==================================

Now it's time to determine what other cleaning tasks need to be done. things to look for: - Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. - Examples of numeric data stored as text which can be cleaned and converted.

The following methods are helpful for exploring the data: - DataFrame.describe() (with include='all' to get both categorical and numeric columns) - Series.value_counts() and Series.head() if any columns need a closer look.

Use DataFrame.describe() to look at descriptive statistics for all columns.

autos.describe(include = 'all')

Write a markdown cell noting:

Any columns that have mostly one value that are candidates to be dropped
Any columns that need more investigation.
Any examples of numeric data stored as text that needs to be cleaned

Answer: Looks like both price and odometer columns both have numeric values, but are actually stored as txt. Also, 'seller' and 'offer_type' seem to have the same values

Removing any non-numeric characters.

autos['price'] = autos['price'].str.replace(',','')
autos['price'] = autos['price'].str.replace('$','')
autos['odometer'] = autos['odometer'].str.replace(',','')
autos['odometer'] = autos['odometer'].str.replace("km","")

Converting the column to a numeric dtype.

autos['price'] = autos['price'].astype(int)
autos['odometer'] = autos['odometer'].astype(int)
dtypes = autos.dtypes
print(dtypes)

date_crawled          object
name                  object
seller                object
offer_type            object
price                  int64
ab_test               object
vehicle_type          object
registration_year      int64
gear_box              object
power_ps               int64
model                 object
odometer               int64
registration_month     int64
fuel_type             object
brand                 object
unrepaired_damage     object
ad_created            object
num_pictures           int64
postal_code            int64
last_seen             object
dtype: object

Using DataFrame.rename() to rename the column to odometer_km.

autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

===============================================================

Step 4/9 - Basic Data Exploration - Unique Values and Outliers

===============================================================

We'll use:

Series.unique().shape to see how many unique values
Series.describe() to view min/max/median/mean etc
Series.value_counts(), with some variations:

chained to .head() if there are lots of values.
Because Series.value_counts() returns a series, we can use Series.sort_index() with ascending= True or False to view the highest and lowest values with their counts (can also chain to head() here).

When removing outliers, we can do df[(df["col"] > x ) & (df["col"] < y )], but it's more readable to use df[df["col"].between(x,y)]

For each of the odometer_km and price columns:
Use the techniques above to explore the data

Series.unique().shape to see how many unique values

print(autos['price'].unique())
print(autos['odometer_km'].unique())

[ 5000  8500  8990 ...   385 22200 16995]
[150000  70000  50000  80000  10000  30000 125000  90000  20000  60000
   5000 100000  40000]

Series.describe() to view min/max/median/mean etc

autos[['price','odometer_km']].describe()

Series.value_counts() (with chained to .head() if there are lots of values.)

autos['price'].value_counts().head()

0       1421
500      781
1500     734
2500     643
1000     639
Name: price, dtype: int64

autos['odometer_km'].value_counts().head()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
Name: odometer_km, dtype: int64

Because Series.value_counts() returns a series, we can use Series.sort_index() with ascending= True or False to view the highest and lowest values with their counts (can also chain to head() here).

autos['price'].sort_values(ascending=False)

39705    99999999
42221    27322222
39377    12345678
47598    12345678
27371    12345678
2897     11111111
24384    11111111
11137    10000000
47634     3890000
7814      1300000
22947     1234566
43049      999999
514        999999
37585      999990
36818      350000
14715      345000
34723      299000
35923      295000
12682      265000
47337      259000
38299      250000
37840      220000
40918      198000
43668      197000
28090      194000
20351      190000
17140      180000
11433      175000
32840      169999
18509      169000
           ...   
21171           0
24900           0
1220            0
31264           0
24871           0
2470            0
21256           0
29503           0
37021           0
21253           0
47861           0
29482           0
31336           0
35819           0
27020           0
4266            0
2481            0
15208           0
33620           0
12819           0
31332           0
35821           0
8438            0
43925           0
38832           0
8445            0
29499           0
15225           0
43923           0
18089           0
Name: price, Length: 50000, dtype: int64

autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

for row in autos.head(5):
    print(row)
#     if row[4] == 0:
#         print(row.index.item())

# one = autos.head(1)
# one.index.item()
# one

date_crawled
name
seller
offer_type
price
ab_test
vehicle_type
registration_year
gear_box
power_ps
model
odometer_km
registration_month
fuel_type
brand
unrepaired_damage
ad_created
num_pictures
postal_code
last_seen

# Look at individual entries
autos.iloc[26772]

date_crawled                                2016-03-22 10:51:48
name                  Subaru_Justy_1000_4WD_Allrad_Teiletraeger
seller                                                   privat
offer_type                                              Angebot
price                                                        20
ab_test                                                 control
vehicle_type                                         kleinwagen
registration_year                                          1990
gear_box                                                manuell
power_ps                                                     50
model                                                     justy
odometer_km                                              150000
registration_month                                           11
fuel_type                                                benzin
brand                                                    subaru
unrepaired_damage                                           NaN
ad_created                                  2016-03-22 00:00:00
num_pictures                                                  0
postal_code                                               82387
last_seen                                   2016-03-27 23:17:01
Name: 26772, dtype: object

autos['price'].sort_index(ascending=True).head(1)

0    5000
Name: price, dtype: int64

If you find there are outliers, remove them and write a markdown paragraph explaining your decision. After you have removed the outliers, make some observations about the remaining values.

To do this, I found min +1 and max -1. I then just made autos equal to everything in between

# find the min value in price column
min_price = autos['price'].min()
print(min_price)

0

# Find the max value in price column
max_price = autos['price'].max() - 1
print(max_price)

99999998

autos = autos[autos["price"].between(min_price,max_price)]

autos['price'].sort_values(ascending=False)

42221    27322222
47598    12345678
39377    12345678
27371    12345678
2897     11111111
24384    11111111
11137    10000000
47634     3890000
7814      1300000
22947     1234566
43049      999999
514        999999
37585      999990
36818      350000
14715      345000
34723      299000
35923      295000
12682      265000
47337      259000
38299      250000
37840      220000
40918      198000
43668      197000
28090      194000
20351      190000
17140      180000
11433      175000
32840      169999
18509      169000
22673      163991
           ...   
21171           0
24900           0
1220            0
31264           0
24871           0
2470            0
21256           0
29503           0
37021           0
21253           0
47861           0
29482           0
31336           0
35819           0
27020           0
4266            0
2481            0
15208           0
33620           0
12819           0
31332           0
35821           0
8438            0
43925           0
38832           0
8445            0
29499           0
15225           0
43923           0
18089           0
Name: price, Length: 49999, dtype: int64

Another way is to find min and max and keep any rows that do not match the min and max

# find the min value in price column
min_price = autos['price'].min()
print(min_price)
# Find the max value in price column
max_price = autos['price'].max()
print(max_price)

# Remove any rows with values == to min value
autos = autos[autos['price'] != min_price]
# Remove any rows with values == to max value
autos = autos[autos['price'] != max_price]

0
27322222

autos['price'].sort_values(ascending=False)

39377    12345678
27371    12345678
47598    12345678
2897     11111111
24384    11111111
11137    10000000
47634     3890000
7814      1300000
22947     1234566
43049      999999
514        999999
37585      999990
36818      350000
14715      345000
34723      299000
35923      295000
12682      265000
47337      259000
38299      250000
37840      220000
40918      198000
43668      197000
28090      194000
20351      190000
17140      180000
11433      175000
32840      169999
18509      169000
22673      163991
45387      163500
           ...   
5883            1
49377           1
21342           1
11442           1
17984           1
33067           1
39507           1
32839           1
37914           1
40143           1
20569           1
37500           1
23037           1
344             1
5246            1
20663           1
30162           1
15054           1
39180           1
7791            1
17110           1
25281           1
49845           1
47265           1
30185           1
46595           1
48786           1
1987            1
14350           1
30146           1
Name: price, Length: 48577, dtype: int64

Observations: There was an entry of 99 million dollars and a number of entries that were zero dollars. I removed them and kept everything in between

==============================================================================

Step 5/9 - Basic Data Exploration - Working with Dates and Date Columns 1/2

==============================================================================

Looking at the columns that have dat data in them, there are some with data populated by the crawler (date_crawled and last_seen) and some grabbed from the website itself (ad_created, registration_month and registration_year)

I also looks like date_crawled, last_seen and ad_created are identified as string values whereas the others are represented by numeric values.

I will need to convert the string values to numeric in order to quantitatively understand the data by running Series.describe().

Printing out the three columns, I can see they are represented in full timestamp vaulues:

autos[['date_crawled','ad_created','last_seen']][0:5]

I can see that the first 10 characters represent the day

print(autos['date_crawled'].head().str[:10])

0    2016-03-26
1    2016-04-04
2    2016-03-26
3    2016-03-12
4    2016-04-01
Name: date_crawled, dtype: object

Get a count of each of these dates using value_counts()
Show the count in percentage values, so I will use normalize for each of the series/ columns.
I'll also include missing values in the distribution by using the dropna=False) method.
Finally, I'll sort them in ascending order

# extract only the date-month-year component for 'date_crawled'
dmy_date_crawled = autos['date_crawled'].str[:10]
dmy_date_crawled.head()

0    2016-03-26
1    2016-04-04
2    2016-03-26
3    2016-03-12
4    2016-04-01
Name: date_crawled, dtype: object

# extract only the date-month-year component for 'ad_created'
dmy_ad_created = autos['ad_created'].str[:10]
dmy_ad_created.head()

0    2016-03-26
1    2016-04-04
2    2016-03-26
3    2016-03-12
4    2016-04-01
Name: ad_created, dtype: object

# extract only the date-month-year component for 'last_seen'
dmy_last_seen = autos['last_seen'].str[:10]
dmy_last_seen.head()

0    2016-04-06
1    2016-04-06
2    2016-04-06
3    2016-03-15
4    2016-04-01
Name: last_seen, dtype: object

# Getting a count in percentage values, with missing values and sorting in ascending order for the column 'date_crawled'
dmy_dc = dmy_date_crawled.value_counts(normalize=True, dropna=False).sort_index() 
print(dmy_dc)
print(dmy_dc.describe())

2016-03-05    0.025321
2016-03-06    0.014040
2016-03-07    0.036005
2016-03-08    0.033308
2016-03-09    0.033102
2016-03-10    0.032176
2016-03-11    0.032567
2016-03-12    0.036931
2016-03-13    0.015666
2016-03-14    0.036540
2016-03-15    0.034275
2016-03-16    0.029602
2016-03-17    0.031640
2016-03-18    0.012907
2016-03-19    0.034770
2016-03-20    0.037878
2016-03-21    0.037405
2016-03-22    0.032999
2016-03-23    0.032217
2016-03-24    0.029335
2016-03-25    0.031599
2016-03-26    0.032196
2016-03-27    0.031085
2016-03-28    0.034852
2016-03-29    0.034131
2016-03-30    0.033678
2016-03-31    0.031846
2016-04-01    0.033678
2016-04-02    0.035469
2016-04-03    0.038599
2016-04-04    0.036519
2016-04-05    0.013093
2016-04-06    0.003170
2016-04-07    0.001400
Name: date_crawled, dtype: float64
count    34.000000
mean      0.029412
std       0.009764
min       0.001400
25%       0.029973
50%       0.032783
75%       0.034831
max       0.038599
Name: date_crawled, dtype: float64

# Getting a count in percentage values, with missing values and sorting in ascending order for the column 'ad_created'
dmy_ac = dmy_ad_created.value_counts(normalize=True, dropna=False).sort_index() 
print(dmy_ac)
print(dmy_ac.describe())

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033164
2016-03-10    0.031888
2016-03-11    0.032896
2016-03-12    0.036766
2016-03-13    0.017004
2016-03-14    0.035181
2016-03-15    0.034008
2016-03-16    0.030117
2016-03-17    0.031291
2016-03-18    0.013587
2016-03-19    0.033678
2016-03-20    0.037940
2016-03-21    0.037610
2016-03-22    0.032814
2016-03-23    0.032052
2016-03-24    0.029273
2016-03-25    0.031743
2016-03-26    0.032258
2016-03-27    0.030982
2016-03-28    0.034975
2016-03-29    0.034070
2016-03-30    0.033493
2016-03-31    0.031888
2016-04-01    0.033678
2016-04-02    0.035140
2016-04-03    0.038846
2016-04-04    0.036890
2016-04-05    0.011816
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created, Length: 76, dtype: float64
count    76.000000
mean      0.013158
std       0.015957
min       0.000021
25%       0.000021
50%       0.000144
75%       0.032104
max       0.038846
Name: ad_created, dtype: float64

# Getting a count in percentage values, with missing values and sorting in ascending order for the column 'last_seen'
dmy_ls = dmy_last_seen.value_counts(normalize=True, dropna=False).sort_index() 
print(dmy_ls)
print(dmy_ls.describe())

2016-03-05    0.001070
2016-03-06    0.004323
2016-03-07    0.005393
2016-03-08    0.007411
2016-03-09    0.009614
2016-03-10    0.010663
2016-03-11    0.012372
2016-03-12    0.023797
2016-03-13    0.008893
2016-03-14    0.012599
2016-03-15    0.015872
2016-03-16    0.016448
2016-03-17    0.028079
2016-03-18    0.007349
2016-03-19    0.015831
2016-03-20    0.020648
2016-03-21    0.020648
2016-03-22    0.021368
2016-03-23    0.018527
2016-03-24    0.019762
2016-03-25    0.019207
2016-03-26    0.016798
2016-03-27    0.015645
2016-03-28    0.020874
2016-03-29    0.022356
2016-03-30    0.024765
2016-03-31    0.023797
2016-04-01    0.022789
2016-04-02    0.024929
2016-04-03    0.025197
2016-04-04    0.024477
2016-04-05    0.124771
2016-04-06    0.221813
2016-04-07    0.131914
Name: last_seen, dtype: float64
count    34.000000
mean      0.029412
std       0.043717
min       0.001070
25%       0.012429
50%       0.019485
75%       0.023797
max       0.221813
Name: last_seen, dtype: float64

date_crawled and last_seen seem to correlate in that they have the same length (34 entries)

autos['registration_year'].describe()

count    48577.000000
mean      2004.752928
std         88.633473
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

registration_year looks to have some invalid years on the min and max side. I'll need to address that next

==============================================================================

Step 6/9 - Basic Data Exploration - Working with Dates and Date Columns 2/2

==============================================================================

Looking from my last exercise, registration_year has some weird values for min - 1000 and max - 9999. Also looking at the the dates of date_crawled, ad_created and last_seen, I saw the range of years as ~2015 - 2016. Something is not matching up. To dig in deeper, I'll need to figure out the correct range of years.

autos['registration_year'].sort_values()

22316    1000
49283    1001
24511    1111
32585    1800
10556    1800
28693    1910
3679     1910
45157    1910
22659    1910
30781    1910
21416    1927
22101    1929
11246    1931
2221     1934
2573     1934
23804    1937
21421    1937
39725    1937
26607    1937
26103    1938
24855    1939
13963    1941
25792    1941
11585    1943
11047    1948
14020    1950
1171     1950
32091    1950
35921    1951
44406    1951
         ... 
42768    2018
2089     2018
41965    2018
21708    2018
5462     2018
18345    2018
13244    2018
1121     2018
19707    2018
7283     2018
14700    2018
3801     2018
44772    2018
5763     2019
49185    2019
27578    2800
4549     4100
453      4500
42079    4800
4164     5000
24519    5000
22799    5000
49153    5000
27618    5911
8360     6200
25003    8888
49910    9000
8012     9999
38076    9999
33950    9999
Name: registration_year, Length: 48577, dtype: int64

Looking at the above output, I can see the first four entries fall outside the realistic date a car could be registered. Same observation on the last number of entries, they seem to be quite far into the future. So based on these observations, I will take the year range of 1910 thru 2019 and filter out everything that falls outside of that

autos = autos[autos["registration_year"].between(1910,2019)]
autos['registration_year'].describe()

count    48557.000000
mean      2003.459336
std          7.571143
min       1910.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       2019.000000
Name: registration_year, dtype: float64

autos['registration_year'].sort_values()

3679     1910
28693    1910
45157    1910
22659    1910
30781    1910
21416    1927
22101    1929
11246    1931
2221     1934
2573     1934
39725    1937
23804    1937
26607    1937
21421    1937
26103    1938
24855    1939
25792    1941
13963    1941
11585    1943
11047    1948
14020    1950
32091    1950
1171     1950
44406    1951
35921    1951
23372    1952
35453    1953
40765    1954
25556    1954
36794    1955
         ... 
13772    2018
38000    2018
23841    2018
43378    2018
44237    2018
43383    2018
29445    2018
47930    2018
3801     2018
6424     2018
28403    2018
29459    2018
30720    2018
16187    2018
6408     2018
48495    2018
41656    2018
44256    2018
21243    2018
33430    2018
28938    2018
41036    2018
17069    2018
11436    2018
7329     2018
38546    2018
21177    2018
32467    2018
49185    2019
5763     2019
Name: registration_year, Length: 48557, dtype: int64

autos['registration_year'].value_counts(normalize=True)

2000    0.064996
2005    0.060465
1999    0.059682
2004    0.055667
2003    0.055584
2006    0.055008
2001    0.054307
2002    0.051198
1998    0.048664
2007    0.046893
2008    0.045616
2009    0.042960
1997    0.040180
2011    0.033425
2010    0.032724
2017    0.028688
1996    0.028276
2012    0.026979
1995    0.025290
2016    0.025125
2013    0.016537
2014    0.013654
1994    0.012954
2018    0.009700
1993    0.008753
2015    0.008073
1992    0.007641
1990    0.007146
1991    0.006981
1989    0.003583
          ...   
1976    0.000432
1969    0.000391
1975    0.000371
1965    0.000350
1964    0.000247
1963    0.000165
1959    0.000124
1961    0.000124
1910    0.000103
1937    0.000082
1962    0.000082
1958    0.000082
1956    0.000082
1950    0.000062
1955    0.000041
2019    0.000041
1934    0.000041
1957    0.000041
1951    0.000041
1941    0.000041
1954    0.000041
1953    0.000021
1948    0.000021
1927    0.000021
1943    0.000021
1929    0.000021
1939    0.000021
1938    0.000021
1931    0.000021
1952    0.000021
Name: registration_year, Length: 81, dtype: float64

All the values outside our range have been removed. Looking at the distribution, it's clear that as we go back in time with regards to registration years, the amount of listings decrease.

=======================================================================================

Step 7/9 - Basic Data Exploration - Exploring Values Within Column Data - Aggregation

=======================================================================================

# I selected unique brands in the dataset that have a presence of > 5%
autos2 = autos['brand'].value_counts(normalize=True).sort_values(ascending=False)
brand_bool = autos['brand'].value_counts(normalize=True).sort_values(ascending=False) > 0.05
brand_autos = autos2[brand_bool]
print(brand_autos)

# From here, since Series.value_count produces a series with an index table, I can leverage that
brand_autos.index

volkswagen       0.212822
opel             0.108635
bmw              0.108635
mercedes_benz    0.095764
audi             0.085837
ford             0.069691
Name: brand, dtype: float64

Index(['volkswagen', 'opel', 'bmw', 'mercedes_benz', 'audi', 'ford'], dtype='object')

# Looping over my selected brand  and assign the mean price to each entry in the dictionary 'aggregated_data',
# with the brand name as the key.
aggregate_data = {}
for brand in brand_autos.index:
    individual_brand_bool = autos['brand'] == brand
    individual_brand = autos[individual_brand_bool]
    brand_mean = individual_brand['price'].mean()
    if brand not in aggregate_data:
        aggregate_data[brand] = brand_mean

aggregate_data

{'volkswagen': 6600.382233404296,
 'opel': 5281.321706161138,
 'bmw': 8493.857251184834,
 'mercedes_benz': 8526.623225806452,
 'audi': 9212.9306621881,
 'ford': 7305.141843971631}

This is a good way to see which auto brands fall within a certain budget.

============================================================================

Step 8/9 - Basic Data Exploration - Data Aggregation(Avg Mileage vs Price)

============================================================================

# From the dictionary 'aggregate_data', I create a panda series called 'brand_mean_prices'
brand_mean_prices = pd.Series(aggregate_data)

brand_mean_prices

volkswagen       6600.382233
opel             5281.321706
bmw              8493.857251
mercedes_benz    8526.623226
audi             9212.930662
ford             7305.141844
dtype: float64

# Now, the 'brand_mean_prices' series has an index and I can use that to create a dataframe with a column labeled 'mean_price'
df = pd.DataFrame(brand_mean_prices, columns=['mean_price'])

df

# Calculating the mean mileage and mean price for each of the top brands and
# storing the results in dictionaries.
mean_price_dict = {}
mean_mileage_dict = {}
aggregate_data = {}
for brand in brand_autos.index:
    individual_brand_bool = autos['brand'] == brand
    individual_brand = autos[individual_brand_bool]
    price_mean = individual_brand['price'].mean()
    mileage_mean = individual_brand['odometer_km'].mean()
    if brand not in aggregate_data:
        mean_price_dict[brand] = price_mean
        mean_mileage_dict[brand] = mileage_mean

print(mean_price_dict)
print(mean_mileage_dict)

{'volkswagen': 6600.382233404296, 'opel': 5281.321706161138, 'bmw': 8493.857251184834, 'mercedes_benz': 8526.623225806452, 'audi': 9212.9306621881, 'ford': 7305.141843971631}
{'volkswagen': 128961.67989161989, 'opel': 129455.92417061611, 'bmw': 132686.2559241706, 'mercedes_benz': 130848.3870967742, 'audi': 129492.56238003839, 'ford': 124327.71867612294}

# Converting both dictionaries to series objects
prices_mean = pd.Series(mean_price_dict)
mileage_mean = pd.Series(mean_mileage_dict)
print(prices_mean)
print(mileage_mean)

volkswagen       6600.382233
opel             5281.321706
bmw              8493.857251
mercedes_benz    8526.623226
audi             9212.930662
ford             7305.141844
dtype: float64
volkswagen       128961.679892
opel             129455.924171
bmw              132686.255924
mercedes_benz    130848.387097
audi             129492.562380
ford             124327.718676
dtype: float64

# Creating a dataframe ('mileage_price') from the 'mileage_mean'series object using the dataframe constructor.
mileage_price = pd.DataFrame(mileage_mean, columns=['mean_mileage'])
mileage_price

# Assigning the mean_price series as a new column in 'mileage_price'.
mileage_price['mean_price'] = prices_mean
mileage_price

Looking at the mean of mileage vs. the mean of price, I've narrowed down a possible purchase. Seems like this is right in line with what we usually see out there. The Ford, VW and Opel are in one price range and the BMW, Audo and MB are a bit above that. Having this in front of you can definitely help make the decision as to which one to go for.

	price	odometer_km
count	5.000000e+04	50000.000000
mean	9.840044e+03	125732.700000
std	4.811044e+05	40042.211706
min	0.000000e+00	5000.000000
25%	1.100000e+03	125000.000000
50%	2.950000e+03	150000.000000
75%	7.200000e+03	150000.000000
max	1.000000e+08	150000.000000

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gear_box	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	num_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-08 10:40:35	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

	mean_price
volkswagen	6600.382233
opel	5281.321706
bmw	8493.857251
mercedes_benz	8526.623226
audi	9212.930662
ford	7305.141844

	mean_mileage
volkswagen	128961.679892
opel	129455.924171
bmw	132686.255924
mercedes_benz	130848.387097
audi	129492.562380
ford	124327.718676