Project 1 (Guided) - Profitable App Profiles for the App Store and Google Play Markets


Project Description


This project analyzes data from the AppStore and Google Play marketplaces and filters out all the paid apps. It will then segregate based on number of users. The goal of this project is to give developrs insight into what types of apps generate will be most likely to attract the largest amount of users.

==================================

1. Creating Datasets From CSV Files
==================================

  1. Open the two data sets we mentioned above, and save both as lists of lists.
In [7]:
# Import module needed to read csv's
from csv import reader

# Import goole csv and assign dataset to variable 'google/apple_dataset'
opened_google_file = open('googleplaystore.csv')
opened_apple_file = open('AppleStore.csv')
read_google_file = reader(opened_google_file)
read_apple_file = reader(opened_apple_file)
google_dataset = list(read_google_file)
apple_dataset = list(read_apple_file)

# seperate header from rest of data in the dataset
google_header = google_dataset[0]
apple_header = apple_dataset[0]
google_dataset = google_dataset[1:]
apple_dataset = apple_dataset[1:]

#Check to see if the both data sets are Lists
print("google_dataset type is a list of a", type(google_dataset))
print("apple_dataset type is a list of a", type(apple_dataset),'\n\n')
google_dataset type is a list of a <class 'list'>
apple_dataset type is a list of a <class 'list'> 


  1. Explore both data sets using the explore_data() function.
    • Print the first few rows of each data set.
    • Find the number of rows and columns of each data set (recall that the function assumes the argument for the dataset parameter doesn't have a header row).
In [8]:
# We can use this function(created by dq), to periodically check 
# our datasets to make sure all matches up
def explore_data(dataset, start, end, rows_and_columns=True):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
# For Google MarketPlace
print("Google MarketPlace")
print("==================")
google_mplace_data = explore_data(google_dataset, 0, 3)
print("\n")
# print(google_mplace_data)
        
# For AppStore
print("App Store")
print("=========")
app_store_data = explore_data(apple_dataset, 0, 3)
print("\n\n")
Google MarketPlace
==================
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


App Store
=========
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16



  1. Print the column names and try to identify the columns that could help us with our analysis. Use the documentation of the data sets if you're having trouble understanding what a column describes. Add a link to the documentation for readers if you think the column names are not descriptive enough.
In [9]:
# Google dataset column names
print("Google Columns\n==============\n", google_header,'\n\n')

# Apple dataset column names
print("Apple Columns\n=============\n", apple_header,'\n\n')
Google Columns
==============
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 


Apple Columns
=============
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 


=============
2. Data Cleaning
=============

1. Removing Rows with Errors

  1. Check dataset to see if any rows have errors. I do this by comparing the len of each row to the len of the header. If it doesn't match, I deleted it and print the row index #
In [10]:
# Removes errors from each dataset
def del_rows_w_errs(dataset):
    if len(dataset[0]) == len(google_header):
        # For google dataset
        g_count = 0
        for row in dataset:
            if len(row) != len(google_header):
                print(row)
                bad_row_num = google_dataset.index(row)
                print("Row number ", bad_row_num, "had an error and is deleted")
                del google_dataset[bad_row_num]
                g_count += 1
        return "For google_dataset, there was a total of", g_count, "errors"
    else:
        # For apple dataset
        a_count = 0
        for row in dataset:
            if len(row) != len(apple_header):
                print(row)
                bad_row_num = apple_dataset.index(row)
                print("Row number ", bad_row_num, "had an error and is deleted")
                del apple_dataset[bad_row_num]
                a_count += 1
        return "For apple_dataset, there was a total of", a_count, "errors"


print(del_rows_w_errs(google_dataset))
print(del_rows_w_errs(apple_dataset))
google_ds_no_errs = google_dataset
apple_ds_no_errs = apple_dataset
print(len(google_ds_no_errs))
print(len(apple_ds_no_errs))
print('\n')
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Row number  10472 had an error and is deleted
('For google_dataset, there was a total of', 1, 'errors')
('For apple_dataset, there was a total of', 0, 'errors')
10840
7197


In [19]:
print(google_ds_no_errs == google_dataset)
print(apple_ds_no_errs == apple_dataset)
True
True
In [132]:
print("Working google dataset = 'google_ds_no_errs'")
print("Working apple dataset = 'apple_ds_no_errs'")
Working google dataset = 'google_ds_no_errs'
Working apple dataset = 'apple_ds_no_errs'

2. Identifying Duplicate Entries

In the next couple of sections, I will filter out which apps have duplicates in both the google mkt place and app store.

  1. Lists - I will be iterating over the name column of each dataset and comparing to what's in each of the lists I created for unique and duplicate elements. If an element exists in unique, it gets entered into the duplicate list. If it does not exist, then the unique list will be appended with that element exist, then the unique list will be appended with that element
In [26]:
# seperating duplicates from google data
duplicate_google_apps = []
unique_google_apps = []
duplicate_apple_apps = []
unique_apple_apps = []
def dedup(dataset):
    if len(dataset[0]) == len(google_header):
        for row in google_ds_no_errs:
            name = row[0]
            if name in unique_google_apps:
                duplicate_google_apps.append(name)
            else:
                unique_google_apps.append(name)
    else:
        # seperating duplicates from apple data
        for row in apple_ds_no_errs:
            name = row[1]
            if name in unique_apple_apps:
                duplicate_apple_apps.append(name)
            else:
                unique_apple_apps.append(name)

dedup(google_ds_no_errs)
dedup(apple_ds_no_errs)
google_ds_no_errs_ddup = unique_google_apps
apple_ds_no_errs_ddup = unique_apple_apps

print("\nGoogle Marketplace")
print("==================")
print("Duplicate Apps -", len(duplicate_google_apps))
print("Unique Apps -",len(google_ds_no_errs_ddup))
print("Total Apps -", len(google_ds_no_errs),"\n")

print("\nAppStore")
print("========")
print("Duplicate Apps -", len(duplicate_apple_apps))
print("Unique Apps -",len(apple_ds_no_errs_ddup))
print("Total Apps -", len(apple_ds_no_errs),"\n")
Google Marketplace
==================
Duplicate Apps - 1181
Unique Apps - 9659 (Working file = 'google_ds_no_errs_ddup')
Total Apps - 10840 


AppStore
========
Duplicate Apps - 2
Unique Apps - 7195 (Working file = 'apple_ds_no_errs_ddup')
Total Apps - 7197 

  1. Dictionaries - Next,I will create a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app. In the end, it will acheive the same dedup functionality as above, but this time utilizing dictionaries.
In [165]:
new_google_ds_no_errs_ddup = []
google_dups = []

new_apple_ds_no_errs_ddup = []
apple_dups = []

         ### Step 1 - create dict with unique keys  ####

def ddup_dict(dataset):
    if len(dataset[0]) == len(google_header):
        reviews_max = {}
        for app in google_ds_no_errs:
            name = app[0]
            n_reviews = float(app[3])
            if name in reviews_max and reviews_max[name] > n_reviews:
                reviews_max[name] = n_reviews
            elif name not in reviews_max:
                reviews_max[name] = n_reviews
        print('Found', len(google_ds_no_errs) - len(reviews_max), 'duplicate/s')

        # Inspect the dictionary to make sure everything went as expected. 
        # Measure length of the dictionary — expected length = 9,659
        print("length of reviews_max dict for google dataset is", len(reviews_max))

        ### Step 2 - use dict to create new dataset with unique entries  ####
        # Using the dictionary I created above to remove the duplicate rows
        # I will create a new dataset which is void of errors and duplicates
        for app in google_ds_no_errs:
            name = app[0]
            n_reviews = float(app[3])
            if n_reviews == reviews_max[name] and name not in google_dups:
                new_google_ds_no_errs_ddup.append(app)
                google_dups.append(name)
    elif len(dataset[0]) == len(apple_header):
        reviews_max = {}
        for app in apple_ds_no_errs:
            name = app[1]
            n_reviews = float(app[5])
            if name in reviews_max and reviews_max[name] > n_reviews:
                reviews_max[name] = n_reviews
            elif name not in reviews_max:
                reviews_max[name] = n_reviews
        print('Found', len(apple_ds_no_errs) - len(reviews_max), 'duplicate/s')
        print("length of reviews_max dict for apple dataset is", len(reviews_max))

        for app in apple_ds_no_errs:
            name = app[1]
            n_reviews = float(app[5])
            if n_reviews == reviews_max[name] and name not in apple_dups:
                new_apple_ds_no_errs_ddup.append(app)
                apple_dups.append(name)
    
            
ddup_dict(google_ds_no_errs)
ddup_dict(apple_ds_no_errs)
print("length of 'new_google_ds_no_errs_ddup' is",len(new_google_ds_no_errs_ddup))
print("length of 'new_apple_ds_no_errs_ddup' is",len(new_apple_ds_no_errs_ddup))
Found 1181 duplicate/s
length of reviews_max dict for google dataset is 9659
Found 2 duplicate/s
length of reviews_max dict for apple dataset is 7195
length of 'new_google_ds_no_errs_ddup' is 9659
length of 'new_apple_ds_no_errs_ddup' is 7195
In [164]:
explore_data(new_google_ds_no_errs_ddup, 0, 3, True)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9659
Number of columns: 13
In [114]:
explore_data(new_apple_ds_no_errs_ddup, 0, 3, True)
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7195
Number of columns: 16
In [133]:
print("Working google dataset = 'new_google_ds_no_errs_ddup'")
print("Working apple dataset = 'new_apple_ds_no_errs_ddup'")
Working google dataset = 'new_google_ds_no_errs_ddup'
Working apple dataset = 'new_apple_ds_no_errs_ddup'

3. Identifying Rows with Non-English Entries

  1. Both data sets have apps with names that suggest they are not directed toward an English-speaking audience. We need to remove these. First, we will work on identifying the non-English rows.
In [99]:
# Write a function that takes in a string and returns False if
# there's any character in the string that doesn't belong to the
# set of common English characters, otherwise it returns True.
def english_or_not(string):
    for char in string:
        if ord(char) > 127:
            return False
        
    return True
print(english_or_not('Instagram'))
print(english_or_not('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_or_not('Docs To Go™ Free Office Suite'))
print(english_or_not('Instachat 😜'))
True
False
False
False
In [116]:
# This is a function that takes in a string and returns False if
# there any 3 characters in the string do not belong to the
# set of common English characters, otherwise it returns True.
# This is a better way than above as it's based on a larger sampling
def english_or_not(string):
    non_english = 0
    
    for character in string:
        if ord(character) > 127:
            non_english += 1
    
    if non_english > 3:
        return "not english"
    else:
        return "english"
print(english_or_not('Docs To Go™ Free Office Suite'))
print(english_or_not('Instachat 😜'))
print(english_or_not('爱奇艺PPS -《欢乐颂2》电视剧热播'))
english
english
not english
In [129]:
# Using the function above, this function will segregate english
# and non_english apps for any dataset into seperate lists
google_ds_no_errs_ddup_eng = []
google_non_eng_apps = []

apple_ds_no_errs_ddup_eng = []
apple_non_eng_apps = []

def eng_only(dataset):
    if len(dataset[0]) == len(google_header):
        for row in dataset:
            if english_or_not(row[0]) == 'english':
                google_ds_no_errs_ddup_eng.append(row)
            else:
                google_non_eng_apps.append(row)
        print("There were", len(google_non_eng_apps), "non english google apps")
    elif len(dataset[0]) == len(apple_header):
        for row in dataset:
            if english_or_not(row[1]) == 'english':
                apple_ds_no_errs_ddup_eng.append(row)
            else:
                apple_non_eng_apps.append(row)
        print("There were", len(apple_non_eng_apps), "non english apple apps")

eng_only(new_google_ds_no_errs_ddup)
eng_only(new_apple_ds_no_errs_ddup)
print("length of 'google_ds_no_errs_ddup_eng' is",len(google_ds_no_errs_ddup_eng))
print("length of 'apple_ds_no_errs_ddup_eng' is",len(apple_ds_no_errs_ddup_eng))
There were 45 non english google apps
There were 1014 non english apple apps
length of 'google_ds_no_errs_ddup_eng' is 9614
length of 'apple_ds_no_errs_ddup_eng' is 6181
In [126]:
explore_data(google_ds_no_errs_ddup_eng, 0, 3, True)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9614
Number of columns: 13
In [127]:
explore_data(apple_ds_no_errs_ddup_eng, 0, 3, True)
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6181
Number of columns: 16
In [131]:
print("Working google dataset = 'google_ds_no_errs_ddup_eng'")
print("Working apple dataset = 'apple_ds_no_errs_ddup_eng'")
Working google dataset = 'google_ds_no_errs_ddup_eng'
Working apple dataset = 'apple_ds_no_errs_ddup_eng'

4. Identifying Free Apps

In the next section, I will filter out which apps are free in both the google mkt place and app store.

  1. This is the final step. Once finished, we will have clean datasets that include only free apps in english without any errors or duplicates
In [135]:
print(google_header,'\n')
print(apple_header,'\n')
print('Using the header from the original datasets, we can see that the price column is at:\n')
print('index 7 for Google Marketplace\n and\nindex 4 for App Store')
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

Using the header from the original datasets, we can see that the price column is at:

index 7 for Google Marketplace
 and
index 4 for App Store
In [147]:
google_ds_final = []
apple_ds_final = []

def free_apps(dataset):
    if len(dataset[0]) == len(google_header):
        for app in dataset:
            price = app[7]
            if price == '0':
                google_ds_final.append(app)
    elif len(dataset[0]) == len(apple_header):
        for app in dataset:
            price = app[4]
            if price == '0.0':
                apple_ds_final.append(app)

free_apps(google_ds_no_errs_ddup_eng)
free_apps(apple_ds_no_errs_ddup_eng)
print("google_ds_final =",len(google_ds_final),"apps")
print("apple_ds_final =",len(apple_ds_final),"apps")
True
google_ds_final = 8862 apps
apple_ds_final = 3220 apps
In [148]:
print("Working google dataset = 'google_ds_final'")
print("Working apple dataset = 'apple_ds_final'")
Working google dataset = 'google_ds_final'
Working apple dataset = 'apple_ds_final'

==============
3. Data Analyzing
==============

1. Identifying Columns with Most Pertinent Info

To begin analysis, we need to identify which columns in our dataset would give us the best insight into the most common genres in each market. We could then use the columns to generate frequency tables

Google - Category[1], Genres[9]
Apple - prime_genre[11]

In [159]:
print("google - " + google_header[1]+"[1] and " + google_header[9]+"[9]")
print("apple - " + apple_header[11]+"[11]")
google - Category[1] and Genres[9]
apple - prime_genre[11]
In [160]:
from pprint import pprint
# Makes sure the variable 'index' is n integer
def check_int(integer):
    if type(integer) == str and integer.isdigit() == False:
        x = input("type a number please ")
        check_int(x)
    else:
        x = int(integer)
        return x
# creates frequency table for any column we choose from dataset
def freq_table(dataset,index):
    frequency_table = {}
    total_num_of_apps = len(dataset)
    if check_int(index) == index:
        for row in dataset:
            if row[index] in frequency_table:
                frequency_table[row[index]] += 1
            else:
                frequency_table[row[index]] = 1
# displays frequency table in percentages                
    for item in frequency_table:
        frequency_table[item]/= total_num_of_apps
        frequency_table[item] = (format(frequency_table[item]*100, '.2f')+"%")
    return frequency_table
    
results = freq_table(google_ds_final,1)
print("google - Categories")
print("===================")
pprint(results)
print(len(results))
google - Categories
===================
{'ART_AND_DESIGN': '0.68%',
 'AUTO_AND_VEHICLES': '0.93%',
 'BEAUTY': '0.60%',
 'BOOKS_AND_REFERENCE': '2.14%',
 'BUSINESS': '4.58%',
 'COMICS': '0.62%',
 'COMMUNICATION': '3.25%',
 'DATING': '1.86%',
 'EDUCATION': '1.25%',
 'ENTERTAINMENT': '1.04%',
 'EVENTS': '0.71%',
 'FAMILY': '18.79%',
 'FINANCE': '3.70%',
 'FOOD_AND_DRINK': '1.24%',
 'GAME': '9.64%',
 'HEALTH_AND_FITNESS': '3.07%',
 'HOUSE_AND_HOME': '0.84%',
 'LIBRARIES_AND_DEMO': '0.94%',
 'LIFESTYLE': '3.90%',
 'MAPS_AND_NAVIGATION': '1.40%',
 'MEDICAL': '3.53%',
 'NEWS_AND_MAGAZINES': '2.80%',
 'PARENTING': '0.65%',
 'PERSONALIZATION': '3.32%',
 'PHOTOGRAPHY': '2.95%',
 'PRODUCTIVITY': '3.89%',
 'SHOPPING': '2.25%',
 'SOCIAL': '2.66%',
 'SPORTS': '3.42%',
 'TOOLS': '8.44%',
 'TRAVEL_AND_LOCAL': '2.34%',
 'VIDEO_PLAYERS': '1.78%',
 'WEATHER': '0.80%'}
33
In [180]:
# uses function above to sort percentages in descending order
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
print("GOOGLE GENRE COLUMN [9]")
print("====================")
google_genre_results = (display_table(google_ds_final, 9))
print("\nGOOGLE CATEGORY COLUMN [1]")
print("=======================")
google_category_results = (display_table(google_ds_final, 1))
print("\nAPPLE PRIME GENRE COLUMN [11]")
print("=========================")
apple_genre_results = (display_table(apple_ds_final, 11))
GOOGLE GENRE COLUMN [9]
====================
Tools : 8.43%
Entertainment : 6.07%
Education : 5.35%
Business : 4.58%
Productivity : 3.89%
Lifestyle : 3.89%
Finance : 3.70%
Medical : 3.53%
Sports : 3.46%
Personalization : 3.32%
Communication : 3.25%
Action : 3.10%
Health & Fitness : 3.07%
Photography : 2.95%
News & Magazines : 2.80%
Social : 2.66%
Travel & Local : 2.32%
Shopping : 2.25%
Books & Reference : 2.14%
Simulation : 2.04%
Dating : 1.86%
Arcade : 1.86%
Video Players & Editors : 1.78%
Casual : 1.75%
Maps & Navigation : 1.40%
Food & Drink : 1.24%
Puzzle : 1.13%
Racing : 0.99%
Role Playing : 0.94%
Libraries & Demo : 0.94%
Auto & Vehicles : 0.93%
Strategy : 0.91%
House & Home : 0.84%
Weather : 0.80%
Events : 0.71%
Adventure : 0.68%
Comics : 0.61%
Beauty : 0.60%
Art & Design : 0.60%
Parenting : 0.50%
Card : 0.44%
Casino : 0.43%
Trivia : 0.42%
Educational;Education : 0.39%
Educational : 0.37%
Board : 0.37%
Education;Education : 0.34%
Word : 0.26%
Casual;Pretend Play : 0.24%
Music : 0.20%
Racing;Action & Adventure : 0.17%
Puzzle;Brain Games : 0.17%
Entertainment;Music & Video : 0.17%
Casual;Brain Games : 0.14%
Casual;Action & Adventure : 0.14%
Arcade;Action & Adventure : 0.12%
Action;Action & Adventure : 0.10%
Educational;Pretend Play : 0.09%
Board;Brain Games : 0.09%
Simulation;Action & Adventure : 0.08%
Parenting;Education : 0.08%
Entertainment;Brain Games : 0.08%
Parenting;Music & Video : 0.07%
Educational;Brain Games : 0.07%
Casual;Creativity : 0.07%
Art & Design;Creativity : 0.07%
Education;Pretend Play : 0.06%
Role Playing;Pretend Play : 0.05%
Education;Creativity : 0.05%
Role Playing;Action & Adventure : 0.03%
Puzzle;Action & Adventure : 0.03%
Entertainment;Creativity : 0.03%
Entertainment;Action & Adventure : 0.03%
Educational;Creativity : 0.03%
Educational;Action & Adventure : 0.03%
Education;Music & Video : 0.03%
Education;Brain Games : 0.03%
Education;Action & Adventure : 0.03%
Adventure;Action & Adventure : 0.03%
Video Players & Editors;Music & Video : 0.02%
Sports;Action & Adventure : 0.02%
Simulation;Pretend Play : 0.02%
Puzzle;Creativity : 0.02%
Music;Music & Video : 0.02%
Entertainment;Pretend Play : 0.02%
Casual;Education : 0.02%
Board;Action & Adventure : 0.02%
Trivia;Education : 0.01%
Travel & Local;Action & Adventure : 0.01%
Tools;Education : 0.01%
Strategy;Education : 0.01%
Strategy;Creativity : 0.01%
Strategy;Action & Adventure : 0.01%
Simulation;Education : 0.01%
Role Playing;Brain Games : 0.01%
Racing;Pretend Play : 0.01%
Puzzle;Education : 0.01%
Parenting;Brain Games : 0.01%
Music & Audio;Music & Video : 0.01%
Lifestyle;Pretend Play : 0.01%
Lifestyle;Education : 0.01%
Health & Fitness;Education : 0.01%
Health & Fitness;Action & Adventure : 0.01%
Entertainment;Education : 0.01%
Communication;Creativity : 0.01%
Comics;Creativity : 0.01%
Casual;Music & Video : 0.01%
Card;Brain Games : 0.01%
Card;Action & Adventure : 0.01%
Books & Reference;Education : 0.01%
Art & Design;Pretend Play : 0.01%
Art & Design;Action & Adventure : 0.01%
Arcade;Pretend Play : 0.01%
Adventure;Education : 0.01%

GOOGLE CATEGORY COLUMN [1]
=======================
GAME : 9.64%
TOOLS : 8.44%
BUSINESS : 4.58%
LIFESTYLE : 3.90%
PRODUCTIVITY : 3.89%
FINANCE : 3.70%
MEDICAL : 3.53%
SPORTS : 3.42%
PERSONALIZATION : 3.32%
COMMUNICATION : 3.25%
HEALTH_AND_FITNESS : 3.07%
PHOTOGRAPHY : 2.95%
NEWS_AND_MAGAZINES : 2.80%
SOCIAL : 2.66%
TRAVEL_AND_LOCAL : 2.34%
SHOPPING : 2.25%
BOOKS_AND_REFERENCE : 2.14%
FAMILY : 18.79%
DATING : 1.86%
VIDEO_PLAYERS : 1.78%
MAPS_AND_NAVIGATION : 1.40%
EDUCATION : 1.25%
FOOD_AND_DRINK : 1.24%
ENTERTAINMENT : 1.04%
LIBRARIES_AND_DEMO : 0.94%
AUTO_AND_VEHICLES : 0.93%
HOUSE_AND_HOME : 0.84%
WEATHER : 0.80%
EVENTS : 0.71%
ART_AND_DESIGN : 0.68%
PARENTING : 0.65%
COMICS : 0.62%
BEAUTY : 0.60%

APPLE PRIME GENRE COLUMN [11]
=========================
Entertainment : 7.89%
Games : 58.14%
Photo & Video : 4.97%
Education : 3.66%
Social Networking : 3.29%
Shopping : 2.61%
Utilities : 2.52%
Sports : 2.14%
Music : 2.05%
Health & Fitness : 2.02%
Productivity : 1.74%
Lifestyle : 1.58%
News : 1.34%
Travel : 1.24%
Finance : 1.12%
Weather : 0.87%
Food & Drink : 0.81%
Reference : 0.56%
Business : 0.53%
Book : 0.43%
Navigation : 0.19%
Medical : 0.19%
Catalogs : 0.12%

Questions

  1. Analyzing the frequency table generated for the prime_genre column of the App Store data set.
    • The most common genre is: Games
      The runner-up is: Entertainment
    • What other patterns do you see?
      Business, Productivity and Education are very low
    • What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)?
      Games and Entertainment
    • Can you recommend an app profile for the App Store market based on this frequency table alone? If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users?
      Not necessarily. Would be interesting to compare the number of users agains number of apps in a genre
  2. Analyze the frequency table you generated for the Category and Genres column of the Google Play data set.
    • What are the most common genres?
      Family and games. In that order
    • What other patterns do you see?
      Business and Productivity are higher than the AppStore results
    • Compare the patterns you see for the Google Play market with those you saw for the App Store market. Can you recommend an app profile based on what you found so far?
      Games
    • Do the frequency tables you generated reveal the most frequent app genres or what genres have the most users?
      Most frequent app genres
In [230]:
# Function to calculate the popularity of each genre in the apple dataset
# by percentage. It uses a for loop inside of another for loop. (nested)

def freq_table(dataset, index):
    freq_table = {}
    total = 0
    
    for row in dataset:
        total += 1
        app = row[index]
        if app in freq_table:
            freq_table[app] += 1
        else:
            freq_table[app] = 1
    
    freq_table_percentages = {}
    for genre in freq_table:
        percentage = (format((freq_table[genre] / total) * 100, '.2f')+"%")
        freq_table_percentages[genre] = percentage 

    freq_table_display = []
    for key in freq_table_percentages:
        key_val_as_tuple = (freq_table_percentages[key], key)
        freq_table_display.append(key_val_as_tuple)

    freq_table_sorted = sorted(freq_table_display, reverse = True)
    sorted_list = []
    for entry in freq_table_sorted:
        element = entry[1], ':', entry[0]
        sorted_list.append(element)
    return sorted_list

freq_table(apple_ds_final,11)
Out[230]:
[('Entertainment', ':', '7.89%'),
 ('Games', ':', '58.14%'),
 ('Photo & Video', ':', '4.97%'),
 ('Education', ':', '3.66%'),
 ('Social Networking', ':', '3.29%'),
 ('Shopping', ':', '2.61%'),
 ('Utilities', ':', '2.52%'),
 ('Sports', ':', '2.14%'),
 ('Music', ':', '2.05%'),
 ('Health & Fitness', ':', '2.02%'),
 ('Productivity', ':', '1.74%'),
 ('Lifestyle', ':', '1.58%'),
 ('News', ':', '1.34%'),
 ('Travel', ':', '1.24%'),
 ('Finance', ':', '1.12%'),
 ('Weather', ':', '0.87%'),
 ('Food & Drink', ':', '0.81%'),
 ('Reference', ':', '0.56%'),
 ('Business', ':', '0.53%'),
 ('Book', ':', '0.43%'),
 ('Navigation', ':', '0.19%'),
 ('Medical', ':', '0.19%'),
 ('Catalogs', ':', '0.12%')]
In [181]:
print("google installs column [5]")
print("==========================")
google_installs_results = (display_table(google_ds_final, 5))
goolge installs column [5]
==========================
1,000+ : 8.41%
100+ : 6.92%
5,000,000+ : 6.82%
500,000+ : 5.57%
50,000+ : 4.77%
5,000+ : 4.51%
10+ : 3.54%
500+ : 3.25%
50,000,000+ : 2.28%
100,000,000+ : 2.12%
1,000,000+ : 15.75%
100,000+ : 11.55%
10,000,000+ : 10.51%
10,000+ : 10.22%
50+ : 1.92%
5+ : 0.79%
1+ : 0.51%
500,000,000+ : 0.27%
1,000,000,000+ : 0.23%
0+ : 0.05%
0 : 0.01%
In [245]:
# Now I'll calculate the average number of installs per app genre for the
# google data set using a nested loop.
cat_google = freq_table(google_ds_final, 1)

for category in cat_google:
    category = category[0]
    total = 0
    len_category = 0
    for app in google_ds_final:
        category_app = app[1]
        if category_app == category:
            num_installs = app[5]
            num_installs = num_installs.replace(',', '')
            num_installs = num_installs.replace('+', '')
            total += float(num_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)
GAME
GAME : 13006872.892271662
TOOLS : 10695245.286096256
BUSINESS : 1704192.3399014778
LIFESTYLE : 1437816.2687861272
PRODUCTIVITY : 16772838.591304347
FINANCE : 1387692.475609756
MEDICAL : 107167.23322683707
SPORTS : 4274688.722772277
PERSONALIZATION : 5201482.6122448975
COMMUNICATION : 38326063.197916664
HEALTH_AND_FITNESS : 4167457.3602941176
PHOTOGRAPHY : 17805627.643678162
NEWS_AND_MAGAZINES : 9549178.467741935
SOCIAL : 23253652.127118643
TRAVEL_AND_LOCAL : 13984077.710144928
SHOPPING : 7036877.311557789
BOOKS_AND_REFERENCE : 8767811.894736841
FAMILY : 4371709.123123123
DATING : 854028.8303030303
VIDEO_PLAYERS : 24790074.17721519
MAPS_AND_NAVIGATION : 4056941.7741935486
EDUCATION : 3057207.207207207
FOOD_AND_DRINK : 1924897.7363636363
ENTERTAINMENT : 19428913.04347826
LIBRARIES_AND_DEMO : 638503.734939759
AUTO_AND_VEHICLES : 647317.8170731707
HOUSE_AND_HOME : 1313681.9054054054
WEATHER : 5074486.197183099
EVENTS : 253542.22222222222
ART_AND_DESIGN : 1905351.6666666667
PARENTING : 542603.6206896552
COMICS : 817657.2727272727
BEAUTY : 513151.88679245283

This concludes my first project in Dataquest. Based on the above, it seems that 'Games' is the best direction for developing an app that shows the most potential for being profitable

In [ ]: