This is a Pacmann Academy SQL & Data Wrangling project involving visual examination of datasets, problem identification, and problem solving. First, introduce the Olist company and the datasets. Olist is a Brazilian e-commerce company that provides solutions for online sales and e-commerce services. It offers a variety of technologies, tools, and connectors to help streamline and speed online business processes. Finally, provide remedies to the challenges and a benefit evaluation. The Python programming language is utilized in this project for data analysis and graphics.
⚠️ Disclaimer: I'm using a dataset from Pacmann Academy's learning management system, but we may use the same dataset from kaggle
- Number of Order per Each Product Category: Identifying the number of orders received by Olist in each product category helps in understanding the popularity of a particular product category.
- The most item of product categories: Analyze the top product categories to understand the extent to which these products contribute to Olist's revenue and popularity.
- Top 3 for Each Product Category Name by Product Category: Displays the top three products in each product category to provide insight into the most popular products in each category.
- Top 10 Customer State Capacity: Identifying the ten states with the largest customer capacity helped Olist identify the most potential regions.
- Revenue for Each State: Analyzing the revenue received by Olist from each state, helps in understanding the revenue contribution of each region.
- Monthly Revenue: Identify Olist's monthly revenue trends to assist in financial planning and business growth.
- Total Revenue per Each Product Category: Calculated the total revenue earned by each product category to estimate the relative contribution of each category.
- The Relationship between the Price of the Product and the Payment Value: Analyze the relationship between product price and payment value to understand if any correlations or trends can be identified.
- The Relationship between Product Volume and the Price: Analyze the relationship between product volume and price to evaluate whether product volume affects price.
- Average Price over Time by Product Category: Understand the average change in product prices over time within each product category.
- Payment Type Distribution: Display the distribution of payment types used by Olist customers to assist in better payment strategies.
- Payment Installment: Analyze payments made in installments to identify term payment patterns.
- Dataset
- Accessing dataset
- Load dataset
- Create dataframe
- Exploration and Processing
- NaN identification
- Outlier identification
- Identify inconsistent format
- Identify duplicate data
- Other checks required
- Explorating Data and Analysis
-
-
Accessing dataset
# sqlite3 used to integrate SQLite database with Python import sqlite3 # pandas and numpy for data manipulation import pandas as pd import numpy as np pd.set_option('display.max_columns',100) # matplotlib and seaborn for plotting import matplotlib.pyplot as plt import seaborn as sns from matplotlib.dates import MonthLocator, YearLocator import warnings warnings.filterwarnings("ignore") -
Load Dataset
# Create function to load dataset def get_result(query): # Create the get_result function dbfile = 'olist.db' # Create file path connection = sqlite3.connect(dbfile) # Accessing the dbfile cursor = connection.cursor() # Create a cursor object to execute SQL commands on a database cursor.execute(query) # Executing SQL commands data = cursor.fetchall() # Retrieve the result of SQL commands cursor.close() # Close the cursor connection.close() # Close connection return(data) # Return to the SQL command result-
View the names of tables and columns in the dataset.
# Define SQL commands to view tables in the olist database query_get_tables = "SELECT name FROM sqlite_master WHERE type='table';" # Run SQL command using get_result function tables = get_result(query_get_tables) # View tables in the dbfile # Show table names for table in tables: table_name = table[0] print(f'Table Name : {table_name}') # Define the SQL command to get the column structure of the table query_get_column = f'PRAGMA table_info({table_name});' # Run SQL command using get_result function columns = get_result(query_get_column) # Display the names of the columns in the table for column in columns: column_name = column[1] print(f' Column : {column_name}')
-
-
Create Dataframe
# Function for creating a dataframe def create_df(data, columns): process_data = pd.DataFrame(data=data, columns=columns).drop('index', axis=1) # Reset column index process_data = process_data.reset_index(drop=True) # Combine multiple index levels to create a single column index process_data.columns = [''.join(col).strip() for col in process_data.columns.values] return process_data-
Create columns for each tables
# Creates column variables for each table # Customer Column olist_customer_column = ['index','customer_id','customer_unique_id','customer_zip_code_prefix','customer_city','customer_state'] # Order Column olist_order_column = ['index','order_id','customer_id','order_status','order_purchase_timestamp','order_approved_at', 'order_delivered_carrier_date','order_delivered_customer_date','order_estimated_delivery_date'] # Order Review Column olist_order_reviews_column = ['index','review_id','order_id','review_score','review_comment_title','review_comment_message', 'review_creation_date','review_answer_timestamp'] # Order Payment Column olist_order_payments_column = ['index','order_id','payment_sequential','payment_type','payment_installments','payment_value'] # Order Item Column olist_order_items_column = ['index','order_id','order_item_id','product_id','seller_id','shipping_limit_date','price','freight_value'] # Product Column olist_products_column = ['index','product_id','product_category_name','product_name_lenght','product_description_lenght', 'product_photos_qty','product_weight_g','product_length_cm','product_height_cm','product_width_cm'] # Seller Column olist_sellers_column = ['index','seller_id','seller_zip_code_prefix','seller_city','seller_state'] # Geolocation Column olist_geolocation_column = ['index','geolocation_zip_code_prefix','geolocation_lat','geolocation_lng', 'geolocation_city','geolocation_state'] # Product Category Column olist_product_category_column = ['index','product_category_name','product_category_name_english'] -
Retrieve all dataset
# Create df Customer olist_customer = create_df(get_result('SELECT * FROM olist_order_customer_dataset'),olist_customer_column) # Create df Order olist_order = create_df(get_result('SELECT * FROM olist_order_dataset'),olist_order_column) # Create df Order Review olist_order_reviews = create_df(get_result('SELECT * FROM olist_order_reviews_dataset'),olist_order_reviews_column) # Create df Order Paayment olist_order_payment = create_df(get_result('SELECT * FROM olist_order_payments_dataset'),olist_order_payments_column) # Create df Order Item olist_order_items = create_df(get_result('SELECT * FROM olist_order_items_dataset'),olist_order_items_column) # Create df Product olist_products = create_df(get_result('SELECT * FROM olist_products_dataset'),olist_products_column) # Create df Seller olist_sellers = create_df(get_result('SELECT * FROM olist_sellers_dataset'),olist_sellers_column) # Create df Geolocation olist_geolocation = create_df(get_result('SELECT * FROM olist_geolocation_dataset'),olist_geolocation_column) # Create df Product Category olist_product_category = create_df(get_result('SELECT * FROM product_category_name_translation'),olist_product_category_column) -
Merge the necessary tables
# Merge the necessary tables df_olist = pd.merge(olist_customer, olist_order, on='customer_id', how='inner') df_olist = df_olist.merge(olist_order_payment, on='order_id', how="inner") df_olist = df_olist.merge(olist_order_items, on='order_id', how="inner") df_olist = df_olist.merge(olist_products, on='product_id', how="inner") df_olist = df_olist.merge(olist_product_category, on='product_category_name', how="inner") -
Classify product names into multiple product categories
# Show unique value at product_category_name_english column column_product_categories = df_olist['product_category_name_english'].unique() print(column_product_categories)# Create a product category classification function def classify_product(x): categories = { 'Beauty & Health': ['health_beauty','perfumery','diapers_and_hygiene'], 'Book & Stationary': ['stationery','books_general_interest','books_imported','books_technical'], 'Electronics': ['computers_accessories','auto','air_conditioning','telephony','watches_gifts','consoles_games', 'electronics','small_appliances','small_appliances_home_oven_and_coffee','signaling_and_security', 'musical_instruments','fixed_telephony','tablets_printing_image','computers','audio','security_and_services'], 'Entertainment': ['sports_leisure','toys','art','music','dvds_blu_ray','christmas_supplies','party_supplies','cine_photo', 'cds_dvds_musicals','arts_and_craftmanship'], 'Fashion': ['baby','fashio_female_clothing','cool_stuff','fashion_bags_accessories','fashion_male_clothing','fashion_shoes', 'fashion_underwear_beach','fashion_sport','fashion_childrens_clothes'], 'Food & Drinks': ['food_drink','drinks','food'], 'Furniture': ['office_furniture','home_confort','furniture_decor','bed_bath_table','kitchen_dining_laundry_garden_furniture', 'home_construction','furniture_living_room','furniture_bedroom','furniture_mattress_and_upholstery','home_comfort_2'], 'Home & Garden': ['housewares','garden_tools','pet_shop','construction_tools_lights','luggage_accessories','home_appliances_2', 'home_appliances','market_place','costruction_tools_garden','la_cuisine','flowers'], 'Industry & Construction': ['costruction_tools_tools','construction_tools_construction','industry_commerce_and_business', 'construction_tools_safety','agro_industry_and_commerce'] } for category, keywords in categories.items(): if x in keywords: return category return None df_olist['product_category'] = df_olist['product_category_name_english'].apply(classify_product) -
Drop unrelevan data
# Removing unnecessary columns df_olist.drop(['customer_zip_code_prefix', 'customer_city', 'payment_sequential', 'order_item_id', 'shipping_limit_date', 'freight_value', 'product_name_lenght', 'product_description_lenght', 'product_photos_qty', 'product_weight_g','order_approved_at', 'order_delivered_carrier_date','order_delivered_customer_date','order_estimated_delivery_date','payment_sequential', 'product_category_name', 'order_status'], axis=1, inplace=True)
-
-
-
Exploration and Processing
-
NaN Identification
# Check for total NaN values nan_value = df_olist.isna().sum()[df_olist.isna().sum() > 0] # Construct a dataframe consists of NaN count and NaN percentage from the dataset nan_df_olist = pd.DataFrame ({'NaN_count ': nan_value,'NaN_percentage' : nan_value/len(df_olist)*100}).sort_values(by='NaN_percentage', ascending=False) # Show the data nan_df_olist# Generate a summary showing how many NaN values are in the 'product_length_cm', 'product_height_cm', and 'product_width_cm' columns # for each product category 'product_category_name_english' nan_summary = pd.pivot_table(df_olist, index='product_category_name_english', values=['product_length_cm', 'product_height_cm', 'product_width_cm'], aggfunc=lambda x: x.isna().sum()) # Filter rows with NaN values in columns 'product_length_cm', 'product_height_cm', and 'product_width_cm' # >0 is checks if the value in the 'product_length_cm','product_height_cm', and 'product_width_cm' column is greater than 0. # It returns a Series boolean that will be True if the value is greater than 0 and False otherwise. filtered_rows = nan_summary.loc[(nan_summary['product_length_cm'] > 0) | (nan_summary['product_height_cm'] > 0) | (nan_summary['product_width_cm'] > 0)] filtered_rows -
Calculate the mode from product category 'baby' only to fill the NaN values
# Calculate the mode (most frequent value) for the columns 'product_length_cm', 'product_height_cm', and 'product_width_cm' # for the product category 'baby' only baby_mode = df_olist.loc[df_olist['product_category_name_english'] == 'baby'] # Find the most frequently occurring value in the 'product_length_cm' column product_length_mode = baby_mode['product_length_cm'].mode()[0] # Find the most frequent value in the 'product_height_cm' column product_height_mode = baby_mode['product_height_cm'].mode()[0] # Look for the most frequently occurring value in the 'product_width_cm' column product_width_mode = baby_mode['product_width_cm'].mode()[0] # Show each mode values print(f'Product length modus : {product_length_mode}') print(f'Product height modus : {product_height_mode}') print(f'Product width modus : {product_width_mode}') # Fill 'product_length_cm' with mode value df_olist['product_length_cm'].fillna(product_length_mode, inplace=True) # Fill 'product_height_cm' mode value df_olist['product_height_cm'].fillna(product_height_mode, inplace=True) # Fill 'product_width_cm' with mode value df_olist['product_width_cm'].fillna(product_width_mode, inplace=True) -
Create product volume column
# Create Product Volume Column df_olist['product_volume_cm3'] = df_olist['product_length_cm'] * df_olist['product_height_cm'] * df_olist['product_width_cm']# Drop columns length, heigth and width df_olist.drop(['product_length_cm','product_height_cm','product_width_cm'], axis=1, inplace=True) -
Outlier Identify
# Generate descriptive stastistic df_olist.describe()[!NOTE] In the price column, it can be seen that the 75% data and the max data have a very large range. We can assume this is an outlier
# Show distribution of data price plt.figure(figsize=(10, 7)) sns.set_theme(context='notebook', style='darkgrid', palette='deep', font='sans-serif', font_scale=1, color_codes=True, rc=None) sns.histplot(df_olist['price'], bins=100, palette='Set2') plt.title('Distribution of data price') plt.xlabel('Price') plt.show()[!NOTE]
- It can be seen that the scale of the x axis reaches 7000.
- This happens because there is data whose value is close to 7000.
- This can be validated by looking at the statistical description of the price column.
# Show in boxlot graph plt.figure(figsize=(10,7)) sns.boxplot(df_olist['price']) plt.title('Box Plot : Distribution of data price') plt.xlabel('Price') plt.show()# Generate descriptive stastistic at column 'price' df_olist['price'].describe()[!NOTE]
- It can be seen that the maximum value of the price column is 6735.
- This number is very far compared to the Q3 value of 1349.
- The data above Q3 has the potential to be an outlier.
- Let's assume there are indeed outliers.
-
We will determine a data is an outlier, if its value is greater than Q3 + 1.5 IQR
# Outliers Detection # Calculate the upper and lower limits #IQR q1 = df_olist['price'].quantile(0.25) q3 = df_olist['price'].quantile(0.75) iqr = q3 - q1 upper = q3 + 1.5*iqr lower = q1 - 1.5*iqr# Filtering data without outliers df_olist = df_olist[df_olist['price'] < upper][!NOTE]
- It can be seen that Q3 and the maximum value are not far apart.
- Outliers have been removed.
# Show distribution of data price without outlier plt.figure(figsize=(10,7)) sns.set_theme(context='notebook', style='darkgrid', palette='deep', font='sans-serif', font_scale=1, color_codes=True, rc=None) sns.histplot(df_olist['price'], bins=100) plt.title('Distribution of data price') plt.xlabel('Price') plt.show()Now the graphic looks positively skewed
-
Identify inconsistent format
# Identify inconsistent format in 'product_category_name_english' # Show the data df_olist['product_category_name_english'].unique()# Create a varialbe to replace inconsistent name replace_product_name = {'home_confort':'home_comfort', 'home_comfort_2':'home_comfort', 'home_appliances_2':'home_appliances'} # Replace it into dataframe df_olist['product_category_name_english'].replace(replace_product_name, inplace=True)# Generate unique value in column order_purchase_timestamp df_olist['order_purchase_timestamp'].unique()[!NOTE]
- From the information above, it can be seen that the column does not match the representation of the data type of the column
- order_purchase_timestamp: Indicates the purchase timestamp.
- Because the value in this column begins with the year, we will change the yearfirst parameter in the to_datetime function.
# Convert order_purchase_timestamp df_olist['order_purchase_timestamp'] = pd.to_datetime(df_olist['order_purchase_timestamp'], errors='coerce', yearfirst=True) -
Identify duplicate data
# Check duplicate data of dataframe df_olist[df_olist.duplicated(keep=False)]# Delete duplicate data df_olist = df_olist.drop_duplicates(keep='first').reset_index(drop=True)
-
-
Exploring Data & Analysis
-
Number of Order per Each Product Category
# Visualize the data of product category with bar plots plt.figure(figsize=(10,7)) ax = sns.barplot(x=df_olist['product_category'].value_counts().values, y=df_olist['product_category'].value_counts().index, palette='Set2') plt.title('Number of order per each product category') max_values = df_olist.groupby('product_category')['product_category'].count().max() min_values = df_olist.groupby('product_category')['product_category'].count().min() for index, value in enumerate(df_olist['product_category'].value_counts().values): if value == max_values: ax.text(value, index, f'{value}', ha='right', va='center') if value == min_values: ax.text(value, index, f'{value}', ha='left', va='center') plt.show()Electronics as the most ordered product category with 25033 orders and food & drinks as the least ordered product category with 1008 orders.
-
The most item in product categories
# Filter max value count in product category df_olist['product_category'].value_counts().idxmax()Output: 'Electronics'# Show item of product categories in Electronics # Filter DataFrame for product_category equal to "Electronics" electronic_df = df_olist[df_olist['product_category'] == 'Electronics'] # Count the number of occurrences of each product_category_name_english category_count = electronic_df['product_category_name_english'].value_counts().reset_index() category_count.columns = ['product_category_name_english', 'count'] # Sort DataFrame descendingly by its number category_count = category_count.sort_values(by='count', ascending=False) max_category_count = category_count['count'].max() # Visualize the data with bar plots plt.figure(figsize=(12, 8)) ax = sns.barplot(x='count', y='product_category_name_english', data=category_count, palette='Set2') plt.title('Product Categories in Electronics') plt.ylabel('Product Category Name ') highest_count_category = category_count[category_count['count'] == max_category_count]['product_category_name_english'].values[0] ax.text(max_category_count, 0, f'{max_category_count}', ha='right', va='center', fontsize=10) plt.show()The most ordered electronic product category, totaling 6740, turned out to be computer accessories.
-
Top 3 for Each Product Category Name by Product Category
# Grouped the data based on 'product_category' and 'product_category_name_english', then counted the number. category_counts = df_olist.groupby(['product_category', 'product_category_name_english']).size().reset_index(name='count') # Filtering the top 3 product_category_name_english in each product_category top_categories = category_counts.groupby('product_category').apply(lambda x: x.nlargest(3, 'count')).reset_index(drop=True) # Visualize for each product_category with bar plot plt.figure(figsize=(12, 8)) sns.barplot(data=top_categories, x='count', y='product_category_name_english', hue='product_category', palette='Set2') plt.xlabel('Count') plt.ylabel('Product Category Name') plt.title('Top 3 Product Categories by Product Category') plt.show()Health Beauty, Stationery, Computer accessories, Sport Leisure, Cool Stuff, Food, Bed Bath Table, Housewares, Construction tools construction are the most item ordered for each product category
-
Top 10 Customer state capacity
# Show Top 10 customer state capacities plt.figure(figsize=(10,7)) sns.barplot(x=df_olist['customer_state'].value_counts().values[:10], y=df_olist['customer_state'].value_counts().index[:10], palette='Set2') plt.title('Top 10 Customer State Capacity') plt.show()It can be seen that the state SP (São Paulo) is the country that makes the most orders of 41115
-
Revenue for each state
# Group state by payment value state_revenue = df_olist.groupby('customer_state')['payment_value'].sum().reset_index() # Sorting descending state revenue state_revenue = state_revenue.sort_values(by='payment_value', ascending=False) # Visualize state revenue with bar plot plt.figure(figsize=(12, 6)) sns.barplot(x='customer_state', y='payment_value', data=state_revenue, palette='Set2') plt.title('Total Revenue per State') plt.xlabel('Customer State') plt.ylabel('Total Revenue (Million)') plt.xticks(rotation=45) plt.show()SP(São Paulo) is the state with the highest revenue of 4493263.20 and RR (Roraima) has the lowest revenue of 6678.50
-
Monthly Revenue
# Extract order_purchase_timestamp into order_month df_olist['order_month'] = df_olist['order_purchase_timestamp'].dt.to_period('M').dt.to_timestamp() # Groupby order_month by payment_value to get monthly revenue monthly_revenue = df_olist.groupby('order_month')['payment_value'].sum().reset_index() plt.figure(figsize=(12, 6)) sns.set_theme(context='notebook', style='darkgrid', palette='deep', font='sans-serif', font_scale=1, color_codes=True, rc=None) sns.lineplot(x='order_month', y='payment_value', data=monthly_revenue) plt.title('Monthly Revenue') plt.xlabel('Month') plt.ylabel('Revenue (Million)') plt.xticks(rotation=45) # To rotate x-axis labels for easier reading # Find the month with the highest revenue max_revenue_point = monthly_revenue.loc[monthly_revenue['payment_value'].idxmax()] min_revenue_point = monthly_revenue.loc[monthly_revenue['payment_value'].idxmin()] # Annotate the highest revenue point plt.annotate(f'Highest: {max_revenue_point["payment_value"]:.2f}', xy=(max_revenue_point['order_month'], max_revenue_point['payment_value']), xytext=(max_revenue_point['order_month'], max_revenue_point['payment_value'] + 1000), arrowprops=dict(arrowstyle='->')) # Annotate the lowest revenue point plt.annotate(f'Lowest: {min_revenue_point["payment_value"]:.2f}', xy=(min_revenue_point['order_month'], min_revenue_point['payment_value']), xytext=(min_revenue_point['order_month'], min_revenue_point['payment_value'] - 1000), arrowprops=dict(arrowstyle='->')) plt.show()It is apparent that November 2017 had the largest revenue, totaling 879899.45, while December 2012 had the lowest revenue, totaling 19.62
-
Total revenue per each product category
# Groupby product_category with paymeny_value to get revenue product category revenue_product_cat = df_olist.groupby('product_category')['payment_value'].sum().reset_index() # Sorting descending revenue product category revenue_product_cat = revenue_product_cat.sort_values(by='payment_value', ascending=False) #Visualisize with barplot plt.figure(figsize=(12,7)) sns.barplot(x='product_category', y='payment_value', data=revenue_product_cat, palette='Set2') plt.title('Total Revenue Per Each Product Category') plt.xlabel('Product Category') plt.ylabel('Total Revenue (Million)') plt.xticks(rotation=45) plt.show()Food & Beverages has the lowest revenue at 82544.05, while electronics is the product category with the highest revenue at 2794148.72
-
The relationship between the price of the product and the payment value
# Visualize with Scatter plot plt.figure(figsize=(10, 7)) sns.scatterplot(x='price', y='payment_value', data=df_olist, color='green', alpha=0.5) plt.title('Price vs. Payment Value') plt.xlabel('Price (R$)') plt.ylabel('Payment Value') # Calculate the regression coefficient and intercept slope, intercept = np.polyfit(df_olist['price'], df_olist['payment_value'], 1) # Create an array of x-values for the linear trend line x = np.array([df_olist['price'].min(), df_olist['price'].max()]) # Create an array of y-values for the linear trend line using the regression equation y = slope * x + intercept # Add a linear trend line to the plot plt.plot(x, y, color='red', linestyle='--', label=f'Trendline: y = {slope:.2f}x + {intercept:.2f}') plt.legend() plt.show()- Positive Relationship: This visualization shows that there is a positive relationship between "price" and "payment_value," with the majority of the data points forming an upward pattern from left to right. Put another way, a product's payment value increases with its price.
- Outliers: Outliers are data points that deviate significantly from the general pattern. These outliers are transactions or high-priced products with a considerably higher payout value than other products. This could be a sign of some exceptional products or huge purchases that result in larger payments than the price.
- Data Concentration: The majority of the data points have relatively low price ranges and payment values. This shows that the majority of transactions involve the purchase of decently priced goods.
-
The relationship between Product Volume and the Price
# Visualize with Scatter plot plt.figure(figsize=(10, 7)) sns.scatterplot(x='product_volume_cm3', y='price', data=df_olist, color='blue', alpha=0.5) plt.title('Scatter Plot Product Volume vs. Price') plt.xlabel('Product Volume (cm3)') plt.ylabel('Price (R$)') # Calculate the regression coefficient and intercept slope, intercept = np.polyfit(df_olist['product_volume_cm3'], df_olist['price'], 1) # Create an array of x-values for the linear trend line x = np.array([df_olist['product_volume_cm3'].min(), df_olist['product_volume_cm3'].max()]) # Create an array of y-values for the linear trend line using the regression equation y = slope * x + intercept # Add a linear trend line to the plot plt.plot(x, y, color='red', linestyle='--', label=f'Trendline: y = {slope:.2f}x + {intercept:.2f}') plt.legend() plt.show()- Positive Relationship: This visualization shows that there is a positive relationship between "product_volume_cm3" and "price" with the majority of the data points forming an upward pattern from left to right. Put another way, a product volumes value increases with its price.
- Data Concentration: The majority of the data points have relatively small product volume and price. This shows that the majority of transactions involve the purchase of decently priced goods.
-
Average Price over Time by Product Category
# Group the data by 'order_month' and 'product_category' and calculate the mean 'price' avg_price = df_olist.groupby(['order_month', 'product_category'])['price'].mean().unstack() # Visualize with line plot plt.figure(figsize=(12, 6)) for category in avg_price.columns: plt.plot(avg_price.index, avg_price[category], label=category) plt.title('Average Price Over Time by Product Category') plt.xlabel('Order Month') plt.ylabel('Average Price') plt.legend(loc='best', bbox_to_anchor=(1, 1)) plt.xticks(rotation=45) plt.grid(True) plt.show()From January 2017 to August 2018, the average price of each product category was in the range of 75–100 R$, except for the food & drink product category.
-
Payment Type Distribution
# Visualize with pie plot plt.figure(figsize=(10,10)) plt.pie(df_olist['payment_type'].value_counts().values, autopct='%1.1f%%', shadow=False, startangle=90, labels=df_olist['payment_type'].value_counts().index) plt.title('Payment Type Distribution') plt.show()It seems obvious that the majority of customers (73.9%) pay with credit cards while placing orders.
-
Payment Installment
# Generate value count in payment_installment df_olist['payment_installments'].value_counts()Output: payment_installments 1 49997 2 12038 3 10028 4 6685 5 4851 10 3919 6 3409 8 3277 7 1442 9 571 12 117 15 54 11 24 18 20 13 17 14 14 24 12 20 10 17 7 16 5 21 3 0 2 23 1 22 1 Name: count, dtype: int64# Delete a value of 0 in the payment_installment column in the dataframe df_olist = df_olist[df_olist['payment_installments'] !=0 ] df_olist.reset_index(drop=True, inplace=True)# Visulisize payment installment with countplot plt.figure(figsize=(10,6)) sns.countplot(x=df_olist['payment_installments'], palette='Set2') plt.title('Payment Installement Distribution') plt.xlabel('Month') plt.show()It is apparent that the majority of clients (49997) pay by credit card in 1-month installments when placing an order.
-
-
With 25,033 orders, the "Electronics" product category was the most popular, and with only 1,008 orders, the "Food & Beverage" category had the lowest number of orders. In order to meet the increased demand for products in the "Electronics" category, Olist may think about growing or adding to its inventory. Furthermore, it can enhance the promotion of the "Food & Beverage" category to increase consumer interest in it.
-
"Computer Accessories" was the most ordered "Electronics" product category, with a total of 6,740 orders. Emphasizing the product "Computer Accessories" might boost the popularity of the category "Electronics." Olist may think of developing and marketing such products.
-
There are some products in certain categories that are extremely well-liked. Knowing this can assist in concentrating these products' marketing and promotion. By offering sales, discounts, or customer interaction, Olist can highlight the best-selling items in each category.
-
The state of SP (So Paulo) got the most orders, at 41,115. Olist can focus more on clients in SP state and increase services or promotions in this region.
-
SP (So Paulo) has the greatest income, while RR (Roraima) has the lowest. Concentrating on the state of SP may result in more revenue. Olist may consider increased marketing or customer service efforts in these states.
-
November 2017 was when Olist's revenue peaked. Planning inventory and marketing can be made easier by having a thorough understanding of monthly trends. Olist has the ability to modify their sales tactics in order to maximize profits throughout high season.
-
The revenue contribution that was highest in the "Electronics" category was lowest in the "Food & Beverage" category. Since "Electronics" is a profitable category, Olist should highlight more of these items.
-
Prices for products and payment values are positively correlated. A substantial purchase with a greater payment value is indicated by an anomaly. Olist can boost its income by concentrating on large, expensive purchases. Offering goods that are highly valued for their money can receive extra attention.
-
The price and volume of a product are positively correlated. Most of the transactions are for relatively inexpensive and low-volume products. To boost income, Olist can think about offering product packages at bigger quantities and costs.
-
Average prices fluctuate within a certain product category, and understanding this might aid with product pricing. Olist can alter product prices in specific categories to maximize income.
-
The most popular method of payment for Olist users is credit card. Olist can keep processing credit card payments and providing specials or incentives for using this payment method.
-
The majority of consumers pay by credit card in one-month payments. Olist can provide more installment payment choices to clients and promote the benefits to them.
We examined several facets of Olist's operations in this analysis. We found patterns and trends in everything from the population of products to consumer payments that can assist Olist in growing its revenue and business. Some tactics that can be used are to draw attention to the in-demand "Electronics" product category, promote "Computer Accessories" more, optimize product pricing across categories, and keep credit card payments supported with exclusive offers. Growth also hinges on increasing customer service and promotions in states like SP (São Paulo) that contribute the most to state revenue. Olist is able to boost revenue more successfully and economically as a result.
Note
📌 We can use this dataset to do a more in-depth investigation. We can study more from this information in the future, but for now, we'll stick to 12 objectives. 🤝
You can see in another easy report at my portfolio Medium
I'm learning to write, and mistakes are inescapable even when I do my hardest. Please feel free to offer feedback and recommendations. Let me know if you spot any difficulties or mistakes 🙏

























