Data Analytics Step 1: Understand Your Data (using Python)

Understanding your data is crucial before delving into analysis because it lays the foundation for informed and accurate insights. Data serves as the raw material from which patterns and trends are extracted, forming the basis of any analytical exercise. Without a thorough comprehension of the data’s context, quality, and characteristics, there is a risk of drawing erroneous conclusions or making misguided decisions. Understanding the nuances of the data helps in identifying potential biases, outliers, and limitations, allowing for appropriate preprocessing and cleaning. Moreover, it enables researchers and analysts to choose the most suitable analytical methods, ensuring that the results are not only statistically robust but also align with the real-world implications of the data. In essence, it is extremely critical to understand your data thoroughly and then process it to address the identified anomalies to enhance the reliability and validity of analytical outcomes, contributing to more informed and effective decision-making processes.

This article explains the step-by-step process of understanding your data to lay a robust foundation for insightful analysis. By following these key steps, you’ll be equipped to extract meaningful patterns and make informed decisions from your dataset. You can follow along with the examples in this article using the provided dataset, which is available on my github. This will enable you to apply the concepts and techniques discussed in a hands-on manner, reinforcing your understanding of the importance of data exploration. While this article explains the concept using Python code, you can head here if are using R.

1. Importing the data

Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from google.colab import drive
drive.mount('/content/gdrive')

ss = pd.read_excel('/content/gdrive/My Drive/Colab Notebooks/superstore_analysis.xlsx')

2. Exploring the Data Overview with head():

Python
ss.head()

Output:

Category City Country Customer.ID Customer.Name Discount Market Shu-Sh Order.Date Order.ID ... Region Row.ID Sales Segment Ship.Date Ship.Mode Shipping.Cost State Sub.Category
0 Office Supplies Los Angeles United States LS-172304 Lycoris Saunders 0.0 US 1 2011-01-07 CA-2011-130813 ... West 36624 19 Consumer 2011-01-09 Second Class 4.37 California Paper
1 Office Supplies Los Angeles United States MV-174854 Mark Van Huff 0.0 US 1 2011-01-21 CA-2011-148614 ... West 37033 19 Consumer 2011-01-26 Standard Class 0.94 California Paper
2 Office Supplies Los Angeles United States CS-121304 Chad Sievert 0.0 US 1 2011-08-05 CA-2011-118962 ... West 31468 21 Consumer 2011-08-09 Standard Class 1.81 California Paper
3 Office Supplies Los Angeles United States CS-121304 Chad Sievert 0.0 US 1 NaT CA-2011-118962 ... West 31469 111 Consumer 2011-08-09 Standard Class 4.59 California Paper
4 Office Supplies Los Angeles United States AP-109154 Arthur Prichep 0.0 US 1 2011-09-29 CA-2011-146969 ... West 32440 6 Consumer 2011-10-03 Standard Class 1.32 California Paper

3. Unveiling the Structure with info(): Here info() provides a brief snapshot of columns and number of rows with non-missing data, along with the data type of each column.

Python
ss.info()  #Ship.Date in character form instead of date

Output: A snapshot with selected columns. Column “Ship.Date” which should be in Date format is represented by ‘Object’ data type. We also observe that all but 2 columns have 51,389 records. Columns ‘Country’ and ‘Order.Date’ have lower number of entries indicating missing values.

2   Country         51084 non-null  object
8   Order.Date      50624 non-null  datetime64[ns]
15  Region          51389 non-null  object        
16  Row.ID          51389 non-null  int64         
17  Sales           51389 non-null  int64         
18  Segment         51389 non-null  object        
19  Ship.Date       51389 non-null  object

4. Descriptive Statistics with describe(): Next, we leverage the describe() function to obtain basic statistical summaries of the data.

Python
ss.describe()

Output: Here we get quite a few observations: First, column Shu-Sh does not have any variation (min = 1, max = 1, sd = 0) and contains only a single value throughout the rows. Second, we observe that we have negative quantities in some of the rows (min = -11) which seems like an inconsistency in the data. Third, we observe that minimum value for sales is 0 which may need investigation: if the value is 0, then why is that entry part of the records?

     	Discount	Shu-Sh	Profit	Quantity	Row.ID	Sales	Shipping.Cost
count	51389.000000	51389.0	51389.000000	51389.000000	51389.000000	51389.000000	51389.000000
mean	0.142835	1.0	28.619139	3.463504	25642.236043	246.490961	26.370548
std	0.212209	0.0	174.232675	2.298596	14803.961859	487.384037	57.275683
min	0.000000	1.0	-6599.978000	-11.000000	1.000000	0.000000	0.002000
25%	0.000000	1.0	0.000000	2.000000	12820.000000	31.000000	2.610000
50%	0.000000	1.0	9.240000	3.000000	25641.000000	85.000000	7.791000
75%	0.200000	1.0	36.807000	5.000000	38460.000000	251.000000	24.450000
max	0.850000	1.0	8399.976000	14.000000	51290.000000	22638.000000	933.570000

5. Exploring Unique Values with unique(): As the next step, we scrutinize each character column using the unique() function. This step is crucial for uncovering potential anomalies, irregularities, or unexpected values within categorical variables. Identifying unique values helps in detecting data quality issues and ensures that the data is suitable for the intended analysis.

Python
ss.['Category'].unique()
ss.['Market'].unique()
ss.['Market2'].unique()
ss['Order.Priority'].unique()
ss['Segment'].unique()
ss['Ship.Mode'].unique()
ss['Sub.Category'].unique()

Output: Here again, we notice a few anomalies in the data which need to be addressed before we dive into the data. First, we notice presence of nans as one of the values in ‘Order.Priority’. Second, we see that the only difference in columns ‘Market’ and ‘Market2’ is that in column ‘Market2’, ‘US’ and ‘Canada’ are combined as a single entry under ‘North America’. Lastly, under column ‘Category’, we see two different heads: Office Supplies and Office Supplies & Stationary. This needs investigation to understand if these two heads are really different or just a data entry error.

array(['Office Supplies', 'Office Supplies & Stationary', 'Technology',
       'Furniture'], dtype=object)
       
array(['US', 'EU', 'LATAM', 'Africa', 'APAC', 'EMEA', 'Canada'],
      dtype=object)
      
array(['North America', 'EU', 'LATAM', 'Africa', 'APAC', 'EMEA'],
      dtype=object)
      
array(['High', 'Medium', 'Critical', nan, 'Low'], dtype=object)

array(['Consumer', 'Home Office', 'Corporate'], dtype=object)

array(['Second Class', 'Standard Class', 'Same Day', 'First Class'],
      dtype=object)

array(['Paper', 'Art', 'Storage', 'Appliances', 'Supplies', 'Envelopes',
       'Fasteners', 'Labels', 'Binders', 'Accessories', 'Phones',
       'Copiers', 'Machines', 'Tables', 'Bookcases', 'Chairs',
       'Furnishings'], dtype=object)

6. Checking presence of duplicate rows with a combination of sum() and duplicated(): In addition to exploring the structure and characteristics of our data, it’s essential to address the potential presence of duplicate rows. Duplicate rows can introduce biases and inaccuracies into our analyses, leading to skewed results. To identify and handle these duplicates, we employ a simple yet effective check using the duplicated().sum() expression. This command sums up the count of duplicated rows, allowing us to quickly assess if there are any instances of data repetition. Resolving duplicates is a crucial step in ensuring the integrity of our analysis, as it prevents double-counting of observations and provides a more accurate representation of the underlying patterns within the dataset.

Python
ss.duplicated().sum()

Output: We see that the data contains 99 duplicate rows in total which need to be addresses / cleaned before we begin our analysis.

99

In conclusion, a meticulous understanding of your data is not merely a preliminary step; it is the cornerstone of succesful analysis. By navigating through the initial exploration with tools like head(), info(), and describe(), we lay the groundwork for robust analysis, ensuring that our insights are built upon a foundation of well-understood data. The scrutiny of unique values and the detection of duplicate rows further refine our dataset, eliminating potential pitfalls and fortifying the reliability of our findings.

Remember, the journey from raw data to meaningful insights is a nuanced process, and each step contributes to the overall narrative that the data tells us. Through this comprehensive approach, we empower ourselves to make informed decisions, avoiding common pitfalls such as biases, outliers, and inaccuracies. The real-world implications of our analyses become more tangible when supported by a solid understanding of the data’s context and characteristics.

As you embark on your own data exploration endeavors, embrace the power of these fundamental techniques. Let them be your guiding compass, steering you through the intricate landscape of your dataset. The ability to extract meaningful insights from data is not just a skill; it’s an art form that evolves with each exploration. So, go ahead, dive into your data with confidence, armed with the knowledge that a thorough understanding of your dataset is the key to unlocking its full potential. Happy exploring!

Leave a comment

Your email address will not be published. Required fields are marked *