Understanding your data is crucial before delving into analysis because it lays the foundation for informed and accurate insights. Data serves as the raw material from which patterns and trends are extracted, forming the basis of any analytical exercise. Without a thorough comprehension of the data’s context, quality, and characteristics, there is a risk of drawing erroneous conclusions or making misguided decisions. Understanding the nuances of the data helps in identifying potential biases, outliers, and limitations, allowing for appropriate preprocessing and cleaning. Moreover, it enables researchers and analysts to choose the most suitable analytical methods, ensuring that the results are not only statistically robust but also align with the real-world implications of the data. In essence, it is extremely critical to understand your data thoroughly and then process it to address the identified anomalies to enhance the reliability and validity of analytical outcomes, contributing to more informed and effective decision-making processes.
This article explains the step-by-step process of understanding your data to lay a robust foundation for insightful analysis. By following these key steps, you’ll be equipped to extract meaningful patterns and make informed decisions from your dataset. You can follow along with the examples in this article using the provided dataset, which is available on my github. This will enable you to apply the concepts and techniques discussed in a hands-on manner, reinforcing your understanding of the importance of data exploration. While this article explains the concept using R code, you can head here if are using python.
1. Importing the data
setwd("C:/Users/user/Superstore_data")
library(readxl) #for reading xlsx files
library(writexl) #for writing xlsx files
library(dplyr) #for data cleaning and wrangling
library(psych) #for descriptive statistcs
ss = read_xlsx("superstore_analysis.xlsx") #51389 observations
2. Exploring the Data Overview with head()
: To start, we employ the head()
function in R to gain an initial glimpse of the dataset. This simple yet powerful command allows us to observe the first few rows of the data (6 rows by default), providing a quick overview of its structure and content. Understanding the dataset at this level helps us identify the available columns and get a feel for the types of information they contain.
#---------------PART 1: understand the data--------------------
head(ss)
Output:
# A tibble: 10 × 25
Category City Country Customer.ID Customer.Name Discount Market `Shu-Sh`
<chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 Office Supplies Los Angeles United S… LS-172304 Lycoris Saun… 0 US 1
2 Office Supplies Los Angeles United S… MV-174854 Mark Van Huff 0 US 1
3 Office Supplies Los Angeles United S… CS-121304 Chad Sievert 0 US 1
4 Office Supplies Los Angeles United S… CS-121304 Chad Sievert 0 US 1
5 Office Supplies Los Angeles United S… AP-109154 Arthur Prich… 0 US 1
# ℹ 17 more variables: Order.Date <dttm>, Order.ID <chr>, Order.Priority <chr>,
# Product.ID <chr>, Product.Name <chr>, Profit <dbl>, Quantity <dbl>, Region <chr>,
# Row.ID <dbl>, Sales <dbl>, Segment <chr>, Ship.Date <chr>, Ship.Mode <chr>,
# Shipping.Cost <dbl>, State <chr>, Sub.Category <chr>, Market2 <chr>
3. Unveiling the Structure with str()
: Moving on, we delve deeper into the dataset’s structure using the str()
function. This command reveals essential details such as the data type of each variable, the number of observations, and the overall structure of the dataset. Ensuring that columns behave as expected is crucial for a successful analysis, and str()
is our go-to tool for this purpose.
str(ss) #Ship.Date in character form instead of date
Output: A snapshot of a few columns to keep it simple. Here we observe that column “Ship.Date” which should be in Date format is represented by ‘Character’ data type. Congratulations on highlighting the first anomaly in the data.
$ Row.ID : num [1:51389] 36624 37033 31468 31469 32440 ...
$ Sales : num [1:51389] 19 19 21 111 6 13 19 12 54 49 ...
$ Segment : chr [1:51389] "Consumer" "Consumer" "Consumer" "Consumer" ...
$ Ship.Date : chr [1:51389] "2011-01-09" "2011-01-26" "2011-08-09" "2011-08-09" ...
$ Ship.Mode : chr [1:51389] "Second Class" "Standard Class" "Standard Class" "Standard Class" ...
4. Descriptive Statistics with describe()
: Next, we leverage the describe()
function from the psych package to obtain basic statistical summaries of the data. This step provides insights into central tendencies, dispersions, and the overall distribution of numerical variables. Understanding the basic statistics of the dataset aids in identifying potential outliers and assessing the general characteristics of the data.
#---------------PART 2: examine the data--------------------
#Descriptive statistics
#unique values in each character column
#duplicate rows
#------------------------------------------------------------
psych::describe(ss)
Output: A snapshot of a few columns (with select statistics) to keep it simple. Here we get quite a few observations: First, column Shu-Sh does not have any variation (min = 1, max = 1, sd = 0) and contains only a single value throughout the rows. Second, we observe that we have negative quantities in some of the rows (min = -11) which seems like an inconsistency in the data. Third, we observe that minimum value for sales is 0 which may need investigation: if the value is 0, then why is that entry part of the records? And lastly, we observe that all but 2 columns have 51,389 records. Columns ‘Country’ and ‘Order.Date’ have lower number of entries indicating missing values.
vars n mean sd median trimmed mad min max range skew kurtosis se
Category* 1 51389 2.21 0.97 2.00 2.13 0.00 1.00 4.00 3.00 0.87 -0.24 0.00
Country* 3 51084 80.08 46.68 82.00 81.45 68.20 1.00 147.00 146.00 0.01 -1.41 0.21
Shu-Sh 8 51389 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN NaN 0.00
Order.Date 9 50624 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
Quantity 15 51389 3.46 2.30 3.00 3.16 1.48 -11.00 14.00 25.00 1.28 2.37 0.01
Sales 18 51389 246.49 487.38 85.00 142.04 99.33 0.00 22638.00 22638.00 8.13 176.63 2.15
5. Exploring Unique Values with unique()
: As the next step, we scrutinize each character column using the unique()
function. This step is crucial for uncovering potential anomalies, irregularities, or unexpected values within categorical variables. Identifying unique values helps in detecting data quality issues and ensures that the data is suitable for the intended analysis.
unique(ss$Category)
unique(ss$Country)
unique(ss$Market)
unique(ss$Order.Priority)
unique(ss$Region)
unique(ss$Ship.Mode)
unique(ss$Sub.Category)
unique(ss$Market2)
Output: Here again, we notice a few anomalies in the data which need to be addressed before we dive into the data. First, we notice presence of “NA” as one of the values in ‘Order.Priority’. Second, we see that the only difference in columns ‘Market’ and ‘Market2’ is that in column ‘Market2’, ‘US’ and ‘Canada’ are combined as a single entry under ‘North America’. Lastly, under column ‘Category’, we see two different heads: Office Supplies and Office Supplies & Stationary. This needs investigation to understand if these two heads are really different or just a data entry error.
> unique(ss$Category)
[1] "Office Supplies" "Office Supplies & Stationary" "Technology"
[4] "Furniture"
> unique(ss$Market)
[1] "US" "EU" "LATAM" "Africa" "APAC" "EMEA" "Canada"
> unique(ss$Order.Priority)
[1] "High" "Medium" "Critical" "NA" "Low"
> unique(ss$Region)
[1] "West" "East" "South" "Central" "Africa" "Central Asia" "North Asia"
[8] "Caribbean" "North" "EMEA" "Oceania" "Southeast Asia" "Canada"
> unique(ss$Ship.Mode)
[1] "Second Class" "Standard Class" "Same Day" "First Class"
> unique(ss$Sub.Category)
[1] "Paper" "Art" "Storage" "Appliances" "Supplies" "Envelopes" "Fasteners" "Labels"
[9] "Binders" "Accessories" "Phones" "Copiers" "Machines" "Tables" "Bookcases" "Chairs"
[17] "Furnishings"
> unique(ss$Market2)
[1] "North America" "EU" "LATAM" "Africa" "APAC" "EMEA"
6. Checking presence of duplicate rows with a combination of sum()
and duplicated()
: In addition to exploring the structure and characteristics of our data, it’s essential to address the potential presence of duplicate rows. Duplicate rows can introduce biases and inaccuracies into our analyses, leading to skewed results. To identify and handle these duplicates, we employ a simple yet effective check using the sum(duplicated())
expression. This command sums up the count of duplicated rows, allowing us to quickly assess if there are any instances of data repetition. Resolving duplicates is a crucial step in ensuring the integrity of our analysis, as it prevents double-counting of observations and provides a more accurate representation of the underlying patterns within the dataset.
sum(duplicated(ss))
Output: We see that the data contains 99 duplicate rows in total which need to be addresses / cleaned before we begin our analysis.
> sum(duplicated(ss))
[1] 99
In conclusion, a meticulous understanding of your data is not merely a preliminary step; it is the cornerstone of succesful analysis. By navigating through the initial exploration with tools like head()
, str()
, and describe()
, we lay the groundwork for robust analysis, ensuring that our insights are built upon a foundation of well-understood data. The scrutiny of unique values and the detection of duplicate rows further refine our dataset, eliminating potential pitfalls and fortifying the reliability of our findings.
Remember, the journey from raw data to meaningful insights is a nuanced process, and each step contributes to the overall narrative that the data tells us. Through this comprehensive approach, we empower ourselves to make informed decisions, avoiding common pitfalls such as biases, outliers, and inaccuracies. The real-world implications of our analyses become more tangible when supported by a solid understanding of the data’s context and characteristics.
As you embark on your own data exploration endeavors, embrace the power of these fundamental techniques. Let them be your guiding compass, steering you through the intricate landscape of your dataset. The ability to extract meaningful insights from data is not just a skill; it’s an art form that evolves with each exploration. So, go ahead, dive into your data with confidence, armed with the knowledge that a thorough understanding of your dataset is the key to unlocking its full potential. Happy exploring!