Understanding Data Pre-Processing in Statistical Analysis
January 22, 2021
Collected data is
often incomplete, inconsistent and is likely to contain many errors. Data
preprocessing is a proven method of resolving these issues.
In simple words,
Data preprocessing is the process of collecting, selecting, and transforming
data to analyze data. Basically, it means converting data into an
It is also known as data cleaning or munging. It is the most time taking process of statistical analysis as it accounts for 80% of the time taken for analysis. Data preprocessing is very important in any statistical analysis, otherwise, it will directly impact the success rate of the project.
preprocessing allows us to remove any unnecessary data with the use of various
techniques, this allows the user to have a dataset that contains more valuable
are edited to correct the data corruption or human error. This is an important
step to get accurate quantifiers like true positives, true negatives, False
positives, and false negatives found in a Confusion matrix that are commonly
used for a medical diagnosis.
information is removed from the data which allows the analyst to have higher
accuracy. Analysts use Python programming scripts accompanied by the pandas
library which gives them the ability to import data from Comma-separated values
as a data-frame.
of Data Preprocessing:
Now let us
discuss the methods of data preprocessing. Data Preprocessing is done using the
two methods :
Missing Value Treatment
Missing Value Treatment First :
i. Missing value treatment–
There could be
several reasons behind the missing values like – human error, data incorrectly
received, output error, and so on. To fill the
missing values, we use imputation
techniques like mean, median, and mode.
Let us now look
at the second method of data pre-processing
ii. Outlier correction/ Treatment-
understand the procedure, let’s understand what outliers are:
An outlier is the
data point that lies outside the range of the remaining data points in a
Eg: The sale of
electronic goods during holidays like Black Friday sale, new year, and so on,
could result in outliers in data of sales for that entire year. This is because
the electronic goods sales are way higher in these days of the year when
compared to normal days.
outliers we have a simple technique called the box plot method. The values that
fall outside the upper and lower limit are called outliers.
Data Preprocessing is a very crucial stage in the whole process of
statistical analysis, and you can not afford to have any sort of mistake at
this point. If you did any error at this point then you won’t get the desired
result from the analysis. As said, this step itself takes almost 80% of the
time of the statistical analysis process