Data Preprocessing for Machine Learning
What is Data Pre-Processing?
Data Pre-Processing is a technique used
by Data Scientists to clean the data in a way that it becomes eligible for a
machine to process the same and apply it to the models that are best suited to
run them. Here, the raw data is transformed and converted into clean data using
data manipulation techniques so it becomes easier for the machine to understand
the data in question.
Why Data Pre-Processing?
Think of it like this, for you to qualify
as a Data Scientist by industry standards, you have to be trained not just with
the theory of Data Science but also with real world applications of the same.
This is where your doubts over the knowledge you have accumulated gets cleaned
and prepared to deal with client data, or business problems that cannot be
taken lightly.
Tools for Data Pre-Processing
The tools for data pre-processing depends
on the type of data that is there at our disposal. If you’ve got a hang of the
CRISP-DM model, then you’d know the back and forth which recommends a Data
Scientist to constantly see whether the data is ready to be fed into the
machine learning algorithms or no. In order to do so, it is essential to clean
the data at its very source, which is when it is collected. This can be done
manually, or by a 3rd party. Either ways, we’ll make a list of the
tools required to make these changes.
1. Data Understanding – This can be done using Data Visualisation tools
which can show bar plots, scatter plots, histograms or heat maps where one
could spot discrepancies with the naked eye.
2. Data Manipulation – For
missing values in a data set, we can conduct a missing values treatment on the
data depending on the type of data and the relation between their features –
where we would be required to thereby conduct a univariate, bivariate or a
multivariate analysis.
Before we dive deep into Data Cleaning
and Data Transformation, we have to understand these various types of analyses
since it helps to understand the impact data has when corrections are made to
the raw data.
Univariate Analysis: This analysis takes one variable or a single data point which is why
it is called univariate analysis. The functions performed during such analysis
are usually used to obtain information regarding its mean, median, mode,
variance, range or interquartile range, or standard deviation. For example, an
array consisting of numerical values is used to give out information that tells
us about it alone, without being in comparison with another data set.
Bivariate Analysis: This features the analysis of two data points or sets which can be
distinctly compared to draw conclusions or insights with reference to each
other. Normally, we perform this wherever we feel that one data set may have an
influence over the other.
Multivariate Analysis: Multivariate Analysis is the same as Bivariate Analysis, except for
the part where it considers more than two data points or sets for analysis.
All-in-all, in pre-processing of data, these analyses give us a clear
indication of what treatment or method is to be used for cleaning and preparing
the data for modeling. Now let’s take a look at some of these methods.
A. Missing Value Treatment:
In order to prepare the data, we either
conduct a transformation on the data set or assign new values to the missing
data. During multivariate analysis, if we have a large number of missing
values, then it may be suitable to drop the null values. However, say, in the
case of univariate analysis, the decision to drop the missing values entirely
can cause a bias in the way we look at the data set as a whole.
<1% Missing Values: Usually, we
consider the percentage of missing values in comparison with the entire data
set. When the missing values are less than 1%, we safely drop the cells as it
wouldn’t make much difference to the overall considerations. For example, a
sale on one day would be negligible in contrast to the results obtained for
sales made throughout the year.
<7% Missing Values: An industry
standard for considering transforming the values from null to a finite figure
is when the missing values are under 7%. In this case, we replace the null with
either the mean, median, or the mode observed in the data set or a sample of
it.
>30% Missing Values: Another major
criteria to look out for transformation is when the missing values are more
than 30%. In this case again, we drop the feature as the input given to the
values could cause a bias in the way the entire data set shapes up, leading to
falsifications in data. Hence, we avoid this.
B.
Outlier Treatment:
Outliers becomes a constant threat to the
dataset when they are not taken care of. A simple problem that may arise due to
outliers is when they take on extreme values that affect the distribution,
however, this decision is key to the Business Team without whose approval a
Data Scientist doesn’t usually take this decision. To identify and clean
outliers, a box plot is used as a placeholder. Then the changes are made for
outlier treatment in consultation with the business team, which decides the
value to be replaced with, which is usually the mean of data set.
C.
Bad encoding (text)
Let’s consider an example of Alexa here,
which only understands a select few languages. Now if we were to give out
instructions to this machine in a native language not recognized by it, it’s
not going to work. When this happens in the context of a text, we call it bad
encoding. For such text, we turn this categorical data into numerical by a
process called One-hot encoding where we perform numerical encoding of textual
data. Here we take dummy variables and assign 1 to the column where the data
appears to be true. If we assign using a zero, it’s called one-cold.
Conclusions: Data Preprocessing
is a key stage in Data
Analysis where the data is shaped, scaled, or standardized using various
tools that we mentioned above. This process helps the data to become fit for
Machine Learning as well as in aligning with the business understanding which
lies at the core of any Data Science.
If you’re interested in Data Science as
a career which is one of the hottest skills in the market today, you can enroll
for the course with Skillslash. At Skillslash, you get a unique opportunity to
gain real work experience upon completion of the course at top MNCs. To find
out more, get in touch with one of our counselors today by visiting https://skillslash.com/data-science-course-training-kolkata
Comments
Post a Comment