Statistics for Data Science: Measures of Dispersion

 What is dispersion?



Dispersion in statistics is the measure of how far the data points stretch out or spread out from a certain point of reference. Usually, this certain point is the mean, one which gives us a measure of a data set’s central tendency. This is why it becomes a good reference point to calculate the ]dispersion or distance from the central point, so we can work on the data using a specific criteria.

For example, let’s assume the average of all students’ scores in a class is 70%. From this overall average, now we can calculate the dispersion for each of the student scores. In other words, this stat can tell us how far behind or ahead a student is compared to the class average.

In Data Science, where we deal with enormous amounts of data, measures of dispersion become the primal starting point before we start analysing data on a deeper level. The measures of dispersion commonly used are Range, Interquartile range, Standard Deviation, and Variance. Let’s look into what these are with the help of some examples.

Measures of dispersion:

  1. Range: Range is the difference between the highest and lowest value of data points in a data set. When we are dealing with copious amounts of data, Range is the measure of dispersion which allows us to get an idea of where our data points lie so we can conduct our analysis knowing what we’re playing with. Range also becomes crucial when we are investigating data or performing analysis using Python. We’ll look into this in another section, or you can jump straight to it below.
  2. Interquartile Range: Let’s say we have split our data set with four divisions, where each division means a quarter. Now we can see that our data makes more sense as we now have data points on the extreme end of it, and some on either side of the central point of the data set. When we have divided our data set in this way, Interquartile range is the range of data points lying between the third quarter and the first quarter, or between 25% and 75% of the total data points. Essentially, giving us information about the most important data points to analyse in a data set when arranged in ascending order.
  3. Standard Deviation: Standard Deviation gives us an idea of how far the data is distributed from the mean which is used as a standard point. Standard Deviation of a certain sample is measured in units from the mean so we can tell where it falls in the distribution scale.
  4. Variance: Variance just like standard deviation gives us an idea of how far the sample of data has spread out from the mean or any other data point in the data set. We can calculate the variance by calculating the difference between the two points and taking a square of these values divided by the number of samples in the data set.

How to calculate measures of dispersion:

  1. Range: As mentioned earlier, range calculates how far the data set has spread out. It can be calculated as follows:

Range = (Highest value in data set) - (Lowest value in data set)

Example: Batting scores of a Cricket Team = (100,90,80,70,60,50,50,20,20,90,10)

From this data set, we can find the range of batting scores by (100) - (10) = 90

So the range of data in thai data set is of 90 data points which is now ready for further analysis on the distribution scale.

  1. Interquartile range: This is given by Q3 - Q1 where Q3 is the third quarter and Q1 is the first quarter.

Let’s take the same example of a cricket team’s batting scores, only this time we’ll arrange them in an ascending order: (10,20,20,50,50,60,70,80,90,90,100)

From this data set, we now calculate the median which is the value of the 6th data point = 60. Now we have five values each on either side of this data point. Again, the median of the left side i.e. (10,20,20,50,50) = 20 which becomes our Q1 and on the right side i.e. (70,80,90,90,100) = 90 which becomes our Q3.

Now, Q3 - Q1 = 90 - 20 = 70 which becomes our interquartile range of batting scores, i.e. scores which fall between 25% to 75% of total data.

  1. Standard Deviation:

σ = ( √sq.( X – X1)) / (N - 1)

Where X=the value of sample/data point, X1=mean value, N=total number of samples

In the cricket scores example, we can simply calculate the mean and see that all other values are ready to be put in the formula for us to get the standard deviation of a certain score from it.

  1. Variance: (σ2) = (∑ ( X − μ)2 / N)

Where X=the value of data point, μ=mean value, N=total number of samples

In the cricket scores example, we can simply calculate the mean and see that all other values are ready to be put in the formula for us to get the variation of a certain score from it.

How to calculate the measures of dispersion using Python:

With Python, you can use functions from certain libraries such as Numpy to calculate the measures of dispersion. While range() is one of the commonly used functions across Python outside of statistical syntax as well, the Interquartile range can be calculated using the iqr() function from the Scipy module. For standard deviation, we use std() function and for variance, we use var() by importing the Numpy library.

Conclusion:

In this article, we’ve covered the various measures of dispersion and seen how they can be calculated using examples. These measures help Data Scientists understand the data in a big way and generally, give us the formative preparation for handling data and drawing conclusions from them.

If you’re interested in pursuing Data Science as a career, which is very much in demand right now, then you can enroll for a course with Skillslash which offers real work experience with top MNCs at the end of the course. To know more, get in touch with one of our counselors at Data Science Course in Training in Mumbai today.

Comments

Popular posts from this blog

10 Most Popular Business Intelligence Tools in Corporate Use