Descriptive Statistics- Part I

Vishnu Satheesh
5 min readNov 30, 2020

To get a start in analytics is a great milestone. This article discusses about all the basic terminologies and ideas one should have before getting their hands dirty with the data. Happy reading!

Photo by Carlos Muza on Unsplash

· Descriptive Statistics: The science of analyzing and summarizing the data the raw data that is collected to make a sense out of it.

· Population: The group from which is data is collected

· Sample: The subset of population available to analyze

Types of Data

Based on Structure

· Structured Data: The information is available in a matrix form with labelled rows and columns

· Unstructured Data: The information is not originally in matrix form with labelled rows and columns.

E.g.: e-mails, videos

Based on Type of data collected

· Cross-Sectional Data: Collected on many subjects of interest at the same time or duration

E.g.: Temperature, Rainfall, Humidity across cities of America in November 2020

· Time Series Data: Collected on single subject at multiple time intervals

E.g.: Temperature, Rainfall, Humidity at New York in 2018,2019, 2020

· Panel Data: Combination of Cross sectional and Time series data. Multiple subjects at different intervals or instances of time

E.g.: Temperature, Rainfall, Humidity at across cities of America in 2018,2019, 2020

Data Measurement Scales (Levels of Measurement)

· Nominal Scale: Used to describe qualitative or categorical data.

E.g.: (Male, Female), (Single, Married, Divorced)

· Ordinal Scale: Used to describe order set of data in their order of magnitude.

E.g.: (Very Unsatisfied — 1, Unsatisfied — 2, Neutral — 3, Satisfied — 4, Very Satisfied — 5)

· Interval Scale: Used to describe a variable that is chosen from an interval set.

E.g.: Intelligent Quotient (IQ) Scores

· Ratio Scale: Used when ratios can be calculated and gives insights. Majority of the variables (Quantitative) falls under this category.

E.g.: Sales, Salary (Salary of A is 2 times salary of B)

Descriptive Statistics of Continuous Variable

Measure of Central Tendency: The measure that helps us to comprehend and summarizes the data with a single value.

· Mean: Sum of all observations divided by the total number of observations (arithmetical average).

· Median: The middle most of observation of distribution of data or the value that can divide the distribution into two halves.

· Mode: The most common observation.

Measure of Variation: The measures that helps us understand the variability of the data.

· Range: The difference between the maximum and the minimum value of the data.

· Inter-Quartile Distance (IQD): The difference between Quartile 3 and Quartile 1 of the data.

Quartile divides the data into 4 parts. First set contains 25% of the whole data, second 50%, third 75% and fourth 100% of the whole data. Similarly, Percentile (Px: value with x % of data below) and Decile (divides data into 10 equal parts) are used for identifying positions of the dataset.

· Variance: Measure of variability of data from the mean value.

Where σ2 is the variance of the population, X is the individual observation, µ is the mean of the population and N is the number of observations.

Where s2 is the variance of the sample, X is the individual observation, is the mean of the sample.

· Standard Deviation: The square root of variance.

Where SD is standard deviation of population and s is standard deviation of the sample.

The sample variance and sample standard deviation formulas are divided by a factor of N-1 instead of N. This Is known as Bessel’s correction. There two arguments for these. First, when we select only a sample of data from the population there occurs a downward bias causing underestimation of the values. To balance this bias we use n-1 as a factor. The second argument is using degrees of freedom. If there are n number of observations and k parameters are to be estimated from the sample, then the degrees of freedom is n-k. So here we are estimating the mean x̄, hence the degrees of freedom is n-1.

Measure of Dispersion: The measure of shape of distribution of the data

· Skewness: The measure of symmetry of the data.

If the value of Skewness (g) is close to 0, then the data is symmetrical. Positive value indicates positive skewness and negative value indicates negative skewness.

Fig 1: The three cases of Skewness Image Source

· Kurtosis: Measure of shape aimed at shape of the tail (heavy or light)

If value is less than 3, then it is a platykurtic distribution (longer tail) and greater than 3 is called leptokurtic distribution. Distribution with kurtosis value 3 is considered as a standard normal distribution.

Fig 2: Types of Kurtosis Image Source

--

--

Vishnu Satheesh

Big fan of data,cloud and AI. 3+ years of experience in data science. Completed Masters in Business Analytics at National University of Ireland, Galway.