DATA STATISTICS - Hadoop Canada

In this article we will explain the methods of statistics that we use them in the data science. We can implement these methods in any programming language. We need to understand the statistics methods to use them in the Machine learning and data analytics.

1- Mean, Mod, Median, Range and Standard Deviation

Mean: is sum the values for the list and divided by the number of the items in this list
Example: (A)- 20,24,25,36,25,22,23
(B)- sum the values 20+24+25+36+25+22+23=175
(C)- divided by 7 175/7=25

Mod: The mode identifies the most common value or values in the data set.
Example:
(A) 20, 22, 23, 24, 25, 25, 36
(B) Mode is 25

Median: Order the items in the list and find the center values
Example: (A): 20,22,23,24,25,25,36
(B): pick up 24

Range: shows the mathematical distance between the lowest and highest values in the data set.
Example:
(A): 20, 24, 25, 36, 25, 22, 23
(B): find the lowest:20 and highest:36
(C): Range: 36-20= 16

Standard Deviation: is a measure of how spread out numbers are.

Example:
(A): 20,24,25,36,25,22,23
(B): Calculate Mean: 20+24+25+36+25+22+23=175, 175/7=25
(C): Squaring the Difference
20-25=-5 and -5-5 =25; 24-25=-1 and -1-1=1; 25-25=0 and 00=0;
36-25=11 and 1111=121; 25-25=0 and 00=0;22-25=-3 and -3-3=9; and 23-25=-2 and -2*-2=4
(D): Adding the differences: 25+1+0+121+0+9+4=160
(E): Divide by N-1 (160/(7-1)) = 160/6= 26.666
(F): The root: 26.666 = 5.164

2- Normal distribution

Normal distribution knows as the Gaussian distribution. The Normal distribution is a probability distribution. The middle of the graph is the calculation of mean value. The moving to the left or right by standard deviation

3- Binomial distribution

A binomial distribution can be calculated by probability of a SUCCESS or FAILURE. Binomial is a type of distribution that has two possible outcomes (the prefix “bi” means two, or twice).

In the binomial calculation, we use the following items:
A. Binomial coefficient: Grouping of a certain items without any order like C(n,k).
B. Binomial variables: ts a random number that consist of the number of success trail. It is made up of independent trials (probability)
Example of Binomial variables: The head and tail in the coin. Each trial is head or tail same as success or fail. We should have the fix number of trails.

Example of binomial distribution:
If we need to calculate the binomial distribution of the coin. We need the head in 5 times trial.

X= # of head (for the coin) flipping for 5 times
A. Possible out comes from 5 flips 2*2*2*2*2=32
B. First X0= P(5,0) = (5 Choose 0)/32
(5 Choose 0) = 5!/(0! * (5-0)!) = (543*2)/1 * (543*2) = 1
1/32 = 0.03125
Second X1=P(5,1) = (5 Choose 1)/32
(5 Choose 1) = 5!/(1! * (5-1)!) = (543*2)/1 * (432) = 5
5/32 = 0.156
Third X2=P(5,2) = (5 Choose 2)/32
(5 Choose 2) = 5!/(2! * (5-2)!) = (5432)/(26) = 5 *2 = 10
10/32 = 0.3125
Fourth X3=P(5,3) = (5 Choose 3)/32
(5 Choose 3) = 5!/(3! * (5-3)!) = (543*2)/12 = 10
10/32 = 0.3125
Fifth X4=P(5,4) = (5 Choose 4)/32
(5 Choose 4) = 5!/(4! * (5-4)!) = (5432)/43*2 = 5
5/32 = 0.156
Sixth X5=P(5,5) = (5 Choose 5)/32
(5 Choose 5) = 5!/(5! * (5-5)!) = (5432)/43*2 = 1
1/32 = 0.03125

Binomial distribution for the example

4- Conditional probability

When the probability of one event occur with relationship of the probability of another event.
The formula is:
P(B|A) = P(A and B) / P(B)
Also we can also rewrite as:
P(B|A) = P(A∩B) / P(B)
See the image of P(A∩B)

Example about the Dice:
Let A be the event that the outcome is an odd number
A={1,3,5}.

Let B be the event that the outcome is less than or equal to 3
B={1,2,3}.

What is the probability of P(A|B )?
Note: All die rolls are equally

A= {1,3,5}
S= 6 possible numbers
P(A) = A/S
P(A) = 1/2 (Half or possible when we through the dice to get odd only {1,3,5})

B={1,2,3}
A∩B={1,3,5} ∩ {1,2,3} = {1,3}

P(A|B) = P(A∩B)/P(B) = {1,3}/ {1,2,3} =2/3

5- Bayes theorem

The Bayes theorem describes the probability of an event based on the prior knowledge of the conditions that might be related to the event.

Formula:
P(A|B)= P(B|A) P(A)/P(B)

P(A|B) – the probability of event A occurring, given event B has occurred
P(B|A) – the probability of event B occurring, given event A has occurred
P(A) – the probability of event A
P(B) – the probability of event B

6- Descriptive statistics

descriptive statistics are used to describe the basic features of the data in a study.
example:
The main two types:
Measures of Central Tendency (Mean, Median, and Mode).
Measures of Dispersion or Variation (Variance, Standard Deviation, Range).

7- Law of Large number

Law of Large number: It’s the average of the whole population (the mean of Probability samples).
X: the probability of N times /N where n=infinite number
Example:
If we roll the dice only three times: 1,5,6. The average is 1+5+6 =(12/3)=4

8- Central Limit Theorem

Central Limit Theorem: takes the sample of the population and find the mean for that sample. We usually takes the specific number of data sample. We use this methods when the size of data is huge. For this reason we have to simplify the data. Also we can calculate the standard deviation of the select samples and find the mean of them.The output will be the standard deviation of the entire population. In this case the graph will move toward of a normal distribution.

9- Linear regression

Linear regression: It’s the model that show the relationship between the dependent and undepenedent variables. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. Also there is a slope in the model. The slope could be positive or negative.

Formula: y = β0 +β1x+ε
The slope of the line is b, and a is the intercept (the value of y when x = 0).
ε: Error term is used to account for the variability in y
β0 is the y-intercept of the regression line.
β1 is the slope.
Ε(y) is the mean or expected value of y for a given value of x.

We have two types of slope in the linear regression equation.
Positive: y = β0 +β1x+ε
Negative: y = β0 -β1x+ε