In this post we show how methods of statistics and probability can be used to
understand the chances of actually winning the lottery. The various methods will be worked out in python.
Analytics
This is really 'Descriptive' analytics, we are just telling the story of what's there.After, we can get into predictive and prescriptive analytics but before that the foundation needs some building.
First lets get an understanding of the odds of winning the lottery, and how we find out what those odds are.
There are actually 2 interpretations of probability; 2 different approaches to solve a how likely an event will happen. One is called "classic" and has evolved a bit to be termed "frequentist". The other is usually called "subjective" or "bayesian". Lets see how these approaches compare with our lotto chances problem.
Counting
First, the classic example. We have a set of 49 numbers and we want to find out what the chances are of finding a set of 6 unique numbers. This we can check as a combination, it's 49 'choose' 6:Combinations are expressed as:
$$ \dfrac{n!}{(n-r!)*(r!)} $$
We use combinations instead of permutations because the actual order of the result doesn't matter. If we draw a 1, 20, 33, 41, 5, 6, or 1, 33, 41, 20, 6, 5. It doesn't matter, we just need 6 unique numbers.
A solution in python could be something like:
n = 49 r = 6 import math denominator = math.factorial(n-r) * math.factorial(r) num_combos = math.factorial(n) / denominator print(num_combos) #13983816
So that's about 14 million, so you have a 1 in 14 million chances to get those numbers.
Do the numbers drawn previously matter? They do: we are looking for unique numbers, so we can't draw the same number twice. These are dependent events. Two events are dependent if the outcome or occurrence of the first affects the outcome or occurrence of the second so that the probability is changed.
When two events, A and B, are dependent, the probability of both occurring is expressed by:
P(A and B) = P(A)*P(B|A)
Its the probability of the first event, combined with second, and that result it combined with the third, and so on. The order of the result still doesn't matter.
In python it could be expressed like so:
Bayesian
Now let's use a Bayesian probability approach to find out how good our chances are. In Bayesian probability only the current likelihood is known and that prior knowledge is taken into account to determine the probability of the next event. Remember that the entire set of data doesn't need to be known, you are just venturing forward with what you have for each new case.Do the numbers drawn previously matter? They do: we are looking for unique numbers, so we can't draw the same number twice. These are dependent events. Two events are dependent if the outcome or occurrence of the first affects the outcome or occurrence of the second so that the probability is changed.
When two events, A and B, are dependent, the probability of both occurring is expressed by:
P(A and B) = P(A)*P(B|A)
Its the probability of the first event, combined with second, and that result it combined with the third, and so on. The order of the result still doesn't matter.
In python it could be expressed like so:
select = 49 choose = 6 total_prob = choose / float(select) for x in range(5): select = select - 1 choose = choose - 1 prob = choose / float(select) total_prob = total_prob * prob odds = 1 / float(total_prob) print str(odds) # 13983816.0
Just under 14 million again, but when you look closer you see that the result is exactly the same (13983816).
It's the same result, just a different way of going about it.
What is interesting about the Bayesian probability interpretation solution is that we loop until we get to the end; but don't need to know how big the initial set is. That way interpreting probability lends itself nicely to computation models and is why Bayesian probability methods are a big part of the emerging field of Data Science. This isn't so obvious in the descriptive part of analytics, but the Bayesian methods have really shined when data isn't complete, or when the data set is extremely large.
Using frequentest methods with large data sets you need to take samples and understand what constitutes a sample set that is significant enough.
Is one better than the other? I don't think it's a better or worse, but just different. They both have advantages considering the problem and conditions around it, but it's interesting to see how different methods can have the same results.
To get really into it, read this very good book:
Save the theory for another day, we have to get back to our Lottery problem!
We now know that the chances are long, no matter how they are measured. How about the actual drawing of the numbers? In the next post, we are going to get some historical data on the numbers that have been drawn, and see if we can find some patterns to better our odds.
It's the same result, just a different way of going about it.
What is interesting about the Bayesian probability interpretation solution is that we loop until we get to the end; but don't need to know how big the initial set is. That way interpreting probability lends itself nicely to computation models and is why Bayesian probability methods are a big part of the emerging field of Data Science. This isn't so obvious in the descriptive part of analytics, but the Bayesian methods have really shined when data isn't complete, or when the data set is extremely large.
Using frequentest methods with large data sets you need to take samples and understand what constitutes a sample set that is significant enough.
Is one better than the other? I don't think it's a better or worse, but just different. They both have advantages considering the problem and conditions around it, but it's interesting to see how different methods can have the same results.
To get really into it, read this very good book:
- http://xcelab.net/rm/statistical-rethinking/
- http://www.fharrell.com/2017/02/my-journey-from-frequentist-to-bayesian.html
Save the theory for another day, we have to get back to our Lottery problem!
We now know that the chances are long, no matter how they are measured. How about the actual drawing of the numbers? In the next post, we are going to get some historical data on the numbers that have been drawn, and see if we can find some patterns to better our odds.
No comments:
Post a Comment