2024-12-19: Some of the code in this article has been updated. Check the Git repository for the latest version.
The problem and the theory
So I’ve been interested in a little game theory problem lately. With the holidays in full swing, the NC Education Lottery has been hitting the advertising trail pretty hard. It got me thinking though. Which lottery scratch offs are the best overall value? Which ones are the worst? Is it possible to win the lottery, on average, if you buy enough tickets? Spoiler alert, the answer to that last one is no. But the results were pretty fascinating anyway, and not all scratch offs are created the same.
So how do we figure out which ones are the best, and which ones are the worst? Well, in game theory there’s a surprisingly simple formula for solving this kind of problem called the “expected value”, often denoted E which represents, on average, how valuable something is based on the probability (p) of a particular outcome (x). In our case, the probability is given by the “odds in 1” and the outcome is the amount of money each prize is worth.
\[ E[x]=x_1p_1 + x_2p_2 + … + x_np_n \]
Say you had a lottery where the top prize, or outcome, is $100 (\(x_1\)). There are 100 tickets in circulation, but only one of them has the $100 prize. The rest have no prize. This gives a probability (\(p_1\)) of \(\frac {1}{100}\). The expected value in this case is the probability of winning a ticket multiplied by the value of that outcome: \(E[x]=p_1x_1\). So in this very simple case, each ticket’s expected value is $1. If you can buy a ticket for say, $2, you can expect to lose, on average $0.50 per dollar spent given by: \(\frac{cost}{value}\). You can see how this works if you just imagine buying every single lottery ticket. Your total cost would be $200, and you would win $100 back. Obviously this isn’t a good deal.
The question I wanted to answer was if that ratio ever gets above 1 for any of our scratch off tickets as prizes are redeemed. The original published odds will always have an expected value ratio less than 1 because otherwise it wouldn’t be profitable. But if we could determine new probabilities based on estimates of remaining tickets and prizes, the expected value could change. As it turns out, the NC Education lottery shares not just the prizes, but also the total number of prizes as well as the number remaining and the odds on their website. Using this information, we could extrapolate the actual expected value of a ticket based on the value of all the prizes multiplied by how many prizes remain divided by our estimate of how many total tickets remain in the wild. We’ll get to that last bit, it’s by far the biggest unknown in our equation, but we can take a guess using the information we know.
Scraping HTML is painful
The biggest, and most annoying part of this problem is digesting HTML. Not only is it messy, but it’s inherently fragile. If the developers of the Lottery site decide to use different class names, this all falls apart and we have to refactor the code. Fortunately, HTML also has the notion of tables, which means our friends over at BeautifulSoup and Pandas can at least take a stab at extracting the data cleanly. As with any web scraping task, I try not to abuse the webserver, so the first thing I did was grab the content and store it locally:
wget https://nclottery.com/scratch-off-prizes-remaining -O lottery-main.html
Now that we have a local copy of the HTML content, we can process it freely without worrying about latency or the admins blocking our connections. Pandas has some great tools for reading tables from HTML code, and on my first attempt, it did a pretty good job. What it didn’t do well though was extract elements from the page that weren’t part of a table. In our source data, the value of a ticket isn’t captured in the actual table, but instead it’s captured in one of the HTML classes, as well as the game name and number.
<div class="box cloudfx databox price_5">
<table class="datatable">
<thead>
<tr>
<th colspan="4" class="ticketdetails">
<span class="gamethumb">
<a href="/Content/Images/Instant/nc916_sqr.png" class="preview">
<img src="/Content/Images/Instant/nc916_sqr.png" alt="+" class="autoresize" />
</a>
</span>
<span class="gamename">
<a href="/scratch-off/916/money-bag">Money Bag</a>
</span>
<span class="gamestats">
<span class="gamenumber"><b>Game Number:</b>916</span>
</span>
<span class="gameflags"></span>
</th>
</tr>
This code allows JavaScript on the page to filter by price. But for our purposes, it allows us to determine the ratio of expected value to actual value. This is important because the highest expected value doesn’t necessarily mean it’s the best overall ticket to buy. By looking at the ratio, you can determine which one is best per dollar spent, which is the gold standard for the value.
To get this data, we can use BeautifulSoup to pre-process the HTML before using Pandas to extract the tables themselves. We can also use this to extract the name of the game itself as well as its number since the name is not necessarily unique, which Pandas wasn’t great at doing (or at least, I wasn’t able to figure out what incantations to cast for it to properly extract the data I wanted).
with open("lottery-main.html", "r") as f:
html = f.read()
soup = BeautifulSoup(html, "html.parser")
for game in soup.find_all(class_=re.compile("box cloudfx databox price_.+")):
# Extract ticket value and game name
val = int(game['class'][3].split('_')[1])
name = game.find(class_="gamename").text
number = game.find(class_="gamenumber").text.split(':')[1].lstrip()
Now that we have each game table extracted in the game
variable, we can read it with a Pandas Data Frame. We have a function that strips the $ and , characters so we can interpret the currency fields as floating point numbers instead of strings. The tables also have a row at the bottom that gets interpreted incorrectly, so we can drop that with the dropna()
function
df = pd.read_html(
StringIO(str(game)),
header=1,
converters={'Value': convert_currency}
)[0].dropna()
From this Data Frame, we end up with a table that looks like this..
**Game Name: 200X The Cash**
=== ========== =========== ========== =========
N Value Odds 1Ā in Total Remaining
=== ========== =========== ========== =========
0 5000000.0 2917430.00 6.0 1.0
1 100000.0 972476.67 18.0 1.0
2 50000.0 729357.50 24.0 0.0
3 10000.0 583486.00 30.0 2.0
4 5000.0 112208.85 156.0 2.0
5 1000.0 6002.94 2916.0 101.0
6 500.0 399.52 43814.0 989.0
7 400.0 253.63 69017.0 1588.0
8 200.0 54.52 321044.0 7600.0
9 100.0 25.00 700211.0 17134.0
10 60.0 37.52 466541.0 13772.0
11 50.0 13.63 1284102.0 35575.0
12 40.0 12.50 1400364.0 44344.0
13 30.0 10.00 1750264.0 77117.0
=== ========== =========== ========== =========
This data gives us a good deal of insight into the game. For example, they don’t publish the total number of tickets printed, but if you multiply the odds by the total, you’ll find that you always end up with the same number. In this case 17,504,580 total tickets have been printed. This is crucial information because we can use the following to estimate how many tickets remain in the wild. We could calculate a margin of error here, but in this specific case, because of the large sample size, I’m just going to ignore it.
\[ \frac {remaining_{prize}}{total_{prize}} \cdot total_{tickets} \]
Crunching the numbers
Now that we have our data in a reasonable format, we can extract the expected values. We can do this with the original data first as this will give us which game has the highest inherent value. Based entirely on the original published odds, we can calculate this directly from our table by simply dividing the first column, the prize, by the second column, the odds of this outcome.
def calculate_expected_value(df):
"""Calculates the expected value from a DataFrame of lottery data."""
ev = 0
for row in df.values:
ev += row[0] / row[1]
return ev
This raw expected value only applies when the game is initially printed. Much more interesting would be to see how these odds, and thus the expected value change over time. We can do this by implementing our estimated ticket totals and computing the adjusted odds based on the remaining prizes.
def calculate_adjusted_expected_value(df, total_tickets):
"""Calculates the adjusted expected value based on remaining prizes."""
# Initialize the adjusted expected value (aev) to 0
aev = 0
original_count = 0
current_count = 0
# original_count = how many prizes we started with, current_count = how many prizes remain
for row in df.values:
original_count += row[2]
current_count += row[3]
# This extrapolated total is calculated from the ratio of prizes collected vs. remaining
extrapolated_total = total_tickets * (current_count / original_count)
# Now that we know the denominator of our adjusted odds formula, we can estimate the current odds
for row in df.values:
aev += row[0] * (row[3] / extrapolated_total)
return aev
From here, it’s pretty trivial to calculate the different ratios for original and adjusted expected value. Once we’ve gathered all of this data, we can stuff it into a simple array and sort it by the various metrics we care about. The complete code can be found in my GitHub repo, but I’m sure by now you’re ready for the results, so without further ado…
The results
So now that we have the data, let’s look at the results. In the table below, the columns are labeled as follows:
- Actual - Cost in dollars
- Expected - Expected value, raw, in dollars
- Ratio - The dollar for dollar expected value
- Adj. Expected - Adjusted expected value, in dollars
- Adj. Ratio - Adjusted dollar for dollare expected value
- % Chg. - Percentage of the original expected value with the adjusted odds
- Printed - How many ticets were printed originally
- % Left - Percentage of the original print estimated to be left in the wild
- Name - Game name (and number). Note that the name can be re-used. The number is unique.
N | Actual | Expected | Ratio | Adj. Expected | Adj. Ratio | % Chg. | Printed | % Left | Name |
---|---|---|---|---|---|---|---|---|---|
69 | $30 | $24.08 | 0.803 | $28.41 | 0.947 | 118.0% | 17,504,580 | 3% | 200X The Cash (826) |
47 | $50 | $41.72 | 0.834 | $44.77 | 0.895 | 107.3% | 16,277,910 | 29% | $10 Million Spectacular (913) |
36 | $10 | $7.57 | 0.757 | $8.53 | 0.853 | 112.7% | 4,899,780 | 20% | $1,000,000 Loteria (924) |
43 | $20 | $15.50 | 0.775 | $16.80 | 0.840 | 108.4% | 6,142,230 | 36% | $2,000,000 Diamond Deluxe (917) |
57 | $30 | $24.22 | 0.807 | $24.96 | 0.832 | 103.1% | 11,910,930 | 24% | Black Titanium (892) |
62 | $20 | $15.45 | 0.773 | $16.61 | 0.831 | 107.5% | 7,177,860 | 23% | $2,000,000 Riches (884) |
63 | $30 | $23.97 | 0.799 | $24.87 | 0.829 | 103.8% | 16,674,390 | 36% | $5,000,000 Ultimate (876) |
0 | $50 | $41.02 | 0.820 | $41.17 | 0.823 | 100.4% | 15,211,320 | 96% | $8 Million Money Maker (972) |
55 | $20 | $15.48 | 0.774 | $16.40 | 0.820 | 106.0% | 6,392,820 | 28% | Big Cash Payout (896) |
9 | $30 | $24.09 | 0.803 | $24.38 | 0.813 | 101.2% | 11,117,700 | 86% | Millionaire Bucks (952) |
35 | $30 | $24.04 | 0.801 | $24.37 | 0.812 | 101.4% | 14,818,230 | 66% | 200X The Cash (925) |
53 | $10 | $7.43 | 0.743 | $8.06 | 0.806 | 108.4% | 8,398,980 | 18% | Red Hot Millions (900) |
59 | $10 | $7.51 | 0.751 | $8.03 | 0.803 | 106.9% | 10,375,800 | 15% | Jumbo Bucks (888) |
71 | $30 | $24.52 | 0.817 | $23.97 | 0.799 | 97.7% | 21,581,880 | 10% | Millionaire Maker (774) |
20 | $30 | $24.11 | 0.804 | $23.77 | 0.792 | 98.6% | 9,049,980 | 62% | Supreme 7s (940) |
28 | $20 | $15.45 | 0.772 | $15.79 | 0.789 | 102.2% | 6,918,420 | 57% | Instant Millions (932) |
50 | $20 | $15.41 | 0.770 | $15.72 | 0.786 | 102.0% | 7,411,590 | 58% | Power 20s (904) |
68 | $10 | $7.24 | 0.724 | $7.80 | 0.780 | 107.7% | 12,231,000 | 13% | Scorching Hot 7s (830) |
65 | $20 | $15.40 | 0.770 | $15.56 | 0.778 | 101.0% | 14,828,928 | 20% | 100X The Cash (868) |
17 | $10 | $7.58 | 0.758 | $7.77 | 0.777 | 102.6% | 4,840,740 | 56% | $1,000,000 Cashword (943) |
10 | $20 | $15.44 | 0.772 | $15.37 | 0.769 | 99.6% | 5,187,780 | 70% | Win Big (951) |
13 | $10 | $7.53 | 0.753 | $7.61 | 0.761 | 100.9% | 11,186,190 | 66% | JUMBO BUCKS (948) |
16 | $20 | $14.98 | 0.749 | $14.98 | 0.749 | 100.0% | 5,560,110 | 67% | $50,000 Payout (944) |
3 | $20 | $14.99 | 0.750 | $14.84 | 0.742 | 99.0% | 1,851,120 | 47% | Merry Multiplier (958) |
64 | $10 | $7.49 | 0.749 | $7.38 | 0.738 | 98.6% | 8,151,570 | 44% | VIP Platinum (875) |
58 | $10 | $7.54 | 0.754 | $7.38 | 0.738 | 97.9% | 8,092,680 | 19% | Multiplier Mania (891) |
48 | $5 | $3.45 | 0.689 | $3.67 | 0.734 | 106.6% | 7,347,960 | 12% | Emerald 8s (912) |
31 | $10 | $7.54 | 0.754 | $7.32 | 0.732 | 97.2% | 8,063,250 | 42% | Ultimate Dash (929) |
38 | $10 | $7.45 | 0.745 | $7.28 | 0.728 | 97.8% | 11,609,760 | 21% | 50X The Cash (922) |
4 | $10 | $7.26 | 0.726 | $7.26 | 0.726 | 100.0% | 3,683,890 | 35% | Holiday Cash Blowout (957) |
21 | $10 | $7.25 | 0.725 | $7.25 | 0.725 | 100.0% | 11,494,741 | 65% | $80 Million Cash Blowout (939) |
56 | $10 | $7.25 | 0.725 | $7.21 | 0.721 | 99.4% | 11,494,872 | 6% | $80,000,000 Cash Blowout (895) |
24 | $10 | $7.25 | 0.725 | $7.16 | 0.716 | 98.7% | 6,056,798 | 28% | $2,000 Loaded (936) |
70 | $25 | $18.90 | 0.756 | $17.72 | 0.709 | 93.8% | 11,936,490 | 5% | Spectacular Riches (805) |
32 | $5 | $3.45 | 0.690 | $3.54 | 0.708 | 102.7% | 10,294,080 | 46% | $400,000 Jackpot (928) |
66 | $10 | $7.24 | 0.724 | $7.04 | 0.704 | 97.2% | 9,636,435 | 18% | Triple 777 (864) |
18 | $5 | $3.45 | 0.689 | $3.51 | 0.701 | 101.8% | 6,157,620 | 44% | GAME OF THRONES⢠(942) |
72 | $25 | $18.82 | 0.753 | $17.41 | 0.697 | 92.5% | 18,158,850 | 3% | Extreme Cash (718) |
67 | $20 | $15.61 | 0.780 | $13.93 | 0.697 | 89.3% | 5,292,420 | 9% | Platinum (846) |
60 | $5 | $3.45 | 0.690 | $3.45 | 0.690 | 100.0% | 10,614,000 | 8% | Mega Bucks (887) |
23 | $5 | $3.45 | 0.690 | $3.45 | 0.690 | 100.0% | 7,288,080 | 49% | Super Loteria (937) |
11 | $5 | $3.45 | 0.689 | $3.44 | 0.689 | 99.9% | 6,654,060 | 63% | Ca$h Plu$ (950) |
14 | $5 | $3.45 | 0.689 | $3.42 | 0.685 | 99.3% | 11,111,220 | 66% | MEGA BUCKS (947) |
37 | $5 | $3.45 | 0.690 | $3.43 | 0.685 | 99.3% | 7,319,040 | 34% | Bankroll (923) |
1 | $5 | $3.45 | 0.689 | $3.42 | 0.684 | 99.2% | 6,665,580 | 83% | Diamond 10X (971) |
12 | $5 | $3.45 | 0.690 | $3.41 | 0.682 | 98.9% | 6,907,740 | 77% | Xtreme Cashword (949) |
25 | $5 | $3.45 | 0.690 | $3.40 | 0.681 | 98.7% | 12,215,832 | 48% | $500 Loaded (935) |
5 | $5 | $3.45 | 0.690 | $3.40 | 0.680 | 98.6% | 4,661,700 | 39% | Holiday Cash 50X (956) |
51 | $5 | $3.45 | 0.690 | $3.40 | 0.679 | 98.5% | 8,034,240 | 30% | Power Cashword (903) |
52 | $5 | $3.45 | 0.690 | $3.39 | 0.678 | 98.3% | 7,755,780 | 38% | Power 5s (902) |
29 | $5 | $3.45 | 0.690 | $3.38 | 0.676 | 98.0% | 7,308,300 | 45% | Red Hot Slots (931) |
39 | $5 | $3.45 | 0.689 | $3.37 | 0.673 | 97.7% | 13,359,420 | 21% | 20X The Cash (921) |
2 | $3 | $1.99 | 0.663 | $1.99 | 0.663 | 100.0% | 9,272,900 | 95% | LOTERIAĀ® (959) |
30 | $3 | $1.99 | 0.663 | $1.99 | 0.663 | 100.1% | 5,113,600 | 36% | SKEE-BALLĀ® (930) |
6 | $3 | $1.99 | 0.663 | $1.98 | 0.660 | 99.6% | 3,246,500 | 49% | Holiday Cash 30X (955) |
15 | $2 | $1.33 | 0.663 | $1.31 | 0.656 | 99.0% | 13,336,350 | 71% | Junior BIG OL’ BUCKSĀ® (946) |
49 | $3 | $1.99 | 0.662 | $1.96 | 0.654 | 98.8% | 9,871,600 | 28% | Loteria (911) |
41 | $2 | $1.33 | 0.663 | $1.30 | 0.652 | 98.4% | 12,154,800 | 29% | 10X The Cash (919) |
7 | $2 | $1.33 | 0.663 | $1.30 | 0.648 | 97.7% | 4,514,400 | 53% | Holiday Cash 20X (954) |
26 | $2 | $1.33 | 0.663 | $1.28 | 0.641 | 96.7% | 8,404,497 | 39% | $100 Loaded (934) |
33 | $2 | $1.33 | 0.663 | $1.26 | 0.631 | 95.1% | 7,211,100 | 32% | Payday! (927) |
61 | $2 | $1.33 | 0.664 | $1.22 | 0.609 | 91.7% | 14,442,000 | 18% | Junior Big Ol’ Bucks (885) |
22 | $5 | $3.45 | 0.689 | $3.04 | 0.608 | 88.3% | 6,149,880 | 14% | Triple Red 777’s (938) |
40 | $3 | $1.99 | 0.663 | $1.81 | 0.604 | 91.0% | 7,213,600 | 18% | 15X Cashword (920) |
42 | $1 | $0.61 | 0.608 | $0.60 | 0.604 | 99.3% | 14,334,900 | 25% | 5X The Cash (918) |
19 | $1 | $0.60 | 0.605 | $0.60 | 0.602 | 99.4% | 8,606,100 | 53% | Tic Tac Bonus (941) |
46 | $2 | $1.33 | 0.663 | $1.20 | 0.599 | 90.4% | 6,661,050 | 10% | Triple Winning 7s (914) |
54 | $3 | $1.99 | 0.663 | $1.78 | 0.594 | 89.6% | 4,933,300 | 11% | Lucky 7s Cashword (898) |
8 | $1 | $0.60 | 0.605 | $0.59 | 0.590 | 97.5% | 6,346,800 | 60% | Holiday Cash 10X (953) |
44 | $5 | $3.45 | 0.689 | $2.94 | 0.588 | 85.3% | 4,938,360 | 7% | Money Bag (916) |
45 | $5 | $3.45 | 0.689 | $2.89 | 0.578 | 83.9% | 4,898,640 | 12% | Pot of Gold (915) |
27 | $1 | $0.60 | 0.605 | $0.55 | 0.554 | 91.6% | 8,939,374 | 27% | $50 Loaded (933) |
34 | $1 | $0.61 | 0.608 | $0.54 | 0.536 | 88.1% | 7,322,700 | 19% | Lucky 7s (926) |
Conclusions
In running these calculations, I’ve run across some really interesting features of these games.
First of all (and most obviously I would hope), the expected value will always be less than the ticket initially. However, and this is another fascinating facet of this exercise, the vast majority of the value of these tickets does not come from the top prize. Instead, the lower tier prizes have the highest impact, proportionally, on the value of the ticket itself. Let’s look at our top performer at the moment, 200X The Cash (826).
+---------------------------+
| 200X The Cash (826) - $30 |
+---------------------------+
Original E[x]
-------------
$5,000,000 (1:2,917,430) -> $1.71
$100,000 (1:972,477) -> $0.10
$50,000 (1:729,358) -> $0.07
$10,000 (1:583,486) -> $0.02
$5,000 (1:112,209) -> $0.04
$1,000 (1:6,003) -> $0.17
$500 (1:400) -> $1.25
$400 (1:254) -> $1.58
$200 (1:55) -> $3.67
$100 (1:25) -> $4.00
$60 (1:38) -> $1.60
$50 (1:14) -> $3.67
$40 (1:12) -> $3.20
$30 (1:10) -> $3.00
You can see that the biggest source of expected value is actually the $100 prize. I also find it interesting that the prizes from $1,000 to $100,000 add the least value to the ticket, leaving a trough of low value prizes near the top of the ticket. This bottom heavy distribution will help to even out the variation in expected value as the game goes on.
I hypothesize that this is done in order to drive sales of lottery tickets. Put simply, you need people to win, otherwise people won’t buy. If you scaled the odds such that each prize contributed exactly the same value to the ticket as the top prize, you would have much longer odds for the low valued prizes. If people don’t win, they won’t play. So it is to the benefit of the game makers to make those low value prizes much more likely. If your neighbor, or someone in your family wins $200 playing 200X The Cash, you’re going to be more likely to buy one.
As the game goes on, however, the weight of each prize category does change, and can have an impact on the expected value of the ticket. Assuming a totally random distribution, it’s conceivable that the game value would increase and eventually exceed 1. The current best valued ticket is close to 1, mainly because the top prizes have begun to exceed their normal likelihood of occurring, and therefore are becoming one of the dominant factors in the expected value formula. The $5M prize started with very low odds of 1:2,917,430 and so the proportional value it lent to the ticket was low, $1.71. But, as you can see in the output below, the odds have greatly increased for the top prize as only about 3% (estimated) of the original games printed are left in the wild. Now the $5M grand prize is the dominant factor in the overall expected value at a whopping $8.81.
+---------------------------+
| 200X The Cash (826) - $30 |
+---------------------------+
Adjusted E[x]
-------------
$5,000,000 (1:567,219) -> $8.81
$100,000 (1:567,219) -> $0.18
$50,000 (INF) -> $0.00
$10,000 (1:283,610) -> $0.04
$5,000 (1:283,610) -> $0.02
$1,000 (1:5,788) -> $0.17
$500 (1:584) -> $0.86
$400 (1:365) -> $1.10
$200 (1:76) -> $2.63
$100 (1:34) -> $2.97
$60 (1:42) -> $1.44
$50 (1:16) -> $3.09
$40 (1:13) -> $3.08
$30 (1:7) -> $4.04
You can see how, as the game gets closer to being sold out, it’s possible for the high value prizes to start dominating the equation and eventually could go over $1. If this happens, it would theoretically be in your best interest to buy out the rest of the game because you could expect on average to make your money back. Of course in practice, this would be a pretty terrible decision as it would necessitate you winning the only remaining $5M prize to even stand a chance of breaking even. At $30 per ticket, and some half a million tickets remaining, this is obviousy an enormous gamble.
This isn’t exactly ground breaking research here, but it was a fun excercise and bears out the wisdom that it really never will pay off to play the lottery. I hope you enjoyed this diversion as much as I did. If you want to see the full results without running my code, you can download the full text document here.