Applying Game Theory to Scratch Offs

2024-12-19: Some of the code in this article has been updated. Check the Git repository for the latest version.

The problem and the theory

So I’ve been interested in a little game theory problem lately. With the holidays in full swing, the NC Education Lottery has been hitting the advertising trail pretty hard. It got me thinking though. Which lottery scratch offs are the best overall value? Which ones are the worst? Is it possible to win the lottery, on average, if you buy enough tickets? Spoiler alert, the answer to that last one is no. But the results were pretty fascinating anyway, and not all scratch offs are created the same.

So how do we figure out which ones are the best, and which ones are the worst? Well, in game theory there’s a surprisingly simple formula for solving this kind of problem called the “expected value”, often denoted E which represents, on average, how valuable something is based on the probability (p) of a particular outcome (x). In our case, the probability is given by the “odds in 1” and the outcome is the amount of money each prize is worth.

\[ E[x]=x_1p_1 + x_2p_2 + … + x_np_n \]

Say you had a lottery where the top prize, or outcome, is $100 ($x_1$). There are 100 tickets in circulation, but only one of them has the $100 prize. The rest have no prize. This gives a probability ($p_1$) of $\frac {1}{100}$. The expected value in this case is the probability of winning a ticket multiplied by the value of that outcome: $E[x]=p_1x_1$. So in this very simple case, each ticket’s expected value is $1. If you can buy a ticket for say, $2, you can expect to lose, on average $0.50 per dollar spent given by: $\frac{cost}{value}$. You can see how this works if you just imagine buying every single lottery ticket. Your total cost would be $200, and you would win $100 back. Obviously this isn’t a good deal.

The question I wanted to answer was if that ratio ever gets above 1 for any of our scratch off tickets as prizes are redeemed. The original published odds will always have an expected value ratio less than 1 because otherwise it wouldn’t be profitable. But if we could determine new probabilities based on estimates of remaining tickets and prizes, the expected value could change. As it turns out, the NC Education lottery shares not just the prizes, but also the total number of prizes as well as the number remaining and the odds on their website. Using this information, we could extrapolate the actual expected value of a ticket based on the value of all the prizes multiplied by how many prizes remain divided by our estimate of how many total tickets remain in the wild. We’ll get to that last bit, it’s by far the biggest unknown in our equation, but we can take a guess using the information we know.

Scraping HTML is painful

The biggest, and most annoying part of this problem is digesting HTML. Not only is it messy, but it’s inherently fragile. If the developers of the Lottery site decide to use different class names, this all falls apart and we have to refactor the code. Fortunately, HTML also has the notion of tables, which means our friends over at BeautifulSoup and Pandas can at least take a stab at extracting the data cleanly. As with any web scraping task, I try not to abuse the webserver, so the first thing I did was grab the content and store it locally:

wget https://nclottery.com/scratch-off-prizes-remaining -O lottery-main.html

Now that we have a local copy of the HTML content, we can process it freely without worrying about latency or the admins blocking our connections. Pandas has some great tools for reading tables from HTML code, and on my first attempt, it did a pretty good job. What it didn’t do well though was extract elements from the page that weren’t part of a table. In our source data, the value of a ticket isn’t captured in the actual table, but instead it’s captured in one of the HTML classes, as well as the game name and number.

<div class="box cloudfx databox price_5">

    <table class="datatable">

        <thead>
            <tr>
                <th colspan="4" class="ticketdetails">
                    <span class="gamethumb">
                        <a href="/Content/Images/Instant/nc916_sqr.png" class="preview">
                            <img src="/Content/Images/Instant/nc916_sqr.png" alt="+" class="autoresize" />
                        </a>
                    </span>
                    <span class="gamename">
                        <a href="/scratch-off/916/money-bag">Money Bag</a>
                    </span>
                    <span class="gamestats">
                        <span class="gamenumber"><b>Game Number:</b>916</span>
                    </span>
                    <span class="gameflags"></span>
                </th>
            </tr>

This code allows JavaScript on the page to filter by price. But for our purposes, it allows us to determine the ratio of expected value to actual value. This is important because the highest expected value doesn’t necessarily mean it’s the best overall ticket to buy. By looking at the ratio, you can determine which one is best per dollar spent, which is the gold standard for the value.

To get this data, we can use BeautifulSoup to pre-process the HTML before using Pandas to extract the tables themselves. We can also use this to extract the name of the game itself as well as its number since the name is not necessarily unique, which Pandas wasn’t great at doing (or at least, I wasn’t able to figure out what incantations to cast for it to properly extract the data I wanted).

with open("lottery-main.html", "r") as f:
    html = f.read()
    soup = BeautifulSoup(html, "html.parser")

for game in soup.find_all(class_=re.compile("box cloudfx databox price_.+")):
    # Extract ticket value and game name
    val = int(game['class'][3].split('_')[1])
    name = game.find(class_="gamename").text
    number = game.find(class_="gamenumber").text.split(':')[1].lstrip()

Now that we have each game table extracted in the game variable, we can read it with a Pandas Data Frame. We have a function that strips the $ and , characters so we can interpret the currency fields as floating point numbers instead of strings. The tables also have a row at the bottom that gets interpreted incorrectly, so we can drop that with the dropna() function

    df = pd.read_html(
        StringIO(str(game)),
        header=1,
        converters={'Value': convert_currency}
    )[0].dropna()

From this Data Frame, we end up with a table that looks like this..

**Game Name: 200X The Cash**

=== ========== =========== ========== =========
N       Value   Odds 1 in      Total  Remaining
=== ========== =========== ========== =========
0   5000000.0  2917430.00        6.0        1.0
1    100000.0   972476.67       18.0        1.0
2     50000.0   729357.50       24.0        0.0
3     10000.0   583486.00       30.0        2.0
4      5000.0   112208.85      156.0        2.0
5      1000.0     6002.94     2916.0      101.0
6       500.0      399.52    43814.0      989.0
7       400.0      253.63    69017.0     1588.0
8       200.0       54.52   321044.0     7600.0
9       100.0       25.00   700211.0    17134.0
10       60.0       37.52   466541.0    13772.0
11       50.0       13.63  1284102.0    35575.0
12       40.0       12.50  1400364.0    44344.0
13       30.0       10.00  1750264.0    77117.0
=== ========== =========== ========== =========

This data gives us a good deal of insight into the game. For example, they don’t publish the total number of tickets printed, but if you multiply the odds by the total, you’ll find that you always end up with the same number. In this case 17,504,580 total tickets have been printed. This is crucial information because we can use the following to estimate how many tickets remain in the wild. We could calculate a margin of error here, but in this specific case, because of the large sample size, I’m just going to ignore it.

\[ \frac {remaining_{prize}}{total_{prize}} \cdot total_{tickets} \]

Crunching the numbers

Now that we have our data in a reasonable format, we can extract the expected values. We can do this with the original data first as this will give us which game has the highest inherent value. Based entirely on the original published odds, we can calculate this directly from our table by simply dividing the first column, the prize, by the second column, the odds of this outcome.

def calculate_expected_value(df):
    """Calculates the expected value from a DataFrame of lottery data."""
    ev = 0
    for row in df.values:
        ev += row[0] / row[1]
    return ev

This raw expected value only applies when the game is initially printed. Much more interesting would be to see how these odds, and thus the expected value change over time. We can do this by implementing our estimated ticket totals and computing the adjusted odds based on the remaining prizes.

    def calculate_adjusted_expected_value(df, total_tickets):
        """Calculates the adjusted expected value based on remaining prizes."""

        # Initialize the adjusted expected value (aev) to 0
        aev = 0
        original_count = 0
        current_count = 0

        # original_count = how many prizes we started with, current_count = how many prizes remain
        for row in df.values:
            original_count += row[2]
            current_count += row[3]

        # This extrapolated total is calculated from the ratio of prizes collected vs. remaining
        extrapolated_total = total_tickets * (current_count / original_count)

        # Now that we know the denominator of our adjusted odds formula, we can estimate the current odds
        for row in df.values:
            aev += row[0] * (row[3] / extrapolated_total)
        return aev

From here, it’s pretty trivial to calculate the different ratios for original and adjusted expected value. Once we’ve gathered all of this data, we can stuff it into a simple array and sort it by the various metrics we care about. The complete code can be found in my GitHub repo, but I’m sure by now you’re ready for the results, so without further ado…

The results

So now that we have the data, let’s look at the results. In the table below, the columns are labeled as follows:

Actual - Cost in dollars
Expected - Expected value, raw, in dollars
Ratio - The dollar for dollar expected value
Adj. Expected - Adjusted expected value, in dollars
Adj. Ratio - Adjusted dollar for dollare expected value
% Chg. - Percentage of the original expected value with the adjusted odds
Printed - How many ticets were printed originally
% Left - Percentage of the original print estimated to be left in the wild
Name - Game name (and number). Note that the name can be re-used. The number is unique.

N	Actual	Expected	Ratio	Adj. Expected	Adj. Ratio	% Chg.	Printed	% Left	Name
69	$30	$24.08	0.803	$28.41	0.947	118.0%	17,504,580	3%	200X The Cash (826)
47	$50	$41.72	0.834	$44.77	0.895	107.3%	16,277,910	29%	$10 Million Spectacular (913)
36	$10	$7.57	0.757	$8.53	0.853	112.7%	4,899,780	20%	$1,000,000 Loteria (924)
43	$20	$15.50	0.775	$16.80	0.840	108.4%	6,142,230	36%	$2,000,000 Diamond Deluxe (917)
57	$30	$24.22	0.807	$24.96	0.832	103.1%	11,910,930	24%	Black Titanium (892)
62	$20	$15.45	0.773	$16.61	0.831	107.5%	7,177,860	23%	$2,000,000 Riches (884)
63	$30	$23.97	0.799	$24.87	0.829	103.8%	16,674,390	36%	$5,000,000 Ultimate (876)
0	$50	$41.02	0.820	$41.17	0.823	100.4%	15,211,320	96%	$8 Million Money Maker (972)
55	$20	$15.48	0.774	$16.40	0.820	106.0%	6,392,820	28%	Big Cash Payout (896)
9	$30	$24.09	0.803	$24.38	0.813	101.2%	11,117,700	86%	Millionaire Bucks (952)
35	$30	$24.04	0.801	$24.37	0.812	101.4%	14,818,230	66%	200X The Cash (925)
53	$10	$7.43	0.743	$8.06	0.806	108.4%	8,398,980	18%	Red Hot Millions (900)
59	$10	$7.51	0.751	$8.03	0.803	106.9%	10,375,800	15%	Jumbo Bucks (888)
71	$30	$24.52	0.817	$23.97	0.799	97.7%	21,581,880	10%	Millionaire Maker (774)
20	$30	$24.11	0.804	$23.77	0.792	98.6%	9,049,980	62%	Supreme 7s (940)
28	$20	$15.45	0.772	$15.79	0.789	102.2%	6,918,420	57%	Instant Millions (932)
50	$20	$15.41	0.770	$15.72	0.786	102.0%	7,411,590	58%	Power 20s (904)
68	$10	$7.24	0.724	$7.80	0.780	107.7%	12,231,000	13%	Scorching Hot 7s (830)
65	$20	$15.40	0.770	$15.56	0.778	101.0%	14,828,928	20%	100X The Cash (868)
17	$10	$7.58	0.758	$7.77	0.777	102.6%	4,840,740	56%	$1,000,000 Cashword (943)
10	$20	$15.44	0.772	$15.37	0.769	99.6%	5,187,780	70%	Win Big (951)
13	$10	$7.53	0.753	$7.61	0.761	100.9%	11,186,190	66%	JUMBO BUCKS (948)
16	$20	$14.98	0.749	$14.98	0.749	100.0%	5,560,110	67%	$50,000 Payout (944)
3	$20	$14.99	0.750	$14.84	0.742	99.0%	1,851,120	47%	Merry Multiplier (958)
64	$10	$7.49	0.749	$7.38	0.738	98.6%	8,151,570	44%	VIP Platinum (875)
58	$10	$7.54	0.754	$7.38	0.738	97.9%	8,092,680	19%	Multiplier Mania (891)
48	$5	$3.45	0.689	$3.67	0.734	106.6%	7,347,960	12%	Emerald 8s (912)
31	$10	$7.54	0.754	$7.32	0.732	97.2%	8,063,250	42%	Ultimate Dash (929)
38	$10	$7.45	0.745	$7.28	0.728	97.8%	11,609,760	21%	50X The Cash (922)
4	$10	$7.26	0.726	$7.26	0.726	100.0%	3,683,890	35%	Holiday Cash Blowout (957)
21	$10	$7.25	0.725	$7.25	0.725	100.0%	11,494,741	65%	$80 Million Cash Blowout (939)
56	$10	$7.25	0.725	$7.21	0.721	99.4%	11,494,872	6%	$80,000,000 Cash Blowout (895)
24	$10	$7.25	0.725	$7.16	0.716	98.7%	6,056,798	28%	$2,000 Loaded (936)
70	$25	$18.90	0.756	$17.72	0.709	93.8%	11,936,490	5%	Spectacular Riches (805)
32	$5	$3.45	0.690	$3.54	0.708	102.7%	10,294,080	46%	$400,000 Jackpot (928)
66	$10	$7.24	0.724	$7.04	0.704	97.2%	9,636,435	18%	Triple 777 (864)
18	$5	$3.45	0.689	$3.51	0.701	101.8%	6,157,620	44%	GAME OF THRONES™ (942)
72	$25	$18.82	0.753	$17.41	0.697	92.5%	18,158,850	3%	Extreme Cash (718)
67	$20	$15.61	0.780	$13.93	0.697	89.3%	5,292,420	9%	Platinum (846)
60	$5	$3.45	0.690	$3.45	0.690	100.0%	10,614,000	8%	Mega Bucks (887)
23	$5	$3.45	0.690	$3.45	0.690	100.0%	7,288,080	49%	Super Loteria (937)
11	$5	$3.45	0.689	$3.44	0.689	99.9%	6,654,060	63%	Ca$h Plu$ (950)
14	$5	$3.45	0.689	$3.42	0.685	99.3%	11,111,220	66%	MEGA BUCKS (947)
37	$5	$3.45	0.690	$3.43	0.685	99.3%	7,319,040	34%	Bankroll (923)
1	$5	$3.45	0.689	$3.42	0.684	99.2%	6,665,580	83%	Diamond 10X (971)
12	$5	$3.45	0.690	$3.41	0.682	98.9%	6,907,740	77%	Xtreme Cashword (949)
25	$5	$3.45	0.690	$3.40	0.681	98.7%	12,215,832	48%	$500 Loaded (935)
5	$5	$3.45	0.690	$3.40	0.680	98.6%	4,661,700	39%	Holiday Cash 50X (956)
51	$5	$3.45	0.690	$3.40	0.679	98.5%	8,034,240	30%	Power Cashword (903)
52	$5	$3.45	0.690	$3.39	0.678	98.3%	7,755,780	38%	Power 5s (902)
29	$5	$3.45	0.690	$3.38	0.676	98.0%	7,308,300	45%	Red Hot Slots (931)
39	$5	$3.45	0.689	$3.37	0.673	97.7%	13,359,420	21%	20X The Cash (921)
2	$3	$1.99	0.663	$1.99	0.663	100.0%	9,272,900	95%	LOTERIA® (959)
30	$3	$1.99	0.663	$1.99	0.663	100.1%	5,113,600	36%	SKEE-BALL® (930)
6	$3	$1.99	0.663	$1.98	0.660	99.6%	3,246,500	49%	Holiday Cash 30X (955)
15	$2	$1.33	0.663	$1.31	0.656	99.0%	13,336,350	71%	Junior BIG OL’ BUCKS® (946)
49	$3	$1.99	0.662	$1.96	0.654	98.8%	9,871,600	28%	Loteria (911)
41	$2	$1.33	0.663	$1.30	0.652	98.4%	12,154,800	29%	10X The Cash (919)
7	$2	$1.33	0.663	$1.30	0.648	97.7%	4,514,400	53%	Holiday Cash 20X (954)
26	$2	$1.33	0.663	$1.28	0.641	96.7%	8,404,497	39%	$100 Loaded (934)
33	$2	$1.33	0.663	$1.26	0.631	95.1%	7,211,100	32%	Payday! (927)
61	$2	$1.33	0.664	$1.22	0.609	91.7%	14,442,000	18%	Junior Big Ol’ Bucks (885)
22	$5	$3.45	0.689	$3.04	0.608	88.3%	6,149,880	14%	Triple Red 777’s (938)
40	$3	$1.99	0.663	$1.81	0.604	91.0%	7,213,600	18%	15X Cashword (920)
42	$1	$0.61	0.608	$0.60	0.604	99.3%	14,334,900	25%	5X The Cash (918)
19	$1	$0.60	0.605	$0.60	0.602	99.4%	8,606,100	53%	Tic Tac Bonus (941)
46	$2	$1.33	0.663	$1.20	0.599	90.4%	6,661,050	10%	Triple Winning 7s (914)
54	$3	$1.99	0.663	$1.78	0.594	89.6%	4,933,300	11%	Lucky 7s Cashword (898)
8	$1	$0.60	0.605	$0.59	0.590	97.5%	6,346,800	60%	Holiday Cash 10X (953)
44	$5	$3.45	0.689	$2.94	0.588	85.3%	4,938,360	7%	Money Bag (916)
45	$5	$3.45	0.689	$2.89	0.578	83.9%	4,898,640	12%	Pot of Gold (915)
27	$1	$0.60	0.605	$0.55	0.554	91.6%	8,939,374	27%	$50 Loaded (933)
34	$1	$0.61	0.608	$0.54	0.536	88.1%	7,322,700	19%	Lucky 7s (926)

Conclusions

In running these calculations, I’ve run across some really interesting features of these games.

First of all (and most obviously I would hope), the expected value will always be less than the ticket initially. However, and this is another fascinating facet of this exercise, the vast majority of the value of these tickets does not come from the top prize. Instead, the lower tier prizes have the highest impact, proportionally, on the value of the ticket itself. Let’s look at our top performer at the moment, 200X The Cash (826).

+---------------------------+
| 200X The Cash (826) - $30 |
+---------------------------+

Original E[x]
-------------
$5,000,000 (1:2,917,430) -> $1.71
$100,000 (1:972,477) -> $0.10
$50,000 (1:729,358) -> $0.07
$10,000 (1:583,486) -> $0.02
$5,000 (1:112,209) -> $0.04
$1,000 (1:6,003) -> $0.17
$500 (1:400) -> $1.25
$400 (1:254) -> $1.58
$200 (1:55) -> $3.67
$100 (1:25) -> $4.00
$60 (1:38) -> $1.60
$50 (1:14) -> $3.67
$40 (1:12) -> $3.20
$30 (1:10) -> $3.00

You can see that the biggest source of expected value is actually the $100 prize. I also find it interesting that the prizes from $1,000 to $100,000 add the least value to the ticket, leaving a trough of low value prizes near the top of the ticket. This bottom heavy distribution will help to even out the variation in expected value as the game goes on.

I hypothesize that this is done in order to drive sales of lottery tickets. Put simply, you need people to win, otherwise people won’t buy. If you scaled the odds such that each prize contributed exactly the same value to the ticket as the top prize, you would have much longer odds for the low valued prizes. If people don’t win, they won’t play. So it is to the benefit of the game makers to make those low value prizes much more likely. If your neighbor, or someone in your family wins $200 playing 200X The Cash, you’re going to be more likely to buy one.

As the game goes on, however, the weight of each prize category does change, and can have an impact on the expected value of the ticket. Assuming a totally random distribution, it’s conceivable that the game value would increase and eventually exceed 1. The current best valued ticket is close to 1, mainly because the top prizes have begun to exceed their normal likelihood of occurring, and therefore are becoming one of the dominant factors in the expected value formula. The $5M prize started with very low odds of 1:2,917,430 and so the proportional value it lent to the ticket was low, $1.71. But, as you can see in the output below, the odds have greatly increased for the top prize as only about 3% (estimated) of the original games printed are left in the wild. Now the $5M grand prize is the dominant factor in the overall expected value at a whopping $8.81.

+---------------------------+
| 200X The Cash (826) - $30 |
+---------------------------+

Adjusted E[x]
-------------
$5,000,000 (1:567,219) -> $8.81
$100,000 (1:567,219) -> $0.18
$50,000 (INF) -> $0.00
$10,000 (1:283,610) -> $0.04
$5,000 (1:283,610) -> $0.02
$1,000 (1:5,788) -> $0.17
$500 (1:584) -> $0.86
$400 (1:365) -> $1.10
$200 (1:76) -> $2.63
$100 (1:34) -> $2.97
$60 (1:42) -> $1.44
$50 (1:16) -> $3.09
$40 (1:13) -> $3.08
$30 (1:7) -> $4.04

You can see how, as the game gets closer to being sold out, it’s possible for the high value prizes to start dominating the equation and eventually could go over $1. If this happens, it would theoretically be in your best interest to buy out the rest of the game because you could expect on average to make your money back. Of course in practice, this would be a pretty terrible decision as it would necessitate you winning the only remaining $5M prize to even stand a chance of breaking even. At $30 per ticket, and some half a million tickets remaining, this is obviousy an enormous gamble.

This isn’t exactly ground breaking research here, but it was a fun excercise and bears out the wisdom that it really never will pay off to play the lottery. I hope you enjoyed this diversion as much as I did. If you want to see the full results without running my code, you can download the full text document here.