Featured image of post Applying Game Theory to Scratch Offs

Applying Game Theory to Scratch Offs

2024-12-19: Some of the code in this article has been updated. Check the Git repository for the latest version.

The problem and the theory

So I’ve been interested in a little game theory problem lately. With the holidays in full swing, the NC Education Lottery has been hitting the advertising trail pretty hard. It got me thinking though. Which lottery scratch offs are the best overall value? Which ones are the worst? Is it possible to win the lottery, on average, if you buy enough tickets? Spoiler alert, the answer to that last one is no. But the results were pretty fascinating anyway, and not all scratch offs are created the same.

So how do we figure out which ones are the best, and which ones are the worst? Well, in game theory there’s a surprisingly simple formula for solving this kind of problem called the “expected value”, often denoted E which represents, on average, how valuable something is based on the probability (p) of a particular outcome (x). In our case, the probability is given by the “odds in 1” and the outcome is the amount of money each prize is worth.

\[ E[x]=x_1p_1 + x_2p_2 + … + x_np_n \]

Say you had a lottery where the top prize, or outcome, is $100 (\(x_1\)). There are 100 tickets in circulation, but only one of them has the $100 prize. The rest have no prize. This gives a probability (\(p_1\)) of \(\frac {1}{100}\). The expected value in this case is the probability of winning a ticket multiplied by the value of that outcome: \(E[x]=p_1x_1\). So in this very simple case, each ticket’s expected value is $1. If you can buy a ticket for say, $2, you can expect to lose, on average $0.50 per dollar spent given by: \(\frac{cost}{value}\). You can see how this works if you just imagine buying every single lottery ticket. Your total cost would be $200, and you would win $100 back. Obviously this isn’t a good deal.

The question I wanted to answer was if that ratio ever gets above 1 for any of our scratch off tickets as prizes are redeemed. The original published odds will always have an expected value ratio less than 1 because otherwise it wouldn’t be profitable. But if we could determine new probabilities based on estimates of remaining tickets and prizes, the expected value could change. As it turns out, the NC Education lottery shares not just the prizes, but also the total number of prizes as well as the number remaining and the odds on their website. Using this information, we could extrapolate the actual expected value of a ticket based on the value of all the prizes multiplied by how many prizes remain divided by our estimate of how many total tickets remain in the wild. We’ll get to that last bit, it’s by far the biggest unknown in our equation, but we can take a guess using the information we know.

Scraping HTML is painful

The biggest, and most annoying part of this problem is digesting HTML. Not only is it messy, but it’s inherently fragile. If the developers of the Lottery site decide to use different class names, this all falls apart and we have to refactor the code. Fortunately, HTML also has the notion of tables, which means our friends over at BeautifulSoup and Pandas can at least take a stab at extracting the data cleanly. As with any web scraping task, I try not to abuse the webserver, so the first thing I did was grab the content and store it locally:

wget https://nclottery.com/scratch-off-prizes-remaining -O lottery-main.html

Now that we have a local copy of the HTML content, we can process it freely without worrying about latency or the admins blocking our connections. Pandas has some great tools for reading tables from HTML code, and on my first attempt, it did a pretty good job. What it didn’t do well though was extract elements from the page that weren’t part of a table. In our source data, the value of a ticket isn’t captured in the actual table, but instead it’s captured in one of the HTML classes, as well as the game name and number.

<div class="box cloudfx databox price_5">

    <table class="datatable">

        <thead>
            <tr>
                <th colspan="4" class="ticketdetails">
                    <span class="gamethumb">
                        <a href="/Content/Images/Instant/nc916_sqr.png" class="preview">
                            <img src="/Content/Images/Instant/nc916_sqr.png" alt="+" class="autoresize" />
                        </a>
                    </span>
                    <span class="gamename">
                        <a href="/scratch-off/916/money-bag">Money Bag</a>
                    </span>
                    <span class="gamestats">
                        <span class="gamenumber"><b>Game Number:</b>916</span>
                    </span>
                    <span class="gameflags"></span>
                </th>
            </tr>

This code allows JavaScript on the page to filter by price. But for our purposes, it allows us to determine the ratio of expected value to actual value. This is important because the highest expected value doesn’t necessarily mean it’s the best overall ticket to buy. By looking at the ratio, you can determine which one is best per dollar spent, which is the gold standard for the value.

To get this data, we can use BeautifulSoup to pre-process the HTML before using Pandas to extract the tables themselves. We can also use this to extract the name of the game itself as well as its number since the name is not necessarily unique, which Pandas wasn’t great at doing (or at least, I wasn’t able to figure out what incantations to cast for it to properly extract the data I wanted).

with open("lottery-main.html", "r") as f:
    html = f.read()
    soup = BeautifulSoup(html, "html.parser")

for game in soup.find_all(class_=re.compile("box cloudfx databox price_.+")):
    # Extract ticket value and game name
    val = int(game['class'][3].split('_')[1])
    name = game.find(class_="gamename").text
    number = game.find(class_="gamenumber").text.split(':')[1].lstrip()

Now that we have each game table extracted in the game variable, we can read it with a Pandas Data Frame. We have a function that strips the $ and , characters so we can interpret the currency fields as floating point numbers instead of strings. The tables also have a row at the bottom that gets interpreted incorrectly, so we can drop that with the dropna() function

    df = pd.read_html(
        StringIO(str(game)),
        header=1,
        converters={'Value': convert_currency}
    )[0].dropna()

From this Data Frame, we end up with a table that looks like this..

**Game Name: 200X The Cash**

=== ========== =========== ========== =========
N       Value   Odds 1Ā in      Total  Remaining
=== ========== =========== ========== =========
0   5000000.0  2917430.00        6.0        1.0
1    100000.0   972476.67       18.0        1.0
2     50000.0   729357.50       24.0        0.0
3     10000.0   583486.00       30.0        2.0
4      5000.0   112208.85      156.0        2.0
5      1000.0     6002.94     2916.0      101.0
6       500.0      399.52    43814.0      989.0
7       400.0      253.63    69017.0     1588.0
8       200.0       54.52   321044.0     7600.0
9       100.0       25.00   700211.0    17134.0
10       60.0       37.52   466541.0    13772.0
11       50.0       13.63  1284102.0    35575.0
12       40.0       12.50  1400364.0    44344.0
13       30.0       10.00  1750264.0    77117.0
=== ========== =========== ========== =========

This data gives us a good deal of insight into the game. For example, they don’t publish the total number of tickets printed, but if you multiply the odds by the total, you’ll find that you always end up with the same number. In this case 17,504,580 total tickets have been printed. This is crucial information because we can use the following to estimate how many tickets remain in the wild. We could calculate a margin of error here, but in this specific case, because of the large sample size, I’m just going to ignore it.

\[ \frac {remaining_{prize}}{total_{prize}} \cdot total_{tickets} \]

Crunching the numbers

Now that we have our data in a reasonable format, we can extract the expected values. We can do this with the original data first as this will give us which game has the highest inherent value. Based entirely on the original published odds, we can calculate this directly from our table by simply dividing the first column, the prize, by the second column, the odds of this outcome.

def calculate_expected_value(df):
    """Calculates the expected value from a DataFrame of lottery data."""
    ev = 0
    for row in df.values:
        ev += row[0] / row[1]
    return ev

This raw expected value only applies when the game is initially printed. Much more interesting would be to see how these odds, and thus the expected value change over time. We can do this by implementing our estimated ticket totals and computing the adjusted odds based on the remaining prizes.

    def calculate_adjusted_expected_value(df, total_tickets):
        """Calculates the adjusted expected value based on remaining prizes."""

        # Initialize the adjusted expected value (aev) to 0
        aev = 0
        original_count = 0
        current_count = 0

        # original_count = how many prizes we started with, current_count = how many prizes remain
        for row in df.values:
            original_count += row[2]
            current_count += row[3]

        # This extrapolated total is calculated from the ratio of prizes collected vs. remaining
        extrapolated_total = total_tickets * (current_count / original_count)

        # Now that we know the denominator of our adjusted odds formula, we can estimate the current odds
        for row in df.values:
            aev += row[0] * (row[3] / extrapolated_total)
        return aev

From here, it’s pretty trivial to calculate the different ratios for original and adjusted expected value. Once we’ve gathered all of this data, we can stuff it into a simple array and sort it by the various metrics we care about. The complete code can be found in my GitHub repo, but I’m sure by now you’re ready for the results, so without further ado…

The results

So now that we have the data, let’s look at the results. In the table below, the columns are labeled as follows:

  • Actual - Cost in dollars
  • Expected - Expected value, raw, in dollars
  • Ratio - The dollar for dollar expected value
  • Adj. Expected - Adjusted expected value, in dollars
  • Adj. Ratio - Adjusted dollar for dollare expected value
  • % Chg. - Percentage of the original expected value with the adjusted odds
  • Printed - How many ticets were printed originally
  • % Left - Percentage of the original print estimated to be left in the wild
  • Name - Game name (and number). Note that the name can be re-used. The number is unique.
N Actual Expected Ratio Adj. Expected Adj. Ratio % Chg. Printed % Left Name
69 $30 $24.08 0.803 $28.41 0.947 118.0% 17,504,580 3% 200X The Cash (826)
47 $50 $41.72 0.834 $44.77 0.895 107.3% 16,277,910 29% $10 Million Spectacular (913)
36 $10 $7.57 0.757 $8.53 0.853 112.7% 4,899,780 20% $1,000,000 Loteria (924)
43 $20 $15.50 0.775 $16.80 0.840 108.4% 6,142,230 36% $2,000,000 Diamond Deluxe (917)
57 $30 $24.22 0.807 $24.96 0.832 103.1% 11,910,930 24% Black Titanium (892)
62 $20 $15.45 0.773 $16.61 0.831 107.5% 7,177,860 23% $2,000,000 Riches (884)
63 $30 $23.97 0.799 $24.87 0.829 103.8% 16,674,390 36% $5,000,000 Ultimate (876)
0 $50 $41.02 0.820 $41.17 0.823 100.4% 15,211,320 96% $8 Million Money Maker (972)
55 $20 $15.48 0.774 $16.40 0.820 106.0% 6,392,820 28% Big Cash Payout (896)
9 $30 $24.09 0.803 $24.38 0.813 101.2% 11,117,700 86% Millionaire Bucks (952)
35 $30 $24.04 0.801 $24.37 0.812 101.4% 14,818,230 66% 200X The Cash (925)
53 $10 $7.43 0.743 $8.06 0.806 108.4% 8,398,980 18% Red Hot Millions (900)
59 $10 $7.51 0.751 $8.03 0.803 106.9% 10,375,800 15% Jumbo Bucks (888)
71 $30 $24.52 0.817 $23.97 0.799 97.7% 21,581,880 10% Millionaire Maker (774)
20 $30 $24.11 0.804 $23.77 0.792 98.6% 9,049,980 62% Supreme 7s (940)
28 $20 $15.45 0.772 $15.79 0.789 102.2% 6,918,420 57% Instant Millions (932)
50 $20 $15.41 0.770 $15.72 0.786 102.0% 7,411,590 58% Power 20s (904)
68 $10 $7.24 0.724 $7.80 0.780 107.7% 12,231,000 13% Scorching Hot 7s (830)
65 $20 $15.40 0.770 $15.56 0.778 101.0% 14,828,928 20% 100X The Cash (868)
17 $10 $7.58 0.758 $7.77 0.777 102.6% 4,840,740 56% $1,000,000 Cashword (943)
10 $20 $15.44 0.772 $15.37 0.769 99.6% 5,187,780 70% Win Big (951)
13 $10 $7.53 0.753 $7.61 0.761 100.9% 11,186,190 66% JUMBO BUCKS (948)
16 $20 $14.98 0.749 $14.98 0.749 100.0% 5,560,110 67% $50,000 Payout (944)
3 $20 $14.99 0.750 $14.84 0.742 99.0% 1,851,120 47% Merry Multiplier (958)
64 $10 $7.49 0.749 $7.38 0.738 98.6% 8,151,570 44% VIP Platinum (875)
58 $10 $7.54 0.754 $7.38 0.738 97.9% 8,092,680 19% Multiplier Mania (891)
48 $5 $3.45 0.689 $3.67 0.734 106.6% 7,347,960 12% Emerald 8s (912)
31 $10 $7.54 0.754 $7.32 0.732 97.2% 8,063,250 42% Ultimate Dash (929)
38 $10 $7.45 0.745 $7.28 0.728 97.8% 11,609,760 21% 50X The Cash (922)
4 $10 $7.26 0.726 $7.26 0.726 100.0% 3,683,890 35% Holiday Cash Blowout (957)
21 $10 $7.25 0.725 $7.25 0.725 100.0% 11,494,741 65% $80 Million Cash Blowout (939)
56 $10 $7.25 0.725 $7.21 0.721 99.4% 11,494,872 6% $80,000,000 Cash Blowout (895)
24 $10 $7.25 0.725 $7.16 0.716 98.7% 6,056,798 28% $2,000 Loaded (936)
70 $25 $18.90 0.756 $17.72 0.709 93.8% 11,936,490 5% Spectacular Riches (805)
32 $5 $3.45 0.690 $3.54 0.708 102.7% 10,294,080 46% $400,000 Jackpot (928)
66 $10 $7.24 0.724 $7.04 0.704 97.2% 9,636,435 18% Triple 777 (864)
18 $5 $3.45 0.689 $3.51 0.701 101.8% 6,157,620 44% GAME OF THRONESā„¢ (942)
72 $25 $18.82 0.753 $17.41 0.697 92.5% 18,158,850 3% Extreme Cash (718)
67 $20 $15.61 0.780 $13.93 0.697 89.3% 5,292,420 9% Platinum (846)
60 $5 $3.45 0.690 $3.45 0.690 100.0% 10,614,000 8% Mega Bucks (887)
23 $5 $3.45 0.690 $3.45 0.690 100.0% 7,288,080 49% Super Loteria (937)
11 $5 $3.45 0.689 $3.44 0.689 99.9% 6,654,060 63% Ca$h Plu$ (950)
14 $5 $3.45 0.689 $3.42 0.685 99.3% 11,111,220 66% MEGA BUCKS (947)
37 $5 $3.45 0.690 $3.43 0.685 99.3% 7,319,040 34% Bankroll (923)
1 $5 $3.45 0.689 $3.42 0.684 99.2% 6,665,580 83% Diamond 10X (971)
12 $5 $3.45 0.690 $3.41 0.682 98.9% 6,907,740 77% Xtreme Cashword (949)
25 $5 $3.45 0.690 $3.40 0.681 98.7% 12,215,832 48% $500 Loaded (935)
5 $5 $3.45 0.690 $3.40 0.680 98.6% 4,661,700 39% Holiday Cash 50X (956)
51 $5 $3.45 0.690 $3.40 0.679 98.5% 8,034,240 30% Power Cashword (903)
52 $5 $3.45 0.690 $3.39 0.678 98.3% 7,755,780 38% Power 5s (902)
29 $5 $3.45 0.690 $3.38 0.676 98.0% 7,308,300 45% Red Hot Slots (931)
39 $5 $3.45 0.689 $3.37 0.673 97.7% 13,359,420 21% 20X The Cash (921)
2 $3 $1.99 0.663 $1.99 0.663 100.0% 9,272,900 95% LOTERIAĀ® (959)
30 $3 $1.99 0.663 $1.99 0.663 100.1% 5,113,600 36% SKEE-BALLĀ® (930)
6 $3 $1.99 0.663 $1.98 0.660 99.6% 3,246,500 49% Holiday Cash 30X (955)
15 $2 $1.33 0.663 $1.31 0.656 99.0% 13,336,350 71% Junior BIG OL’ BUCKSĀ® (946)
49 $3 $1.99 0.662 $1.96 0.654 98.8% 9,871,600 28% Loteria (911)
41 $2 $1.33 0.663 $1.30 0.652 98.4% 12,154,800 29% 10X The Cash (919)
7 $2 $1.33 0.663 $1.30 0.648 97.7% 4,514,400 53% Holiday Cash 20X (954)
26 $2 $1.33 0.663 $1.28 0.641 96.7% 8,404,497 39% $100 Loaded (934)
33 $2 $1.33 0.663 $1.26 0.631 95.1% 7,211,100 32% Payday! (927)
61 $2 $1.33 0.664 $1.22 0.609 91.7% 14,442,000 18% Junior Big Ol’ Bucks (885)
22 $5 $3.45 0.689 $3.04 0.608 88.3% 6,149,880 14% Triple Red 777’s (938)
40 $3 $1.99 0.663 $1.81 0.604 91.0% 7,213,600 18% 15X Cashword (920)
42 $1 $0.61 0.608 $0.60 0.604 99.3% 14,334,900 25% 5X The Cash (918)
19 $1 $0.60 0.605 $0.60 0.602 99.4% 8,606,100 53% Tic Tac Bonus (941)
46 $2 $1.33 0.663 $1.20 0.599 90.4% 6,661,050 10% Triple Winning 7s (914)
54 $3 $1.99 0.663 $1.78 0.594 89.6% 4,933,300 11% Lucky 7s Cashword (898)
8 $1 $0.60 0.605 $0.59 0.590 97.5% 6,346,800 60% Holiday Cash 10X (953)
44 $5 $3.45 0.689 $2.94 0.588 85.3% 4,938,360 7% Money Bag (916)
45 $5 $3.45 0.689 $2.89 0.578 83.9% 4,898,640 12% Pot of Gold (915)
27 $1 $0.60 0.605 $0.55 0.554 91.6% 8,939,374 27% $50 Loaded (933)
34 $1 $0.61 0.608 $0.54 0.536 88.1% 7,322,700 19% Lucky 7s (926)

Conclusions

In running these calculations, I’ve run across some really interesting features of these games.

First of all (and most obviously I would hope), the expected value will always be less than the ticket initially. However, and this is another fascinating facet of this exercise, the vast majority of the value of these tickets does not come from the top prize. Instead, the lower tier prizes have the highest impact, proportionally, on the value of the ticket itself. Let’s look at our top performer at the moment, 200X The Cash (826).

+---------------------------+
| 200X The Cash (826) - $30 |
+---------------------------+

Original E[x]
-------------
$5,000,000 (1:2,917,430) -> $1.71
$100,000 (1:972,477) -> $0.10
$50,000 (1:729,358) -> $0.07
$10,000 (1:583,486) -> $0.02
$5,000 (1:112,209) -> $0.04
$1,000 (1:6,003) -> $0.17
$500 (1:400) -> $1.25
$400 (1:254) -> $1.58
$200 (1:55) -> $3.67
$100 (1:25) -> $4.00
$60 (1:38) -> $1.60
$50 (1:14) -> $3.67
$40 (1:12) -> $3.20
$30 (1:10) -> $3.00

You can see that the biggest source of expected value is actually the $100 prize. I also find it interesting that the prizes from $1,000 to $100,000 add the least value to the ticket, leaving a trough of low value prizes near the top of the ticket. This bottom heavy distribution will help to even out the variation in expected value as the game goes on.

I hypothesize that this is done in order to drive sales of lottery tickets. Put simply, you need people to win, otherwise people won’t buy. If you scaled the odds such that each prize contributed exactly the same value to the ticket as the top prize, you would have much longer odds for the low valued prizes. If people don’t win, they won’t play. So it is to the benefit of the game makers to make those low value prizes much more likely. If your neighbor, or someone in your family wins $200 playing 200X The Cash, you’re going to be more likely to buy one.

As the game goes on, however, the weight of each prize category does change, and can have an impact on the expected value of the ticket. Assuming a totally random distribution, it’s conceivable that the game value would increase and eventually exceed 1. The current best valued ticket is close to 1, mainly because the top prizes have begun to exceed their normal likelihood of occurring, and therefore are becoming one of the dominant factors in the expected value formula. The $5M prize started with very low odds of 1:2,917,430 and so the proportional value it lent to the ticket was low, $1.71. But, as you can see in the output below, the odds have greatly increased for the top prize as only about 3% (estimated) of the original games printed are left in the wild. Now the $5M grand prize is the dominant factor in the overall expected value at a whopping $8.81.

+---------------------------+
| 200X The Cash (826) - $30 |
+---------------------------+

Adjusted E[x]
-------------
$5,000,000 (1:567,219) -> $8.81
$100,000 (1:567,219) -> $0.18
$50,000 (INF) -> $0.00
$10,000 (1:283,610) -> $0.04
$5,000 (1:283,610) -> $0.02
$1,000 (1:5,788) -> $0.17
$500 (1:584) -> $0.86
$400 (1:365) -> $1.10
$200 (1:76) -> $2.63
$100 (1:34) -> $2.97
$60 (1:42) -> $1.44
$50 (1:16) -> $3.09
$40 (1:13) -> $3.08
$30 (1:7) -> $4.04

You can see how, as the game gets closer to being sold out, it’s possible for the high value prizes to start dominating the equation and eventually could go over $1. If this happens, it would theoretically be in your best interest to buy out the rest of the game because you could expect on average to make your money back. Of course in practice, this would be a pretty terrible decision as it would necessitate you winning the only remaining $5M prize to even stand a chance of breaking even. At $30 per ticket, and some half a million tickets remaining, this is obviousy an enormous gamble.

This isn’t exactly ground breaking research here, but it was a fun excercise and bears out the wisdom that it really never will pay off to play the lottery. I hope you enjoyed this diversion as much as I did. If you want to see the full results without running my code, you can download the full text document here.

Built with Hugo
Theme Stack designed by Jimmy