Post

Alice vs. Bob Coin Flipping Problem

Alice vs. Bob Coin Flipping Problem

I saw this interesting math problem from a coworker and decided to make a post discussing the computation.

Coin Flipping Problem

Flip a fair coin 100 times—it gives a sequence of heads (H) and tails (T). For each HH in the sequence of flips, Alice gets a point; for each HT, Bob does, so e.g. for the sequence THHHT Alice gets 2 points and Bob gets 1 point. Who is most likely to win?

Answer The correct answer is Bob.

Math Solution

Don’t feel bad if you can’t intuit your way to the solution (I guessed wrong twice). We can use Markov Chain Theory to prove this mathematically. Here is an outline of this:

  • Translate problem to math. Specifically, a Markov Chain with states representing the game.
  • Use the Markov Chain to compute a joint distribution of all the states, across all 100 coin flips.
  • Compute the marginal distribution of each player’s score on the 100th coin flip from the joint distribution.
  • Sum the marginal distribution according to whether Alice won, Bob won, or they tied.

First, we need to translate the problem to mathematical terms:

\[\begin{gather*} t = \{1, ..., 100\}, X_t = \{H, T\}, P(X_t = \text{H}) = P(X_t = \text{T}) = \frac{1}{2} \\ Z_t = \text{Event that Alice wins at turn $t$} \\ Y_t = \text{Event that Bob wins at turn $t$} \\ P(Z_{100}) \stackrel{?}{>} P(Y_{100}) \end{gather*}\]

Although intuitive, I wont proceed further with this approach because it becomes very difficult to analyze. Each random variable only models a single time index ($t$), but the problem requires us to model two time steps (namely, $HH$ and $HT$). I got this idea from this stackoverflow post, and will explain it now. First, let’s map our original state space into a more useful state space:

\[\begin{gather*} \{X_{t - 1} = H, X_{t} = H\} \rightarrow HH \\ \{X_{t - 1} = H, X_{t} = T\} \rightarrow HT \\ \{X_{t - 1} \neq H, X_t = H\} \rightarrow H \\ \{X_t = T\} \rightarrow NULL \\ \end{gather*}\]

This allows us to track the two scoring states ($HH$, $HT$), the state before scoring ($H$), and discard the rest ($NULL$). More importantly, this formulation requires us to remember only the most recent state, ($HH$, $HT$, $H$, $NULL$). This is known as the Markov Property and directly leads to a Markov Chain specification.

Coming up with this formulation was often the most difficult part of Markov Chain analysis for me in school. Indeed, with this specified, First Step and Markov Chain analysis becomes a probability calculation.

Now we can write out the Markov Chain’s transition matrix:

 $NULL$$H$$HH$$HT$
$NULL$1/21/200
$H$001/21/2
$HH$001/21/2
$HT$1/21/200

The rows represent a starting point at $t - 1$ and each entry in the corresponding columns are the chances of transitioning into that state at time $t$. For example, starting in $NULL$, you have a $\frac{1}{2}$ chance of rolling a $H$ but its impossible to go from $NULL$ to a $HH$ i.e.

\[P(X_t = HH | X_{t - 1} = NULL) = 0\]

Notice how the sum of each row will always equal 1, indicating that we will always transition to a state next flip. The key here is that since we constructed the definitions of our states to only exist on its previous state and still be meaningful in terms of the problem, we can construct such a table.

This table can also be visualized in a flow chart:

flowchart TD
    NULL -->|1/2| NULL
    NULL -->|1/2| H
    H -->|1/2| HH
    HH -->|1/2| HH
    HH -->|1/2| HT
    HT -->|1/2| H
    H -->|1/2| HT
    HT -->|1/2| NULL

With each arrow indicating a state transition according to the probability from the table.

We now need one more set of random variables, $Z_t$ and $Y_t$:

\[\begin{gather*} Z_t = \text{Alice's points at turn $t$} \in \{0, ..., 100\} \\ Y_t = \text{Bob's points at turn $t$} \in \{0, ..., 100\} \\ \end{gather*}\]

Bob’s max points is only 50 (alternating $HT$ for 50 times), but if we use ${0, …, 100}$ is makes the code easier and doesn’t change the answer. Lets stack all three random variables $(X_t, Z_t, Y_t)$ into a vector:

\[\begin{gather*} \vec S_t = [X_t, Z_t, Y_t] \end{gather*}\]

Now for the calculation part, we simply need to compute $P(\vec S_{100})$. To do this, lets first use the law of total probability and the chain rule of probability to condition on $\vec S_{99}$:

\[\begin{gather*} P(\vec S_{100}) = \sum_{\vec S_{99}} P(\vec S_{100}, \vec S_{99}) \\ = \sum_{\vec S_{99}} P(\vec S_{100} | \vec S_{99}) P(\vec S_{99}) \end{gather*}\]

Lets condition on $\vec S_{98}$ as well:

\[\begin{gather*} P(\vec S_{100}) = \sum_{\vec S_{99}} P(\vec S_{100} | \vec S_{99}) P(\vec S_{99}) \\ = \sum_{\vec S_{98}}\sum_{\vec S_{99}} P(\vec S_{100}, \vec S_{98} | \vec S_{99}) P(\vec S_{99}, \vec S_{98}) \\ = \sum_{\vec S_{98}}\sum_{\vec S_{99}} P(\vec S_{100} | \vec S_{99}) P(\vec S_{99} | \vec S_{98}) P(\vec S_{98}) \\ \end{gather*}\]

Once we condition $\vec S_{100}$ on $\vec S_{99}$, $\vec S_{100}$ becomes conditionally independent from $\vec S_{98}$. Again, this is the Markov Property we are using. We can continue this pattern all the way down to $\vec S_1$:

\[\begin{gather*} P(\vec S_{100}) = \sum_{\vec S_{1}}...\sum_{\vec S_{98}}\sum_{\vec S_{99}} P(\vec S_{100} | \vec S_{99}) P(\vec S_{99} | \vec S_{98}) P(\vec S_{98} | \vec S_{97}) ... P(\vec S_{2} | \vec S_{1}) P(\vec S_1) \\ \end{gather*}\]

From our transition table above, we always know: \(P(\vec S_{t} | \vec S_{t-1})\) and we can create a dummy timestep $t=0$ representing the state before the first flip as:

\[\begin{gather*} P(\vec S_0 = [Null, 0, 0]) = 1 \end{gather*}\]

You would have to then add the condition:

\[P(\vec S_1 | \vec S_0) P(\vec S_0)\]

to our expression for $P(\vec S_{100})$, but for brevity I will suppress this.

Computational Tricks

Now we can use three computational tricks. These aren’t required from a mathematical perspective, but to get the correct answer using numpy in a reasonable amount of time, we need them. First:

\[\begin{gather*} P(\vec S_{100}) = \sum_{\vec S_{1}}...\sum_{\vec S_{98}}\sum_{\vec S_{99}} P(\vec S_{100} | \vec S_{99}) P(\vec S_{99} | \vec S_{98}) P(\vec S_{98} | \vec S_{97}) ... P(\vec S_{2} | \vec S_{1}) P(\vec S_1) \\ = \sum_{\vec S_{99}} P(\vec S_{100} | \vec S_{99}) \sum_{\vec S_{98}} P(\vec S_{99} | \vec S_{98}) \sum_{\vec S_{97}} P(\vec S_{98} | \vec S_{97}) ... \sum_{\vec S_{1}} P(\vec S_{2} | \vec S_{1}) P(\vec S_1) \end{gather*}\]

This allow us to compute the most far right part first, and store it. Consider after we compute the right most part:

\[P(\vec S_{2}) = \sum_{\vec S_{1}} P(\vec S_{2} | \vec S_{1}) P(\vec S_1)\]

When you are working further left to $\vec S_3$ and so on, $P(\vec S_{2})$ will not change since it is a constant. So instead of recomputing this term several times, we can simply cache it. Congrats, you just learned Dynamic Programming and sum-product inference!

The second trick is that to have numerical stability computing the product of these small probabilities, we will compute $P(\vec S_{100})$ in log-space i.e. $\log P(\vec S_{100})$:

\[\begin{gather*} \log P(\vec S_{100}) = \log \big [\sum_{\vec S_{99}} P(\vec S_{100} | \vec S_{99}) \sum_{\vec S_{98}} P(\vec S_{99} | \vec S_{98}) \sum_{\vec S_{97}} P(\vec S_{98} | \vec S_{97}) ... \sum_{\vec S_{1}} P(\vec S_{2} | \vec S_{1}) P(\vec S_1) \big ] \\ =\log \sum_{\vec S_{99}} \exp \log \big [P(\vec S_{100} | \vec S_{99}) \sum_{\vec S_{98}} P(\vec S_{99} | \vec S_{98}) \sum_{\vec S_{97}} P(\vec S_{98} | \vec S_{97}) ... \sum_{\vec S_{1}} P(\vec S_{2} | \vec S_{1}) P(\vec S_1)\big ] \\ =\log \sum_{\vec S_{99}} \exp \log P(\vec S_{100} | \vec S_{99}) + \log \big [\sum_{\vec S_{98}} P(\vec S_{99} | \vec S_{98}) \sum_{\vec S_{97}} P(\vec S_{98} | \vec S_{97}) ... \sum_{\vec S_{1}} P(\vec S_{2} | \vec S_{1}) P(\vec S_1)\big ] \\ ... \text{ Continue converting each product using $\exp \log [\cdot]$ } ... \\ =\log \sum_{\vec S_{99}} \exp \log P(\vec S_{100} | \vec S_{99}) + \log \sum_{\vec S_{98}} \exp \log P(\vec S_{99} | \vec S_{98}) + ... + \log\sum_{\vec S_{1}} \exp \log P(\vec S_{2} | \vec S_{1}) P(\vec S_1) \\ =\log \sum_{\vec S_{99}} \exp \log P(\vec S_{100} | \vec S_{99}) + \log \sum_{\vec S_{98}} \exp \log P(\vec S_{99} | \vec S_{98}) + ... + \log\sum_{\vec S_{1}} \exp \log P(\vec S_{2} | \vec S_{1}) + \log P(\vec S_1) \\ =LSE_{\vec S_{99}} \log P(\vec S_{100} | \vec S_{99}) + LSE_{\vec S_{98}} \log P(\vec S_{99} | \vec S_{98}) + ... + LSE_{\vec S_{1}} \log P(\vec S_{2} | \vec S_{1}) + \log P(\vec S_1) \\ \end{gather*}\]

Where $LSE_x$ stands for Logarithm Summation Exponentiation. This operation is common enough that numpy & scipy has given it a special addition implementation and special summation implementation. And lastly, just add the index $t$ to the vector itself for bookkeeping. This is not a random variable, but is useful to index into a specific time index.

Python Implementation

1
2
3
4
5
6
7
import numpy as np
import pandas as pd
from scipy.special import logsumexp
from numba import njit
import plotly.express as px
import plotly.graph_objects as go
from IPython.display import display, Markdown
1
2
3
4
5
N = 100
# Set log-probability array
# State (ordering is [NULL, H, HH, HT]), Alice's points, Bob's points, index
log_probs = np.full((4, N + 1, N + 1, N + 1), -np.inf)
log_probs.shape
1
(4, 101, 101, 101)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
@njit
def compute(log_probs: np.ndarray) -> None:
    """Compute log-probabilities for joint states starting in NULL state.

    Args:
        log_probs: Array shaped (4, N + 1, N + 1, N + 1) representing:
            - State ordered [NULL, H, HH, HT]
            - Alice's score 0, ..., 100
            - Bob's score 0, ..., 100
            - Time index 1, ..., 100 prefixed with a dummy time, 0.
    """
    # Set the dummy 0 index to a state where log_prob = 0 or prob = 1 that we
    # are in NULL state with zero points for Alice and Bob.
    log_probs[0, 0, 0, 0] = 0
    # Loop over all time indices, starting from the first coin clip.
    for t in range(1, N + 1):
        # Since we know Z_t <= t and Y_t <= t, we only need to compute probabilities
        # for values up to t.
        for a in range(t + 1):
            for b in range(t + 1):
                # We can enter NULL or H from either NULL or HT.
                log_probs[0, a, b, t] = log_probs[1, a, b, t] = np.logaddexp(
                    log_probs[0, a, b, t - 1] - np.log(2),
                    log_probs[3, a, b, t - 1] - np.log(2),
                )
                # We can enter HH from either H or HH. Since HH is also the same
                # as Alice scoring, we can increment her score in the same step.
                log_probs[2, a, b, t] = (
                    np.logaddexp(
                        log_probs[1, a - 1, b, t - 1] - np.log(2),
                        log_probs[2, a - 1, b, t - 1] - np.log(2),
                    )
                    if a > 0
                    else -np.inf
                )
                # We can enter HT from either H or HH. Since HT is also the same
                # as Bob scoring, we can increment his score in the same step.
                log_probs[3, a, b, t] = (
                    np.logaddexp(
                        log_probs[1, a, b - 1, t - 1] - np.log(2),
                        log_probs[2, a, b - 1, t - 1] - np.log(2),
                    )
                    if b > 0
                    else -np.inf
                )


compute(log_probs=log_probs)
1
2
3
%%timeit
# Running naively takes ~15s
compute(log_probs)
1
5.3 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Joint Probability of t=100

With the entire joint probability computed, we need to marginalize the last time index $t=100$ for the joint distribution of $P(Z_{100}, Y_{100})$:

\[\begin{equation} P(Z_{100}, Y_{100}) = \sum_{X_{100}} P(X_{100}, Z_{100}, Y_{100}, 100) \end{equation}\]
1
2
3
4
5
# Fix the last time step, and sum over the X_{100} dimension
log_probs_final = logsumexp(log_probs[..., -1], axis=0)

# Compute joint probability distribution of points
probs_joint = np.exp(log_probs_final)
1
2
3
4
5
6
7
8
9
10
11
12
13
# Plotting Helper function
def render_plotly_html(fig: go.Figure) -> None:
    """Display a Plotly figure in markdown with HTML."""
    # Show the figure if in a IPython display.
    fig.show()
    # Render the content for html output.
    display(
        Markdown(
            fig.to_html(
                include_plotlyjs="cdn",
            )
        )
    )
1
2
3
4
5
6
7
8
render_plotly_html(
    px.imshow(
        probs_joint,
        aspect="auto",
        labels=dict(x="Bob Score", y="Alice Score", color="Probability"),
        title="P(Alice Score, Bob Score | t=100)",
    )
)

Probability of Winning

For the last step, we need to marginalize over $P(Z_{100}, Y_{100})$. When we do the summation we can bucket an Alice win, tie, and Bob win into 3 separate categories. Specifically:

\[\begin{gather*} W_t = \text{Outcome of the game at round $t$} = \begin{cases} \text{Alice wins} & \text{if $Z_t$ > $Y_t$}\\ \text{tie} & \text{if $Z_t$ = $Y_t$}\\ \text{Bob wins} & \text{otherwise} \end{cases} \\ P(W_{100}) = \sum_{Z_{100}, Y_{100}} P(Z_{100}, Y_{100}) \\ \log P(W_{100}) = \log \sum_{Z_{100}, Y_{100}} \exp \log P(Z_{100}, Y_{100}) \end{gather*}\]

Now we just need to sum over $\log P(Z_{100} = z, Y_{100} = y)$ and for each $(z, y)$ combination, keep a running tally of the probability mass:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Initialize log-probabilities
log_probs_win = np.full(3, -np.inf)
names = ["Alice Wins", "Tie", "Bob Wins"]

# Compute log-probabilities of Alice winning, tie, or Bob winning.
for a in range(N + 1):
    for b in range(N + 1):
        if a > b:
            log_probs_win[0] = np.logaddexp(log_probs_win[0], log_probs_final[a, b])
        elif a == b:
            log_probs_win[1] = np.logaddexp(log_probs_win[1], log_probs_final[a, b])
        else:
            log_probs_win[2] = np.logaddexp(log_probs_win[2], log_probs_final[a, b])

# Compute outcome probabilities
probs_win = np.exp(log_probs_win)
render_plotly_html(
    px.pie(
        pd.DataFrame({"State": names, "Probability": probs_win}),
        names="State",
        values="Probability",
    )
)

Bob, you clever devil.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# A little magic to automatically write my blog :)
import subprocess

_ = subprocess.run(
    [
        "jupyter",
        "nbconvert",
        "--to",
        "markdown",
        "--output",
        "~/jakee417.github.io/_includes/coin_problem.md",
        "coin_problem.ipynb",
    ]
)
1
2
[NbConvertApp] Converting notebook coin_problem.ipynb to markdown
[NbConvertApp] Writing 15252 bytes to /Users/jakeetaylor/jakee417.github.io/_includes/coin_problem.md
This post is licensed under CC BY 4.0 by the author.