Thursday, August 7, 2025

Viterbi semiring intro

Explain the Viterbi semiring

The Viterbi semiring is the algebraic structure underlying the Viterbi algorithm’s “max-product” computations — it formalizes how you score and select the single best path (most probable sequence) in models like Hidden Markov Models (HMMs) or weighted finite-state automata.

Core idea

A semiring is a set $S$ equipped with two binary operations:

  • $\oplus$ (an “addition”-like operation, here used to choose between alternatives),
  • $\otimes$ (a “multiplication”-like operation, here used to accumulate scores along a path),

along with identity elements $\mathbf{0}$ for $\oplus$ and $\mathbf{1}$ for $\otimes$, such that:

  • $\oplus$ is associative and commutative, with identity $\mathbf{0}$,
  • $\otimes$ is associative, with identity $\mathbf{1}$,
  • $\otimes$ distributes over $\oplus$: $a \otimes (b \oplus c) = (a \otimes b) \oplus (a \otimes c)$,
  • $\mathbf{0}$ is absorbing for $\otimes$: $\mathbf{0} \otimes a = \mathbf{0}$.

The Viterbi semiring for probabilities is typically:

  • Domain: $[0,1]$ (or, for numerical stability, log-space: $\mathbb{R} \cup \{-\infty\}$),
  • $\oplus = \max$ (choose the most probable alternative),
  • $\otimes = \times$ (multiply probabilities along a path),
  • Identities: $\mathbf{0} = 0$ (since $\max(x,0) = x$ for $x \in [0,1]$), $\mathbf{1} = 1$.

In log-space, this becomes the max-plus semiring:

  • Domain: $\mathbb{R} \cup \{-\infty\}$,
  • $\oplus = \max$,
  • $\otimes = +$,
  • Identities: $\mathbf{0} = -\infty$, $\mathbf{1} = 0$.

Switching to log-space turns products into sums (safer numerically) and keeps the “best path” selection as a max.

Connection to the Viterbi algorithm

The Viterbi algorithm performs dynamic programming over this semiring: at each time step $t$, for each state $j$, it computes

$$ \delta_t(j) = \bigoplus_{i} \left[ \delta_{t-1}(i) \otimes a_{i,j} \right] \otimes b_j(o_t) $$

which, in the usual (probability) notation, is

$$ \delta_t(j) = \max_i \left( \delta_{t-1}(i) \cdot a_{i,j} \right) \cdot b_j(o_t), $$

where:

  • $\delta_t(j)$ is the max probability of any path ending in state $j$ after observing $o_1, \dots, o_t$,
  • $a_{i,j}$ is the transition probability from state $i$ to $j$,
  • $b_j(o_t)$ is the emission probability of observation $o_t$ from state $j$.

In log-space:

$$ \log \delta_t(j) = \max_i \left( \log \delta_{t-1}(i) + \log a_{i,j} \right) + \log b_j(o_t). $$

Backpointers record which predecessor $i$ achieved the $\max$, so the single best full path can be reconstructed.

Small concrete example

HMM with two states $A, B$:

  • Initial probs: $\pi_A = 0.6,\ \pi_B = 0.4$

  • Transitions: $A \to A = 0.7,\ A \to B=0.3;\ B\to A=0.4,\ B\to B=0.6$

  • Emissions for observation sequence $[x, y]$:

    • From $A$: $P(x)=0.5,\ P(y)=0.5$
    • From $B$: $P(x)=0.1,\ P(y)=0.9$

Step 1 (observe $x$):

$$ \delta_1(A)=0.6 \times 0.5=0.3,\quad \delta_1(B)=0.4 \times 0.1=0.04. $$

Step 2 (observe $y$):

$$ \delta_2(A)=\max(0.3 \times 0.7,\ 0.04 \times 0.4) \times 0.5 = \max(0.21,\ 0.016)\times 0.5 = 0.105,\\ \delta_2(B)=\max(0.3 \times 0.3,\ 0.04 \times 0.6) \times 0.9 = \max(0.09,\ 0.024)\times 0.9 = 0.081. $$

Final best state is $A$ with probability $0.105$; backpointers give path $A \to A$.

Contrast: sum-product semiring

The sum-product semiring (used for computing total probability or marginals) has:

  • $\oplus = +$,
  • $\otimes = \times$,
  • Identities: $\mathbf{0} = 0,\ \mathbf{1}=1$.

There, you sum over paths instead of taking the max, so you’re aggregating total mass rather than picking one best path. In log-space that requires log-sum-exp for $\oplus$, which is more expensive than $\max$.

Why thinking in semirings helps

Many dynamic programming algorithms (Viterbi, shortest path, parsing, belief propagation variants) can be seen as instances of the generalized semiring framework: replace $\oplus$ and $\otimes$ to get different behaviors (e.g., highest weight path vs. total weight, counting paths, etc.). The Viterbi semiring specializes this to “best path under multiplicative scoring.”

Properties summary

  • Associativity / commutativity: $\max$ is commutative/associative; multiplication (or addition in log-space) is associative.
  • Distributivity: $a \otimes \max(b,c) = \max(a \otimes b,\ a \otimes c)$ holds.
  • No additive inverse: It’s a semiring, not a ring — there is no operation to “undo” a $\max$.
  • Used for decoding: In speech recognition, bioinformatics, NLP (e.g., Viterbi taggers, parse scoring), etc.

If you want, I can also show how the Viterbi semiring fits into a generic semiring-based code template or how to generalize it to other semirings. Would you like that?

No comments:

Post a Comment