Explain the Viterbi semiring
The Viterbi semiring is the algebraic structure underlying the Viterbi algorithm’s “max-product” computations — it formalizes how you score and select the single best path (most probable sequence) in models like Hidden Markov Models (HMMs) or weighted finite-state automata.
Core idea
A semiring is a set $S$ equipped with two binary operations:
- $\oplus$ (an “addition”-like operation, here used to choose between alternatives),
- $\otimes$ (a “multiplication”-like operation, here used to accumulate scores along a path),
along with identity elements $\mathbf{0}$ for $\oplus$ and $\mathbf{1}$ for $\otimes$, such that:
- $\oplus$ is associative and commutative, with identity $\mathbf{0}$,
- $\otimes$ is associative, with identity $\mathbf{1}$,
- $\otimes$ distributes over $\oplus$: $a \otimes (b \oplus c) = (a \otimes b) \oplus (a \otimes c)$,
- $\mathbf{0}$ is absorbing for $\otimes$: $\mathbf{0} \otimes a = \mathbf{0}$.
The Viterbi semiring for probabilities is typically:
- Domain: $[0,1]$ (or, for numerical stability, log-space: $\mathbb{R} \cup \{-\infty\}$),
- $\oplus = \max$ (choose the most probable alternative),
- $\otimes = \times$ (multiply probabilities along a path),
- Identities: $\mathbf{0} = 0$ (since $\max(x,0) = x$ for $x \in [0,1]$), $\mathbf{1} = 1$.
In log-space, this becomes the max-plus semiring:
- Domain: $\mathbb{R} \cup \{-\infty\}$,
- $\oplus = \max$,
- $\otimes = +$,
- Identities: $\mathbf{0} = -\infty$, $\mathbf{1} = 0$.
Switching to log-space turns products into sums (safer numerically) and keeps the “best path” selection as a max.
Connection to the Viterbi algorithm
The Viterbi algorithm performs dynamic programming over this semiring: at each time step $t$, for each state $j$, it computes
$$ \delta_t(j) = \bigoplus_{i} \left[ \delta_{t-1}(i) \otimes a_{i,j} \right] \otimes b_j(o_t) $$
which, in the usual (probability) notation, is
$$ \delta_t(j) = \max_i \left( \delta_{t-1}(i) \cdot a_{i,j} \right) \cdot b_j(o_t), $$
where:
- $\delta_t(j)$ is the max probability of any path ending in state $j$ after observing $o_1, \dots, o_t$,
- $a_{i,j}$ is the transition probability from state $i$ to $j$,
- $b_j(o_t)$ is the emission probability of observation $o_t$ from state $j$.
In log-space:
$$ \log \delta_t(j) = \max_i \left( \log \delta_{t-1}(i) + \log a_{i,j} \right) + \log b_j(o_t). $$
Backpointers record which predecessor $i$ achieved the $\max$, so the single best full path can be reconstructed.
Small concrete example
HMM with two states $A, B$:
-
Initial probs: $\pi_A = 0.6,\ \pi_B = 0.4$
-
Transitions: $A \to A = 0.7,\ A \to B=0.3;\ B\to A=0.4,\ B\to B=0.6$
-
Emissions for observation sequence $[x, y]$:
- From $A$: $P(x)=0.5,\ P(y)=0.5$
- From $B$: $P(x)=0.1,\ P(y)=0.9$
Step 1 (observe $x$):
$$ \delta_1(A)=0.6 \times 0.5=0.3,\quad \delta_1(B)=0.4 \times 0.1=0.04. $$
Step 2 (observe $y$):
$$ \delta_2(A)=\max(0.3 \times 0.7,\ 0.04 \times 0.4) \times 0.5 = \max(0.21,\ 0.016)\times 0.5 = 0.105,\\ \delta_2(B)=\max(0.3 \times 0.3,\ 0.04 \times 0.6) \times 0.9 = \max(0.09,\ 0.024)\times 0.9 = 0.081. $$
Final best state is $A$ with probability $0.105$; backpointers give path $A \to A$.
Contrast: sum-product semiring
The sum-product semiring (used for computing total probability or marginals) has:
- $\oplus = +$,
- $\otimes = \times$,
- Identities: $\mathbf{0} = 0,\ \mathbf{1}=1$.
There, you sum over paths instead of taking the max, so you’re aggregating total mass rather than picking one best path. In log-space that requires log-sum-exp for $\oplus$, which is more expensive than $\max$.
Why thinking in semirings helps
Many dynamic programming algorithms (Viterbi, shortest path, parsing, belief propagation variants) can be seen as instances of the generalized semiring framework: replace $\oplus$ and $\otimes$ to get different behaviors (e.g., highest weight path vs. total weight, counting paths, etc.). The Viterbi semiring specializes this to “best path under multiplicative scoring.”
Properties summary
- Associativity / commutativity: $\max$ is commutative/associative; multiplication (or addition in log-space) is associative.
- Distributivity: $a \otimes \max(b,c) = \max(a \otimes b,\ a \otimes c)$ holds.
- No additive inverse: It’s a semiring, not a ring — there is no operation to “undo” a $\max$.
- Used for decoding: In speech recognition, bioinformatics, NLP (e.g., Viterbi taggers, parse scoring), etc.
If you want, I can also show how the Viterbi semiring fits into a generic semiring-based code template or how to generalize it to other semirings. Would you like that?
No comments:
Post a Comment