This document derives the complete mathematical formulation of the OmniVoice model
directly from the implementation in omnivoice/model.py. Sections follow the code’s
logical structure: input embedding, forward pass & loss, inference scheduling, and
helper functions.
The model is parameterized by:
| Symbol | Code name | Description |
|---|---|---|
| $V$ | audio_vocab_size |
Acoustic token vocabulary size (default 1025) |
| $M$ | audio_mask_id |
Mask token index (default 1024) |
| $C$ | num_audio_codebook |
Number of codebook layers (default 8) |
| $H$ | hidden_size |
LLM hidden dimension |
| $\mathbf{w} = (w_1, \dots, w_C)$ | audio_codebook_weights |
Per-layer loss weights (default $[8,8,6,6,4,4,2,2]$) |
The normalized layer weights used in the loss are:
\[\tilde{w}_c = \frac{w_c}{\sum_{c'=1}^{C} w_{c'}}, \quad c = 1, \dots, C\]_prepare_embed_inputs)The model receives a joint input tensor of shape $(B, C, S)$, where $B$ is batch size and $S$ is sequence length. Each position is either a text token or an audio token, distinguished by the binary audio mask $\mathbf{a} \in {0,1}^{B \times S}$.
| For text positions, the standard LLM token embedding $E_{\text{text}} \in \mathbb{R}^{ | \mathcal{V}_{\text{text}} | \times H}$ is applied to the first codebook layer (layer index 0): |
Each codebook layer $c \in {0, \dots, C-1}$ has a dedicated segment within a shared embedding table $E_{\text{audio}} \in \mathbb{R}^{(C \cdot V) \times H}$. Layer $c$ is offset by $c \cdot V$, so the token ID for codebook $c$ at position $s$ is shifted as:
\[\hat{x}_{b,c,s} = \mathbf{x}_{b,c,s} \cdot \mathbf{a}_{b,s} + c \cdot V\]The audio embedding at position $s$ is the sum over all codebook layers:
\[\mathbf{h}^{\text{audio}}_{b,s} = \sum_{c=0}^{C-1} E_{\text{audio}}\!\left(\hat{x}_{b,c,s}\right) \in \mathbb{R}^{H}\]The final per-position embedding selects audio or text based on the mask:
\[\mathbf{h}_{b,s} = \begin{cases} \mathbf{h}^{\text{audio}}_{b,s} & \text{if } \mathbf{a}_{b,s} = 1 \\ \mathbf{h}^{\text{text}}_{b,s} & \text{if } \mathbf{a}_{b,s} = 0 \end{cases}\]forward)The combined embedding $\mathbf{H} \in \mathbb{R}^{B \times S \times H}$ is passed through the bidirectional Transformer backbone $f_\theta$:
\[\mathbf{Z} = f_\theta(\mathbf{H},\ \mathbf{A}) \in \mathbb{R}^{B \times S \times H}\]where $\mathbf{A}$ is the attention mask (either a standard boolean mask or a FlexAttention block mask for sequence-packed training).
A single linear head $W_{\text{head}} \in \mathbb{R}^{H \times (C \cdot V)}$ projects the hidden states to per-codebook logits:
\[\mathbf{L}_{\text{flat}} = \mathbf{Z}\, W_{\text{head}}^\top \in \mathbb{R}^{B \times S \times (C \cdot V)}\]These are reshaped and transposed to:
\[\mathbf{L} \in \mathbb{R}^{B \times C \times S \times V}\]where $L_{b,c,s,v}$ is the unnormalized log-probability of token $v$ at codebook $c$, position $s$, sample $b$.
forward, labels provided)Let $\mathcal{M}$ denote the set of labeled (unmasked target) positions with label index $\neq -100$. The cross-entropy loss at each labeled position $(b, c, s)$ is:
\[\ell_{b,c,s} = -\log \frac{\exp\!\left(L_{b,c,s,\, y_{b,c,s}}\right)}{\sum_{v=0}^{V-1} \exp\!\left(L_{b,c,s,v}\right)}\]where $y_{b,c,s}$ is the ground-truth token. Positions with $y_{b,c,s} = {-100}$ are excluded (ignore_index=-100).
Let $\mathbf{1}{b,c,s} = \mathbf{1}[y{b,c,s} \neq -100]$ be the validity indicator. The mean loss for codebook layer $c$ is:
\[\bar{\ell}_c = \frac{\displaystyle\sum_{b,s} \ell_{b,c,s} \cdot \mathbf{1}_{b,c,s}}{\displaystyle\max\!\left(1,\ \sum_{b,s} \mathbf{1}_{b,c,s}\right)}\]The final scalar training loss is a weighted sum over codebook layers:
\[\boxed{\mathcal{L} = \sum_{c=1}^{C} \tilde{w}_c \cdot \bar{\ell}_c}\]where $\tilde{w}_c$ are the normalized codebook weights defined in Section 1.
_generate_iterative)_get_time_steps)A sequence of $N+1$ evenly-spaced time steps is first constructed over $[0, 1]$, then warped by a shift parameter $\tau$:
\[t_n^{\text{linear}} = \frac{n}{N}, \quad n = 0, 1, \dots, N\] \[\boxed{t_n = \frac{\tau \cdot t_n^{\text{linear}}}{1 + (\tau - 1)\cdot t_n^{\text{linear}}}}\]where $\tau = 0.1$ by default (shift parameter t_shift). The schedule is monotonically increasing from $t_0=0$ to $t_N=1$, with smaller $\tau$ concentrating more steps near $t=0$ (i.e., unmasking fewer tokens per early step).
Let $T$ be the target sequence length and $K = T \cdot C$ be the total number of tokens to unmask. At step $n$ ($n = 0, \dots, N-1$), the number of tokens to unmask is:
\[k_n = \min\!\left(\left\lceil K \cdot (t_{n+1} - t_n) \right\rceil,\ K_{\text{rem}}\right)\]where $K_{\text{rem}}$ is the remaining count of masked tokens. On the final step $n = N-1$, all remaining tokens are unmasked: $k_{N-1} = K_{\text{rem}}$.
_predict_tokens_with_scoring)At each decoding step, the model is run in two conditions simultaneously:
Let $\mathbf{L}^{\text{cond}}$ and $\mathbf{L}^{\text{uncond}} \in \mathbb{R}^{C \times T \times V}$ be the respective logit tensors. The guided log-probability is:
\[\log \tilde{p}_{c,t,v} = \log p^{\text{cond}}_{c,t,v} + \gamma \cdot \left(\log p^{\text{cond}}_{c,t,v} - \log p^{\text{uncond}}_{ c,t,v}\right)\] \[= (1 + \gamma)\, \log p^{\text{cond}}_{c,t,v} - \gamma\, \log p^{\text{uncond}}_{c,t,v}\]where $\gamma$ is guidance_scale (default 2.0), and the log-softmax probabilities are:
The final guided log-probabilities are renormalized via an additional log-softmax:
\[\log q_{c,t,v} = \log\text{softmax}\!\left(\log \tilde{p}_{c,t,\cdot}\right)_v\]The mask token index $M$ is suppressed at this stage: $\log q_{c,t,M} = -\infty$.
Greedy (class_temperature = 0):
\(\hat{x}_{c,t} = \operatorname{argmax}_{v}\; \log q_{c,t,v}\)
Stochastic (class_temperature $> 0$, top-$k$ filtered Gumbel sampling):
Let $\kappa = \lceil \rho \cdot V \rceil$ with ratio $\rho = 0.1$. Define the top-$\kappa$ filtered logits:
\[\tilde{q}_{c,t,v} = \begin{cases} \log q_{c,t,v} & \text{if } v \in \text{Top-}\kappa(\log q_{c,t,\cdot}) \\ -\infty & \text{otherwise} \end{cases}\]Then apply Gumbel noise with temperature $\tau_{\text{cls}}$:
\[g_v \sim \text{Gumbel}(0,1), \quad \hat{x}_{c,t} = \operatorname{argmax}_{v}\left(\frac{\tilde{q}_{c,t,v}}{\tau_{\text{cls}}} + g_v\right)\]where $g_v = -\log(-\log(u_v + \epsilon) + \epsilon)$, $u_v \sim \mathcal{U}(0,1)$, $\epsilon = 10^{-10}$.
The confidence score for a predicted token at codebook layer $c$, position $t$ is the maximum log-probability:
\[s_{c,t} = \max_{v}\; \log q_{c,t,v}\]A layer penalty is subtracted to encourage lower codebook layers (more fundamental acoustic features) to be unmasked first:
\[\tilde{s}_{c,t} = s_{c,t} - c \cdot \lambda\]where $\lambda$ is layer_penalty_factor (default 5.0) and $c \in {0, \dots, C-1}$ is the zero-indexed codebook layer.
To introduce stochasticity in position selection, Gumbel noise with temperature $\tau_{\text{pos}}$ is added to the penalized scores:
\[\hat{s}_{c,t} = \frac{\tilde{s}_{c,t}}{\tau_{\text{pos}}} + g_{c,t}, \quad g_{c,t} \sim \text{Gumbel}(0,1)\]where $\tau_{\text{pos}}$ is position_temperature (default 5.0). Positions that are already unmasked are excluded by setting $\hat{s}_{c,t} = -\infty$ for those positions.
The $k_n$ positions with the highest scores are selected for unmasking at step $n$:
\[\mathcal{U}_n = \operatorname{Top-}k_n\!\left\{\hat{s}_{c,t} : x_{c,t} = M\right\}\]For each $(c, t) \in \mathcal{U}_n$, the token is set to the predicted value:
\[x_{c,t} \leftarrow \hat{x}_{c,t}\]_prepare_inference_inputs)The full conditional input sequence is formed by concatenating four segments along the temporal axis:
\[\mathbf{x}^{\text{cond}} = \bigl[\underbrace{\mathbf{x}^{\text{style}}}_{N_1}\ \|\ \underbrace{\mathbf{x}^{\text{text}}}_{N_2}\ \| \ \underbrace{\mathbf{x}^{\text{ref}}}_{T_p}\ \|\ \underbrace{\mathbf{x}^{\text{target}}}_{T}\bigr] \in \mathbb{Z}^{C \times S}\]where $S = N_1 + N_2 + T_p + T$ and:
<|denoise|>, <|lang_start|>…<|lang_end|>, <|instruct_start|>…<|instruct_end|>)<|text_start|>…<|text_end|>)The binary audio mask $\mathbf{a} \in {0,1}^S$ marks positions belonging to reference or target audio:
\[a_s = \begin{cases} 1 & \text{if } s \geq N_1 + N_2 \\ 0 & \text{otherwise} \end{cases}\]The unconditional input contains only the target segment (target tokens only, no conditioning):
\[\mathbf{x}^{\text{uncond}} = \mathbf{x}^{\text{target}} \in \mathbb{Z}^{C \times T}\]_post_process_audio)Let $\rho_{\text{ref}}$ be the RMS of the reference audio waveform. If the reference audio is quiet ($\rho_{\text{ref}} < 0.1$), the generated audio $\hat{\mathbf{y}}$ is rescaled:
\[\hat{\mathbf{y}} \leftarrow \hat{\mathbf{y}} \cdot \frac{\rho_{\text{ref}}}{0.1}\]When no reference audio is provided (voice design mode), peak normalization is applied:
\[\hat{\mathbf{y}} \leftarrow \frac{\hat{\mathbf{y}}}{\|\hat{\mathbf{y}}\|_\infty} \cdot 0.5\]Reference audio itself is clipped to a normalized level before tokenization: if $0 < \rho_{\text{ref}} < 0.1$,
\[\mathbf{y}_{\text{ref}} \leftarrow \mathbf{y}_{\text{ref}} \cdot \frac{0.1}{\rho_{\text{ref}}}\]| The estimated number of target audio tokens $\hat{T}$ is produced by a rule-based duration estimator. Given a reference text of length $ | r | $ characters mapped to $T_p$ reference audio frames, the per-character rate is: |
The raw target token count is:
\[\hat{T} = \left\lfloor \rho_{\text{char}} \cdot |t| \right\rfloor\]| where $ | t | $ is the character length of the target text. |
Given a speed factor $\alpha > 0$:
\[T = \max\!\left(1,\ \left\lfloor \frac{\hat{T}}{\alpha} \right\rfloor\right)\]Values $\alpha > 1$ produce faster (shorter) speech; $\alpha < 1$ produces slower (longer) speech.
When an explicit duration $d$ (in seconds) is specified, the target token count is set directly from the audio tokenizer frame rate $f_r$:
\[T = \max\!\left(1,\ \lfloor d \cdot f_r \rfloor\right)\]The effective speed ratio for chunked generation is then back-calculated as:
\[\alpha_{\text{eff}} = \frac{\hat{T}}{T}\]_mask_mod_packed)For efficient training with sequence packing, a document-aware causal-style mask is applied. Each position $s$ in the packed sequence is assigned a document ID $\text{doc}(s)$. The attention mask allows position $q$ to attend to position $k$ only if they belong to the same document:
\[A_{q,k} = \begin{cases} 1 & \text{if } \text{doc}(q) = \text{doc}(k) \\ 0 & \text{otherwise} \end{cases}\]Since OmniVoice uses a bidirectional Transformer, there is no causal constraint — the full within-document context is available at every position.
| Component | Formula | Location |
|---|---|---|
| Normalized codebook weights | $\tilde{w}c = w_c / \sum{c’} w_{c’}$ | __init__ |
| Audio embedding offset | $\hat{x}{b,c,s} = x{b,c,s} \cdot a_{b,s} + c V$ | _prepare_embed_inputs |
| Summed audio embedding | $\mathbf{h}^{\text{audio}}{b,s} = \sum_c E{\text{audio}}(\hat{x}_{b,c,s})$ | _prepare_embed_inputs |
| Per-token cross-entropy | $\ell_{b,c,s} = -\log p_{b,c,s,y_{b,c,s}}$ | forward |
| Weighted layer loss | $\mathcal{L} = \sum_c \tilde{w}_c \bar{\ell}_c$ | forward |
| Time-step warp | $t_n = \tau t_n^{\text{lin}} / (1 + (\tau-1)t_n^{\text{lin}})$ | _get_time_steps |
| CFG log-prob | $\log q = \text{logsoftmax}((1{+}\gamma)\log p^c - \gamma \log p^u)$ | _predict_tokens_with_scoring |
| Layer penalty | $\tilde{s}{c,t} = s{c,t} - c\lambda$ | _generate_iterative |
| Position Gumbel sampling | $\hat{s}{c,t} = \tilde{s}{c,t}/\tau_{\text{pos}} + g_{c,t}$ | _generate_iterative |
| Volume normalization | $\hat{\mathbf{y}} \leftarrow \hat{\mathbf{y}} \cdot \rho_{\text{ref}}/0.1$ | _post_process_audio |
| Speed control | $T = \max(1, \lfloor \hat{T}/\alpha \rfloor)$ | _estimate_target_tokens |
| Packed attention | $A_{q,k} = \mathbf{1}[\text{doc}(q) = \text{doc}(k)]$ | _mask_mod_packed |