A2D2: Finetuning Any-Length Discrete Diffusion for Adaptive Decoding

1University of Pennsylvania    2Georgia Institute of Technology
A2D2 framework overview showing adaptive joint decoding of any-length discrete diffusion

A2D2 introduces a unified framework for reward-guided fine-tuning of any-length masked discrete diffusion models. 🃏🔮 By jointly optimizing insertion and unmasking policies with learned quality predictors, A2D2 enables adaptive decoding that provably samples from an intractable reward-tilted distribution without requiring target samples.

Abstract

Masked discrete diffusion models (MDMs) offer a simple and stable likelihood-based framework for sequence generation and have recently been extended to any-length settings via token insertion. However, principled reward-guided fine-tuning for any-length discrete diffusion remains largely unexplored.

We introduce Finetuning Any-Length Discrete Diffusion for Adaptive Decoding (A2D2), a unified framework for reward-guided fine-tuning of any-length MDMs. A2D2 formulates generation as a controlled continuous-time Markov chain and jointly optimizes insertion and unmasking policies to learn a reward-tilted path measure without requiring target samples. We derive the Radon–Nikodym derivative for the joint insertion–unmasking process and introduce the Adaptive Joint Decoding (AJD) loss, which provably minimizes trajectory-induced error while preserving the target distribution. Empirically, A2D2 improves reward optimization, generation accuracy, and flexibility over prior fixed-length and inference-time guidance methods.

Key Contributions

  • We introduce A2D2, a unified framework for reward-guided fine-tuning of any-length masked discrete diffusion models via joint optimization of the insertion and unmasking policies and quality-based inference schedule.
  • We derive the Radon–Nikodym derivative for the joint insertion and unmasking path measures, enabling theoretically-guaranteed convergence to an intractable reward-tilted sequence distribution.
  • We establish unmasking and insertion quality as tractable methods of minimizing compounding parallelization error (CPE) and introduce the Adaptive Joint Decoding (AJD) loss, which provably yields the optimal path measure that minimizes error and generates the reward-tilted distribution.
  • We demonstrate that A2D2 simultaneously optimizes rewards while enhancing generation flexibility and accuracy over prior fixed-length fine-tuning and inference-time guidance approaches on multi-objective therapeutic peptide design.

Overview of Framework

🃏 Defining Unmasking and Insertion Quality 🔮

A key challenge in any-length discrete diffusion is that model performance is highly sensitive to the chosen insertion and unmasking trajectory. We formally define the quality of a step taken in the decoding process.

Unmasking Quality

We define the unmasking quality as the probability that the unmasked token is sampled from the unmasking posterior given the context from the rest of the sequence:

$$\mu_\star^\ell(\boldsymbol{y}) := p(\boldsymbol{y}^{\ell} = \boldsymbol{x}_1^\ell \mid \boldsymbol{y}) = f_\theta(\tilde{\boldsymbol{x}}_t, t)[\ell, \boldsymbol{x}_1^\ell]$$

We train a parameterized model $\mu_\phi : \mathcal{X} \to [0,1]$ to predict the unmasking quality by minimizing the Unmasking Quality Loss (UQL):

$$\mathcal{L}_{\text{UQL}}(\phi; \boldsymbol{x}_1) := \underset{t \sim \mathcal{U}(0,1)}{\mathbb{E}} \underset{\tilde{\boldsymbol{x}}_t, \boldsymbol{y}}{\mathbb{E}} \left[ \sum_{\ell \in \mathcal{M}} \text{BCE}\left(\boldsymbol{1}[\boldsymbol{y}^\ell = \boldsymbol{x}_1^\ell], \mu_\phi^\ell(\boldsymbol{y})\right) \right]$$

During inference, the unmasking quality determines which tokens are inconsistent and should be re-masked, enabling adaptive unmasking that maximizes the probability of optimal parallel unmasking.

Insertion Quality

The insertion quality is defined as the probability that an inserted mask token is likely to be decoded into a true token in the corresponding gap of the target sequence:

$$\nu_\star^\ell(\boldsymbol{y}) := \sum_{i=s_t[\ell-1]}^{s_t[\ell]} p(\boldsymbol{y}^{\ell} = \boldsymbol{x}_1^{i} \mid \boldsymbol{y})$$

We train a parameterized model $\nu_\phi$ by minimizing the Insertion Quality Loss (IQL):

$$\mathcal{L}_{\text{IQL}}(\phi; \boldsymbol{x}_1) := \underset{t \sim \mathcal{U}(0,1)}{\mathbb{E}} \underset{\tilde{\boldsymbol{x}}_t, \boldsymbol{y}}{\mathbb{E}} \left[ \sum_{i \in \mathcal{I}} \text{BCE}\big(\nu_\star^i(\boldsymbol{y}), \nu_\phi^i(\boldsymbol{y})\big) \right]$$

Maximizing insertion quality provides an upper bound for the probability of reconstructing a clean sequence, justifying its use for adaptive removal of low-quality insertions.

🃏 Adaptive Joint Decoding Loss 🔮

We define the optimal reward-tilted path measure for any-length MDMs as:

$$\mathbb{P}^\star(\boldsymbol{X}_{0:1}) := \frac{1}{Z} \mathbb{P}^{\text{pre}}(\boldsymbol{X}_{0:1}) \exp\left(\frac{r(\boldsymbol{X}_1)}{\alpha}\right)$$

To optimize toward $\mathbb{P}^\star$, we derive the Radon–Nikodym derivative for the joint insertion–unmasking CTMC path measures. This yields our Adaptive Joint Decoding (AJD) loss:

$$\mathcal{L}_{\text{AJD}}(\theta, \phi) := \underset{\boldsymbol{X}_{0:1} \sim \mathbb{P}^v}{\mathbb{E}} \left[ \frac{1}{Z} e^{W^v} \big[ \mathcal{L}_{\text{unmask}}(\theta; \boldsymbol{X}_1) + \mathcal{L}_{\text{insert}}(\theta; \boldsymbol{X}_1) + \mathcal{L}_{\text{UQL}}(\phi; \boldsymbol{X}_1) + \mathcal{L}_{\text{IQL}}(\phi; \boldsymbol{X}_1) \big] \right]$$

The AJD loss provably yields the optimal unmasking and insertion generators that minimize trajectory-induced error while generating the reward-tilted distribution. It jointly optimizes:

  • The unmasking policy $f_\theta$ and insertion policy $g_\theta$
  • The unmasking quality predictor $\mu_\phi$ and insertion quality predictor $\nu_\phi$

🃏 Adaptive Inference 🔮

At each discrete time step during generation, A2D2 performs:

  1. Adaptive Unmasking: Sample tokens to unmask via $f_\theta$, predict unmasking quality via $\mu_\phi$, and re-mask low-quality tokens that fall below a threshold.
  2. Adaptive Insertion: Insert masks according to $g_\theta$, predict insertion quality via $\nu_\phi$, and remove low-quality insertions.

Generation stops when no masks remain and the insertion expectation falls below $0.5$, or when the total number of time steps is reached.

Experiments

Multi-Objective Therapeutic Peptide Generation 💉

We pre-train an any-length MDM on a dataset of 11 million peptide SMILES and use A2D2 to fine-tune the model for multiple therapeutic properties, including binding affinity to a protein target, solubility, non-hemolysis, non-fouling, and membrane permeability. We evaluate against fixed-length masked diffusion model baselines, including PepTune (inference-time multi-objective guidance) and TR2-D2 (off-policy RL for fixed-length fine-tuning).

Multi-objective fine-tuning curves showing reward optimization with and without quality predictors

Fine-tuning any-length MDMs with the AJD weighted loss yields significant increases in reward across multiple objectives, as shown by consistently increasing evaluation curves. Compared to both pre-trained fixed-length and any-length baselines, A2D2 produces higher scoring sequences across all objectives with the same inference cost. Notably, A2D2 with quality-based adaptive inference increases the fraction of valid peptide sequences, indicating that maximizing quality translates empirically to higher-quality and more accurate generation.


We compare against unconditional sampling from the pre-trained fixed-length and any-length models, PepTune (inference-time guidance), TR2-D2 without search (fixed-length fine-tuning), and A2D2 without quality (fine-tuning without quality predictors). A2D2 produces higher rewards across most properties compared to all baselines across three target proteins: TfR, GLP-1R, and GLAST.

Table 1: Multi-objective peptide design results. All values are averaged over 100 generated peptides. Bold = best, underline = second best. † indicates values taken from TR2-D2.

Target Method Binding Affinity (↑) Solubility (↑) Non-hemolysis (↑) Non-fouling (↑) Permeability (↑)
TfR Pre-trained (Fixed Length) † 8.008 ±0.673 0.742 ±0.166 0.874 ±0.063 0.102 ±0.083 -7.470 ±0.120
Pre-trained (Any Length) 7.788 ±0.798 0.773 ±0.202 0.875 ±0.084 0.172 ±0.163 -7.248 ±0.314
PepTune † 8.216 ±0.703 0.789 ±0.144 0.902 ±0.051 0.121 ±0.081 -7.389 ±0.119
TR2-D2 w/o search 8.518 ±0.667 0.664 ±0.143 0.876 ±0.048 0.067 ±0.055 -7.296 ±0.140
A2D2 w/o quality 8.057 ±0.681 0.648 ±0.271 0.862 ±0.095 0.135 ±0.167 -7.252 ±0.320
A2D2 (Ours) 11.283 ±0.295 0.820 ±0.095 0.754 ±0.058 0.214 ±0.048 -6.628 ±0.110
GLP-1R Pre-trained (Fixed Length) † 8.233 ±0.367 0.742 ±0.166 0.874 ±0.063 0.102 ±0.083 -7.470 ±0.120
Pre-trained (Any Length) 7.788 ±0.798 0.773 ±0.202 0.875 ±0.084 0.172 ±0.163 -7.248 ±0.314
PepTune † 8.403 ±0.365 0.774 ±0.170 0.907 ±0.057 0.125 ±0.082 -7.388 ±0.128
TR2-D2 w/o search 8.698 ±0.266 0.692 ±0.118 0.864 ±0.048 0.243 ±0.088 -7.332 ±0.059
A2D2 w/o quality 8.104 ±0.769 0.647 ±0.273 0.863 ±0.088 0.112 ±0.131 -7.228 ±0.336
A2D2 (Ours) 9.724 ±0.628 0.795 ±0.101 0.621 ±0.071 0.323 ±0.073 -6.689 ±0.074
GLAST Pre-trained (Fixed Length) † 7.830 ±0.420 0.742 ±0.166 0.874 ±0.063 0.102 ±0.083 -7.470 ±0.120
Pre-trained (Any Length) 7.100 ±1.274 0.742 ±0.166 0.874 ±0.063 0.102 ±0.083 -7.470 ±0.120
PepTune † 8.400 ±0.353 0.815 ±0.139 0.937 ±0.029 0.137 ±0.086 -7.311 ±0.106
TR2-D2 w/o search 8.579 ±0.591 0.709 ±0.144 0.913 ±0.029 0.119 ±0.059 -7.327 ±0.063
A2D2 w/o quality 7.545 ±1.259 0.691 ±0.233 0.860 ±0.084 0.134 ±0.153 -7.239 ±0.322
A2D2 (Ours) 11.265 ±0.345 0.827 ±0.103 0.703 ±0.063 0.183 ±0.055 -6.460 ±0.099

BibTeX

@article{tang2026a2d2,
  title={A2D2: Finetuning Any-Length Discrete Diffusion for Adaptive Decoding},
  author={Tang, Sophia and Zhu, Yuchen and Tao, Molei and Chatterjee, Pranam},
  journal={ReALM-GEN ICLR 2026 Workshop},
  year={2026}
}