Abstract
Retrosynthetic analysis---the task of decomposing a target molecule into purchasable starting materials---remains a central challenge in computer-aided synthesis planning. We present Rasyn, a hybrid framework that combines three complementary neural architectures: (1) a graph neural network (GNN) head that predicts likely bond disconnections as edit operations, (2) a custom encoder-decoder Transformer with a copy/pointer mechanism (RetroTransformer v2) that generates reactant SMILES conditioned on predicted edits, and (3) a fine-tuned large language model (RSGPT v6, based on LLaMA-2 architecture) that performs edit-conditioned retrosynthesis through in-context learning. On the standard USPTO-50K benchmark (5,007 test reactions), our RetroTransformer v2 with root-aligned SMILES (R-SMILES) and 20x offline augmentation achieves 69.7% Top-1 exact match accuracy with 100% coverage using only 45.5M parameters---surpassing C-SMILES (67.2%) and demonstrating that carefully designed small models can outperform approaches with orders of magnitude more parameters.
1.Introduction
The synthesis of complex organic molecules is a cornerstone of pharmaceutical development, materials science, and chemical biology. Given a target molecule, retrosynthetic analysis seeks to identify a set of simpler precursor molecules (reactants) and the corresponding reaction that transforms them into the desired product. This inverse problem, first formalized by Corey (1991), has historically relied on expert intuition built over decades of training.
Recent advances in deep learning have produced a wealth of methods for single-step retrosynthetic prediction, broadly categorized into three paradigms:
- Template-based methods match the target molecule against a library of reaction templates (SMARTS patterns). While interpretable, they are fundamentally limited by template coverage and cannot generalize to unseen reaction types.
- Template-free methods directly generate reactant SMILES strings from product SMILES using encoder-decoder architectures, treating retrosynthesis as a sequence-to-sequence translation problem.
- Semi-template methods decompose the problem into two stages: first predicting the reaction center, then completing the resulting synthons into full reactant molecules.
In this paper, we present Rasyn, a hybrid framework that synthesizes insights from all three paradigms. Our key contributions include a multi-model architecture, a compact model surpassing larger approaches, comprehensive analysis of the mathematical foundations governing accuracy, and detailed ablation studies quantifying the contribution of each architectural component.

2.The Token-Sequence Accuracy Gap
A critical insight that motivated our architectural decisions is the mathematical relationship between token-level accuracy and sequence-level exact match. For an autoregressive model generating a sequence of length L tokens, if each token is predicted independently with accuracy p, the probability of generating the entire sequence correctly is:
P_exact = p^LThis exponential decay has devastating consequences for character-level tokenization. A typical SMILES string contains L ≈ 30-50 characters. Even at 85% token accuracy (strong token-level performance), the expected exact match rate is:
This analysis explains why our v1 RetroTransformer, despite achieving 85% token accuracy, produced only 0.9% Top-1 exact match---a result that initially appeared to be a bug but is in fact a mathematical inevitability of character-level autoregression. Reducing the effective sequence length through better tokenization is therefore critical.

3.Methods
The Rasyn framework consists of three primary components operating in a pipeline architecture. We describe each component and the six key innovations that drive performance.
3.1 Graph Head: GNN-Based Edit Prediction
The Graph Head serves as the first stage of the pipeline, predicting which bond disconnections are most likely for a given product molecule. We employ a message-passing neural network (MPNN) operating on the molecular graph where nodes represent atoms and edges represent bonds. After 4 rounds of message passing, each edge receives a disconnection probability. The top-K edges are selected as candidate edits for downstream models.
The Graph Head achieves a validation loss of 0.1349, corresponding to approximately 91% recall@5 for identifying the correct disconnection bond.
3.2 RetroTransformer v2: Six Architectural Innovations
RetroTransformer v2 is a custom encoder-decoder Transformer (d_model=512, 6 layers, 8 heads, 45.5M parameters) designed specifically to address the token-sequence accuracy gap. It introduces six key innovations:
Regex-Based Atom-Level Tokenizer
+11.4ppParses SMILES into chemically meaningful tokens (~20 tokens vs ~50 characters), reducing sequence length by 2.5x. This single change is the most impactful for exact match accuracy.
Copy/Pointer Mechanism
+16.2ppThe decoder can directly copy tokens from the encoder input. Since ~80% of reactant tokens are present in the product, this dramatically reduces the generation burden.
Offline Data Augmentation
+13.6pp20x randomized SMILES augmentation with canonical targets. On-the-fly augmentation hurts retrosynthesis because the model never sees the same input twice, preventing stable associations.
Reaction Class Conditioning
+6.2ppReaction-type tokens (RXN_1 through RXN_10) disambiguate cases where the same product could be synthesized through different reaction types.
Segment Embeddings
+3.8ppLearned embeddings distinguish product tokens (segment 0) from synthon tokens (segment 1), helping the model attend differently to source material vs. edit context.
Root-Aligned SMILES (R-SMILES)
+2.2ppAligns reactant SMILES to start from the same atom as the product, reducing edit distance by ~50% and benefiting the copy mechanism.
3.3 RSGPT v6: Fine-Tuned Large Language Model
Our second generation pathway uses a large language model based on the LLaMA-2 architecture. We start from pre-trained RSGPT weights (trained on 10B synthetic reaction tokens) and perform parameter-efficient fine-tuning using LoRA (Low-Rank Adaptation) with rank r=16 and scaling factor α=32. The total number of trainable parameters is approximately 4.2M (0.13% of the full model), making fine-tuning feasible on a single A100 GPU.
Rather than performing raw SMILES-to-SMILES translation, we condition the LLM on the edit information extracted by the Graph Head, significantly reducing the difficulty of the generation task.


4.Results
We evaluate on the standard USPTO-50K benchmark dataset (5,007 test reactions, Schneider split). Our RetroTransformer v2 with R-SMILES 20x augmentation achieves 69.7% Top-1 exact match accuracy with 100% coverage, surpassing C-SMILES (67.2%) while using only 45.5M parameters.
| Method | Params | Coverage | Top-1 |
|---|---|---|---|
| Mol. Transformer | 12M | 100% | 43.7% |
| Retroformer | 18M | 100% | 53.2% |
| EditRetro | --- | 100% | 60.8% |
| C-SMILES | --- | 100% | 67.2% |
| RSGPT | 3.2B | 100% | 63.4% |
| RSGPT (w/ TTA) | 3.2B | 100% | 77.0% |
| RetroDFM-R | --- | 100% | 65.0% |
| RetroTransformer v2 (R-SMILES 20x)Ours | 45.5M | 100% | 69.7% |
| RSGPT v6Ours | 3.2B | 77.6% | 61.7% |

Training Analysis
The model converges within 15-20 epochs with early stopping (patience 15). Token accuracy exceeds 98%, yet exact match is limited to ~70%---illustrating the token-sequence gap: at L=20 tokens, 0.9820 ≈ 66.8% is the theoretical maximum. Our actual 69.7% slightly exceeds this bound because the copy mechanism ensures that copied tokens have ~100% accuracy, effectively reducing the generation length below 20.

Quality Metrics
Beyond exact match, we evaluate generation quality through SMILES validity, Tanimoto similarity, and beam diversity. Even when predictions are not exact matches, they are structurally similar to the ground truth (86.2% for RetroTransformer v2, 91.0% for RSGPT v6), indicating that “near-misses” are chemically meaningful.

5.Ablation Studies
To understand the contribution of each architectural component, we conduct systematic ablation experiments. Each row adds one component to the previous configuration, showing the incremental gain.
| Configuration | Top-1 | Δ |
|---|---|---|
| v1 Baseline (char tokenizer) | 0.9% | --- |
| + Regex tokenizer | 12.3% | +11.4 |
| + Copy mechanism | 28.5% | +16.2 |
| + Offline augmentation (5x) | 42.1% | +13.6 |
| + Reaction class tokens | 48.3% | +6.2 |
| + Segment embeddings | 52.1% | +3.8 |
| + Beam search fixes | 56.7% | +4.6 |

6.Error Analysis
We analyze the 30.3% of test reactions where RetroTransformer v2 fails to produce an exact match at Top-1. Notably, the largest category (~35%) consists of alternative valid disconnections---chemically correct pathways that simply differ from the ground truth.
8.Conclusion
We have presented Rasyn, a hybrid framework for single-step retrosynthetic analysis that combines graph neural networks, a copy-augmented Transformer, and a fine-tuned large language model. Our key findings:
- Our RetroTransformer v2 achieves 69.7% Top-1 exact match accuracy on the full USPTO-50K test set with 100% coverage, using only 45.5M parameters---surpassing C-SMILES (67.2%).
- The mathematical analysis of token-to-sequence accuracy (Pexact = ptokL) provides a principled framework for understanding and improving retrosynthesis models.
- Eight architectural innovations collectively improve accuracy from 0.9% to 69.7%, with the copy mechanism (+16.2pp), offline augmentation (+13.6pp), and regex tokenization (+11.4pp) providing the largest gains.
- An honest evaluation methodology---reporting accuracy on total, accuracy on attempted, and coverage separately---is essential for fair comparison.
These results demonstrate that the combination of classical chemical reasoning, modern sequence modeling, and large-scale language model pre-training can push the boundaries of automated retrosynthetic analysis. With ongoing work on model scaling and reinforcement learning with chemical rewards, we expect significant further improvements.