Marcus: A Hierarchical Transformer for Reaction Condition Prediction

Abstract

A retrosynthesis route is only useful if a chemist can actually run each step. We introduce Marcus, a family of hierarchical transformers that predict the joint distribution of solvent, catalyst, base, temperature, and yield for any reactant–product pair. Trained on 4.7M reactions from Reaxys, USPTO, the Open Reaction Database, and Rasyn's in-house optimized-batch dataset, Marcus-1B (1.0B parameters) achieves 92.1% top-3 solvent accuracy, 89.9% top-3 catalyst accuracy, 8.4 °C temperature MAE, and R² = 0.72 on yield prediction - surpassing GraphRXN-Conditions, RXN4Chem (IBM), and the Coley/Maser baselines. The hierarchical decoder factorizes the conditional distribution P(solvent, catalyst, base, T | reaction), preventing the "mode collapse to the most common condition" failure mode of flat seq2seq models. Marcus is deployed inside Rasyn's retrosynthesis pipeline and the Condition Compiler to populate executable lab protocols from skeletal route plans.

1.Introduction

Predicting the right conditions for a chemical reaction has historically been the domain of process chemists with decades of experience. Should this Suzuki coupling use Pd(PPh₃)₄ or Pd(dba)₂/SPhos? Toluene/H₂O at reflux, or 1,4-dioxane at 90 °C? K₂CO₃, K₃PO₄, or Cs₂CO₃? The right answer depends on substrate, scale, and constraints invisible to a SMILES string.

Coley et al. (2018) opened the field with a simple feedforward network on Morgan fingerprints achieving 41% top-1 solvent accuracy. Subsequent work - Maser's Smiles2Conditions (2021), IBM's RXN4Chem (2023), GraphRXN-Conditions (2024) - pushed top-3 accuracy past 80% but treated each condition (solvent, catalyst, base, T) as an independent multi-class classification task. This factorization is wrong: the choice of base depends on the catalyst, and the temperature depends on the solvent's boiling point.

Marcus models the conditions as a structured joint distribution and decodes them autoregressively in a chemically motivated order - first solvent (reflecting compatibility with reagents), then catalyst (conditional on solvent), then base/additive, then temperature, then yield. Each step attends to the reaction graph and to all previously decoded conditions, capturing the cross-condition dependencies that matter in practice.

2.Method

2.1 Reaction-graph encoder

Each reactant and product is parsed to its molecular graph and atom-mapped via RXNMapper. The encoder operates on the union graph with edge-type embeddings distinguishing intra-molecule bonds, inter-molecule reactant–product correspondence, and edges marking the reaction core (atoms whose bond environment changes). A 12-layer GraphTransformer with rotational position encodings produces a contextualized atom embedding tensor.

2.2 Hierarchical decoder

The decoder produces conditions in a fixed chemically motivated order:

Solvent - selected from a 142-class controlled vocabulary (DMSO, THF, MeCN, …) plus a "mixture" head for binary solvent systems
Catalyst / pre-catalyst - 384-class vocabulary including ligand-bound complexes (Pd(PPh₃)₄, Pd(dba)₂/SPhos, …)
Base / additive - 89 classes
Temperature - regression head with Gaussian likelihood
Yield - regression head conditioned on all of the above

Each step receives the reaction encoding plus embeddings of all previously decoded conditions via cross-attention. This factorization captures couplings such as "K₂CO₃ doesn't dissolve well in toluene → prefer Cs₂CO₃ when toluene is the solvent."

2.3 Training data

The corpus combines four sources, with deduplication and aggressive filtering for atom-mapping consistency, reasonable yields (5–95%), and exclusion of one-step deprotection trivialities:

Reaxys-derived corpus (licensed): 3.1M reactions with full conditions and reported yield
USPTO-MIT-Conditions (open): 1.0M patent reactions parsed by Lowe + manual review
Open Reaction Database (ORD): 540k reactions from open-source contributions
Rasyn in-house optimized batch dataset: 87k high-throughput screening rows from partner labs (Aragen, IFM Therapeutics, 1200 Pharma)

We employ a curriculum: the model is pretrained on the noisier literature corpus (Reaxys + USPTO + ORD) for 80% of training, then fine-tuned on the cleaner in-house optimized-batch data. This curriculum is responsible for the final 0.3pp top-3 lift in our ablation.

2.4 Model variants

Variant	Encoder layers	Decoder layers	d_model	Params	Tier
Marcus-100M	12	8	768	108M	Pro
Marcus-1B	24	16	2048	1.04B	Enterprise

3.Results

We evaluate on the USPTO-MIT-Conditions held-out test set (50,287 reactions, no overlap with training) and additionally on a curated internal medicinal-chemistry test set of 4,210 reactions across 12 named reaction classes. All baselines were re-trained on identical training data and evaluated with identical metrics for a fair comparison.

Marcus-1B

Top-1

73.8%

Top-3

92.1%

Top-5

96.4%

Marcus-100M

Top-1

68.4%

Top-3

88.4%

Top-5

94.1%

GraphRXN-Conditions (2024)

Top-1

62.1%

Top-3

83.2%

Top-5

89.7%

RXN4Chem (IBM, 2023)

Top-1

58.4%

Top-3

79.1%

Top-5

86.2%

Maser et al. - Smiles2Conditions (2021)

Top-1

52.1%

Top-3

75.4%

Top-5

82.8%

Coley et al. (2018)

Top-1

41.2%

Top-3

67.3%

Top-5

73.9%

Top-1 Top-3 Top-5

Figure 1. Solvent prediction accuracy on the USPTO-MIT-Conditions held-out test set (50,287 reactions, 142-class vocabulary). Marcus-1B reaches 92.1% top-3 - a 9pp absolute improvement over the prior best.

Marcus-1B

Top-1

71.3%

Top-3

89.9%

Top-5

95.1%

Marcus-100M

Top-1

66.1%

Top-3

85.7%

Top-5

92.4%

GraphRXN-Conditions (2024)

Top-1

58.4%

Top-3

80.1%

Top-5

87.3%

RXN4Chem (IBM, 2023)

Top-1

54.7%

Top-3

76.8%

Top-5

84.0%

Maser et al. - Smiles2Conditions (2021)

Top-1

49.0%

Top-3

72.4%

Top-5

80.3%

Coley et al. (2018)

Top-1

38.1%

Top-3

64.2%

Top-5

71.5%

Top-1 Top-3 Top-5

Figure 2. Catalyst / pre-catalyst prediction accuracy (384-class vocabulary including ligand-bound complexes). The harder long tail of rare ligand-metal combinations rewards the larger model - Marcus-1B is 5.5pp top-3 above Marcus-100M.

3.1 Temperature regression

Method	Params	MAE (°C)	% within ±20 °C
Marcus-1BOurs	1.0B	8.4	91.2%
Marcus-100MOurs	108M	11.7	87.4%
GraphRXN-Conditions (2024)	85M	14.2	79.1%
RXN4Chem (IBM, 2023)	40M	16.8	74.3%
Maser et al. - Smiles2Conditions (2021)	12M	19.4	68.0%

4.Yield Prediction

Yield prediction is the hardest task in this benchmark - yield is heavily influenced by execution factors (purity of starting material, mixing, temperature control) that are invisible from a reaction SMILES alone. The R² ceiling for any structure-only model is therefore intrinsically bounded; we estimate it at ~0.80 by training an oracle model on the in-house high-throughput screening data with explicit batch metadata.

Marcus-1B achieves R² = 0.72 / MAE = 8.7%, closing more than half of the remaining gap to the oracle ceiling.

Method	Params	R²	MAE (%)
Marcus-1BOurs	1.0B	0.72	8.7
Marcus-100MOurs	108M	0.65	10.2
Yield-BERT (Schwaller, 2021)	12M	0.55	12.4
GraphRXN (yield head)	85M	0.58	11.8
Random Forest (Ahneman, 2018) baseline	-	0.42	14.6

Figure 3. Marcus-1B predicted yield vs. measured yield on the USPTO-MIT-Conditions held-out set. Dashed line is y = x. R² = 0.72 / MAE = 8.7%.

Yield prediction caveats

Yield depends on execution quality (purity of starting material, mixing efficiency, temperature control) which is invisible from a reaction SMILES alone. Predicted yields should be treated as a feasibility ranking, not absolute promises. A predicted 78% yield is meaningful as "this reaction will work"; the difference between predicted 78% and observed 65% is normal noise.

5.Ablation Studies

Top-3 solvent accuracy on the validation set after progressively adding components to a plain reactant-SMILES → conditions transformer baseline.

Configuration	Top-3 solvent	Δ
Plain reactant→conditions transformer	71.4%	-
+ Reaction-graph encoder (atom-mapped)	78.6%	+7.2
+ Hierarchical decoder (solvent → catalyst → base → T)	82.1%	+3.5
+ Reaction-class conditioning prior	84.7%	+2.6
+ Reagent–condition co-attention	86.9%	+2.2
+ Multi-task yield head	88.1%	+1.2
+ Curriculum: literature → optimized batch	88.4%	+0.3

The reaction-graph encoder pays for itself

Switching from reactant SMILES tokens to a proper atom-mapped reaction graph contributed +7.2pp top-3 - the single largest gain. Reaction conditions depend on which atoms' bond environment changes, which is much easier to read off an atom-mapped graph than to infer from string-level SMILES.

6.Applications in the Rasyn Platform

Marcus is the conditions backbone for two production features:

biotech

Retrosynthesis-to-protocol

Every step of a route generated by RetroTransformer v2 is passed through Marcus to produce executable conditions: solvent, catalyst, base, T, expected yield, and a confidence score. The chemist receives a complete lab protocol, not just an arrow-pushing diagram.

tune

Condition Compiler

Given a single reaction with unknown optimal conditions, Marcus enumerates the joint top-K conditions and proposes a 2- or 3-factor DoE plan around the highest-confidence prediction. Outcome data feeds back into Marcus's curriculum fine-tuning loop.

calculate

Reaction yield ranking

When a route has multiple synthetically valid disconnections, Marcus's predicted yield is one of the inputs to the route-ranker - penalizing low-yield steps that would compound over a long sequence.

7.Limitations

Closed-vocabulary conditions. The 142-class solvent and 384-class catalyst vocabularies cover ~98% of literature reactions but cannot suggest novel solvents or ligands. Open-vocabulary generation is future work.
Yield ceiling at R² ≈ 0.80. Yield depends on execution quality (mixing, purity, temperature control) that is structurally invisible. Any structure-only model has an information-theoretic ceiling well below R²=1.
Long-tail named reactions. Reactions with fewer than ~50 training examples (e.g. exotic transition-metal cross-couplings) show 10–15pp lower top-3 accuracy. Targeted active-learning loops with partner labs are addressing this.
Patent-bias. Reaxys and USPTO over-represent successful published reactions. Marcus may underweight unconventional but viable conditions that are simply absent from the literature.

8.Conclusion

Marcus models reaction conditions as a structured joint distribution rather than a bag of independent classifications, and decodes them in a chemically motivated order with cross-condition co-attention. Trained on 4.7M reactions and benefiting from a literature → optimized-batch curriculum, Marcus-1B sets new SOTA across solvent, catalyst, temperature, and yield prediction simultaneously. Inside Rasyn it is the difference between a retrosynthesis route and an executable lab protocol.

Generate conditions for any reaction

Drop a SMILES reaction - get a complete protocol in seconds.

More researcharrow_forward