Evidence

We are scientists building for scientists. Below you will find our approach to benchmarking, the white paper behind it, the research publications underlying the architecture, and a sample client engagement that put our engine to the test.

Benchmarking

Public benchmarks track research progress in a narrow domain. They were never designed to certify an AI for R&D decision-making.

We took a different approach: A multi-tier benchmark architecture, designed around the decisions R&D scientists actually make. Specialist competence at each modality. Cross-modal grounding and reasoning. Decision-grade output on end-to-end workflows.

Across all three tiers, we’re competitive or ahead of the strongest available systems on more than a dozen benchmarks—and the lead widens at the layers closest to real R&D decisions. That’s the pattern we expected from Modality Fusion combined with scientific reasoning, and it’s the pattern the evidence confirms. Comprehending biology isn’t about topping one benchmark—it’s about reasoning across every scale and modality relevant to biology. The breadth of where the engine performs well is the measure to which we hold ourselves.

We publish what we lead, and we publish what we don’t. Our AI doesn’t need to top every leaderboard—it needs to provide reliable answers for the complex reasoning questions crucial for R&D leaders.

Benchmarking

Why we built our own framework

Public biomedical AI benchmarks are small, single-modal, or built to reward final-answer correctness over reasoning. They can’t distinguish a model that reads a paper from one that synthesizes across an omics count matrix, SMILES of the molecule, and a clinical trial in the same trace—which is what R&D decisions actually require.

The three-tier architecture

Tier 1—Specialist competence. Each modality model evaluated against the strongest available specialist on its native task.
Tier 2—Cross-modal grounding and reasoning. Whether the reasoning layer extracts checkable content from non-textual modalities and combines it correctly.
Tier 3—Decision-grade output. Whether end-to-end workflows produce hypotheses validated by practicing drug developers.

Some examples: 85.9% score on translational analysis reasoning (SOTA: 55.6%), 96.8% negative-control discrimination on biomarker queries (next-best: 93.3%); 75% on multimodal omics QA (next-best: 58%). Twelve more in the [white paper].

Five benchmarks we built—because the field hadn’t

Where public benchmarks couldn’t ask the questions that matter, we built our own.

Validated Biomarkers (n=30)—expert-curated drug / method-of-action / indication pairs with biomarker positives and adjacent-biology negative controls, testing discrimination rather than recall.
GEO-OmicsQA (n=3,000)—free-text biological questions grounded in transcriptomics profiles plus natural-language context.
DILER (935 hypotheses, 257 compounds)—mechanistic narrative quality on Drug-Induced Liver Injury.
Drug Development Competence Benchmark (n=109)—stratified across ontology resolution, qualitative retrieval, and basic quantitative computation.

Each is published as a preprint and openly available, with peer review underway—because what the field measures is what the field will eventually build.

The full record—including where we lose

Across all three tiers we’re competitive or ahead of the strongest available systems, but not on every marker. TxGemma leads on mean ROC-AUC across held-out TDC classification (it was trained on TDC; we weren’t, this was an intentional assessment of our zero-shot capabilities). Our DILI hallucination rate is higher than TxGemma-Chat’s—this is the cost of generating more mechanistic content per compound. Cell Whisperer leads on Tabula Sapiens cell-type identification, on which they’re purpose-trained.

The full methodology, results, and analysis—including every case where we’re not the leader—live in our [white paper] and the [research publications].

Build the field with us

Benchmarks are infrastructure. The field needs more—designed against real R&D decisions, validated by practicing scientists, released openly.

If your team is working on a translational or clinical question that doesn’t yet have a benchmark, we’d like to talk. Several of the benchmarks above started as a customer engagement and became a public contribution.

Contact us about benchmark collaboration →

Research Publications

The methods behind the Ingenix Biological Reasoning Engine are published openly. Each preprint below covers a component of the architecture, including molecular reasoning, omics, mechanistic explainability, and multi-agent R&D workflows.

Research Publications

Bolek: A Multimodal Language Model for Molecular Reasoning, Grabowski et al., arXiv:2605.02745 (2026).
The molecular reasoning layer—how the engine extracts and reasons over structural and physicochemical information directly from molecular embeddings. Read on arXiv →

BioResearcher: Scenario-Guided Multi-Agent for Translational Medicine, Kinas et al., arXiv:2605.05985 (2026). The multi-agent reasoning architecture that orchestrates specialist models across translational workflows. Read on arXiv →

Simultaneous Learning from Bulk and Single-Cell Expression Data with Perceiver-Based Models, Powalski et al., MLGenX 2026. The Fun-Omics foundation—joint pretraining on bulk and single-cell data, enabling transfer across experimental scales. Read paper →

An Explainable Hypothesis-Driven Approach to Drug-Induced Liver Injury with HADES,
Wiśniewski et al., arXiv:2605.02669 (2026). The mechanistic hypothesis pipeline behind the engine’s DILI predictions—ranked causal pathways with severity labels and source-grounded narratives. Read on arXiv →

OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning, Sypetkowski et al., arXiv:2605.06728 (2026). The omics reasoning layer—how the engine ingests and reasons over bulk and single-cell expression natively, rather than serializing genes to text. Read on arXiv →

Peer review for the arXiv preprints is currently underway.

Case Study

Dual-Payload ADC Prioritization

Case Study

The challenge

A biotech developing a next-generation dual-payload antibody-drug conjugate (ADC) needed to identify the most potent payload combination for clinical development. With thousands of possible configurations—far beyond experimental capacity—they engaged Ingenix to apply Modality Fusion to the prioritization problem.

The core question: Which payload class is most likely to deliver synergistic or additive anti-tumor activity with the target payload?

What we did

The Biological Reasoning Engine fused six specialist models—knowledge and context, resistance-mechanism reasoning, synthetic lethality, dual-drug synergy, patient response simulations, and clinical feasibility—drawing on seven data modalities spanning text, molecular, mechanistic, cellular, patient omics and clinical readouts.

The output was a ranked priority matrix of candidate payload combinations, each accompanied by a stepwise reasoning chain—including the biological principle behind the predicted synergy, the molecular events upstream and downstream, and the tumor genotypes in which the synergy was predicted to hold.

Blind expert review of the top 15 hypotheses

5 were publicly known.
2 were supported in existing literature, but not widely cited.
3 were known to the client through proprietary experiments—never published, never disclosed to Ingenix.
5 were novel hypotheses not previously considered by the client team. Of these, the client flagged 3 as actionable candidates, 1 as deprioritized in the cancer context, and 1 as target-infeasible.

Why this matters

The engine recovered both the widely known hypotheses, and the results the client had confirmed internally but not shared with us. Furthermore, no predictions were made that the client knew to be false. These results gave the actionable novel hypotheses high credibility as candidates worth pursuing.

The two novel hypotheses deemed non-actionable were an expected cost of generative hypothesis production and identifying them was straightforward because every prediction arrived with a reasoning trace. The client’s scientists could interrogate the mechanism behind each hit and rank for feasibility in minutes, not weeks.

Blog

Our long-form process writing lives on Substack, where members of our tech team post essays on topics such as biological reasoning, benchmarking philosophy, the architecture choices behind Modality Fusion, and what we’re learning from the field as we build.

Read and subscribe →