Research

Publications

rbio1 — training scientific reasoning LLMs with biological world models as soft verifiers

Ana-Maria Istrate, Fausto Milletari, Fabrizio Castrotorres, Jakub M. Tomczak, Michaela Torkar, Donghui Li, Theofanis Karaletsos

Abstract: Reasoning Models are typically trained against verification mechanisms in formally specified systems such as code or symbolic math. However, in open domains like biology, we do not generally have access to exact rules facilitating formal verification at scale, and oftentimes resolve to testing hy-potheses in the lab to assess the validity of a prediction. Verification by performing real experiments is slow, expensive, and inherently does not scale with computation. In this work, we show that one can use world models of biology or other prior knowledge as approximate oracles over biological knowledge to utilize as soft verification to train reasoning systems without the need for additional experimental data. We introduce rbio1, a reasoning model for biology that is post-trained from a pretrained LLM using reinforcement learning and uses learned models of biology to obtain biological knowledge for verification during training. We show that soft verification successfully distills biology world models into rbio, at the example of achieving leading performance on perturbation prediction against the PerturbQA benchmark compared to state-of-the-art models; we demonstrate the benefits of compositions of verifiers to learn more general rbio models. We believe rbio provides a proof of concept that demonstrates that predictions from bio-models can be used to train powerful reasoning models using simulations, rather than experimental data, as a new training paradigm.

bioRxiv Preprint | CZI Blog Post | GitHub | Quickstart | VentureBeat Coverage

Accepted as Oral Presentation at AI Virtual Cells and Instruments: A New Era in Drug Discovery and Development Workshop at NeurIPS 2025

A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model

James D. Pearce, Sara E. Simmonds, Gita Mahmoudabadi, Lakshmi Krishnan, Giovanni Palla, Ana-Maria Istrate, Alexander Tarashansky, Benjamin Nelson, Omar Valenzuela, Donghui Li, others

Abstract: Single-cell transcriptomics has revolutionized our understanding of cellular diversity, yet our understanding of the transcriptional programs across the tree of life remains limited. Here we present TranscriptFormer, a family of generative foundation models trained on up to 112 million cells spanning 1.53 billion years of evolution across 12 species. By jointly modeling gene identities and expression levels using a novel generative architecture, TranscriptFormer encodes multi-scale biological structure, functioning as a queryable virtual cell atlas. We demonstrate state-of-the-art performance on both in-distribution and out-of-distribution cell type classification, with robust performance even for species separated by over 685 million years of evolution. TranscriptFormer can also perform zero-shot disease state identification in human cells and accurately transfers cell state annotations across species boundaries. As a generative model, TranscriptFormer can be prompted to predict cell type-specific transcription factors and gene-gene interactions that align with independent experimental observations. Developmental trajectories, phylogenetic relationships and cellular hierarchies emerge naturally in TranscriptFormer’s representations without any explicit training on these annotations. This work establishes a powerful framework for quantitative single-cell analysis, and comparative cellular biology, thus demonstrating that universal principles of cellular organization can be learned and predicted across the tree of life.

bioRxiv Preprint

CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data

CZI Cell Science Program and Abdulla, Shibla … (includes Ana-Maria Istrate)

Abstract: Hundreds of millions of single cells have been analyzed using high-throughput transcriptomic methods. The cumulative knowledge within these datasets provides an exciting opportunity for unlocking insights into health and disease at the level of single cells. Meta-analyses that span diverse datasets building on recent advances in large language models and other machine-learning approaches pose exciting new directions to model and extract insight from single-cell data. Despite the promise of these and emerging analytical tools for analyzing large amounts of data, the sheer number of datasets, data models and accessibility remains a challenge. Here, we present CZ CELLxGENE Discover (cellxgene.cziscience.com), a data platform that provides curated and interoperable single-cell data. Available via a free-to-use online data portal, CZ CELLxGENE hosts a growing corpus of community-contributed data of over 93 million unique cells. Curated, standardized and associated with consistent cell-level metadata, this collection of single-cell transcriptomic data is the largest of its kind and growing rapidly via community contributions. A suite of tools and features enables accessibility and reusability of the data via both computational and visual interfaces to allow researchers to explore individual datasets, perform cross-corpus analysis, and run meta-analyses of tens of millions of cells across studies and tissues at the resolution of single cells.

Journal page

scGenePT: Is language all you need for modeling single-cell perturbations?

Ana-Maria Istrate, Donghui Li, Theofanis Karaletsos

Abstract: Modeling single-cell perturbations is a crucial task in the field of single-cell biology. Predicting the effect of up or down gene regulation or drug treatment on the gene expression profile of a cell can open avenues in understanding biological mechanisms and potentially treating disease. Most foundation models for single-cell biology learn from scRNA-seq counts, using experimental data as a modality to generate gene representations. Similarly, the scientific literature holds a plethora of information that can be used in generating gene representations using a different modality - language - as the basis. In this work, we study the effect of using both language and experimental data in modeling genes for perturbation prediction. We show that textual representations of genes provide additive and complementary value to gene representations learned from experimental data alone in predicting perturbation outcomes for single-cell data. We find that textual representations alone are not as powerful as biologically learned gene representations, but can serve as useful prior information. We show that different types of scientific knowledge represented as language induce different types of prior knowledge. For example, in the datasets we study, subcellular location helps the most for predicting the effect of single-gene perturbations, and protein information helps the most for modeling perturbation effects of interactions of combinations of genes. We validate our findings by extending the popular scGPT model, a foundation model trained on scRNA-seq counts, to incorporate language embeddings at the gene level. We start with NCBI gene card and UniProt protein summaries from the genePT approach and add gene function annotations from the Gene Ontology (GO). We name our model “scGenePT”, representing the combination of ideas from these two models. Our work sheds light on the value of integrating multiple sources of knowledge in modeling single-cell data, highlighting the effect of language in enhancing biological representations learned from experimental data.

bioRxiv Preprint | Model (CZI Virtual Cell Platform) | GenomeWeb Coverage | CZI Article

A machine learning-enabled open biodata resource inventory from the scientific literature

Heidi J. Imker, Kenneth E. Schackart III, Ana-Maria Istrate, Charles E. Cook

Abstract: Modern biological research depends on data resources. These resources archive difficult-to-reproduce data and provide added-value aggregation, curation, and analyses. Collectively, they constitute a global infrastructure of biodata resources. While the organic proliferation of biodata resources has enabled incredible research, sustained support for the individual resources that make up this distributed infrastructure is a challenge. The Global Biodata Coalition (GBC) was established by research funders in part to aid in developing sustainable funding strategies for biodata resources. An important component of this work is understanding the scope of the resource infrastructure; how many biodata resources there are, where they are, and how they are supported. Existing registries require self-registration and/or extensive curation, and we sought to develop a method for assembling a global inventory of biodata resources that could be periodically updated with minimal human intervention. The approach we developed identifies biodata resources using open data from the scientific literature. Specifically, we used a machine learning-enabled natural language processing approach to identify biodata resources from titles and abstracts of life sciences publications contained in Europe PMC. Pretrained BERT (Bidirectional Encoder Representations from Transformers) models were fine-tuned to classify publications as describing a biodata resource or not and to predict the resource name using named entity recognition. To improve the quality of the resulting inventory, low-confidence predictions and potential duplicates were manually reviewed. Further information about the resources were then obtained using article metadata, such as funder and geolocation information. These efforts yielded an inventory of 3112 unique biodata resources based on articles published from 2011–2021. The code was developed to facilitate reuse and includes automated pipelines. All products of this effort are released under permissive licensing, including the biodata resource inventory itself (CC0) and all associated code (BSD/MIT).

PLOS ONE

A large dataset of software mentions in the biomedical literature

Ana-Maria Istrate, Donghui Li, Dario Taraborelli, Michaela Torkar, Boris Veytsman, Ivana Williams

Abstract: We describe the CZ Software Mentions dataset, a new dataset of software mentions in biomedical papers. Plain-text software mentions are extracted with a trained SciBERT model from several sources: the NIH PubMed Central collection and from papers provided by various publishers to the Chan Zuckerberg Initiative. The dataset provides sources, context and metadata, and, for a number of mentions, the disambiguated software entities and links. We extract 1.12 million unique string software mentions from 2.4 million papers in the NIH PMC-OA Commercial subset, 481k unique mentions from the NIH PMC-OA Non-Commercial subset (both gathered in October 2021) and 934k unique mentions from 3 million papers in the Publishers' collection. There is variation in how software is mentioned in papers and extracted by the NER algorithm. We propose a clustering-based disambiguation algorithm to map plain-text software mentions into distinct software entities and apply it on the NIH PubMed Central Commercial collection. Through this methodology, we disambiguate 1.12 million unique strings extracted by the NER model into 97600 unique software entities, covering 78% of all software-paper links. We link 185000 of the mentions to a repository, covering about 55% of all software-paper links. We describe in detail the process of building the datasets, disambiguating and linking the software mentions, as well as opportunities and challenges that come with a dataset of this size. We make all data and code publicly available as a new resource to help assess the impact of software (in particular scientific open source projects) on science.

arXiv Preprint | PDF Link | CZI Blog Post | Nature Coverage

Scientific Software Citation Intent Classification Using Large Language Models

Ana-Maria Istrate, Joshua Fisher, Xinyu Yang, Kara Moraw, Kai Li, Donghui Li, Martin Klein

Abstract: Software has emerged as a crucial tool in the current research ecosystem, frequently referenced in academic papers for its application in studies or the introduction of new software systems. Despite its prevalence, there remains a significant gap in understanding how software is cited within the scientific literature. In this study, we offer a conceptual framework for studying software citation intent and explore the use of large language models, such as BERT-based models, GPT-3.5, and GPT-4 for this task. We compile a representative software-mention dataset by merging two existing gold standard software mentions datasets and annotating them to a common citation intent scheme. This new dataset makes it possible to analyze software citation intent at the sentence level. We observe that in a fine-tuning setting, large language models can generally achieve an accuracy of over 80% on software citation intent classification on unseen, challenging data. Our research paves the way for future empirical investigations into the realm of research software, establishing a foundational framework for exploring this under-examined area.

Springer Chapter

Unsupervised pre-training for biomedical question answering

Vaishnavi Kommaraju, Karthick Gunasekaran, Kun Li, Trapit Bansal, Andrew McCallum, Ivana Williams, Ana-Maria Istrate

Abstract: We explore the suitability of unsupervised representation learning methods on biomedical text -- BioBERT, SciBERT, and BioSentVec -- for biomedical question answering. To further improve unsupervised representations for biomedical QA, we introduce a new pre-training task from unlabeled data designed to reason about biomedical entities in the context. Our pre-training method consists of corrupting a given context by randomly replacing some mention of a biomedical entity with a random entity mention and then querying the model with the correct entity mention in order to locate the corrupted part of the context. This de-noising task enables the model to learn good representations from abundant, unlabeled biomedical text that helps QA tasks and minimizes the train-test mismatch between the pre-training task and the downstream QA tasks by requiring the model to predict spans. Our experiments show that pre-training BioBERT on the proposed pre-training task significantly boosts performance and outperforms the previous best model from the 7th BioASQ Task 7b-Phase B challenge.

arXiv Preprint

Predicting the star rating of a business on Yelp using graph convolutional neural networks

Ana-Maria Istrate

Abstract: Social media platforms have been rising steadily in recent years, influencing consumer spaces as a whole and individual users alike. Users also have the power of influencing the popularity of businesses or products on these platforms, driving the success level of different entities. Hence, understanding users' behavior is useful for businesses that want to cater to users' needs and know what market segment to direct efforts towards. In this paper, we are looking at how the star rating of a business on Yelp is determined by the profile of users who have rated it with a high score on Yelp. We are defining a graph between users on Yelp and businesses they have high rtings to, and using graph convolutional neural networks to find node embeddings for businesses, by aggregating information from the users they are connected to. We show how a business's start rating can be predicted by aggregating local information about a business's neighborhood in the Yelp graph, as well as information about the business itself.

Report Link

A Personal Conversational Model

Lawrence Lin Murata, Ana-Maria Istrate

Abstract: NLP Model that takes in a human utterance as input and uses a Support Vector Machine (SVM) with a linear kernel to generate a machine response word-by-word. Created for the CS224N Stanford Class Final Project.

Report Link