Projects

rBio

Training reasoning LLMs for biology using world models as soft verifiers — learning from simulations rather than new wet-lab data. Demonstrates strong performance on perturbation prediction (PerturbQA) and shows how compositions of verifiers can yield more general models.

Paper: bioRxiv preprint
Blog post: CZI — rBio: Reasoning AI model
GitHub: czi-ai/rbio
Quickstart: Virtual Cell Models
Press: VentureBeat feature
Conference: AI4D³ 2025 (oral)

Preprint Quickstart GitHub Repo

scGenePT

Multimodal single-cell modeling that combines language-based gene representations with scRNA-seq to improve perturbation prediction and interpretability; integrates textual priors (e.g., gene/protein summaries, GO annotations) with biologically learned embeddings.

Paper: bioRxiv preprint
Model: scGenePT | v1.0 (CZI Virtual Cell Platform)
Press: GenomeWeb coverage
Highlight: CZI article feature

Preprint Quickstart GitHub Repo

CZ Software Mentions Dataset

One of the largest datasets of software mentions mined from biomedical papers. I worked on disambiguation and linking algorithms for software mentions extracted from scientific research articles, as well as led the overall technical effort.

Paper: arXiv preprint
Blog post: New data reveals the hidden impact of open source in science
Nature feature: Hunting for the best bioscience software tool? Check this database

Dataset Link GitHub Repo

Open Global Data Citation Corpus

I worked on the ML algorithms. In particular, I built a Named Entity Recognition (NER) algorithm to extract mentions of datasets (accession IDs, dataset DOIs) from biomedical research articles.

Project description: Blog post
Slides: Building the Open Global Data Citation Corpus — Chan Zuckerberg Initiative
Talk video: YouTube link

GitHub Repo

Global Biodata Resource Inventory

I developed ML algorithms to extract mentions of biodata resources from scientific research articles — using article classification, NER, and metadata pipelines to build a global inventory of life-sciences data resources described in the literature.

GitHub Repo