Open-source efforts I'm contributing to
Training reasoning LLMs for biology using world models as soft verifiers — learning from simulations rather than new wet-lab data. Demonstrates strong performance on perturbation prediction (PerturbQA) and shows how compositions of verifiers can yield more general models.
Multimodal single-cell modeling that combines language-based gene representations with scRNA-seq to improve perturbation prediction and interpretability; integrates textual priors (e.g., gene/protein summaries, GO annotations) with biologically learned embeddings.
One of the largest datasets of software mentions mined from biomedical papers. I worked on disambiguation and linking algorithms for software mentions extracted from scientific research articles, as well as led the overall technical effort.
I worked on the ML algorithms. In particular, I built a Named Entity Recognition (NER) algorithm to extract mentions of datasets (accession IDs, dataset DOIs) from biomedical research articles.
I developed ML algorithms to extract mentions of biodata resources from scientific research articles — using article classification, NER, and metadata pipelines to build a global inventory of life-sciences data resources described in the literature.