Research

A deep dive into single-cell RNA sequencing foundation models

Large-scale foundation models, which are pre-trained on massive, un-labeled datasets and subsequently fine-tuned on specific tasks, have recently achieved unparalleled success on a wide array of applications, including in healthcare and biology. In this paper, we explore two foundation models recently developed for single-cell RNA sequencing data, scBERT and scGPT. Focusing on the fine-tuning task of cell type annotation, we explore the relative performance of pre-trained models compared to a simple baseline, L1-regularised logistic regression, including in the few-shot setting. We perform ablation studies to understand whether pre-training improves model performance and to better understand the difficulty of the pre-training task in scBERT. Finally, using scBERT as an example, we demonstrate the potential sensitivity of fine-tuning to hyper-parameter settings and parameter initialisations. Taken together, our results highlight the importance of rigorously testing foundation models against well established baselines, establishing challenging fine-tuning tasks on which to benchmark foundation models, and performing deep introspection into the embeddings learned by the model in order to more effectively harness these models to transform single-cell data analysis. Code is available at https://github.com/clinicalml/sc-foundation-eval.

Details

author(s)

David Sontag

publication date

23 October 2023

source

bioRxiv

related programme

MIT Jameel Clinic

Link to publication

External link ->

Generative AI in the era of 'alternative facts'

27 March 2024

MIT Open Publishing Services

External data and AI are making each other more valuable

26 February 2024

Harvard Business Review Press

Removing biases from molecular representations via information maximisation

1 December 2023

Arxiv

Effective human-AI teams via learned natural language rules and onboarding

7 November 2023

Arxiv