Sachin Gururangan

PhD Fellowship Proposal Advice
April 27, 2023

A guide for writing successful PhD fellowship proposals, covering strategy, writing tips, and common pitfalls to avoid based on my experience with the Bloomberg PhD Fellowship.

Personal Statement Advice
September 1, 2020

Advice for writing compelling personal statements for graduate school applications, including structure, content, and how to effectively communicate your research interests and experiences.

Rethinking Thinking Tokens: LLMs as Improvement Operators
Lovish Madaan, Aniket Didolkar, Sachin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdinov, Manzil Zaheer, Sanjeev Arora, Anirudh Goyal

We examine how language models trained for reasoning can be restructured during inference. Rather than requiring lengthy chains of thought, we propose viewing models as operators capable of refining their own outputs. We introduce the Parallel-Distill-Refine (PDR) framework, which generates multiple solution drafts concurrently, condenses them into a workspace, then iteratively improves the result. This approach achieves superior accuracy compared to extended chain-of-thought reasoning while reducing latency, with substantial improvements on mathematics tasks including +11% on AIME 2024 and +9% on AIME 2025 compared to single-pass systems with equivalent computational budgets.

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision
Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Sachin Gururangan, Cheng Zhang, Anirudh Goyal, Alan Schelten

We propose Compute as Teacher (CaT), a method that converts a model's inference-time exploration into supervisory signals by synthesizing a single reference from multiple parallel rollouts. The approach addresses the challenge of generating learning signals without ground truth during model post-training. Testing demonstrated significant improvements across multiple models and benchmarks, with gains reaching approximately +27% on MATH-500 and +12% on HealthBench in test-time scenarios, and even larger gains with reinforcement learning enhancement approaching +33% and +30%.

Diversity-driven Data Selection for Language Model Tuning through Sparse Autoencoder
Xianjun Yang, Shaoliang Nie, Lijuan Liu, Sachin Gururangan, Ujjwal Karn, Rui Hou, Madian Khabsa, Yuning Mao

We address data selection for instruction tuning by proposing a diversity-aware approach using sparse autoencoders (SAEs) to measure data diversity. While existing methods focus primarily on quality, we demonstrate that diversity and complexity are equally important in training data selection. Our experimental results show that models trained on data selected using this method achieve improved capabilities compared to competing approaches while reducing computational costs and enabling greater control over model behavior.

BTS: Harmonizing Specialized Experts into a Generalist LLM
Qizhen Zhang, Prajjwal Bhargava, Chloe Bi, Chris X. Cai, Jakob Foerster, Jeremy Fu, Punit Singh Koura, Ruan Silva, Sheng Shen, Emily Dinan*, Sachin Gururangan*, Mike Lewis* *Joint Last Author

We introduce Branch-Train-Stitch (BTS), a novel training method for large language models (LLMs) that asynchronously trains multiple expert models on specialized domains and periodically synchronizes them into a unified generalist model. This approach addresses the challenge of training LLMs that excel across diverse domains while maintaining computational efficiency.

Self-Generated Critiques Boost Reward Modeling for Language Models
Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Sachin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, Rui Hou

We propose a novel approach to improve reward modeling for language models by training them to generate their own critiques. By augmenting the training data with self-generated critiques and their evaluations, we significantly enhance the reward model's ability to distinguish between high and low-quality responses.

The Llama 3 Herd of Models
Llama Team

This paper introduces Llama 3, a new generation of state-of-the-art open large language models. We describe the design principles, training methodology, and comprehensive evaluation of models ranging from 8B to 405B parameters, demonstrating strong performance across diverse tasks.

DataComp-LM: In search of the next generation of training sets for language models
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Sachin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner, Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah Pratt, Sunny Sanyal, Gabriel Ilharco, Giannis Daras, Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang, Khyathi Chandu, Thao Nguyen, Igor Vasiljevic, Sham Kakade, Shuran Song, Sujay Sanghavi, Fartash Faghri, Sewoong Oh, Luke Zettlemoyer, Kyle Lo, Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang, Dirk Groeneveld, Luca Soldaini, Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alexandros G. Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, Vaishaal Shankar

DataComp-LM is a benchmark and competition for improving language model training datasets. We provide a standardized framework for dataset experiments, enabling controlled comparisons of data curation strategies. Our baseline experiments reveal actionable insights for improving pretraining data quality.

Language models scale reliably with over-training and on downstream tasks
Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Sachin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Alexandros G. Dimakis, Gabriel Ilharco, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, Ludwig Schmidt

We study the scaling behavior of language models when trained beyond the compute-optimal point. Our experiments show that over-trained models follow predictable scaling laws and that downstream task performance can be reliably predicted from pretraining loss, enabling better resource allocation decisions.

LESS: Selecting Influential Data for Targeted Instruction Tuning
Mengzhou Xia, Sadhika Malladi, Sachin Gururangan, Sanjeev Arora, Danqi Chen

LESS is a data selection method that identifies the most influential training examples for instruction tuning. By selecting only 5% of the data that most improves performance on a target task, LESS achieves comparable or better performance than training on the full dataset while being computationally efficient.

Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models
Terra Blevins, Tomasz Limisiewicz, Sachin Gururangan, Margaret Li, Hila Gonen, Noah A. Smith, Luke Zettlemoyer

We propose cross-lingual expert language models to overcome the curse of multilinguality. By training separate experts for different languages and sharing parameters strategically, we achieve better performance than multilingual models while maintaining cross-lingual capabilities.

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters
Li Lucy, Sachin Gururangan, Luca Soldaini, Emma Strubell, David Bamman, Lauren Klein, Jesse Dodge

We analyze how common data filters affect different demographic groups by examining self-descriptions in web pages. Our findings reveal that quality filters disproportionately remove content from and about marginalized communities, raising concerns about representation in training data.

OpenLM
Sachin Gururangan*, Mitchell Wortsman*, Samir Yitzhak Gadre, Achal Dave, Maciej Kilian, Weijia Shi, Jean Mercat, Georgios Smyrnis, Gabriel Ilharco, Matt Jordan, Reinhard Heckel, Alex Dimakis, Ali Farhadi, Vaishaal Shankar, Ludwig Schmidt *Equal Contribution

OpenLM is an open-source language modeling framework designed for research. We provide efficient implementations, training recipes, and evaluation tools to enable reproducible research on language models at various scales.

Time is Encoded in the Weights of Finetuned Language Models
Kai Nylund, Sachin Gururangan, Noah A. Smith

We discover that finetuned language models encode temporal information in their weights, allowing them to be dated based on their knowledge cutoff. This finding has implications for model versioning, temporal reasoning, and understanding how models represent time.

SILO Language Models: Isolating Legal Risk in a Nonparametric Datastore
ICLR 2024, RegML 2024
⭐ Outstanding Paper Award at RegML 2024 Workshop ⭐ Sewon Min*, Sachin Gururangan*, Eric Wallace, Hannaneh Hajishirzi, Noah A. Smith, Luke Zettlemoyer *Equal Contribution

SILO is a new language model architecture that manages legal risk by separating the parametric model from a nonparametric datastore. The model is trained only on permissively licensed data, while the datastore can include more diverse content, enabling effective performance while maintaining clear data provenance.

Scaling Expert Language Models with Unsupervised Domain Discovery
JMLR 2024 Sachin Gururangan*, Margaret Li*, Mike Lewis, Weijia Shi, Tim Althoff, Noah A. Smith, Luke Zettlemoyer *Equal Contribution

We present a method for automatically discovering domains in large text corpora and training specialized expert models for each domain. Our approach uses unsupervised clustering to identify coherent domains, then trains experts that can be efficiently composed at inference time to handle diverse inputs.

Editing Models with Task Arithmetic
ICLR 2023 Gabriel Ilharco, Marco Tulio Riberio, Mitchell Wortsman, Sachin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, Ali Farhadi

We introduce task arithmetic, a simple method for editing models by adding and subtracting task-specific weight updates. This enables model capabilities to be combined, removed, or modified through arithmetic operations on weight vectors.

lo-fi: distributed fine-tuning without communication
TMLR Mitchell Wortsman, Sachin Gururangan, Shen Li, Ali Farhadi, Ludwig Schmidt, Michael Rabbat, Ari S. Morcos

lo-fi enables distributed fine-tuning of large models without communication between workers. Each worker fine-tunes independently on local data, and the resulting models are merged to create a single performant model, dramatically reducing communication costs.

M2D2: A Massively Multi-Domain Language Modeling Dataset
EMNLP 2022 Machel Reid, Victor Zhong, Sachin Gururangan, Luke Zettlemoyer

M2D2 is a massive multi-domain dataset for language modeling research, containing text from hundreds of diverse domains. This resource enables research on domain adaptation, multi-domain modeling, and understanding how language varies across different contexts.

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
EMNLP 2022 Sachin Gururangan, Dallas Card, Sarah K. Dreier, Emily K. Gade, Leroy Wang, Blarry Wang, Luke Zettlemoyer, and Noah A. Smith

We examine the implicit language ideologies embedded in 'quality' filters used for curating pretraining data. Our analysis reveals systematic biases against certain language varieties and populations, highlighting how data curation practices can perpetuate linguistic discrimination in language models.

kNN-Prompt: Nearest Neighbor Zero-Shot Inference
EMNLP 2022 Weijia Shi, Julian Michael, Sachin Gururangan, and Luke Zettlemoyer

kNN-Prompt enables zero-shot inference by retrieving nearest neighbors from a prompt datastore. This approach improves performance on various tasks without fine-tuning, leveraging similarity in the prompt space to make better predictions.

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models
Margaret Li*, Sachin Gururangan*, Tim Dettmers, Mike Lewis, Noah A. Smith, and Luke Zettlemoyer *Equal Contribution

Branch-Train-Merge enables embarrassingly parallel training of expert language models. Multiple experts are trained independently on different data subsets, then merged to create a unified model that combines their specialized knowledge.

Time Waits for No One! Analysis and Challenges of Temporal Misalignment
NAACL 2022 Kelvin Luu, Daniel Khashabi, Sachin Gururangan, Karishma Mandyam, and Noah A. Smith

We analyze temporal misalignment in NLP systems, where training and deployment data come from different time periods. Our work reveals how this misalignment affects model performance and proposes methods to detect and mitigate temporal distribution shifts.

DEMix Layers: Disentangling Domains for Modular Language Modeling
NAACL 2022 Sachin Gururangan, Mike Lewis, Ari Holtzman, Noah A. Smith, and Luke Zettlemoyer

DEMix introduces modular transformer layers that learn to specialize on different domains without explicit supervision. By disentangling domain-specific and domain-general representations, DEMix enables efficient multi-domain modeling and controlled generation within specific domains.

All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text
ACL 2021
⭐ Outstanding Paper Award ⭐ Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Sachin Gururangan, and Noah A. Smith

We critically examine human evaluation practices for generated text, revealing systematic biases and inconsistencies. Our experiments show that human evaluators often can't distinguish human from machine text, calling into question common evaluation assumptions.

Expected Validation Performance and Estimation of a Random Variable's Maximum
Jesse Dodge, Sachin Gururangan, Roy Schwartz, Dallas Card, and Noah A. Smith

We develop statistical methods for estimating expected validation performance and the maximum of a random variable in machine learning contexts. This helps researchers better understand and report the uncertainty in their experimental results.

Detoxifying Language Models Risks Marginalizing Minority Voices
NAACL 2021 Albert Xu, Eshaan Pathak, Eric Wallace, Sachin Gururangan, Maarten Sap, and Dan Klein

We show that current approaches to detoxifying language models can disproportionately silence minority voices. Toxicity classifiers often mislabel minority identity mentions as toxic, leading to censorship of marginalized communities' perspectives.

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
EMNLP Findings 2020 Sam Gehman, Sachin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith

RealToxicityPrompts is a dataset of 100k naturally occurring prompts for evaluating toxic degeneration in language models. We systematically analyze various models and decoding strategies, revealing the pervasiveness of toxic outputs even from seemingly innocuous prompts.

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks
ACL 2020
⭐ Honorable Mention for Best Overall Paper ⭐ Sachin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith

We show that continued pretraining on domain-specific unlabeled data (domain-adaptive pretraining) followed by task-specific pretraining (task-adaptive pretraining) leads to significant performance gains. This simple approach achieves state-of-the-art results on various tasks across biomedical, computer science, news, and review domains.

Variational Pretraining for Semi-supervised Text Classification
ACL 2019 Sachin Gururangan, Tam Dang, Dallas Card, and Noah A. Smith

We introduce VAMPIRE, a variational pretraining approach for semi-supervised text classification. By learning variational autoencoders on unlabeled data, we create better representations that improve downstream classification with limited labels.

Show Your Work: Improved Reporting of Experimental Results
EMNLP 2019 Jesse Dodge, Sachin Gururangan, Roy Schwartz, Dallas Card, and Noah A. Smith

We advocate for better reporting practices in NLP experiments, including reporting results from multiple random seeds and expected validation performance. We provide tools and recommendations to help researchers report more reliable and reproducible results.

Emergent coordination underlying learning to reach to grasp with a brain-machine interface
Journal of Neurophysiology with many authors 🙂

We study how neural populations coordinate during brain-machine interface learning for reach-to-grasp tasks. Our findings reveal emergent coordination patterns that develop as subjects learn to control prosthetic devices through neural activity.

Annotation Artifacts in Natural Language Inference Data
NAACL 2018 Sachin Gururangan*, Swabha Swayamdipta*, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith *Equal contribution

We show that natural language inference datasets contain annotation artifacts that allow models to perform well without understanding the relationship between premise and hypothesis. Models trained only on hypotheses achieve surprisingly high accuracy, revealing systematic biases that can be exploited.

Analysis of Graph Invariants in Functional Neocortical Circuitry Reveals Generalized Features Common to Three Areas of Sensory Cortex
Plos Compbio 2014 Sachin Gururangan, Alex Sadovsky and Jason Maclean

We analyze graph-theoretic properties of functional neural circuits across three sensory cortical areas. Our analysis reveals common organizational principles and invariant features that characterize information processing in sensory cortex.

Sachin Gururangan

Blog Posts

Publications

2025

2024

2023

2022

2021

2020

2019

2018

2014