Omar Shaikh

Speculative Prefilling

2026-03-19T00:00:00+00:00

If you know what a user is gonna do next, you can prefill your KV cache with candidate contexts and significantly improve your time-to-first-token.

The Interaction

I was revisiting an interaction idea I had a bit ago- if we have a great model of user context, then we can effectively build an “everywhere” tab to autocomplete. I called this system Tabracadabra 🎉. Here’s the original demo video:

Pretty cool, huh? You can download and try it here.

The Problem

Now what I kinda hid from you here is that this ORIGINAL clip is sped up quite a bit. That spinner spins for quite some time before you see anything. This time where that spinner is going off is known as the time to first token (or TTFT) for short. You might also notice that once the first token appears, the actual decoding speed is not so bad!

What exactly is going on when the spinner is running? This phase is known as the prefill phase. Here, the model processes all input tokens in parallel, computing the KV. Unlike the decode step, we have all the inputs at the very start, so we compute the full attention matrix at the start! Awesome!

In theory, this should be a LOT faster than decoding, right? We can parallelize the prefill phase, but not the decode phase! So what’s going on?

This is because, in the Tabracadabra setting, the context is an order of magnitude larger than what’s generated. For that demo example, I’m retrieving all of my past emails, screenshots of me in similar settings, etc. etc. We end up in a setting where the context can be 20x-30x larger than the text that’s actually generated.

On top of that, the actual decode speed isn’t really a big interaction bottleneck. Users can early-interrupt decoding and work from a partial autocomplete; but we can’t partially remove parts of the context. And the completion animation looks cool!

Speculative Prefilling

Here’s the idea: if we have a user model that can predict what a user will do next, we can use those predictions to retrieve relevant context ahead of time—and prefill the KV cache before the user ever triggers autocomplete. This is the same intuition behind speculative decoding (pre-generate tokens that might get used, and accept or reject them later) except here we’re speculating over the prefill rather than the decode. That’s it! Here’s a comparison video:

The time to first token here is effectively nothing. To get here, we need an aside on user models:

User Models

We define user models as models that are predictive: they tell us what a user will do next. How do we get a good user model? Under the hood, we use ideas from two of my papers (access to a user model, see GUM and NAP). In NAP specifically, we introduce LongNAP, a user model that predicts what a user will do next given their full multimodal interaction history (screenshots, keystrokes, clicks). LongNAP operates in two phases:

LongNAP first reasons about the user's current context to retrieve relevant memories, then uses those memories to predict concrete next actions.

Phase 1: Reasoning to Retrieve. Given the user’s recent context—say, they just opened a set of paper reviews—LongNAP first generates a reasoning trace about what might come next (e.g., “Received reviews on paper with collaborators… user may revise paper after viewing feedback”). This trace then serves as a query to retrieve relevant entries from a memory of the user’s past observations and reasoning. In our example, the retriever might surface past traces like “Procrastinates heavily on paper writing” and “Prefers using Slack to collaborate.”

Phase 2: Reasoning to Predict. With retrieved context in hand, LongNAP revises its initial reasoning and makes a concrete prediction. The retrieved traces about this user’s tendency to delegate allow the model to go from “user may revise paper” to “user will message coauthors to divide tasks, check which experiments have been run”—and predict concrete next actions like opening Slack and scrolling through Weights & Biases.

You can use a prompted scaffold to implement this. In our paper though, the whole pipeline is trained end-to-end via reinforcement learning (GRPO). Since we’re predicting what a user will do, we can just wait and see if they actually do it, scoring predictions against ground truth as a reward (again, more details in the paper).

The Algorithm

With our LongNAP in hand, the actual algorithm here is pretty simple. This “reasoning to retrieve” mechanism is exactly what we need for speculative prefilling. We just run Phase 1 whenever a user’s context switches.

Algorithm: Speculative Prefilling

On context change (e.g. you open an app, switch a tab), run the reasoning to retrieve phase of p_θ: generate a reasoning trace about what the user will do next, and use it to retrieve relevant context from the user's history.
Prefill the KV cache with the retrieved context.
On autocomplete trigger, decode immediately (TTFT ≈ 0) or add a tiny bit more context.

That’s it. The user model tells us what context to fetch; we just need to fetch it before the user asks. As the user’s context shifts, we re-run Phase 1, keeping the prefilled context relevant.

Systems and User Models

A small aside—I think we can view some of these user models as general purpose human branch predictors. I’ve applied this idea here to LLMs, but you could speculate on any kind of application that might benefit from speculative execution with a user model!

Anyway, if you found this interesting, please consider citing:

NAP (Next Action Prediction):

@misc{shaikh2026learningactionpredictorshumancomputer,
  title={Learning Next Action Predictors from Human-Computer Interaction},
  author={Omar Shaikh and Valentin Teutschbein and Kanishk Gandhi and Yikun Chi and Nick Haber and Thomas Robinson and Nilam Ram and Byron Reeves and Sherry Yang and Michael S. Bernstein and Diyi Yang},
  year={2026},
  eprint={2603.05923},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2603.05923},
}

GUM (General User Models):

@misc{shaikh2025creatinggeneralusermodels,
  title={Creating General User Models from Computer Use},
  author={Omar Shaikh and Shardul Sapkota and Shan Rizvi and Eric Horvitz and Joon Sung Park and Diyi Yang and Michael S. Bernstein},
  year={2025},
  eprint={2505.10831},
  archivePrefix={arXiv},
  primaryClass={cs.HC},
  url={https://arxiv.org/abs/2505.10831},
}

Learning Next Action Predictors from Human-Computer Interaction

2026-03-06T00:00:00+00:00

Truly proactive AI systems must anticipate what we will do next. This foresight demands far richer information than the sparse signals we type into our prompts – it demands reasoning over the entire context of what we see and do. We formalize this as next action prediction (NAP): given a sequence of a user’s multimodal interactions with a computer (screenshots, clicks, sensor data), predict that user’s next action. Progress on this task requires both new data and modeling approaches. To scale data, we annotate longitudinal, naturalistic computer use with vision-language models. We release an open-source pipeline for performing this labeling on private infrastructure, and label over 360K actions across one month of continuous phone usage from 20 users, amounting to 1,800 hours of screen time. We then introduce LongNAP, a user model that combines parametric and in-context learning to reason over long interaction histories. LongNAP is trained via policy gradient methods to generate user-specific reasoning traces given some context; retrieve relevant traces from a library of past traces; and then apply retrieved traces in-context to predict future actions. Using an LLM-as-judge evaluation metric (0-1 similarity to ground truth), LongNAP significantly outperforms supervised finetuning and prompted baselines on held-out data (by 79% and 39% respectively). Additionally, LongNAP generalizes to held out users when trained across individuals. The space of next actions a user might take at any moment is unbounded, spanning thousands of possible outcomes. Despite this, 17.1% of LongNAP’s predicted trajectories are well-aligned with what a user does next (LLM-judge score ≥ 0.5). This rises to 26% when we filter to highly confident predictions. In sum, we argue that learning from the full context of user behavior to anticipate user needs is now a viable task with substantial opportunity.

How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations

2025-10-26T00:00:00+00:00

AI agents are continually optimized for tasks related to human work, such as software engineering and professional writing, signaling a pressing trend with significant impacts on the human workforce. However, these agent developments have often not been grounded in a clear understanding of how humans execute work, to reveal what expertise agents possess and the roles they can play in diverse workflows. In this work, we study how agents do human work by presenting the first direct comparison of human and agent workers across multiple essential work-related skills: data analysis, engineering, computation, writing, and design. To better understand and compare heterogeneous computer-use activities of workers, we introduce a scalable toolkit to induce interpretable, structured workflows from either human or agent computer-use activities. Using such induced workflows, we compare how humans and agents perform the same tasks and find that: (1) While agents exhibit promise in their alignment to human workflows, they take an overwhelmingly programmatic approach across all work domains, even for open-ended, visually dependent tasks like design, creating a contrast with the UI-centric methods typically used by humans. (2) Agents produce work of inferior quality, yet often mask their deficiencies via data fabrication and misuse of advanced tools. (3) Nonetheless, agents deliver results 88.3% faster and cost 90.4-96.2% less than humans, highlighting the potential for enabling efficient collaboration by delegating easily programmable tasks to agents.

Comparing Text-Only and Virtual Reality-Embodied Conversational AI Agents for Interpersonal Skills Training

2025-10-18T00:00:00+00:00

Conversational AI agents powered by large language models (LLMs) have the potential to support the development of interpersonal skills, which are essential for navigating diverse situations and engaging effectively with a variety of people. However, text-based AI agents often lack crucial nonverbal cues such as facial expressions, body gestures, and tone of voice. In this study, we present a VR simulation featuring an embodied AI agent that leverages nonverbal cues to train interpersonal skills across various scenarios. We compare its efficacy to a Text-Only AI agent in a between-subjects study with twenty-four participants. We find that participants preferred the embodied agent condition, and their initial scores were significantly higher than those of participants in the text condition. However, the difference between the initial and final scores was not statistically significant.

Just-In-Time Objectives: A General Approach for Specialized AI Interactions

2025-10-16T00:00:00+00:00

Large language models promise a broad set of functions, but when not given a specific objective, they default to milquetoast results such as drafting emails littered with cliches. We demonstrate that inferring the user’s in-the-moment objective, then rapidly optimizing for that singular objective, enables LLMs to produce tools, interfaces, and responses that are more responsive and desired. We contribute an architecture for automatically inducing just-in-time objectives by passively observing user behavior, then steering downstream AI systems through generation and evaluation against this objective. Inducing just-in-time objectives (e.g., “Clarify the abstract’s research contribution”) enables automatic generation of tools, e.g., those that critique a draft based on relevant HCI methodologies, anticipate related researchers’ reactions, or surface ambiguous terminology. In a series of experiments (N=14, N=205) on participants’ own tasks, JIT objectives enable LLM outputs that achieve 66-86% win rates over typical LLMs, and in-person use sessions (N=17) confirm that JIT objectives produce specialized tools unique to each participant.

Navigating Rifts in Human-LLM Grounding: Study and Benchmark

2025-07-28T00:00:00+00:00

Language models excel at following instructions but often struggle with the collaborative aspects of conversation that humans naturally employ. This limitation in grounding—the process by which conversation participants establish mutual understanding—can lead to outcomes ranging from frustrated users to serious consequences in high-stakes scenarios. To systematically study grounding challenges in human-LLM interactions, we analyze logs from three human-assistant datasets: WildChat, MultiWOZ, and Bing Chat. We develop a taxonomy of grounding acts and build models to annotate and forecast grounding behavior. Our findings reveal significant differences in human-human and human-LLM grounding: LLMs were three times less likely to initiate clarification and sixteen times less likely to provide follow-up requests than humans. Additionally, we find that early grounding failures predict later interaction breakdowns. Building on these insights, we introduce Rifts, a benchmark derived from publicly available LLM interaction data containing situations where LLMs fail to initiate grounding. We note that current frontier models perform poorly on Rifts, highlighting the need to reconsider how we train and prompt LLMs for human interaction. To this end, we develop a preliminary intervention aimed at mitigating grounding failures.

SynthesizeMe! Inducing Persona-Guided Prompts for Personalized Reward Models in LLMs

2025-07-28T00:00:00+00:00

Recent calls for pluralistic alignment of Large Language Models (LLMs) encourage adapting models to diverse user preferences. However, most prior work on personalized reward models heavily rely on additional identity information, such as demographic details or a predefined set of preference categories. To this end, we introduce SynthesizeMe, an approach to inducing synthetic user personas from user interactions for personalized reward modeling. SynthesizeMe first generates and verifies reasoning to explain user preferences, then induces synthetic user personas from that reasoning, and finally filters to informative prior user interactions in order to build personalized prompts for a particular user. We show that using SynthesizeMe induced prompts improves personalized LLM-as-a-judge accuracy by 4.4% on Chatbot Arena. Combining SynthesizeMe derived prompts with a reward model achieves top performance on PersonalRewardBench a new curation of user-stratified interactions with chatbots collected from 854 users of Chatbot Arena and PRISM.

Creating General User Models from Computer Use

2025-05-20T00:00:00+00:00

Human-computer interaction has long imagined technology that understands us-from our preferences and habits, to the timing and purpose of our everyday actions. Yet current user models remain fragmented, narrowly tailored to specific apps, and incapable of the flexible reasoning required to fulfill these visions. This paper presents an architecture for a general user model (GUM) that learns about you by observing any interaction you have with your computer. The GUM takes as input any unstructured observation of a user (e.g., device screenshots) and constructs confidence-weighted propositions that capture user knowledge and preferences. GUMs can infer that a user is preparing for a wedding they’re attending from messages with a friend. Or recognize that a user is struggling with a collaborator’s feedback on a draft by observing multiple stalled edits and a switch to reading related work. GUMs introduce an architecture that infers new propositions about a user from multimodal observations, retrieves related propositions for context, and continuously revises existing propositions. To illustrate the breadth of applications that GUMs enable, we demonstrate how they augment chat-based assistants with context, manage OS notifications to selectively surface important information, and enable interactive agents that adapt to preferences across apps. We also instantiate proactive assistants (GUMBOs) that discover and execute useful suggestions on a user’s behalf using their GUM. In our evaluations, we find that GUMs make calibrated and accurate inferences about users, and that assistants built on GUMs proactively identify and perform actions that users wouldn’t think to request explicitly. Altogether, GUMs introduce methods that leverage multimodal models to understand unstructured context, enabling long-standing visions of HCI and entirely new interactive systems that anticipate user needs.

Aligning Language Models with Demonstrated Feedback

2024-06-24T00:00:00+00:00

Language models are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. Steering LLMs away from generic output is possible through supervised finetuning or RLHF, but requires prohibitively large datasets for new ad-hoc tasks. We argue that it is instead possible to align an LLM to a specific setting by leveraging a very small number (<10) of demonstrations as feedback. Our method, Demonstration ITerated Task Optimization (DITTO), directly aligns language model outputs to a user’s demonstrated behaviors. Derived using ideas from online imitation learning, DITTO cheaply generates online comparison data by treating users’ demonstrations as preferred over output from the LLM and its intermediate checkpoints. We evaluate DITTO’s ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts. Additionally, we conduct a user study soliciting a range of demonstrations from participants (N=16). Across our benchmarks and user study, we find that win-rates for DITTO outperform few-shot prompting, supervised fine-tuning, and other self-play methods by an average of 19% points. By using demonstrations as feedback directly, DITTO offers a novel method for effective customization of LLMs.

Social Skill Training with Large Language Models

2024-04-05T00:00:00+00:00

People rely on social skills like conflict resolution to communicate effectively and to thrive in both work and personal life. However, practice environments for social skills are typically out of reach for most people. How can we make social skill training more available, accessible, and inviting? Drawing upon interdisciplinary research from communication and psychology, this perspective paper identifies social skill barriers to enter specialized fields. Then we present a solution that leverages large language models for social skill training via a generic framework. Our AI Partner, AI Mentor framework merges experiential learning with realistic practice and tailored feedback. This work ultimately calls for cross-disciplinary innovation to address the broader implications for workforce development and social equality.