Ashmit Khandelwal

background: Hallstätter See, Austria

I’m a researcher at Microsoft Research working with Dr. Nagarajan Natarajan and Dr. Amit Sharma on verification and steering for Large Language Model (LLM) agents. I study how these systems search, analyze, and interact with real-world environments, and methods to monitor and steer them.

My recent work interwhen, focuses on test-time verifiers that asynchronously monitor LLM trajectories at runtime, intervening when the agent deviates from a defined specification. We see improvements in both task performance and soundness across code generation, logical reasoning, and agentic settings. I’ve also worked on giving a definition to and evaluating Deep Research, published at ICLR 2026. These are LLM systems that perform structured search over large corpora.

Previously, I finished my Undergraduate at BITS Pilani, India where I studied Computer Science and Data Science. I spent a summer at the Adobe MDSR lab, where I worked with on predicting human behavior at scale using multimodal LLMs. This research was a spotlight publication at ICLR 2024.

publications

* = equal contribution

ICLR
Spotlight

Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior

Ashmit Khandelwal^*, Aditya Agrawal^*, Aanisha Bhattacharyya^*, Yaman K Singla^*, and 7 more authors

In International Conference on Learning Representations, 2024

Abs arXiv Poster Website

Shannon and Weaver’s seminal information theory divides communication into three levels: technical, semantic, and effectiveness. While the technical level deals with the accurate reconstruction of transmitted symbols, the semantic and effectiveness levels deal with the inferred meaning and its effect on the receiver. Large Language Models (LLMs), with their wide generalizability, make some progress towards the second level. However, LLMs and other communication models are not conventionally designed for predicting and optimizing communication for desired receiver behaviors and intents. As a result, the effectiveness level remains largely untouched by modern communication systems. In this paper, we introduce the receivers’ "behavior tokens," such as shares, likes, clicks, purchases, and retweets, in the LLM’s training corpora to optimize content for the receivers and predict their behaviors. Other than showing similar performance to LLMs on content understanding tasks, our trained models show generalization capabilities on the behavior dimension for behavior simulation, content simulation, behavior understanding, and behavior domain adaptation. We show results on all these capabilities using a wide range of tasks on three corpora. We call these models Large Content and Behavior Models (LCBMs). Further, to spur more research on LCBMs, we release our new Content Behavior Corpus (CBC), a repository containing communicator, message, and corresponding receiver behavior.
ICLR

Characterizing Deep Research: A Benchmark and Formal Definition

Abhinav Java^*, Ashmit Khandelwal^*, Sukruta Prakash Midigeshi^*, Aaron Halfaker, and 5 more authors

In International Conference on Learning Representations, 2026

Abs arXiv

Information tasks such as writing surveys or analytical reports require complex search and reasoning, and have recently been grouped under the umbrella of deep research - a term also adopted by recent models targeting these capabilities. Despite growing interest, the scope of the deep research task remains underdefined and its distinction from other reasoning-intensive problems is poorly understood. In this paper, we propose a formal characterization of the deep research (DR) task and introduce a benchmark to evaluate the performance of DR systems. We argue that the core defining feature of deep research is not the production of lengthy report-style outputs, but rather the high fan-out over concepts required during the search process, i.e., broad and reasoning-intensive exploration. To enable objective evaluation, we define DR using an intermediate output representation that encodes key claims uncovered during search-separating the reasoning challenge from surface-level report generation. Based on this formulation, we propose a diverse, challenging benchmark LiveDRBench with 100 challenging tasks over scientific topics (e.g., datasets, materials discovery, prior art search) and public interest events (e.g., flight incidents, movie awards). Across state-of-the-art DR systems, F1 score ranges between 0.02 and 0.72 for any sub-category. OpenAI’s model performs the best with an overall F1 score of 0.55. Analysis of reasoning traces reveals the distribution over the number of referenced sources, branching, and backtracking events executed by current DR systems, motivating future directions for improving their search mechanisms and grounding capabilities.
ArXiv preprint

interwhen: A Generalizable Framework for Verifiable Reasoning with Test-time Monitors

Vishak K Bhat, Prateek Chanda, Ashmit Khandelwal, Maitreyi Swaroop, and 4 more authors

In ArXiv, 2026

Abs arXiv

Reasoning models produce long traces of intermediate decisions and tool calls, making test-time verification increasingly important for ensuring correctness. Existing approaches either verify only the final answer, which misses early errors, or rely on branch-and-verify strategies that explore multiple trajectories at substantially higher compute cost. We introduce interwhen, a single-trajectory verification framework that steers model behavior by providing feedback on intermediate verifiable properties. Our method addresses two key challenges. First, extracting intermediate solutions from a reasoning trace typically requires prompt engineering or external task decomposition into fixed steps, which can constrain the model’s reasoning strategy. Instead, we periodically poll the reasoning trace and fork inference to recover intermediate solutions without imposing any predefined structure. Second, frequent verifier calls can increase latency; we address this by running verifiers asynchronously and interrupting the main trajectory only when an error is detected, leaving generation unaffected otherwise. This design improves both reliability and efficiency, and naturally supports early stopping based on consistency over recent intermediate solutions. Across benchmarks in code generation and arithmetic, logical and spatial reasoning, interwhen improves accuracy by up to 15 percentage points over standard chain-of-thought execution while staying within 1.5x of token compute cost. Moreover, on every dataset, interwhen achieves a Pareto-optimal operating point between accuracy and efficiency compared to existing test-time verification methods.