Predicting Causal Effects from Natural Language Queries using Structured Representations
Working paper
arXiv
Randomized controlled trials give reliable estimates of causal effects but are costly and slow to run, motivating interest in forecasting intervention effects from existing experimental evidence. Can large language models predict causal effect sizes directly from free-form questions? We introduce Query2Effect, a large-scale benchmark of more than 73,000 questions aligned with experiment descriptions, built to mimic realistic information-seeking by varying query specificity along implicitness, abstraction, and ambiguity. We propose a modular framework that first uses an LLM to construct a synthetic structured representation of the experiment, then predicts the effect size with a supervised encoder. This substantially reduces prediction error and improves generalization relative to prompting off-the-shelf LLMs, lowering absolute error by 27% to 186%. Separating semantic interpretation from numerical estimation appears critical for forecasting causal effects from language.