Kyle Richardson

Senior Research Scientist

Allen Institute for Artificial Intelligence


I am a senior research scientist at the Allen Institute for Artificial Intelligence in Seattle, Washington where I do work on natural language processing and machine learning on the Aristo Project. Prior to this, I was a researcher at the Institute for Natural Language Processing (IMS) at the University of Stuttgart in Germany, where I received my PhD in October 2018. Before this, I received my B.A. from the University of Rochester in upstate New York (USA).


Some miscellaneous notes and musings: Number Theory Meets Computability Theory (see also blog post); other lecture notes: Notes on Language Models, Attention and Transformers, Negation as Failure, Mixing Logic and Deep Learning: The Logic as Loss Function Approach, Introduction to Probability. Formal Techniques for Neural-symbolic Modeling recently taught at ESSLLI 2023

Recent Talks from me and my extended group: Brief (10 minute) introduction to Natural Language Understanding (NLU) and Language Modeling (intended for a non-technical audience); Overview of my work on diagnostic testing of neural models; Pushing the Limits of Rule Reasoning in Transformers (AAAI 2022), Breakpoint Transformers (EMNLP 2022); Learning to Decompose (EMNLP 2022) Decomposed Prompting (ICLR 2023);

Recent News Released the Open-Cot leaderboard on Huggingface that aims to track model improvements due to chain-of-thought prompting. 3 papers accepted to ACL 2024 on OLMO, DOLMA (our work on open-source large language models) and TimeArena (agent modeling with time constraints).

Recent Posts

I recently starting converting some of my research notes into blog posts, with the hope that someone might find them useful (or, even better, that someone might correct me when I’m wrong, since many of the topics covered go outside of my area of expertise).

Number Theory Meets Computability Theory

Solving Equations In this article1, we consider the problem of solving certain types of equations (called polynomial equations). For …

Why Infinity is Strange

What is Kolmogorov Complexity?

Selected Publications

Note: For the most up-to-date versions of my papers, please refer to the arxiv versions (unless stated otherwise).

Ruihan Yang, Jiangjie Chen, Yikai Zhang, Siyu Yuan, Aili Chen, Kyle Richardson, Yanghua Xiao, Deqing Yang (2024) SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals (work in progress) [project page]

Yikai Zhang, Siyu Yuan, Caiyu Hu, Kyle Richardson, Yanghua Xiao, Jiangjie Chen. (2024) TimeArena: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation (ACL 2024) [project page]

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, et al… Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, Hannaneh Hajishirzi (2024) OLMo: Accelerating the Science of Language Models (ACL 2024) [code] [data] [model]

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, et al. (2024) Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research (ACL 2024) [code] [data]

Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah A. Smith, Kyle Richardson, Jesse Dodge. (2023) PALOMA: A Benchmark for Evaluating Language Model Fit [code] [data]

Dirk Groenveld, Anas Awadalla, Iz Beltagy, Akshita Bhagia, Ian Magnusson, Hao Peng, Oyvind Tafjord, Pete Walsh, Kyle Richardson, Jesse Dodge. (2023) Catwalk: A Unified Language Model Evaluation Framework for Many Datasets (technical report) [toolkit code] eval code]

Kyle Richardson, Ian Magnusson, Oyvind Tafjord, Akshita Bhagia, Iz Beltagy, Arman Cohan, Pradeep Dasigi, Jesse Dodge, Dirk Groeneveld, Yuling Gu, Ananya Harsh Jha, Tushar Khot, Nishant Subramani. (2023) Robust Tooling and New Resources for Large Language Model Evaluation via Catwalk. (extended abstract, accepted to GEM 2023) (details forthcoming)

Jiangjie Chen, Siyu Yuan, Rong Ye, Bodhisattwa Prasad Majumder, Kyle Richardson. (2023) Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena (work in progress) [arxiv] [project page] [code]

Nora Kassner, Oyvind Tafjord, Ashish Sabharwal, Kyle Richardson, Hinrich Schütze and Peter Clark. (2023) Language Models with Rationality (EMNLP 2023) [arxiv] [project page]

Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Sabharwal, Kyle Richardson (2023) DISCO: Distilling Counterfactuals with Large Language Models. (ACL 2023) [arxiv] [code]

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, Ashish Sabharwal (2023) Decomposed Prompting: A Modular Approach for Solving Complex Tasks (ICLR 2023) [arxiv] [code] [poster] [slides]

Gregor Betz, Kyle Richardson. (2023) Probabilistic coherence, logical consistency, and Bayesian learning: Neural language models as epistemic agents (PLOS One journal) [publisher] [data/resources]

Kyle Richardson, Ronen Tamari, Oren Sultan, Dafna Shahaf, Reut Tsarfaty and Ashish Sabharwal. (2022) Breakpoint Transformers for Modeling and Tracking Intermediate Beliefs. (EMNLP 2022) [arxiv] [code] [slides]

Ben Zhou, Kyle Richardson, Xiaodong Yu and Dan Roth. (2022) Learning to Decompose: Hypothetical Question Decomposition Based on Comparable Texts (EMNLP 2022) [arxiv] [data/code]

Matthew Finlayson, Kyle Richardson , Ashish Sabharwal, Peter Clark (2022) What Makes Instruction Learning Hard? An Investigation and a New Challenge in a Synthetic Environment (EMNLP 2022) [arxiv] [code/data]

Gregor Betz, Kyle Richardson. (2022) Judgement Aggregation, Discursive Dilemma and Reflective Equilibrium: Neural Language Models as Self- Improving Doxastic Agents. Frontiers in Artificial Intelligence. [publisher]

Aarohi Srivastava et al (+441 authors) (2022) Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models [arxiv] [resources]

Tushar Khot, Kyle Richardson , Daniel Khashabi, Ashish Sabharwal (2022) Learning to Solve Complex Tasks by Talking to Agents (Findings of ACL) [arxiv] [code/data] [slides] [poster]

Kyle Richardson , Ashish Sabharwal (2022) Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability (AAAI2022) [arxiv] [code/data][slides] [poster]

Daniel Khashabi, Shane Lyu, Sewon Min, Lianhui Qin, Kyle Richardson , Sameer Singh, Sean Welleck, Hannaneh Hajishirzi, Tushar Khot, Ashish Sabharwal, Yejin Choi (2022) PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts (Proceedings of NAACL) [arxiv] [slides]

Ronen Tamari, Kyle Richardson , Aviad Sar-Shalom, Noam Kahlon, Nelson F. Liu, Reut Tsarfaty and Dafna Shahaf (2022) Dyna-bAbI: unlocking bAbI’s potential with dynamic synthetic benchmarking (*SEM2022) [arxiv] [code/data]

Gregor Betz, Kyle Richardson. (2022) DeepA2: A Modular Framework for Deep Argument Analysis with Pretrained Neural Text2Text Language Models (*SEM2022) [arxiv] [demo] [dataset] [model]

Hai Hu, He Zhou, Zuoyu Tian, Yiwen Zhang, Yina Patterson, Yanting Li, Yixin Nie, Kyle Richardson. (2021) Investigating Transfer Learning in Multi-lingual Pre-trained Language Models through Chinese Natural Language Inference Findings of ACL [code/data] [arxiv] [acl anthology]

Gregor Betz, Christian Voigt, Kyle Richardson. (2021) Thinking Aloud: Dynamic Context Generation Improves Zero-Shot Reasoning Performance of GPT-2 work in progress [arxiv]

Ben Zhou, Kyle Richardson, Qiang Ning, Tushar Khot, Ashish Sabharwal, Dan Roth. (2021) Temporal Reasoning on Implicit Events from Distant Supervision Proceedings of the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2021) [arxiv] [code] [data] [leaderboard] [slides]

Tushar Khot, Daniel Khashabi, Kyle Richardson, Peter Clark, Ashish Sabharwal (2021) Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models Proceedings of the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2021) [arxiv] [code/data] [demo] [slides] [poster]

Gregor Betz, Christian Voigt, Kyle Richardson. (2021) Critical Thinking for Language Models Proceedings of International Conference on Computational Semantics (IWCS 2021) [arxiv] [data] [models] [blog] [proceedings] [video]

Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Peter Clark (2021) Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge technical note [arxiv] [data]

Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu, Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Kyle Richardson, and Zhenzhong Lan. (2020) CLUE: A Chinese Language Understanding Evaluation Benchmark. in Proceedings of International Conference on Computational Linguistics (COLING) [arxiv] [website/leaderboard] [code/data] [proceedings]

Niket Tandon, Keisuke Sakaguchi, Bhavana Dalvi, Dheeraj Rajagopal, Peter Clark, Michal Guerquin, Kyle Richardson and Eduard Hovy. (2020) A Dataset for Tracking Entities in Open Domain Procedural Text in Proceedings of International Conference on Empirical Methods in Natural Language Processing (EMNLP) [proceedings] [arxiv] [dataset] [code]

Hai Hu, Kyle Richardson, Liang Xu, Lu Li, Sandra Kubler, Lawrence S. Moss. (2020) OCNLI: Original Chinese Natural Language Inference Findings of EMNLP [arxiv] [code/data] [leaderboard] [acl_anthonology]

Sumithra Bhakthavatsalam, Kyle Richardson, Niket Tandon, Peter Clark (2020) Do Dogs have Whiskers? A New Knowledge Base of hasPart Relations technical note [arxiv] [data]

Atticus Geiger, Kyle Richardson, Christopher Potts (2020) Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation in Workshop on Analzying and Interpreting Neural Networks for NLP (BlackBoxNLP) [arxiv] [proceedings] [data]

Kyle Richardson, Ashish Sabharwal (2020). What Does My QA Model Know? Devising Controlled Probes using Expert Knowledge. in Transactions of the Association for Computational Linguistics (TACL) [arxiv] [journal] [code/data][slides (EMNLP2020)]

Peter Clark, Oyvind Tafjord,Kyle Richardson (2020). Transformers as Soft Reasoners over Language. Proceedings of International Joint Conference on Artificial Intelligence (IJCAI) [arxiv] [proceedings] [demo][data] [data generator code]

Hai Hu, Qi Chen, Kyle Richardson, Atreyee Mukherjee, Lawrence S. Moss,Sandra Kuebler (2020). MonaLog: a Lightweight System for Natural Language Inference Based on Monotonicity. Proceedings of Society for Computation in Linguistics (SCIL 2020) [arxiv] [proceedings] [data]

Kyle Richardson, Hai Hu, Lawrence S. Moss, Ashish Sabharwal (2020). Probing Natural Language Inference Models through Semantic Fragments. Proceedings of Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI) [arxiv][aaai][code/data][slides]

Peter Clark,Oren Etzioni, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Niket Tandon, Sumithra Bhakthavatsalam, Dirk Groeneveld,Michal Guerquin, Michael Schmitz (2020). From ‘F’ to ‘A’ on the N.Y. Regents Science Exams: An Overview of the Aristo Project AI Magazine[arxiv][New York Times, GeekWire]

Kyle Richardson (2018) New Resources and Ideas for Semantic Parser Induction. PhD Thesis, Institute for Natural Language Processing (IMS), Faculty of Computer Science, Electrical Engineering and Information Technology. University of Stuttgart, Germany [opus][slides][code/data][handout]

Kyle Richardson (2018) A Language for Function Signature Representations. Brief technical note. [arxiv][data]

Kyle Richardson, Jonathan Berant and Jonas Kuhn (2018). Polyglot Semantic Parsing in APIs. Proceedings of 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) [arxiv][data][notes][code][slides][video]

Kyle Richardson, Sina Zarrieß and Jonas Kuhn (2017). The Code2Text Challenge: Text Generation in Source Code Libraries (2017) Proceedings of International Natural Language Generation Conference (INLG) [arxiv][paper][inlg_slides][resources].

Kyle Richardson, Jonas Kuhn (2017). Function Assistant: A Tool for NL Querying of APIs. (2017) Proceedings of Empirical Methods in Natural Language Processing (EMNLP) [arxiv][paper][demo][resources][code][poster]

Kyle Richardson, Jonas Kuhn (2017). Learning Semantic Correspondences in Technical Documentation. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) [arvix][paper][notes][data][acl_poster][stuttgart slides][code].

Kyle Richardson, Jonas Kuhn. (2016) Learning to Make Inferences in a Semantic Parsing Task. Transactions of the Association for Computational Linguistics (TACL) [paper][data][acl_slides][video] [extended version (from thesis)] [based partly on cky/kbest implemention from here].

Cleo Condoravdi, Kyle Richardson, Vishal Sikka, Asuman Suenbuel, and Richard Waldinger (2015) Natural Language Access to Data: It Takes Common Sense!. in Twelfth International Symposium on Logical Formalizations of Commonsense Reasoning (Commonsense-15). AAAI Spring Symposium. [demo][link]

Cleo Condoravdi, Kyle Richardson, Vishal Sikka, Asuman Suenbuel, and Richard Waldinger (2014) Deduction for Natural Language Access to Data. in University of Coimbra CS Technical Reports, CISUC/TR 2014-02. Presented at Joint Workshop on Natural Language and Computer Science (NLCS) and Natural Language Services for Reasoners (NLSR).

Kyle Richardson and Jonas Kuhn (2014) UnixMan Corpus: A Resource for Language Learning in the Unix Domain. in Proceedings of Language Resources and Evaluation (LREC). [link] [data]

Sina Zarriess and Kyle Richardson. (2013) An Automatic Method for Building a Data-to-Text Generator. in Proceedings of 14th European Workshop on Natural Language Generation (ENLG) [link]

Richard Waldinger, Danny Bobrow, Cleo Condoravdi, Amar Das, Kyle Richardson. (2011) Accessing Structured Health Information through English Queries and Automatic Deduction. in Proceedings of AAAI Spring Symposium on Health Communications.