QRData

Introduction

Quantitative reasoning is a critical skill to analyze data, yet the assessment of such ability remains limited. To address this gap, we introduce the Quantitative Reasoning with Data (QRData) benchmark, aiming to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data. The benchmark comprises a carefully constructed dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers. To compare models' quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText.

Examples of advanced quantitative reasoning questions and reasoning steps.

We evaluate natural language reasoning, program-based reasoning, and agent reasoning methods including Chain-of-Thought, Program-of-Thoughts, ReAct, and code interpreter assistants on diverse models. The strongest model GPT-4 achieves an accuracy of 58%, which has a large room for improvement. Among open-source models, Deepseek-coder-instruct, a code LLM pretrained on 2T tokens, gets the highest accuracy of 37%. Analysis reveals that models encounter difficulties in data analysis and causal reasoning, and struggle in using causal knowledge and provided data simultaneously.

Our benchmark has its unique challenges:

Our benchmark requires advanced quantitative knowledge and skills.
Models need to know what are common methods of statistical/causal reasoning and how to utilize them.
Our benchmark requires multi-turn reasoning.
After execution of model generated code, models need to read the execution results, and write subsequent code or interpret the results to produce the final answer.

Overview

We introduce QRData, the first benchmark for advanced quantitative reasoning with data, to assess models' abilities of data-based statistical and causal reasoning. To ensure the quality of our benchmark, we first gather quantitative reasoning teaching and research resources, and then annotate questions based on the materials. We collect multiple-choice questions and numerical questions from these resources, and ensure that the gold answer is unique. We also collect data descriptions like the purpose and format of the data sheets from the resources, and provide them along with questions to models.

Statistics of QRData.

Key concepts in QRData.

Examples of QRData

Example 1 (source: OpenIntro statistics)

Example 2 (source: Quantitative social science)

Example 3 (source: Causal inference for the brave and true)

Example 4 (source: Flow cytometry)

Auxiliary Benchmark: QRText

To separate the challenge of quantitative reasoning from data analysis and analyze whether models master the quantitative reasoning skills, we create an auxiliary benchmark for comparison called Quantitative Reasoning with Text (QRText). Questions in QRText can be answered without data. QRText contains 290 questions, 100 for statistical reasoning and 190 for causal reasoning. For ease of model comparison, the ratio of statistical/causal questions of QRText is similar to QRData.

Examples of QRText

Example 1 (source: OpenIntro statistics)

Example 2 (source: CLadder)

Example 3 (source: Corr2Cause)

We develop several zero-shot reasoning methods as our baselines:

Table Question Answering (Table QA): The task of table QA asks models to answer a question based on tables. To evaluate if our benchmark can be solved by existing table QA methods, we select a competitive model TableLlama.
Chain-of-Thought (CoT): CoT prompting enables models to conduct complex natural language reasoning through intermediate reasoning steps.
Program-of-Thoughts (PoT): PoT prompting asks models to generate a Python code program, and uses an external computer to execute the code. The output of the code is regarded as the answer.
ReAct-style Prompting: ReAct is a prevalent prompting paradigm for agent reasoning. It combines reasoning and acting in language models for task solving. To address our task, we restrict the action space to Python code execution, and ask models to generate thoughts and actions.
Code Interpreter Assistants: GPT models can be used as agents (called assistants), and a code interpreter tool is built-in, which could execute Python code in a sandbox to interpret the data.

We experiment with the CoT, PoT, and ReAct-style prompting methods on the following models:

General-Purpose LLMs: Llama-2-chat, Gemini-Pro and GPT-4
Code LLMs: CodeLlama-instruct and Deepseek-coder-instruct
LLM for Mathematical Reasoning: WizardMath
LLM for Agent Reasoning: AgentLM

Performance of models on QRData. Numbers are accuracies in percentages (%). The best results are in bold. For models evaluated with multiple reasoning methods, the model-level best results are underlined.

GPT-4 with the code interpreter assistant achieves the best performance, and Deepseek-coder-instruct with PoT prompting is the best among open-source models. The best model is 18% worse than human, showing that QRData is challenging for LLMs.

GPT-4 Failure Case

What are main difficulties of LLMs?

We go deeper into the primary difficulties models face in addressing the task of quantitative reasoning with data. This may provide insights into how to design methods to better solve our task.

Difficulty of Data Analysis

We evaluate models on our auxiliary benchmark QRText, and compare the performance on QRText and QRData to quantify the difficulty of data analysis.

Performance of models on QRText. Numbers are accuracies in percentages (%). Best results are in bold, and model-level best results are underlined.

All models perform better on QRText than QRData from 1.8% to 11.5%, and the gap is larger for smaller models. If we control the knowledge and skills required by restricting questions to the same source (the book OpenIntro statistics), models perform 6% better on average. These provide evidence that most models have difficulty in data analysis.

Difficulty of Causal Reasoning

We observe the performance gap between statistical reasoning and causal reasoning on QRData, and the gap remains on QRText. GPT-4 achieves 89% accuracy in statistical questions on QRText, but only about half the accuracy in causal questions. This exhibits the unique difficulty of causal reasoning regardless of data analysis.

As shown in the failure case of GPT-4, when asked to predict the causal relation between two variables L L1 radiculopathy and R L5 radiculopathy, GPT-4 proposes a wrong plan of calculating the conditional probabilities in Step 2. As correlation does not imply causation, although GPT-4 successfully executes the plan, it makes a wrong prediction.

BibTeX


      @article{liu2024llms,
        title={Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data},
        author={Liu, Xiao and Wu, Zirui and Wu, Xueqing and Lu, Pan and Chang, Kai-Wei and Feng, Yansong},
        journal={arXiv preprint arXiv:2402.17644},
        year={2024}
      }