logic benchmark: towards systemic evaluation of logical reasoning ability of large language models

This article is a series of LLM articles for "LogicBench: Towards Systemic Evaluation of Logical Reasoning Ability of Large LanguageModels》 Translation.

LogicBench: LargeLanguage ModelSystematic evaluation of logical reasoning ability

summary
1 Introduction
2 Related Work
3 LogicBench
4 Results and Analysis
5 Conclusion
limitation

summary

Recently developed large languagesModel(LLM) has been proven to perform well in a wide range of language comprehension tasks. But can they really "reason" natural language? This problem has been widely concerned by research, and many reasoning techniques, such as common sense, numerical and qualitative, have been studied. However, key skills related to “logical reasoning” are still underexplored. The existing work on LLM's reasoning ability is focused only on several inference rules (such as pattern sub and pattern sub) of propositional logic and first-order logic. In view of the above limitations, we comprehensively evaluate the logical reasoning ability of LLM on 25 different reasoning modes that span propositional logic, first-order logic and non-monotonic logic. To achieveSystem evaluation,We introduce LogicBench, a natural language question and answer dataset focused on using a single inference rule. We use the thinking chain prompt for GPT-4,ChatGPT, Gemini, Llama-2 and Mistral, a series of LLMs, were analyzed in detail. Experimental results show that existing LLMs perform poorly on LogicBench; in particular, they have difficulty coping with situations involving complex reasoning and negation. Furthermore, they sometimes ignore the contextual information needed to reason about reaching the correct conclusion. We believe that our work and findings will help future research evaluate and improve the logical reasoning capabilities of LLMs.

1 Introduction

2 Related Work

3 LogicBench

4 Results and Analysis

5 Conclusion

In this work, we evaluate the logical reasoning ability of LLM on 25 different inference rules and inference patterns that cover PL, FOL, and NM logic. To this end, we introduce LogicBench, a natural language question and answer focused on evaluating individual inference rulesDataset. We designed two tasks using LogicBench: (i) BQA and (ii) MCQA. We evaluated a series of LLMs in both tasks, including GPT-4, ChatGPT, Gemini Pro,Llama-2and Mistral. Experimental results show that LLMs perform poorly on LogicBench, even if they only require one reasoning rule to be applied. In addition, we also enhance LogicBench to LogicBench (Aug), which can be used for training purposes. Using LogicBench (Aug), we demonstrate that LLM trained with it can better understand logical reasoning, thus obtaining better on existing logical datasetsperformance。