A survey on efficient inference for large language models

This article is a series of LLM articles, aiming at "A Survey on"Efficient InferenceTranslation of for Large Language Models.

LargeLanguage ModelA review of efficient reasoning research

summary
1 Introduction
2 Preface
3 categories
4 Data-level optimization
5 Model-level optimization
6 System-level optimization
7 Discussion on key application scenarios
8 Conclusion

summary

Large language models (LLMs) have attracted widespread attention for their outstanding performance in various tasks. However, the large amount of computing and memory requirements of LLM inference poses challenges to deployment in resource-constrained scenarios. Work in this field has been committed to developing technologies designed to improve the efficiency of LLM reasoning. This article provides a comprehensive review of the existing literature on effective LLM reasoning. We first analyzed LLM reasoningInefficientThe main reason is thatModelPay attention to operations and autoregressive decoding methods for size and quadratic complexity. We then introduce a comprehensive taxonomy that divides the current literature into data-level, model-level, and system-level optimization. In addition, this paper also conducts comparative experiments on representative methods within key subfields to provide quantitative insights. Finally, we provide some knowledge summary and discuss future research directions.

1 Introduction

2 Preface

3 categories

4 Data-level optimization

5 Model-level optimization

6 System-level optimization

7 Discussion on key application scenarios

8 Conclusion

Efficient LLM inference focuses on reducing the computing, memory access and memory costs in the LLM inference process, and aims to optimize efficiency metrics such as latency, throughput, storage, power and energy. This survey provides a comprehensive review of efficient LLM inference research and provides insights, suggestions and key technologies.Future direction. Initially, we introduced a hierarchical taxonomy that includes data, model and system-level optimization. Subsequently, under the guidance of this taxonomy, we carefully examined and summarized the research at each level and subfield. For mature technologies such as model quantization and efficient service systems, we conducted experiments to evaluate and analyze their performance. Based on these analyses, we provide practical advice to practitioners and researchers in the field and identify promising research avenues.