From Algorithm to System: Integrated Design of Compiler and Toolchain for Large Model Inference Optimization

Shengyi Gao

doi:10.71222/032q3y25

Authors

Shengyi Gao Ningxia University, Yinchuan, Ningxia, China Author

DOI:

https://doi.org/10.71222/032q3y25

Keywords:

large model inference, algorithm optimization, compiler optimization, toolchain scheduling, heterogeneous hardware

Abstract

With the rapid growth of deep learning model scales, especially large models such as Transformers and GPT, efficient inference has become a critical challenge due to increasing computational and memory demands. This paper proposes an integrated optimization framework that unifies algorithmic simplifications, compiler transformations, and system-level scheduling to enhance large model inference performance. By tightly coupling quantization, pruning, operator fusion, memory reuse, and automated heterogeneous hardware scheduling, the framework achieves significant improvements in computation reduction, memory efficiency, and parallel execution. Theoretical analysis and design considerations demonstrate the framework's potential for predictable performance gains and scalability across diverse hardware platforms. Future work will focus on extending hardware support, distributed inference, and adaptive optimization strategies. This integrated approach lays a foundation for efficient, scalable, and accurate large model deployment in practical AI applications.

References

1. S. Park, S. Jeon, C. Lee, S. Jeon, B. S. Kim, and J. Lee, "A survey on inference engines for large language models: Perspectives on optimization and efficiency," arXiv preprint arXiv:2505.01658, 2025. doi: 10.48550/arXiv.2505.01658.

2. Z. Liu, J. Leng, G. Lu, C. Wang, Q. Chen, and M. Guo, "Survey and design of Paleozoic: A high-performance compiler tool chain for deep learning inference accelerator," CCF Transactions on High Performance Computing, vol. 2, no. 4, pp. 332-347, 2020. doi: 10.1007/s42514-020-00044-7.

3. R. Zhang, H. Jiang, W. Wang, and J. Liu, "Optimization methods, challenges, and opportunities for edge inference: A comprehensive survey," Electronics, vol. 14, no. 7, p. 1345, 2025. doi: 10.3390/electronics14071345.

4. E. Kusmenko, B. Rumpe, S. Schneiders, and M. von Wenckstern, "Highly-optimizing and multi-target compiler for embedded system models: C++ compiler toolchain for the component and connector language EmbeddedMontiArc," In Proceedings of the 21th ACM/IEEE International Conference on Model Driven Engineering Languages and Systems, October, 2018, pp. 447-457. doi: 10.1145/3239372.3239388.

5. J. J. Poveda Rodrigo, "Inference optimization of large language models on RISC-V HPC platforms (Doctoral dissertation, Politecnico di Torino)," 2024.

6. C. Cummins, V. Seeker, D. Grubisic, B. Roziere, J. Gehring, G. Synnaeve, and H. Leather, "LLM compiler: Foundation language models for compiler optimization," In Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction, February, 2025, pp. 141-153. doi: 10.1145/3708493.3712691.

7. C. Cummins, V. Seeker, D. Grubisic, B. Roziere, J. Gehring, G. Synnaeve, and H. Leather, "Meta large language model compiler: Foundation models of compiler optimization," arXiv preprint arXiv:2407.02524, 2024. doi: 10.48550/arXiv.2407.02524.

8. S. Tang, C. Priebe, R. Mahapatra, L. Qin, and H. Esmaeilzadeh, "Compiler optimization via LLM reasoning for efficient model serving," arXiv preprint arXiv:2506.01374, 2025. doi: 10.48550/arXiv.2506.01374.