Abstract

Large Language Models (LLMs) have been the cornerstone for current Software as a Service (SaaS) solutions. These LLMs have made intelligent automation and analytics possible. But their current computation or inference cost is high. As a result, cloud service companies face challenges with respect to cloud scalability. Adaptive Precision Scaling (APS) is the strategy of adapting computational precision during execution. This paper describes the newly proposed architecture of Adaptive Precision Scaling (APS) in the context of Software as a Service (SaaS) and proposes a taxonomy of precision scaling to have a clearer understanding of precision adaptivity.

Keywords

  • Large Language Models
  • SaaS
  • Adaptive Precision
  • Energy Efficiency
  • Coherence
  • Factuality
  • Model Serving

References

  1. Han, S., Mao, H., and Dally, W. J., β€œDeep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding,” arXiv preprint arXiv:1510.00149, 2016.
  2. Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L., β€œ8-bit Matrix Multiplication for Transformers at Scale,” arXiv preprint arXiv:2208.07339, 2022.
  3. Frantar, E., Ashkboos, S., and Alistarh, D., β€œGPTQ: Accurate Post-Training Quantization for Generative Pretrained Transformers,” arXiv preprint arXiv:2210.17323, 2023.
  4. Lin, S., Wang, Y., and Chen, Z., β€œAWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration,” arXiv preprint arXiv:2306.00978, 2023.
  5. Zhao, Z., Liu, J., and Liu, Y., β€œRL-Driven Precision Control for Energy-Efficient Neural Network Inference,” IEEE Transactions on Neural Networks and Learning Systems, 2024.
  6. Xu, H., Singh, R., and Patel, K., β€œAdaptive Quantization with Meta-Reinforcement Learning for Transformer Acceleration,” NeurIPS Conference Proceedings, 2024.
  7. Kang, Y., Patel, M., and Chen, L., β€œEnergy-Aware Machine Learning for Cloud and SaaS AI Serving Systems,” ACM Transactions on Internet Technology, 2023.
  8. Nguyen, T., Chen, L., and Zhao, X., β€œPredictive Quality Estimation for Large Language Models under Quantization,” arXiv preprint arXiv:2405.06102, 2024.
  9. Singh, R., Ahmed, F., and Gupta, D., β€œMulti-Objective Reinforcement Learning for Resource-Optimal LLM Serving in SaaS Clouds,” IEEE Transactions on Parallel and Distributed Systems, 2025.
  10. Li, X., Zhang, J., and Liu, Y., β€œEnergy-Efficient Deep Learning Inference: Challenges and Opportunities,” IEEE Internet of Things Journal, vol. 8, no. 12, pp. 9876–9890, 2021.
  11. Zhao, Z., and Liu, J., β€œGreenAI Serving: Adaptive Model Serving for Energy-Efficient Inference,” arXiv preprint arXiv:2304.07892, 2024.
  12. Huang, T., Lin, Y., and Wang, Y., β€œDynamic Precision Scaling in Neural Network Accelerators,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023.
  13. Wang, Q., Liu, B., and Tang, Y., β€œRuntime Adaptive Inference for Transformer-Based Models,” IEEE Access, vol. 11, pp. 64521–64533, 2023.
  14. Park, J., and Kim, S., β€œEnergy-Adaptive Neural Network Inference Using Hardware-Aware Scheduling,” ACM Transactions on Architecture and Code Optimization, 2024.
  15. Liang, S., and Luo, J., β€œSelf-Adaptive Quantization Strategies for Efficient Transformer Serving,” Proceedings of the 2024 International Conference on Machine Learning (ICML), 2024.
  16. Hosseini, A., and Lee, H., β€œMeta-Learning-Based Dynamic Precision Management for AI Accelerators,” IEEE Transactions on Artificial Intelligence, 2024.
  17. Wu, C., Zhang, R., and Li, H., β€œFine-Grained Reinforcement Learning for Dynamic Computation in Transformers,” NeurIPS, 2023.
  18. Bae, S., and Moon, J., β€œAdaptive Inference Scheduling for Multi-Tenant SaaS Systems,” IEEE Cloud Computing, vol. 10, no. 1, pp. 47–56, 2024.
  19. Huang, L., and Chen, M., β€œCarbon-Aware Machine Learning and Green AI Deployment in Cloud Infrastructure,” Nature Machine Intelligence, 2024.
  20. Li, Q., and Guo, Y., β€œReinforcement Learning-Based Resource Orchestration in AIaaS Environments,” IEEE Transactions on Cloud Computing, 2024.
  21. Sato, K., and Shimizu, R., β€œDynamic Precision Management for On-Device and Cloud-Based Neural Inference,” Journal of Parallel and Distributed Computing, 2024.
  22. Hu, J., and Lee, Y., β€œA Unified Framework for Mixed-Precision Transformer Inference,” Proceedings of the 2023 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2023.
  23. Rahman, A., and Dong, X., β€œQuantized Reinforcement Learning for Efficient Neural Model Deployment,” IEEE Transactions on Artificial Intelligence, 2023.
  24. Patel, K., and Banerjee, A., β€œPrecision-Aware Scheduling for Energy-Proportional AI Serving,” ACM Symposium on Cloud Computing (SoCC), 2024.
  25. Gao, R., and Liu, T., β€œMeasuring the Carbon Footprint of Large-Scale Model Serving,” Communications of the ACM, 2024.