Multi-Objective Reinforcement Learning for Resource-Optimal LLM Serving in SaaS Clouds

Vasanthi Jangala Naga

doi:10.18535/ijsrm/v13i11.ec03

Abstract

Large Language Models (LLMs) have been the cornerstone for current Software as a Service (SaaS) solutions. These LLMs have made intelligent automation and analytics possible. But their current computation or inference cost is high. As a result, cloud service companies face challenges with respect to cloud scalability. Adaptive Precision Scaling (APS) is the strategy of adapting computational precision during execution. This paper describes the newly proposed architecture of Adaptive Precision Scaling (APS) in the context of Software as a Service (SaaS) and proposes a taxonomy of precision scaling to have a clearer understanding of precision adaptivity.

Keywords

Large Language Models
SaaS
Adaptive Precision
Energy Efficiency
Coherence
Factuality
Model Serving

References

Han, S., Mao, H., and Dally, W. J., “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding,” arXiv preprint arXiv:1510.00149, 2016.
Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L., “8-bit Matrix Multiplication for Transformers at Scale,” arXiv preprint arXiv:2208.07339, 2022.
Frantar, E., Ashkboos, S., and Alistarh, D., “GPTQ: Accurate Post-Training Quantization for Generative Pretrained Transformers,” arXiv preprint arXiv:2210.17323, 2023.
Lin, S., Wang, Y., and Chen, Z., “AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration,” arXiv preprint arXiv:2306.00978, 2023.
Zhao, Z., Liu, J., and Liu, Y., “RL-Driven Precision Control for Energy-Efficient Neural Network Inference,” IEEE Transactions on Neural Networks and Learning Systems, 2024.
Xu, H., Singh, R., and Patel, K., “Adaptive Quantization with Meta-Reinforcement Learning for Transformer Acceleration,” NeurIPS Conference Proceedings, 2024.
Kang, Y., Patel, M., and Chen, L., “Energy-Aware Machine Learning for Cloud and SaaS AI Serving Systems,” ACM Transactions on Internet Technology, 2023.
Nguyen, T., Chen, L., and Zhao, X., “Predictive Quality Estimation for Large Language Models under Quantization,” arXiv preprint arXiv:2405.06102, 2024.
Singh, R., Ahmed, F., and Gupta, D., “Multi-Objective Reinforcement Learning for Resource-Optimal LLM Serving in SaaS Clouds,” IEEE Transactions on Parallel and Distributed Systems, 2025.
Li, X., Zhang, J., and Liu, Y., “Energy-Efficient Deep Learning Inference: Challenges and Opportunities,” IEEE Internet of Things Journal, vol. 8, no. 12, pp. 9876–9890, 2021.
Zhao, Z., and Liu, J., “GreenAI Serving: Adaptive Model Serving for Energy-Efficient Inference,” arXiv preprint arXiv:2304.07892, 2024.
Huang, T., Lin, Y., and Wang, Y., “Dynamic Precision Scaling in Neural Network Accelerators,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023.
Wang, Q., Liu, B., and Tang, Y., “Runtime Adaptive Inference for Transformer-Based Models,” IEEE Access, vol. 11, pp. 64521–64533, 2023.
Park, J., and Kim, S., “Energy-Adaptive Neural Network Inference Using Hardware-Aware Scheduling,” ACM Transactions on Architecture and Code Optimization, 2024.
Liang, S., and Luo, J., “Self-Adaptive Quantization Strategies for Efficient Transformer Serving,” Proceedings of the 2024 International Conference on Machine Learning (ICML), 2024.
Hosseini, A., and Lee, H., “Meta-Learning-Based Dynamic Precision Management for AI Accelerators,” IEEE Transactions on Artificial Intelligence, 2024.
Wu, C., Zhang, R., and Li, H., “Fine-Grained Reinforcement Learning for Dynamic Computation in Transformers,” NeurIPS, 2023.
Bae, S., and Moon, J., “Adaptive Inference Scheduling for Multi-Tenant SaaS Systems,” IEEE Cloud Computing, vol. 10, no. 1, pp. 47–56, 2024.
Huang, L., and Chen, M., “Carbon-Aware Machine Learning and Green AI Deployment in Cloud Infrastructure,” Nature Machine Intelligence, 2024.
Li, Q., and Guo, Y., “Reinforcement Learning-Based Resource Orchestration in AIaaS Environments,” IEEE Transactions on Cloud Computing, 2024.
Sato, K., and Shimizu, R., “Dynamic Precision Management for On-Device and Cloud-Based Neural Inference,” Journal of Parallel and Distributed Computing, 2024.
Hu, J., and Lee, Y., “A Unified Framework for Mixed-Precision Transformer Inference,” Proceedings of the 2023 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2023.
Rahman, A., and Dong, X., “Quantized Reinforcement Learning for Efficient Neural Model Deployment,” IEEE Transactions on Artificial Intelligence, 2023.
Patel, K., and Banerjee, A., “Precision-Aware Scheduling for Energy-Proportional AI Serving,” ACM Symposium on Cloud Computing (SoCC), 2024.
Gao, R., and Liu, T., “Measuring the Carbon Footprint of Large-Scale Model Serving,” Communications of the ACM, 2024.

[refR-1] Han, S., Mao, H., and Dally, W. J., “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding,” arXiv preprint arXiv:1510.00149, 2016.

[refR-2] Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L., “8-bit Matrix Multiplication for Transformers at Scale,” arXiv preprint arXiv:2208.07339, 2022.

[refR-3] Frantar, E., Ashkboos, S., and Alistarh, D., “GPTQ: Accurate Post-Training Quantization for Generative Pretrained Transformers,” arXiv preprint arXiv:2210.17323, 2023.

[refR-4] Lin, S., Wang, Y., and Chen, Z., “AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration,” arXiv preprint arXiv:2306.00978, 2023.

[refR-5] Zhao, Z., Liu, J., and Liu, Y., “RL-Driven Precision Control for Energy-Efficient Neural Network Inference,” IEEE Transactions on Neural Networks and Learning Systems, 2024.

[refR-6] Xu, H., Singh, R., and Patel, K., “Adaptive Quantization with Meta-Reinforcement Learning for Transformer Acceleration,” NeurIPS Conference Proceedings, 2024.

[refR-7] Kang, Y., Patel, M., and Chen, L., “Energy-Aware Machine Learning for Cloud and SaaS AI Serving Systems,” ACM Transactions on Internet Technology, 2023.

[refR-8] Nguyen, T., Chen, L., and Zhao, X., “Predictive Quality Estimation for Large Language Models under Quantization,” arXiv preprint arXiv:2405.06102, 2024.

[refR-9] Singh, R., Ahmed, F., and Gupta, D., “Multi-Objective Reinforcement Learning for Resource-Optimal LLM Serving in SaaS Clouds,” IEEE Transactions on Parallel and Distributed Systems, 2025.

[refR-10] Li, X., Zhang, J., and Liu, Y., “Energy-Efficient Deep Learning Inference: Challenges and Opportunities,” IEEE Internet of Things Journal, vol. 8, no. 12, pp. 9876–9890, 2021.

[refR-11] Zhao, Z., and Liu, J., “GreenAI Serving: Adaptive Model Serving for Energy-Efficient Inference,” arXiv preprint arXiv:2304.07892, 2024.

[refR-12] Huang, T., Lin, Y., and Wang, Y., “Dynamic Precision Scaling in Neural Network Accelerators,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023.

[refR-13] Wang, Q., Liu, B., and Tang, Y., “Runtime Adaptive Inference for Transformer-Based Models,” IEEE Access, vol. 11, pp. 64521–64533, 2023.

[refR-14] Park, J., and Kim, S., “Energy-Adaptive Neural Network Inference Using Hardware-Aware Scheduling,” ACM Transactions on Architecture and Code Optimization, 2024.

[refR-15] Liang, S., and Luo, J., “Self-Adaptive Quantization Strategies for Efficient Transformer Serving,” Proceedings of the 2024 International Conference on Machine Learning (ICML), 2024.

[refR-16] Hosseini, A., and Lee, H., “Meta-Learning-Based Dynamic Precision Management for AI Accelerators,” IEEE Transactions on Artificial Intelligence, 2024.

[refR-17] Wu, C., Zhang, R., and Li, H., “Fine-Grained Reinforcement Learning for Dynamic Computation in Transformers,” NeurIPS, 2023.

[refR-18] Bae, S., and Moon, J., “Adaptive Inference Scheduling for Multi-Tenant SaaS Systems,” IEEE Cloud Computing, vol. 10, no. 1, pp. 47–56, 2024.

[refR-19] Huang, L., and Chen, M., “Carbon-Aware Machine Learning and Green AI Deployment in Cloud Infrastructure,” Nature Machine Intelligence, 2024.

[refR-20] Li, Q., and Guo, Y., “Reinforcement Learning-Based Resource Orchestration in AIaaS Environments,” IEEE Transactions on Cloud Computing, 2024.

[refR-21] Sato, K., and Shimizu, R., “Dynamic Precision Management for On-Device and Cloud-Based Neural Inference,” Journal of Parallel and Distributed Computing, 2024.

[refR-22] Hu, J., and Lee, Y., “A Unified Framework for Mixed-Precision Transformer Inference,” Proceedings of the 2023 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2023.

[refR-23] Rahman, A., and Dong, X., “Quantized Reinforcement Learning for Efficient Neural Model Deployment,” IEEE Transactions on Artificial Intelligence, 2023.

[refR-24] Patel, K., and Banerjee, A., “Precision-Aware Scheduling for Energy-Proportional AI Serving,” ACM Symposium on Cloud Computing (SoCC), 2024.

[refR-25] Gao, R., and Liu, T., “Measuring the Carbon Footprint of Large-Scale Model Serving,” Communications of the ACM, 2024.

Multi-Objective Reinforcement Learning for Resource-Optimal LLM Serving in SaaS Clouds

Abstract

Keywords

References

Author Resources

Journal Policies

Author Desk