Reinforcement Learning with Reward Shaping for Last-Mile Delivery Dispatch Efficiency

Sichong Huang

doi:10.71222/ejnna704

Authors

Sichong Huang Duke University, Durham, North Carolina, United States Author

DOI:

https://doi.org/10.71222/ejnna704

Keywords:

last-mile delivery, reinforcement learning, multi-dimensional reward shaping, dynamic dispatch, Markov decision process

Abstract

As the final and most labor-intensive segment of the logistics chain, last-mile delivery grapples with inherent challenges: dynamic traffic conditions, fluctuating order volumes, and the conflicting demands of timeliness, cost control, and resource efficiency. Conventional dispatch approaches-such as heuristic algorithms and static optimization models-exhibit limited adaptability to real-time fluctuations, often resulting in suboptimal resource utilization and elevated operational costs. To address these gaps, this study proposes a reinforcement learning (RL) framework integrated with multi-dimensional reward shaping (RS) to enhance dynamic last-mile delivery dispatch efficiency. First, we formalize the dispatch problem as a Markov Decision Process (MDP) that explicitly incorporates real-time factors (e.g., traffic congestion, order urgency, and vehicle status) into the state space. Second, we design a domain-specific RS function that introduces intermediate rewards (e.g., on-time arrival bonuses, empty-running penalties) to mitigate the sparsity of traditional terminal rewards and accelerate RL agent convergence. Experiments were conducted on a real-world dataset from a logistics enterprise in Chengdu (June-August 2024), comparing the proposed RS-PPO framework against two baselines: the classic Savings Algorithm (SA) and standard PPO without reward shaping (PPO-noRS). Results demonstrate that RS-PPO improves the on-time delivery rate (OTR) by 18.2% (vs. SA) and 9.5% (vs. PPO-noRS), reduces the average delivery cost (ADC) by 12.7% (vs. SA) and 7.3% (vs. PPO-noRS), and shortens convergence time by 40.3% (vs. PPO-noRS). Additionally, RS-PPO boosts vehicle utilization rate (VUR) by 29.8% (vs. SA) and 13.4% (vs. PPO-noRS). This framework provides a practical, data-driven solution for logistics enterprises seeking to balance service quality, cost efficiency, and sustainability-aligning with global last-mile optimization trends.

References

1. K. C. Tan, L. H. Lee, Q. L. Zhu, and K. Ou, "Heuristic methods for vehicle routing problem with time windows," Artificial in-telligence in Engineering, vol. 15, no. 3, pp. 281-295, 2001.

2. L. Shi, Z. Xu, M. Lejeune, and Q. Luo, "An integer l-shaped method for dynamic order fulfillment in autonomous last-mile delivery with demand uncertainty," arXiv preprint arXiv:2208.09067, 2022.

3. G. Clarke, and J. W. Wright, "Scheduling of vehicles from a central depot to a number of delivery points," Operations research, vol. 12, no. 4, pp. 568-581, 1964. doi: 10.1287/opre.12.4.568

4. G. B. Dantzig, and J. H. Ramser, "The truck dispatching problem," Management science, vol. 6, no. 1, pp. 80-91, 1959. doi: 10.1287/mnsc.6.1.80

5. F. Glover, "Tabu search-part I," ORSA Journal on computing, vol. 1, no. 3, pp. 190-206, 1989. doi: 10.1287/ijoc.1.3.190

6. P. Toth, and D. Vigo, "Vehicle routing: problems, methods, and applications," Society for industrial and applied mathematics, 2014.

7. W. B. Smythe, "Static and dynamic electricity," 1988.

8. M. Silva, and J. P. Pedroso, "Deep reinforcement learning for crowdshipping last-mile delivery with endogenous uncertainty," Mathematics, vol. 10, no. 20, p. 3902, 2022. doi: 10.3390/math10203902

9. A. Y. Ng, D. Harada, and S. Russell, "Policy invariance under reward transformations: Theory and application to reward shaping," In Icml, June, 1999, pp. 278-287.

10. H. N. Psaraftis, M. Wen, and C. A. Kontovas, "Dynamic vehicle routing problems: Three decades and counting," Networks, vol. 67, no. 1, pp. 3-31, 2016. doi: 10.1002/net.21628

11. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, "Proximal policy optimization algorithms," arXiv preprint arXiv:1707.06347, 2017.

12. J. E. Muriel, L. Zhang, J. C. Fransoo, and J. G. Villegas, "A reinforcement learning framework for improving parking decisions in last-mile delivery," Transportmetrica B: Transport Dynamics, vol. 12, no. 1, p. 2337216, 2024. doi: 10.1080/21680566.2024.2337216

13. A. Bdeir, S. Boeder, T. Dernedde, K. Tkachuk, J. K. Falkner, and L. Schmidt-Thieme, "RP-DQN: An application of Q-learning to vehicle routing problems," In German conference on artificial intelligence (Künstliche Intelligenz), September, 2021, pp. 3-16. doi: 10.1007/978-3-030-87626-5_1

14. H. Lee, and J. Jeong, "Mobile robot path optimization technique based on reinforcement learning algorithm in warehouse environment," Applied sciences, vol. 11, no. 3, p. 1209, 2021. doi: 10.3390/app11031209

15. W. K. Anuar, L. S. Lee, H. V. Seow, and S. Pickl, "A multi-depot dynamic vehicle routing problem with stochastic road capacity: An MDP model and dynamic policy for post-decision state rollout algorithm in reinforcement learning," Mathematics, vol. 10, no. 15, p. 2699, 2022.

16. M. I. D. Ranathunga, A. N. Wijayanayake, and D. H. H. Niwunhella, "Solution approaches for combining first-mile pickup and last-mile delivery in an e-commerce logistic network: A systematic literature review," In 2021 International Research Conference on Smart Computing and Systems Engineering (SCSE), September, 2021, pp. 267-275. doi: 10.1109/scse53661.2021.9568349

17. K. V. Tiwari, and S. K. Sharma, "An optimization model for vehicle routing problem in last-mile delivery," Expert Systems with Applications, vol. 222, p. 119789, 2023. doi: 10.1016/j.eswa.2023.119789

18. S. Wang, T. Kong, B. Guo, L. Lin, and H. Wang, "CourIRL: Predicting Couriers' Behavior in Last-Mile Delivery Using Crossed-Attention Inverse Reinforcement Learning," In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, October, 2024, pp. 4957-4965. doi: 10.1145/3627673.3680046