Safe Offline Reinforcement Learning for Sepsis Treatment: A Two-Stage Framework Combining Constraint-Aware Learning with Runtime Safety Filtering

Bailing Zhang; Yuwei Mi

doi:10.53941/tai.2026.100007

Abstract

Reinforcement learning (RL) has shown promise in optimizing treatment strategies for sepsis, a life-threatening condition responsible for significant mortality in intensive care units. However, deploying RL policies in clinical settings requires not only optimizing patient outcomes but also ensuring adherence to established medical guidelines. In this paper, we propose a two-stage safety framework for offline RL-based sepsis treatment. The first stage employs Constraint-Penalized Q-learning combined with Implicit Q-Learning (CPQ-IQL), which incorporates clinical constraints through Lagrangian optimization during policy learning. The second stage applies a runtime safety filter that dynamically validates actions against clinical guidelines before execution. We evaluate our framework on the ICU-Sepsis benchmark with four clinically-motivated constraints derived from the Surviving Sepsis Campaign 2021 guidelines. Experimental results over 5 random seeds demonstrate that CPQ-IQL achieves the lowest constraint violation rate (22.88 ± 0.94%) among all baselines while maintaining competitive survival rates (78.4 ± 1.8%). When combined with the Safe Actions filtering mechanism, constraint violations are reduced by 97.2% (from 22.88% to 0.41%), demonstrating the effectiveness of our two-stage safety framework. Our analysis reveals that the Safe Actions filter modifies approximately 21% of policy decisions, highlighting the importance of runtime safety mechanisms for clinical deployment. These findings suggest that combining constraint-aware offline learning with runtime safety filtering provides a practical pathway toward safe and effective RL-based clinical decision support systems.

References

1.
Rudd, K.E.; Johnson, S.C.; Agesa, K.M.; et al. Global, regional, and national sepsis incidence and mortality, 1990–2017: Analysis for the Global Burden of Disease Study. Lancet 2020, 395, 200–211.
2.
Seymour, C.W.; Gesten, F.; Prescott, H.C.; et al. Time to treatment and mortality during mandated emergency care for sepsis. N. Engl. J. Med. 2017, 376, 2235–2244.
3.
Komorowski, M.; Celi, L.A.; Badber, O.; et al. The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nat. Med. 2018, 24, 1716–1720.
4.
Raghu, A.; Komorowski, M.; Ahmed, I.; et al. Deep reinforcement learning for sepsis treatment. arXiv 2017, arXiv:1711.09602.
5.
Peng, X.; Ding, Y.; Wihl, D.; et al. Improving sepsis treatment strategies by combining deep and kernel-based reinforcement learning. AMIA Annu Symp Proc. 2018, 2018, 887–896.
6.
Futoma, J.; Hughes, M.; Doshi-Velez, F. POPCORN: Partially observed prediction constrained reinforcement learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Palermo, Italy, 26–28 August 2020;
pp. 3578–3588.
7.
Guo, S.; Wu, D.O. Game theoretical AI for precision medicine. Trans. Artif. Intell. 2025, 1, 170–196.
8.
Levine, S.; Kumar, A.; Tucker, G.; et al. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv 2020, arXiv:2005.01643.
9.
Gottesman, O.; Johansson, F.; Komorowski, M.; et al. Guidelines for reinforcement learning in healthcare. Nat. Med. 2019, 25, 16–18.
10.
Evans, L.; Rhodes, A.; Alhazzani, W.; et al. Surviving sepsis campaign: International guidelines for management of sepsis and septic shock 2021. Intensive Care Med. 2021, 47, 1181–1247.
11.
Garcıa, J.; Fernandez, F. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 2015, 16, 1437–1480.
12.
Brunke, L.; Greeff, M.; Hall, A.W.; et al. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annu. Rev. Control Robot. Auton. Syst. 2022, 5, 411–444.
13.
Xu, H.; Zhan, X.; Zhu, X. Constraints penalized Q-learning for safe offline reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, virtual, 22 February–1 March 2022; Volume 36, pp. 8753–8760.
14.
Liu, Z.; Cen, Z.; Isenber, V.; et al. Constrained offline policy optimization. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 13644–13668.
15.
Kostrikov, I.; Nair, A.; Levine, S. Offline reinforcement learning with implicit Q-learning. In Proceedings of the International Conference on Learning Representations, virtual, 25–29 April 2022.
16.
Killian, T.W.; Zhang, H.; Subramanian, J.; et al. An empirical study of representation learning for reinforcement learning in healthcare. In Proceedings of the Machine Learning for Health Workshop, virtual, 11 December 2020; pp. 139–160.
17.
Loftus, T.J.; Filiberto, A.C.; Li, Y.; et al. Decision analysis and reinforcement learning in surgical decision-making. Surgery 2020, 168, 253–266.
18.
Datta, S.; Li, Y.; Ruppert, M.M.; et al. Reinforcement learning in surgery. Surgery 2021, 170, 329–332.
19.
Wu, X.; Li, R.; He, Z.; et al. A value-based deep reinforcement learning model with human expertise in optimal treatment of sepsis. npj Digit. Med. 2023, 6, 15.
20.
Zhang, T.; Qu, Y.; Wang, D.; et al. Optimizing sepsis treatment strategies via a reinforcement learning model. Biomed. Eng. Lett. 2024, 14, 279–289.
21.
Huang, Y.; Cao, R.; Rahmani, A.M. Reinforcement learning for sepsis treatment: A continuous action space solution. In Proceedings of the 7th Machine Learning for Healthcare, Durham, NC, USA, 5–6 August 2022; pp. 631–647.
22.
Kumar, A.; Zhou, A.; Tucker, G.; et al. Conservative Q-learning for offline reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, virtual, 6–12 December 2020; Volume 33, pp. 1179–1191.
23.
Fujimoto, S.; Meger, D.; Precup, D. Off-policy deep reinforcement learning without exploration. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 2052–2062.
24.
Altman, E. Constrained Markov Decision Processes; CRC Press: Boca Raton, FL, USA, 1999.
25.
Achiam, J.; Held, D.; Tamar, A.; et al. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 22–31.
26.
Tessler, C.; Mankowitz, D.J.; Mannor, S. Reward constrained policy optimization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019.
27.
Chow, Y.; Ghavamzadeh, M.; Janson, L.; et al. Risk-constrained reinforcement learning with percentile risk criteria. J. Mach. Learn. Res. 2017, 18, 6070–6120.
28.
Yang, Q.; Simao, T.D.; Tindemans, S.H.; et al. WCSAC: Worst-case soft actor critic for safety-constrained reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, virtual, 2–9 February 2021; Volume 35,
pp. 10639–10646.
29.
Thananjeyan, B.; Balakrishna, A.; Nair, S.; et al. Recovery RL: Safe reinforcement learning with learned recovery zones. IEEE Robot. Autom. Lett. 2021, 6, 4915–4922.
30.
Le, H.; Voloshin, C.; Yue, Y. Batch policy learning under constraints. In International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 3703–3712.
31.
Alshiekh, M.; Bloem, R.; Ehlers, R.; et al. Safe reinforcement learning via shielding. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32, pp. 2669–2678.
32.
Konighofer, B.; Lorber, F.; Jansen, N.; et al. Shield synthesis for reinforcement learning. In Proceedings of the 9th International Symposium on Leveraging Applications of Formal Methods, Verification and Validation (ISoLA 2020), Rhodes, Greece, 20–30 October 2020; Volume 12476, pp. 290–306.
33.
Ames, A.D.; Coogan, S.; Egerstedt, M.; et al. Control barrier functions: Theory and applications. In Proceedings of the European Control Conference, Naples, Italy, 25–28 June 2019; pp. 3420–3431.
34.
Dalal, G.; Dvijotham, K.; Vecerik, M.; et al. Safe exploration in continuous action spaces. arXiv 2018, arXiv:1801.08757.
35.
Bertsekas, D.P. Nonlinear Programming, 2nd ed.; Athena Scientific: Nashua NH, USA, 1999.
36.
Johnson, A.E.; Pollard, T.J.; Shen, L.; et al. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 1–9.
37.
Mnih, V.; Kavukcuoglu, K.; Silver, D.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533.

Scilight Press

Author Information

Abstract

Keywords

References

About Scilight

Journals

Publishing Policies

Contact Us