Reinforcement learning (RL) has shown promise in optimizing treatment strategies for sepsis, a life-threatening condition responsible for significant mortality in intensive care units. However, deploying RL policies in clinical settings requires not only optimizing patient outcomes but also ensuring adherence to established medical guidelines. In this paper, we propose a two-stage safety framework for offline RL-based sepsis treatment. The first stage employs Constraint-Penalized Q-learning combined with Implicit Q-Learning (CPQ-IQL), which incorporates clinical constraints through Lagrangian optimization during policy learning. The second stage applies a runtime safety filter that dynamically validates actions against clinical guidelines before execution. We evaluate our framework on the ICU-Sepsis benchmark with four clinically-motivated constraints derived from the Surviving Sepsis Campaign 2021 guidelines. Experimental results over 5 random seeds demonstrate that CPQ-IQL achieves the lowest constraint violation rate (22.88 ± 0.94%) among all baselines while maintaining competitive survival rates (78.4 ± 1.8%). When combined with the Safe Actions filtering mechanism, constraint violations are reduced by 97.2% (from 22.88% to 0.41%), demonstrating the effectiveness of our two-stage safety framework. Our analysis reveals that the Safe Actions filter modifies approximately 21% of policy decisions, highlighting the importance of runtime safety mechanisms for clinical deployment. These findings suggest that combining constraint-aware offline learning with runtime safety filtering provides a practical pathway toward safe and effective RL-based clinical decision support systems.



