Automated blood glucose control for Type 1 diabetes (T1D) is a safety-critical clinical challenge in which the consequences of suboptimal control—particularly hypoglycemia—can be life-threatening. Existing reinforcement learning (RL) approaches are limited by the prohibition on online exploration, the difficulty of specifying clinically meaningful reward functions, and insufficient guarantees of worst-case safety. We propose Diffusion-Pref, a purely offline RL framework that integrates three synergistic components: (i) a conditional diffusion world model that captures the multimodal distributional uncertainty of glucose dynamics, (ii) a zero-shot preference construction method that automatically generates trajectory preference labels from established clinical metrics—eliminating the need for human annotation, and (iii) a Conditional Value-at-Risk (CVaR)-regularized Implicit Q-Learning (IQL) algorithm that explicitly optimizes for worst-case safety. We evaluate Diffusion-Pref on the OhioT1DM dataset comprising 12 real-world T1D patients. The proposed method achieves a Time-in-Range (TIR) of 89.7%, substantially exceeding both the historical treatment record (75.0%) and a Conservative Q-Learning (CQL) baseline (78.1%). Severe hypoglycemia (glucose <54 mg/dL) is reduced from 6.9% to 1.8%, a 74% relative reduction. The overall Time Below Range (TBR) of 10.3% exceeds the recommended <4% target; however, supplementary experiments with an explicit TBR penalty (λTBR = 2.0) demonstrate that TBR can be reduced to 5.1% while preserving a TIR of 83.4%. Ablation studies confirm that each component—diffusion world model, zero-shot preference learning, and CVaR constraints—contributes meaningfully to performance. These results demonstrate that combining generative world models with clinically grounded preference learning and risk-sensitive policy optimization offers a promising pathway toward safer offline glucose control, although prospective clinical validation remains necessary before deployment.




