Offline reinforcement learning enables learning insulin dosing policies from historical data without risky patient interactions. This study evaluates Implicit Q-Learning (IQL) on three real-world continuous glucose monitoring datasets: OhioT1DM (USA, type 1), ShanghaiT1DM (China, type 1), and ShanghaiT2DM (China, type 2). IQL achieved time in range of 68.5%, 54.3%, and 78.9% respectively. Cross-dataset transfer experiments demonstrated exceptional generalization with over 98% performance retention across geographic regions and diabetes types, suggesting that IQL captures fundamental glucose-insulin dynamics rather than dataset-specific patterns. Ablation studies validated our clinically-motivated reward function design, while sensitivity and robustness analyses confirmed algorithm stability across hyperparameter choices and data quality perturbations.



