Open Access
Article

Normalization and Selecting Non-Differentially Expressed Genes Improve Machine Learning Modelling of Cross-Platform Transcriptomic Data

Fei Deng1
Catherine H. Feng1, 2
Nan Gao3, 4
Lanjing Zhang1, 4, 5, 6, *
Author Information
Submitted: 29 Jan 2025 | Revised: 7 Apr 2025 | Accepted: 22 May 2025 | Published: 26 May 2025

Abstract

Normalization is a critical step in quantitative analyses of biological processes. Recent works show that cross-platform integration and normalization enable machine learning (ML) training on RNA microarray and RNA-seq data, but no independent datasets were used in their studies. Therefore, it is unclear how to improve ML modelling performance on independent RNA array and RNA-seq based datasets. Inspired by the house-keeping genes that are commonly used in experimental biology, this study tests the hypothesis that non-differentially expressed genes (NDEG) may improve normalization of transcriptomic data and subsequently cross-platform modelling performance of ML models. Microarray and RNA-seq datasets of the TCGA breast cancer were used as independent training and test datasets, respectively, to classify the molecular subtypes of breast cancer.  NDEG (p > 0.85) and differentially expressed genes (DEG) (p < 0.05) were selected based on the p values of ANOVA analysis and used for subsequent data normalization and classification, respectively. Models trained based on data from one platform were used for testing on the other platform. Our data show that NDEG and DEG gene selection could effectively improve the model classification performance. Normalization methods based on parametric statistical analysis were inferior to those based on nonparametric statistics. In this study, the LOG_QN and LOG_QNZ normalization methods combined with the neural network classification model seem to achieve better performance. Therefore, NDEG-based normalization appears useful for cross-platform testing on completely independent datasets. However, more studies are required to examine whether NDEG-based normalization can improve ML classification performance in other datasets and other omic data types.

References

Share this article:
Graphical Abstract
How to Cite
Deng, F., Feng, C. H., Gao, N., & Zhang, L. (2025). Normalization and Selecting Non-Differentially Expressed Genes Improve Machine Learning Modelling of Cross-Platform Transcriptomic Data. Transactions on Artificial Intelligence, 1(1), 5. https://doi.org/10.53941/tai.2025.100005
RIS
BibTex
Copyright & License
article copyright Image
Copyright (c) 2025 by the authors.
scilight logo

About Scilight

Contact Us

Suite 4002 Level 4, 447 Collins Street, Melbourne, Victoria 3000, Australia
General Inquiries: info@sciltp.com
© 2025 Scilight Press Pty Ltd All rights reserved.