An Improved Generative Adversarial Network with Feature Filtering for Imbalanced Data

Jun Dou; Yan Song

doi:10.53941/ijndi.2023.100017

Article

An Improved Generative Adversarial Network with Feature Filtering for Imbalanced Data

Jun Dou ¹, and Yan Song ^2,*

¹ Department of Systems Science, University of Shanghai for Science and Technology, Shanghai 200093, China

² Department of Control Science and Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China

^* Correspondence: sonya@usst.edu.cn;Tel.:+86-21-55271299; fax:+86-21-55271299

Received: 7 October 2023

Accepted: 31 October 2023

Published: 21 December 2023

Abstract: Generative adversarial network (GAN) is an overwhelming yet promising method to address the data imbalance problem. However, most existing GANs that are usually inspired by computer vision techniques have not yet taken the significance and redundancy of features into consideration delicately, probably producing rough samples with overlapping and incorrectness. To address this problem, a novel GAN called improved GAN with feature filtering (IGAN-FF) is proposed, which establishes a new loss function for the model training by replacing the traditional Euclidean distance with the Mahalanobis distance and taking the ℓ_1,2-norm regularization term into consideration. The remarkable merits of the proposed IGAN-FF can be highlighted as follows: 1) the utilization of the Mahalanobis distance can make a fair evaluation of different attributes without neglecting any trivial/small-scale but significant ones. In addition, it can mitigate the disturbance caused by the correlation between features; 2) the embedding of ℓ_1,2-norm regularization term into the loss function contributes greatly to the feature filtering by guaranteeing the data sparsity as well as helps reduce risk of overfitting. Finally, empirical experiments on 16 well-known imbalanced datasets demonstrate that our proposed IGAN-FF performs better at most evaluation metrics than the other 11 state-of-the-art methods.

Keywords:

imbalanced data generative adversarial network ℓ1,2 -norm regularization mahalanobis distance

1. Introduction

With the rapid growth of data types and data volumes in the information explosion era, data processing has emerged as a vitally relevant research topic in the field of data analytics. Classification is an important computer task in machine learning and has led to significant advances in areas such as medical diagnosis [1, 2], fault detection [3−5], and financial fraud [6−8]. It is worth mentioning that the success of these approaches is heavily dependent on balanced data categories. This is not always the case in classification problems, as some classes have significant differences in the number of classes between them due to their low frequency of occurrence. As a result, the bigger the imbalance ratio, the greater the difficulty in classification. To tackle the challenging issue of imbalanced data classification, a significant amount of work has been dedicated to addressing such challenging work and a body of promising results has been reported in the literature [9−13], which can be roughly divided into two categories: model-oriented methods and data-oriented methods. The former usually involves developing efficient models directly by scrutinizing the intrinsic characteristics of the data, such as imbalance ratios, without changing the amount of data. Nevertheless, the classification performance of these approaches turns out to be unsatisfactory under overlap between classes and extreme imbalance. As far as the data-oriented approach is concerned, the primary concept is to regulate the sample size to achieve a balance between the different classes by means of sampling.

Among the data-oriented methods, oversampling and undersampling are two typical coping tools. However, given that undersampling leads to loss of information and in turn affects the classification performance, an increasingly prominent approach in addressing classification problems with imbalanced datasets is oversampling. This method is considered effective as it not only restores data balance but also preserves the inherent characteristics of the original data. Nonetheless, most oversampling methods, such as the synthetic minority oversampling technique (SMOTE) [14] and its variants generate new samples by interpolating between an existing sample and its K-nearest neighbors, which often leads to ambiguous sample features inconsistent with the original distribution. So far, many researchers have been interested in the study of data distribution recently during oversampling. In [15], K-means SMOTE considers data distribution by introducing the idea of clustering into the oversampling strategy of assigning weights. Similarly, MWMOTE [16] assigns weights to each minority cluster appropriately, not generating instances beyond the range of clusters to ensure the safety of the synthetic instances. Dai et al. [17] synthesized new samples based on the weighted distribution of two factors including the inter-class distance and the cluster capacity. Although these methods can take the data distribution into account in some sense, it is still difficult to ensure that the synthesized samples strictly conform to the original distribution for datasets with complex distributions, which directly gives rise to one of the motivations of this paper.

Recently, methods based on generative adversarial networks (GAN) [18] have been applied to sample expansion and data augmentation. Due to their powerful feature extraction and characterization capabilities, the GAN-based methods make the generated samples closer to the real data distribution. Furthermore, Douzas et al. [19] proposed the use of conditionally generated adversarial networks (CGANs) as an oversampling method for binary class imbalance data. Moreover, Gao et al. [20] investigated the Wasserstein generative adversarial network (WGAN) with gradient penalty to generate data where the accuracy of the classifiers has increased on the experimental dataset. Although the above methods are effective in obtaining samples that match the distribution of the original data, most synthesized samples tend to be considered desirable only within a strictly specific space, which omits the effects of trivial but essential features. The main reason for this problem is that GANs are trained based on Euclidean distance, which makes it difficult to effectively portray problems such as correlation and inconsistency in magnitude between features. Additionally, GAN-based sampling algorithms are usually designed to be very complex and many parameters are required to tune which not only leads to the characterization of some redundant feature information and then affects the quality of the synthesized samples, but also leads to overfitting during the model training process. Such an issue therefore motivates us to develop a new oversampling method to bridge such a gap.

Stimulated by the above discussions, our attention is devoted to investigating the problem of diversity and parameter optimization for GAN synthetic samples. Such a problem is seen as nontrivial for the following obvious discerned obstacles: 1) how to assess different attributes fairly and be able to mitigate the interference caused by correlations between features? and 2) how to simultaneously avoid the impact of redundant features on synthetic samples and prevent model overfitting? To cope with these potential obstacles, we make the corresponding contributions in this paper outlined as follows.

• Utilizing the Mahalanobis distance enables impartial evaluation of different attributes without ignoring any trivial/minor but crucial characteristics and alleviates the interference caused by correlation between features.

• Embedding the -norm regularization term into the loss function ensures the sparsity of the data, which greatly facilitates feature filtering and reduces the risk of overfitting.

• Our proposed methods achieve satisfactory outcomes in handling the imbalanced classification, as demonstrated by a comparison with various oversampling approaches and experimental assessments on several public datasets.

The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 presents the details of the main results. The experiment results and analysis are presented in Section 4. Finally, the conclusions are drawn in Section 5.

2. Related Work

In this section, some generative adversarial network-based learning methods for imbalanced data are reviewed and introduced. Besides, the commonly used regularization in optimization functions as well as distance are discussed, and the motivation for research based on the above discussion is given in light of the current shortcomings of GANs.

2.1. Generating Adversarial Network

Imbalanced learning techniques [21, 22] intend to tackle the problem of imbalanced data, in which at least one class of data is much smaller in number than the other classes of data. In general, the minority class tends to have a large impact on many real-world problems, such as cancer detection in medical diagnosis and fault diagnosis in industrial systems.

Existing methods for imbalanced learning mainly encompass: 1) sampling-based methods, which learn the imbalanced classification by oversampling [23, 24] the minority class or undersampling [25] the majority class. Representative methods such as SMOTE [26] generate data from existing minority classes. 2) cost-sensitive methods [27, 28] that employ different cost matrices to compute the cost of any particular data sample. 3) kernel-based methods, which use classifiers such as support vector machines (SVMs) [29] to maximize the separation margin. 4) GANs-based approaches [30, 31], which are analogous to our proposed method using the generator to balance the data class distribution by creating a minority class. Nevertheless, most of the current GAN methods are used to deal with problems in images, with little work using these GAN-based methods for solving classification studies to tackle imbalanced data.

Generative Adversarial Network is an unsupervised deep learning model, which has been proposed by Goodfellow in the literature [18]. GAN is composed of generators and discriminators, where the generator produces fake samples by receiving random noise samples, and the role of the discriminator is to determine whether the generated synthetic samples are true or false. As the discriminator determines that the synthetic samples are more similar to the real data, these samples are augmented into the minority class to constitute a balanced dataset . However, the Euclidean distance used in GAN makes it difficult to deal with problems such as inconsistency of magnitude, which makes the synthesized samples only be considered ideal samples in a harsh spatial range, which in turn causes the synthesized samples to be easy to overlap and neglect essential features. In this regard, how to improve the diversity of the synthesized samples is a research motivation of this paper.

2.2. Regularization in Optimization Functions

In optimization problems, to balance the model in terms of complexity and performance, a regularization term is usually added to the objective function. To date, the most popular norms used in the objective function of optimization problems are -norm [32] and -norm [33]. For the former, due to the inherent property of the -norm, the solution to an -norm optimization is sparse, hence the -norm is also called the sparse rule operator, and feature sparsity can be achieved by the -norm, thus filtering out some redundant features. Analogously, the -norm is the most prevalent norm in optimizing objective functions. For example, overfitting is another recurrent issue encountered when training on a dataset with a small sample size. To address the phenomenon of overfitting and to improve the generalization ability of the model, the problem can be solved by applying an -norm to the objective function.

As discussed above, models in real-world issues often require both -norm for filtering redundant features and -norm for preventing the model from overfitting to cater to the excessively complicated training set. Therefore, a mixed norm that incorporates both -norm and -norm is a natural choice for this paper. In addition, there is no relevant research on employing generative adversarial network models embedded with mixed norm [34] to deal with the imbalance problem, so this is another research motivation for this paper.

3. Main Results

In this section, in order to tackle the imbalanced classification problems, we propose a GAN-oriented imbalanced learning method, called IGAN-FF, which incorporates -norm regularizer and Mahalanobis distance. The generator with -norm regularizer is not only effective in improving the robustness of the model and removing redundant features but also in preventing overfitting. In addition, the discriminator is trained to discriminate between real samples and fake (i.e., generated) samples, and also between minority samples and majority samples on the synthetic balanced dataset. It should be noted that, all problems in this paper are addressed by using the Mahalanobis distance to replace the traditional Euclidean distance.

3.1. Optimization Based on Mahalanobis Distance without Regularizer

GANs are extensively used as they can generate high-quality data that matches the distribution of the original data with a large amount of training. As shown in formula (1), the essential idea of this model is to design the generator and the discriminator to play a maximal-minimal game. There is a competition between and . The generator tries to learn the distribution of the input data from the random variable and produces a new sample . The discriminator is responsible for distinguishing between and real samples, and the training process of GAN is to run the generator and discriminator repeatedly until the discriminator cannot distinguish between and real data. In contrast to common generative models, GANs can generate high-quality data and do not need to predefine the data distribution of the input samples. Therefore, we select to rebalance the imbalanced data by using GAN to generate more samples of minority classes. The objective function is shown in the following equation:

where is the value function and denotes the expectation of the distribution, is real data obeying the distribution and is a noise variable obeying the distribution .

It is difficult for GAN to deal with the problem of inconsistency in magnitude by utilizing the Euclidean distance to measure the relationship between samples and features, which limits the spatial range of the synthesized samples and ignores potentially valuable attributes. In this regard, the Mahalanobis distance is used in this paper to impartially assess the different attributes because it has the merit of handling samples with inconsistent dimensions. Using Mahalanobis distance for the similarity measurement can effectively solve the interference of correlation between sample features, greatly eliminate the influence of dimension on algorithm processing, and also help detect outliers. The Mahalanobis distance [35] of two samples (denoted as ) can be calculated by

where and are two different samples, and are the covariance matrix of these two samples and their inverse matrix. From this, we can see that the covariance matrix requires an inverse. If there is a serious correlation in the variable space, the covariance matrix would be singular, which obviously results in (2) being insolvable. Therefore, when the number of minority class samples is less than the number of features, we need to reduce the dimensionality of the samples.

The principal component analysis (PCA) [36, 37] is a multivariate statistical method used to reduce the dimension of variables. It projects high-dimensional data into low-dimensional principal component space. Therefore, the calculation of the Mahalanobis distance is transferred to the principal component space, and the data dimension is compressed to effectively avoid the problem of matrix singularity. On the premise that has a solution, in order to retain the information of the real data, the number of principal components is guaranteed as much as possible.

3.2. -Norm Regularizer

-norm is widely used in optimization problems and is considered a very useful tool for solving sparsity problems [32]. For example, given a dataset with a large number of features for training, a typical issue encountered is the effect of redundant features on the training results, i.e., the trained model may characterize implicitly noisy and irrelevant features in the data, which reduces the training effectiveness. Therefore, one solution to avoid similar problems is to include a regularization term in the loss function. In this regard, by using the regularization term represented by the -norm in 3, it is possible to make the final desired solution a sparse one by making the value of the -norm as minimal as possible:

where denotes a matrix and denotes an instance of the th row and th column of the matrix . Due to the inherent property of the -norm, the solution to an -norm optimization is sparse, and hence the -norm is also called the sparse rule operator, and feature sparsity can be achieved by the -norm, thus filtering out some redundant features. For instance, while classifying a user's movie hobby, the user has 100 features, only a dozen of which may be useful for classification, and most of the features such as height and weight may be irrelevant and can be filtered out by using the -norm.

Analogously, the -norm is the most prevalent norm in optimizing objective functions [33]. For example, overfitting is another recurrent issue encountered when training on a dataset with a small sample size. To address the phenomenon of overfitting and to improve the generalization ability of the model, the problem can be solved by applying an -norm to the objective function, which is denoted by the -norm as shown below:

Considering that removing redundant features and preventing model overfitting are often needed simultaneously in real problems, a mixed norm that combines -norm and -norm comes naturally to mind. Until now, there have been still only a few studies on -norm regularization for GAN training [38], even without the so-called mixed norms of -norm. In this paper, we tackle the problem of parameterizing a GAN with a suitable -norm regularizer that achieves parameter sparsity and prevents model overfitting by mixing -norm regularizers. Specifically, the -norm is defined as:

where is the set of training weights of generator or discriminator, denotes the th entry of .

3.3. Improved GAN Model with Regularizer

With the above discussion, we design a GAN that incorporates the Mahalanobis distance and -norm regularizer to solve the problems of inconsistent magnitude between samples, feature redundancy and model overfitting. To be specific, the generator is a multilayer perceptron (MLP) trained to generate realistic data from a random noise . The loss function of is

where this loss function consists of four terms. The first and second terms are the confusing discriminator loss over the generated minority samples, in which denotes the sample labels, and represents the output (prediction probability) of the discriminator. The third term aims at making the generated minority samples close to the real minority samples, and denote the number of synthetic samples and the number of minority class samples, respectively, and denotes the absolute value of " ". The last term is -norm regularizer, in which is the set of training weights of the generator with regularization coefficient .

Similarly, the discriminator is an MLP trained to distinguish real data and generated data from the input. The loss function of is

where this loss function consists of four terms. The first term is the cross entropy loss to discriminate whether the sample is generated by a generator or a real sample of the original dataset, denotes the number of majority class samples, and are defined as in (6). The second term is also the cross entropy loss to discriminate whether the sample is a minority class or a majority class. The third term aims at making the different class samples far away from each other. The last term is a regularizer, in which is the set of training weights of the discriminator with regularization coefficient .

Finally, the adversarial training objective function of IGAN-FF is given as equation (8):

The goal of the generator is to generate fake minority samples to simulate the real minority sample distribution to confuse the discriminator. The goal of the discriminator is to correctly classify between the real training samples and the fake samples generated from the generator, and also between the minority samples and the majority samples.

So far, we have discussed the proposed generative adversarial network algorithm embedded with -norm and Mahalanobis distances for imbalanced data, i.e., IGAN-FF. The pseudo-code for rebalancing imbalanced data is given in the following Algorithm 1. The characteristics of the proposed method can be highlighted in the following three advantages: 1) GAN incorporating the Mahalanobis distance can impartially evaluate the different attributes of a sample without ignoring any trivial but crucial attributes and mitigate the interference caused by correlation between attributes; 2) the generator with -norm regularizer is not only effective in improving the robustness of the model and removing redundant features but also in preventing overfitting; and 3) the discriminator can effectively distinguish not only between generated samples and real samples but also between majority samples and minority samples.

4. Experimental Results

4.1. Datasets Analysis

The experimental datasets in this paper are derived from the 16 common datasets borrowed from KEEL and UCI [39] machine learning repository to investigate the performance of the proposed method in various scenarios. Considering that the research task of this paper is to test the learning capability of the algorithm in a binary classification assignment, some modifications have been made to the labels of several original datasets with multiple classes. The details of these datasets are shown in Table 1. The column “Minority” indicates the class that is considered as a minority class in the experiment, and the last column “IR%” indicates the ratio of the number of samples in the minority class to the number of samples in the majority class.

Table 1. Information of the Datasets

Dataset	Minority	Features	Size	Size of Maj	Size of Min	IR%
Wine	Class “2”	13	198	151	47	31.1
Vehicle	Class “van”	18	846	646	200	31
Ecoli4	Class “0”	7	336	316	20	6.3
Libra	Class “1”	90	360	336	24	7.1
Abalone	Class “18”	7	4177	4135	42	1.0
Pima	Class “1”	8	768	500	268	53.6
Iris	Class “2”	4	150	100	50	50
Vowel	Class “0”	18	846	647	199	30.8
Yeast	Class “0”	8	1483	1150	303	26.3
Liver	Class “1”	6	178	107	71	66.3
Pageblock	Class “text”	10	5472	4913	559	11.3
Segment	Class “WINDOW”	18	2310	1980	330	16.7
Glass	Class “1”	9	208	138	70	50.7
Letter	Class “0”	16	20000	19266	734	3.8
Defalut	Class “0”	23	30000	23364	6636	28.4
Judicial	Class “5”	60	5473	5242	231	7.04

4.2. Evaluation Metrics

Accuracy is a widely used evaluation criterion in classification, which reflects the number of samples classified correctly. However, accuracy may not be able to judge the minority samples precisely under the imbalanced learning condition. Hence, concerning the imbalanced data, multi-indices are used to evaluate the effectiveness of the proposed method, which include Precision, Recall, F-Score and G-Means [40, 41]. To give a better explanation, a confusion matrix is defined in Table 2.

Table 2. Confusion matrix

Actual	Predicted
Actual	Class	Class
Class	TP	FN
Class	FP	TN

In Table 2, Class represents minority class, then Class stands for the remainder. TP and FN are separately the numbers of correctly predicted and incorrectly predicted samples in Class Class . TN and FP are the numbers of correctly predicted and incorrectly predicted samples in Class , respectively.

Based on the above confusion matrix, indices borrowed from [17, 42] are given to evaluate the proposed method as follows:

In experiments, we hope to find sufficiently small values of FP and FN, i.e., FP and FN are expected to be close to 0, then the above five evaluation metrics are expected to be close to 1. Besides, the area under the curve (AUC) of receiver operating characteristic (ROC) is an effective metric to evaluate the performance, especially for imbalanced data [43, 44]. The closer AUC is to 1, the better the classification performance will be [12, 45].

4.3. Simulation Analysis and Discussion

4.3.1. General Setting

All experiments are conducted on a PC server with a 2.5 GHz CPU and 16 GB memory, and all tested models are implemented on Python to check their stability for real usage. The strategy of the 70%−30% train-test setting and 5-fold cross-validations are employed to obtain unbiased results.

4.3.2. Tested Methods

To demonstrate the performance of our proposed techniques, the experiments compare four state-of-the-art oversampling techniques, four ensemble classifiers and three deep learning GAN-based methods, which include K-means SMOTE, MWMOTE, DFBASO, IFID [9], adaptive boosting (AdaBoost) [46], random forest (RF) [47], gradient boosting (Gboost) [48], self-paced ensemble (SPE) [49], GAN, WGAN, CGAN. In addition, except for the four ensemble classifier methods, the classifiers used are all deep neural networks (DNN) with 2 hidden layers.

In addition, the parameters of IGAN-FF are set corresponding to the experiment. From the perspective of network structure, the experimental experiment uses a three-layer fully connected neural network to construct the generator and discriminator. The numbers of units for the hidden layers of the generator and discriminator are set to 32-256 and 16-128, respectively. The batch size is set to 32 and the number of epochs is 5000 and Rectifying Linear Unit (ReLU) is chosen as its activation function. In the output layer, the sigmoid function is used as the activation function. To optimize the loss of the generator and discriminator, the Root Mean Square Prop (RMSProp) optimizer is applied to set the learning rate to 0.001. Figure 1 illustrates the training process of IGAN-FF, where the purple and blue lines indicate the changes in the loss values of the generator and the discriminator, respectively. From the figure, we can easily find that the fluctuation of the generator and the discriminator is very obvious before 10000 iterations, and when the number of training times exceeds 20000, the loss of the generator and the discriminator tend to stabilize. Under this condition, the trained IGAN-FF model can output synthetic data more similar to the real original samples.

Figure 1. Generator and discriminator loss values for IGAN-FF.

4.3.3. Simulation Analysis and Discussion

According to the evaluation indicators described in 9-13, the results of the comparison experiments of different evaluation indicators are shown in Tables 3−5, where the best results of each indicator are highlighted in bold. Table 3 presents a comparison between IGAN-FF and 11 other comparative approaches to evaluate classification effectiveness on 16 public imbalanced datasets [39] using F-Score metrics with DNN as the classifier. As can be noticed from the table, the IGAN-FF proposed in this paper performs satisfactorily on 16 datasets, out of which the best performance is obtained on 11 datasets and the second best results are obtained on 2 datasets. However, the proposed method of IGAN-FF classification is not desirable for the datasets Libra, Segment and Judicial, leading to this situation may be due to the fact that the number of minority samples in these three datasets is too scarce to be statistically distributed. In addition, the imbalance rates of these three datasets are high, which are 7.1%, 16.7%, and 7.04%, respectively. It may lead to a large hyperplane bias of the classifiers, and thus the applicability of the proposed method IGAN-FF to such datasets could be improved.

Table 3. Results of F-Score for IGAN-FF on 16 imbalanced datasets

Datasets	K-SMOTE	MWMOTE	DFBASO	IFID	AdaBoost	RF	Gboost	SPE	GAN	WGAN	CGAN	IGAN-FF
Wine	0.978	0.978	0.976	0.971	0.943	0.937	0.927	0.947	0.954	0.979	0.968	0.98
Vehicle	0.841	0.84	0.872	0.851	0.838	0.803	0.802	0.841	0.862	0.873	0.865	0.881
Ecoli4	0.825	0.801	0.819	0.822	0.773	0.798	0.713	0.767	0.836	0.854	0.842	0.857
Libra	0.903	0.855	0.874	0.881	0.767	0.796	0.718	0.785	0.884	0.897	0.888	0.896
Abalone	0.856	0.841	0.822	0.825	0.565	0.612	0.582	0.599	0.841	0.858	0.843	0.861
Pima	0.732	0.732	0.747	0.725	0.672	0.635	0.613	0.664	0.717	0.741	0.746	0.753
Iris	0.877	0.879	0.883	0.866	0.947	0.92	0.948	0.925	0.917	0.928	0.926	0.937
Vowel0	0.903	0.916	0.914	0.927	0.871	0.891	0.906	0.896	0.906	0.924	0.912	0.934
Yeast	0.705	0.748	0.769	0.766	0.761	0.754	0.765	0.748	0.747	0.781	0.766	0.799
Liver	0.681	0.669	0.688	0.674	0.646	0.649	0.645	0.609	0.663	0.683	0.677	0.687
Pageblock	0.779	0.782	0.781	0.784	0.719	0.73	0.68	0.647	0.762	0.793	0.771	0.801
Segment	0.783	0.812	0.859	0.817	0.714	0.759	0.741	0.719	0.815	0.839	0.841	0.855
Glass	0.754	0.734	0.761	0.757	0.417	0.08	0.124	0.333	0.722	0.751	0.744	0.765
Letter	0.698	0.71	0.688	0.691	0.719	0.73	0.68	0.647	0.751	0.801	0.779	0.816
Default	0.792	0.768	0.782	0.784	0.719	0.73	0.68	0.647	0.833	0.845	0.84	0.852
Judicial	0.83	0.845	0.819	0.837	0.719	0.73	0.68	0.647	0.819	0.857	0.839	0.851
Average	0.809	0.807	0.816	0.811	0.737	0.722	0.700	0.714	0.814	0.838	0.828	0.845
Ranking	6.063	6.500	5.125	5.813	9.375	9.250	10.063	10.188	6.625	2.625	4.438	1.438

Table 4. Results of G-Means for IGAN-FF on 16 imbalanced datasets

Datasets	K-SMOTE	MWMOTE	DFBASO	IFID	AdaBoost	RF	Gboost	SPE	GAN	WGAN	CGAN	IGAN-FF
Wine	0.974	0.975	0.973	0.969	0.934	0.918	0.908	0.936	0.96	0.988	0.971	0.989
Vehicle	0.934	0.933	0.931	0.929	0.848	0.865	0.828	0.849	0.944	0.952	0.943	0.955
Ecoli4	0.93	0.923	0.933	0.911	0.782	0.809	0.703	0.767	0.929	0.938	0.927	0.933
Libra	0.925	0.885	0.911	0.903	0.768	0.8	0.717	0.778	0.913	0.922	0.915	0.927
Abalone	0.903	0.933	0.871	0.884	0.621	0.676	0.636	0.685	0.899	0.924	0.91	0.922
Pima	0.655	0.656	0.674	0.649	0.463	0.471	0.458	0.467	0.651	0.669	0.671	0.684
Iris	0.812	0.823	0.886	0.838	0.893	0.881	0.914	0.893	0.891	0.911	0.903	0.92
Vowel0	0.895	0.904	0.901	0.924	0.92	0.918	0.923	0.936	0.919	0.937	0.928	0.941
Yeast	0.92	0.937	0.922	0.917	0.742	0.73	0.739	0.718	0.892	0.929	0.914	0.933
Liver	0.646	0.628	0.656	0.635	0.362	0.45	0.437	0.312	0.672	0.689	0.667	0.695
Pageblock	0.923	0.923	0.925	0.911	0.715	0.729	0.685	0.734	0.913	0.919	0.909	0.922
Segment	0.93	0.943	0.928	0.939	0.702	0.74	0.723	0.69	0.9	0.924	0.929	0.933
Glass	0.688	0.682	0.688	0.691	0.388	0.09	0.119	0.319	0.681	0.692	0.688	0.702
Letter	0.893	0.9	0.887	0.899	0.715	0.729	0.685	0.734	0.923	0.946	0.926	0.944
Default	0.795	0.796	0.776	0.781	0.715	0.729	0.685	0.734	0.851	0.869	0.866	0.871
Judicial	0.858	0.892	0.855	0.866	0.715	0.729	0.685	0.734	0.861	0.9	0.892	0.904
Average	0.855	0.858	0.857	0.853	0.705	0.704	0.678	0.705	0.862	0.882	0.872	0.886
Ranking	5.688	5.188	5.813	6.438	10.063	9.750	10.500	9.313	6.000	2.688	4.500	1.625

Table 5. Results of AUC for IGAN-FF on 16 imbalanced datasets

Datasets	K-SMOTE	MWMOTE	DFBASO	IFID	AdaBoost	RF	Gboost	SPE	GAN	WGAN	CGAN	IGAN-FF
Wine	0.978	0.979	0.977	0.981	0.949	0.939	0.956	0.958	0.982	0.993	0.98	0.997
Vehicle	0.979	0.974	0.972	0.956	0.914	0.921	0.932	0.924	0.977	0.984	0.979	0.989
Ecoli4	0.947	0.941	0.952	0.937	0.898	0.877	0.861	0.898	0.951	0.959	0.955	0.96
Libra	0.997	0.994	0.992	0.989	0.945	0.955	0.941	0.961	0.993	0.996	0.991	0.996
Abalone	0.956	0.976	0.923	0.946	0.837	0.849	0.822	0.839	0.958	0.961	0.956	0.966
Pima	0.733	0.735	0.748	0.728	0.796	0.828	0.831	0.805	0.734	0.744	0.743	0.761
Iris	0.877	0.886	0.886	0.879	0.985	0.982	0.977	0.972	0.934	0.952	0.946	0.971
Vowel0	0.967	0.982	0.983	0.991	0.959	0.968	0.967	0.976	0.977	0.986	0.981	0.993
Yeast	0.951	0.953	0.971	0.969	0.959	0.971	0.972	0.962	0.969	0.978	0.964	0.982
Liver	0.685	0.671	0.691	0.685	0.758	0.776	0.765	0.713	0.723	0.773	0.758	0.783
Pageblock	0.955	0.96	0.963	0.959	0.887	0.874	0.884	0.878	0.944	0.957	0.951	0.961
Segment	0.931	0.943	0.93	0.941	0.976	0.969	0.965	0.971	0.933	0.945	0.951	0.971
Glass	0.771	0.768	0.774	0.775	0.568	0.585	0.561	0.596	0.759	0.768	0.771	0.78
Letter	0.938	0.945	0.936	0.944	0.887	0.874	0.884	0.878	0.969	0.982	0.971	0.989
Default	0.872	0.855	0.868	0.862	0.787	0.774	0.784	0.778	0.901	0.922	0.911	0.937
Judicial	0.904	0.912	0.891	0.911	0.887	0.874	0.884	0.878	0.901	0.942	0.925	0.939
Average	0.903	0.905	0.904	0.903	0.875	0.876	0.874	0.874	0.913	0.928	0.921	0.936
Ranking	7.250	6.500	6.438	6.813	8.375	8.375	8.313	8.375	6.313	3.438	5.188	1.813

Along the same lines, Table 4 exhibits the performance achievement of the proposed method IGAN-FF as compared to 11 diverse approaches in terms of G-Means. Note from Table 4 that IGAN-FF achieved 10 best results, 3 secondary and 3 tertiary results in the 16 datasets, with an average ranking of 1.625, which is 1.063 higher than WGAN, having an average ranking of second. An inference can be drawn from the results in the table that the four algorithms based on deep learning outperform the mean of the four methods based on oversampling in terms of mean as well as average ranking, while the four algorithms based on ensemble learning are significantly inferior to the preceding two methods. The cause of this result may be that although ensemble learning algorithms can handle imbalanced classification problems, their capabilities for dealing with extremely imbalanced classification problems are still limited.

In Table 5, we compare the AUC experimental results of different methods separately. Although IGAN-FF obtains only 9 optimal and 5 sub-optimal results on 16 datasets, its average ranking is 1.813, which is much higher than the second-ranked WGAN by 1.625. It is worth mentioning that the average rankings based on the four oversampling algorithms, four ensemble algorithms, and four deep learning algorithms are 6.750, 8.360, and 4.188, respectively. Similarly, the average AUC values are 0.904, 0.875, and 0.925, respectively. The proposed method IGAN-FF outperforms the highest-ranked oversampling method MWMOTE by and outperforms the highest-ranked ensemble algorithm RF by . According to the above discussions, the comparison results demonstrate the effectiveness of our proposed algorithm.

In what follows, Figure 2 illustrates the radargrams of the different approaches on the 16 datasets. From the figure, it is easy to see that the proposed algorithm IGAN-FF is the best in terms of robustness of the three estimators (F-Score, G-Means, AUC). It is also worth noting that as the number of minority samples is too limited, the minority class samples may not have a data distribution or statistical significance, and therefore the discriminator will not be effective during the synthesis process, which then makes it difficult to determine the sample distribution and perform the correct synthesis. As mentioned above, given that the method proposed in this paper is based on generative adversarial neural networks, the number of minority class samples cannot be too small.

Figure 2. Comparison of classification robustness of different methods(a-p).

Next, for a better summary of Tables 3−5, the results of the average values and average rankings of each algorithm in terms of different metrics are displayed in Figure 3. From the figure, it is easy to find that the average value of the proposed method is optimal in terms of F-Score, G-Means and AUC, which sufficiently verifies the robustness and effectiveness of the proposed method. Nevertheless, among the three evaluation metrics, unsatisfactory results can be found for the datasets Pageblock and Segment. The reason for this may be related to the inaccurate transformation of the dataset's unstructured data into structured data, which affects the subsequent classification results. Moreover, the datasets Ecoli4 and Libra also yielded poor results, which may occur as the number of samples in the minority class is excessively sparse and the statistical distribution may not even be available. As a result, the discriminator is incapable of accurately determining whether the synthesized samples are proximal to the original data, which results in the synthesis of a large number of samples that do not match the distribution of the original data. Regarding the analysis of the above results, the method proposed in this paper is more suitable for datasets with a certain number of samples of minority class, that is to say, the algorithm will have satisfactory results only for the samples of minority class with statistical distribution.

Figure 3. (a) Average and (b) Ranking of different algorithms for each evaluation metric on 16 datasets.

Finally, to shed further light on whether IGAN-FF is significantly distinct from other algorithms and to ensure that the improvement is statistically significant, the Mann-Whitney Wilcoxon test [50] is employed to validate the conjecture that the proposed method outperforms the other methods, and the -values comparison of all algorithms in F-Score, G-Means and AUC is reported in Table 6. In Table 6, the -values of F-Score, G-mean and AUC of the proposed method IGAN-FF over other compared methods on 16 datasets are described in three rows. Considering the stochastic properties of the computations, a hypothesis test employing such a result is conducted to show whether there is a significant difference between the two algorithms. Without loss of generality, we set the null hypothesis that there is no significant difference between the two algorithms. It is evident that when the significance level is set to be , the values are substantially less than , and thus all values are statistically significant. Hence, by analyzing the results in Table 6, it can be summarized that the proposed algorithm IGAN-FF gets 29 significant results out of the total 33 comparisons of F-Score, G-means and AUC. However, there exist four metrics for which the results are not significant, which may be caused by the fact that all three data-based algorithms and one GAN-based method take into account the data distribution to some extent, resulting in a higher quality of synthesized samples and a better classification performance. In summary, the proposed algorithm IGAN-FF is superior to the other comparative methods and is significant.

Table 6. Mann-Whitney Wilcoxon test results of F-Score, G-Means and AUC between IGAN-FF and each of the compared methods

5. Conclusions

In this paper, we have proposed a GAN-oriented imbalanced learning method, called IGAN-FF, which incorporates -norm regularizer and Mahalanobis distance. To evaluate the different attributes fairly without neglecting any trivial/small-scale but important attributes and also to attenuate the interference caused by correlation between features, GAN incorporating Mahalanobis distance has been adopted to address the issue. Subsequently, the -paradigm regularization term has been embedded into the loss function, which ensures the sparsity of the data, greatly facilitates feature filtering, and reduces the risk of overfitting. In the end, experiments on several public datasets have been utilized to show the effectiveness and universal applicability of the proposed methods. It can be observed from the above experimental results that we have developed an effective way by the proposed IGAN-FF strategy to handle the imbalanced data, meanwhile, it has good practical applicability because 1) the datasets adopted in the experiments have covered various fields, such as traffic system, judicial system, mail detection, etc., and 2) the datasets with different scales and imbalanced ratios have been employed to fulfill the comparison experiments with other methods. In future work, we may try to develop a new deep-learning method to fulfill the data imputation without sacrificing much computational cost. Besides, the semi-supervised classification for imbalanced and incomplete data can also be our next work direction.

In future work, we plan to extend the results obtained to handle datasets without data distribution.

Author Contributions: Jun Dou: conception and design of study, acquisition of data, analysis and interpretation of data, writing-original draft, writing-review and editing; Yan Song: conception and design of study, formal analysis, revising the manuscript critically for important intellectual content, funding acquisition, approval of the version of the manuscript to be published. All authors have read and agreed to the published version of the manuscript.

Funding: This work was supported in part by the National Nature Science Foundation of China under Grants 62073223 and the Natural Science Foundation of Shanghai under Grant 22ZR1443400.

Data Availability Statement: The UCI datasets can be downloaded from: (http://archive.ics.uci.edu/) and the KEEL-dataset repository can be downloaded from: (https://sci2s.ugr.es/keel/datasets.php)

Conflicts of Interest: The authors declare no conflict of interest.

References

Wang, L.; Ye, X.; Li, J.L.; et al. GAN-based dual active learning for nosocomial infection detection. IEEE Trans. Network Sci. Eng., 2022, 9: 3282−3291. doi: 10.1109/TNSE.2021.3100322
Lu, P.; Song, B.Y.; Xu, L. Human face recognition based on convolutional neural network and augmented dataset. Syst. Sci. Control Eng., 2021, 9: 29−37. doi: 10.1080/21642583.2020.1836526
Wang, C.; Wang, Z.D.; Ma, L.F.; et al. Subdomain-alignment data augmentation for Pipeline fault diagnosis: An adversarial self-attention network. IEEE Trans. Ind. Informat. 2023 , in press.
Wang, C.; Wang, Z.D.; Ma, L.F.; et al. A novel contrastive adversarial network for minor-class data augmentation: Applications to pipeline fault diagnosis. Knowledge-Based Syst., 2023, 271: 110516. doi: 10.1016/j.knosys.2023.110516
Yang, D.D.; Lu, J.Y.; Dong, H.L.; et al. Pipeline signal feature extraction method based on multi-feature entropy fusion and local linear embedding. Syst. Sci. Control Eng., 2022, 10: 407−416. doi: 10.1080/21642583.2022.2063202
Sun, J.; Li, H.; Fujita, H.; et al. Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting. Inf. Fus., 2020, 54: 128−144. doi: 10.1016/j.inffus.2019.07.006
Su, Y.F.; Cai, H.; Huang, J. The cooperative output regulation by the distributed observer approach. Int. J. Network Dyn. Intellig., 2022, 1: 20−35. doi: 10.53941/ijndi0101003
Liu, Y.H.; Huang, F.H.; Yang, H. A fair dynamic content store-based congestion control strategy for named data networking. Syst. Sci. Control Eng., 2022, 10: 73−78. doi: 10.1080/21642583.2022.2031335
Dou, J.; Song, Y.; Wei, G.L.; et al. Fuzzy information decomposition incorporated and weighted Relief-F feature selection: When imbalanced data meet incompletion. Inf. Sci., 2022, 584: 417−432. doi: 10.1016/j.ins.2021.10.057
He, H.B; Bai, Y.; Garcia, E.A.; et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 01–08 June 2008; IEEE: Hong Kong, China, 2008; pp. 1322–1328.
Dou, J.; Wei, G.L.; Song, Y.; et al. Switching triple-weight-SMOTE in empirical feature space for imbalanced and incomplete data. IEEE Trans. Autom. Sci. Eng. 2023 , in press.
Hu, J.; Jia, C.Q.; Liu, H.J.; et al. A survey on state estimation of complex dynamical networks. Int. J. Syst. Sci., 2021, 52: 3351−3367. doi: 10.1080/00207721.2021.1995528
Zhang, Q.C.; Zhou, Y.Y. Recent advances in non-Gaussian stochastic systems control theory and its applications. Int. J. Network Dyn. Intellig., 2022, 1: 111−119. doi: 10.53941/ijndi0101010
Chawla, N.V.; Bowyer, K.; Hall, L.O.; et al. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intellig. Res., 2002, 16: 321−357. doi: 10.1613/jair.953
Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf. Sci., 2018, 465: 1−20. doi: 10.1016/j.ins.2018.06.056
Barua, S.; Islam, M.; Yao, X.; et al. MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowledge Data Eng., 2014, 26: 405−425. doi: 10.1109/TKDE.2012.232
Dai, F.F.; Song, Y.; Si, W.Y.; et al. Improved CBSO: A distributed fuzzy-based adaptive synthetic oversampling algorithm for imbalanced judicial data. Inf. Sci., 2021, 569: 70−89. doi: 10.1016/j.ins.2021.04.017
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; et al. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, December 2014; MIT Press: Montreal, Canada, 2014; pp. 2672–2680.
Douzas, G.; Bacao, F. Effective data generation for imbalanced learning using conditional generative adversarial networks. Exp. Syst. Appl., 2018, 91: 464−471. doi: 10.1016/j.eswa.2017.09.030
Gao, X.; Deng, F.; Yue, X.H. Data augmentation in fault diagnosis based on the Wasserstein generative adversarial network with gradient penalty. Neurocomputing, 2020, 396: 487−494. doi: 10.1016/j.neucom.2018.10.109
Wei, G.L.; Mu, W.M.; Song, Y.; et al. An improved and random synthetic minority oversampling technique for imbalanced data. Knowledge-Based Syst., 2022, 248: 108839. doi: 10.1016/j.knosys.2022.108839
Yu, N.X.; Yang, R.; Huang, M.J. Deep common spatial pattern based motor imagery classification with improved objective function. Int. J. Network Dyn. Intellig., 2022, 1: 73−84. doi: 10.53941/ijndi0101007
Dou, J.; Gao, Z.H.; Wei, G.L.; et al. Switching synthesizing-incorporated and cluster-based synthetic oversampling for imbalanced binary classification. Eng. Appl. Artif. Intellig., 2023, 123: 106193. doi: 10.1016/j.engappai.2023.106193
Wang, X.L.; Sun, Y.; Ding, D.R. Adaptive dynamic programming for networked control systems under communication constraints: A survey of trends and techniques. Int. J. Network Dyn. Intellig., 2022, 1: 85−98. doi: 10.53941/ijndi0101008
Shakiba, F.M.; Shojaee, M.; Azizi, S.; et al. Real-time sensing and fault diagnosis for transmission lines. Int. J. Network Dyn. Intellig., 2022, 1: 36−47. doi: 10.53941/ijndi0101004
Barua, S.; Islam, M.M.; Murase, K. A novel synthetic minority oversampling technique for imbalanced data set learning. In 18th International Conference on Neural Information Processing, Shanghai, China, 13–17 November 2011; Springer: Shanghai, China, 2011; pp. 735–744.
Ting, K.M. An instance-weighting method to induce cost-sensitive trees. IEEE Trans. Knowledge Data Eng., 2002, 14: 659−665. doi: 10.1109/TKDE.2002.1000348
Jia, J.; Zhai, L.M.; Ren, W.X.; et al. An effective imbalanced jpeg steganalysis scheme based on adaptive cost-sensitive feature learning. IEEE Trans. Knowledge Data Eng., 2022, 34: 1038−1052. doi: 10.1109/TKDE.2020.2995070
Suykens, J.A.K.; Vandewalle, J. Least squares support vector machine classifiers. Neural Process. Lett., 1999, 9: 293−300. doi: 10.1023/A:1018628609742
Wang, Z.R.; Wang, J.; Wang, Y.R. An intelligent diagnosis scheme based on generative adversarial learning deep neural networks and its application to planetary gearbox fault pattern recognition. Neurocomputing, 2018, 310: 213−222. doi: 10.1016/j.neucom.2018.05.024
Guo, Q.W.; Li, Y.B.; Song, Y.; et al. Intelligent fault diagnosis method based on full 1-D convolutional generative adversarial network. IEEE Trans. Ind. Informat., 2020, 16: 2044−2053. doi: 10.1109/TII.2019.2934901
Zhang, H.C.; Zhang, Y.N.; Nasrabadi, N.M.; et al. Joint-structured-sparsity-based classification for multiple-measurement transient acoustic signals. IEEE Trans. Syst. Man Cybernet. Part B Cybernet., 2012, 42: 1586−1598. doi: 10.1109/TSMCB.2012.2196038
Tropp, J.A. Algorithms for simultaneous sparse approximation. Part II: Convex relaxation. Signal Process., 2006, 86: 589−602. doi: 10.1016/j.sigpro.2005.05.031
Xu, Z.B.; Chang, X.Y.; Xu, F. M.; et al. L1 /2 regularization: A thresholding representation theory and a fast solver. IEEE Trans. Neural Networks Learn. Syst., 2012, 23: 1013−1027. doi: 10.1109/TNNLS.2012.2197412
Maesschalck, R.; Jouan-Rimbaud, D.; Massart, D. The mahalanobis distance. Chemometr. Intellig. Lab. Syst., 2000, 50: 1−18. doi: 10.1016/S0169-7439(99)00047-7
Daffertshofer, A.; Lamoth, C.J.C.; Meijer, O.G.; et al. PCA in studying coordination and variability: A tutorial. Clin. Biomech., 2004, 19: 415−428. doi: 10.1016/j.clinbiomech.2004.01.005
Xu, L.; Song, B.Y.; Cao, M.Y. An improved particle swarm optimization algorithm with adaptive weighted delay velocity. Syst. Sci. Control Eng., 2021, 9: 188−197. doi: 10.1080/21642583.2021.1891153
Qu, L.; Zhu, H.S.; Zheng, R.Q.; et al. ImGAGN: Imbalanced network embedding via generative adversarial graph networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining), Singapore, 14–18 August 2021; ACM: Singapore, 2021; pp. 1390–1398.
Lichman, M. UCI machine learning repository. Available online: http://archive.ics.uci.edu/ml (accessed on 2016).
Tao, X.M.; Li, Q.; Guo, W.J.; et al. Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inf. Sci., 2019, 487: 31−56. doi: 10.1016/j.ins.2019.02.062
Mao, J.Y.; Sun, Y.; Yi, X.J.; et al. Recursive filtering of networked nonlinear systems: A survey. Int. J. Syst. Sci., 2021, 52: 1110−1128. doi: 10.1080/00207721.2020.1868615
Ju, Y.M.; Tian, X.; Liu, H.J.; et al. Fault detection of networked dynamical systems: A survey of trends and techniques. Int. J. Syst. Sci., 2021, 52: 3390−3409. doi: 10.1080/00207721.2021.1998722
Zong, W.W.; Huang, G.B.; Chen, Y.Q. Weighted extreme learning machine for imbalance learning. Neurocomputing, 2013, 101: 229−242. doi: 10.1016/j.neucom.2012.08.010
Wen, P.Y.; Li, X.R.; Hou, N.; et al. Distributed recursive fault estimation with binary encoding schemes over sensor networks. Syst. Sci. Control Eng., 2022, 10: 417−427. doi: 10.1080/21642583.2022.2063203
Li, H.; Wu, P.S.; Zeng, N.Y.; et al. Liu and Alsaadi, F.E. A survey on parameter identification, state estimation and data analytics for lateral flow immunoassay: From systems science perspective. Int. J. Syst Sci, 2022, 53: 3556−3576. doi: 10.1080/00207721.2022.2083262
Freund, J. Boosting a weak learning algorithm by majority. Inf. Comput., 1995, 121: 256−285. doi: 10.1006/inco.1995.1136
Breiman, L. Random forests. Mach. Learn., 2001, 45: 5−32. doi: 10.1023/A:1010933404324
Natekin, A.; Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot., 2013, 7: 21. doi: 10.3389/fnbot.2013.00021
Liu, Z.N.; Cao, W.; Gao, Z.F.; et al. Self-paced ensemble for highly imbalanced massive data classification. In IEEE 36th International Conference on Data Engineering, Dallas, TX, USA, 20–24 April 2020; IEEE: Dallas, TX, USA, 2019; pp. 841–852.
De Winter, J.F.C.; Dodou, D. Five-point likert items: T test versus Mann-Whitney-Wilcoxon. Pract. Assessm. Res. Evaluat., 2010, 15: 1−12. doi: 10.7275/bj1p-ts64

Downloads

An Improved Generative Adversarial Network with Feature Filtering for Imbalanced Data

Keywords:

1. Introduction

2. Related Work

3. Main Results

4. Experimental Results

5. Conclusions

References

About Scilight

Journals

Publishing Policies

Contact