Sign in Register Submit Manuscript

Qingres Home

Location:Home >> Detail

Med One 2016;1(4):6;DOI:10.20900/mo.20160019


Systematical Evaluation of Candidate Genetic Marks for Leukemia

Xixi Xiang1, Parker Foster2, Xi zhang1, Xiangning Bu3*

1Department of Hematology, Xinqiao Hospital, Third Military Medical University, Chongqing, 400037, China;

2National Institute of Biomedical Imaging and Bioengineering (NIBIB), NIH, Bethesda, 20892, USA;

3National Heart, Lung, and Blood Institute (NHLBI), NIH, Bethesda, 20814, USA.

Correspondence:Xiangning Bu. Email:;

Published: 8/23/2016 10:45:51 AM


Background: Although the most common type of cancer, the exact cause of leukemia is unknown. In recent years, there were an increased number of genetic studies reporting over a thousand of genes linked to the disease.

Methods: We conducted a systematical evaluation of 1,093 leukemia candidate genes, which were identified from leukemia-gene relations data extracted from the ResNet 11 Mammalian database with the support of 6,524 references. Four network metrics were proposed to evaluate the potential relevance of each individual gene to leukemia. Additionally, gene set enrichment analysis, sub-network enrichment analysis, and network connectivity analysis were conducted to study the attributes of these genes. Lastly, an expression dataset of 71 leukemia patients and 76 healthy controls was employed for validation.

Results: 952 out of 1,093 genes enriched 100 pathways (p < 3.3e-20), demonstrating strong gene-gene interaction. Network metrics analysis revealed 5 genes (TP53, CTNNB1, AKT1, TNF, and RARA) as the top candidates for leukemia in terms of both functional diversity and replication frequency. Validation using the expression data showed that the 1,093 genes as a whole, as well as the top genes selected by each of the proposed metrics, were efficient in distinguishing leukemia patients from controls (maximum classification ratio = 95.3 % with permutation p-value = 0.0054).

Conclusion: The genetic causes of leukemia are linked to a genetic network composed of a large group of genes. The genetic network, together with the network metrics provided in this study, could be used as the ground work for further molecular studies in the field.


Leukemia is a group of cancers that usually begins in the bone marrow, resulting in high numbers of abnormal white blood cells. It is the most common type of cancer in children, however, about 90 % of all leukemia cases are diagnosed in adults [1]. Although the exact cause of leukemia remains unknown, both inherited and environmental factors are thought to be involved [2].

There have been an increased number of articles reporting over a thousand genes related to leukemia, many of which were suggested as potential biomarkers for the disease, such as FLT3, WT1, TET2, and KRAS [3-5]. Additionally, several genes (e.g., IL2 and CSF3) have been studied in clinical trials [6,7]. Moreover, more articles have reported genetic changes and quantitative changes of genes in the case of leukemia [8,9]. Both increased and decreased gene expression levels/activities were observed [10-12]. To note, many genes were reported to affect the pathogenic development of leukemia with an unknown mechanism [13].

Nevertheless, no systematic analysis has evaluated the quality and strength of these reported genes as a functional network/group to study the underlying biological processes of leukemia. In this study, instead of focusing on a specific gene, we attempt to provide a comprehensive view of the genetic-map, and use gene set enrichment analysis (GSEA), as well as a sub-network enrichment analysis (SNEA) to study the underlying functional profile of the genes identified [14]. We hypothesized that the leukemia genes were functionally linked to each other, co-regulating the pathogenic development of the disease through multiple pathways.


The workflow of the study was as follows: 1) Acquisition of leukemia-gene relation data set and identification of leukemia candidate genes; 2) Enrichment analysis of the identified genes to study their pathogenic significance with leukemia; 3) Network metrics analysis to identify genes with specific significance; 4) Network connectivity analysis (NCA) to test the functional association between these reported genes; 5) Validation using an independent gene expression data set.

2.1 Acquisition of Leukemia-Gene relation Data Set

The leukemia-gene relation data were extracted from the Pathway Studio ResNet® Mammalian database updated May 2016. The genes identified were used as the candidate network nodes/genes. The ResNet® Mammalian database was one part of Pathway Studio ResNet Databases, which are a group of real-time updated network databases, and includes curated signaling, cellular process and metabolic pathways, ontologies and annotations, as well as molecular interactions and functional relationships extracted from the 35M+ references covering entire PubMed abstract and Elsevier full text journals. Updating weekly, the ResNet® Mammalian database contains information of over 6,500,000 functional relationships for human, rat, and mouse, linked to all of their original literature sources. Entities in the database include: 1) 142,270 proteins; 2) 106,732 small molecules; 3) 8,863 cell processes; 4) 15,911 diseases; 5) 5,038 functional classes; 6) 4,387 Clinical parameters; 7) 1,983 pathways; 8) 559 complexes; and 9) 767 cells. For more information about the ResNet databases, refer to

2.2 Literature metrics analysis

For literature metrics analysis, we proposed 2 scores for each gene-disease relationship. We define the reference number underlying a gene-disease relationship as the gene's reference score (RScore) as Eq. (1).

RScore = The number of references underlying a relationship (1)

We define the earliest publication age of a gene-disease relationship as the gene's age score (AScore) as Eq. (5).

AScore = max1≤i≤nArticlePubAgei (2)

where is the total number of references supporting a gene-disease relation, and

ArtilcePubAge = Current date - Publication date + 1 (3)

2.3 Enrichment metric analysis

Given a disease associated with a set of genetic pathways we then defined the gene-wise enrichment score (EScore) for the gene within a gene set of size as Eq. (4).

ArtilcePubAge = Current date - Publication date + 1 (4)

where is the enrichment score of the pathway with the gene set; number of pathways including the gene; we defined as the for the gene:

ArtilcePubAge = Current date - Publication date + 1 (5)

To note, PScore presents how many disease related pathways were associated with the genes, and EScore showed the significance of the involved pathways.

2.4 Enrichment analysis

To better understand the underlying functional profile and the pathogenic significance of the reported genes, GSEA and SNEA [15] were performed on 3 groups: 1) Whole gene list (1,093 genes); 2) 2-subgroups selected using the highest metric scores. In addition, NCA was conducted on the latter two groups.

2.5 Validation using gene expression data

We hypothesized that significant leukemia candidate gene-gene sets should contribute to distinguishing leukemia patients from healthy controls. To validate the effectiveness of the selected genes and the proposed metrics, we performed a Euclidean distance-based multivariate classification [16] on an expression dataset, followed by a leave-one-out (LOO) cross validation, using the overall gene set and the sub-sets selected by different scores as tentative markers. A permutation of 5,000 runs was then conducted to test the hypothesis that a gene set was randomly selected to reach a specific classification accuracy.

Expression data from 147 subjects, including samples from 71 chronic lymphocytic leukemia (CLL) tumors and 76 sorted CD19pos B cells from healthy donors (NCBI GEO: GSE50006), with 1,031 genes overlapped with the candidate leukemia gene-pool identified within the leukemia-gene dataset.


3.1 Identification of candidate genes

From the leukemia-gene relation data set, we identified 1,093 leukemia candidate genes, supported by 6,524 articles (See Supplementary Material 1). For these genes, 994 (90.94 %) presented Regulation relationship to the disease, 133(12.17 %) with Genetic Change, 61 (5.58 %) with Quantitative Change, 52 (4.76 %) with Cell Expression, 20 (1.83 %) with Biomarker, 17(1.56 %) with Clinical Trial, and 5 (0.46 %) with State Change. To note, 148(13.54 %) genes have been reported to have multiple relationships with the disease. Specifically, 945(86.46%) genes presented 1 type of relationship to the disease, 113 (10.34 %) with 2, 31(2.84%) with 3, 3 (0.27 %) with 4, and 1 (0.09 %) with 5. For detailed definition and description of these relation types mentioned above, please refer to the ‘Relations: Definitions and Annotations’ section at To note, gene with ‘m_*’ and ‘r_*’ represent genes identified in mouse and rat, respectively.

Fig. 1 Gene-wise Relation Type Distribution of 1,093 Genes

Publication date distribution of these 6,531 articles are presented in Fig. 2 (a), with novel genes reported in each year. To note, these articles have an average publication age of only 6.0 years, indicating that most of the articles were published in recent years. In addition, our analysis also showed that the publication date distributions of most of the articles underlying each of the 1,093 genes were similar, as shown in Fig. 2 (b).

Fig. 2 Histogram of Publications Reporting Gene-disease Relationships between Leukemia and 1,093 Genes. (a) Number of article publications by year; (b) Gene-wise publication date distribution of the supporting references, with mean marked as red star.
3.2 Marker ranking

Among these 1,093 genes, 31 were reported within this year (Jan. to Apr. 2016), which are listed in Table 1. For comparison purposes, Table 1 also lists the top 31 genes with the highest RScores (in descend order). Full results are provided in Supplementary Material 1.

Table 1
Table 1 Top 31 Genes Reported Associations with Leukemia Ranked by Different Scores
3.3 Enrichment analysis

In this section, we present the GSEA and SNEA results for 3 different groups: All 1,093 genes, and the 2 gene groups listed in Table 1.

3.3.1 Enrichment analysis on all 1,093 genes

In Table 2, we present the top 20 pathways/groups enriched with 857/1,093 genes (p-values < 3.7e-41). The full list of 100 pathways/gene sets enriched with 952/1,093 genes (p-value < 3.3e-20) has been listed in Supplementary Material 2.

Among these 100 pathways/gene sets enriched, we identified 6 pathways/gene sets that were related to cell apoptosis (with 345/1,093 genes), 9 to cell growth and proliferation (366/1,093 genes), 6 to protein phosphorylation (201/1,093 genes), 3 to immune system (319/1,093 genes), 11 to transcription factors (449/1,093 genes), 8 to protein kinase (234/1,093 genes), and 2 to neuronal system (257/1,093 genes).

Table 2
Table 2 Molecular Function Pathways/Groups Enriched by 1,093 Genes Reported

Note: The Jaccard similarity (Js) is a statistic used for comparing the similarity and diversity of sample sets, which is defined by , where A and B are two sample sets.

Besides GSEA, we also performed a SNEA using Pathway Studio for purpose of identifying the pathogenic significance of the reported genes to other disorders that are potentially related to leukemia. In Table 3, we present the top 10 disease related sub-networks enriched with a p-value < 5E-254. We provide the full list of results in Supplementary Material 3.

Table 3
Table 3 Sub-networks Enriched by the 1,093 Genes Reported

From Table 3, many of these reported leukemia related genes were also identified in other cancers, with a large percentage of overlap (Jaccard similarity 0.18).

3.3.2 Enrichment analysis on top 31 genes with highest scores

Here we compared the top 31 genes listed in Table 1 in terms of GSEA and SNEA results. The top 10 pathways/sub-networks for the AScore group and the RScore group are presented (Table 4 and Table 5). A complete report is located in Supplementary Material 2 and 3.

Using the same enrichment p-value threshold (p < 1E-4), we identified 10 pathways/gene sets that were enriched with the 31 genes with top AScores, while the number for RScore group is 153. Table 4 presents the top 10 pathways enriched with the 31 genes from AScore and RScore groups, respectively. The full lists of these pathways/gene sets are provided in Supplementary Material 2.

Table 4
Table 4 Pathways/Groups Enriched by 31 Genes with the Highest AScore and RScore

From Table 4, the genes with the top AScores and those with the top RScores were enriching different groups of pathways, with different p-values (AScore group: 7.62E-07 - 9.23E-05; RScore group: 1.12E-17 - 2.71E-10), indicating that the newly reported genes were both functionally distinct and of less significance when compared to those most frequently reported.

Moreover, we observed that 4 out of the 10 pathways/gene sets enriched by the RScore group (Table 4) were observed in Table 2, which lists the top 20 pathways/gene set enriched with 857/1,093 genes, while the number for AScore group is 0.

For the SNEA analysis, we only performed an enrichment analysis against disease sub-networks. Table 5 presents the top 10 disease related sub-networks enriched by the top 31 genes from AScore group and RScore group, respectively. We provided the full list of results in Supplementary Material 3.

Table 5
Table 5 SNEA Results by 31 Genes with the Highest AScore and RScore

From Table 5, both groups enriched other cancer related sub-networks. However, the enrichment p-values by the RScore group were much more significant than those by the AScore group (NScore group: 4.90E-05 - 2.72E-04; RScore group: 1.41E-52 - 9.32E-41), and with higher Jaccard similarities.

3.4 Connectivity analysis

In addition to GSEA and SNEA, an NCA was performed on the top 31 genes with the highest RScores and AScores (from Table 1) to generate gene-gene interaction networks. Results showed that, for the RScore group, there were 441 connections among all 31 genes, with numerous literature supports. In contrast, genes within the AScore group demonstrated only 15 relations among 19/31 genes, as shown in Fig. 3 (b), with 12 genes showing no direct relation with other genes in the group (Fig. 3 (b); highlighted in green). This observation was consistent with the GSEA and SNEA, suggesting that genes with the smallest AScore were not as functionally close to each other as are those from the RScore group.

Fig. 3 Connectivity Networks built by 31 Genes from Different Groups. The networks were generated using Pathway Studio. The un-related genes are highlighted in green.
3.5 EScore analysis

Through GSEA, two biological metrics, EScore and PScore were generated for each gene. The value of a PScore represents how many leukemia associated pathways involve the gene, and EScore shows how significant these pathways are.

To compare the EScore and PScore with the two literature metrics, we conducted a correlation analysis using averaged metric values of all 1,093 genes at a group level, as shown in Fig. 4 (a). We used a group size of 36 genes, that is, we first sorted the 1,093 genes by RScore, then averaged each type of metrics values using a moving window of length 36. Results showed that the average scores were strongly correlated, especially for the top ones, as shown in Fig. 4 (a). and Table 6. To note, group-wise PScore and EScore were extremely correlated (ρ=0.99).

Fig. 4 Comparison of Different Metrics Ranking the 1,093 Genes. (a) Comparison of average metrics values with gene set size of 36; (b) A Venn diagram of top 31 genes selected by different metrics.
Table 6
Table 6 Pearson Correlation Coefficients between Different Metrics

In addition to the group-wise correlations analysis, we also performed a cross-analysis of the top 31 genes selected using different scores, and present a Venn diagram in Fig.4 (b) (Oliveros, 2007-2015).

There was a strong overlap between PScore group and EScore (28/31). These 28 genes related to the most pathways that were significantly enriched. Specifically, 5 genes were identified as having the overlap with EScore, PScore, and RScore groups, including TP53, CTNNB1, AKT1, TNF, and RARA (RScore:58.80 ± 11.17 references, PScore: 34.00 ± 2.55 pathways) (see Fig. 4 (b)). Additionally, there were 23 genes observed in both PScore group and EScore group, but not in RScore group, including: RELA, JUN, EGFR, SRC,J AK2, HIF1A, TGFB1, HDAC1, STAT3, IL1B, PTK2, FYN, LCK, LYN, MAPK3, IL6, MAPK1, EP300, CREB1, ERBB2, PDGFRB, GSK3B, and KDR. These genes played roles within multiple significant pathways with leukemia (32.04 ± 4.28 pathways). Although they were old (AScore: 12.48 ± 6.71 years) and were not frequently replicated (RScore: 10.04 ± 8.94 references), our results suggest that they are worthy of further study.

3.6 Validation using expression data

We hypothesized that significant leukemia candidate gene-gene sets should contribute to distinguishing leukemia patients from healthy controls. Therefore, if our selected gene set (1,093 genes) and the top genes selected by the proposed metric scores are significant to the pathogenesis of leukemia, they should lead to significant higher classification accuracy compare to randomly selected gene sets. To test the hypothesis that our 1,093-gene-pool and the 4 proposed metrics are effective, classification and leave-one-out (LOO) cross validate were conducted on a gene expression dataset (NCBI GEO: GSE50006), followed by a permutation test of 10,000 runs.

We first ranked the 1,093 genes by different metric scores, then we used the top ( =1, 2 …) genes as input variables for classification and LOO cross validation. Fig. 5 presents the LOO results using different number of genes, with the maximum classification Ratios (maxCRs) marketed at the position of corresponding number of genes, which were also presented in Table 7.

Fig. 5 Comparison of Different Metrics through a LOO Cross Validation.

From Fig. 5 we see that the top genes selected by different scores can lead to the highest classification accuracies, while adding more variable/genes with lower score may not necessarily help, demonstrating the effectiveness of the proposed metrics. All four groups (RScore, AScore, PScore, and EScore), reached the highest CRs of 94.6 %, 95.3 %, 92.1 % and 92.1 %, respectively, with a relatively small number of genes, and the permutation p-value of all the groups passed the 0.05 threshold. Interestingly, we observed that the top 33 genes by AScore led to the highest CR of 95.3 %, with a permutation p-value of 0.0054. Additionally, we noted that employing all matched 1,031/1,093 genes, a CR of 92.1 % was reached with a permutation p-value of 0.037, suggesting that the majority of the 1,093 genes were effective for the leukemia predication. Table 7 summarizes the results of LOO cross validation and permutation approaches on different gene sets.

Table 7
Table 7 Pearson Correlation Coefficients between Different Metrics


This study proposed 4 network metrics to evaluate the 1,093 candidate genes within a genetic network for leukemia, and employed an independent gene expression data set to validate their efficiencies. Furthermore, GSEA, SNEA, and NCA were used to study the pathogenic significance of these candidate genes to the disease.

We noticed that the 1,093 genes identified were not equal in terms of publication frequency (RScore), their novelties (AScore), nor the functional diversity (EScore). Using the proposed quality metrics scores, one is able to rank the genes according to different needs/significance and pick the top ones for further analysis (see Supplementary Material 1). Specifically, we observed that some frequently replicated genes (with high RScore) also demonstrate high EScore and PScore, such as TP53, CTNNB1, AKT1, TNF, and RARA (see Fig. 4 (b)). These genes have an average support of 58.80 ± 11.17 references, and were connected to multiple significantly enriched pathways (34.00 ± 2.55 pathways). The results suggest that these genes likely posse biological significance with the disease.

Additionally, there were 23 genes observed in both the PScore group and EScore group (Fig. 4 (b)), but not in RScore group. Although they were old (AScore: 12.48 ± 6.71 years) and were not frequently replicated (10.04 ± 8.94 references), our results suggested that they were worthy of further study. For example, the gene RELA, although reported 4 years ago and thus far only 2 references supporting its relation with leukemia, is linked to 38 significantly enriched pathways, many of which have been implicated with leukemia or general cancers, such as: positive regulation of NF-kappa B transcription factor activity (0051092); aging (GO: 0016280); liver development (GO: 0001889); negative regulation of apoptotic process (GO: 0006916); positive regulation of cell proliferation (GO: 0008284); transcription factor complex (GO: 0005667); innate immune response (GO: 0002226); protein kinase binding (GO: 0019901) [17-26]. This observation suggests that these genes may play significant roles in the pathogenic development of leukemia and are thereby worthy of further study.

Moreover, our results demonstrate that most genes identified by this study were included in the pathways previously implicated with leukemia, including 6 cell apoptosis pathways, 9 cell growth and proliferation pathways, 11 transcription factor pathways, 7 protein phosphorylation related pathways, 3 immune system pathways, 8 protein kinase related pathways, and 2 neuronal system pathways [21-27]. We hypothesize that the majority of these literature reported genes, especially the ones that were identified from significantly enriched pathways, should functionally be linked to leukemia. Although there may be false positives from the separate studies into the publications, it is less likely that a big group of genes were falsely perturbed [14].

When the members of a gene set exhibit strong cross-correlation, GSEA can boost the signal-to-noise ratio and make it possible to detect modest changes in individual genes [14]. The NCA analysis showed that many of the frequently reported genes relating to leukemia are functionally associated with one another (Fig. 3), supported by hundreds of scientific reports. Furthermore, we note that 952/1,093 were included in the top 100 pathways enriched (p-value < 3.3e-020), and 857/1,093 in the top 20 pathways listed in Table 2 (p-value < 3.7e-041). If we define that two genes are functionally related to each other by their co-existence within same genetic pathway, then we see that around 87.1 % of the 1,093 genes were functionally related. The results indicate that these functionally linked genes possess higher opportunities as true discoveries than that as noise (false positives). It is less likely that all these functionally related genes were false identified than a single gene.

In addition to GSEA, we performed a Sub-Network Enrichment Analysis (SNEA), which provides high levels of confidence when interpreting experimentally-derived genetic data against the background of previously published results (Pathway Studio Web Help). SNEA results demonstrated that many of the 1,093 genes (> 90 %) also identified as causal genes for other health disorders (e.g. Breast cancer, Hepatocellular carcinoma, and Lung cancer) that are in strong association with leukemia [28-30].

Through the LOO cross validation and permutation process using the gene expression data set (NCBI GEO: GSE50006), several significant gene-combinations were identified by different scores, generating highest CRs. Permutation results showed that the top genes selected by those four scores, as well as the 1,031/1,093 genes, were effective in predicting leukemia (p-value < 0.05), indicating the effectiveness of the proposed metric scores. Especially, the top 33 genes selected by AScore reached the highest CR of 95.3 % with permutation p-value of 0.0054, suggesting that the genes identified during the earliest stage of leukemia genetic studies play significant role in leukemia predication.

Nevertheless, this study has several limitations that should be considered in future work. The 1,093 genes were identified from the leukemia-gene relation data extracted from Pathway Studio ResNet database. Although supported by 6,524 articles, it is still possible that some leukemia-gene relations may be left uncovered. Additionally, although the 4 propose metrics were shown effectiveness in selecting top genes for leukemia prediction, further network analysis with more experiment data may extract addition helpful features to identify biologically significant genes to the disease.


We conclude that leukemia is a complex disease whose genetic causes are linked to a network composed of a large group of genes. Integrating network gene-disease relation data and experiment data, together with GSEA, SNEA, and NCA, could serve as an effective approach in finding these potential target genes. This study provided a landscape overview with metrics for the current field of genetic researches of leukemia, which could be used as a groundwork for further biological/genetic studies in the area.

Declaration of interests

The authors declare no conflicts of interest.


This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
































All Rights Reserved © Copyright 2016 Qingres Co., Ltd .

Powered by Qingres Limitd.