Sign in Register Submit Manuscript

Qingres Home

Location:Home >> Detail

Med One 2016;1(4):3;DOI:10.20900/mo.20160016


Sparse Representation based Genetic Biomarker Evaluation for Congenital Heart Defects

Peng Zhou1, Benjamin H Lehrman2, Hongping Zheng3*

1Department of Biomedical Engineering, Tianjin University, Tianjin 300072, China;

2Rush Medical College, 600 S Paulina St, Chicago, IL 60612, USA.

3Laboratory of Human Carcinogenesis, NCI, NIH, 9000 Rockville Pike, Bethesda, MD 20892, USA

Correspondence: Hongping Zheng. Tel: 301-496-7279;

Published: 8/23/2016 10:45:51 AM


Background Congenital heart defects (CHD) are the most common type of birth defect, affecting approximately 8 out of every 1,000 newborns. Hundreds of genes have been reported as CHD candidate genes. Nevertheless, each patient/patient group may demonstrate unique etiologic characteristics that need personalized treatment.

Methods We proposed a sparse representation based variable selection (SRVS) approach to select disease-related genetic markers from a huge disease candidate gene pool acquired from ResNet relation database. The proposed approach has been applied to evaluate 167 CHD candidate genes, followed by validation on a microarray expression data set. Pathway enrichment analysis (PEA), sub-network enrichment analysis (SNEA) and network connectivity analysis (NCA) were conducted to study the functional profile of the variables selected by SRVS and compare them with previous reported genetic markers.

Results A significant high disease predication accuracy of 81.40 % was acquired (permutation p-value < 0.0002) using the top 24 SRVS selected genes, which were enriched within multiple pathways and sub-networks that were previously implicated with CHD. In contrast, using the most frequently reported genes out of the 167 CHD candidate genes, the highest accuracy of 69.77 % was reached with permutation p-value = 0.017. Additionally, enrichment analysis and NCA showed that the top genes selected by the proposed SRVS approach were strongly related to the frequently reported CHD genes, although a functional difference was present.

Conclusion Our study suggests that SRVS is an effective method in data driven variable selection for CHD. Furthermore, frequently reported CHD candidate genes might not be the best biomarkers for a specific CHD patient/ patient group.


Congenital heart defects (CHD) refer to anatomical abnormalities of the heart and large blood vessels that occur during embryonic development [1]. CHD can result from genetic factors, or environmental factors; however, a combination of both is typical [2,3]. Recently, there have been an increased number of articles reporting hundreds of genes/proteins related to CHD, many of which were suggested as candidate genes for the disease. However, every patient/patient group has a unique variation of the human genome that requires treatment based on their predicted response or risk of disease [4].

In recent years, Sparse representation has received a great attention in applications such as signal recovery and significant components identification [5,6]. However, in the case of large variable and small sample number applications, specific modulation is required to fulfil the variable selected task. In many biomedical problems (e.g., genomic data, image data) the number of samples is far less than the number of variables.

In this study, we proposed a sparse representation based variable selection (SRVS) algorithm that selects significant biomarkers at different detection resolutions, which has previously been effective in variable selection with SNP data and fMARI data [7]. Instead of selecting a specific of number of variable, this data driven method ranks all the variables by generating a sparse regression weight for each of them [7].


In this section, we first describe the proposed SRVS algorithm (Section 2.1), then we apply it to a CHD candidate genetic biomarker selection problem (Section 2.2), and finally we study the SRVS selected variables in terms of pathway enrichment analysis (PEA), sub-network enrichment analysis (SNEA) and network connectivity analysis (NCA) (Section 2.3).

2.1 SRVS algorithm
SRVS Algorithm

In Step 3, there are many proposed methods for solving the minimization problem, such as the Homotopy method [8] for P = 1 , and the orthogonal matching pursuit (OMP) algorithm [9] for P = 0.

2.2 CHD Candidate Genes for Evaluation

A 167-CHD-candidate gene pool was acquired from the CHD-Gene relation data, which was also included in a gene expression data set (GEO: GSE34457).

The CHD-Gene relation data was acquired from Pathway Studio (PS) ResNet 11 Mammalian database updated July 2016. The ResNet® Mammalian database is one part of PS ResNet Databases, which are a group of real-time update network databases, includes curated signaling, cellular process and metabolic pathways, ontologies and annotations, as well as molecular interactions and functional relationships extracted from the 35M+ references covering entire PubMed abstract and Elsevier full text journals. More information about the PS ResNet Mammalian databases please refer to

The gene expression profile was acquired from cell lines of 43 Down syndrome patients, among which 21 were with CHD (case) and 22 without (controls). The original data includes 48,701 probes. All probes with null data entries were removed, resulting an overlap of 167 genes with the CHD candidate gene pool.

2.3 Validation of the SRVS method

To test the validity of the proposed method, we studied the SRVS selected genes through four approaches: CHD predication, PEA, SNEA and NCA. For comparison purposes, we also compared the performance of the top frequently referenced CHD candidate genes.

2.3.1 Definition of two scores

We define the reference number underlying a gene-disease relationship as the gene’s reference score (Rscore), and define the SRVS approach generated weights for each gene as the SRVS score (Sscore).

2.3.2 Validation using disease prediction

We hypothesize that significant CHD candidate gene/gene set should contribute to distinguishing CHD patients from healthy controls. To validate the effectiveness of the selected genes and the proposed SRVS approach, we performed a Euclidean distance-based multivariate classification [7] on the gene expression data set (GEO: GSE34457), followed by a leave-one-out (LOO) cross validation, using the overall gene set and the sub-sets selected by Sscore and Rscore as tentative markers. Permutation of 5,000 runs was then conducted to test the hypothesis that a randomly selected gene set with the same size can reach an equal or higher classification accuracy (CR).

2.3.3 Enrichment and connectivity analysis

To better understand the underlying functional profile of the genes selected by Sscore and Rscore, we also conducted PEA and SNEA on the top genes selected by the two scores. In addition, we conducted NCA on the subsets of genes using Pathway Studio, which identifies connectivity between given genes/proteins. The weight of an edge will be the number of scientific references underlying a reported gene-gene interaction.


3.1 CHD candidate genes for validation

Analysis of CHD-Gene relation data revealed a CHD gene pool of 684 genes, supported by 3,237 articles (Supplementary Table S1). Here we evaluated 167 of these CHD candidate genes using the proposed SRVS algorithm with an independent gene expression data (GEO: GSE34457). Fig. 1 presents these CHD candidate genes. The full gene list of the 167 genes and related information, including Sscore and Rscore is provided in Supplementary Table S2.

Fig. 1 The 167 CHD Candidate Genes Analyzed
3.2 Validation in CHD prediction

To evaluate the effectiveness of the SRVS generated metrics, Sscore, a case/control classification and LOO cross validation were conducted on an RNA microarray dataset (GEO: GSE34457), followed by a permutation test of 5,000 runs. For comparison purposes, we also tested the Rscore. For the LOO cross validation, we first rank the 167 genes by different metric

scores, then we used the top n (n=1, 2 …) genes as input variables for classification and LOO cross validation. Fig. 2 presents the results with the maximum classification ratios (CRs) marketed at the position of corresponding number of genes.

Fig. 2 Comparison of Different Metrics through a LOO Cross Validation (genes ranked in ascending order)
Table 1
Table 1 LOO cross validation and permutation results

Figure 2 establishes that, compared to the CRs generated by randomly selected gene sets, the top genes selected by both Sscore and Rscore can lead to significant better classification accuracies with the same size. To note, using only the top genes selected by different scores, the highest CRs were acquired (See Fig. 2 and Table 1), while adding more genes with lower score may not necessarily help, suggesting the validity of both Sscore and Rscore. Moreover, we noted that the Sscore led to much higher CRs with lower permutation p-values, demonstrating the effectiveness of the proposed method. We present the top 24 genes by Sscore in Table 2. For comparison purposes, we also provide the top 24 genes by Rscore, and the full lists in Supplementary Table S2.

Table 2
Table 2 Top 24 Genes Reported Associations with CHD Ranked by Different Scores
3.3 Compare Top Genes by Sscore and Rscore

To better understand the profile of the genes selected by SRVS approach, we further compared the two group of top genes selected by Sscore and Rscore (Table 2) using PEA and NCA approach.

Analysis identified that among the 24 genes selected by Sscore and Rscore, there was only a two-gene overlap: BCOR and G6PC3, as depicted in Fig. 3 (a). Nevertheless, NCA analysis demonstrated that there were 94 relations of different types between 18/24 genes from Sscore group and 18/24 genes from Rscore group (Fig. 3 (b)), supported by over 3,000 references (Supplementary Table S3), suggesting a strong relation between the two gene groups.

Fig 3 Overlap and association between the sub gene sets with the highest Sscore and Rscore. (a) Venn diagram of the top 24 genes by both scores; (b) Gene-Gene connection between top 24 genes by both scores; genes selected by Rscore are highlighted in yellow; genes selected by Sscore in blue.
3.4 Enrichment Analysis

In this section, we present PEA and SNEA results for the different groups listed in Table 3. At the same enrichment p-value threshold ( < 0.001), we identified 152 pathways enriched by the top 24 genes by Rscore, while for the Sscore group, there were only 17 enriched pathways. We present the top 10 pathways/gene sets by different scores in Table 3. The full results are presented in Supplementary Table S4a and S4b.

Table 3
Table 3 Pathways/groups enriched by 24 genes with the highest Sscore and Rscore

Table 4 establishes that the genes with the top Sscores and those with the top Rscores were enriched in different groups of pathways (only one overlap: GOID 0007511), with different p-values (Sscore group: 8.54 E-04 - 8.93 E-05; Rscore group: 1.36E-07 - 3.01E-12), indicating that the top genes selected by SRVS were functionally different from the most frequently reported ones.

Of note, for the 17 pathways/gene sets enriched with the 24 genes by Sscore (p-value < 0.00091, with 16/24 unique genes; see Supplementary Table 4a), there was 1 gene set related to heart development (GO: 0007511; p-value = 0.00014, overlap: 4) and 2 gene sets related to transcription factors for the positive regulation of DNA-templated transcription (GO: 0045941; p-value=0.0006, overlap: 5) and positive regulation of transcription on the RNA polymerase II promoter (GO: 0010552; p-value = 0.00091, overlap: 6) both of which were previously implicated with CHD (Clark et al. 2016).

In addition to PEA, we also performed SNEA using Pathway Studio with the purpose of identifying the pathogenic significance of the selected genes to other disorders that are possibly related to CHD. The full results are presented in Supplementary Table S5a and S5b. Table 4 shows the top 10 disease-related sub-networks enriched by the two groups of genes.

Table 4
Table 4 SNEA Results by 63 Genes with the Highest NScore and Rscore

Table 4 indicates that both groups enriched some other heart defect related sub-networks, as well as other congenital and genetic mutation related sub-networks. Moreover, we noted an overlap of two sub-networks for two groups: congenital malformation and leukemia.

3.5 Connectivity analysis

In addition to PEA and SNEA, we performed a NCA on the top 24 genes with the highest Rscores and Sscores (Table 2) to generate functional networks. In the Sscore group, 11 genes out of the 24 presented 14 direct connections supported with 53 references. In contrast, for the Rscore group, there were over 113 relations among 19/24 genes, with 1,283 literature supports (Fig. 4 (b)). This observation is consistent with the PEA and SNEA, suggesting that genes with the highest Sscores were not as functionally related to each other as were the genes from the Rscore group.

Fig. 4 Connectivity networks built by 24 genes from different groups. The networks are generated using Pathway Studio. (a) Sscore group; (b) Rscore group.


Identification of significant biomarkers based on a small number of observations is a fundamental problem in signal processing. This study proposed a sparse representation based genetic marker selection approach, and applied it to the evaluation of 167 CHD candidate genes. The genes were identified from a CHD-Gene network relation data set acquired from the ResNet database, which were also overlapped with a RNA gene expression data set (GEO: GSE34457). Two metrics scores were generated and compared: Sscore from SRVS analysis and Rscore from CHD-Gene relation data set analysis.

LOO cross validation demonstrated that using the whole 167 CHD candidate genes, a classification ratio of only 58.14% was reached with a permutation p-value of 0.30 (Table 1). However, using the top genes by both Sscore and Rscore, greater CRs were acquired (81.40 % and 69.77 %) with a more significant permutation p-value ( < 0.017). This suggested the necessity of variable selection for the candidate CHD genes tested, as well as the efficacy of both Sscore and Rscore.

Furthermore, we noted that the top 24 genes by Sscore led to the greatest CR with the lowest permutation p-value, demonstrating the effectiveness of the proposed SRVS method in variable selection for CHD. To better understand the 24 genes selected by the SRVS method, we compare it with the top 24 genes selected with Rscore. Analysis showed that these two groups only share two genes: BCOR and G6PC3 (Fig. 3 (a)). Their differences were also demonstrated in terms of enrichment pathways (Table 4), associated sub-networks (Table 5) and gene-gene interactions (Fig. 4). As the results highlight, even though the well-studied CHD candidate genes were significant to the disease and effective in disease prediction (LOO permutation p-value = 0.017), they were not the best genetic markers for the subjects involved with the expression data tested (GEO: GSE34457).

Despite the differences between the top genes selected by high Sscore and Rscore, we identified that many of the Sscore enrichment pathways were previously reported with CHD. For example, heart development, positive regulation of transcription from the RNA polymerase II promoter, the DNA-templated positive regulation of transcription, the canonical Wnt signaling pathway, and histone deacetylase binding [11,12,13]. Furthermore, these genes were also identified to be the genetic basis of other CHD related diseases, such as congenital malformation, cardiac hypertrophy, dystrophy, and cancer [14,15]. These results supported the biological validity of the top genes selected by the SRVS approach.

In addition to the direct literature support for the association between CHD and the top 24 genes selected by Sscore (Supplementary Table S1), we observed a strong functional association between the top genes selected by the Sscore and Rscore groups (Fig. 3 (b)), supported by over 3,000 references (Supplementary Table S3). A gene with a high Rscore indicates that the gene gets strong literature support for its linkage to CHD. Therefore, our observation provides indirect support that the majority of the top genes selected by the SRVS method pose functional significance to CHD.

Nevertheless, this study has several limitations that need future work. Although the algorithm was tested on the 167 CHD candidate genes, there are other genes linked to CHD that were not included in the data set and were therefore not analyzed. More inclusive data sets covering all CHD genes should be used to test the accuracy of the method. Additionally, the method should also be tested on other diseases to study its validity.

Altogether, we conclude that CHD is a complex disease whose genetic causes are linked to a network composed of a large group of genes. Each patient/patient group may present with unique genomic variations that require treatment based on their specific disease risk prediction, where our proposed SRVS method can be employed as an effective tool.


The authors declare no conflict of interests.

















All Rights Reserved © Copyright 2016 Qingres Co., Ltd .

Powered by Qingres Limitd.