Sign in Register Submit Manuscript

Qingres Home

Location:Home>> Detail

Med One 2016;1(2):1; DOI:10.20900/mo.20160006

Article

Literature data mining based enrichment analysis on 1,925 genes for lung cancer

Xinming Dong1 , McKenzie Ritter2, Hongbao Cao3 * , DeXiang Yang4 *

1Tianjin sanatorium, Tianjin, 300191, China;

2Unit on Statistical Genomics, National Institute of Mental Health, NIH, Bethesda, 20852, USA;

3Elsevier Inc., Biology Prod Research, Rockville, MD 20852, USA;

4Department of Respiratory, the People's Hospital of Tongling, Anhui, 244000, China

Correspondence: Dr. Yang, Department of Respiratory, the People's Hospital of Tongling, Anhui, 244000, China. Email: drr11154@rjh.com.cn.
or Dr. Cao, Elsevier Inc., Biology Prod Research, Rockville, MD 20852, USA. Email: h.cao@elsevier.com

Published: 4/25/2016 8:36:57 PM

ABSTRACT:

Background:Approximately 8% of lung cancer is due to inherited factors, and the risk is more than doubled in those with relatives who have lung cancer. To date, numerous genetic studies reported a large group of genes that are related to lung cancer. However, a majority of the studies are focused on separate activities of genes that influence the development of the disease.

Methods: We conducted a literature data mining (LDM) of over 17,884 articles covering publications from 1978 to Feb. 2016. These articles reported multiple types of marker-disease associations between 1,925 genes and lung cancer. Then we conducted a gene set enrichment analysis (GSEA) and a sub-network enrichment analysis (SNEA) to study the functional profile and validate the pathogenic significance of these genes to lung cancer. Last, we performed a network connectivity analysis (NCA) to study the associations between the reported genes.

Results:The reported genes demonstrate multiple types of association with lung cancer. Results from the enrichment analysis confirm the reports and suggest that these genes play significant roles in the pathogenesis of lung cancer, as well as in the pathogenesis of other lung cancer related disorders. Moreover, NCA results demonstrate that these genes, especially the ones with high RScores, present strong functional associations with each other.

Conclusion:Our results suggest that the genetic causes of lung cancer are linked to a network composed of a large group of genes. LDM together with enrichment and network analysis could serve as an effective approach in finding these potential target genes.

Keywords:

1. INTRODUCTION

Lung cancer is a malignant tumor, characterized by uncontrolled cell growth in tissues of the lung. About 10–15% of the cases occur in people who have never smoked (Thun et al., 2008). These cases are often caused by a combination of genetic factors and exposure to radon gas, asbestos, second-hand smoke, or other forms of air pollution. It is estimated that inherited factors alone account for about 8% of lung cancer cases (O'Reilly et al., 2007). Moreover, it is believed that the genetic causes of the disease are a combination of multiple genes and polymorphisms on chromosomes 5, 6 and 15, which are known to affect the risk of lung cancer (Larsen et al., 2011).

Within the last several years, there have been an increased number of articles reporting nearly 2,000 genes/proteins that are related to lung cancer, many of which are suggested as biomarkers for the disease. However, a majority of these publications were studying separate gene/protein activities. Based on how the gene-lung cancer relations were reported, those articles can generally be classified into several different categories: 1) Biomarker 2) Clinical Trial 3) Genetic Change; 4) Quantitative Change ; 5) Regulation ; 6) State Change.

Biomarkers refer to the identification of proteins/genes that are prognostic or diagnostic of the disease. A relatively small number of articles suggested that the genes reported in their study could serve as biomarkers for the disease (Hara et al., 2001; Liu et al., 2010; Xiao et al., 2013; Natukula et al., 2013; Wang et al., 2014; Wang et al., 2014). Nevertheless, the observations have not been consistent (Padda et al., 2014). For example, Groen et al. concluded that PTGS2 expression in patients with advanced non–small-cell lung cancer was not a prognostic or predictive marker for treatment with celecoxib (Groen et al., 2011).

For multiple reasons, such as expense and ethical issues, a relatively small number of clinical trials have been conducted to study the relationship between these genes and lung cancer (Antonia et al., 2006; Wislez et al. 2014). In contrast, many studies reported a genetic change of these genes in the case of lung cancer (Do et al., 2015; Yoo et al., 2015; Liu et al., 2014; Jiang et al., 2014), with both independent studies as well as meta analyses (Guo et al., 2012; Yin et al., 2014; Li et al., 2013; Song et al., 2014; Meric-Bernstam et al. 2015). However, mutation changes of these genes demonstrated sub-group sensitivities (Sasaki et al 2009; Remon et al., 2014; Lopez-Chavez et al., 2015; Costa et al., 2015). For example, Sasaki et al. showed that EGFR mutations were found for only 63 of 575 lung cancer patients (Sasaki et al 2006). This is a limitation for using these genes as biomarkers for the diagnosis and treatment of the disease.

Quantitative change refers to changes in the abundance/activity/expression of a gene/protein in a disease state. Most reports for this type of relationship were from gene expression studies, where many genes were observed to demonstrate increased activity/gene expression levels in the case of lung cancer including, EGFR, CYP1A1, ALK, ROS1, ERBB2, MET,KEAP1,VEGF, PTGS2, TERT and TP53 (Yamashita et al., 2013; Ohno et al., 2011; Narayanan et al., 2013; Liang et al., 2012; Sadiq et al., 2013; Wang et al., 2008; Cathcart et al., 2014; Lee et al., 2013; Aras et al., 2013). Alternatively, some genes were detected to have decreased activity, such as GSTM1, GSTT1, ERCC1, and KRAS (Alemany et al. 1996; Schreiner et al., 2011; Ozcan et al., 2012). Moreover, similar to genetic changes, the observed quantitative changes demonstrate case sensitivity among patients with lung cancer (He et al., 2001; Alavanja et al., 2002).

Regulation refers to changing the activity of the target by an unknown mechanism. This type of relationship is generally equivocal when describing the mechanism of the association (Zhou et al., 2013; Roskoski et al., 2014; Wang et al., 2014; Schildhaus et al., 2015; Ramalingam et al., 2015). However, some of these studies did suggest mechanisms of the genes to lung cancer (Hanada et al., 2012; Malhotra et al., 2014; Cheng et al., 2014; Gruber et al., 2015). In addition, several of the aforementioned studies revealed the functional correlation between different genes and genetic factors (Schumacker, 2015; Ting et al., 2015).

State change refers to changes in a protein/gene post-translational modification status or alternative splicing events associated with a disease. There were only a few papers reporting state changes of genes in lung cancer (Ohta et al., 1999; Patek et al., 2008). However, as these studies reveal specific protein/gene state changes that may be related to lung cancer, they are important for the understanding of the mechanism of the disease.

Nevertheless, by far, no systematic analysis has evaluated the quality and strength of these reported genes as one functional network/group to study the underlying biological processes of lung cancer. In this study, instead of focusing on one specific marker or function, we attempt to provide a full view of the genetic-map that is related to lung cancer.

2. MATERIALS AND METHODS

The overview of this study is as follows: 1.) Literature data mining (LDM) to discover gene-lung cancer relations; 2.) Enrichment analysis on the genes identified to validate their pathogenic significance to lung cancer. 3.) Network connectivity analysis (NCA) to test the functional association between these reported genes.

2.1 Literature data mining

In this study, we performed literature data mining (LDM) over all articles available in the Pathway Studio database (www.pathwaystudio.com) to the date of Feb 2016, which covers over 40 million scientific articles, seeking the ones that reported gene-lung cancer relations. The LDM was conducted by employing the finely-tuned Natural Language Processing (NLP) system of the Pathway Studio software, which has the capability of identifying and extracting relationship data from scientific literatures. Only these publications containing a biological interaction defined by ResNet Exchange (RNEF) data format will be included (http://www.gousinfo.com/AIC%20project/Pathway%20Studio/Elsevier%20RNEF-1.3.htm). Results are presented, including a full list of genes names, the information of the underlying articles, and the marker scores, which are described below.

2.2 Quality Metric analysis

We performed a quality metrics analysis on all marker-disease relations. Output of the analysis includes quality score (QScore), citation score (CScore), novelty score (NScore) and report frequency score (RScore) at article level as well as marker level. These quality measures can be used to sort the marker list and get the top ones with different significance.

Using the RScore one can identify the most frequently reported markers. At the article level, RScore=1, indicates a marker-disease relation has been reported; otherwise RScore=0. At marker level, RScore is the sum of article level RScores, representing the report frequency of the marker.

Using the NScore, one can identify the newly reported markers. Here we define the publication age as the current year - publication year +1. According to different publication age threshold n, we differentiate NScores into NScore_ n, where n (years) =1, 2... ; at article level, NScore_n =0 when the publication age of the article is older than n; otherwise NScore_n>0. At marker level, NScore_n =0 means the marker-disease relation has been reported more than n years ago.

Using the CScore, one can identify the marker-disease relations that are highly cited. The CScore of an article is defined as its number of citations, and the marker level CScore of a relation is the sum of the total citations of all the articles supporting the relation.

The QScore is a composite index considering three factors of an article-reported relation: 1) the citation number; 2) the publication age, and 3) the RScore. The QScore of an article is in the range of (0,1), and is inversely related to publication age and positively related to its citation number. If an article is recently published with a high citation number, its QScore will be close to 1, and if the article is older with a low number of citations, its QScore will be close to 0. The marker level QScore is the sum of the QScores of all the articles supporting the marker.

It should be noted that both article level and marker level scores are designed on the relation level to evaluate the significance of the article(s) to the relation. If multiple marker-disease relations have been reported by one article, this articles will have scores for each of those relations.

2.3 Gene set enrichment analysis

To better understand the underlying functional profile and validate the pathogenic significance of the reported genes, we performed a gene set/pathway enrichment analysis (GSEA) and a sub-network enrichment analysis (SNEA) on five groups: 1) Whole gene list (1,925 genes); 2) 4-subgroups selected using the highest quality matrix scores (150 genes in each group). In addition, we conducted a network connectivity analysis on a subset of genes using Pathway Studio (www.pathwaystudio.com).

GSEA (also known as functional enrichment analysis) is a method for analyzing biological high throughput experiments, which identify classes of genes or proteins that are over-represented in a large set of genes or proteins. These gene sets may be known biochemical pathways or otherwise functionally related genes. The method uses statistical approaches to identify significantly enriched or depleted groups of genes to retrieve a functional profile of the input gene set, in order to better understand the underlying biological processes. With this method, one does not consider the perturbation of single genes but instead, whole (functionally related) gene sets. The advantage of this approach is that it is more robust. It is more likely that a single gene will befound to be falsely perturbed than a whole pathway.

In addition to GSEA, we performed a sub-network enrichment analysis (SNEA), which was implemented in Pathway Studio using master casual networks (database) containing more than 6.5 million relationships derived from more than 4 million full text articles and 25 million PubMed abstracts. These networks are generated by a finely-tuned Natural Language Processing (NLP) text mining system to extract relationship data from the scientific literature, rather than the manual curation process used by IPA (http://www.ingenuity.com/products/ipa). The ability to quickly update the terminologies and linguistics rules used by NLP systems ensures that new terms can be captured soon after entering regular use in the literature. This extensive database of interaction data provides high levels of confidence when interpreting experimentally-derived genetic data against the background of previously published results (http://help.pathwaystudio.com/fileadmin/standalone/pathway_studio/help_ps_10.0/index.html?analyze_experiment.htm).

3. RESULTS

3.1 Summary of LDM results

In this study, we conducted LDM on 17,884 articles that reported the 1,925 genes associated with lung cancer. According to the reported category of gene-lung cancer relations, the 17,884 articles can generally be clustered into 6 different groups: 1) Biomarker (0.62%); 2) Clinical Trial (0.16%); 3) Genetic Change (53.91%); 4) Quantitative Change (22.51%); 5) Regulation (21.75%); 6) State Change (1.05%).

We presented the publication date distribution of these 17,884 articles in Fig. 1, where we show that this study covers literature data of the past 40 years (1976 to 2016). However, these articles have an average publication age of only 6.2 years, suggesting that most of the articles were published in recent years. Here we define the publication age as the current year - publication date +1. It should be noted that, recent years have seen an increased number of publications, especially after 2010. In addition, our analysis also showed that the publication date distributions of the articles underlying each of the 1,925 genes are similar to that presented in Fig. 1.

FIGURE 1
Fig. 1: Histogram of the publications reporting marker-disease relationships between lung cancer and 1,925 genes
3.2 Marker ranking

Fig.2 shows the marker-wise score values for the 1,925 genes. The x-axis represents the index of markers ranked by QScore and the y-axis contains the CScore, QScores, NScore, and RScore normalized by their maximum values, respectively.

FIGURE 2
Fig.2: Plot of CScore, QScore, NScore and RScore. Each of these measures were normalized using the corresponding maximum value. The NScore presented in this figure are the NScore_2, so that the NScore will be zero if corresponding gene gets supports from articles with publication age older than 2 years.

Using the 4 scores, we identified that some genes were frequently reported with large numbers of articles to support them, such as EGFR (2141 articles), KRAS (1085 articles) and TP53 (1003 articles). These genes have the highest RScores. On the other hand, some genes recently reported (e.g., reported within last two years) have a high NScore, such as MAPK8 (NScore_2: 5.7), MIR423 (NScore_2: 4.0) and SIRT2 (NScore_2: 4.0). These genes usually have fewer articles to support them, demonstrating a low RScore (see Supplementary Material 1). Moreover, genes with high report frequencies (RScore) do not necessarily have a higher number of citations (CScore), which may be caused by many factors such as the total number of underlying articles and their publication age. To balance these factors, we use the QScore.

Among these 1,925 genes, 150 were reported within the last two years (2015-2016), with an NScore_2 > 0 (Fig. 1). These 150 genes are listed in Table 1 and the full results are provided in Supplementary Material 1. For comparison purposes, we also present in Table 1 the top 150 genes with highest RScore (have been frequently reported). Due to the fact that there are large percentages of overlaps among the top genes selected by using the RScore, CSore and QScore (e.g., overlap>75% for the top 150 genes), we only present the 150 genes with highest CScores and QScores in Supplementary Material 1 to reduce redundancy.

TABLE 1
Table 1 Top 150 genes reported associations with lung cancer ranked by different scores

Note: NScore here is NScore_2; Any marker that has been reported more than 2 years ago will have an NScore of 0. In this study there were 150 genes were newly reported in 2015 and 2016.

3.2 Enrichment analysis

In this section, we present GSEA and SNEA results for 3 different groups: all 1,925 genes, and the two gene groups listed in Table 1. The results for the top 150 genes with the highest CScores and QScores are presented in Supplementary Material 2 and 3.

3.2.1 Enrichment analysis on all 1,925 genes

The full list of 198 pathways/gene sets that were enriched with a p-value< 1.4E-015 are listed in Supplementary Material 2, where 114 pathways/gene sets are enriched with p-values< 1E-20, 32 are enriched with p-values< 1E-40, and 7 are enriched with p-values< 1E-70. In Table 2, we present the top 20 pathways/groups enriched by all the 1,925 genes, with p-values< 1e-047.

TABLE 2
Table 2 Molecular function pathways/ groups enriched by 1,925 genes reported

Cancer is fundamentally a disease of cell/tissue growth regulation failure. A normal cell transforming into a cancer cell indicates that the genes regulating the cell growth and differentiation have been altered (Croce, 2008). Our GSEA showed that 26 pathways/gene sets related to cell apoptosis, cell growth and cell proliferation were significantly enriched with the 1,925 genes reported. Specifically, there were 11 pathways/gene sets related to cell apoptosis (P-value: [1.3e-074,1e-016]): negative regulation of apoptotic process (GO: 0006916; p-value=1.3e-074, overlap: 210); apoptotic process (GO: 0006917; p-value=9.9e-068, overlap: 224); positive regulation of apoptotic process (GO: 0043065; p-value=2.2e-055, overlap: 140); negative regulation of neuron apoptotic process (GO: 0043524; p-value=8.4e-025, overlap: 59); intrinsic apoptotic signaling pathway (GO: 0008629; p-value=3.9e-022, overlap: 35); positive regulation of neuron apoptotic process (GO: 0043525; p-value=2.7e-021, overlap: 34); negative regulation of cysteine-type endopeptidase activity involved in apoptotic process (GO: 0001719; p-value=4.9e-020, overlap: 37); regulation of apoptotic process (GO: 0042981; p-value=2.4e-019, overlap: 72); activation of cysteine-type endopeptidase activity involved in apoptotic process (GO: 0006919; p-value=1.4e-018, overlap: 39); intrinsic apoptotic signaling pathway in response to DNA damage (GO: 0008630; p-value=5.4e-018, overlap: 31); apoptotic signaling pathway (GO: 0097190; p-value=1e-016, overlap: 43).

In addition, there were 15 pathways/gene sets related to cell growth and proliferation (P-value: [8.8e-072,1e-015]): positive regulation of cell proliferation (GO: 0008284; p-value=8.8e-072, overlap: 192); negative regulation of cell proliferation (GO: 0008285; p-value=1.3e-066, overlap: 168); cell proliferation (GO: 0008283; p-value=3.8e-044, overlap: 131); regulation of cell proliferation (GO: 0042127; p-value=2.1e-041, overlap: 94); epidermal growth factor receptor signaling pathway (GO: 0007173; p-value=6.8e-030, overlap: 73); positive regulation of smooth muscle cell proliferation (GO: 0048661; p-value=4.2e-026, overlap: 40); fibroblast growth factor receptor signaling pathway (GO: 0008543; p-value=1.7e-025, overlap: 61); vascular endothelial growth factor receptor signaling pathway (GO: 0048010; p-value=3.6e-025, overlap: 49); negative regulation of cell growth (GO: 0030308; p-value=1.5e-023, overlap: 54); transforming growth factor beta receptor signaling pathway (GO: 0007179; p-value=4.1e-019, overlap: 49); positive regulation of fibroblast proliferation (GO: 0048146; p-value=5.6e-018, overlap: 30); positive regulation of epithelial cell proliferation (GO: 0050679; p-value=3.3e-017, overlap: 33); growth factor activity (GO: 0008083; p-value=2.3e-016, overlap: 52); negative regulation of epithelial cell proliferation (GO: 0050680; p-value=6.4e-016, overlap: 31); positive regulation of endothelial cell proliferation (GO: 0001938; p-value=1e-015, overlap: 31).

Another cancer related factor is the immune system (Finn 2012). Here we indentified one related gene set:the innate immune response (GO: 0045087; p-value=1.9e-046, overlap: 192).

At the protein level, we identified 7 pathways/gene sets that were related to protein phosphorylation and 9 pathways/gene sets related to protein kinase: protein phosphorylation (GO: 0006468; p-value=3.9e-045, overlap: 173); positive regulation of protein phosphorylation (GO: 0001934; p-value=3.3e-032, overlap: 69); phosphorylation (GO: 0016310; p-value=3.3e-030, overlap: 147); peptidyl-tyrosine phosphorylation (GO: 0018108; p-value=3.4e-028, overlap: 60); protein autophosphorylation (GO: 0046777; p-value=3.5e-027, overlap: 68); positive regulation of peptidyl-serine phosphorylation (GO: 0033138; p-value=8.8e-017, overlap: 33); positive regulation of peptidyl-tyrosine phosphorylation (GO: 0050731; p-value=1.2e-016, overlap: 37); protein kinase binding (GO: 0019901; p-value=2.9e-037, overlap: 125); protein kinase activity (GO: 0050222; p-value=9e-037, overlap: 142); positive regulation of protein kinase B signaling (GO: 0051897; p-value=2.4e-032, overlap: 51); kinase activity (GO: 0016301; p-value=1.5e-030, overlap: 150); protein tyrosine kinase activity (GO: 0004718; p-value=8.7e-030, overlap: 59); transmembrane receptor protein tyrosine kinase activity (GO: 0004714; p-value=4.1e-025, overlap: 36); transmembrane receptor protein tyrosine kinase signaling pathway (GO: 0007169; p-value=8.6e-021, overlap: 48); positive regulation of I-kappaB kinase-NF-kappaB signaling (GO: 0043123; p-value=2.9e-018, overlap: 52); positive regulation of MAP kinase activity (GO: 0043406; p-value=6.2e-016, overlap: 29). A protein kinase is a kinase enzyme that modifies other proteins by chemically adding phosphate groups to them ( phosphorylation ). Phosphorylation usually results in a functional change of the target protein (substrate) by changing enzyme activity, cellular location, or association with other proteins. Deregulated kinase activity is a frequent cause of cancer, and drugs that inhibit specific kinases are being developed to treat cancers (Zhange et al., 2009).

Moreover, we note that 5 enriched pathways/gene sets are related to neural system (P-value: [1e-047, 5.1e-016]) and 5 to drug response (P-value: [3.7e-081,2e-031]). In addition, we found one gene set related to aging (GO: 0016280), which was also significantly enriched (p-value=9.4e-050, overlap: 106). Although these pathways/gene sets may be not directly related to lung cancer, enrichment helps us to understand the underlying biological processing of the disease that benefits drug development. More significantly enriched pathways were identified and presented in Supplementary Material 2.

In addition to GSEA, we performed SNEA using Pathway Studio with the purpose of identifying the pathogenic significance of the reported genes to other disorders that are possibly related to lung cancer. The full list of results are in Supplementary Material 3. Table 3 is the disease related sub-networks enriched with a p-value< 1E-323.

TABLE 3
Table 3 Sub-networks enriched by the by 1,925 genes reported

From Table 3 we see that many of these reported lung cancer related genes were also identified in other types of cancers, with a large percentage of overlap (Jaccard similarity>0.24).

3.2.2 Enrichment analysis on top 150 genes with highest scores

As described in the Methods section, the QScore, CScore, RScore are strongly related, while the NScore is not. Here we compare their differences in terms of GSEA and SNEA results. Considering the similarity of the groups selected by QScore, CScore, RScore, we only present the results for the NScore group and the RScore group (Table 4 and Table 5), and report the full results for QScore and CScore groups in Supplementary Material 2 and 3.

TABLE 4
Table 4 Pathways/groups enriched by 150 genes with the highest NScore and RScore

Note: 1) NScore is used as NScore_2, a non-zero-value of which represents that the gene is reported within the last two years; there were 150 genes reported to have non-zero NScore_2 values.

From Table 4 we see that the genes with the top NScores and those with the top RScores enrich different groups of pathways, with different p-values (NScore group: 5.09E-08~3.73E-05; RScore group: 2.13E-48~1.26E-21), indicating that the newly reported genes are functionally different from the frequently reported ones.

Moreover, we observed that 8 out of the 10 pathways/gene sets enriched by the RScore group (Table 4) were also enriched by the overall 1,925 genes that rank in the top 20 (Table 2). Similarly, we see that the cytosol group (GO: 0005829) was enriched by both overall genes and the NScore group alone, although with much weaker significance (4.68E-82 vs. 1.05E-07), indicating that many more genes with similar functions have already been discovered.

For the SNEA analysis, we tested the disease sub-networks that were enriched by the two groups of genes. We provided the full list of results in Supplementary Material 3. Table 5 shows the top 10 disease related sub-networks enriched by the two groups of genes.

TABLE 5
Table 5 SNEA results by 150 genes with the highest NScore and RScore

From Table 4, we see that both groups enriched some cancer/neoplasms related sub-networks. However, the enrichment p-values by the RScore group are much more significant than those by the NScore group.

3.3 Connectivity analysis

In addition to GSEA and SNEA, we performed a network connectivity analysis (NCA) on the top 150 genes with the highest RScores and NScores (from Table 1) to generate functional networks. Results show that for the RScore group, there are over 5,000 relationships among those 150 genes, with numerous literature supports. We present in Fig. 3 (a) a network built using 20 genes that are randomly selected from these 150 genes, where we see that these genes are functionally connected to each other forming a complex network. In contrast, the 20 genes randomly selected from the 150-NScore group demonstrate only a few connections, as shown in Fig. 3 (b). NCA analysis shows that there are only 290 relationships among 98 genes for the whole 150-NScore group. This observation is consistent with the GSEA and SNEA, suggesting that these genes with a high NScore are not functionally close to each other as those within the RScore group do.

FIGURE 3
(a) By RScore group
FIGURE 3
(b) By NScore group

Fig. 3 Connectivity networks built by 20 genes from different groups. The networks are generated using Pathway Studio. The 20 genes were randomly selected from the 150 genes with highest RScores for (a) and NScores for (b), respectively.

4. DISCUSSION

In this study, we performed a LDM on 17,884 articles (from year 1976 to Feb 2016) reporting 1,925 genes associated with lung cancer. We provided in Supplementary Materials 1 the full gene list and related parameters. In addition, we conducted GSEA and SNEA to study the functional profile and pathogenic significance of the reported genes with lung cancer. In addition, we performed NCA to study the functional association between the top gene ranked by different scores. Different from the genetic studies using raw data to report novel discoveries, this is a literature-based summarization and validation of already reported marker-diseases relations.

This study has several limitations that need future work. The literature data of 17,884 articles studied were extracted from the Pathway Studio database. Although the Pathway Studio database is composed of over 40 million articles, it is still possible that some articles studying gene-lung cancer associations were beyond their scope of coverage. Additionally, the 4 quality scores, RScore, NScore, CScore and QScore were proposed as quality measures of LDR identified marker-disease relations, feasible to rank the markers/relations according to different needs/significance. However, although related to, they are not biological significance measures of the markers to the disease. Therefore, they cannot replace genetic statistical studies like GWAS, meta-analysis and enrichment analysis.

As an automatic data mining approach, the Natural Language Processing (NLP) technique used for LDM is effective and necessary in dealing with millions of articles. However, the automatic LDM method may bring some false positives. Therefore, the results of this study is to lay the groundwork for further studies in the area. Towards this purpose, we provided in Supplementary material 1 the detailed information of all the 17,884 articles studied for further investigation, including the sentences where a specific relation has been located.

Nevertheless, results from this up-to-date LDM reveal that these 1,925 genes have multiple types of association with lung cancer. Enrichment analysis suggests that these genes play significant roles in the pathogenesis of lung cancer, as well as in the pathogenesis of many other lung cancer related disorders. Moreover, NCA results demonstrate that these genes, especially the ones with high RScores, present strong functional associations with each other. Our results suggest that these genes may operate as a functional biomarker network influencing the development of lung cancer.

Altogether, we conclude that lung cancer is a complex disease whose genetic causes are linked to a network composed of a large group of genes. LDM together with GSEA, SNEA and NCA could serve as an effective approach in finding these potential target genes.

DECLARATION OF INTERESTS

The other authors declare no conflict of interests

REFERENCES

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.

47.

48.

49.

50.

51.

52.

53.

54.

55.

56.

All Rights Reserved © Copyright 2016 Qingres Co., Ltd .

Powered by Qingres Limitd.