Appendix A: Extended Methodologies
Automated Workflow for Extracting and Analyzing Gene-Disease Associations in Rare Diseases Using RENET2, Comparative Tools, and Advanced Functional Annotation
Objective
To develop a robust and automated workflow leveraging RENET2, seeded with gene lists from DisGeNET and manually curated literature searches, alongside BeFree and DTMiner, complemented by advanced functional annotation tools, to extract, curate, and analyze gene-disease associations from full-text articles for a specific rare disease.
Approach
Seed Gene List Preparation
The workflow begins with the collection of a seed gene list derived from two primary sources:
- DisGeNET: A database that provides curated gene-disease associations, which will be used as a foundational dataset.
- Manual Literature Curation: An additional gene list will be curated manually by conducting targeted literature searches using specific search terms related to the rare disease. This curated list will ensure the inclusion of genes with significant associations that may not be covered in DisGeNET.
Automated Extraction of Gene-Disease Associations Using RENET2
The curated gene list will be used to generate a list of relevant articles from PubMed Central (PMC). RENET2 will then be applied to these full-text articles to extract gene-disease associations. RENET2’s section filtering and ambiguous relation modeling will enhance the precision and recall of these associations, especially when dealing with complex full-text data. The iterative training data expansion strategy within RENET2 will further refine the extraction process by addressing the scarcity of full-text labels.
Comparison with BeFree and DTMiner
To evaluate the performance of RENET2, the extracted associations will be compared with those obtained using BeFree and DTMiner. Both tools will be applied to the same set of full-text articles. The comparative analysis will focus on metrics such as precision, recall, and F1-score, with an expectation that RENET2 will demonstrate superior performance, particularly in the extraction of associations from full-text data.
Functional Annotation and Enrichment Analysis
The workflow will employ advanced tools for functional annotation and enrichment analysis of the extracted gene-disease associations:
- g:Profiler: This tool will be used for the comprehensive analysis of Gene Ontology (GO) terms, pathways (KEGG, Reactome), and transcription factor binding sites, offering a more updated alternative to traditional tools like DAVID.
- Enrichr: A powerful tool for enrichment analysis across a wide array of gene set libraries, providing insights into pathways, ontologies, and drug-target interactions.
- Metascape: This tool will integrate data from multiple sources for gene annotation and enrichment analysis, and will offer advanced visualization options.
Network Analysis Using STRING and Cytoscape
A protein-protein interaction (PPI) network will be constructed using the STRING database, based on the validated gene-disease associations. Cytoscape will be used to visualize these networks, with plugins like ClueGO and CluePedia integrated to explore functional groupings and pathway interactions within the PPI networks.
Data Integration and Comparative Visualization
The final step involves integrating the outputs from RENET2, BeFree, DTMiner, and the functional annotation tools into Cytoscape for comprehensive visualization. The comparative analysis between RENET2 and the other tools will be visually represented, highlighting areas of overlap and unique findings. A script will automate the generation of reports summarizing gene-disease associations, enrichment analyses, and network interactions, facilitating further research and exploration.
Outcome
This workflow will provide a rigorous, validated, and comprehensive approach to extracting and analyzing gene-disease associations from full-text articles, beginning with a solid foundation of curated gene lists from DisGeNET and manual literature searches. By incorporating RENET2 and comparing it with BeFree and DTMiner, alongside advanced functional annotation tools, the workflow ensures high accuracy, thorough validation, and actionable insights into the molecular mechanisms of the rare disease.
References
- Zhang, R., Guo, H., Yang, X., Zhang, D., Li, B., Li, Z., & Xiong, Y. (2019). Pathway-based network analyses and candidate genes associated with Kashin-Beck disease. Medicine, 98(18), e15498. https://doi.org/10.1097/MD.0000000000015498