GeTPRA¶
The GeTPRA framework
This project is to develop a framework that systematically predicts Gene-Transcript-Protein-Reaction Associations (GeTPRA) in human metabolims and updates a human genome-scale metabolic model (GEM) accordingly. This source code implements the GeTPRA framework.
Features This source code executes following steps in the GeTPRA framework in order:
Get reactions from biochemical database using EC number
Standardize metabolite information
Compartmentalize metabolic reactions
Generate GeTPRA
Check ‘Exist in Recon 2M.1’
Check ‘Blocked reaction’
Check ‘Experimental evidence available’
See below for the implementation of EFICAz and Wolf PSort (optional features)
Installation Procedure Note: This source code was developed in Linux, and has been tested in Ubuntu 14.04.5 LTS (i7-4770 CPU @ 3.40GHz)
Clone the repository
Create and activate virtual environment
$ virtualenv venv
$ source venv/bin/activate
Install packages at the root of the repository
$ pip install pip --upgrade
$ pip install -r requirements.txt
Input files for the GeTPRA framework Following working input files can be found in: getpra/input_data/getpra_inputs/. These files were used for the data presented in the manuscript.
Gene-transcript ID annotation file format - Download gene-transcript ID annotation file from [Ensembl BioMarts](http://www.ensembl.org/biomart/martview)
NCBI gene ID Gene stable ID Transcript stable ID RefSeq mRNA ID UCSC Stable ID
2733 ENSG00000119392 ENST00000309971 NM_001003722 uc004bvj.4
2733 ENSG00000119392 ENST00000372770 NM_001499 uc004bvi.4
5690 ENSG00000126067 ENST00000373237 NM_002794 uc001bzf.4
5690 ENSG00000126067 ENST00000373237 NM_001199779 uc001bzf.4
5690 ENSG00000126067 ENST00000621781 NM_001199780 uc021olh.3
Download procedure
Go to [Ensembl BioMarts](http://www.ensembl.org/biomart/martview)
Click Dataset on the left menu
Select Ensembl Genes 89 in the drop-down menu CHOOSE DATABASE
Select Human genes (GRCh38.p10) in the drop-down menu CHOOSE DATASET
Click Filters on the left menu
Click GENE: in the main menu (center)
Check Gene type and select protein_coding
Click Attributes on the left menu
Check Features in the center
Click both GENE: and EXTERNAL: in the main menu (center)
GENE: -> Ensembl -> Uncheck Gene stable ID and Transcript stable ID
Check following items in order:
EXTERNAL: -> External References (max 3) -> NCBI gene ID
GENE: -> Ensembl -> Gene stable ID
GENE: -> Ensembl -> Transcript stable ID
EXTERNAL: -> External References (max 3) -> RefSeq mRNA ID
EXTERNAL: -> External References (max 3) -> UCSC Stable ID
Click the button Results on the top left
Click the button Go in the top center
- File names in the source:
Ensembl_GRCh38_EnsemblDB_v84.txt
Ensembl_GRCh38_EnsemblDB_v85.txt
Ensembl_GRCh38_EnsemblDB_v86.txt
Ensembl_GRCh38_EnsemblDB_v87.txt
Ensembl_GRCh38_EnsemblDB_v88.txt
`chem_xref.tsv` from `MetaNetX` - Download [chem_xref.tsv](http://www.metanetx.org/cgi-bin/mnxget/mnxref/chem_xref.tsv) from [MetaNetX](http://www.metanetx.org/) - File name in the source: chem_xref.tsv
`gene2ensembl.gz` from `NCBI FTP` - Download [gene2ensembl.gz](ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz) from [NCBI FTP](ftp://ftp.ncbi.nlm.nih.gov/) - File name in the source: gene2ensembl.gz
`appris_data.principal.txt` from `APPRIS` - Download [appris_data.principal.txt](http://apprisws.bioinfo.cnio.es/pub/current_release/datafiles/homo_sapiens/GRCh38/appris_data.principal.txt) for annotation of principal isoform from [APPRIS](http://appris.bioinfo.cnio.es/) - File name in the source: appris_data.principal.txt
`subcellular_location.csv` from `The Human Protein Atlas` - Download [subcellular_location.csv](http://www.proteinatlas.org/download/subcellular_location.csv.zip) with subcellular localization information from [The Human Protein Atlas](http://www.proteinatlas.org/) - File name in the source: subcellular_location.csv
Recon model - Prepare a human genome-scale metabolic model with consistent TPR associations and that shows biologically reasonable simulation performance - File name in the source: Recon2M.1_Entrez_Gene.xml
EFICAz output file as input file - File name in the source: 20170110_EFICAz_result.txt. - EFICAz can be run with a different set of peptide sequences (see below).
WoLF PSort output file as input file - File name in the source: 20170110_WoLFPSort_result.txt - WoLF PSort can be run with a different set of peptide sequences (see below).
BRENDA data - Set user email address and password before implementing the GeTPRA framework. The framework programmatically fetches BRENDA data from BRENDA through its API.
##Implementation Note: All the arguments shown below should be provided when implementing the framework
Note: Make sure to provide own information for -brenda_email and -brenda_pw
Note: Implementation of this source code takes long (~ 8 h)
$ python run_GeTPRA_framework.py \
-output_dir ./getpra_results/ \
-ec ./input_data/getpra_inputs/20170110_EFICAz_result.txt \
-sl ./input_data/getpra_inputs/20170110_WoLFPSort_result.txt \
-brenda_email user_email_address \
-brenda_pw user_password \
-mnx_xref ./input_data/getpra_inputs/chem_xref.tsv \
-ensembl ./input_data/getpra_inputs/Ensembl_GRCh38_EnsemblDB_v88.txt \
-appris ./input_data/getpra_inputs/appris_data.principal.txt \
-model ./input_data/getpra_inputs/Recon2M.1_Entrez_Gene.xml \
-hpa ./input_data/getpra_inputs/subcellular_location.csv \
-ncbi_id_information ./input_data/getpra_inputs/gene2ensembl.gz
##Output files from the GeTPRA framework - Raw output files from the GeTPRA framework, which were used for the publication, are available in: getpra/getpra_results_publication_version/ - New output files upon implementation of the framework are generated in: getpra/getpra_results/. This folder is automatically created.
Implementation of EFICAz and Wolf PSort (optional) Output files of EFICAz and Wolf PSort serve as input files for the GeTPRA framework.
Peptide sequences of metabolic genes as inputs for EFICAz and Wolf PSort - Get peptide sequences of metabolic genes by implementing ./input_data/get_peptide_sequences.py
$ python ./input_data/get_peptide_sequences.py \
-output_dir ./input_data/getpra_inputs/ \
-model ./input_data/getpra_inputs/Recon2M.1_Entrez_Gene.xml \
-ensembl ./input_data/getpra_inputs/Ensembl_GRCh38_EnsemblDB_v88.txt
File name in the source: Ensembl_peptide_seq_metabolic_genes.fa
>ENSP00000452494|ENST00000448914 TGGY >ENSP00000488240|ENST00000631435 GTGG >ENSP00000487941|ENST00000632684 GTGG >ENSP00000451515|ENST00000434970 PSY >ENSP00000451042|ENST00000415118 EI
Installation 1. Download [EFICAz2.5](http://cssb.biology.gatech.edu/skolnick/webservice/EFICAz2/index.html)
Set environment variable for EFICAz2.5
$ export EFICAz25_PATH="[insert-destination-directory]/EFICAz2.5.1/" $ export PATH="${PATH}:${EFICAz25_PATH}"
Download [WoLF PSort](https://github.com/fmaguire/WoLFPSort)
Set environment variable for WoLF PSort
$ export WoLFPSort_PATH="[insert-destination-directory]/WoLFPSort/" $ export PATH="${PATH}:${WoLFPSort_PATH}"
Implementation of EFICAz and Wolf PSort - Predict EC numbers using [EFICAz2.5](http://cssb.biology.gatech.edu/skolnick/webservice/EFICAz2/index.html)
$ python $EFICAz25_PATH/eficaz2.5 Ensembl_peptide_seq_metabolic_genes.fa
Predict subcellular localization using [WoLF PSort](https://github.com/fmaguire/WoLFPSort)
$ WoLFPSort_PATH/bin/runWolfPsortSummary animal < Ensembl_peptide_seq_metabolic_genes.fa
- Output files from the above two implementations using peptide sequences of entire human genes are available in:
EFICAz: getpra/input_data/20170110_EFICAz_result_using_all_human_genes.txt
Wolf PSort: getpra/input_data/20170110_WoLFPSort_result_using_all_human_genes.txt
Extract metabolic genes from EFICAz and Wolf PSort output data obtained with entire human genes - Extract metabolic genes from the EFICAz and Wolf PSort output data
$ python ./input_data/get_EFICAz_WolfPSort_results.py \ -output_dir ./input_data/getpra_inputs/ \ -model ./input_data/getpra_inputs/Recon2M.1_Entrez_Gene.xml \ -ec ./input_data/getpra_inputs/20170110_EFICAz_result_using_all_human_genes.txt \ -sl ./input_data/getpra_inputs/20170110_WoLFPSort_result_using_all_human_genes.txt \ -ensembl ./input_data/getpra_inputs/Ensembl_GRCh38_EnsemblDB_v88.txt
- Resulting output files in the source:
EFICAz: getpra/input_data/Trimmed_EFICAz_result.txt
Wolf PSort: getpra/input_data/Trimmed_WoLFPSort_result.txt
Publication Jae Yong Ryu 1, Hyun Uk Kim 1 & Sang Yup Lee. Framework and resource for more than 11,000 gene-transcript-protein-reaction associations in human metabolism., Proc. Natl. Acad. Sci. U.S.A., 2017, http://www.pnas.org/content/early/2017/10/23/1713050114