GeTPRA

The GeTPRA framework

This project is to develop a framework that systematically predicts Gene-Transcript-Protein-Reaction Associations (GeTPRA) in human metabolims and updates a human genome-scale metabolic model (GEM) accordingly. This source code implements the GeTPRA framework.


Features This source code executes following steps in the GeTPRA framework in order:

  • Get reactions from biochemical database using EC number

  • Standardize metabolite information

  • Compartmentalize metabolic reactions

  • Generate GeTPRA

  • Check ‘Exist in Recon 2M.1’

  • Check ‘Blocked reaction’

  • Check ‘Experimental evidence available’

See below for the implementation of EFICAz and Wolf PSort (optional features)

Installation Procedure Note: This source code was developed in Linux, and has been tested in Ubuntu 14.04.5 LTS (i7-4770 CPU @ 3.40GHz)

  1. Clone the repository

  2. Create and activate virtual environment

$ virtualenv venv
$ source venv/bin/activate
  1. Install packages at the root of the repository

$ pip install pip --upgrade
$ pip install -r requirements.txt

Input files for the GeTPRA framework Following working input files can be found in: getpra/input_data/getpra_inputs/. These files were used for the data presented in the manuscript.

Gene-transcript ID annotation file format - Download gene-transcript ID annotation file from [Ensembl BioMarts](http://www.ensembl.org/biomart/martview)

NCBI gene ID        Gene stable ID  Transcript stable ID    RefSeq mRNA ID  UCSC Stable ID
2733        ENSG00000119392 ENST00000309971 NM_001003722    uc004bvj.4
2733        ENSG00000119392 ENST00000372770 NM_001499       uc004bvi.4
5690        ENSG00000126067 ENST00000373237 NM_002794       uc001bzf.4
5690        ENSG00000126067 ENST00000373237 NM_001199779    uc001bzf.4
5690        ENSG00000126067 ENST00000621781 NM_001199780    uc021olh.3

Download procedure

  1. Go to [Ensembl BioMarts](http://www.ensembl.org/biomart/martview)

  2. Click Dataset on the left menu

    Select Ensembl Genes 89 in the drop-down menu CHOOSE DATABASE

    Select Human genes (GRCh38.p10) in the drop-down menu CHOOSE DATASET

  3. Click Filters on the left menu

    Click GENE: in the main menu (center)

    Check Gene type and select protein_coding

  4. Click Attributes on the left menu

    Check Features in the center

  5. Click both GENE: and EXTERNAL: in the main menu (center)

    GENE: -> Ensembl -> Uncheck Gene stable ID and Transcript stable ID

    Check following items in order:

    • EXTERNAL: -> External References (max 3) -> NCBI gene ID

    • GENE: -> Ensembl -> Gene stable ID

    • GENE: -> Ensembl -> Transcript stable ID

    • EXTERNAL: -> External References (max 3) -> RefSeq mRNA ID

    • EXTERNAL: -> External References (max 3) -> UCSC Stable ID

  6. Click the button Results on the top left

  7. Click the button Go in the top center

  • File names in the source:
    • Ensembl_GRCh38_EnsemblDB_v84.txt

    • Ensembl_GRCh38_EnsemblDB_v85.txt

    • Ensembl_GRCh38_EnsemblDB_v86.txt

    • Ensembl_GRCh38_EnsemblDB_v87.txt

    • Ensembl_GRCh38_EnsemblDB_v88.txt

`chem_xref.tsv` from `MetaNetX` - Download [chem_xref.tsv](http://www.metanetx.org/cgi-bin/mnxget/mnxref/chem_xref.tsv) from [MetaNetX](http://www.metanetx.org/) - File name in the source: chem_xref.tsv

`gene2ensembl.gz` from `NCBI FTP` - Download [gene2ensembl.gz](ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz) from [NCBI FTP](ftp://ftp.ncbi.nlm.nih.gov/) - File name in the source: gene2ensembl.gz

`appris_data.principal.txt` from `APPRIS` - Download [appris_data.principal.txt](http://apprisws.bioinfo.cnio.es/pub/current_release/datafiles/homo_sapiens/GRCh38/appris_data.principal.txt) for annotation of principal isoform from [APPRIS](http://appris.bioinfo.cnio.es/) - File name in the source: appris_data.principal.txt

`subcellular_location.csv` from `The Human Protein Atlas` - Download [subcellular_location.csv](http://www.proteinatlas.org/download/subcellular_location.csv.zip) with subcellular localization information from [The Human Protein Atlas](http://www.proteinatlas.org/) - File name in the source: subcellular_location.csv

Recon model - Prepare a human genome-scale metabolic model with consistent TPR associations and that shows biologically reasonable simulation performance - File name in the source: Recon2M.1_Entrez_Gene.xml

EFICAz output file as input file - File name in the source: 20170110_EFICAz_result.txt. - EFICAz can be run with a different set of peptide sequences (see below).

WoLF PSort output file as input file - File name in the source: 20170110_WoLFPSort_result.txt - WoLF PSort can be run with a different set of peptide sequences (see below).

BRENDA data - Set user email address and password before implementing the GeTPRA framework. The framework programmatically fetches BRENDA data from BRENDA through its API.

##Implementation Note: All the arguments shown below should be provided when implementing the framework

Note: Make sure to provide own information for -brenda_email and -brenda_pw

Note: Implementation of this source code takes long (~ 8 h)

$ python run_GeTPRA_framework.py \
    -output_dir ./getpra_results/ \
    -ec ./input_data/getpra_inputs/20170110_EFICAz_result.txt \
    -sl ./input_data/getpra_inputs/20170110_WoLFPSort_result.txt \
    -brenda_email user_email_address \
    -brenda_pw user_password \
    -mnx_xref ./input_data/getpra_inputs/chem_xref.tsv \
    -ensembl ./input_data/getpra_inputs/Ensembl_GRCh38_EnsemblDB_v88.txt \
    -appris ./input_data/getpra_inputs/appris_data.principal.txt \
    -model ./input_data/getpra_inputs/Recon2M.1_Entrez_Gene.xml \
    -hpa ./input_data/getpra_inputs/subcellular_location.csv \
    -ncbi_id_information ./input_data/getpra_inputs/gene2ensembl.gz

##Output files from the GeTPRA framework - Raw output files from the GeTPRA framework, which were used for the publication, are available in: getpra/getpra_results_publication_version/ - New output files upon implementation of the framework are generated in: getpra/getpra_results/. This folder is automatically created.

Implementation of EFICAz and Wolf PSort (optional) Output files of EFICAz and Wolf PSort serve as input files for the GeTPRA framework.

Peptide sequences of metabolic genes as inputs for EFICAz and Wolf PSort - Get peptide sequences of metabolic genes by implementing ./input_data/get_peptide_sequences.py

$ python ./input_data/get_peptide_sequences.py \
-output_dir ./input_data/getpra_inputs/ \
-model ./input_data/getpra_inputs/Recon2M.1_Entrez_Gene.xml \
-ensembl ./input_data/getpra_inputs/Ensembl_GRCh38_EnsemblDB_v88.txt
  • File name in the source: Ensembl_peptide_seq_metabolic_genes.fa

    >ENSP00000452494|ENST00000448914 TGGY >ENSP00000488240|ENST00000631435 GTGG >ENSP00000487941|ENST00000632684 GTGG >ENSP00000451515|ENST00000434970 PSY >ENSP00000451042|ENST00000415118 EI

Installation 1. Download [EFICAz2.5](http://cssb.biology.gatech.edu/skolnick/webservice/EFICAz2/index.html)

  1. Set environment variable for EFICAz2.5

$ export EFICAz25_PATH="[insert-destination-directory]/EFICAz2.5.1/"
$ export PATH="${PATH}:${EFICAz25_PATH}"
  1. Download [WoLF PSort](https://github.com/fmaguire/WoLFPSort)

  2. Set environment variable for WoLF PSort

$ export WoLFPSort_PATH="[insert-destination-directory]/WoLFPSort/"
$ export PATH="${PATH}:${WoLFPSort_PATH}"

Implementation of EFICAz and Wolf PSort - Predict EC numbers using [EFICAz2.5](http://cssb.biology.gatech.edu/skolnick/webservice/EFICAz2/index.html)

$ python $EFICAz25_PATH/eficaz2.5 Ensembl_peptide_seq_metabolic_genes.fa
  • Predict subcellular localization using [WoLF PSort](https://github.com/fmaguire/WoLFPSort)

    $ WoLFPSort_PATH/bin/runWolfPsortSummary animal < Ensembl_peptide_seq_metabolic_genes.fa
    
  • Output files from the above two implementations using peptide sequences of entire human genes are available in:
    • EFICAz: getpra/input_data/20170110_EFICAz_result_using_all_human_genes.txt

    • Wolf PSort: getpra/input_data/20170110_WoLFPSort_result_using_all_human_genes.txt

Extract metabolic genes from EFICAz and Wolf PSort output data obtained with entire human genes - Extract metabolic genes from the EFICAz and Wolf PSort output data

$ python ./input_data/get_EFICAz_WolfPSort_results.py \
-output_dir ./input_data/getpra_inputs/ \
-model ./input_data/getpra_inputs/Recon2M.1_Entrez_Gene.xml \
-ec ./input_data/getpra_inputs/20170110_EFICAz_result_using_all_human_genes.txt \
-sl ./input_data/getpra_inputs/20170110_WoLFPSort_result_using_all_human_genes.txt \
-ensembl ./input_data/getpra_inputs/Ensembl_GRCh38_EnsemblDB_v88.txt
  • Resulting output files in the source:
    • EFICAz: getpra/input_data/Trimmed_EFICAz_result.txt

    • Wolf PSort: getpra/input_data/Trimmed_WoLFPSort_result.txt

Publication Jae Yong Ryu 1, Hyun Uk Kim 1 & Sang Yup Lee. Framework and resource for more than 11,000 gene-transcript-protein-reaction associations in human metabolism., Proc. Natl. Acad. Sci. U.S.A., 2017, http://www.pnas.org/content/early/2017/10/23/1713050114