GeTPRA
======

.. automodule:: GeTPRA 
    :members: 
    :undoc-members: 
    :show-inheritance: 
    
    **The GeTPRA framework**
    
    This project is to develop a framework that systematically predicts Gene-Transcript-Protein-Reaction Associations (GeTPRA) in human metabolims and updates a human genome-scale metabolic model (GEM) accordingly. This source code implements the GeTPRA framework.
    
    ------
    
    **Features**
    This source code executes following steps in the GeTPRA framework in order:

    - Get reactions from biochemical database using EC number
    - Standardize metabolite information
    - Compartmentalize metabolic reactions
    - Generate GeTPRA
    - Check 'Exist in Recon 2M.1'
    - Check 'Blocked reaction'
    - Check 'Experimental evidence available'

    See below for the implementation of EFICAz and Wolf PSort (optional features)

    **Installation**
    **Procedure**
    **Note**: This source code was developed in Linux, and has been tested in Ubuntu 14.04.5 LTS (i7-4770 CPU @ 3.40GHz)

    1. Clone the repository

    2. Create and activate virtual environment
    
    .. code-block::

            $ virtualenv venv
            $ source venv/bin/activate

    3. Install packages at the root of the repository
    
    .. code-block::

            $ pip install pip --upgrade
            $ pip install -r requirements.txt

    **Input files for the GeTPRA framework**
    Following working input files can be found in: `getpra/input_data/getpra_inputs/`. These files were used for the data presented in the manuscript.

    **Gene-transcript ID annotation file format**
    - Download gene-transcript ID annotation file from [Ensembl BioMarts](http://www.ensembl.org/biomart/martview)
    
    .. code-block::

            NCBI gene ID	Gene stable ID	Transcript stable ID	RefSeq mRNA ID	UCSC Stable ID
            2733	ENSG00000119392	ENST00000309971	NM_001003722	uc004bvj.4
            2733	ENSG00000119392	ENST00000372770	NM_001499	uc004bvi.4
            5690	ENSG00000126067	ENST00000373237	NM_002794	uc001bzf.4
            5690	ENSG00000126067	ENST00000373237	NM_001199779	uc001bzf.4
            5690	ENSG00000126067	ENST00000621781	NM_001199780	uc021olh.3

    **Download procedure**

    1. Go to [Ensembl BioMarts](http://www.ensembl.org/biomart/martview)
    2. Click *Dataset* on the left menu

        Select *Ensembl Genes 89* in the drop-down menu *CHOOSE DATABASE*

        Select *Human genes (GRCh38.p10)* in the drop-down menu *CHOOSE DATASET*

    3. Click *Filters* on the left menu

        Click *GENE:* in the main menu (center)

        Check *Gene type* and select *protein_coding*

    4. Click *Attributes* on the left menu

        Check *Features* in the center

    5. Click both *GENE:* and *EXTERNAL:* in the main menu (center)

        *GENE:* -> *Ensembl* -> Uncheck *Gene stable ID* and *Transcript stable ID*

        Check following items in order:

        - *EXTERNAL:* -> *External References (max 3)* -> *NCBI gene ID*
        - *GENE:* -> *Ensembl* -> *Gene stable ID*
        - *GENE:* -> *Ensembl* -> *Transcript stable ID*
        - *EXTERNAL:* -> *External References (max 3)* -> *RefSeq mRNA ID*
        - *EXTERNAL:* -> *External References (max 3)* -> *UCSC Stable ID*

    6. Click the button *Results* on the top left
    7. Click the button *Go* in the top center

    - File names in the source:
        - `Ensembl_GRCh38_EnsemblDB_v84.txt`
        - `Ensembl_GRCh38_EnsemblDB_v85.txt`
        - `Ensembl_GRCh38_EnsemblDB_v86.txt`
        - `Ensembl_GRCh38_EnsemblDB_v87.txt`
        - `Ensembl_GRCh38_EnsemblDB_v88.txt`

    *`chem_xref.tsv` from `MetaNetX`*
    - Download [chem_xref.tsv](http://www.metanetx.org/cgi-bin/mnxget/mnxref/chem_xref.tsv) from [MetaNetX](http://www.metanetx.org/)
    - File name in the source: `chem_xref.tsv`

    *`gene2ensembl.gz` from `NCBI FTP`*
    - Download [gene2ensembl.gz](ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz) from [NCBI FTP](ftp://ftp.ncbi.nlm.nih.gov/)
    - File name in the source: `gene2ensembl.gz`

    *`appris_data.principal.txt` from `APPRIS`*
    - Download [appris_data.principal.txt](http://apprisws.bioinfo.cnio.es/pub/current_release/datafiles/homo_sapiens/GRCh38/appris_data.principal.txt) for annotation of principal isoform from [APPRIS](http://appris.bioinfo.cnio.es/)
    - File name in the source: `appris_data.principal.txt`

    *`subcellular_location.csv` from `The Human Protein Atlas`*
    - Download [subcellular_location.csv](http://www.proteinatlas.org/download/subcellular_location.csv.zip) with subcellular localization information from [The Human Protein Atlas](http://www.proteinatlas.org/)
    - File name in the source: `subcellular_location.csv`

    *Recon model*
    - Prepare a human genome-scale metabolic model with consistent TPR associations and that shows biologically reasonable simulation performance
    - File name in the source: `Recon2M.1_Entrez_Gene.xml`

    *EFICAz output file as input file*
    - File name in the source: `20170110_EFICAz_result.txt`.
    - EFICAz can be run with a different set of peptide sequences (see below).

    *WoLF PSort output file as input file*
    - File name in the source: `20170110_WoLFPSort_result.txt`
    - WoLF PSort can be run with a different set of peptide sequences (see below).

    *BRENDA data*
    - Set user email address and password before implementing the GeTPRA framework. The framework programmatically fetches BRENDA data from BRENDA through its API.

    ##Implementation
    **Note**: All the arguments shown below should be provided when implementing the framework

    **Note**: Make sure to provide own information for `-brenda_email` and `-brenda_pw`

    **Note**: Implementation of this source code takes long (~ 8 h)
    
    .. code-block::

        $ python run_GeTPRA_framework.py \
            -output_dir ./getpra_results/ \
            -ec ./input_data/getpra_inputs/20170110_EFICAz_result.txt \
            -sl ./input_data/getpra_inputs/20170110_WoLFPSort_result.txt \
            -brenda_email user_email_address \
            -brenda_pw user_password \
            -mnx_xref ./input_data/getpra_inputs/chem_xref.tsv \
            -ensembl ./input_data/getpra_inputs/Ensembl_GRCh38_EnsemblDB_v88.txt \
            -appris ./input_data/getpra_inputs/appris_data.principal.txt \
            -model ./input_data/getpra_inputs/Recon2M.1_Entrez_Gene.xml \
            -hpa ./input_data/getpra_inputs/subcellular_location.csv \
            -ncbi_id_information ./input_data/getpra_inputs/gene2ensembl.gz

    ##Output files from the GeTPRA framework
    - Raw output files from the GeTPRA framework, which were used for the publication, are available in: `getpra/getpra_results_publication_version/`
    - New output files upon implementation of the framework are generated in: `getpra/getpra_results/`. This folder is automatically created.

    *Implementation of EFICAz and Wolf PSort (optional)*
    Output files of EFICAz and Wolf PSort serve as input files for the GeTPRA framework.

    *Peptide sequences of metabolic genes as inputs for EFICAz and Wolf PSort*
    - Get peptide sequences of metabolic genes by implementing `./input_data/get_peptide_sequences.py` 
    
    .. code-block::

            $ python ./input_data/get_peptide_sequences.py \
            -output_dir ./input_data/getpra_inputs/ \
            -model ./input_data/getpra_inputs/Recon2M.1_Entrez_Gene.xml \
            -ensembl ./input_data/getpra_inputs/Ensembl_GRCh38_EnsemblDB_v88.txt

    - File name in the source: `Ensembl_peptide_seq_metabolic_genes.fa`

            >ENSP00000452494|ENST00000448914
            TGGY
            >ENSP00000488240|ENST00000631435
            GTGG
            >ENSP00000487941|ENST00000632684
            GTGG
            >ENSP00000451515|ENST00000434970
            PSY
            >ENSP00000451042|ENST00000415118
            EI

    *Installation*
    1. Download [EFICAz2.5](http://cssb.biology.gatech.edu/skolnick/webservice/EFICAz2/index.html)

    2. Set environment variable for EFICAz2.5
    
      .. code-block::

            $ export EFICAz25_PATH="[insert-destination-directory]/EFICAz2.5.1/"
            $ export PATH="${PATH}:${EFICAz25_PATH}"

    3. Download [WoLF PSort](https://github.com/fmaguire/WoLFPSort)

    4. Set environment variable for WoLF PSort
    
      .. code-block::

            $ export WoLFPSort_PATH="[insert-destination-directory]/WoLFPSort/"
            $ export PATH="${PATH}:${WoLFPSort_PATH}"

    *Implementation of EFICAz and Wolf PSort*
    - Predict EC numbers using [EFICAz2.5](http://cssb.biology.gatech.edu/skolnick/webservice/EFICAz2/index.html)
    
      .. code-block::

            $ python $EFICAz25_PATH/eficaz2.5 Ensembl_peptide_seq_metabolic_genes.fa

    - Predict subcellular localization using [WoLF PSort](https://github.com/fmaguire/WoLFPSort)
    
      .. code-block::

            $ WoLFPSort_PATH/bin/runWolfPsortSummary animal < Ensembl_peptide_seq_metabolic_genes.fa

    - Output files from the above two implementations using peptide sequences of entire human genes are available in:
        - EFICAz: `getpra/input_data/20170110_EFICAz_result_using_all_human_genes.txt`
        - Wolf PSort: `getpra/input_data/20170110_WoLFPSort_result_using_all_human_genes.txt`

    *Extract metabolic genes from EFICAz and Wolf PSort output data obtained with entire human genes*
    - Extract metabolic genes from the EFICAz and Wolf PSort output data 
      
      .. code-block::
      
            $ python ./input_data/get_EFICAz_WolfPSort_results.py \
            -output_dir ./input_data/getpra_inputs/ \
            -model ./input_data/getpra_inputs/Recon2M.1_Entrez_Gene.xml \
            -ec ./input_data/getpra_inputs/20170110_EFICAz_result_using_all_human_genes.txt \
            -sl ./input_data/getpra_inputs/20170110_WoLFPSort_result_using_all_human_genes.txt \
            -ensembl ./input_data/getpra_inputs/Ensembl_GRCh38_EnsemblDB_v88.txt

    - Resulting output files in the source:
        - EFICAz: `getpra/input_data/Trimmed_EFICAz_result.txt`
        - Wolf PSort: `getpra/input_data/Trimmed_WoLFPSort_result.txt`

    **Publication**
    Jae Yong Ryu 1, Hyun Uk Kim 1 & Sang Yup Lee. Framework and resource for more than 11,000 gene-transcript-protein-reaction associations in human metabolism., *Proc. Natl. Acad. Sci. U.S.A.*, 2017, http://www.pnas.org/content/early/2017/10/23/1713050114

    
    
    -------