PROTEOFORMER

What is PROTEOFORMER?

PROTEOFORMER is a proteogenomic pipeline that delineates true in vivo proteoforms and generates a protein sequence search space for peptide to MS/MS matching. It can be combined with canonical protein databases or used independently for identification of novel translation products. The pipeline makes use of the recently developed next generation sequencing strategy termed ribosome profiling (RIBO-seq) that provides genome-wide information on protein synthesis in vivo. RIBO-seq is based on the deep sequencing of ribosome protected mRNA fragments. RIBO-seq allows for the mapping of the location of translating ribosomes on mRNA with sub codon precision, it can indicate which portion of the genome is actually being translated at the time of the experiment as well as account for sequence variations such as single nucleotide polymorphism, indels and RNA splicing.

Smiley face
Fig1. - Overview of the PROTEOFORMER Pipeline.

PROTEOFORMER takes as input two fastq files (NGS reads files representing ribosome-protected fragments (RPFs) of the elongating and initiating ribosomes and outputs a FASTA protein sequence database of derived translation products based on Ensembl transcript annotations. Furthermore, specific metrics (e.g. metagenic classification, gene RPF abundance) are deduced to enable the verification of the RIBO-seq data quality. The alignment and RPF density information is outputted in order to allow easy upload and visual evaluation in a genome browser environment.

The pipeline consists of eight major parts:

  1. Quality Control: This preliminary step determines which genes comprise the RPFs and the total RPF count for these genes. Furthermore a metagenic functional classification is compiled, depicting the distribution of RPFs over 5UTR, exonic, intronic, 3UTR, ncRNA and intergenic regions.
  2. Mapping: The first step makes use of transcriptome mappers (STAR or TopHAT2) to align RPFs from the input files against a references genome based on the corresponding Ensembl annotation bundle.
  3. Transcript calling: This step makes use of the ribosome profiles of the elongating ribosomes to mark transcripts with experimental evidence of translation. Extra annotation is also added in this step (CCDS id, canonical transcript).
  4. TIS calling: This entails the identification of translation initiation sites (TIS). It implements a rule-based algorithm that combines RIBO-seq information from two related translation inhibitors: an elongation inhibitor (e.g. cycloheximide) and an initiation inhibitor (e.g. lactimidomycin) to differentiate TIS sites from elongating ribosomes.
  5. Variation calling: This part of the pipeline uses samTools and/or a dbSNP to identify variants in the mapped reads.
  6. Translation assembly: This step assembles all translation products based on the TIS, transcript isoform, and/or SNP information derived from the RIBO-seq data.
  7. Translation Database: Finally a non-redundant FASTA-formatted database of derived translation products is generated wherein all duplicate and sub-sequences are removed.
  8. Floss Calculation: Calculates first the reference fractions and cut off values based on known protein-coding transcripts. With these, the FLOSS scores are calculated and classified for each possible translation product.

Check our manuscript for full explanation. Also please cite our work if you plan on using it (http://dx.doi.org/10.1093/nar/gku1283).

PROTEOFORMER was developed in Perl 5 and is freely available for download in a script version and as a Galaxy implementation.

Galaxy Version

All tools can be downloaded from github. The tool documentation and installation instructions are in this README. Pre-configured ublast- and blastp-formatted databases can be downloaded from the data download links below or generated as described in SECTION 3 & 4 in the README. The dbSNP databases can be downloaded from the NCBI website and the Igenomes from illumina and placed in the tool-data and igenomes folders respectively as described in the README.

Script Based Version

A script based version of the pipeline can be found on github. The requirements and documentation are available in the README.

Virtual Machine

A customized virtual machine (Ubuntu 12.04 LTS) with all script dependencies and a galaxy server already installed can be downloaded from the link PROTEOFORMER VM. The Galaxy-server instance is installed in the folder /opt/galaxy-server. The data dependencies can be downloaded and installed as described in the README of the Galaxy Version.

PROTEOFORMER was developed and tested on Mac and Ubuntu 12.04 LTS.

Downloads

  • Available from github.
  • Virtual machine.
  • Data Dependencies

  • Illumina Igenomes.
  • dbSNP database.
  • Species-specific ublast and blastp formated SWISSPROT databases [May 2014] can be downloaded below (see blast links).
  • Species-specific SQLite databases holding necessary Ensembl information can also be downloaded below (see Esembl_SQLiteDB links), or can be generated using this Python script.
  • License

    This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. You may obtain a copy of the License at http://www.gnu.org/licenses/. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

    Copyright (C) 2014 G. Menschaert, J. Crappé, E. Ndah, A. Koch & S. Steyaert

    TOP