Tuesday, September 6, 2016

GA4GH Reference Server - v.0.3.3 installation on CentOS 7

Here I will present the steps taken to install the GA4GH Reference Server - Version 0.3.3 onto CentOS 7. For the official installation guide please visit ga4gh installation guide as a reference.

This installation will use apache web server as the front-end to service requests with ga4gh running behind apache.

Install essential library packages

$ yum install python-devel python-virtualenv zlib-devel libxslt-devel openssl-devel libffi-devel redhat-rpm-config ncurses-devel ncurses samtools

Install Apache web server

yum install httpd
yum install mod_wsgi
mkdir /var/cache/httpd/python-egg-cache
chown apache:apache /var/cache/httpd/python-egg-cache

Create a file /etc/httpd/conf.d/ga4gh.conf with the following contents:

 <VirtualHost *:80>  
 WSGIDaemonProcess ga4gh processes=10 threads=1 python-path=/srv/ga4gh/ga4gh-serverenv/lib/python2.7/site-packages python-eggs=/var/cache/httpd/python-egg-cache  
 WSGIScriptAlias /ga4gh /srv/ga4gh/application.wsgi  
 <Directory /srv/ga4gh>  
   WSGIProcessGroup ga4gh  
   WSGIApplicationGroup %{GLOBAL}  
   Require all granted  

Install ga4gh reference server

 mkdir –p /srv/ga4gh  
 cd /srv/ga4gh  
 virtualenv ga4gh-server-env  
 source ga4gh-server-env/bin/activate  
 pip install ga4gh  

Install some missing python packages that weren't automatically installed:
pip install flask
pip install flask-cors
pip install humanize
pip install oic
pip install protobuf
pip install pysam

When I tried to use some of the ga4gh command line scripts, I ran into some errors like the following:
[root@gatkslave ga4gh]# source ga4gh-server-env/bin/activate
(ga4gh-server-env)[root@gatkslave ga4gh]# ga4gh_repo init registry.db
Traceback (most recent call last):
  File "/srv/ga4gh/ga4gh-server-env/bin/ga4gh_repo", line 5, in
    from pkg_resources import load_entry_point
  File "/srv/ga4gh/ga4gh-server-env/lib/python2.7/site-packages/pkg_resources.py", line 3007, in
  File "/srv/ga4gh/ga4gh-server-env/lib/python2.7/site-packages/pkg_resources.py", line 728, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/srv/ga4gh/ga4gh-server-env/lib/python2.7/site-packages/pkg_resources.py", line 626, in resolve
    raise DistributionNotFound(req)
pkg_resources.DistributionNotFound: sphinx-argparse==0.1.15

It appears that several python packages were still missing. I used the following commands to install them all, one by one:
Pip install sphinx-argparse==0.1.15
pip install lxml==3.4.4
pip install pyOpenSSL==0.15.1
pip install oic==0.7.6
pip install requests==2.7.0
pip install pysam==0.9.0
pip install protobuf==3.0.0.b3
pip install Flask==0.10.1
pip install Flask-Cors==2.0.1
pip install pyjwkest==1.0.1
pip install Jinja2==2.7.3
pip install pycparser==2.14
pip install cffi==1.5.2
yum install libffi-devel
pip install ipaddress==1.0.16
pip install enum34==1.1.2
pip install pyasn1==0.1.9
pip install idna==2.1
pip install cryptography==1.3.1
pip uninstall pycryptodomex
pip uninstall pcryptodome

Disable SELinux
# dislable selinux
setenforce 0

Make the /srv readable and writable:
chmod -R +x /srv

Create the WSGI file at /srv/ga4gh/application.wsgi with the following contents:
from ga4gh.frontend import app as application
import ga4gh.frontend as frontend

Create the configuration file at /srv/ga4gh/config.py with the following contents:
DATA_SOURCE = "/srv/ga4gh/ga4gh-example-data/repo.db"

Install bgzip

cd /usr/src
bzip2 –d htslib-1.3.1.tar.dz2
tar xvf htslib-1.3.1.tar
cd htslib-1.3.1
make prefix=/usr install

Data import

Reference set

wget ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
gunzip hs37d5.fa.gz
bgzip hs37d5.fa
ga4gh_repo add-referenceset registry.db /srv/ga4gh/hs37d5.fa.gz  -d "NCBI assembly of the human genome" --ncbiTaxonId 9606 --name NCBI37


wget https://raw.githubusercontent.com/The-Sequence-Ontology/SO-Ontologies/master/so-xp.obo

ga4gh_repo add-ontology registry.db /srv/ga4gh/so-xp.obo -n so-xp

Create a new dataset

ga4gh_repo add-dataset registry.db NA12878_sample1_rerun_sg1_snvcalls --description "1000genomes genome, an Illumina platinum genome and  one that the Kinghorn guys use for testing their sequencer"


Importing VCFs

If you have a bunch of VCFs in a directory. You can loop through each file to bgzip each of them:
for i in *.vcf; do bgzip $i; done

Run tabix on each of the bgzip files
for i in *.gz; do tabix $i; done

ga4gh_repo add-variantset registry.db NA12878_sample1_rerun_sg1_snvcalls /srv/ga4gh/datasets/NA12878_sample1_rerun_sg1_snvcalls --name NA12878_sample1_rerun_sg1 --referenceSetName NCBI37

Test query

curl -X POST -H 'Content-Type:application/json' -d '{"variantSetId": "WyJOQTEyODc4X3NhbXBsZTFfcmVydW5fc2cxX3NudmNhbGxzIiwidnMiLCJOQTEyODc4X3NhbXBsZTFfcmVydW5fc2cxIl0", "referenceName":"22","start":17190024,"end":"17671934"}' http://your.server/ga4gh/variants/search

