Here I will present the steps taken to install the
GA4GH Reference Server - Version 0.3.3 onto
CentOS 7. For the official installation guide please visit
ga4gh installation guide as a reference.
This installation will use apache web server as the front-end to service requests with ga4gh running behind apache.
Install essential library packages
$ yum install python-devel python-virtualenv zlib-devel libxslt-devel openssl-devel libffi-devel redhat-rpm-config ncurses-devel ncurses samtools
Install Apache web server
yum install httpd
yum install mod_wsgi
mkdir /var/cache/httpd/python-egg-cache
chown apache:apache /var/cache/httpd/python-egg-cache
Create a file /etc/httpd/conf.d/ga4gh.conf with the
following contents:
<VirtualHost *:80>
WSGIDaemonProcess ga4gh processes=10 threads=1 python-path=/srv/ga4gh/ga4gh-serverenv/lib/python2.7/site-packages python-eggs=/var/cache/httpd/python-egg-cache
WSGIScriptAlias /ga4gh /srv/ga4gh/application.wsgi
<Directory /srv/ga4gh>
WSGIProcessGroup ga4gh
WSGIApplicationGroup %{GLOBAL}
Require all granted
</Directory>
</VirtualHost>
Install ga4gh reference server
mkdir –p /srv/ga4gh
cd /srv/ga4gh
virtualenv ga4gh-server-env
source ga4gh-server-env/bin/activate
pip install ga4gh
Install some missing python packages that weren't automatically installed:
pip install flask
pip install flask-cors
pip install humanize
pip install oic
pip install protobuf
pip install pysam
When I tried to use some of the ga4gh command line scripts, I ran into some errors like the following:
[root@gatkslave
ga4gh]# source ga4gh-server-env/bin/activate
(ga4gh-server-env)[root@gatkslave
ga4gh]# ga4gh_repo init registry.db
Traceback
(most recent call last):
File
"/srv/ga4gh/ga4gh-server-env/bin/ga4gh_repo", line 5, in
from pkg_resources import load_entry_point
File
"/srv/ga4gh/ga4gh-server-env/lib/python2.7/site-packages/pkg_resources.py",
line 3007, in
working_set.require(__requires__)
File
"/srv/ga4gh/ga4gh-server-env/lib/python2.7/site-packages/pkg_resources.py",
line 728, in require
needed =
self.resolve(parse_requirements(requirements))
File
"/srv/ga4gh/ga4gh-server-env/lib/python2.7/site-packages/pkg_resources.py",
line 626, in resolve
raise DistributionNotFound(req)
pkg_resources.DistributionNotFound:
sphinx-argparse==0.1.15
It appears that several python packages were still missing. I used the following commands to install them all, one by one:
Pip
install sphinx-argparse==0.1.15
pip install lxml==3.4.4
pip install pyOpenSSL==0.15.1
pip install oic==0.7.6
pip install requests==2.7.0
pip install pysam==0.9.0
pip install protobuf==3.0.0.b3
pip install Flask==0.10.1
pip install Flask-Cors==2.0.1
pip install pyjwkest==1.0.1
pip install Jinja2==2.7.3
pip install pycparser==2.14
pip install cffi==1.5.2
yum install libffi-devel
pip install ipaddress==1.0.16
pip install enum34==1.1.2
pip install pyasn1==0.1.9
pip install idna==2.1
pip install cryptography==1.3.1
pip uninstall pycryptodomex
pip uninstall pcryptodome
Disable SELinux
#
dislable selinux
setenforce
0
Make the /srv readable and writable:
Create the WSGI file at
/srv/ga4gh/application.wsgi with the following contents:
from ga4gh.frontend import app as application
import ga4gh.frontend as frontend
frontend.configure("/srv/ga4gh/config.py")
Create the configuration file at
/srv/ga4gh/config.py with the following contents:
DATA_SOURCE = "/srv/ga4gh/ga4gh-example-data/repo.db"
Install bgzip
cd /usr/src
bzip2 –d htslib-1.3.1.tar.dz2
tar xvf htslib-1.3.1.tar
cd htslib-1.3.1
make
make prefix=/usr install
Data import
Reference set
wget ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
gunzip hs37d5.fa.gz
bgzip hs37d5.fa
ga4gh_repo
add-referenceset registry.db /srv/ga4gh/hs37d5.fa.gz -d "NCBI assembly of the human
genome" --ncbiTaxonId 9606 --name NCBI37
Ontology
wget
https://raw.githubusercontent.com/The-Sequence-Ontology/SO-Ontologies/master/so-xp.obo
ga4gh_repo
add-ontology registry.db /srv/ga4gh/so-xp.obo -n so-xp
Create a new dataset
ga4gh_repo
add-dataset registry.db NA12878_sample1_rerun_sg1_snvcalls --description
"1000genomes genome, an Illumina platinum genome and one that the Kinghorn guys use for testing
their sequencer"
Importing VCFs
If you have a bunch of VCFs in a directory. You can loop through each file to bgzip each of them:
for
i in *.vcf; do bgzip $i; done
Run tabix on each of the bgzip files
for
i in *.gz; do tabix $i; done
ga4gh_repo
add-variantset registry.db NA12878_sample1_rerun_sg1_snvcalls
/srv/ga4gh/datasets/NA12878_sample1_rerun_sg1_snvcalls --name
NA12878_sample1_rerun_sg1 --referenceSetName NCBI37
Test query
curl
-X POST -H 'Content-Type:application/json' -d '{"variantSetId":
"WyJOQTEyODc4X3NhbXBsZTFfcmVydW5fc2cxX3NudmNhbGxzIiwidnMiLCJOQTEyODc4X3NhbXBsZTFfcmVydW5fc2cxIl0",
"referenceName":"22","start":17190024,"end":"17671934"}'
http://your.server/ga4gh/variants/search