Sunday, September 18, 2016

mongoimport and updating data types

Let's say I import a CSV file using the following command

mongoimport -d importtest --collection documents --type csv --file documents.csv  --headerline

mongoimport is a pretty handy tool for importing your existing records stored in CSV or JSON format, however, it does have it's limitations. Often the datatypes inferred by mongoimport may not always be correct and usually results in assigning a field as a datatype of String.

For example,

If I have a date with the following format

mongoimport will assign a field value of '2012-07-14T-01:00:00+01:00' as the literal string.

To fix this, we use the mongo client command as follows:

db.documents.find({}).forEach( function (d) { d.dateCreated = new ISODate(d.dateCreated);; });

Here I convert the string to an ISODate.

This same technique can be applied to other datatypes such as Integer, Boolean etc...

Tuesday, September 6, 2016

GA4GH Reference Server - v.0.3.3 installation on CentOS 7

Here I will present the steps taken to install the GA4GH Reference Server - Version 0.3.3 onto CentOS 7. For the official installation guide please visit ga4gh installation guide as a reference.

This installation will use apache web server as the front-end to service requests with ga4gh running behind apache.

Install essential library packages

$ yum install python-devel python-virtualenv zlib-devel libxslt-devel openssl-devel libffi-devel redhat-rpm-config ncurses-devel ncurses samtools

Install Apache web server

yum install httpd
yum install mod_wsgi
mkdir /var/cache/httpd/python-egg-cache
chown apache:apache /var/cache/httpd/python-egg-cache

Create a file /etc/httpd/conf.d/ga4gh.conf with the following contents:

 <VirtualHost *:80>  
 WSGIDaemonProcess ga4gh processes=10 threads=1 python-path=/srv/ga4gh/ga4gh-serverenv/lib/python2.7/site-packages python-eggs=/var/cache/httpd/python-egg-cache  
 WSGIScriptAlias /ga4gh /srv/ga4gh/application.wsgi  
 <Directory /srv/ga4gh>  
   WSGIProcessGroup ga4gh  
   WSGIApplicationGroup %{GLOBAL}  
   Require all granted  

Install ga4gh reference server

 mkdir –p /srv/ga4gh  
 cd /srv/ga4gh  
 virtualenv ga4gh-server-env  
 source ga4gh-server-env/bin/activate  
 pip install ga4gh  

Install some missing python packages that weren't automatically installed:
pip install flask
pip install flask-cors
pip install humanize
pip install oic
pip install protobuf
pip install pysam

When I tried to use some of the ga4gh command line scripts, I ran into some errors like the following:
[root@gatkslave ga4gh]# source ga4gh-server-env/bin/activate
(ga4gh-server-env)[root@gatkslave ga4gh]# ga4gh_repo init registry.db
Traceback (most recent call last):
  File "/srv/ga4gh/ga4gh-server-env/bin/ga4gh_repo", line 5, in
    from pkg_resources import load_entry_point
  File "/srv/ga4gh/ga4gh-server-env/lib/python2.7/site-packages/", line 3007, in
  File "/srv/ga4gh/ga4gh-server-env/lib/python2.7/site-packages/", line 728, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/srv/ga4gh/ga4gh-server-env/lib/python2.7/site-packages/", line 626, in resolve
    raise DistributionNotFound(req)
pkg_resources.DistributionNotFound: sphinx-argparse==0.1.15

It appears that several python packages were still missing. I used the following commands to install them all, one by one:
Pip install sphinx-argparse==0.1.15
pip install lxml==3.4.4
pip install pyOpenSSL==0.15.1
pip install oic==0.7.6
pip install requests==2.7.0
pip install pysam==0.9.0
pip install protobuf==3.0.0.b3
pip install Flask==0.10.1
pip install Flask-Cors==2.0.1
pip install pyjwkest==1.0.1
pip install Jinja2==2.7.3
pip install pycparser==2.14
pip install cffi==1.5.2
yum install libffi-devel
pip install ipaddress==1.0.16
pip install enum34==1.1.2
pip install pyasn1==0.1.9
pip install idna==2.1
pip install cryptography==1.3.1
pip uninstall pycryptodomex
pip uninstall pcryptodome

Disable SELinux
# dislable selinux
setenforce 0

Make the /srv readable and writable:
chmod -R +x /srv

Create the WSGI file at /srv/ga4gh/application.wsgi with the following contents:
from ga4gh.frontend import app as application
import ga4gh.frontend as frontend

Create the configuration file at /srv/ga4gh/ with the following contents:
DATA_SOURCE = "/srv/ga4gh/ga4gh-example-data/repo.db"

Install bgzip

cd /usr/src
bzip2 –d htslib-1.3.1.tar.dz2
tar xvf htslib-1.3.1.tar
cd htslib-1.3.1
make prefix=/usr install

Data import

Reference set

gunzip hs37d5.fa.gz
bgzip hs37d5.fa
ga4gh_repo add-referenceset registry.db /srv/ga4gh/hs37d5.fa.gz  -d "NCBI assembly of the human genome" --ncbiTaxonId 9606 --name NCBI37



ga4gh_repo add-ontology registry.db /srv/ga4gh/so-xp.obo -n so-xp

Create a new dataset

ga4gh_repo add-dataset registry.db NA12878_sample1_rerun_sg1_snvcalls --description "1000genomes genome, an Illumina platinum genome and  one that the Kinghorn guys use for testing their sequencer"


Importing VCFs

If you have a bunch of VCFs in a directory. You can loop through each file to bgzip each of them:
for i in *.vcf; do bgzip $i; done

Run tabix on each of the bgzip files
for i in *.gz; do tabix $i; done

ga4gh_repo add-variantset registry.db NA12878_sample1_rerun_sg1_snvcalls /srv/ga4gh/datasets/NA12878_sample1_rerun_sg1_snvcalls --name NA12878_sample1_rerun_sg1 --referenceSetName NCBI37

Test query

curl -X POST -H 'Content-Type:application/json' -d '{"variantSetId": "WyJOQTEyODc4X3NhbXBsZTFfcmVydW5fc2cxX3NudmNhbGxzIiwidnMiLCJOQTEyODc4X3NhbXBsZTFfcmVydW5fc2cxIl0", "referenceName":"22","start":17190024,"end":"17671934"}' http://your.server/ga4gh/variants/search