Techspot - Databases, web, servers - Philip Wu: September 2016

Let's say I import a CSV file using the following command

mongoimport -d importtest --collection documents --type csv --file documents.csv  --headerline

mongoimport is a pretty handy tool for importing your existing records stored in CSV or JSON format, however, it does have it's limitations. Often the datatypes inferred by mongoimport may not always be correct and usually results in assigning a field as a datatype of String.

For example,

If I have a date with the following format

2012-07-14T01:00:00+01:00

mongoimport will assign a field value of '2012-07-14T-01:00:00+01:00' as the literal string.

To fix this, we use the mongo client command as follows:

db.documents.find({}).forEach( function (d) { d.dateCreated = new ISODate(d.dateCreated); db.documents.save(d); });

Here I convert the string to an ISODate.

This same technique can be applied to other datatypes such as Integer, Boolean etc...

Here I will present the steps taken to install the GA4GH Reference Server - Version 0.3.3 onto CentOS 7. For the official installation guide please visit ga4gh installation guide as a reference.

This installation will use apache web server as the front-end to service requests with ga4gh running behind apache.

Install essential library packages

$ yum install python-devel python-virtualenv zlib-devel libxslt-devel openssl-devel libffi-devel redhat-rpm-config ncurses-devel ncurses samtools

Install Apache web server

yum install httpd

yum install mod_wsgi

mkdir /var/cache/httpd/python-egg-cache

chown apache:apache /var/cache/httpd/python-egg-cache

Create a file /etc/httpd/conf.d/ga4gh.conf with the following contents:

 <VirtualHost *:80>  
 WSGIDaemonProcess ga4gh processes=10 threads=1 python-path=/srv/ga4gh/ga4gh-serverenv/lib/python2.7/site-packages python-eggs=/var/cache/httpd/python-egg-cache  
 WSGIScriptAlias /ga4gh /srv/ga4gh/application.wsgi  
 <Directory /srv/ga4gh>  
   WSGIProcessGroup ga4gh  
   WSGIApplicationGroup %{GLOBAL}  
   Require all granted  
 </Directory>  
 </VirtualHost>

Install ga4gh reference server

 mkdir –p /srv/ga4gh  
 cd /srv/ga4gh  
 virtualenv ga4gh-server-env  
 source ga4gh-server-env/bin/activate  
 pip install ga4gh

Install some missing python packages that weren't automatically installed:

pip install flask

pip install flask-cors

pip install humanize

pip install oic

pip install protobuf

pip install pysam

When I tried to use some of the ga4gh command line scripts, I ran into some errors like the following:

[root@gatkslave ga4gh]# source ga4gh-server-env/bin/activate

(ga4gh-server-env)[root@gatkslave ga4gh]# ga4gh_repo init registry.db

Traceback (most recent call last):

File "/srv/ga4gh/ga4gh-server-env/bin/ga4gh_repo", line 5, in

from pkg_resources import load_entry_point

File "/srv/ga4gh/ga4gh-server-env/lib/python2.7/site-packages/pkg_resources.py", line 3007, in

working_set.require(__requires__)

File "/srv/ga4gh/ga4gh-server-env/lib/python2.7/site-packages/pkg_resources.py", line 728, in require

needed = self.resolve(parse_requirements(requirements))

File "/srv/ga4gh/ga4gh-server-env/lib/python2.7/site-packages/pkg_resources.py", line 626, in resolve

raise DistributionNotFound(req)

pkg_resources.DistributionNotFound: sphinx-argparse==0.1.15

It appears that several python packages were still missing. I used the following commands to install them all, one by one:

Pip install sphinx-argparse==0.1.15

pip install lxml==3.4.4

pip install pyOpenSSL==0.15.1

pip install oic==0.7.6

pip install requests==2.7.0

pip install pysam==0.9.0

pip install protobuf==3.0.0.b3

pip install Flask==0.10.1

pip install Flask-Cors==2.0.1

pip install pyjwkest==1.0.1

pip install Jinja2==2.7.3

pip install pycparser==2.14

pip install cffi==1.5.2

yum install libffi-devel

pip install ipaddress==1.0.16

pip install enum34==1.1.2

pip install pyasn1==0.1.9

pip install idna==2.1

pip install cryptography==1.3.1

pip uninstall pycryptodomex

pip uninstall pcryptodome

Disable SELinux

# dislable selinux

setenforce 0

Make the /srv readable and writable:

chmod -R +x /srv

Create the WSGI file at /srv/ga4gh/application.wsgi with the following contents:

from ga4gh.frontend import app as application

import ga4gh.frontend as frontend

frontend.configure("/srv/ga4gh/config.py")

Create the configuration file at /srv/ga4gh/config.py with the following contents:

DATA_SOURCE = "/srv/ga4gh/ga4gh-example-data/repo.db"

Install bgzip

cd /usr/src

wget https://github.com/samtools/htslib/releases/download/1.3.1/htslib-1.3.1.tar.bz2

bzip2 –d htslib-1.3.1.tar.dz2

tar xvf htslib-1.3.1.tar

cd htslib-1.3.1

make

make prefix=/usr install

Data import

Reference set

wget ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
gunzip hs37d5.fa.gz

bgzip hs37d5.fa

ga4gh_repo add-referenceset registry.db /srv/ga4gh/hs37d5.fa.gz -d "NCBI assembly of the human genome" --ncbiTaxonId 9606 --name NCBI37

Ontology

wget https://raw.githubusercontent.com/The-Sequence-Ontology/SO-Ontologies/master/so-xp.obo

ga4gh_repo add-ontology registry.db /srv/ga4gh/so-xp.obo -n so-xp

Create a new dataset

ga4gh_repo add-dataset registry.db NA12878_sample1_rerun_sg1_snvcalls --description "1000genomes genome, an Illumina platinum genome and one that the Kinghorn guys use for testing their sequencer"

Importing VCFs

If you have a bunch of VCFs in a directory. You can loop through each file to bgzip each of them:

for i in *.vcf; do bgzip $i; done

Run tabix on each of the bgzip files

for i in *.gz; do tabix $i; done

ga4gh_repo add-variantset registry.db NA12878_sample1_rerun_sg1_snvcalls /srv/ga4gh/datasets/NA12878_sample1_rerun_sg1_snvcalls --name NA12878_sample1_rerun_sg1 --referenceSetName NCBI37

Test query

curl -X POST -H 'Content-Type:application/json' -d '{"variantSetId": "WyJOQTEyODc4X3NhbXBsZTFfcmVydW5fc2cxX3NudmNhbGxzIiwidnMiLCJOQTEyODc4X3NhbXBsZTFfcmVydW5fc2cxIl0", "referenceName":"22","start":17190024,"end":"17671934"}' http://your.server/ga4gh/variants/search

Techspot - Databases, web, servers - Philip Wu

Sunday, September 18, 2016

mongoimport and updating data types

Tuesday, September 6, 2016

GA4GH Reference Server - v.0.3.3 installation on CentOS 7