Sunday, May 14, 2017

ElasticSearch: Issues Adding a new node

The Issue

The new node was visible on the cluster but existing shards were not relocating to the new node

The steps

I had a pre-existing elasticsearch cluster of 3 nodes, and I went about adding a new node. In a round-robin fashion, I updated the elasticsearch.yml configuration of the pre-existing nodes to include the new node by updating the list of hosts and the minimum number of master nodes:

elasticsearch.yml
 discovery.zen.ping.unicast.hosts: ["10.0.0.1", "10.0.0.2", "10.0.0.3", "10.0.0.4"]  
 discovery.zen.minimum_master_nodes: 3  

Restarting each node, and checking the health status as follows:


 [root@mongo-elastic-node-1 centos]# curl -XGET 10.0.0.1:9200/_cluster/health?pretty  
 {  
  "cluster_name" : "cpi",  
  "status" : "green",  
  "timed_out" : false,  
  "number_of_nodes" : 4,  
  "number_of_data_nodes" : 4,  
  "active_primary_shards" : 40,  
  "active_shards" : 71,  
  "relocating_shards" : 2,  
  "initializing_shards" : 0,  
  "unassigned_shards" : 0,  
  "delayed_unassigned_shards" : 0,  
  "number_of_pending_tasks" : 0,  
  "number_of_in_flight_fetch" : 0,  
  "task_max_waiting_in_queue_millis" : 0,  
  "active_shards_percent_as_number" : 100.0  
 }  

The important item to notice from above, is the bit about "relocating_shards". Here it's saying that the cluster is relocating 2 shards. To find out which shards are going where, you can check with this command:

 [root@mongo-elastic-node-1 centos]# curl -XGET http://10.0.0.9:9200/_cat/shards | grep RELO  
  % Total  % Received % Xferd Average Speed  Time  Time   Time Current  
                  Dload Upload  Total  Spent  Left Speed  
 100 7881 100 7881  0   0  318k   0 --:--:-- --:--:-- --:--:-- 334k  
 cpi12         2 p RELOCATING 6953804  5.8gb 10.0.0.2 cpi2 -> 10.0.0.4 fBmdkD2gT6-jTJ6k_bEF0w cpi4  
 cpi12         0 r RELOCATING 6958611  5.5gb 10.0.0.3 cpi3 -> 10.0.0.4 fBmdkD2gT6-jTJ6k_bEF0w cpi4  

Here's it's saying that cluster is trying to send shards belonging to the index called cpi12 from node cpi3 and node cpi2 to node cpi4. More specifically, it's trying to send shard #2 and shard #0 by RELOCATING them to cpi4. To monitor it's progress, I would login into cpi4 and see if the diskspace usage was going up. And here is where I noticed my first problem:


 [root@elastic-node-4 elasticsearch]# df -h  
 Filesystem   Size Used Avail Use% Mounted on  
 /dev/vdb     69G  52M  66G  1% /mnt  

The mounted folder where I expected to find my elasticsearch data remained unchanged at 52 MB.

Debugging

I remained stumped on this one for a long time and did the following checks:

  • The elasticsearch.yml config file for every node ensuring that discovery.zen.ping.unicast.hosts was correctly.
  • Every node could ping the new node and vice versa.
  • Every node could access ports 9200 and 9300 on the new node and vice-versa using the telnet command.
  • Every node had sufficient diskspace for the shard relocation
  • New node had the right permissions to write to it's elasticsearch folder
  • Check cluster settings: curl 'http://localhost:9200/_cluster/settings?pretty' and look for cluster.routing settings
  • Restarted elasticsearch on each node 3 times over
However, none of the above solved the issue. Even worse, the repeated restarts of each node, managed to get my cluster into an even worse state where now some of shards became UNASSIGNED:

 [root@mongo-elastic-node-1 bin]# curl -XGET http://10.0.0.1:9200/_cat/shards | grep UNASS  
  % Total  % Received % Xferd Average Speed  Time  Time   Time Current  
                  Dload Upload  Total  Spent  Left Speed  
 100 5250 100 5250  0   0  143k   0 --:--:-- --:--:-- --:--:-- 146k  
 .marvel-es-2017.05.13 0 p UNASSIGNED  
 .marvel-es-2017.05.13 0 r UNASSIGNED  
 .marvel-es-2017.05.14 0 p UNASSIGNED  
 .marvel-es-2017.05.14 0 r UNASSIGNED  
 cpi14         1 p UNASSIGNED  
 cpi13         1 p UNASSIGNED  
 cpi13         4 p UNASSIGNED  

After much browsing on the web, there was one forum that mentioned the state of the plugins on all nodes must be exactly the same as referenced from here: http://stackoverflow.com/questions/28473687/elasticsearch-cluster-no-known-master-node-scheduling-a-retry

The solution

The question about the plugins got my memory jogging where I had previously installed the marvel plugin. To see what plugins are installed for each node, run the plugin command from the command-line:

 [root@elastic-node-3 elasticsearch]# cd /usr/share/elasticsearch/bin  
 [root@elastic-node-3 bin]# ./plugin list  
 Installed plugins in /usr/share/elasticsearch/plugins:  
   - license  
   - marvel-agent  

It turned out my pre-existing 3 nodes each had the license and marvel-agent plugins installed. Whereas the fresh install of the 4th node had no plugins at all. Because of this, the nodes were able to acknowledge each other, but refused to talk. To fix this, I manually removed the plugins for each node:

 [root@elastic-node-3 bin]# ./plugin remove license  
 -> Removing license...  
 Removed license  
 [root@elastic-node-3 bin]# ./plugin remove marvel-agent  
 -> Removing marvel-agent...  
 Removed marvel-agent  

Before I could see if shard relocation would work, I first had to assign the UNASSIGNED shards:

 [root@mongo-elastic-node-1 elasticsearch]# curl -XPOST -d '{ "commands" : [{ "allocate" : { "index": "cpi14", "shard":1, "node":"cpi4", "allow_primary":true } }]}' localhost:9200/_cluster/reroute?pretty  

I had repeat this command for every UNASSIGNED shard. Checking the cluster health, I could see that there were no more unassigned shards, and that there were 2 shards currently relocating:

 [root@elastic-node-4 elasticsearch]# curl -XGET localhost:9200/_cluster/health?pretty  
 {  
  "cluster_name" : "cpi",  
  "status" : "green",  
  "timed_out" : false,  
  "number_of_nodes" : 4,  
  "number_of_data_nodes" : 4,  
  "active_primary_shards" : 40,  
  "active_shards" : 71,  
  "relocating_shards" : 2,  
  "initializing_shards" : 0,  
  "unassigned_shards" : 0,  
  "delayed_unassigned_shards" : 0,  
  "number_of_pending_tasks" : 0,  
  "number_of_in_flight_fetch" : 0,  
  "task_max_waiting_in_queue_millis" : 0,  
  "active_shards_percent_as_number" : 100.0  
 }  

Again, checking the diskspace usage on the new node this time showed that shards were indeed relocating this time! Yay!

References

http://stackoverflow.com/questions/23656458/elasticsearch-what-to-do-with-unassigned-shards

http://stackoverflow.com/questions/28473687/elasticsearch-cluster-no-known-master-node-scheduling-a-retry

https://www.elastic.co/guide/en/elasticsearch/plugins/2.2/listing-removing.html

Wednesday, May 3, 2017

MongoDB switching to WireTiger storage engine

We were already running in production with a mongodb cluster of 3 nodes (replicated) which were running out of diskspace, each node having access to a 750 GB drive at 77% usage. The obvious solution was to expand the diskspace, but at the same time I wanted to be more efficient with the disk space usage itself.

Previously we were using the storage engine callled MMAPv1 which had no support for compression and I wanted to switch over to the WireTiger storage engine which does have support for compression options.

Here I describe the strategy I used :


Since my mongoDB cluster was replicated, I was able to take down one node at a time to perform the switch over to WiredTiger. Once I was finished with one node, I could bring it back up, and take down the next node, and so on until all nodes were upgraded. By doing it this way, there was no downtime whatsoever from the perspective of the user.


For each node I did the following:



  • Shutdown the mongod service
  • Moved the mongo data folder, which in my case was /var/lib/mongo, to another volume attached storage for backup purposes in case procedure fails.
  • Recreate the mongo data folder, in my case /var/lib/mongo and assign the appropriate permissions: chown mongod:mongod /var/lib/mongo
  • Modify the /etc/mongod.conf configuration file to include the following: storageEngine=wiredTiger
  • Restart mongod service
  • Check wiredTiger is configured correctly using the mongo command-line:
 db.serverStatus().storageEngine  
 { "name" : "wiredTiger", "supportsCommittedReads" : true }  

Now that the node is back up and running, replication will happen in the background. If you head over to your primary mongo node, and type rs.status() and you should see a status of STARTUP2.
Once the node has replicated successfully, repeat the same procedure for the next node.

Reference:


https://docs.mongodb.com/v3.0/release-notes/3.0-upgrade/?_ga=1.86531032.1131483509.1428671022#change-replica-set-storage-engine-to-wiredtiger

https://askubuntu.com/questions/643252/how-to-migrate-mongodb-2-6-to-3-0-with-wiredtiger

Tuesday, March 14, 2017

mongodb won't start: Data directory /data/db not found

Our VM provider suffered a hardware failure and one of our mongo nodes failed to start up. Using the command:

 sudo mongod 

 I had the following error message:
 2017-03-15T09:05:27.963+1100 I CONTROL [initandlisten] MongoDB starting : pid=1347 port=27017 dbpath=/data/db 64-bit host=mongodb-node-3.novalocal  
 2017-03-15T09:05:27.963+1100 I CONTROL [initandlisten] db version v3.2.5  
 2017-03-15T09:05:27.963+1100 I CONTROL [initandlisten] git version: 34e65e5383f7ea1726332cb175b73077ec4a1b02  
 2017-03-15T09:05:27.963+1100 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013  
 2017-03-15T09:05:27.963+1100 I CONTROL [initandlisten] allocator: tcmalloc  
 2017-03-15T09:05:27.963+1100 I CONTROL [initandlisten] modules: none  
 2017-03-15T09:05:27.963+1100 I CONTROL [initandlisten] build environment:  
 2017-03-15T09:05:27.963+1100 I CONTROL [initandlisten]   distmod: rhel70  
 2017-03-15T09:05:27.963+1100 I CONTROL [initandlisten]   distarch: x86_64  
 2017-03-15T09:05:27.963+1100 I CONTROL [initandlisten]   target_arch: x86_64  
 2017-03-15T09:05:27.963+1100 I CONTROL [initandlisten] options: {}  
 2017-03-15T09:05:27.985+1100 I STORAGE [initandlisten] exception in initAndListen: 29 Data directory /data/db not found., terminating  
 2017-03-15T09:05:27.985+1100 I CONTROL [initandlisten] dbexit: rc: 100  
I've highlighted the important error message in red.

It appears mongo was completely ignoring my configured dbpath specified in the /etc/mongod.conf.

So what was going on?

Turns out that mongo's journal folder got corrupted when the server immediately shutdown. So removing /var/lib/mongo/journal folder solved the problem.


Then I restarted mongod and that got everything back up and running again!

Reference: http://stackoverflow.com/questions/20729155/mongod-shell-doesnt-start-data-db-doesnt-exsist

Tuesday, February 21, 2017

Fineuploader with Grails

Fineuploader is an excellent frontend javascript library for supporting full-featured uploading capabilities such as concurrent file chunking and file resume. Our use case involves uploading extremely large files such as genomic DNA sequencing data including FASTQs, BAMs and VCFs. But the javascript library does run on it's own out of the box. You will still need to implement some server-side code to handle the file uploading.

There's already a Github repository with many examples of server-side implementations for the popular programming languages like Java, Python, PHP, node.js etc... which can be found here:

https://github.com/FineUploader/server-examples

However, I could not find any examples for a Grails implementation, so once again, I rolled up my own solution which can be found below. For this implementation I've only focused on file concurrent chunking and file resume. Other features such as file deletion we're omitted on purpose, but that's not to say you couldn't modify it to support the remaining features of fineuploader.



And for the GSP I have something like this:


Monday, February 6, 2017

Gradle intellij


Commands used to get intellij to recognize the Gradle project

gradle idea

From intellij GUI, File -> Invalidate Caches/Restart

Import project from Gradle

Monday, January 23, 2017

Grails 3 assets duplicated

For some reason, in Grails 3, the assets (javascripts, css ) declarations we're being duplicated HTML source as shown below:

Generated HTML:
   <script type="text/javascript" src="/assets/jquery-2.2.0.min.js?compile=false" ></script>  
   <script type="text/javascript" src="/assets/jquery-ui.min.js?compile=false" ></script>  
   <link rel="stylesheet" href="/assets/jquery-ui.min.css?compile=false" />  
   <link rel="stylesheet" href="/assets/jquery-ui.theme.min.css?compile=false" />  
   <link rel="stylesheet" href="/assets/bootstrap.css?compile=false" />  
 <link rel="stylesheet" href="/assets/grails.css?compile=false" />  
 <link rel="stylesheet" href="/assets/main.css?compile=false" />  
 <link rel="stylesheet" href="/assets/mobile.css?compile=false" />  
 <link rel="stylesheet" href="/assets/application.css?compile=false" />  
   <script type="text/javascript" src="/assets/igv-1.0.6.js?compile=false" ></script>  
   <link rel="stylesheet" href="/assets/igv-1.0.6.css?compile=false" />  
   <script type="text/javascript" src="/assets/jquery-2.2.0.min.js?compile=false" ></script>  
 <script type="text/javascript" src="/assets/bootstrap.js?compile=false" ></script>  
 <script type="text/javascript" src="/assets/igv-1.0.6.js?compile=false" ></script>  
 <script type="text/javascript" src="/assets/jquery-ui.min.js?compile=false" ></script>  
 <script type="text/javascript" src="/assets/application.js?compile=false" ></script>  

As you can see, several assets are being defined twice, for example jquery-2.2.0.min.js, jquery-ui.min.js and igv-1.0.6.js!

My following GSP code looked like this:

GSP:
   <asset:javascript src="jquery-2.2.0.min.js"/>  
   <asset:javascript src="jquery-ui.min.js"/>  
   <asset:stylesheet src="jquery-ui.min.css" />  
   <asset:stylesheet src="jquery-ui.theme.min.css" />  
   <asset:stylesheet src="application.css"/>  
   <asset:javascript src="igv-1.0.6.js" />  
   <asset:stylesheet src="igv-1.0.6.css"/>  
   <asset:javascript src="application.js"/>  

Strangely enough, if I remove the following line, then the duplicates are removed:

 <asset:javascript src="application.js"/>  

Another problem I had was that in production( but worked fine development mode), my javascript files were not being minified even after I set the configuration to use ES6 in my gradle.build file:


 assets {  
   minifyJs = false  
   minifyCss = true  
   enableSourceMaps = true  
   minifyOptions = [  
       languageMode: 'ES6',  
       targetLanguage: 'ES6'  
   ]  
 }  

To work around the issue, I set the minifyJs = false as shown above.

At the moment, the asset pipeline just feels buggy and unstable and it's probably better to just disable some of the features in the configuration to get some basic things working.

Not sure if I've misunderstood how assets were meant to be used, but if somebody can explain this, please enlighten me!

Tuesday, January 17, 2017

Myrepublic broadband - How to get connected from a customer perspective

So I've finally made the switch to Myrepublic broadband, but it was a bumpy road and I wanted to share with everyone how I managed to get it all working to save yourself some headaches.

Previously, I was signed up with iiNet for my broadband connection at around $80/month for NBN 25Mbs downloads and 3 MBs uploads, unlimited usage. It was a no brainer to switch to myrepublic for just $60 uncapped speeds and unlimited usage.

When you sign up through their online process, they will tell you that they will contact you when it's been activated. I waited probably almost 2 months before I finally got an email saying my broadband has been activated, without any further instruction on how to actually start using it.

Here is a copy of the email:


Congratulations, your MyRepublic service is now active.

To access your service follow the Quick Start Guide that came with your Wi-Fi Hub+ to connect your modem to the MyRepublic network.

If you have ordered a Home Phone service, you will be sent a separate email of when your service will be active.

Support
Our Frequently Asked Questions provide support and information to get you connected and other helpful advice.

Regards,
The Team at MyRepublic

Initially I thought they would automatically cancel my iiNet service and a simple modem swap with the existing connections would be it. But I waited and waiting and monitored my iiNet account, and my broadband was still going through iiNet. So I rang up Myrepublic support to see what was going on. Turns out, they activated my NBN on a secondary port, UNI-D 2, in the box housing the modem. So I really had 2 broadband connections running to my home simultaneously activated. Once I plugged the modem into the secondary port, UNI-D 2, then I was connected to Myrepublic.



I did a speed test and got around 90 MBps downloads and 38 MBps uploads, a significant boost compared to iiNet with a much lower price.

The other issue I had with Myrepublic was the VOIP connection. I waited and waited for many months still no connection. But I received my first bill which already included the cost of the phone.
I called them up again, and they've decided to credit me back for the phone, while they try and get me connected with a new number.

Myrepublic is new to australia and is still getting their feet wet with rolling out these services. In particular, they need to improve on how they process connections, then communicate that back to the user with instructions on how to hook it up and on what ports, then improve on their billing to be more accurate based on the services connected. On the flipside, their australian based support team was very helpful, better than any offshore call centres I've dealt with from other service providers.

In the end, it's definitely worth the pain, or at least wait a few months for them to get their act together before making the switch.