The Issue
The new node was visible on the cluster but existing shards were not relocating to the new nodeThe steps
I had a pre-existing elasticsearch cluster of 3 nodes, and I went about adding a new node. In a round-robin fashion, I updated the elasticsearch.yml configuration of the pre-existing nodes to include the new node by updating the list of hosts and the minimum number of master nodes:elasticsearch.yml
discovery.zen.ping.unicast.hosts: ["10.0.0.1", "10.0.0.2", "10.0.0.3", "10.0.0.4"]
discovery.zen.minimum_master_nodes: 3
Restarting each node, and checking the health status as follows:
[root@mongo-elastic-node-1 centos]# curl -XGET 10.0.0.1:9200/_cluster/health?pretty { "cluster_name" : "cpi", "status" : "green", "timed_out" : false, "number_of_nodes" : 4, "number_of_data_nodes" : 4, "active_primary_shards" : 40, "active_shards" : 71,
"relocating_shards" : 2,
"initializing_shards" : 0, "unassigned_shards" : 0, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 100.0 }
The important item to notice from above, is the bit about "relocating_shards". Here it's saying that the cluster is relocating 2 shards. To find out which shards are going where, you can check with this command:
[root@mongo-elastic-node-1 centos]# curl -XGET http://10.0.0.9:9200/_cat/shards | grep RELO
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 7881 100 7881 0 0 318k 0 --:--:-- --:--:-- --:--:-- 334k
cpi12 2 p RELOCATING 6953804 5.8gb 10.0.0.2 cpi2 -> 10.0.0.4 fBmdkD2gT6-jTJ6k_bEF0w cpi4
cpi12 0 r RELOCATING 6958611 5.5gb 10.0.0.3 cpi3 -> 10.0.0.4 fBmdkD2gT6-jTJ6k_bEF0w cpi4
Here's it's saying that cluster is trying to send shards belonging to the index called cpi12 from node cpi3 and node cpi2 to node cpi4. More specifically, it's trying to send shard #2 and shard #0 by RELOCATING them to cpi4. To monitor it's progress, I would login into cpi4 and see if the diskspace usage was going up. And here is where I noticed my first problem:
[root@elastic-node-4 elasticsearch]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vdb 69G 52M 66G 1% /mnt
The mounted folder where I expected to find my elasticsearch data remained unchanged at 52 MB.
Debugging
I remained stumped on this one for a long time and did the following checks:
- The elasticsearch.yml config file for every node ensuring that discovery.zen.ping.unicast.hosts was correctly.
- Every node could ping the new node and vice versa.
- Every node could access ports 9200 and 9300 on the new node and vice-versa using the telnet command.
- Every node had sufficient diskspace for the shard relocation
- New node had the right permissions to write to it's elasticsearch folder
- Check cluster settings:
curl 'http://localhost:9200/_cluster/settings?pretty'
and look forcluster.routing
settings - Restarted elasticsearch on each node 3 times over
However, none of the above solved the issue. Even worse, the repeated restarts of each node, managed to get my cluster into an even worse state where now some of shards became UNASSIGNED:
[root@mongo-elastic-node-1 bin]# curl -XGET http://10.0.0.1:9200/_cat/shards | grep UNASS
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 5250 100 5250 0 0 143k 0 --:--:-- --:--:-- --:--:-- 146k
.marvel-es-2017.05.13 0 p UNASSIGNED
.marvel-es-2017.05.13 0 r UNASSIGNED
.marvel-es-2017.05.14 0 p UNASSIGNED
.marvel-es-2017.05.14 0 r UNASSIGNED
cpi14 1 p UNASSIGNED
cpi13 1 p UNASSIGNED
cpi13 4 p UNASSIGNED
After much browsing on the web, there was one forum that mentioned the state of the plugins on all nodes must be exactly the same as referenced from here: http://stackoverflow.com/questions/28473687/elasticsearch-cluster-no-known-master-node-scheduling-a-retry
The solution
The question about the plugins got my memory jogging where I had previously installed the marvel plugin. To see what plugins are installed for each node, run the plugin command from the command-line: [root@elastic-node-3 elasticsearch]# cd /usr/share/elasticsearch/bin
[root@elastic-node-3 bin]# ./plugin list
Installed plugins in /usr/share/elasticsearch/plugins:
- license
- marvel-agent
It turned out my pre-existing 3 nodes each had the license and marvel-agent plugins installed. Whereas the fresh install of the 4th node had no plugins at all. Because of this, the nodes were able to acknowledge each other, but refused to talk. To fix this, I manually removed the plugins for each node:
[root@elastic-node-3 bin]# ./plugin remove license
-> Removing license...
Removed license
[root@elastic-node-3 bin]# ./plugin remove marvel-agent
-> Removing marvel-agent...
Removed marvel-agent
Before I could see if shard relocation would work, I first had to assign the UNASSIGNED shards:
[root@mongo-elastic-node-1 elasticsearch]# curl -XPOST -d '{ "commands" : [{ "allocate" : { "index": "cpi14", "shard":1, "node":"cpi4", "allow_primary":true } }]}' localhost:9200/_cluster/reroute?pretty
I had repeat this command for every UNASSIGNED shard. Checking the cluster health, I could see that there were no more unassigned shards, and that there were 2 shards currently relocating:
[root@elastic-node-4 elasticsearch]# curl -XGET localhost:9200/_cluster/health?pretty { "cluster_name" : "cpi", "status" : "green", "timed_out" : false, "number_of_nodes" : 4, "number_of_data_nodes" : 4, "active_primary_shards" : 40, "active_shards" : 71,
"relocating_shards" : 2,
"initializing_shards" : 0, "
unassigned_shards" : 0,
"delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 100.0 }
Again, checking the diskspace usage on the new node this time showed that shards were indeed relocating this time! Yay!
References
http://stackoverflow.com/questions/23656458/elasticsearch-what-to-do-with-unassigned-shardshttp://stackoverflow.com/questions/28473687/elasticsearch-cluster-no-known-master-node-scheduling-a-retry
https://www.elastic.co/guide/en/elasticsearch/plugins/2.2/listing-removing.html