Forums

Articles
Create
cancel
Showing results for 
Search instead for 
Did you mean: 

Cluster Index Replication Health Check Fails during reindexing

Shrikant Bijapurkar _NTT DATA_
Contributor
July 7, 2024

What is the reason for this error?

 

 

 

Reindexing Completed.png

1 answer

1 vote
Ravina
Community Champion
July 7, 2024

Hi @Shrikant Bijapurkar _NTT DATA_ Technically this health check error will be shown when full indexing is running or just completed on the node.

In full indexing the current index is deleted and new index is build from the scratch, while the indexing is running on the node there is a Jira internal service NodeReindexService is keep running every 5 sec to distribute index change from one node to another so that indexes will be upto date on all other nodes across the Jira, this process is resource intensive and sometime it takes time to sync the indexes between the nodes and index health check gets timeout and we will see this error message.

Once the indexing is completed you can copy that updated index to the other nodes so that index on the other nodes will be updated this latest index to have the updated data, this process will take time based on the instance and index size but typically it's a fast process and once it is completed error will be gone in the next health check, this is common issues we face when the full indexing is triggered or completed.

Capture 2024-07-08 at 10.49.23.png

If the error not auto-resolved after hour or so then there are various reasons for this that need to check as mentioned in the below KB document that you can check,

But before checking the KB document I would suggest understand how to indexing is work in Jira Datacenter so that it will hep you understand the issue and find the cause of this error.

https://confluence.atlassian.com/jirakb/how-indexing-works-in-jira-1167744587.html

https://confluence.atlassian.com/adminjiraserver/search-indexing-938847710.html

Related KB documents for the Cluster index replication errors.

https://confluence.atlassian.com/jirakb/cluster-index-replication-health-check-fails-in-jira-data-center-738722192.html

https://confluence.atlassian.com/jirakb/cluster-index-replication-health-check-fails-in-jira-data-center-due-to-jira-charting-plugin-974387065.html

https://confluence.atlassian.com/jirakb/cluster-index-replication-health-check-reports-delay-when-the-database-is-in-a-different-timezone-than-jira-1051984214.html

If you Datacenter is setup correctly then 90% of time this error will be auto-resolved once the updated index is copied to other active nodes and next health check passed in specified time.

Let me know in case of any questions.

Shrikant Bijapurkar _NTT DATA_
Contributor
July 7, 2024

Thanks a lot for your answer, Ravina.

Can you pls elaborate what node details to put in "From" and "To" boxes?
Our is a 3 node cluster.
Also, Jira spawns additional nodes if the CPU usage goes above threshold during full reindexing.
So, I am not sure which node number to put in "From" and "To" boxes.
Guidance around this would be helpful.
All I know is the node number that goes down during full reindexing.
Shrikant Bijapurkar _NTT DATA_
Contributor
July 7, 2024

Attached picture shows our 3 nodes.

1 of these goes down when the full reindexing is triggered.

Screenshot 2024-07-08 at 12.19.08 PM.png

Ravina
Community Champion
July 8, 2024

@Shrikant Bijapurkar _NTT DATA_Nodes does not goes down as you can see the uptime of all the three nodes are same if node goes down then the uptime of the node will be reset, but yes, when the full indexing is triggered node does not accept the active traffic until the full indexing is completed, so it is recommended to offload/remove the node where indexing is triggered from the load balancer so that user traffic will be redirected to the other active nodes.

Regarding the index copy "FROM" to "TO" node, once the indexing is completed on the triggered node (In your case it is the node that goes down/not accessible to the users) so this will be your FROM node, you can also identify these nodes based on the node-id as nodes will have a unique id's identify then, so for example in your case if you started/trigger the full indexing on node - i-078239b... and completed the indexing then this will be your FROM indexing and node - i-02b7afca... will be your TO node in first case, click on copy index button, wait for sometime then perform the same action to copy the index to another node keep FROM node as same and use node - i-0368df0... in TO and click on copy index.

As you have setup a auto scaling to scale another node if the CPU load increase then in that case as this will be a new node joining the cluster once the node is started then the NodeReindexService will look for the latest copy of the index from other nodes or shared home and build the index for the new node joining the cluster, so in case of new node is getting scale technically there is no need to perform this copy index action. 

So best practice to follow the full indexing are:

  • Always perform the full indexing on the low traffic hours or weekend
  • Perform the regular full indexing on weekly basis for the large instance or 15 days to per month for moderate to small instance to keep the indexes upto date as Jira search results are depends the index health/update.
  • In Jira Datacenter instances remove the node from the cluster/load balancer when indexing is running so that user traffic will not go that node and redirected to other active nodes where indexing is not running and add the node back to the cluster/load balancer once the indexing is completed and copy that latest index from the node where indexing is trigger to the other nodes where indexing not triggered.
  • Automate this indexing, adding/removing node from the cluster/load balancer and scheduled it to perform on the weekend so that there does not require manual intervention and indexes will be updated on weekly basis.
  • Follow this Atlassian KB document to increase the indexing speed by adding the mentioned parameters in the jira-config.properties file that need to created in the Jira local home if not already present.

 

Let me know in case of any question.

Shrikant Bijapurkar _NTT DATA_
Contributor
July 8, 2024

Thanks a bunch for your detailed reply, Ravina.

I understood the FROM and TO nodes.

We are already following the Best Practices viz.

1. We run the Full reindexing in OFF time when there is minimal Jira traffic.

2. The other 2 nodes are always UP

 

I will update the FROM and TO next time we run the full reindexing tomorrow, after it completes.

And then I hope the warnings do not show up next day.

 

If all goes as expected, would surely ACCEPT your answers.

 

Thanks anyways. This has been helpful so far.

Like Ravina likes this
Ravina
Community Champion
July 8, 2024

Sure, do let me know how it goes

Shrikant Bijapurkar _NTT DATA_
Contributor
July 9, 2024

The Full reindexing on i-02b7afcad5a3bc955 node completed within 2 hours.

This is very effective, the background reindexing used to take 20 hours or more.

Then I went to the Copy area and put data as below -

From:
Current node: i-02b7afcad5a3bc955

To:
i-03618df0a70329128

 

When I press Copy Index, the page scrolls up and now i-03618df0a70329128 shows as current node, prompting to start full reindexing again on i-03618df0a70329128 node.

 

Actually, I was hoping it would take time to copy the newly created node into i-03618df0a70329128 and then allow me to also copy i-02b7afcad5a3bc955 into the 3rd node viz. i-078239bb88fd7a123

 

But I am unable to do that. The To box is no longer editable.

 

What am I doing wrong?

In the interim, the Health check warning messages are popping up, and I fear the teams would all see those tomorrow morning in Japan, before we start work in India.

 

Ravina
Community Champion
July 10, 2024

This should not happened, but you can check the application logs to find the exact cause.

Shrikant Bijapurkar _NTT DATA_
Contributor
July 10, 2024

I ran the Full reindex on one of the 2 nodes.

And now the Warning messages have stopped.

So, I guess it takes a little while for the new index to propagate to other nodes, I believe.

I might be wrong. 

Ravina
Community Champion
July 10, 2024

Yes, As I said earlier the NodeReindexService is running on the node to sync the indexes between the nodes and update the missing indexes in the local node indexes as there are other bunch of services and process running in Jira so based on that index replication time between the nodes varies.

Suggest an answer

Log in or Sign up to answer