gerrit - modric2jeff/archive GitHub Wiki

speed reindex Affected Version: 2.13, 2.14, 2.15, 2.16

What steps will reproduce the problem?

  1. Install base version to upgrade from e.g. 2.11.x. Note if reindexing on 2.15, you could install 2.13.x or 2.14 as your base version to upgrade from. You just need to be at least a minor version previous to the version you are on, to perform a full reindex. Alternatively it is possible to test this issue on different versions by e.g. install 2.15, and remove the index location to force full regeneration of the index. Rest of steps are the same.
    Just to be clear: This is a reindex issue, and not an upgrade issue directly. It is just that you have to reindex during upgrade it is seen without any workaround.

  2. Create some repositories, but most importantly have large amounts of changes in each review. To do this I create 2k files, then update each file with 500 lines of different text so that we get changes which real life differencing to be merged / indexed across each file. Further steps / scripts below.

js filereader:https://stackoverflow.com/questions/31433413/return-the-array-of-bytes-from-filereader 3. a) Upgrade to the new version by running firstly "java -jar gerrit.war init -d xxx". or b) Delete your site "index" folder for the lucene index content. E.g. if site path is -d ../gerrit_testsite, it will be: "rm -fR ../gerrit_testsite/index"

  1. After update to the schema, perform a full offline reindex ( required step ).
  2. In large organizations / with large amounts of changes this reindex can take between 2-4weeks, and the time taken to do the reindex, is not changed by adding more cores on larger servers.
  3. Performance of this work seemed to take roughly the same on server with 8 cores as 144 cores and 1Tb of RAM.
  4. I have also needed to extend the sorting of the projects beyond just size, but you will see why when batching is introduced below. Note as any piece of work will no longer block the system, starting a large project first doesn't change the overal time frame to reindex at all. It would have currently ( before patches below ) as large project is single threaded, it would have been best to start it early.

What is the expected output? Reindex needs to run in an appropriate amount of time, companies cannot take their systems offline for 2-4 weeks in some cases to perform this upgrade. The length taken is directly related to the changes / differencing being worked out, the number of projects, and amount of changes in each project. I would say within 24-48 hours would be a good guideline.

What do you see instead? Reindex appears to hang on large changes e.g. 2k - 10k changed files in a repo. Repos are 20Gb - 50Gb in size.

Please provide any additional information below.

Below is some details of what was seen, I have created patches and fixes for this quickly to test that performance was increased from 4weeks to 48 hours, then 3 hours, using a combination of all fixes below.

The reindex is greatly slowed down by the following items: PreReqs: Set appropriate java min / max memory settings to avoid GC and memory growing issues which take a long time to grow from e.g. 256Mb to 500Gb which I have seen. Set appropriate index threads or --batch-threads option for number of cores in your environment. Ensure diff.timeout is small for reindexing, you may wish a higher value for a real online system, but taking eg. 30secs per file for each diff during reindex, is too long. H2 cache settings must be increased from defaults, these are much too small for a system doing reindex. Finally advice "jgit GC" to collect packfiles in each repo, as it greatly slows down performance of diff / merge score without this. e.g. reindex command: "java -Xms6g -Xmx6g -Ddiff.renameLimit=7500 -jar /vagrant_data/gerrit-2.13.11.war reindex -d /home/vagrant/gerrit --verbose --threads 144"

Further changes:

  1. Reindexing is only parallelised by project, so you can have 100 projects, but 2 of these projects are very large. What happens is that it does 100 projects in parallel ( see caveats below ) and then ends up processing one or 2 projects at the end with lots of changes. Take this example. Project 1, 2, 3, 4, 5 - all with 2 changes. Project 6 - 1000 changes.

It will say processed 5/6 or 90% of the work ( projects ) but really it has only got through 6 changes out of 1006 total changes.
This in itself isn't as much of a problem as it is just a reporting issue, but importantly this last project is processed only in one single core, regardless of what is available for indexing threads on the server, so greatly slows down overall reindex time. We should be parallelising batches of work, which we can split over the cores so that we can still be using x cores on a machine to process a single repo.

  1. Differencing performance, for the line by line merge score, can be slow on large files, or large diffs. Configuration of this is possible but not available for the HISTOGRAM chain size. Adding configuration of this helps some situations where the Diff algorithm times out after x secs, and then hits the HISTOGRAM algorithm, but you can configure it to use e.g. 32 buckets.
  • Simple allow configuration for better performance.
  1. Lucene index is sequential - So when you increase the number of batching threads, and the work throughput increases you will then see the LUCENE index is being throttled. This is because of the following:
    1. It is using the LUCENE auto detection, which looks at the disk and decides configuration which can hit the following issues. It can detect most storage areas as spinning rust - hard drives not SSD. In this case it only allocates a single thread / MERGE thread to the entire LUCENE index, also the queue size is very small. We need to be able to configure / override the queue size, and number of lucene index threads being used to be able to cope with the increase throughput.

Again this is a simple set of changes, to expose the LUCENE configuration settings in the gerrit.config file.

  1. Batching / Splitting the work across X threads, but avoid the repository changes SCAN each time. For each thread that is kicked off it opens the project, and performs a SCAN of the changes, it would be best to avoid this hit each time, by sharing the DB connection and Changes results across our batching pool. I will allow the changes to be split in a project, and run in multiple threads at once, sharing the DB connection and changes, but I do not see much real contention here at all as its very quick read, even with 80 cores processing in parallel.