notes on asg and solr - krickert/search-api GitHub Wiki
Perfect. I’ll look for Terraform-based solutions and best practices for deploying a SolrCloud cluster with Zookeeper using EC2 Auto Scaling Groups. This will include automation strategies for backups to EFS/S3, shard and replica rebalancing, dynamic rehydration and updates, and leveraging Zookeeper's dynamic config features.
I'll get back to you shortly with a detailed breakdown, architecture guidance, and Terraform examples or modules you can use.
Deploying Apache Solr in cloud mode (SolrCloud) on AWS EC2 requires careful planning to ensure high availability, scalability, and data integrity. In this guide, we design an Infrastructure-as-Code solution with Terraform to provision a SolrCloud cluster and a ZooKeeper ensemble on AWS using Auto Scaling Groups (ASGs). We’ll cover the architecture, Terraform resources (VPC, EC2 ASGs, security groups, IAM roles, launch templates, storage), backup strategies, automated scaling (rehydration of nodes), and best practices for shard/replica rebalancing and ZooKeeper dynamic ensemble membership.
([A step by step guide for a high availability Solr environment](https://www.netcentric.biz/insights/2020/07/installing-solr-guide)) Figure: High-level architecture – A SolrCloud cluster of multiple Solr nodes (each hosting replicas of collection shards) coordinated by a ZooKeeper ensemble. Separate node groups are used for Solr and ZooKeeper for isolation and scalability.
Components & Separation: We deploy SolrCloud and ZooKeeper on separate EC2 instance groups. This means one ASG for Solr nodes and another ASG for ZooKeeper nodes. Separating them ensures that each tier can scale and recover independently and avoids resource contention. ZooKeeper is critical for SolrCloud’s cluster state and coordination, so it will run on its own small instances (typically 3 or 5 nodes for quorum). Solr nodes will be on larger instances optimized for search (CPU/memory/disk).
VPC and Networking: Terraform will create a dedicated VPC with multiple subnets across availability zones (AZs). We deploy instances across at least 3 AZs for high availability. For example, the VPC can have 3 private subnets (one per AZ) for the Solr and ZK instances, and perhaps public subnets if needed for bastion or load balancers. Each ASG will target these subnets to distribute instances. Using three AZs ensures the ZooKeeper ensemble can maintain quorum even if one AZ goes down ([Zookeeper Resiliency for Solr Cloud in AWS, using Auto-Scaling Groups - OpenSource Connections](https://opensourceconnections.com/blog/2019/06/11/zookeeper-resiliency-aws/#:~:text=To%20ensure%20your%20Zookeeper%20ensemble,blog%20post%20by%20Credera%20here)).
Security Groups: We define distinct security groups for Solr and for ZooKeeper:
- Solr SG: Allows inbound Solr traffic on the Solr port (usually 8983). This could be restricted to an internal load balancer or specific clients. It should also allow Solr nodes to communicate with each other on 8983 (for intra-cluster replication). Additionally, Solr nodes need to reach ZooKeeper, so the Solr SG should allow outbound to ZK ports.
- ZooKeeper SG: Allows ZK ensemble communication on quorum ports (2888, 3888) between ZK instances (cluster internal). It also allows Solr nodes (and any admin/bastion) to connect to ZK’s client port 2181. For example, open 2181 to the Solr SG. Keep ZooKeeper ports locked down from public access.
IAM Roles: We create IAM roles/instance profiles for each ASG:
- Solr instances’ role may permit writing to CloudWatch Logs/Metrics, using SSM (AWS Systems Manager) for automation, or accessing S3 if needed for backups. It will also need EFS access if using EFS (via mount targets, handled by security group).
- ZooKeeper instances’ role can be minimal (maybe just CloudWatch Logs or SSM for management). If we use a helper mechanism (like a lifecycle hook Lambda) for ZK networking, the role might allow specific EC2 actions (though in our solution we’ll prefer dynamic reconfiguration over IP reassignment).
Load Balancing (Optional): It’s common to put Solr nodes behind a load balancer (for read requests). You might use an internal ALB pointing to Solr instances on 8983. Terraform can provision this (with a target group for port 8983 and health checks on Solr’s admin endpoint). This is optional if clients can discover Solr nodes via ZK, but for simplicity of client integration, an ALB with round-robin to Solr nodes is helpful. We won’t load-balance ZK (clients have ZK quorum info via SolrCloud, and ZK is accessed by Solr internally).
Latest Versions: We assume latest stable Solr and ZK (e.g., Solr 9.x and ZooKeeper 3.7/3.8) running on a modern Java (Java 11 or 17, as required by Solr 9). Our Terraform scripts will allow specifying custom AMIs or using user-data to install the software. Using the latest versions ensures we can leverage new features like Solr’s improved backup/restore and ZK dynamic config.
First, we create the VPC, subnets, and networking basics. We can use the official Terraform AWS VPC module for convenience. For example:
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "3.19.0"
name = "solr-vpc"
cidr = "10.0.0.0/16"
azs = ["us-east-1a","us-east-1b","us-east-1c"]
private_subnets = ["10.0.1.0/24","10.0.2.0/24","10.0.3.0/24"]
public_subnets = ["10.0.101.0/24","10.0.102.0/24","10.0.103.0/24"]
enable_nat_gateway = true
tags = { Environment = "prod", Project = "SolrCloud" }
}
This creates a VPC with 3 private and 3 public subnets across AZs. The Solr and ZK ASGs will use the private subnets. A NAT Gateway is enabled so instances can reach the internet for downloading packages (if needed). Adjust CIDRs and region accordingly.
Next, define Security Groups. For example:
resource "aws_security_group" "zk_sg" {
name = "solr-zookeeper-sg"
description = "Security group for ZooKeeper nodes"
vpc_id = module.vpc.vpc_id
ingress = [
// Allow ZK peer communication (ensemble quorum)
{ protocol = "tcp", from_port = 2888, to_port = 2888, self = true },
{ protocol = "tcp", from_port = 3888, to_port = 3888, self = true },
// Allow Solr nodes to connect to ZK client port
{ protocol = "tcp", from_port = 2181, to_port = 2181, security_groups = [aws_security_group.solr_sg.id] }
]
egress = [{ protocol = "-1", from_port = 0, to_port = 0, cidr_blocks = ["0.0.0.0/0"] }]
tags = { Name = "sg-zookeeper" }
}
resource "aws_security_group" "solr_sg" {
name = "solr-cloud-sg"
description = "Security group for SolrCloud nodes"
vpc_id = module.vpc.vpc_id
ingress = [
// Allow Solr HTTP (8983) from load balancer or internal sources
{ protocol = "tcp", from_port = 8983, to_port = 8983, cidr_blocks = ["10.0.0.0/16"] },
// (Optional) allow Solr API from a bastion or your IP for admin access
// { from_port=8983, to_port=8983, cidr_blocks=["X.Y.Z.W/32"], protocol="tcp" },
// Allow Solr nodes to talk to each other on 8983
{ protocol = "tcp", from_port = 8983, to_port = 8983, self = true },
// Allow Solr nodes to reach ZK (outbound is open by default egress rule)
]
egress = [{ protocol = "-1", from_port = 0, to_port = 0, cidr_blocks = ["0.0.0.0/0"] }]
tags = { Name = "sg-solr" }
}
The above SGs ensure ZKs can form a quorum and Solr can register with ZK. Note we allow 2181 from Solr SG to ZK SG. Adjust rules as needed (you may lock down 8983 to specific clients or use an ALB security group).
We also create IAM roles for the instances. For example, an IAM role for Solr that allows writing to S3 (for backup syncing) or using SSM:
resource "aws_iam_role" "solr_role" {
name = "solr-ec2-role"
assume_role_policy = data.aws_iam_policy_document.ec2_assume_role.json
}
resource "aws_iam_role_policy" "solr_policy" {
role = aws_iam_role.solr_role.id
policy = jsonencode({
Version = "2012-10-17",
Statement = [
{
Effect = "Allow",
Action = [
"cloudwatch:PutMetricData",
"logs:CreateLogGroup","logs:CreateLogStream","logs:PutLogEvents",
"ssm:*", "ec2:DescribeTags"
],
Resource = "*"
},
{
Effect = "Allow",
Action = [ "s3:PutObject", "s3:GetObject", "s3:ListBucket" ],
Resource = "arn:aws:s3:::your-solr-backup-bucket/*"
}
]
})
}
resource "aws_iam_instance_profile" "solr_profile" {
name = "solr-instance-profile"
role = aws_iam_role.solr_role.name
}
The above allows CloudWatch Logs and custom metrics, SSM (to run commands if needed), and S3 access for a specific bucket. Adjust policies to your needs (you might also add EFS mount target access; however, mounting EFS is handled by network/SG rather than IAM).
We would create a similar minimal role for ZooKeeper instances (mainly CloudWatch/SSM). If not using any AWS API from ZK nodes, this can be very limited.
ZooKeeper should run as a 3-node (or 5-node) ensemble on dedicated small EC2 instances. Using an Auto Scaling Group for ZK gives self-healing: if a ZK instance fails, AWS will replace it. However, ZK ensemble membership is traditionally static, which poses a challenge when instances are replaced (new instance = new IP, not recognized by the ensemble) ([Zookeeper Resiliency for Solr Cloud in AWS, using Auto-Scaling Groups - OpenSource Connections](https://opensourceconnections.com/blog/2019/06/11/zookeeper-resiliency-aws/#:~:text=and%20stay%20up%20to%20date,blog%20post%20by%20Credera%20here)). We solve this by leveraging ZooKeeper dynamic reconfiguration (available since ZK 3.5) to allow the ensemble to adjust to new members without downtime.
AMI and Launch Template: We need an AMI that has ZooKeeper installed or a user-data script to install/configure ZK at boot. For automation, a common approach is to bake a custom AMI (perhaps via Packer) with ZK and Java already installed ([GitHub - fscm/terraform-module-aws-zookeeper: Terraform Module to create a Apache Zookeeper cluster on AWS](https://github.com/fscm/terraform-module-aws-zookeeper#:~:text=Apache%20Zookeeper%20AMI)). Alternatively, use a stock Linux AMI and have user-data download and run ZK. Either way, ensure the Java runtime (matching ZK’s requirements, e.g., Java 11+) is present.
ZooKeeper configuration for dynamic ensemble:
- In
zoo.cfg
, enable dynamic reconfig:standaloneEnabled=false
(to force quorum mode even with 1 node) andreconfigEnabled=true
([ZooKeeper Dynamic Reconfiguration](https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html#:~:text=Note%3A%20Starting%20with%203,reconfigEnabled%20%20configuration%20option)) ([ZooKeeper Dynamic Reconfiguration](https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html#:~:text=default%2C%20and%20has%20to%20be,reconfigEnabled%20%20configuration%20option)). - Use a dynamic configuration file. With ZK 3.5+, you can start ZK with an initial
zoo.cfg
that points to azoo.cfg.dynamic
containing the list of current servers. New servers can be added via thereconfig
command. - Initially, we might bring up the ASG with 3 instances. Each instance’s user-data will generate a unique myid and initial config. One strategy is to use the EC2 instance’s instance ID or auto-assigned sequential ID to assign ZK server IDs. For example, in user-data script:
- Get the last octet of local IP or use EC2 metadata to derive an ID.
- Construct the
server.X=<address>:2888:3888
lines for initial config, possibly via a startup script that queries other ZK instances (this can be tricky in pure ASG). - Alternatively, use a simpler approach: start with a static ensemble of 3 known slots (even if empty) or use an orchestration tool like Netflix Exhibitor to manage ZK ensemble dynamically. Exhibitor can coordinate ensemble changes automatically in AWS environments.
Given the complexity of dynamic reconfiguration, a pragmatic solution used in the past was to assign static network identities to ZK servers. One such approach is using Elastic Network Interfaces (ENIs): attach a pre-created ENI (with a stable private IP or DNS) to each ZK instance when it launches ([Zookeeper Resiliency for Solr Cloud in AWS, using Auto-Scaling Groups - OpenSource Connections](https://opensourceconnections.com/blog/2019/06/11/zookeeper-resiliency-aws/#:~:text=Here%20we%20use%20ENIs%20,quality%20or%20fitness%20for%20purpose)). This way, even if the instance is replaced, it retains the same IP, and the ensemble config (which references those IPs) remains valid. The OpenSource Connections team outlines this ENI approach with an ASG + Lambda hook to attach the ENI on instance launch ([Zookeeper Resiliency for Solr Cloud in AWS, using Auto-Scaling Groups - OpenSource Connections](https://opensourceconnections.com/blog/2019/06/11/zookeeper-resiliency-aws/#:~:text=Depending%20on%20exactly%20how%20you,Do%20this%20for%20each%20ASG)) ([Zookeeper Resiliency for Solr Cloud in AWS, using Auto-Scaling Groups - OpenSource Connections](https://opensourceconnections.com/blog/2019/06/11/zookeeper-resiliency-aws/#:~:text=1,Check%20it%20all%20works)). This avoids needing dynamic reconfig at the ZK level (Solr sees the same ZK addresses). However, this method requires extra setup (Lifecycle Hook, Lambda or EC2 script to attach ENI).
Using Dynamic Reconfiguration: If we opt to fully utilize ZK 3.5+ features, we can bring up new ZK instances and have them join the quorum dynamically. The process might look like:
- Launch initial 3 ZK nodes with a static config (they form quorum).
- Enable
reconfigEnabled
so that a new node can be added. If the ASG scales up (or replaces a node), you can run a ZooKeeper CLI command to add the new server (add server.N <address>:2888:3888 peerType=participant
) while removing the old one if needed ([ZooKeeper Dynamic Reconfiguration](https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html#:~:text=Starting%20with%203,interface%20for%20administrators%20and%20no)) ([ZooKeeper Dynamic Reconfiguration](https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html#:~:text=Links%3A%20paper%20,video%2C%20hadoop%20summit%20slides)). - Automate the above using a small bootstrapping script or an AWS Lambda triggered on instance launch. For example, on a new ZK instance coming up, have it wait until it can connect to the ensemble and then issue a
reconfig
to add itself. Likewise, on instance termination, remove it. This can be done via the ZooKeeper Java client orzkCli.sh
invoked with thereconfig
command.
In Terraform, we define the Launch Template and ASG for ZooKeeper. For instance:
resource "aws_launch_template" "zk_lt" {
name = "zk-node"
image_id = var.zookeeper_ami_id # custom AMI with ZK
instance_type = "t3.small"
security_group_names = [aws_security_group.zk_sg.name]
iam_instance_profile = aws_iam_instance_profile.zk_profile.name
user_data = file("scripts/zk_bootstrap.sh") # cloud-init script to configure ZK
tags = { Name = "ZooKeeper" }
}
resource "aws_autoscaling_group" "zk_asg" {
name = "solr-zookeeper-asg"
min_size = 3
max_size = 5
desired_capacity = 3
health_check_type = "EC2"
vpc_zone_identifier = module.vpc.private_subnets
launch_template {
id = aws_launch_template.zk_lt.id
version = "$Latest"
}
tag = [{ key="Name", value="ZooKeeper", propagate_at_launch=true }]
lifecycle {
// Allow updates in place without destroying entire ASG
ignore_changes = [desired_capacity]
}
}
In the user-data (script zk_bootstrap.sh
), we would do something like:
- Install
java-11-openjdk
if not in AMI. - Install ZooKeeper (e.g., from S3 or baked in AMI).
- Configure
myid
(maybe use$((RANDOM%1000))
or fetch an ID from instance metadata service). - Create the zoo.cfg with
tickTime
,dataDir
, etc., and an initialserver.1, server.2, server.3
list (could use placeholder IPs if we don’t know them yet). Alternatively, each node could come up as a standalone and then reconfigure. - Start ZK service.
Note: Dynamic ensemble management is complex. If this complexity is not desirable, an alternative is to keep the ZK ASG at a fixed size (3) and use the ENI or static IP approach so the ensemble config doesn’t change on replacements ([Zookeeper Resiliency for Solr Cloud in AWS, using Auto-Scaling Groups - OpenSource Connections](https://opensourceconnections.com/blog/2019/06/11/zookeeper-resiliency-aws/#:~:text=Here%20we%20use%20ENIs%20,quality%20or%20fitness%20for%20purpose)). That way, when a node is replaced, it assumes the identity of the old one. This is robust and simpler to automate with AWS Lambda hooking the ASG launch ([Zookeeper Resiliency for Solr Cloud in AWS, using Auto-Scaling Groups - OpenSource Connections](https://opensourceconnections.com/blog/2019/06/11/zookeeper-resiliency-aws/#:~:text=1,Check%20it%20all%20works)). For the purpose of maximum automation (and since the question explicitly notes dynamic reconfiguration), we assume ZK 3.5+ with dynamic config turned on to handle changing membership.
For SolrCloud, we use an ASG to allow scaling out and in. All Solr nodes will run the Solr server in cloud mode (pointing to the ZooKeeper ensemble). Key considerations for Solr nodes are storage for indexes, instance initialization (rehydration), and software installation.
AMI and Installation: Ideally, use a custom AMI with Solr pre-installed (and the correct JDK). Another approach is to use user-data to install Java and Solr on first boot:
- The user-data can download the Solr installation package (e.g., Apache Solr TGZ) and extract it.
- Set up as a service (or run via cloud-init).
- Point Solr to ZooKeeper: The start command would include
-z
pointing to the ZK connection string (e.g.,zk1:2181,zk2:2181,zk3:2181/solr
). This ZK connection can be templated into user-data (maybe fetched from Terraform viaaws_autoscaling_group.zk_asg
IPs if known, or use a DNS if we register ZK instances in Route53).
Persistent Storage for Solr (EBS vs EFS): Each Solr node should have fast disk storage for its indexes. AWS EBS volumes (gp3 or io2 SSD) are suitable. We attach an EBS volume to each instance (through the launch template) to store /var/solr
data. For example, we can define a block device in the launch template for a secondary volume (so the root remains OS, and a data volume for Solr index):
resource "aws_launch_template" "solr_lt" {
name = "solr-node"
image_id = var.solr_ami_id # custom AMI or Linux AMI
instance_type = var.solr_instance_type # e.g., r5.large
key_name = var.key_pair # for SSH, if needed
security_group_names = [aws_security_group.solr_sg.name]
iam_instance_profile = aws_iam_instance_profile.solr_profile.name
user_data = file("scripts/solr_bootstrap.sh")
block_device_mappings {
device_name = "/dev/xvdb"
ebs = {
volume_size = 200 // e.g., 200 GB for index
volume_type = "gp3"
delete_on_termination = true
}
}
tag_specifications {
resource_type = "instance"
tags = { Role = "SolrCloud" }
}
}
This ensures each Solr EC2 gets a 200GB EBS attached as xvdb (adjust size/type as needed). The Terraform module by Edson Marquezani similarly provisions an EBS data disk per Solr instance for storing indexes ([GitHub - edsonmarquezani/terraform-solr-aws: Terraform+Puppet automation for Solr Cloud deployment on AWS EC2](https://github.com/edsonmarquezani/terraform-solr-aws#:~:text=It%20deploys%20the%20following%20resources,on%20AWS)). The Solr startup script should format and mount this volume (e.g., to /var/solr
). Alternatively, bake the AMI such that the volume is auto-mounted.
Shared Storage for Backups: For daily backups, we use Amazon EFS. EFS provides a network file system accessible by all Solr nodes. Solr’s backup utility can write backups to a shared filesystem or to cloud storage. At time of writing, Solr supports writing backups to either EFS (NFS mount) or directly to S3 (with a plugin) depending on version ([How to migrate Apache Solr from the existing cluster to Amazon EKS - DEV Community](https://dev.to/haintkit/how-to-migrate-apache-solr-from-the-existing-cluster-to-amazon-eks-3b3l#:~:text=Setup%20Solr%20backup%20storage%20After,information%2C%20please%20visit%20this%20link)). We choose EFS for broad compatibility across Solr versions ([How to migrate Apache Solr from the existing cluster to Amazon EKS - DEV Community](https://dev.to/haintkit/how-to-migrate-apache-solr-from-the-existing-cluster-to-amazon-eks-3b3l#:~:text=Setup%20Solr%20backup%20storage%20After,information%2C%20please%20visit%20this%20link)). Terraform can create the EFS and mount targets easily:
module "efs" {
source = "terraform-aws-modules/efs/aws"
version = "1.2.0"
name = "solr-backup-efs"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
security_group_ids = [aws_security_group.solr_sg.id] // allow NFS from Solr SG
tags = { Name = "solr-backup-efs" }
}
([How to migrate Apache Solr from the existing cluster to Amazon EKS - DEV Community](https://dev.to/haintkit/how-to-migrate-apache-solr-from-the-existing-cluster-to-amazon-eks-3b3l#:~:text=To%20setup%20EFS%20as%20Solr%E2%80%99s,template%20will%20create%20EFS%20resource))
This module sets up an EFS file system and mount points in each subnet. We ensure the Solr security group has NFS (port 2049) access to the EFS SG (the module can create its own SG). For example, add to Solr SG ingress: { protocol="tcp", from_port=2049, to_port=2049, security_groups=[module.efs.security_group_id] }
if needed.
On each Solr instance, in user-data, mount the EFS file system to a known path (say /mnt/solr-backups
). For Amazon Linux 2 or similar:
# in solr_bootstrap.sh (user-data)
yum install -y amazon-efs-utils
mkdir -p /mnt/solr-backups
efs_id="${efs_fs_id}" # pass via Terraform template or EC2 tag
mount -t efs $efs_id:/ /mnt/solr-backups
This uses the EFS mount helper. Alternatively, use nfs4
mount via the EFS DNS name. Ensure the instance’s SG and EFS SG allow this (handled above). You could also add an fstab entry for persistence.
Auto Scaling Group for Solr: With launch template ready, define the ASG:
resource "aws_autoscaling_group" "solr_asg" {
name = "solr-cloud-asg"
min_size = 2
max_size = 10
desired_capacity = 3
health_check_type = "EC2"
health_check_grace_period = 300 # give Solr time to start
vpc_zone_identifier = module.vpc.private_subnets
launch_template {
id = aws_launch_template.solr_lt.id
version = "$Latest"
}
tag = [
{ key = "Name", value = "SolrCloudNode", propagate_at_launch = true }
]
}
We start with 3 Solr nodes (min 2, desired 3, can scale out up to 10). SolrCloud has no single point of failure (no master), so 2 nodes is the bare minimum for fault tolerance (with replication), but 3 is a common starting point to distribute load.
Each Solr instance on boot (via solr_bootstrap.sh
) should:
- Install Java (if not in AMI).
- Mount the EBS volume to
/var/solr
and perhaps restore any data if needed (more on rehydration below). - Mount EFS to
/mnt/solr-backups
. - Start Solr in cloud mode, pointing to ZK: e.g.,
solr start -c -z "$ZK_HOST" -m 4g
. The ZK_HOST can be constructed from known ZK ASG instances. If we used fixed ENIs or Route53 entries for ZK, we can supply a static ZK connection string. If dynamic, the user-data might query ZK autoscaling group’s instances via the AWS API (requires IAM permission ec2:DescribeInstances) to build the connection string. - Possibly create a Solr collection if none exists (for initial bootstrap).
Safe Automated Backups: With EFS mounted, we schedule daily backups of each collection to EFS. Solr provides a Collections API for backups: http://<solr_node>/solr/admin/collections?action=BACKUP&name=<backup_name>&collection=<coll>&location=/mnt/solr-backups
. We can automate this via a cron job on one of the Solr nodes or an external scheduler. The user will handle syncing EFS to S3, which is straightforward using AWS DataSync or a cron job with AWS CLI on an instance. The key is that Solr writes backups to the shared EFS location safely (all nodes see the same NFS share).
Best practice is to ensure consistent backups:
- Use Solr’s backup API which handles snapshotting index files (only capturing committed index segments) ([Making and Restoring Backups | Apache Solr Reference Guide 8.9](https://solr.apache.org/guide/8_9/making-and-restoring-backups.html#:~:text=8,are%20visible%20in%20search)). All nodes must mount the backup path at the same location path ([Making and Restoring Backups | Apache Solr Reference Guide 8.2](https://solr.apache.org/guide/8_2/making-and-restoring-backups.html#:~:text=8,action%3DBACKUP%20%3A%20This%20command)). The Solr Reference Guide notes that SolrCloud backup/restore “requires a shared file system mounted at the same path on all nodes, or HDFS” ([Making and Restoring Backups | Apache Solr Reference Guide 8.2](https://solr.apache.org/guide/8_2/making-and-restoring-backups.html#:~:text=8,action%3DBACKUP%20%3A%20This%20command)) – this is exactly what our EFS setup provides.
- Stagger backups or run them on a dedicated node to avoid load spikes. Perhaps designate one Solr node (e.g., the first in the ASG) via an instance tag to perform the backup daily.
Additionally, AWS Backup can be used to take EFS-level backups (which would capture the entire backup files system, providing point-in-time recovery) ([Backing Up and Restoring Solr Collections | CDP Public Cloud](https://docs.cloudera.com/runtime/7.2.0/search-managing/topics/search-backing-up-and-restoring-solr-collections.html#:~:text=index%20is%20corrupted%20due%20to,a%20software%20bug)). This can complement Solr’s own backups.
Rehydration: In an auto-scaling or auto-recovery scenario, a new Solr node may come up empty (no indexes) and needs to populate data. SolrCloud is designed to replicate index data to new replicas automatically when they are added to a collection. Our strategy:
- Ensure a sufficient replication factor for each collection so that when one node goes down, its data lives on other nodes. For example, if you have 3 Solr nodes, use replication factor 2 or 3 for critical collections.
- When an instance fails and the ASG replaces it, the new node will join the SolrCloud (because it connects to ZooKeeper on startup). However, by default it won’t have any data until we create new replicas on it.
- We can automate the creation of new replicas on new nodes. Solr 7 introduced an autoscaling triggers framework to handle this (e.g., a nodeAdded trigger to add replicas) ([Autoscale SolrCloud on Kubernetes with Fargate | by Abhijit Choudhury | Medium](https://abhijit-choudhury.medium.com/autoscale-solrcloud-on-kubernetes-with-fargate-b1cfbcd0e9a7#:~:text=%22set,120s)) ([Autoscale SolrCloud on Kubernetes with Fargate | by Abhijit Choudhury | Medium](https://abhijit-choudhury.medium.com/autoscale-solrcloud-on-kubernetes-with-fargate-b1cfbcd0e9a7#:~:text=match%20at%20L207%202,120s%20once%20it%20is%20lost)). An example from a Kubernetes autoscaling scenario: when a new Solr node was added, a trigger waited 120 seconds then added a replica of each shard to the new node; when a node was lost, a trigger removed it after a delay ([Autoscale SolrCloud on Kubernetes with Fargate | by Abhijit Choudhury | Medium](https://abhijit-choudhury.medium.com/autoscale-solrcloud-on-kubernetes-with-fargate-b1cfbcd0e9a7#:~:text=match%20at%20L207%202,120s%20once%20it%20is%20lost)). However, Solr’s internal autoscaling framework was deprecated and removed in Solr 9 ([GitHub - tboeghk/solr-aws-observability: Experiments with Solr autoscaling](https://github.com/tboeghk/solr-aws-observability#:~:text=The%20Solr%20Autoscaling%20Framework%20is,to%20test%20Solr%20Autoscaling%20policies)), so we implement this logic externally.
External Orchestration for Scaling Events: We leverage AWS capabilities:
-
Scale-Out (Node Added): We can use an ASG lifecycle hook or CloudWatch event when a new instance launches. This can trigger a Lambda function that calls Solr’s Collections API to add replicas to the new node. The Lambda would need to know which collections exist – it could call ZK or Solr API to list collections, then for each shard call
ADDREPLICA
specifying the new node’s name. Another approach is to have the new Solr node itself request data: for example, on startup, the user-data script could call an API to ask the cluster to replicate data to it. Solr doesn’t automatically do this for existing shards (unless using the deprecated policy triggers or the specific “autoAddReplicas” feature for high availability).-
AutoAddReplicas: Solr has a cluster-wide setting
autoAddReplicas
which, if true, will automatically create a new replica when a node is lost (to maintain replication factor) ([SolrCloud Autoscaling Automatically Adding Replicas - Apache Solr](https://solr.apache.org/guide/8_11/solrcloud-autoscaling-auto-add-replicas.html#:~:text=Solr%20solr,specified%20at%20the%20time)). This helps in scale-in (node loss) scenarios, but not exactly for scale-out. It ensures that if one node dies, its replicas are recreated on remaining nodes to keep the replication factor. We might enable this insolr.xml
cluster properties for resilience. - For scale-out, since a new node means we potentially want to increase total replicas (not just recover lost ones), a custom script or Lambda is needed. We could tie this to ASG’s instance launch notification.
-
AutoAddReplicas: Solr has a cluster-wide setting
-
Scale-In (Node Removal): When scaling in (or an instance is terminating), we want to decommission the Solr node gracefully. AWS ASG can be configured with a lifecycle hook on instance termination ([aws-lambda-lifecycle-hooks-function/README.md at master · aws-samples/aws-lambda-lifecycle-hooks-function · GitHub](https://github.com/aws-samples/aws-lambda-lifecycle-hooks-function/blob/master/README.md#:~:text=When%20an%20Auto%20Scaling%20group,to%20back%20up%20your%20data)). When an instance is about to terminate, the hook can invoke a Lambda that:
- Calls Solr’s Collections API to remove that node from the cluster. There is a convenient API:
http://<any_solr>/solr/admin/collections?action=DELETENODE&node=<host:port>
which will move any leader roles and replicas off that node before removing it from cluster state. This ensures data hosted on that node is redistributed to remaining nodes (if replication allows) or at least marked as lost. Using DELETENODE before termination is a best practice for SolrCloud downscaling. - Alternatively, the Lambda could initiate a backup of that node’s data to EFS/S3 as a precaution (especially if replication factor was 1, to avoid data loss).
- Complete the lifecycle hook so the instance terminates. (During the hook, the instance can remain running up to a timeout waiting for these actions).
- Calls Solr’s Collections API to remove that node from the cluster. There is a convenient API:
Using Lambda with lifecycle hooks gives us a chance to run custom commands just in time before an instance is killed ([aws-lambda-lifecycle-hooks-function/README.md at master · aws-samples/aws-lambda-lifecycle-hooks-function · GitHub](https://github.com/aws-samples/aws-lambda-lifecycle-hooks-function/blob/master/README.md#:~:text=When%20an%20Auto%20Scaling%20group,to%20back%20up%20your%20data)) ([aws-lambda-lifecycle-hooks-function/README.md at master · aws-samples/aws-lambda-lifecycle-hooks-function · GitHub](https://github.com/aws-samples/aws-lambda-lifecycle-hooks-function/blob/master/README.md#:~:text=You%20can%20configure%20your%20Auto,machine%20and%20upload%20your%20logs)). AWS’s sample uses SSM Run Command to perform operations on the instance (e.g., upload logs) ([aws-lambda-lifecycle-hooks-function/README.md at master · aws-samples/aws-lambda-lifecycle-hooks-function · GitHub](https://github.com/aws-samples/aws-lambda-lifecycle-hooks-function/blob/master/README.md#:~:text=a%20task%20to%20complete%2C%20or,worker%20to%20complete%20a%20task)). In our case, the Lambda could either call the Solr API (if accessible) or use SSM to run a script on the instance to invoke Solr’s API from localhost. This requires the instance IAM role to allow SSM Run Command.
Rolling Updates: To update Solr version or OS, one can create a new Launch Template version (with new AMI or user-data) and then instruct the ASG to gradually replace instances (either using ASG Instance Refresh feature or do it manually by decreasing desired capacity and allowing new ones to come up). Because our setup is automated for joining/leaving, we can do a rolling upgrade node by node:
- Increase desired capacity by 1 (add a new node with new version).
- Wait for it to join, then trigger replica additions (if not automatic).
- Choose an old node to remove: issue DELETENODE and then decrease desired capacity by 1 (ASG terminates that node).
- Repeat until all nodes are rotated. This achieves zero-downtime upgrades if done carefully with replication.
Terraform can help with rolling updates via the create_before_destroy
or using the new Instance Refresh feature of ASG (you can define a refresh that replaces instances when launch template changes).
Maintaining an even distribution of shards and replicas is key to SolrCloud performance, especially during scaling events:
-
Even Shard Distribution: When adding nodes (scale-out), you should redistribute some replicas to the new node to balance load and storage. If using the trigger or external script approach above, ensure that each new node gets at most one replica of each shard (to maximize fault tolerance). For example, if you have a collection with 1 shard and replication factor 3 on a 3-node cluster, adding a 4th node won’t automatically use it. You could increase replication factor to 4 (adding one replica on the new node) or keep RF=3 but move one of the existing replicas to the new node (net same count but spread out). Solr’s Collections API supports MOVEREPLICA and REBALANCESHARD placement APIs in Solr 8/9 (or you can delete and add).
-
Preferred Leaders: After scaling events, consider redistributing leader roles. By default, when a new replica is added it’s not the leader. If you consistently add nodes, you might want to invoke
REBALANCELEADERS
command (Solr 8) or simply use theforcePromotion=true
on ADDREPLICA in Solr 9 to balance which nodes are leaders. -
Scaling Down Safely: Before removing a node, ensure no unique data resides only on that node. With replication factor >= 2, this is satisfied (each shard has another replica elsewhere). If RF=1, you must back up that data or migrate it first (better to avoid RF=1 in auto-scaling scenarios). The DELETENODE API will refuse to remove a node if it would drop below the required replication factor for any shard, unless you pass an explicit
deleteData=true
(which would cause data loss). So, in practice, always run with RF>=2 for collections that need high availability. -
SolrCloud Autoscaling Policies (Deprecated): In older versions, one could set policies for disk usage or node load to trigger moving replicas. Since those are removed in new versions, external monitoring is needed. You can monitor metrics like index size per node (via JMX/Prometheus) and trigger admin procedures if one node becomes too full relative to others (e.g., using the
UTILIZENODE
command in Solr 8 to spread cores).
In summary, on scale-out: add replicas to new nodes (either by increasing replication factor or relocating replicas). On scale-in: remove replicas from the departing node (ensure others have them) before termination. These steps can be automated with a combination of AWS hooks and Solr APIs.
By using Terraform to codify this architecture, we achieve a reproducible and automated deployment:
- Infrastructure-as-Code: VPC, subnets, ASGs, launch templates, etc., are all defined in Terraform. This includes provisioning of persistent storage like EFS for backups ([How to migrate Apache Solr from the existing cluster to Amazon EKS - DEV Community](https://dev.to/haintkit/how-to-migrate-apache-solr-from-the-existing-cluster-to-amazon-eks-3b3l#:~:text=To%20setup%20EFS%20as%20Solr%E2%80%99s,template%20will%20create%20EFS%20resource)) and EBS volumes for indexes, as well as all required security plumbing.
- Self-Healing: ASGs ensure that if a Solr or ZK instance goes down, a new one comes up. Combined with ZooKeeper dynamic membership, the ZK ensemble continues to operate without manual intervention when nodes are replaced (no more fragile static configs or midnight pages) ([ZooKeeper Dynamic Reconfiguration](https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html#:~:text=Starting%20with%203,interface%20for%20administrators%20and%20no)).
- Automation of Cluster Changes: Lifecycle hooks and possibly Lambda/SSM automation handle Solr cluster changes. This maximizes uptime by gracefully integrating new nodes and removing old ones. As noted, waking up people at night to fix cluster issues is undesirable – “the right thing to do here is an automated approach” ([Zookeeper Resiliency for Solr Cloud in AWS, using Auto-Scaling Groups - OpenSource Connections](https://opensourceconnections.com/blog/2019/06/11/zookeeper-resiliency-aws/#:~:text=It%E2%80%99s%20worth%20noting%20at%20this,then%2C%20access%20is%20given%20sparingly)).
- Backup and Restore: Daily backups to EFS ensure that even in a worst-case scenario (loss of all nodes or accidental index corruption), data can be restored. The EFS can also be synced to S3 for off-site storage. For ZooKeeper data (Solr configs, etc.), one can also take periodic backups. ZK’s snapshot files could be stored on EBS and backed up via EBS snapshots or copied to S3 using a cron (as noted by one AWS ZK guide, a python zk-shell can export ZK state to S3 ([Terraforming Zookeeper on AWS. Hey, I don’t want to start with… | by Ravindra Chellubani | Medium](https://ravindra-chellubani.medium.com/terraform-zookeeper-on-aws-78be69bbf81e#:~:text=How%20to%20backup%2Frestore%20Zookeeper%3F)) ([Terraforming Zookeeper on AWS. Hey, I don’t want to start with… | by Ravindra Chellubani | Medium](https://ravindra-chellubani.medium.com/terraform-zookeeper-on-aws-78be69bbf81e#:~:text=For%20backup%20%2F%20Restore%20the,restore%20from%20AWS%20S3))).
- Scaling Policies: We can tie ASG scaling policies to CloudWatch metrics. For example, scale out Solr ASG when CPU > 70% or query rate high; scale in when low. With the automation in place, the cluster will redistribute data accordingly. ZooKeeper ensemble size should generally remain constant (3 or 5), scaling ZK is rarely needed for performance (only for quorum resilience).
Lastly, we can reference existing community projects and modules as starting points:
- The Solr AWS Terraform module by Edson Marquezani which uses Terraform + Puppet to configure SolrCloud on EC2 (it sets up EC2 instances with EBS disks, security groups, etc.) ([GitHub - edsonmarquezani/terraform-solr-aws: Terraform+Puppet automation for Solr Cloud deployment on AWS EC2](https://github.com/edsonmarquezani/terraform-solr-aws#:~:text=It%20deploys%20the%20following%20resources,on%20AWS)) – useful for ideas on provisioning and OS-level config.
- A Solr AWS autoscaling experiment repo by Tobias Boehme which demonstrates using Terraform to quickly spin up Zookeeper and Solr, and even tests Solr’s now-removed autoscaling triggers ([GitHub - tboeghk/solr-aws-observability: Experiments with Solr autoscaling](https://github.com/tboeghk/solr-aws-observability#:~:text=Solr%20AWS%20observability%20%26%20autoscaling,experiments)) ([GitHub - tboeghk/solr-aws-observability: Experiments with Solr autoscaling](https://github.com/tboeghk/solr-aws-observability#:~:text=,0)). It shows how to use cloud-init and systemd to manage Solr instances in an ASG, and observes that “The autoscaling framework in its current form is deprecated and will be removed in Solr 9.0.” ([GitHub - tboeghk/solr-aws-observability: Experiments with Solr autoscaling](https://github.com/tboeghk/solr-aws-observability#:~:text=The%20Solr%20Autoscaling%20Framework%20is,to%20test%20Solr%20Autoscaling%20policies)), reinforcing our approach to manage scaling externally.
- The OpenSource Connections blog on Zookeeper resiliency in AWS ([Zookeeper Resiliency for Solr Cloud in AWS, using Auto-Scaling Groups - OpenSource Connections](https://opensourceconnections.com/blog/2019/06/11/zookeeper-resiliency-aws/#:~:text=To%20ensure%20your%20Zookeeper%20ensemble,blog%20post%20by%20Credera%20here)) which inspired using ASGs for ZK and highlighted the need for strategies (ENI or dynamic config) to handle changing IPs.
- HashiCorp and community Terraform modules for AWS Autoscaling and EFS (e.g., terraform-aws-autoscaling and terraform-aws-efs) which we leveraged to simplify resource creation.
By following these practices, we achieve a fully automated, resilient SolrCloud deployment on AWS:
- Terraform provisions all resources declaratively.
- Dedicated ASGs keep Solr and ZK isolated yet coordinated.
- Daily backups to EFS provide data safety (with offsite sync).
- Dynamic scaling is handled with minimal manual intervention, using SolrCloud’s APIs and AWS hooks for shard rebalancing and node membership changes.
- ZooKeeper dynamic reconfiguration (or stable IP workarounds) ensures Solr is always aware of the ensemble even as it heals itself, eliminating the traditional fragility of ZK on ephemeral infrastructure.
With this setup, your SolrCloud cluster can grow or shrink based on demand while maintaining search availability and data integrity. All infrastructure changes go through Terraform, making it easy to version, review, and replicate the environment in different regions or for disaster recovery. This provides a solid foundation for a search platform that maximizes automation and resiliency.
Sources: The solution builds upon insights from community solutions and official docs, including Terraform modules and real-world Solr on AWS scenarios ([GitHub - edsonmarquezani/terraform-solr-aws: Terraform+Puppet automation for Solr Cloud deployment on AWS EC2](https://github.com/edsonmarquezani/terraform-solr-aws#:~:text=It%20deploys%20the%20following%20resources,on%20AWS)) ([How to migrate Apache Solr from the existing cluster to Amazon EKS - DEV Community](https://dev.to/haintkit/how-to-migrate-apache-solr-from-the-existing-cluster-to-amazon-eks-3b3l#:~:text=To%20setup%20EFS%20as%20Solr%E2%80%99s,template%20will%20create%20EFS%20resource)) ([Zookeeper Resiliency for Solr Cloud in AWS, using Auto-Scaling Groups - OpenSource Connections](https://opensourceconnections.com/blog/2019/06/11/zookeeper-resiliency-aws/#:~:text=To%20ensure%20your%20Zookeeper%20ensemble,blog%20post%20by%20Credera%20here)) ([GitHub - tboeghk/solr-aws-observability: Experiments with Solr autoscaling](https://github.com/tboeghk/solr-aws-observability#:~:text=The%20Solr%20Autoscaling%20Framework%20is,to%20test%20Solr%20Autoscaling%20policies)) ([Autoscale SolrCloud on Kubernetes with Fargate | by Abhijit Choudhury | Medium](https://abhijit-choudhury.medium.com/autoscale-solrcloud-on-kubernetes-with-fargate-b1cfbcd0e9a7#:~:text=match%20at%20L207%202,120s%20once%20it%20is%20lost)), ensuring that best practices in SolrCloud management (backup/restore, safe scaling, ZooKeeper stability) are incorporated throughout.