Published at

ElasticSearch essentials; part 2

ElasticSearch essentials; part 2

Advanced Elasticsearch administration: setup, memory configuration, performance troubleshooting, index rollover, and node upgrades.

Authors
Sharing is caring!
Table of Contents

Table of Contents

  1. Basic Setup
  2. Determine How Much RAM is Needed
  3. Configure Heap
  4. POV/Single Node Deployments
  5. Issues with Performance
  6. Rolling Over an Index
  7. Upgrading/Patching Nodes
  8. Single Node Deployment

Basic Setup

Setting up Elasticsearch involves installing the software, configuring essential settings, and ensuring the environment is optimized for performance. Begin by downloading the appropriate version of Elasticsearch for your operating system from the official website.

Installation Steps

  1. Download and Extract: Download the Elasticsearch distribution and extract it to your desired location.

  2. Configure Basic Settings: Edit config/elasticsearch.yml to set:

    • Cluster name
    • Node name
    • Network host (if needed)
    • Discovery settings (for multi-node clusters)
  3. Set Environment Variables: Configure ES_HOME and ensure Java is properly installed and accessible.

  4. Start Elasticsearch: Run the startup script:

    ./bin/elasticsearch
  5. Verify Installation: Check that Elasticsearch is running:

    curl http://localhost:9200

Essential Configuration Files

  • config/elasticsearch.yml: Main configuration file
  • config/jvm.options: JVM heap and garbage collection settings
  • config/log4j2.properties: Logging configuration

Determine How Much RAM is Needed

Memory requirements for Elasticsearch depend on your workload, data volume, and cluster size. Proper RAM allocation is critical for performance and stability.

General Guidelines

As per AppDynamics recommendations for Event Service deployments (which use Elasticsearch), allocate half of the available RAM to the Elasticsearch process, with:

  • Minimum: 7 GB heap
  • Maximum: 31 GB heap
  • Optimal Production: At least 62 GB of RAM on the system/host machine

Factors Affecting RAM Requirements

  1. License Units: Higher license unit consumption requires more RAM
  2. Event Volume: Transaction Analytics and Log Analytics events impact memory needs
  3. Cluster Size: More nodes can distribute the load, but each node still needs adequate RAM
  4. Query Complexity: Complex queries and aggregations consume more memory

Production Recommendations

For production environments:

  • Minimum: 62 GB RAM per node
  • Optimal: 122 GB RAM per node (i2.4xlarge equivalent)
  • High-Volume: 244 GB RAM per node (i2.8xlarge equivalent)
  • Node Count: At least 3 nodes for production (avoid single-node in production)

Everything below 62 GB RAM should be used only for POV and testing purposes.

Checking Current Memory Usage

To check how much RAM the Elasticsearch process is using:

# Find the Elasticsearch process ID
netstat -anp | grep 9200 | grep LISTEN

# Check memory consumption (replace <pid> with actual process ID)
echo 0 $(awk '/Rss/ {print "+", $2}' /proc/<pid>/smaps) | bc

The output is in KB. Compare this against your heap configuration and available system RAM.

Configure Heap

Heap memory configuration determines the amount of memory allocated to the Java Virtual Machine (JVM) heap. The heap is the runtime data area from which memory for all class instances and arrays is allocated.

Heap Size Guidelines

  • Allocation Rule: Set heap to 50% of available RAM, up to a maximum of 31 GB
  • Why 31 GB?: Beyond 31 GB, JVM uses 64-bit object pointers, which increases memory overhead
  • Why 50%?: The remaining RAM is needed for:
    • Operating system operations
    • File system cache (Lucene uses this for better performance)
    • Other system processes

Configuration Steps

  1. Edit config/jvm.options:

    # Set initial and maximum heap to the same value (prevents resizing)
    -Xms31g
    -Xmx31g
  2. For systems with less RAM, adjust accordingly:

    # Example for 16 GB system: allocate 7-8 GB to heap
    -Xms7g
    -Xmx7g
  3. Restart Elasticsearch after making changes:

    ./bin/elasticsearch -d

Important Notes

  • Always set -Xms and -Xmx to the same value to prevent heap resizing during runtime
  • Never allocate more than 50% of RAM to heap
  • Never exceed 31 GB heap size
  • Ensure total system RAM is sufficient (heap + OS + file cache)

Verifying Heap Configuration

After restart, verify the heap settings:

# Check JVM settings
curl -s 'http://localhost:9200/_nodes/jvm?pretty'

Look for heap_max_in_bytes in the response to confirm your settings.

POV/Single Node Deployments

Proof of Value (POV) and single-node deployments are suitable for development, testing, and evaluation purposes, but not recommended for production.

When to Use Single-Node

  • Development and testing environments
  • POV demonstrations
  • Learning and experimentation
  • Low-volume, non-critical applications

Limitations

  1. No High Availability: Single node failure means complete cluster unavailability
  2. No Data Replication: Risk of data loss if the node fails
  3. Performance Constraints: All roles run on one node, causing resource contention
  4. No Rolling Upgrades: Must take full downtime for maintenance

Single-Node Configuration

For a single-node cluster, set in config/elasticsearch.yml:

discovery.type: single-node

This disables the minimum master nodes check and allows the cluster to form with just one node.

Resource Requirements for POV

Even for POV, ensure minimum resources:

  • RAM: At least 16 GB (8 GB heap minimum)
  • CPU: 4+ cores
  • Storage: SSD recommended, sufficient space for data

Issues with Performance

Performance issues in Elasticsearch can manifest as slow queries, high CPU usage, memory pressure, or cluster instability.

Common Performance Issues

1. High CPU Usage

Symptoms:

  • Slow query responses
  • High CPU utilization on nodes
  • Cluster lag

Troubleshooting:

# Check node stats
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,cpu,load_1m,load_5m,load_15m'

# Check thread pool stats
curl -s 'http://localhost:9200/_cat/thread_pool?v'

Solutions:

  • Optimize queries (avoid deep pagination, use filters instead of queries where possible)
  • Increase node resources (CPU/RAM)
  • Add more nodes to distribute load
  • Review and optimize index mappings

2. Memory Pressure

Symptoms:

  • Frequent garbage collection pauses
  • Out of Memory (OOM) errors
  • Process termination by OS

Troubleshooting:

# Check if Elasticsearch process was killed
dmesg | grep <process-id>

# Look for OOM killer messages
dmesg | grep -i "out of memory"
dmesg | grep -i "oom-kill"

Example OOM Output:

oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1010.slice/session-978.scope,task=java,pid=495838,uid=1010
Out of memory: Killed process 495838 (java) total-vm:25599008kB, anon-rss:19702248kB

Solutions:

  • Increase heap size (within 31 GB limit)
  • Increase system RAM
  • Reduce index shard count
  • Optimize queries to use less memory
  • Enable circuit breakers to prevent OOM

3. Slow Queries

Symptoms:

  • Query timeouts
  • Slow search responses
  • High latency

Troubleshooting:

# Enable slow log
PUT /my_index/_settings
{
  "index.search.slowlog.threshold.query.warn": "10s",
  "index.search.slowlog.threshold.query.info": "5s"
}

# Check slow queries
GET /_cat/indices?v&s=index

Solutions:

  • Optimize query structure
  • Use filters instead of queries for exact matches
  • Add appropriate index mappings
  • Consider using aggregations efficiently
  • Review shard allocation

4. Cluster Health Issues

Symptoms:

  • Yellow or red cluster status
  • Unassigned shards
  • Node failures

Troubleshooting:

# Check cluster health
curl -s 'http://localhost:9200/_cat/health?v'

# Check for unassigned shards
curl -s 'http://localhost:9200/_cat/shards?v' | grep UNASSIGNED

# Check node status
curl -s 'http://localhost:9200/_cat/nodes?v'

Solutions:

  • Ensure cluster is in “green” status before operations
  • Fix unassigned shards
  • Ensure sufficient nodes for replica allocation
  • Check disk space and permissions

Performance Monitoring

Regular monitoring helps identify issues early:

# Cluster health
curl -s 'http://localhost:9200/_cat/health?v'

# Node stats
curl -s 'http://localhost:9200/_nodes/stats?pretty'

# Index stats
curl -s 'http://localhost:9200/_cat/indices?v'

# Check processes
netstat -tulpn | grep 9200

Rolling Over an Index

Index rollover is the process of creating a new index when the current one reaches a certain size or age. This is essential for managing large datasets and maintaining performance.

When Indexes Roll Over

Indexes typically roll over when:

  • Average shard size breaches a threshold
  • Index age exceeds the data retention period
  • Manual rollover is triggered

Manual Index Rollover

For Event Service deployments using Elasticsearch 8, use the following curl command:

curl -XPOST http://<host>:<port>/v1/admin/cluster/<cluster>/index/<index>/rollover \
  -H"Authorization: Basic <key>" \
  -H"Content-Type: application/json" \
  -H"Accept: application/json" \
  -d '{"numberOfShards": "2"}'

Parameters

Replace the following values:

  • <host>: Hostname (use localhost if running from Event Service CLI)
  • <port>: Event Service port (default 9080)
    grep ad.dw.http.port events-service-api-store.properties
  • <cluster>: Cluster name
    grep ad.es.cluster.name events-service-api-store.properties
  • <index>: Index name to rollover
    curl http://localhost:9200/_cat/indices?v
  • <key>: Base64 encoded ad.accountmanager.key.ops from properties file
    echo -n "<value-from-ad.accountmanager.key.ops>" | base64

Pre-Rollover Checklist

Before rolling over any index, ensure cluster health is green:

curl -s 'http://localhost:9200/_cat/health?v'

Verify:

  • Cluster status is “green”
  • No unassigned shards
  • All nodes are healthy

After rollover, verify again:

curl -s 'http://localhost:9200/_cat/health?v'

Common Rollover Scenarios

  1. After Migration: If data older than retention period was migrated, indexes may not roll over automatically
  2. Size Threshold: When index size exceeds recommended limits
  3. Maintenance: During scheduled maintenance windows

Index Lifecycle Management (ILM)

For automated rollover, consider implementing ILM policies:

PUT _ilm/policy/my_policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50GB",
            "max_age": "30d"
          }
        }
      }
    }
  }
}

Upgrading/Patching Nodes

Regular updates ensure access to new features, performance improvements, and security patches. Follow a rolling upgrade process to minimize downtime.

Pre-Upgrade Checklist

  1. Backup Data: Always create a snapshot before upgrading

    # Create snapshot repository
    PUT /_snapshot/my_backup
    {
      "type": "fs",
      "settings": {
        "location": "/path/to/backup"
      }
    }
    
    # Create snapshot
    PUT /_snapshot/my_backup/snapshot_1?wait_for_completion=true
  2. Check Compatibility: Ensure plugins and configurations are compatible with the new version

  3. Review Release Notes: Check for breaking changes and migration requirements

  4. Verify Cluster Health: Ensure cluster is in “green” status

    curl -s 'http://localhost:9200/_cat/health?v'

Rolling Upgrade Process

For Event Service deployments, follow these steps:

  1. Disable Shard Allocation (on the node to be upgraded):

    PUT /_cluster/settings
    {
      "persistent": {
        "cluster.routing.allocation.enable": "none"
      }
    }
  2. Stop Elasticsearch on the node:

    # Find process
    netstat -anp | grep 9200
    
    # Stop gracefully
    kill -SIGTERM <pid>
  3. Upgrade/Patch the node:

    • Install new version
    • Update configuration files if needed
    • Verify Java version compatibility
  4. Start Elasticsearch:

    ./bin/elasticsearch -d
  5. Verify Node Joined:

    curl -s 'http://localhost:9200/_cat/nodes?v'
  6. Re-enable Shard Allocation:

    PUT /_cluster/settings
    {
      "persistent": {
        "cluster.routing.allocation.enable": "all"
      }
    }
  7. Wait for Cluster to Stabilize:

    # Monitor until green
    curl -s 'http://localhost:9200/_cat/health?v'
  8. Repeat for each node, one at a time

Post-Upgrade Verification

After all nodes are upgraded:

# Check cluster health
curl -s 'http://localhost:9200/_cat/health?v'

# Verify version
curl -s 'http://localhost:9200/'

# Check all nodes are on new version
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,version'

Important Notes

  • Never upgrade more than one node at a time in a multi-node cluster
  • Maintain cluster quorum: Ensure majority of master-eligible nodes are available
  • Test in non-production first: Always test upgrade process in a test environment
  • Have rollback plan: Keep backups and know how to restore if needed

Single Node Deployment

For development, testing, or POV scenarios, you may need to run Elasticsearch as a single-node cluster.

Configuration

Set in config/elasticsearch.yml:

# Single node configuration
discovery.type: single-node

# Optional: reduce resource requirements
cluster.routing.allocation.disk.threshold_enabled: false

Starting Single Node

# Start Elasticsearch
./bin/elasticsearch

# Or as daemon
./bin/elasticsearch -d

Verification

# Check cluster status (should show 1 node)
curl -s 'http://localhost:9200/_cat/nodes?v'

# Check cluster health
curl -s 'http://localhost:9200/_cat/health?v'

Limitations and Considerations

  1. No Replication: Data is stored only once, no redundancy
  2. No High Availability: Node failure means complete downtime
  3. Resource Constraints: All roles (data, master, ingest) run on one node
  4. Not for Production: Use only for development, testing, or POV

Resource Requirements

Even for single-node deployments:

  • Minimum RAM: 8 GB (4 GB heap)
  • Recommended RAM: 16 GB (8 GB heap)
  • CPU: 2-4 cores minimum
  • Storage: SSD recommended

When to Scale

Consider moving to a multi-node cluster when:

  • Moving to production
  • Requiring high availability
  • Handling production workloads
  • Need for better performance

Best Practices Summary

  1. Memory Management:

    • Allocate 50% of RAM to heap (max 31 GB)
    • Monitor memory usage regularly
    • Ensure sufficient system RAM
  2. Cluster Health:

    • Always check cluster health before operations
    • Ensure “green” status before rollovers/upgrades
    • Monitor for unassigned shards
  3. Performance:

    • Optimize queries and mappings
    • Monitor CPU and memory usage
    • Use appropriate shard sizes (10-50 GB)
  4. Operations:

    • Always backup before upgrades
    • Use rolling upgrades for zero downtime
    • Test changes in non-production first
  5. Production:

    • Minimum 3 nodes for production
    • At least 62 GB RAM per node
    • Never use single-node in production

References

Sharing is caring!