This paper – originally titled “Capacity management of Java-based business applications running in a virtualized environment”, presents a new technique to evaluate performance and automatically resolve anomalies in Java-based enterprise applications running on virtualized environments in the cloud. New business metrics for measuring and monitoring the user activity and performance data are introduced in addition to the system ones. The results of capacity analysis are shown through real-world examples.
A cloud environment consists of many distributed components: hardware, virtual machines (VM), storage and other network devices. VMs are populated in data centers because of the savings from co-location space and flexibility in capacity extension. At the same time, managing cloud computing resources effectively – and provisioning business services to customers – is complicated and challenging.
This paper presents the results of performance analysis and capacity evaluation of VM resources, using both system and business metrics. To analyze performance data, production statistics from the RingCentral (RC) enterprise system [RC2013] were collected. The RC private cloud infrastructure consists of more than 2,500 physical and virtual servers in co-location data centers on both coasts of the United States, and is rapidly growing along with customer demand. The Zabbix open source monitoring system [ZAB2012] is used to get counters from multiple RC hosts on regular basis and store performance data into a MySQL database for analysis and troubleshooting.
Production servers are combined into pool groups, each running particular business services. The target of this analysis is not the entire enterprise cloud, but only the RC components with Java Enterprise Development Implementation (JEDI).
2. Java Application Problems
Memory utilization is a well-known problem of Java applications, especially in virtualized environments running VMware on a Microsoft Windows Server platform, where CPU and memory loads are higher. Permanent memory leakage is observed in the Java API because objects and classes are dynamically loaded, causing dramatic memory degradation.
Memory leaks are often related to programming bugs. Manual resource management is a major focus for Java developers, and fixing API code is a long-term solution. Garbage collection (GC) is the most popular and effective tool to automatically find and release data objects that won’t be accessed in future.
When GC is enabled and periodically run, every GC cycle can free up memory partially – but not to its initial size. As a result, permanent Java memory leaks were still observed (Figure 1). The degradation trend depends on user activity and business hours. Sooner or later, after free memory reaches a critical threshold, either the JBoss service needs to be restarted or VM needs to be rebooted to prevent a server crash.
Besides Java memory, other computer resources, such as network channels, DB connections, CPU usage, user interactions, etc., are affected. The next section shows how Java application anomalies can be automatically detected and resolved.
Figure 1 – Permanent Java memory leak
3. Auto-remediation Procedure for JEDI
The remediation procedure is designed to automatically restart the JBoss service on the JEDI host when Java free memory becomes critical. Corresponding triggers are implemented in the Zabbix monitoring system. The auto-remediation process is initiated if at least one of the following cases occurs:
- Java heap free memory size is lower than the critical threshold (5 MB).
- Java non-heap free memory size is lower than the critical threshold (5 MB).
- Java virtual allocated memory is higher than allowed (2 GB for 32-bit VM and 6 GB for 64-bit).
- The JBoss service status is not responding.
The following workflow for JEDI is applied (Figure 2):
- The Zabbix agent sends data about Java memory and JBoss service status periodically, at the specified polling interval, to the Zabbix server, where one of the above triggers is activated in case of free memory anomalies.
- The JBoss stability check script is started from the Zabbix server to verify whether the other JEDI hosts in the pool are available for business service.
- If a problematic host can be safely restarted without a service outage in the pool, the corresponding request on auto-remediation is returned.
- The JBoss restart script is initiated on the JEDI host.
- The results of the auto-remediation procedure are verified through logs. In case of failure, a special trigger is fired for manual troubleshooting.
Figure 2 – Auto-remediation workflow for JEDI
4. Implementing Business Metrics with JMX
Standard counters, such as system CPU and memory utilization, do not show all of the internal processes in Java applications, and are not helpful in troubleshooting. An example of an unknown reason for system performance degradation is shown in Figure 3. While Java heap used memory looks fine, and is within min-max bounds, the virtual memory usage jumped from 631 MB to 2.37 GB. That could be explained by either a GC problem or user activity growth, but can’t be determined using only system metrics.
Figure 3 – Java virtual memory degradation
Business metrics (such as hourly rate and distribution of service requests, user activity of established web sessions and DB connections, number of processed pages and many others) are needed, in addition to system metrics.
On the one hand, all business-oriented information could be retrieved from application logs. But on the other, parsing a large amount of log files to calculate each metric will lead to significant delay (from minutes to hours) before the result is available in the monitoring database. Daily log files differ by size, up to 20 GB depending on business services, the day of the week and user activity (Figure 4), that require powerful computer resources to handle.
Figure 4 – Daily log file sizes for JEDI
Business metrics for Java applications can be implemented using Java Management eXtensions (JMX) technology [JMX2012]. JMX tools provide access to internal objects, classes, services and other resources for monitoring and managing a Java API remotely.
In the RC infrastructure, the following JMX metrics are introduced for JEDI components (Figure 5):
- Java memory size (used, free, average, max, virtual, heap, non-heap).
- GC stats (last time issued, duration, count, result).
- System resources (CPU load for Java processes, system memory usage for Java API).
- User activity metrics (HTTP requests per time, active web thread count, session time).
- DB connections count (total connections established, DB objects count, active DB connections and sessions).
Figure 5 – JMX metrics for JEDI
Advantages of JMX technology are shown in Figure 6. JMX metrics such as active DB sessions or pool leased connections better indicate the real workload and resources. Total connections (from 5 to 20, depending on the DB connectivity engine) may be established to the host, but the users have different activity in the threads. In the case of Figure 6, DB connections are not actively used for a long time period.
Figure 6 – JMX metrics for active DB connections
5. Measuring User Activity in Pool
JMX metrics allow user activity in the pool to be measured. Figure 8 shows sample stats (HTTP requests per second, Java free memory) for all JEDI servers in one graph. The goal is to investigate how the entire pool withstands the growing user workload.
In spite of the many JEDI servers in the pool, a single user HTTP request spike caused a dramatic degradation of Java free memory on one host, to a critical value below the specified threshold of 5 MB. As a result, the auto-remediation procedure was initiated to prevent a server crash. After the JBoss service restarted on the problematic host, acceptable Java free memory was restored.
At the same time, all of the other JEDI servers worked properly, with no impact to Java memory utilization. No web service outage in the pool was encountered. That means the number of servers in the JEDI pool was enough at this point. Nevertheless, more investigation is required to make sure there are no capacity issues.
Figure 8 – Example of user activity spike in pool
6. Capacity Analysis
JMX metrics are used for capacity analysis. Below are some real-world examples of capacity problems that are encountered in practice and successfully resolved without significant impact to RC’s customers.
The sample graph in Figure 9 shows different CPU utilization in the pool. The CPU load on host sjc01-p08-pws04 is much higher in relation to the other servers – it exceeds the 60 percent threshold, and is sometimes close to 100 percent. Root cause analysis (RCA) identified a routing issue related to hardware. The problematic network switch was replaced.
Another reason for bad balancing between hosts could be using both VM and hardware in one pool. Virtual and physical hosts were found to have different response time. Hardware is utilized more often and the combination with VM is not recommended.
Figure 9 – Example of different CPU load in pool
The graph in Figure 10 is an example of a connectivity incident on a single host. Web processing active threads became overloaded and reached the maximum allowed value of 100. Those connections were not released for a long time, even though user requests were not critical. The problem was fixed by restarting the server.
Figure 10 – Example of web user activity
The next example (Figure 11) shows the capacity issue in the pool. Java memory is too low on all of the hosts, causing auto-restart each time free memory is below the critical threshold. GC doesn’t help either. A certain amount of time is needed to restart, and frequent auto-remediation may lead to performance degradation or even a service outage. The solution would be to extend the VM capacity.
Figure 11 – Example of capacity lack in pool
The next case (Figure 12) is opposite to the previous one. The number of JEDI hosts in the pool seems too large because Java memory usage is stable and far from the critical threshold even in peak hours. No balancing issues between VMs are detected.
The solution was to remove 3 out of 9 servers from the pool and monitor user activity. As shown in Figure 13, the decision was correct. After decommission of the iad01-p03-jws[07-09] servers, the usage of Java memory on the remaining hosts didn’t change significantly and remained steady, even after the workload spike on 5/21 was restored.
Figure 12 – Example of extra capacity in the pool
Figure 13 – Example of enough capacity in the pool
Virtualization is a good way to improve the effectiveness of performance and capacity management in a fast-growing cloud infrastructure. Nevertheless, a mixture of VM and hardware in a pool is not recommended, due to possible problems in balancing between hosts.
CPU and memory usage are always higher for virtual hosts; because of the overhead of running VMware, that is reasonable. In real-world practice, GC is the most effective memory management technique, and is widely used to free up Java memory resources. But it is still not enough to keep a production cloud environment healthy and business services stable. Auto-remediation procedures and proactive monitoring tools are designed for that purpose.
New metrics based on JMX technology are proposed for monitoring real user activity and detecting the anomalies in Java-based business applications running on VM. Capacity reports generated for the entire pool allow estimates of the optimal Java memory settings and the number of VMs required for provisioning business services.
[JMX2012] Java Management Extensions (JMX) – Best Practices, Oracle Technology Network. (https://www.oracle.com/technetwork/java/javase/tech/best-practices-jsp-136021.html)
[RC2013] RingCentral cloud business phone systems. (https://www.ringcentral.com)
[ZAB2012] Zabbix enterprise-class monitoring solution. (https://www.zabbix.com/)
Microsoft Windows, MySQL, Oracle, VMware and Zabbix are registered trademarks.
This analysis is based on statistical data from RingCentral Inc.