Guaranteeing Query Runtimes for Analytics-as-a-Service

Jennifer Ortiz_Magda Balazinska_University of Washington

By Jennifer Ortiz and Magdalena Balazinska, University of Washington

A variety of data analytics systems are available as cloud services today, including Amazon Elastic MapReduce (EMR), Redshift and Azure’s HDInsight. With these services, users have access to compute clusters that come pre-packaged with data analytics tools. With most of these services, users select and pay for a given cluster configuration: i.e., number and type of service instances. It is well known, however, that users often have difficultly selecting configurations that meet their needs.

For example, if a user has a small 10GB dataset, should the user perform his analysis on a single machine to keep costs down or invest in a 20-node cluster to make the analysis go faster? Frequently, users need to test many configurations before finding a suitable one.  Some database cloud services such as Amazon RDS and Azure offer the ability to scale a database application automatically, but they require users to manually specify the scaling conditions, which requires deep expertise and planning. Recent work offers solutions for automatic cluster resizing but either focuses on transaction processing or requires a known and profiled workload.

An alternate approach is to enable users to purchase not a cluster size but a performance level. In previous work, we developed Personalized Service Level Agreements, an approach where users purchase service tiers with query time guarantees as shown in Figure 1

Example of performance-oriented SLA offered by the PSLAManager system for the Myria big data analytics service.

Figure 1: Example of performance-oriented SLA offered by the PSLAManager system for the Myria big data analytics service.

However, the challenge behind selling performance-focused SLAs for data analytics is in guaranteeing the query runtimes advertised in the SLAs.

“The challenge behind selling performance-focused SLAs for data analytics is in guaranteeing the query runtimes advertised in the SLAs.”

We recently developed a new system called PerfEnforce to address this need. PerfEnforce works in support of performance-centric SLAs for data analytics services. PerfEnforce targets services such as Amazon EMR or Redshift, where each user runs the service in his own cluster of virtual machines: the user purchases a service tier with an SLA that specifies query runtimes. These runtimes correspond to query time estimates for specific cluster sizes, which define the tiers of service. Once the user selects a service tier, the cloud service instantiates the corresponding cluster. As the user executes queries, prediction inaccuracies and interference from other tenants using the service can cause query times to differ from the estimated ones purchased by the user. To meet the terms of the performance-based SLA, PerfEnforce automatically re-sizes the cluster allocated to the user. PerfEnforce seeks to minimize the cluster size allocated to the user subject to satisfying the query time guarantees in the SLA. Figure 2 shows how PerfEnforce interacts with the PSLAManager to provide SLAs with performance guarantees.

Figure 2: PerfEnforce deployment: PerfEnforce sits on top of an elastically scalable big data management system (e.g., Myria), in support of performance-oriented SLAs for cloud data analytics (e.g., PSLAManager)

Figure 2: PerfEnforce deployment: PerfEnforce sits on top of an elastically scalable big data management system (e.g., Myria), in support of performance-oriented SLAs for cloud data analytics (e.g., PSLAManager)

Several systems have recently studied performance guarantees through dynamic resource allocation in storage systems using feedback control, or in transaction processing systems using reinforcement learning. With PerfEnforce, we show how applying these techniques enables PerfEnforce to effectively scale the cluster by adding or removing nodes. In PerfEnforce, we also develop a third technique based on online machine learning. In contrast to feedback control and reinforcement learning, which are reactive methods, online machine learning enables PerfEnforce to change the cluster size before running an incoming query. As the user runs more queries in the session, this approach continuously improves the query runtime prediction model. In addition, it can compensate for prediction errors using a control technique.

“We also develop a third technique based on online machine learning, which enables PerfEnforce to change the cluster size before running an incoming query.”

We will be demonstrating PerfEnforce at the upcoming SIGMOD’16 conference. Our demonstration will enable attendees to experiment with the three cluster-scaling algorithms and experience their benefits and limitations. The attendees will select a performance agreement, a query workload, and the scaling algorithm. They will then observe how the selected algorithm changes the cluster size dynamically and the resulting query performance. Attendees will also be able to adjust tunable parameters for these algorithms. Plan to come and see our demo at SIGMOD’16!

If you cannot attend SIGMOD’16, here is the link to our demo paper:  “PerfEnforce Demonstration: Data Analytics with Performance Guarantees,” Jennifer Ortiz, Brendan Lee, and Magdalena Balazinska, SIGMOD 2016 Demonstration

Additional reading:

This entry was posted in Analytics, Data Management, ISTC for Big Data Blog, Tools for Big Data and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


eight × 4 =