Design Considerations II - Interactive & Real-time
This post is part of a Series on Modern Data Warehouse Architectures and was written in collaboration with Oscar Cassetti.
The ability to interactively query a dataset is often confused with real-time visibility on changes in the data. However, they are two distinct concerns. Interactive querying is an important aspect of any modern data warehouse. One should not have to wait minutes to hours for a query to return its results. To enable a broad range of users and use cases, the solution should be capable of providing sub second response times for a majority of queries and single digit seconds for others. Similarly, there should be a level of consistency in query times and a query which takes 1 second now, should not take 1 minute if executed 5 minutes later. Engines such as Hive v1 were inherently batch based and could not be considered for interactive analytics. While Hive v2+ was reworked to support more interactive and sub second analytics, other tools such as Presto, Impala, Drill, etc are designed from the ground up, specifically for interactive analytics.
Many organisations require real-time visibility of changes in certain key metrics. To achieve this, there must be a streaming source of data which contains the necessary information to build and maintain a real-time view of these metrics. The latency or staleness of data in this stream represents the lower bound on the speed of changes which can be analysed. Creating a unified query system for both real-time and historic data is an extremely difficult task. Most companies will accept a reasonable ETL delay of minutes to hours for some dashboards to update. In fact, the majority of business metrics do not need to be real-time in the sense of seeing the latest data changes (despite what you may hear). Instead, what they want is the ability to interactively query data and have responsive dashboards and reports. If the data is a few minutes old, it is not a problem. One should always evaluate what metrics really require real-time visibility of data changes and what that definition of real-time is.
Metrics which only keep track of some recent information, e.g. a sliding window of the top N sites visited in the last 30 minutes are much easier to achieve than metrics which require a real time view of information and also have a long term historical presence, e.g. total purchases of a given product over all time. The former can be solved via a solution designed only for real-time time series as it doesn’t need to worry about long term storage. It can optimise for such queries easily and there are many databases which can achieve this with little effort. One can develop custom rollups for metrics via streaming tools such as Spark, Flink, KSQL (new for Kafka). When 100% accuracy is not required and / or scale and latency requirements prevent it, streaming approximate counting algorithms can be used due to their simplicity and cost effectiveness. However, all of this becomes quite complex and costly to manage with a large number of metrics. If a historic view of these metrics is also required, then this results in a complicated solution to build the historic values and join with the current changes.
Unifying a query interface for a metric which has both a real-time and a historic component is much more difficult. Druid is probably the most well know OLAP system which attempts to do so. It essentially has separate processes, data structures and storage for real time and historic data, but it unifies the query behind a single data interface. This essentially is the Lambda Architecture for OLAP. It allows for very high performance and low latency queries on real-time and historic data by keeping approximates and rollups of the data. A common critique of Druid is that it’s difficult to set up and manage. There is a reason for this and it is the inherently different properties of the systems required to meet both the real-time and historic queries in a performant manner. In our opinion, there is currently no avoiding such a setup and any systems that have tried to unify the two concerns have failed horribly. A critique that we do have of Druid however is its lack of ANSI SQL support. It has its own query language which is not something we feel belongs in a modern Big Data stack and for all of the reasons we mention in the ANSI SQL discussion later in this series, it is not something we recommend. There is support for certain SQL features via Apache Calcite but this is marked as experimental. However, depending on your reporting requirements, it may be simpler to use than developing and managing the real-time aggregates and merging them with the historic results via a custom solution. Pinot from LinkedIn is another real-time OLAP and shares many of the same design features (lambda architecture), benefits and shortcomings as Druid. Naturally they have certain differences which should be analysed in more detail before choosing one over the other.
It is worth mentioning MemSQL in this section, as it appears like an obvious solution when real-time analytics is required and costs are not such an issue. It is a closed source product which requires licensing to use basic enterprise features such as high availability. That said, it is a fully relational SQL engine (i.e. ACID transactions), supports streaming ingestion of data and works at very high scale. These are not features one finds in many existing database solutions. VoltDB is another solution with some similarities to MemSQL and is founded by a true pioneer in the database world, Michael Stonebraker. However, it is intended more for the OLTP world with real-time analytics capabilities on recent data, whereas MemSQL attempts to fit both the OLTP and OLAP worlds. MemSQL appears to achieves this by allowing one to choose columnar or row storage for a table when creating it. Therefore, it’s unclear how well it supports the hybrid of OLAP and OLTP. VoltDB recommends performing batch analytics against another Data-mart or Hadoop environment (see here) which means two distinct systems again.
Finally, there is an interesting project called Apache Kudu which is an apache incubator at the time of writing. This has the potential to enable real-time querying of changelog (journal) data which could then be merged with the results of a historic query. Impala currently supports querying of Kudu and other engines should do so shortly. Using SQL windowing it is then possible to effectively merge the real-time and historic data to get an up to date value. This concept is still in its infancy and is more complex to manage than is recommended. However, the space and concept could evolve quickly.
Resource Isolation, QoS and Multi-tenancy
Resource allocation, isolation and QoS are probably some of the most overlooked areas in data engineering. One cannot hope to have an interactive, responsive, real-time reporting system without ensuring appropriate isolation. While auto scaling resources and tooling has helped to reduce this burden in recent years, it is still extremely difficult to manage across the four critical contention areas of cpu, memory, network and disk. Many users tend to forget that often, they are not the only users of the data warehouse and run complex queries which can have an impact on other processes and users. Maybe they forget to notify an administrator or another group within the company when running a large job and they end up impacting latencies, or worse, taking down a critical service or datastore. If this happens, who’s fault is it? We would argue that it is the architectures fault and not the users. Sufficient resource isolation should be in place via multi-tenancy support. However, providing accurate and sufficient resource isolation in most environments is extremely difficult. YARN has support for cpu and memory isolation and extremely basic support for disk isolation. However the necessary disk and network isolation (bandwidth allocation etc) are not currently possible with YARN. There are open tickets to tackle these issues and work is ongoing (see YARN-2139 and YARN-2140). Add HDFS disk and network isolation issues into the mix and the complexity is not possible to configure and manage effectively.
This is where isolated environments tend to be a better fit than shared environments. This isolation needs to start from the underlying data access to the query execution. When both your data and query execution engines are a single unit (e.g. Redshift), this type of isolation is not possible unless strictly supported by the product. However, even then it is difficult for the product to guarantee this from the storage throughput, network, memory and cpu perspective. When the data and engine are separated, there are a range of opportunities and flexible options one can take to ensure isolation without being constrained to a particular solution. In cloud environments this can save considerable money while dramatically increasing flexibility. When more execution resources are needed it is easy to scale up a cluster and scale it back down when complete. This resource elasticity is typically available within a few minutes of the request for the resource. In a classical OLAP configuration, adding a new node is a serious consideration and not temporary. It can take days or more for a new node to fully join a cluster while all the data rebalancing takes place. During this time the cluster is under severe strain and the existing performance is degraded.
The ability to have isolated execution environments for different users or groups becomes trivial when the storage is separated. That might be different tenants of different organisational departments. Forget about resource contention or starvation and simplify the breakdown of billing and costs to individual users. Take the example of having your data stored on S3. You can chose to start a new EMR cluster with either Spark, Presto, Hive or combinations thereof and query the data without any fear of affecting another team or user running queries against the same data in another cluster. This also allows for easy cost allocation as the costs associated with running a given instance of a cluster can be easily associated with a given user or group.
Of course the storage must be capable of supporting concurrent users, and scaling to meet the demand. As we discussed already, cloud storage solutions such as AWS S3 are very difficult to beat. When necessary, setting up bucket and cross regional replication is trivial and ensures that users can work with data where and how they want. Data can be temporarily copied and deleted with minimal charges to meet any scaling demands (copying from S3 to S3 is free in the same region).
Next in the series - Design Considerations III - Query Engines