Next Article Table of Contents Previous Article

Analysis & Commentary:

REAL-TIME DATA WAREHOUSING DEFINED
by Michael Haisten, Daman Consulting

We are now poised for the most radical transition in the short history of data warehousing. We are about to witness the arrival of Real-time Data Warehousing (RTDW). Despite its name, the impact of this change is due less to the increase in speed of availability (which is significant) than to a significant reduction in overall processing complexity. We will replace a brute force process with elegant yet powerful simplicity.

Real-time Data Warehousing will eliminate the artificial bottlenecks we have designed into our solutions from the very beginning. It will solve some of the most vexing data warehouse problems such as change data capture. It will allow us to return to co-engineered, end-to-end system architectures.

So what is RTDW? It is an historical and analytic component of an enterprise level data stream. This data stream supports continuous, asynchronous, multi-point delivery of data. To paraphrase: Data moves straight from the originating source to all uses that do not require some form of staging. This movement takes place soon after the original data is written. Any time delays are due solely to transport latency and (optionally) miniscule processing times to dispatch or transform the instance of data being delivered.

The data warehouse component of this information delivery architecture contains both a real-time and a static partition. The real-time partition gets its data an instance at a time from a transformation engine. At regular intervals, consistent subsets of the real-time data are written to the static partition. The static partition maintains the history and serves as the base for supplying dependent data marts with periodic data.

Traditional Data Warehousing as Archeology

Building a data warehouse is like an archeological expedition. You generally know what you are looking for but you have to decipher dusty old records to find where to dig. You setup elaborate scaffolding that you can use only once to get at the buried artifacts and the work is always dirty. When you find the old relics, you must carefully package them for transport to a laboratory where they are unwrapped and reassembled for display. This is all massively labor-intensive and, in the end, you only have a bunch of old stuff that represents an imperfect subset of what actually happened in the past.

Why is the record incomplete? A flood may wash away a whole layer of the fossil record. This is similar to a purge or selective but undocumented periodic deletions. A geological intrusion can replace something without leaving evidence of what was there before. Our operational systems allow the update of information that effectively destroys the historical record. The items that survive best are those that were built to last or intentionally preserved based on the cultural preferences and bias of that time. Since it is very expensive to store things to last, few things survive. Likewise with information systems, we have been very selective about what we keep since we viewed storage as a scarce and costly luxury.

Perfect Capture

First and foremost, Real-time Data Warehousing eliminates the dig. The data we need now and the data we will need in the future streams out of the sources continuously as they come into being. The vast majority of the labor-intensive development effort, the 65% to 80% involved with extraction, is eliminated.

Changed data capture is the process of collecting incremental net changes so we can record history accurately. It allows us to maintain point-in-time currency. It is essential to doing data warehousing right but is often very difficult, if not impossible, to do after the fact by reading operational data sources in batch. In Real-time Data Warehousing, a perfect form of changed data capture is available at no extra cost. Both the before and the after image of a record is available in memory supporting easy and efficient delta processing. For dimensional data, you have much more freedom in how you represent slowly changing dimensions.

When an event occurs, such as shipping an order, all relevant information can be captured simultaneously. Today if we get an extract from a shipping subsystem, it is not likely to have relevant customer, product, order, or invoice cross-references necessary to fully describe the event. We have to piece them back together by merging data from multiple sources to artificially recreate the context of the event. Differing update timing may make some of the associations spurious. There is enormous advantage in being able to capture the full context of an event at creation time.

Quick Reconfiguration

By definition, our systems record only what we tell them to. Even the monolithic enterprise resource planning suites cannot be designed to anticipate everything we may need to analyze or record for posterity. The ideal capture mechanism for a RTDW is an application service that uses templates (such as SAP's IDOC) that specify a data context with an associated trigger event.

Such a template-driven mechanism allows the data warehouse designer to reconfigure an existing capture specification to include more content in hours. A wholly new specification can be ready for implementation in a matter of days. The limiting factor for availability is likely to be the turnaround time for production migration rather than development time. This is in contrast to a minimum of 3-5 days for building a single purpose interface when using an extraction, transformation, and loading (ETL) tool or the 2-3 weeks when writing it by hand.

Time Compression

As the pace of business picks up, processing cycles get compressed. Buying via mail order used to take a week or two. Express delivery services helped reduce this to 2 or 3 days. Now when you buy by phone or on the Internet, you can have your product the very next day. This means the product is delivered to the shipper the day the order is received.

Keeping pace with this speedy processing puts a strain on the traditional distinctions between operational and analytic information services. We use trended data regarding what sells well by site by hour to manage daily decisions. Actual daily sales activity is monitored to re-plan just-in-time inventory replenishment that in turn impacts longer cycle purchase planning.

A real-time acquisition and delivery engine enables on-the-fly analysis while ratcheting up seamlessly to support incremental aggregation and trend-line confirmation. On-the-fly analysis queries or samples the real-time data warehouse partition. The goal is to support immediate research of abnormal conditions in a manner not supported by the OLTP system. Example: A real-time transportation system getting satellite fed data of carrier deliveries indicates that an abnormal number of shipments are unloading behind schedule. On-the-fly analysis reveals the slow handling is for customer self-unloads and that the effected shipments are a brand-new multi-product configuration. This information is used to quickly write and transmit new breakdown instructions to these customer sites.

Incremental aggregation is an alternative to the old method of creating running totals. At regular intervals, such as every 15 minutes for securities or every 30-60 minutes for product orders or bank transactions, an aggregation engine runs against the real-time partition to create intermediate totals and statistics. These aggregates can be cascaded into long period aggregates (such as daily) and then stored in the static partition of the data warehouse. Because this aggregation is done more continuously, it can actually support high volumes than more traditional daily or weekly bulk processing.

Trend-line confirmation builds on incremental aggregation to provide a new monitoring tool. Any time-sensitive activity such as dropped calls at cells sites can be compared to a baseline at close to real-time. This provides much more insight and control than even active "stop lighting", which only tells you when cell sites have already overloaded. Trend-line confirmation reveals the momentum and direction of a change that is out of the ordinary soon after it begins.

Transitory States

Even if your enterprise is not (yet) running near real-time, you may be prone to the effect of time compression. A key business process may be impacted by transitory states. These are short duration activities that may end in some form of reversal that leaves little or no evidence behind.

For instance, a credit hold is a temporary condition in an online order taking system. An order goes on-hold and comes off, either with the order approved or disapproved. For an approved order, you may not know that it went on-hold at all. At most, there is a residual hold indicator in the system. Disapproved orders will generally record a credit rejection reason. But, in neither case, are the details of the hold event recorded. What happens when you introduce a new product that appeals to a consumer segment with lower credit worthiness? The number of orders per representative may decline drastically because of hold delays but you cannot see this effect directly. The number of disapproved orders may increase but this does not account for the lower number of orders per rep.

With a real-time acquisition mechanism, you can capture the on-hold and off-hold events as they occur. Due to the ability to quickly reconfigure the data stream to capture new events, you can begin analysis of a new phenomenon soon after you first notice it. Combined with other state transition events, you can create a contemporaneous record that can be used for analyzing the full order cycle.

Infinite Snapshots

Many data warehouse traditionalists will already be planning their rebuttals. Real-time and warehousing just do not mix! Warehouses are about snapshots that provide a consistent point-in-time view of the data. They argue that real-time data capture violates this treasured principle.

The enlightened perspective is that you cannot achieve point-in-time consistency without continuous, real-time data capture. The point-in-time that is relevant for one use may be radically wrong for another. Some uses may need to count orders taken but not released, released but not shipped, invoiced but not paid, etc. The most flexible data design does not replicate the document concepts of OLTP system. Instead it captures information about discrete states in a business process as they occur.

With real-time and cross application data capture, we are finally free to implement designs based on event states rather than documents or transactions. Continuing the order example, each row in a state-based table is a snapshot of what is known about the order when it reached that state (taken or picked or shipped). In a sense a state table is a far superior "snapshot", since every row in it represents what is known about the order at a consistent state in the process.

Inventories and balances are a unique category of snapshot. They record the quantity or level of some tangible asset as of a particular valuation point. Whether they are numbers that reflect actual inventory counts or are derived by netting inflows and outgos, they are defined by the discrete time of their calculation. Today, the number of relevant as-of times is increasing. You might have month-end immediate for one use, month-end after the first weekend of processing for another, and month-end final after all adjustments. Weekly, daily, shift, and even instant balances now complement these long cycle views.

The best and possibly only way to account for all the possible variations involves continuous sampling using real-time techniques. Any balance required for legal, analytic, or other reason needs to be frozen for posterity. We do this by writing it to the static data warehouse partition. A data warehouse must be designed to collect a large number of snapshots based on the consistency goals of the driving need.

The technologies to support Real-time Data Warehousing are coming online today. The early adopters are already developing the techniques. The ball is in now in your court.

Components of a Real-time Data Distribution Environment

(1) Real-Time Capture – Data capture functionality is provided by the application or the DBMS on a transaction-by-transaction basis. Several different mechanisms are currently available.

(a) Template-Enabled Event Capture: Data is captured both from an active transaction and from one or more associated data tables as the transaction is processed. It is immediately written to a data table or a message queue based on a predefined mapping template. This method has the advantages of being able to capture related context from the database while allowing precise control over output layout. It may also provide visibility to before and after changes thus supporting full delta (changed data) processing. SAP's IDOC facility is an example of this capability.

(b) Transaction Mirroring: The active transaction is written to an external data table or message queue as it is processed. This method only captures data content embedded in the transaction. It provides neither context from the database nor on-the-fly delta processing. Many off-the-shelf application packages (can be configured to) support this functionality.

(c) DBMS Triggers: Changed data can be captured using database supported triggers or rules. The outbound data element mapping is implemented by coding the stored procedure or rule. Additional contextual data can be added via additionally embedded lookups but is often not done for performance reasons. These DBMS mechanisms are designed for real-time capture but not real-time delivery. Unless they support output to a message queue, an external periodic polling mechanism is required to move data into the delivery pipeline.

(d) Custom Event Traps: Legacy applications that do not support any of the other methods can be modified to capture real-time events. If an application is well structured with centralized transaction management or modularized data writes, a customized code block can be embedded to capture and write event details to a queue.

(2) Real-time Delivery – The prototypical pipeline of the real-time delivery mechanism is a cross-platform message queue such as that provided by Oracle or IBM. Riding on top of this pipeline is a dispatch mechanism that directs each specific event (message) to the multiple targets that requested it. New Era of Networking (NEON) provides a form of publish and subscribe dispatch that extends IBM's MQ Series functionality.

Concerns exist in some quarters regarding how robust and scalable message queue products are today. ERP vendors offer proprietary pipelines with all the limitations that exist with such closed solutions. EAI vendors are attempting to move their more generalized, multi-point technologies into the real-time arena. We believe real-time delivery is a primal capability that will be driven by the infrastructure vendors to be as powerful, as extensible, and as bulletproof as the market demands.

(3) Transformation Engine – Data is transformed instance-by-instance by this continuous processing engine. The engine provides physical format conversion, calculations or derivations, code translation, table lookups, and other forms of row level transforms. When you talk about a transformation engine today, most people think of the Extract/Transform/Load (ETL) tools such as ETI and Constellar and data mart utilities (from Informatica, Sagent, Ardent, and others). All these tools are batch processors not real-time engines. Most of these products can be upgraded and repositioned to attack this need IF the vendors see this trend in time. NEON provides basic transformation as a part of the delivery process. Look for more real-time capability from the new wave of Enterprise Application Integration vendors.

(4) Message-ready Targets – Applications that are "message-ready" are one target for the real-time data distribution environment. These applications are more loosely coupled than application suites while being more intimately interrelated and responsive than batch-interfaced systems. The live feed into these applications can contain raw events as written by the source or transformed facts produced by value-added processing in the transformation engine.

(5) Legacy Systems – All existing applications and data warehouse environments are designed to collect data via batch interfaces. They can be incorporated into the RT/DDE without modification by building a flat file data staging area from the message queue. The flat files are cycled to simulate the load frequency of the target such as an existing operational data store.

(6) The Real-time Data Warehouse – The magic of the RTDW comes from collecting data continuously in a real-time partition while sweeping a consistent snapshot into the static partition at regular intervals. The real-time partition is also the host for incremental aggregation that allows key summary metrics to be built on-the-fly. The static partition functions as a traditional data warehouse. It maintains consistent history of atomic details with persistent storage of incremental net changes.

The term "partition" is used in both a strict and a loose fashion. In the loose sense, the real-time partition could be a separate database or just separate tables in the same database. However either of these methods results in much more complex data designs and creates more cumbersome synchronization problems. The preferred method involves table-level partitioning where the DBMS supports a single logical image (one design) while enforcing performance isolation between the real-time and static partitions. The ideal solution provides DBMS managed "consolidation" of real-time data rows into the static data partition.

(7) Incremental Aggregator – This is the engine that aggregates on-the-fly, as noted above. It supports the dynamic maintenance of multiple levels of aggregation simultaneous. Micrsoft's OLAP Services for SQL Server contains the essential core of this capability.

(8) Preparation Engine – Preparation is a batch process of selecting, collecting, aggregating and projecting data to create specialized access-optimized tables. This engine reads data from the static partition of the data warehouse to populate consumer-specific data marts. The previously mentioned data mart utilities provide these functions.

(9) Data Marts – The Real-time Data Warehouse provides seamless coverage for a broader range of analytic needs than a traditional load-and-go snapshot warehouse. Dependent data marts remain as an essential component of the complete solution. They continue to provide the best means of supporting multiple consumer-specific data slices and timeframes. A data mart can be built using data as recent as last moment to a ten year trend in the most optimal form for the task.

Top of Page


Previous Article  |  Table of Contents  |  Next Article