Analysis & Commentary:
REAL-TIME DATA WAREHOUSING DEFINED
by Michael Haisten, Daman Consulting
We are now poised for the most radical transition in the short history of data
warehousing. We are about to witness the arrival of Real-time Data Warehousing
(RTDW). Despite its name, the impact of this change is due less to the
increase in speed of availability (which is significant) than to a significant
reduction in overall processing complexity. We will replace a brute force
process with elegant yet powerful simplicity.
Real-time Data Warehousing will eliminate the artificial bottlenecks we have
designed into our solutions from the very beginning. It will solve some of the
most vexing data warehouse problems such as change data capture. It will allow
us to return to co-engineered, end-to-end system architectures.
So what is RTDW? It is an historical and analytic component of an enterprise
level data stream. This data stream supports continuous, asynchronous,
multi-point delivery of data. To paraphrase: Data moves straight from the
originating source to all uses that do not require some form of staging. This
movement takes place soon after the original data is written. Any time delays
are due solely to transport latency and (optionally) miniscule processing
times to dispatch or transform the instance of data being delivered.
The data warehouse component of this information delivery architecture
contains both a real-time and a static partition. The real-time partition gets
its data an instance at a time from a transformation engine. At regular
intervals, consistent subsets of the real-time data are written to the static
partition. The static partition maintains the history and serves as the base
for supplying dependent data marts with periodic data.
Traditional Data Warehousing as Archeology
Building a data warehouse is like an archeological expedition. You generally
know what you are looking for but you have to decipher dusty old records to
find where to dig. You setup elaborate scaffolding that you can use only once
to get at the buried artifacts and the work is always dirty. When you find the
old relics, you must carefully package them for transport to a laboratory
where they are unwrapped and reassembled for display. This is all massively
labor-intensive and, in the end, you only have a bunch of old stuff that
represents an imperfect subset of what actually happened in the past.
Why is the record incomplete? A flood may wash away a whole layer of the
fossil record. This is similar to a purge or selective but undocumented
periodic deletions. A geological intrusion can replace something without
leaving evidence of what was there before. Our operational systems allow the
update of information that effectively destroys the historical record. The
items that survive best are those that were built to last or intentionally
preserved based on the cultural preferences and bias of that time. Since it is
very expensive to store things to last, few things survive. Likewise with
information systems, we have been very selective about what we keep since we
viewed storage as a scarce and costly luxury.
Perfect Capture
First and foremost, Real-time Data Warehousing eliminates the dig. The data we
need now and the data we will need in the future streams out of the sources
continuously as they come into being. The vast majority of the labor-intensive
development effort, the 65% to 80% involved with extraction, is eliminated.
Changed data capture is the process of collecting incremental net changes so
we can record history accurately. It allows us to maintain point-in-time
currency. It is essential to doing data warehousing right but is often very
difficult, if not impossible, to do after the fact by reading operational data
sources in batch. In Real-time Data Warehousing, a perfect form of changed
data capture is available at no extra cost. Both the before and the after
image of a record is available in memory supporting easy and efficient delta
processing. For dimensional data, you have much more freedom in how you
represent slowly changing dimensions.
When an event occurs, such as shipping an order, all relevant information can
be captured simultaneously. Today if we get an extract from a shipping
subsystem, it is not likely to have relevant customer, product, order, or
invoice cross-references necessary to fully describe the event. We have to
piece them back together by merging data from multiple sources to artificially
recreate the context of the event. Differing update timing may make some of
the associations spurious. There is enormous advantage in being able to
capture the full context of an event at creation time.
Quick Reconfiguration
By definition, our systems record only what we tell them to. Even the
monolithic enterprise resource planning suites cannot be designed to
anticipate everything we may need to analyze or record for posterity. The
ideal capture mechanism for a RTDW is an application service that uses
templates (such as SAP's IDOC) that specify a data context with an associated
trigger event.
Such a template-driven mechanism allows the data warehouse designer to
reconfigure an existing capture specification to include more content in
hours. A wholly new specification can be ready for implementation in a matter
of days. The limiting factor for availability is likely to be the turnaround
time for production migration rather than development time. This is in
contrast to a minimum of 3-5 days for building a single purpose interface when
using an extraction, transformation, and loading (ETL) tool or the 2-3 weeks
when writing it by hand.
Time Compression
As the pace of business picks up, processing cycles get compressed. Buying via
mail order used to take a week or two. Express delivery services helped reduce
this to 2 or 3 days. Now when you buy by phone or on the Internet, you can
have your product the very next day. This means the product is delivered to
the shipper the day the order is received.
Keeping pace with this speedy processing puts a strain on the traditional
distinctions between operational and analytic information services. We use
trended data regarding what sells well by site by hour to manage daily
decisions. Actual daily sales activity is monitored to re-plan just-in-time
inventory replenishment that in turn impacts longer cycle purchase planning.
A real-time acquisition and delivery engine enables on-the-fly analysis while
ratcheting up seamlessly to support incremental aggregation and trend-line
confirmation. On-the-fly analysis queries or samples the real-time data
warehouse partition. The goal is to support immediate research of abnormal
conditions in a manner not supported by the OLTP system. Example: A real-time
transportation system getting satellite fed data of carrier deliveries
indicates that an abnormal number of shipments are unloading behind schedule.
On-the-fly analysis reveals the slow handling is for customer self-unloads and
that the effected shipments are a brand-new multi-product configuration. This
information is used to quickly write and transmit new breakdown instructions
to these customer sites.
Incremental aggregation is an alternative to the old method of creating
running totals. At regular intervals, such as every 15 minutes for securities
or every 30-60 minutes for product orders or bank transactions, an aggregation
engine runs against the real-time partition to create intermediate totals and
statistics. These aggregates can be cascaded into long period aggregates (such
as daily) and then stored in the static partition of the data warehouse.
Because this aggregation is done more continuously, it can actually support
high volumes than more traditional daily or weekly bulk processing.
Trend-line confirmation builds on incremental aggregation to provide a new
monitoring tool. Any time-sensitive activity such as dropped calls at cells
sites can be compared to a baseline at close to real-time. This provides much
more insight and control than even active "stop lighting", which only tells
you when cell sites have already overloaded. Trend-line confirmation reveals
the momentum and direction of a change that is out of the ordinary soon after
it begins.
Transitory States
Even if your enterprise is not (yet) running near real-time, you may be prone
to the effect of time compression. A key business process may be impacted by
transitory states. These are short duration activities that may end in some
form of reversal that leaves little or no evidence behind.
For instance, a credit hold is a temporary condition in an online order taking
system. An order goes on-hold and comes off, either with the order approved or
disapproved. For an approved order, you may not know that it went on-hold at
all. At most, there is a residual hold indicator in the system. Disapproved
orders will generally record a credit rejection reason. But, in neither case,
are the details of the hold event recorded. What happens when you introduce a
new product that appeals to a consumer segment with lower credit worthiness?
The number of orders per representative may decline drastically because of
hold delays but you cannot see this effect directly. The number of disapproved
orders may increase but this does not account for the lower number of orders
per rep.
With a real-time acquisition mechanism, you can capture the on-hold and
off-hold events as they occur. Due to the ability to quickly reconfigure the
data stream to capture new events, you can begin analysis of a new phenomenon
soon after you first notice it. Combined with other state transition events,
you can create a contemporaneous record that can be used for analyzing the
full order cycle.
Infinite Snapshots
Many data warehouse traditionalists will already be planning their rebuttals.
Real-time and warehousing just do not mix! Warehouses are about snapshots that
provide a consistent point-in-time view of the data. They argue that real-time
data capture violates this treasured principle.
The enlightened perspective is that you cannot achieve point-in-time
consistency without continuous, real-time data capture. The point-in-time that
is relevant for one use may be radically wrong for another. Some uses may need
to count orders taken but not released, released but not shipped, invoiced but
not paid, etc. The most flexible data design does not replicate the document
concepts of OLTP system. Instead it captures information about discrete states
in a business process as they occur.
With real-time and cross application data capture, we are finally free to
implement designs based on event states rather than documents or transactions.
Continuing the order example, each row in a state-based table is a snapshot of
what is known about the order when it reached that state (taken or picked or
shipped). In a sense a state table is a far superior "snapshot", since every
row in it represents what is known about the order at a consistent state in
the process.
Inventories and balances are a unique category of snapshot. They record the
quantity or level of some tangible asset as of a particular valuation point.
Whether they are numbers that reflect actual inventory counts or are derived
by netting inflows and outgos, they are defined by the discrete time of their
calculation. Today, the number of relevant as-of times is increasing. You
might have month-end immediate for one use, month-end after the first weekend
of processing for another, and month-end final after all adjustments. Weekly,
daily, shift, and even instant balances now complement these long cycle views.
The best and possibly only way to account for all the possible variations
involves continuous sampling using real-time techniques. Any balance required
for legal, analytic, or other reason needs to be frozen for posterity. We do
this by writing it to the static data warehouse partition. A data warehouse
must be designed to collect a large number of snapshots based on the
consistency goals of the driving need.
The technologies to support Real-time Data Warehousing are coming online
today. The early adopters are already developing the techniques. The ball is
in now in your court.
Components of a Real-time Data Distribution Environment
(1) Real-Time Capture – Data capture functionality is provided by the
application or the DBMS on a transaction-by-transaction basis. Several
different mechanisms are currently available.
(a) Template-Enabled Event Capture: Data is captured both from an active
transaction and from one or more associated data tables as the transaction is
processed. It is immediately written to a data table or a message queue based
on a predefined mapping template. This method has the advantages of being able
to capture related context from the database while allowing precise control
over output layout. It may also provide visibility to before and after changes
thus supporting full delta (changed data) processing. SAP's IDOC facility is
an example of this capability.
(b) Transaction Mirroring: The active transaction is written to an external
data table or message queue as it is processed. This method only captures data
content embedded in the transaction. It provides neither context from the
database nor on-the-fly delta processing. Many off-the-shelf application
packages (can be configured to) support this functionality.
(c) DBMS Triggers: Changed data can be captured using database supported
triggers or rules. The outbound data element mapping is implemented by coding
the stored procedure or rule. Additional contextual data can be added via
additionally embedded lookups but is often not done for performance reasons.
These DBMS mechanisms are designed for real-time capture but not real-time
delivery. Unless they support output to a message queue, an external periodic
polling mechanism is required to move data into the delivery pipeline.
(d) Custom Event Traps: Legacy applications that do not support any of the
other methods can be modified to capture real-time events. If an application
is well structured with centralized transaction management or modularized data
writes, a customized code block can be embedded to capture and write event
details to a queue.
(2) Real-time Delivery – The prototypical pipeline of the real-time delivery
mechanism is a cross-platform message queue such as that provided by Oracle or
IBM. Riding on top of this pipeline is a dispatch mechanism that directs each
specific event (message) to the multiple targets that requested it. New Era of
Networking (NEON) provides a form of publish and subscribe dispatch that
extends IBM's MQ Series functionality.
Concerns exist in some quarters regarding how robust and scalable message
queue products are today. ERP vendors offer proprietary pipelines with all the
limitations that exist with such closed solutions. EAI vendors are attempting
to move their more generalized, multi-point technologies into the real-time
arena. We believe real-time delivery is a primal capability that will be
driven by the infrastructure vendors to be as powerful, as extensible, and as
bulletproof as the market demands.
(3) Transformation Engine – Data is transformed instance-by-instance by this
continuous processing engine. The engine provides physical format conversion,
calculations or derivations, code translation, table lookups, and other forms
of row level transforms. When you talk about a transformation engine today,
most people think of the Extract/Transform/Load (ETL) tools such as ETI and
Constellar and data mart utilities (from Informatica, Sagent, Ardent, and
others). All these tools are batch processors not real-time engines. Most of
these products can be upgraded and repositioned to attack this need IF the
vendors see this trend in time. NEON provides basic transformation as a part
of the delivery process. Look for more real-time capability from the new wave
of Enterprise Application Integration vendors.
(4) Message-ready Targets – Applications that are "message-ready" are one
target for the real-time data distribution environment. These applications are
more loosely coupled than application suites while being more intimately
interrelated and responsive than batch-interfaced systems. The live feed into
these applications can contain raw events as written by the source or
transformed facts produced by value-added processing in the transformation
engine.
(5) Legacy Systems – All existing applications and data warehouse environments
are designed to collect data via batch interfaces. They can be incorporated
into the RT/DDE without modification by building a flat file data staging area
from the message queue. The flat files are cycled to simulate the load
frequency of the target such as an existing operational data store.
(6) The Real-time Data Warehouse – The magic of the RTDW comes from collecting
data continuously in a real-time partition while sweeping a consistent
snapshot into the static partition at regular intervals. The real-time
partition is also the host for incremental aggregation that allows key summary
metrics to be built on-the-fly. The static partition functions as a
traditional data warehouse. It maintains consistent history of atomic details
with persistent storage of incremental net changes.
The term "partition" is used in both a strict and a loose fashion. In the
loose sense, the real-time partition could be a separate database or just
separate tables in the same database. However either of these methods results
in much more complex data designs and creates more cumbersome synchronization
problems. The preferred method involves table-level partitioning where the
DBMS supports a single logical image (one design) while enforcing performance
isolation between the real-time and static partitions. The ideal solution
provides DBMS managed "consolidation" of real-time data rows into the static
data partition.
(7) Incremental Aggregator – This is the engine that aggregates on-the-fly, as
noted above. It supports the dynamic maintenance of multiple levels of
aggregation simultaneous. Micrsoft's OLAP Services for SQL Server contains the
essential core of this capability.
(8) Preparation Engine – Preparation is a batch process of selecting,
collecting, aggregating and projecting data to create specialized
access-optimized tables. This engine reads data from the static partition of
the data warehouse to populate consumer-specific data marts. The previously
mentioned data mart utilities provide these functions.
(9) Data Marts – The Real-time Data Warehouse provides seamless coverage for a
broader range of analytic needs than a traditional load-and-go snapshot
warehouse. Dependent data marts remain as an essential component of the
complete solution. They continue to provide the best means of supporting
multiple consumer-specific data slices and timeframes. A data mart can be
built using data as recent as last moment to a ten year trend in the most
optimal form for the task.
|