
Features - Enterprise Data Insights:
GETTING THE TRUTH OUT OF YOUR DATA WAREHOUSE
By Gary Mehler, Director of Software Development, SAS
It is possible to get "one version of the truth" from your data warehouse --
without starting from scratch.
Data warehousing has become the foundational layer of data upon which today's
business intelligence processes are built. Without a data mart or data
warehouse that has a documented structure, along with the ability to
accommodate updated data, analytic consistency over time cannot be achieved,
and enterprises will continue to suffer. After all, a data warehouse is
typically thought of as the means to achieve "one version of the truth" for an
organization. The more broadly that advanced analytics are used, the more
crucial the data warehousing process becomes.
In addition, it is important to truly understand the structured and documented
approach required to correctly perform data warehousing so that its principles
can be adopted in higher- level functions as well.
Data warehousing is more than just an ETL (Extract, Transform, and Load)
process for a data collection. It must also integrate the infrastructure upon
which this layer is built. This article examines the importance of metadata
processing, and explores how upgrading a data warehouse is possible through a
couple of new services -- multi-level planning/administration, and data
quality integration -- that significantly impact the efficiency and
manageability of the ETL process.
The importance of metadata processing
Data warehousing, like many other computer-based operations, relies on
metadata -- information that describes basic tabular data -- as well as
processes executed upon the basic data. In order to make infrastructure plans
for a large-scale warehouse, metadata needs should be considered as a critical
first step in the planning process.
Next-generation ETL servers will be highly dependent upon metadata services.
These services can be thought of as storage repositories for metadata, and are
built into collections of repositories that serve a particular function. For
example, the typical warehousing development methodology calls for a
three-level model for development (DEV), testing (TEST), and production (PROD)
use.
Each level represents a collection of repositories that comprise the warehouse
operation for that level. These collections have their own needs for
interoperation with minimal additional extra administrative overhead. Within a
level, such as the DEV environment, multiple ETL developers may be working on
various aspects of a larger project, and each of these projects can benefit
from metadata independence during the early development phase.
Change management is important to metadata processing because it allows for
separation of independent projects to be separated so that work can proceed in
parallel. Without change management, an ill-considered change could
effectively stop all work because a user would be subjected to all changes as
soon as they are made.
Change management is critical for real-world development of any type, and is
especially important for ETL development environments when interdependence of
data elements can be difficult to assess on the surface. Change management
allows developers to check out a particular object (such as a table
definition), experiment with the impact of changes in a protected environment
(a project workspace), and after its safety has been determined, check it back
into a shared work area for other users to see.
Once these metadata needs are considered, new upgrades that improve the
effectiveness of the ETL process become possible through a couple of new
services. Multi-level planning and administration, and data quality
integrationm, in particular.
Multi-user planning
Some forethought is required when multiple developers, administrators or
end-users are brought into the picture. For example,the inclusion of multiple
developers in the warehouse design and implementation process requires that a
common approach be taken for the storage of documentation and implementation
materials. This is so important because it ensures that the work of these
multiple warehouse developers can be integrated into a consistent and
maintainable structure.
In addition to handling multi-level migration, core metadata needs to be
available for multiple distributed sites, and needs to must have backup
support and restoration to mitigate against hardware failure. These
capabilities are supported with the general names of replication, whereby a
repository can be copied to another location. A topology of metadata servers
can be created to support this operation in a straightforward fashion.
Multi-user administration
The large numbers of people that are typically involved in an enterprise-wide
warehousing project means that it is crucial that the tools used must be aware
of the roles of these individuals, and help administrators make consistent
choices throughout, including:
- Creating project work areas for each ETL developer
- Setting group permissions easily for all the ETL developers
- Sharing common information as needed
- Ensuring that updates are made with an audit trail
- Using a check-out/check-in approach to keep prevent overwrites from
occurring
The value of storing information in a common metadata repository is clear when
the re-use of this information through a management dashboard is determined.
An administrator can define complex definitions for all data and application
servers just once, and then share that information with the user
community.
Putting all of the information in one place makes it easy for administrators
to manage large-scale enterprises, but also requires authorization management
to ensure that each user has permission to see just exactly what they are
supposed to see. With a management console in place, this is easy to
manage.
Having this information in one place will become an ever-important feature
because each object in the metadata repository can be better administered.
This means that servers are not only limited to setting authorization rights,
but can also display columns in tables or reports.
Data quality integration
No discussion of advanced data warehousing would be complete without
considering the need to understand and manage quality issues. Whenever data is
assembled from a variety of sources, minimizing the effects of defective data
is a major concern. In addition, reducing data redundancies and standardizing
data elements increases the accuracy of the data, making it more usable. A
data quality server can handle the task of removing erroneous data and match
common data when merging data from previously disparate sources. Cleansed data
leads to more usable reliable results, so consideration of cleansing data in
the ETL processes delivers a good return on investment for the resultant
warehouse.
A data quality server can be used in many ways to help leverage exploit the
full power of data quality in the ETL process:
- To generate match codes for transformation, reporting and analysis in
order
to generate and apply schemes that transform data sets, which can be done
through another ETL transformation
- To determine upper- or lower-case standardization for
data values
- To utilize other functions that can determine gender
or locale
Beginning the ETL design process
The actual ETL design process takes place once the administrative functions
are up and running. Of all of the data warehousing development activities,
the ETL process is usually the most time-consuming and difficult to design and
operate of any of the data warehousing development activities. However, new
ETL tools on the market today, such as the SAS ETL Studio, can greatly
simplify the data management process for IT professionals by gathering and
integrating data from multiple streams onto one platform.
ETL, after all, is about extraction, transformation and loading to support
business intelligence needs, the bottom-line reason for this entire
process.
Initial tasks in an ETL tool revolve around defining sources to extract and
targets to load. For example, an ETL developer has a simple task to perform --
transform basic warehouse and transactional data into a weekly summary table.
This would typically be done in a project area to contain the risk and scope
of work, and eventually would eventually be checked into a common area for
others to use. Further, a grouping would be defined to keep the pieces
together and help avoid confusion with components of other projects.
These tasks can now be implemented and easily accommodated more readily than
ever with today's ETL tools.
Building on a common warehouse metamodel (CWM) standard
When talking about data warehousing and the goal of defining and refining one
version of the truth, it is important to mention the CWM standard. The newest
warehousing products today are largely compliant with the open standard that
supports common warehouse metamodel (CWM) interchange. This important standard
allows for the sharing of metadata information to be shared across
applications. In a typical large enterprise, a data architect manages the
corporate or departmental data model is managed by a data architect. This data
model describes how various pieces interrelate and helps prevents multiple
versions of the truth from confusing a data warehousing environment. Using
standard data modeling tools like ErWin or Rational Rose, large-scale data
models are defined for use. These can be imported directly into many ETL
tools.
The outcome of a CWM import is all the table definitions that comprise the
model. In this way, a large number of target definitions can be transported
from the data model design environment into the ETL design studio for
implementation.
Conclusion
Today's data warehousing provides infrastructure and warehousing solutions
that really work. Through new developments in the metadata process, and the
advent of services such as data quality and multi-user planning, data
warehousing will continue to offer new capabilities for enterprise users that
meet -- and even exceed -- the challenges of an ever-changing business
intelligence software market. When executives in the business world have
greater confidence in the quality of their data, they make better, more
informed decisions. IT managers can trust that the ETL environment within
their infrastructure is stable and secure. An end-to-end solution encompassing
high data quality, deployment, and powerful analytics helps with any advanced
ETL design and implementation requirements, and keeps enterprises ahead of the
ever-changing curve.
About Gary Mehler
Gary Mehler is director of software development at SAS. He has computer
science degrees from the University of Illinois, and has been involved with
enterprise computing for the 12 years he has been with SAS Institute. Gary
has experience with enterprise systems, databases, ERP systems and general
data warehousing, and is a much-sought after speaker at numerous conferences
and trade show events.
|