DSstar Logo Providing News & Information For Data Intensive
Storage Solutions For The Enterprise

  |  Table of Contents  |  

Features - Enterprise Data Insights:

GETTING THE TRUTH OUT OF YOUR DATA WAREHOUSE
By Gary Mehler, Director of Software Development, SAS

It is possible to get "one version of the truth" from your data warehouse -- without starting from scratch.

Data warehousing has become the foundational layer of data upon which today's business intelligence processes are built. Without a data mart or data warehouse that has a documented structure, along with the ability to accommodate updated data, analytic consistency over time cannot be achieved, and enterprises will continue to suffer. After all, a data warehouse is typically thought of as the means to achieve "one version of the truth" for an organization. The more broadly that advanced analytics are used, the more crucial the data warehousing process becomes.

In addition, it is important to truly understand the structured and documented approach required to correctly perform data warehousing so that its principles can be adopted in higher- level functions as well.

Data warehousing is more than just an ETL (Extract, Transform, and Load) process for a data collection. It must also integrate the infrastructure upon which this layer is built. This article examines the importance of metadata processing, and explores how upgrading a data warehouse is possible through a couple of new services -- multi-level planning/administration, and data quality integration -- that significantly impact the efficiency and manageability of the ETL process.

The importance of metadata processing

Data warehousing, like many other computer-based operations, relies on metadata -- information that describes basic tabular data -- as well as processes executed upon the basic data. In order to make infrastructure plans for a large-scale warehouse, metadata needs should be considered as a critical first step in the planning process.

Next-generation ETL servers will be highly dependent upon metadata services. These services can be thought of as storage repositories for metadata, and are built into collections of repositories that serve a particular function. For example, the typical warehousing development methodology calls for a three-level model for development (DEV), testing (TEST), and production (PROD) use.

Each level represents a collection of repositories that comprise the warehouse operation for that level. These collections have their own needs for interoperation with minimal additional extra administrative overhead. Within a level, such as the DEV environment, multiple ETL developers may be working on various aspects of a larger project, and each of these projects can benefit from metadata independence during the early development phase.

Change management is important to metadata processing because it allows for separation of independent projects to be separated so that work can proceed in parallel. Without change management, an ill-considered change could effectively stop all work because a user would be subjected to all changes as soon as they are made.

Change management is critical for real-world development of any type, and is especially important for ETL development environments when interdependence of data elements can be difficult to assess on the surface. Change management allows developers to check out a particular object (such as a table definition), experiment with the impact of changes in a protected environment (a project workspace), and after its safety has been determined, check it back into a shared work area for other users to see.

Once these metadata needs are considered, new upgrades that improve the effectiveness of the ETL process become possible through a couple of new services. Multi-level planning and administration, and data quality integrationm, in particular.

Multi-user planning

Some forethought is required when multiple developers, administrators or end-users are brought into the picture. For example,the inclusion of multiple developers in the warehouse design and implementation process requires that a common approach be taken for the storage of documentation and implementation materials. This is so important because it ensures that the work of these multiple warehouse developers can be integrated into a consistent and maintainable structure.

In addition to handling multi-level migration, core metadata needs to be available for multiple distributed sites, and needs to must have backup support and restoration to mitigate against hardware failure. These capabilities are supported with the general names of replication, whereby a repository can be copied to another location. A topology of metadata servers can be created to support this operation in a straightforward fashion.

Multi-user administration

The large numbers of people that are typically involved in an enterprise-wide warehousing project means that it is crucial that the tools used must be aware of the roles of these individuals, and help administrators make consistent choices throughout, including:

  • Creating project work areas for each ETL developer
  • Setting group permissions easily for all the ETL developers
  • Sharing common information as needed
  • Ensuring that updates are made with an audit trail
  • Using a check-out/check-in approach to keep prevent overwrites from occurring

The value of storing information in a common metadata repository is clear when the re-use of this information through a management dashboard is determined. An administrator can define complex definitions for all data and application servers just once, and then share that information with the user community.

Putting all of the information in one place makes it easy for administrators to manage large-scale enterprises, but also requires authorization management to ensure that each user has permission to see just exactly what they are supposed to see. With a management console in place, this is easy to manage.

Having this information in one place will become an ever-important feature because each object in the metadata repository can be better administered. This means that servers are not only limited to setting authorization rights, but can also display columns in tables or reports.

Data quality integration

No discussion of advanced data warehousing would be complete without considering the need to understand and manage quality issues. Whenever data is assembled from a variety of sources, minimizing the effects of defective data is a major concern. In addition, reducing data redundancies and standardizing data elements increases the accuracy of the data, making it more usable. A data quality server can handle the task of removing erroneous data and match common data when merging data from previously disparate sources. Cleansed data leads to more usable reliable results, so consideration of cleansing data in the ETL processes delivers a good return on investment for the resultant warehouse.

A data quality server can be used in many ways to help leverage exploit the full power of data quality in the ETL process:

  • To generate match codes for transformation, reporting and analysis in order to generate and apply schemes that transform data sets, which can be done through another ETL transformation
  • To determine upper- or lower-case standardization for data values
  • To utilize other functions that can determine gender or locale

Beginning the ETL design process

The actual ETL design process takes place once the administrative functions are up and running. Of all of the data warehousing development activities, the ETL process is usually the most time-consuming and difficult to design and operate of any of the data warehousing development activities. However, new ETL tools on the market today, such as the SAS ETL Studio, can greatly simplify the data management process for IT professionals by gathering and integrating data from multiple streams onto one platform.

ETL, after all, is about extraction, transformation and loading to support business intelligence needs, the bottom-line reason for this entire process.

Initial tasks in an ETL tool revolve around defining sources to extract and targets to load. For example, an ETL developer has a simple task to perform -- transform basic warehouse and transactional data into a weekly summary table. This would typically be done in a project area to contain the risk and scope of work, and eventually would eventually be checked into a common area for others to use. Further, a grouping would be defined to keep the pieces together and help avoid confusion with components of other projects.

These tasks can now be implemented and easily accommodated more readily than ever with today's ETL tools.

Building on a common warehouse metamodel (CWM) standard

When talking about data warehousing and the goal of defining and refining one version of the truth, it is important to mention the CWM standard. The newest warehousing products today are largely compliant with the open standard that supports common warehouse metamodel (CWM) interchange. This important standard allows for the sharing of metadata information to be shared across applications. In a typical large enterprise, a data architect manages the corporate or departmental data model is managed by a data architect. This data model describes how various pieces interrelate and helps prevents multiple versions of the truth from confusing a data warehousing environment. Using standard data modeling tools like ErWin or Rational Rose, large-scale data models are defined for use. These can be imported directly into many ETL tools.

The outcome of a CWM import is all the table definitions that comprise the model. In this way, a large number of target definitions can be transported from the data model design environment into the ETL design studio for implementation.

Conclusion

Today's data warehousing provides infrastructure and warehousing solutions that really work. Through new developments in the metadata process, and the advent of services such as data quality and multi-user planning, data warehousing will continue to offer new capabilities for enterprise users that meet -- and even exceed -- the challenges of an ever-changing business intelligence software market. When executives in the business world have greater confidence in the quality of their data, they make better, more informed decisions. IT managers can trust that the ETL environment within their infrastructure is stable and secure. An end-to-end solution encompassing high data quality, deployment, and powerful analytics helps with any advanced ETL design and implementation requirements, and keeps enterprises ahead of the ever-changing curve.

About Gary Mehler

Gary Mehler is director of software development at SAS. He has computer science degrees from the University of Illinois, and has been involved with enterprise computing for the 12 years he has been with SAS Institute. Gary has experience with enterprise systems, databases, ERP systems and general data warehousing, and is a much-sought after speaker at numerous conferences and trade show events.


Top of Page


  |  Table of Contents  |