DATA WAREHOUSING OVERVIEW

Data Warehouse:
The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the following way: "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process". He defined the terms in the sentence as follows:

Subject Oriented:
Data that gives information about a particular subject instead of about a company's ongoing operations. 
 

Integrated:
Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole.  

Time-variant:
All data in the data warehouse is identified with a particular time period. 
Non-volatile
Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. 

 

Pre-Data Warehouse

The pre-Data Warehouse zone provides the data for data warehousing. Data Warehouse designers determine which data contains business value for insertion.

OLTP databases are where operational data are stored. OLTP databases can reside in transactional software applications such as Enterprise Resource Management (ERP), Supply Chain, Point of Sale, Customer Serving Software. OLTPs are design for transaction speed and accuracy.

Metadata ensures the sanctity and accuracy of data entering into the data lifecycle process. Meta-data ensures that data has the right format and relevancy. Organizations can take preventive action in reducing cost for the ETL stage by having a sound Metadata policy. The commonly used terminology to describe meta data is "data about data".

Data Cleansing & ETL

Before data enters the data warehouse, the extraction, transformation and cleaning (ETL) process ensures that the data passes the data quality threshold.

Extract, Transform, and Load (ETL) is a process in data warehousing that involves

  • Extracting data from outside sources, 
  • Transforming it to fit business needs (which can include quality levels), and ultimately 
  • Loading it into the end target, i.e. the data warehouse. 

ETL is important, as it is the way data actually gets loaded into the warehouse. This article assumes that data is always loaded into a data warehouse, whereas the term ETL can in fact refer to a process that loads any database. ETL can also be used for the integration with legacy systems. Usually ETL implementations store an audit trail on positive and negative process runs. In almost all designs, this audit trail is not at the level of granularity which would allow to reproduce the ETL's result if the raw data were not available

Data Repositories

The Data Warehouse repository is the database that stores active data of business value for an organization. The Data Warehouse modeling design is optimized for data analysis. 

There are variants of Data Warehouses - Data Marts and ODS. Data Marts are not physically any different from Data Warehouses. Data Marts can be though of as smaller Data Warehouses built on a departmental rather than on a company-wide level.

Data Warehouses collects data and is the repository for historical data. Hence it is not always efficient for providing up-to-date analysis. This is where ODS, Operational Data Stores, come in. ODS are used to hold recent data before migration to the Data Warehouse.

ODS are used to hold data that have a deeper history that OLTPs. Keep large amounts of data in OLTPs can tie down computer resources and slow down processing - imagine waiting at the ATM for 10 minutes between the prompts for inputs. .

Front-End Analysis

The last and most critical potion of the Data Warehouse overview are the front-end applications that business users will use to interact with data stored in the repositories.

Data Mining is the discovery of useful patterns in data. Data Mining are used for prediction analysis and classification - e.g. what is the likelihood that a customer will migrate to a competitor.

OLAP, Online Analytical Processing, is used to analyze historical data and slice the business information required. OLAPs are often used by managers. Slices of data that are useful to marketing managers can be - How many customers between the ages 24-45, that live in Delhi , buy over Rs 5000 worth of groceries a month?

Reporting tools are used to provide reports on the data. Data are displayed to show relevancy to the business and keep track of key performance indicators (KPI).

OLAP

On-Line Analytical Processing (OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.

OLAP functionality is characterized by dynamic multi-dimensional analysis of consolidated enterprise data supporting end user analytical and navigational activities including:

  • calculations and modeling applied across dimensions, through hierarchies and/or across members
  • trend analysis over sequential time periods
  • slicing subsets for on-screen viewing
  • drill-down to deeper levels of consolidation
  • reach-through to underlying detail data
  • rotation to new dimensional comparisons in the viewing area

OLAP is implemented in a multi-user client/server mode and offers consistently rapid response to queries, regardless of database size and complexity. OLAP helps the user synthesize enterprise information through comparative, personalized viewing, as well as through analysis of historical and projected data in various "what-if" data model scenarios. This is achieved through use of an OLAP Server

 

Compiled by:
S.Nawab,
Scientist-C
Gautam Budh Nagar