Wednesday, 1 January 2014

Business Intelligence



Business Intelligence (BI) refers to computer-based techniques used in spotting, digging-out, and analyzing business data, such as sales revenue by products or departments or associated costs and incomes.

BI technologies provide historical, current, and predictive views of business operations. Common functions of Business Intelligence technologies are reporting, online analytical processing, analytics, data mining, business performance management, benchmarking, text mining, and predictive analytics.

Business Intelligence often aims to support better business decision-making. Thus a BI system can be called a decision support system (DSS).[3] Though the term business intelligence is often used as a synonym for competitive intelligence, because they both support decision making, BI uses technologies, processes, and applications to analyze mostly internal, structured data and business processes while competitive intelligence, is done by gathering, analyzing and disseminating information with or without support from technology and applications, and focuses on all-source information and data (unstructured or structured), mostly external to, but also internal to a company, to support decision making.

What is NoSQL



NoSQL is not the name of any particular database instead it refers to a broad class of non-relational databases that differ from classical relational database management systems (RDBMS) in some significant aspects, most notably because they do not use SQL as their primary query language, instead providing access by means of Application Programming Interfaces (API).


NoSQL...can be considered "Internet age" databases that are being used by Amazon, Facebook, Google and the like to address performance and scalability requirements that cannot be met by traditional relational databases.

NoSQL databases and data-processing frameworks are primarily utilized because of their speed, scalability and flexibility. Adoption of NoSQL in the enterprise level, however, is still emerging. Some consider it the absolute apogee of achievement, while others maintain it at the peak of the Inflated Expectations Phase of Gartner’s Hype Cycle, used to characterize the over-enthusiasm or “hype” and subsequent disappointment that typically happen with the introduction of new technologies. Still others relegate it to an inferior and inconspicuous position in favor of columnar relational databases such as Sybase IQ or Oracle 11g.



Features of NoSQL databases

One major difference between traditional relational databases and NoSQL is that the latter do not generally provide guarantees for atomicity, consistency, isolation and durability (commonly known as ACID property), although some support is beginning to emerge. Instead of ACID, NoSql databases more or less follow something called "BASE". We will discuss this in more detail later in the article.
ACID is comprised of a set of properties that guarantees that database transactions are processed reliably. To know more about ACID, read What is a database? A question for both pro and newbie

The other major difference is, NoSQL databases are generally schema-less - that is records in these databases do not require to conform to a pre-defined storage schema.

In a relational database, schema is the structure of a database system described in a formal language supported by the DBMS and refers how the database will be constructed and divided into database objects such as tables, fields, relationships, views, indexes, packages, procedures, functions, queues, triggers and other elements.

In NoSQL databases, schema-free collections are utilized instead so that different types and document structures such as{“color”, “blue”} and {“price”, “23.5”} can be stored within a single collection.

Below table lists down the major characteristic features of NoSQL databases1
Feature Description
Schema-less "Tables" don't have a pre-defined schema. Records have a variable number of fields that can vary from record to record. Record contents and semantics are enforced by applications.
Shared nothing architecture Instead of using a common storage pool (e.g., SAN), each server uses only its own local storage. This allows storage to be accessed at local disk speeds instead of network speeds, and it allows capacity to be increased by adding more nodes. Cost is also reduced since commodity hardware can be used.
Elasticity Both storage and server capacity can be added on-the-fly by merely adding more servers. No downtime is required. When a new node is added, the database begins giving it something to do and requests to fulfill.
Sharding Instead of viewing the storage as a monolithic space, records are partitioned into shards. Usually, a shard is small enough to be managed by a single server, though shards are usually replicated. Sharding can be automatic (e.g., an existing shard splits when it gets too big), or applications can assist in data sharding by assigning each record a partition ID.
Asynchronous replication Compared to RAID storage (mirroring and/or striping) or synchronous replication, NoSQL databases employ asynchronous replication. This allows writes to complete more quickly since they don't depend on extra network traffic. One side effect of this strategy is that data is not immediately replicated and could be lost in certain windows. Also, locking is usually not available to protect all copies of a specific unit of data.
BASE instead of ACID NoSQL databases emphasize performance and availability. This requires prioritizing the components of the CAP theorem (described elsewhere) that tends to make true ACID transactions implausible.


Types of NoSQL databases

NoSQL database systems came into being by some of the major internet players such as Google, Facebook, LinkedIn and others which had significantly different challenges in dealing with data than those addressed by traditional RDBMS solutions. There was a need to provide information out of large volumes of data that to a greater or lesser degree adhered to similar horizontal structures. These companies realized that performance and real-time character was more important than consistency, to which much of the processing time in a traditional RDBMS had been devoted.

As such, NoSQL databases are often highly optimized for retrieve and append operations and often offer little functionality beyond record storage. The reduced run-time flexibility compared to full SQL systems is counterbalanced by significant gains in scalability and performance for certain data models. NoSQL databases demonstrate their strengths above all with regard to the flexible handling of variable data by document-oriented databases, in the representation of relationships by graph databases and in the reduction of a database to a container with key-value pairs provided by key-value databases.

Consequently, NoSQL databases are often categorized according to the way they store data and fall under the following major categories:
  1. Key-value stores
  2. Columnar (or column-oriented) databases
  3. Graph databases
  4. Document databases
  5. Key-value stores

Key-value stores allow the application to store its data in a schema-less (key, value) pairs. These data can be stored in a hash table like datatypes of a programming language - so that each value can be accessed by its key. Although such storage might not be very efficient - since they provide only a single way to access the values - but eliminates the need for a fixed data model.
Columnar databases

A column-oriented DBMS stores its content by column rather than by row. It contains predefined families of columns and is more accomplished at scaling and updating at relatively high speeds, which offers advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items.
Graph databases

Graph databases optimize the storage of networks – or “Graphs“ – of related nodal data as a single logical unit. A graph database uses graph structures with nodes, edges and properties to represent and store data and provides index-free adjacency, meaning that every element contains a direct pointer to its adjacent element and no index lookups are necessary. This can be useful in cases of finding degrees of separation where SQL would require extremely complex queries. A popular movie service, for example, shows the logged-in user a “Best Guess for You” rating for each film based on how similar people rated it, while other services such as LinkedIn, Facebook or Netflix show people in a network at various degrees of separation. Although such queries become simple in Graph databases, the relevance of this technology in a financial enterprise is difficult to determine.
Document databases

Document stores are used for large, unstructured or semistructured records. Data is organized in documents that can contain any number of fields of any length. All document-oriented database implementations assume documents encapsulate and encode data in some sort of standard formats – known as encodings – and are ideal for MS Office or PDF documents. Document databases should not be confused with Document Management Systems, however. The documents referred to are not actual documents as such, although they can be. Documents inside a document-oriented database are similar in some ways to records or rows in relational databases, but they are less rigid because they are not required to adhere to a standard schema. Unlike a relational database where each record would have the same set of fields and unused fields might be kept empty, there are no empty fields in document records. This system allows new information to be added to or removed from any record without wasting space by creating empty fields on all other records. In contrast to key-value and columnar databases, which view each record as a list of attributes which are updated one at a time, document stores allow insertion, updates and queries of entire records using a JavaScript Object Notation (JSON) format. The concept of a join is less relevant in document databases than in traditional RDBMS systems. As a result, records that might be joined in a traditional RDBMS, are generally denormalized into wide records. Denormalization refers to a process by which the read-performance of a database is optimized by the addition of redundant or grouped data. Some of the NoSQL vendors, most notably MongoDB, do in fact feature add-on join capabilities as well. Many of these database categories are beginning to blur, however. As all of them support the association of values with keys, they are therefore all fundamentally key-value stores; document databases, moreover, can perform all of the capabilities of columnar databases from a sematic point of view. As a result, the distinguishing factors must be evaluated in terms of performance and ease of use for a particular solution.
Popular incarnations of NoSql databases

Most implemented solutions cannot be strictly assigned to a specific type and contain features from two or more categories. We should also recognize that each NoSQL implementation has its own special nuances. Popular offerings include the following:
Apache Cassandra

Apache Cassandra is an open-source, distributed database-management system designed to handle very large amounts of data spread out across many commodity servers while providing a high degree of service availability with no single point of failure. It is particularly fast at write operations as opposed to reads and might therefore lend itself best to applications that require analysis of large sets of data with write-backs.
HBase

HBase is also an open-source, distributed database modeled after Google’s BigTable. HBase technologies are not strictly a data-store, but generally work closely with a NoSQL database to accomplish highly scalable analyses. HBase scales linearly with the number of nodes and can quickly return queries on tables consisting of billions of rows and millions of columns.
BigTable

BigTable can be defined as a sparse, distributed, multi-dimensional sorted map. BigTable is designed to scale into the petabyte range – a petabyte is equivalent to 1 million gigabytes - across hundreds or thousands of machines and to make it easy to add more machines to the system and start taking advantage of those resources automatically without any reconfiguration.
Coherence and Ehcache

Coherence and Ehcache are equipped with In-Memory caches. Coherence is in heavy use in financial industries where network latency – defined as the time it takes to cross a network connection from sender to receiver - is a factor.
Possible applications of NoSql Databases

NoSQL databases should generally be considered as potential options when any high-intensity computation or analysis of large data sets is required, especially when performing real-time analysis. This can easily make their use in many industry sectors e.g. financial institutions' electronic-trading applications. Relational databases, especially the columnar variety, do not generally perform well on updates. As a result, a NoSQL database might present itself as a viable alternative in cases where massive updates are required. In situations involving variable-record templates or sparse data, NoSQL document databases can offer a welcome alternative.

Data Warehousing Life Cycle

A Data Warehouse Supports OLTP

A data warehouse supports an OLTP system by providing a place for the OLTP database to offload data as it accumulates, and by providing services that would complicate and degrade OLTP operations if they were performed in the OLTP database.
Without a data warehouse to hold historical information, data is archived to static media such as magnetic tape, or allowed to accumulate in the OLTP database.
If data is simply archived for preservation, it is not available or organized for use by analysts and decision makers. If data is allowed to accumulate in the OLTP so it can be used for analysis, the OLTP database continues to grow in size and requires more indexes to service analytical and report queries. These queries access and process large portions of the continually growing historical data and add a substantial load to the database. The large indexes needed to support these queries also tax the OLTP transactions with additional index maintenance. These queries can also be complicated to develop due to the typically complex OLTP database schema.
A data warehouse offloads the historical data from the OLTP, allowing the OLTP to operate at peak transaction efficiency. High volume analytical and reporting queries are handled by the data warehouse and do not load the OLTP, which does not need additional indexes for their support. As data is moved to the data warehouse, it is also reorganized and consolidated so that analytical queries are simpler and more efficient.

OLAP is a Data Warehouse Tool

Online analytical processing (OLAP) is a technology designed to provide superior performance for ad hoc business intelligence queries. OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses.
A data warehouse provides a multidimensional view of data in an intuitive model designed to match the types of queries posed by analysts and decision makers. OLAP organizes data warehouse data into multidimensional cubes based on this dimensional model, and then preprocesses these cubes to provide maximum performance for queries that summarize data in various ways. For example, a query that requests the total sales income and quantity sold for a range of products in a specific geographical region for a specific time period can typically be answered in a few seconds or less regardless of how many hundreds of millions of rows of data are stored in the data warehouse database.
OLAP is not designed to store large volumes of text or binary data, nor is it designed to support high volume update transactions. The inherent stability and consistency of historical data in a data warehouse enables OLAP to provide its remarkable performance in rapidly summarizing information for analytical queries.
In SQL Server 2000, Analysis Services provides tools for developing OLAP applications and a server specifically designed to service OLAP queries.

Data Mining is a Data Warehouse Tool

Data mining is a technology that applies sophisticated and complex algorithms to analyze data and expose interesting information for analysis by decision makers. Whereas OLAP organizes data in a model suited for exploration by analysts, data mining performs analysis on data and provides the results to decision makers. Thus, OLAP supports model-driven analysis and data mining supports data-driven analysis.
Data mining has traditionally operated only on raw data in the data warehouse database or, more commonly, text files of data extracted from the data warehouse database. In SQL Server 2000, Analysis Services provides data mining technology that can analyze data in OLAP cubes, as well as data in the relational data warehouse database. In addition, data mining results can be incorporated into OLAP cubes to further enhance model-driven analysis by providing an additional dimensional viewpoint into the OLAP model. For example, data mining can be used to analyze sales data against customer attributes and create a new cube dimension to assist the analyst in the discovery of the information embedded in the cube data.
For more information and details about data mining in SQL Server 2000, see Chapter 24, "Effective Strategies for Data Mining," in the SQL Server 2000 Resource Kit.
Designing a Data Warehouse: Prerequisites
Before embarking on the design of a data warehouse, it is imperative that the architectural goals of the data warehouse be clear and well understood. Because the purpose of a data warehouse is to serve users, it is also critical to understand the various types of users, their needs, and the characteristics of their interactions with the data warehouse.

Data Warehouse Architecture Goals

A data warehouse exists to serve its users—analysts and decision makers. A data warehouse must be designed to satisfy the following requirements:
  • Deliver a great user experience—user acceptance is the measure of success
  • Function without interfering with OLTP systems
  • Provide a central repository of consistent data
  • Answer complex queries quickly
  • Provide a variety of powerful analytical tools, such as OLAP and data mining
Most successful data warehouses that meet these requirements have these common characteristics:
  • Are based on a dimensional model
  • Contain historical data
  • Include both detailed and summarized data
  • Consolidate disparate data from multiple sources while retaining consistency
  • Focus on a single subject, such as sales, inventory, or finance
Data warehouses are often quite large. However, size is not an architectural goal—it is a characteristic driven by the amount of data needed to serve the users.

Data Warehouse Users

The success of a data warehouse is measured solely by its acceptance by users. Without users, historical data might as well be archived to magnetic tape and stored in the basement. Successful data warehouse design starts with understanding the users and their needs.
Data warehouse users can be divided into four categories: Statisticians, Knowledge Workers, Information Consumers, and Executives. Each type makes up a portion of the user population as illustrated in this diagram.
Aa902672.sql_dwdesign01(en-us,SQL.80).gif
Figure 1. The User Pyramid
Statisticians: There are typically only a handful of sophisticated analysts—Statisticians and operations research types—in any organization. Though few in number, they are some of the best users of the data warehouse; those whose work can contribute to closed loop systems that deeply influence the operations and profitability of the company. It is vital that these users come to love the data warehouse. Usually that is not difficult; these people are often very self-sufficient and need only to be pointed to the database and given some simple instructions about how to get to the data and what times of the day are best for performing large queries to retrieve data to analyze using their own sophisticated tools. They can take it from there.
Knowledge Workers: A relatively small number of analysts perform the bulk of new queries and analyses against the data warehouse. These are the users who get the "Designer" or "Analyst" versions of user access tools. They will figure out how to quantify a subject area. After a few iterations, their queries and reports typically get published for the benefit of the Information Consumers. Knowledge Workers are often deeply engaged with the data warehouse design and place the greatest demands on the ongoing data warehouse operations team for training and support.
Information Consumers: Most users of the data warehouse are Information Consumers; they will probably never compose a true ad-hoc query. They use static or simple interactive reports that others have developed. It is easy to forget about these users, because they usually interact with the data warehouse only through the work product of others. Do not neglect these users! This group includes a large number of people, and published reports are highly visible. Set up a great communication infrastructure for distributing information widely, and gather feedback from these users to improve the information sites over time.
Executives: Executives are a special case of the Information Consumers group. Few executives actually issue their own queries, but an executive's slightest musing can generate a flurry of activity among the other types of users. A wise data warehouse designer/implementer/owner will develop a very cool digital dashboard for executives, assuming it is easy and economical to do so. Usually this should follow other data warehouse work, but it never hurts to impress the bosses.
How Users Query the Data Warehouse
Information for users can be extracted from the data warehouse relational database or from the output of analytical services such as OLAP or data mining. Direct queries to the data warehouse relational database should be limited to those that cannot be accomplished through existing tools, which are often more efficient than direct queries and impose less load on the relational database.
Reporting tools and custom applications often access the database directly. Statisticians frequently extract data for use by special analytical tools. Analysts may write complex queries to extract and compile specific information not readily accessible through existing tools. Information consumers do not interact directly with the relational database but may receive e-mail reports or access web pages that expose data from the relational database. Executives use standard reports or ask others to create specialized reports for them.
When using the Analysis Services tools in SQL Server 2000, Statisticians will often perform data mining, Analysts will write MDX queries against OLAP cubes and use data mining, and Information Consumers will use interactive reports designed by others.
Developing a Data Warehouse: Details
The phases of a data warehouse project listed below are similar to those of most database projects, starting with identifying requirements and ending with deploying the system:
  • Identify and gather requirements
  • Design the dimensional model
  • Develop the architecture, including the Operational Data Store (ODS)
  • Design the relational database and OLAP cubes
  • Develop the data maintenance applications
  • Develop analysis applications
  • Test and deploy the system

Identify and Gather Requirements
Identify sponsors. A successful data warehouse project needs a sponsor in the business organization and usually a second sponsor in the Information Technology group. Sponsors must understand and support the business value of the project.
Understand the business before entering into discussions with users. Then interview and work with the users, not the data—learn the needs of the users and turn these needs into project requirements. Find out what information they need to be more successful at their jobs, not what data they think should be in the data warehouse; it is the data warehouse designer's job to determine what data is necessary to provide the information. Topics for discussion are the users' objectives and challenges and how they go about making business decisions. Business users should be closely tied to the design team during the logical design process; they are the people who understand the meaning of existing data. Many successful projects include several business users on the design team to act as data experts and "sounding boards" for design concepts. Whatever the structure of the team, it is important that business users feel ownership for the resulting system.
Interview data experts after interviewing several users. Find out from the experts what data exists and where it resides, but only after you understand the basic business needs of the end users. Information about available data is needed early in the process, before you complete the analysis of the business needs, but the physical design of existing data should not be allowed to have much influence on discussions about business needs.
Communicate with users often and thoroughly—continue discussions as requirements continue to solidify so that everyone participates in the progress of the requirements definition.
Design the Dimensional Model
User requirements and data realities drive the design of the dimensional model, which must address business needs, grain of detail, and what dimensions and facts to include.
The dimensional model must suit the requirements of the users and support ease of use for direct access. The model must also be designed so that it is easy to maintain and can adapt to future changes. The model design must result in a relational database that supports OLAP cubes to provide "instantaneous" query results for analysts.
An OLTP system requires a normalized structure to minimize redundancy, provide validation of input data, and support a high volume of fast transactions. A transaction usually involves a single business event, such as placing an order or posting an invoice payment. An OLTP model often looks like a spider web of hundreds or even thousands of related tables.
In contrast, a typical dimensional model uses a star or snowflake design that is easy to understand and relate to business needs, supports simplified business queries, and provides superior query performance by minimizing table joins.
For example, contrast the very simplified OLTP data model in the first diagram below with the data warehouse dimensional model in the second diagram. Which one better supports the ease of developing reports and simple, efficient summarization queries?
 Click here for larger image
Figure 2. Flow Chart (click for larger image)
Aa902672.sql_dwdesign03(en-us,SQL.80).gif
Figure 3. Star Diagram
Dimensional Model Schemas
The principal characteristic of a dimensional model is a set of detailed business facts surrounded by multiple dimensions that describe those facts. When realized in a database, the schema for a dimensional model contains a central fact table and multiple dimension tables. A dimensional model may produce a star schema or a snowflake schema.
Star Schemas
A schema is called a star schema if all dimension tables can be joined directly to the fact table. The following diagram shows a classic star schema.
 Click here for larger image
Figure 4. Classic star schema, sales (click for larger image)
The following diagram shows a clickstream star schema.
 Click here for larger image
Figure 5. Clickstream star schema (click for larger image)
Snowflake Schemas
A schema is called a snowflake schema if one or more dimension tables do not join directly to the fact table but must join through other dimension tables. For example, a dimension that describes products may be separated into three tables (snowflaked) as illustrated in the following diagram.
 Click here for larger image
Figure 6. Snowflake, three tables (click for larger image)
A snowflake schema with multiple heavily snowflaked dimensions is illustrated in the following diagram.
 Click here for larger image

OLTP and OLAP Systems



OLTP vs. OLAP

We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.





- OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF).

- OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).


The following table summarizes the major differences between OLTP and OLAP system design.

OLTP System
Online Transaction Processing
(Operational System)

OLAP System
Online Analytical Processing
(Data Warehouse)

Source of data
Operational data; OLTPs are the original source of the data.
Consolidation data; OLAP data comes from the various OLTP Databases
Purpose of data
To control and run fundamental business tasks
To help with planning, problem solving, and decisionsupport
What the data
Reveals a snapshot of ongoing business processes
Multi-dimensional views of various kinds of business activities
Inserts and Updates
Short and fast inserts and updates initiated by end users
Periodic long-running batch jobs refresh the data
Queries
Relatively standardized and simple queries Returning relatively few records
Often complex queries involving aggregations
Processing Speed
Typically very fast
Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes
Space Requirements
Can be relatively small if historical data is archived
Larger due to the existence of aggregation structures andhistory data; requires more indexes than OLTP
Database Design
Highly normalized with many tables
Typically de-normalized with fewer tables; use of star and/or snowflake schemas
Backup andRecovery
Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liability
Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method

What is difference between Database & Data Warehouse?

Database
  1. Used for Online Transactional Processing (OLTP) but can be used for other purposes such as Data Warehousing. This records the data from the user for history.
  2. The tables and joins are complex since they are normalized (for RDMS). This is done to reduceredundant data and to save storage space.
  3. Entity – Relational modeling techniques are used for RDMS database design.
  4. Optimized for write operation.
  5. Performance is low for analysis queries.
Data Warehouse
  1. Used for Online Analytical Processing (OLAP). This reads the historical data for the Users for business decisions.
  2. The Tables and joins are simple since they are de-normalized. This is done to reduce the response time for analytical queries.
  3. Data – Modeling techniques are used for the Data Warehouse design.
  4. Optimized for read operations.
  5. High performance for analytical queries.
  6. Is usually a Database.
It's important to note as well that Data Warehouses could be sourced zero to many databases.

What is Data Warehousing?

Businesses measure all sorts of things in order to better manage their business. They measure how much a customer has spent on their products, what products have the best margin and how long it takes to get a product delivered to the customer after the order is confirmed. You name it, and most assuredly someone or some business out there is measuring it.

But measurements for a business are only as good as the delivery of this information to the people who manage what is being measured. This is where business reports come into play. Reports offer consolidation of the items measured. Generally speaking, the higher up in an organization a manager is, the more consolidated his or her reports need to be.

Computerized database technologies have been developed over the last few decades to allow businesses to store their data in highly efficient electronic formats, a.k.a. databases. Businesses rely on database systems geared toward capturing data from the users. These ‘front-end’ systems are often referred to as “On-Line Transactional Processing” systems (OLTP). Most transaction systems, in addition to their data entry and capture capabilities, also offer some amount of reporting. Even the most basic of reports can offer consolidation and some bit of context. For example, “How much did we sell last quarter as opposed to our projections?” But even the best and most complex of OLTP systems will have some reporting shortcomings for one of several reasons.


The concept of a “data warehouse” was developed to combat some of these issues. A data warehouse can be thought of as another database with high volume reporting and analytics as its main purpose, as opposed to row-by-row data retrieval and manipulation. They typically contain copies of data already managed by transaction systems, but they are designed and indexed for efficient bulk retrieval and reporting. In the following sections, we’ll explore some possible shortcomings of OLTP systems in regard to reporting and analysis and see how the implementation of a data warehouse can alleviate them.

Tuesday, 31 December 2013

Glossary of ETL

Source System
A database, application, file, or other storage facility from which the data in a data warehouse is derived.

Mapping
The definition of the relationship and data flow between source and target objects.

Meta data
Data that describes data and other structures, such as objects, business rules, and processes. For example, the schema design of a data warehouse is typically stored in a repository as meta data, which is used to generate scripts used to build and populate the data warehouse. A repository contains meta data.

Staging Area
A place where data is processed before entering the warehouse.

Cleansing
The process of resolving inconsistencies and fixing the anomalies in source data, typically as part of the ETL process.

Transformation
It is the process of manipulating data. Any manipulation beyond copying is a transformation. Examples include cleansing, aggregating, and integrating data from multiple sources.

Transportation
The process of moving copied or transformed data from a source to a data warehouse.

Target System
A database, application, file, or other storage facility to which the “transformed source data” is loaded in a data warehouse.