Tuesday, September 16, 2008

Reporting Solution Alternatives

The following section discusses some common reporting solution alternatives. The alternatives usually represent an evolution in a company’s reporting sophistication. Generally, organizations start with some main reports from an OLTP (Online Transaction Processing) system. Once they meet the limitations of the OLTP system, they evolve their reporting into data warehouses. Eventually, even more complex reports and interactivity are required. This usually leads to the implementation of an OLAP system. We will take a look at each of these alternatives and their relative advantages.

Reporting with Relational Data (OLTP)

Transactional databases are designed to capture and manage real data as it is generated, for example, as products are purchased and as services are rendered. Relational databases are designed according to the rules of normal form and typically have many tables, each containing fragments of data rather than comprehensive information or business facts. This helps preserve the integrity and accuracy of data at the detail level, but it presents challenges for deriving useful information from a large volume of transactional data. In order to obtain information with meaningful context, tables must be joined and values must be aggregated.

For simple report requests, this usually is not an issue. Take the example of an invoice. An invoice is a simple report. It displays custom information along with detail for a small number of transactions. For this type of report, querying an OLTP system is not very costly and the query should be relatively straightforward. However, users will eventually move past these simple reports as they start to look for information for an entire year or product line. Developing these types of reports will eventually consume considerable resources on an OLTP system as well as require increasingly difficult queries. Although relational database systems may support complex queries, reporting against these queries routinely could prove to be slow and inefficient.

Relational Data Warehouses

Many organizations evolve away from reporting on their OLTP data. Usually their first step is to create a carbon copy of the OLTP system on another server. This alleviates the resource constraints on the original system, but it does not solve the issues around increasingly difficult queries. OLTP systems simply are not organized in a logical reporting structure.

To deal with increasing reporting needs, an entire industry has evolved to simply handle reporting. From this industry, individuals such as Ralph Kimball have refined standard patterns and methodologies for developing data warehouses. A common misconception is that a data warehouse is simply a denormalized transactional system. In reality, a data warehouse is another form of relational database that is organized into a reporting-friendly schema. Data is centered around what is known as a “fact” table. A fact table relates to business processes such as orders or enrollments. Radiating out from the fact table are dimensional tables. Dimensional tables contain attributes that further define the facts. These attributes could contain product names, geographic sales locations, or time and date information.

Relational data warehouses can significantly improve query performance on large data sets. However, they too have related drawbacks. These drawbacks generally relate to the fact that data is still stored in a relational format. Relational databases require joins to combine information. They also require aggregate functions to calculate summary-level detail. Both joins and aggregate functions can slow queries on very large sets of data. Relational databases also do not understand inherit associations in the data. Take the example of a product table. Each product table has a related subcategory and each subcategory has a related category. If you need to create a report that is product sales with its percentage makeup of each related subcategory, you have to understand the relationship and write it in your query. The same holds true for time relationships. If you need to create a report that contains year-to-date information, you need to understand what the current date is as well as all the related periods in the same year. These things are possible in SQL queries but take additional effort and require more maintenance. That moves us into our next type of reporting alternative: OLAP.

Reporting with Multidimensional Data (OLAP)

Multidimensional databases take a much different approach to data retrieval and storage than relational databases. Multidimensional databases are organized into objects called cubes. Cubes act as a semantic layer above your underlying database. These databases can contain numerous different relationships and very large sets of aggregate data.

As a multidimensional database, information can be aggregated across many dimensions. This data is preprocessed into the multidimensional structure. Because it is preprocessed, query times are significantly reduced for large additive data sets. Multidimensional databases also have the advantage of understanding relationships between and across dimensions. This opens the door to creating calculations and reports that would be extremely difficult in a relational database.

Imagine that a user asks you to create a report that displays the top five customers with their top three products by this year’s sales amount and compared to last year’s sales amount. Writing a SQL query to return the top five customers is fairly straightforward. However, returning each one’s top three products would require additional subqueries because the relational database does not understand the association between products and customers. The final part of the request can prove even more burdensome. Returning a single year’s data is easy, but nesting that data next to last year’s data can prove almost impossible. The SQL query for the above scenario would most likely contain a number of nested queries as well as some creative use of temporary tables. Besides being a terribly complex query, it probably would not perform that well. On the other hand, Multidimensional Expressions (MDX), the language used to query multidimensional databases, can handle this in a few simple calls—not because MDX is a more advanced language, but simply because the underlying database understands the associations in the data and has stored this information for quick retrieval.
References: Professional SQL Server™ 2005 Reporting Services

No comments: