DataBase & Business Intelligence Expert Solution
  Home   Services Forum Jobs References Live demo   Contacts  
Mnaouar Ben Khelifa
Subscribe?     Forgot Password ?
tunis annonces femme cherche
tunis ads weatherDatabase
tunis chat chien
End of Left Navigation
End of Left Navigation
End of Left Navigation
weather montreal
tunis annonces You are here: Home > Database >

Data base & business intelligence solution website


Manage Large Data Warehouses with SQL Server 2005

SQL Server Technical Article







Writer: Erin Welker, Scalability Experts

Technical Reviewers:    Sunil Agarwal

                  Torsten Grabs

                  Sanjay Mishra



Published: October 2006

Applies To: SQL Server 2005


Summary: This document discusses things to consider when managing a large, high-performance relational data warehouse, especially one that hosts direct queries from users and reports. The focus is on the efficient operation of management functions so that the data warehouse will be highly available to service query requests. The discussion includes some of the new features of SQL Server 2005 and considerations to take into account when using these features.






Table of Contents

Introduction............................................................ 3

Partitioning for Management............................................. 3

Data Lifecycle Management............................................... 3

Inserting new data................................................... 3

Common insert strategies.......................................... 3

Maintaining pre-aggregated data...................................... 3

Aging fact data...................................................... 3

Archiving fact data.................................................. 3

Maintaining dimensions (pruning)..................................... 3

Backup and Restore Strategies........................................... 3

Backups for business continuity...................................... 3

ETL checkpoints...................................................... 3

Recovery model....................................................... 3

Database Health......................................................... 3

Database consistency checking........................................ 3

Index fragmentation.................................................. 3

Update statistics.................................................... 3

A High-Level Maintenance System......................................... 3

Scenario............................................................. 3

Management schedule.................................................. 3

Conclusion.............................................................. 3

References.............................................................. 3




A number of activities may need to occur to effectively manage a data warehouse on the Microsoft® SQL Server™ platform. These activities include:

·         ETL (extraction, transformation, and loading of incremental data)

·         Database backups

·         Index defragmentation

·         Database health checks

·         Update statistics

·         Summary table updating

·         Moving aged data to less expensive disk

·         Archiving historic data

·         Dimension pruning (removal of obsolete dimension members)

There are costs associated with these activities that must be considered and mitigated. Costs usually come in one of the following forms:

·         Availability. Most of these operations can be performed when users are online but at the potential cost of increased query execution time. User query response time is particularly a consideration if incremental updates of the data warehouse are performed throughout the day. If updates can be performed during a daily batch process, outside of the active query time frame, emphasis is usually on the speed of the batch execution in order to meet user availability requirements.

·         Server resources. Excessive use of server resources can have an impact on users, if management activities are performed outside of a predefined batch process. If these activities can take place offline, perhaps at night or on the weekends, the primary consideration is disk space. Server resources should also be monitored in relation to parallel execution of management activities. The goal is to perform as much work at a time as possible, without thrashing any one of the server resources.

·         Administration costs. The more complex the process, the more highly skilled the administrative staff needs to be. Generally speaking, these processes should be highly automated to lessen the recurring human resource cost to manage the system. This automation should be simplified to minimize the resource costs associated with the diagnosis and resolution of occasional problems that might occur or with resource turnover. In addition, the tasks covered in this document should be addressed in order to avoid slow performance and/or unplanned outages.

Dealing with these management issues in a mid- or large-sized relational data warehouse can seem to be a daunting task. When database administrators (DBAs) are responsible for database sizes over 500 GB, and certainly when the size is in the multi-terabyte range, they can wonder whether it is possible to do all that must be done to ensure a healthy, well-performing database within the required batch window. The thing to remember is that the vast majority of data warehouse fact tables, which are the tables that represent the largest proportion of the database, are mostly stable. That is, they represent history that is rarely, if ever, updated. Often times, only 5-10% (or less) of the database is actively updated. For instance, a 2-terabyte database only requires active maintenance on 100-200 GB or less. You also need to take another look at how traditional maintenance is performed. Instead of broad stroke maintenance, analyze where the maintenance truly needs to occur and when.

This white paper provides insight into other ways of looking at maintenance and how it is performed. First, we discuss how to partition, or breakup, very large tables so they can be more easily managed. Then we discuss the various management activities required to ensure a healthy, high-performance data warehouse. For each of these, we look at some ways to minimize the various costs mentioned above to make the entire process more efficient. Many of these techniques can be applied in OLTP environments, though the data warehouse database is the focus of this document. Finally, we present a sample data warehouse environment scenario and propose a high-level schedule for management activities.

This paper assumes basic knowledge of data warehousing concepts, dimensional modeling, SQL Server 2005 table partitioning, and SQL Server database maintenance activities. For additional information on these topics, see the References section at the end of this white paper.

Partitioning for Management

One of the primary tools for managing data warehouse operations is table partitioning. Partitioning is a means of breaking up data in tables into more manageable sections. Almost all management operations in SQL Server can be done at the partition level, so partitioning provides the ability to break up the management of a very large table into more manageable units. As we will see later, partitioning can also be used to segregate sections of the table that no longer require management so we are only left with those sections that do.

Partitioning can be either vertical or horizontal. Vertical partitioning refers to a method of breaking out columns, which logically belong to a single table, across multiple partitions which are stored as physical tables. This is an effective method in an OLTP environment but not in a data warehouse. In a data warehouse, the very large tables are fact tables with a tremendous number of rows but few columns. Horizontal partitioning involves segmenting tables based on values of a column in the table. Horizontal partitioning is almost a given in a data warehouse environment where the tables are of a significant size, say 300 GB or more.

Microsoft SQL Server has provided partitioning capability through partitioned views since version 7.0. SQL Server 2005 provides a new capability, table and index partitioning. Our focus on partitioning will use this new feature.

We can use table and index partitioning to segment data through a life cycle. For instance, we can decide to partition a Sales fact table by month of sale. As a set of monthly sales records age, say after three years, they lose their value and may be moved to an archival system or deleted. Thus, partitioning by a date value is extremely common in a data warehouse system. From the standpoint of management, any column that represents the age and/or segregates dynamic data from stable data is a potentially good choice.

One thing to note is that we typically use partitioning to divide a very large table, usually fact tables. The size boundary that indicates partitioning varies but is generally around the 300‑GB range. When determining whether a table should be partitioned, consider the projected size, not just the current size. Partitioning adds a level of complexity to the environment, so only consider this option for these large tables. Very large dimensions are rarely so large as to benefit greatly from partitioning. However, a very large dimension, such as a Customer dimension, can be partitioned to minimize the portion of the dimension table that requires maintenance at a given time period.

A helpful piece of information to obtain at the onset of a data warehousing project is the point at which data becomes stabilized, or is no longer updated. This may be obtainable from the source systems or from the business users. Late-arriving facts are not uncommon, but there is usually a point at which you can no longer expect them. Even if you cannot absolutely distinguish such a point of time, you can build some intelligence within the ETL (extraction, transformation, and loading) application to automate this. This is discussed in detail later in this paper.

An ultimate goal is to use this information to establish a time period after which fact data becomes stable. This can then be used to create a much more efficient data management strategy, as explained throughout the remainder of this white paper. A strong recommendation is to segment these partitions onto their own filegroup(s) which are then marked as read-only filegroups. Marking these filegroups as read-only is not a requirement, but doing so provides some significant benefits, such as the following:

·         It guarantees that no changes have been made that could require additional maintenance.

·         It eliminates the need for associated transaction log backups when restoring read-only filegroups.

·         It provides more flexible piecemeal restore scenarios when the database is set to the simple recovery model.[1]

With this goal in mind, there are two high-level strategies for laying out partitions on filegroups to segregate active data from latent data:

Create two filegroups, one for partitions that are still updated and one for read-only partitions

·         Advantages

·         Fewer filegroups to maintain

·         Simplicity

·         Disadvantages

·         Less flexibility in piecemeal restore.

·         Data must move when going from active to inactive.

·         Data “freeze” date must be firm, or else partitions may have to move back and forth between the two filegroups.

·         Can restrict management options, such as backup and consistency checking granularity.

Create one filegroup per partition, flag each relevant filegroup as read-only when the partition is no longer updated

·         Advantages

·         Data does not have to be moved when going between active and inactive.

·         Full piecemeal restore capabilities.

·         ETL is more efficient if data is updated after the established “freeze” date.

·         Provides flexible management options.

·         Disadvantages

·         More filegroups to maintain


For these reasons, the second strategy carries a strong recommendation when implementing the management strategy outlined in the remainder of this white paper. The simplicity of the first option, or a combination of both options, may be desirable in a smaller data warehouse where downtime for management is less of an issue.

A final partitioning decision has to do with granularity. It is very common to set up partition boundaries to be uniform across the entire table, for instance, by month. There is nothing wrong with this strategy, but remember that boundaries do not have to be uniform. It may be beneficial to have more granular partitions for current data and less granular as data ages. For instance, data can be added as daily partitions, and merged into weekly or monthly partitions as they age. The usefulness of this strategy varies from environment to environment. The main thing to consider is if there are occasional times, perhaps on the weekend, to devote to partition merging and whether this is worth the time savings observed during data loads. Note that additional partitions are not the issue that they were with partitioned views. The number of partitions in a partitioned table has no affect on query compile time and the support for the number of partitions has been greatly increased with this feature (from 256 partitions with partitioned views to 1,000 partitions with partitioned tables).

Data Lifecycle Management

The partitioning strategy lays the groundwork for the various management activities addressed in this white paper. The first set of activities address data life cycle management (adding, aging, and removing data). Let’s review some of the considerations for each of these categories.

Inserting new data

There are several activities that occur during the ETL batch cycle—inserting new data comprises a large part. Data inserts (and updates) have repercussions beyond the insert itself, most of which we discuss here.

First, it is important to determine the goals for data updates during the batch cycle. A common goal is to minimize the batch window by making the inserts faster. It is also important to simplify the ETL process to make it easier to maintain. Server resources should be used efficiently to reduce costs and affect the first goal of speeding the update process. In other words, parallelize to the extent possible without introducing bottlenecks. Server resources can include, in this case, CPU, disk, tempdb, network, transaction logs, memory, and more. It is also important to understand the side-effects of the updates and the additional management activities that need to be performed as a result. Index fragmentation can be managed, to an extent, by the design of the index. Row updates and late-arriving facts can still result in fragmentation both at the index and extent level.

Bulk inserts continue to be the fastest means of inserting new data. There are a number of conditions that must exist in order to achieve a true bulk insert, all of which are identified in SQL Server Books Online. The quickest way to determine if you’ve actually met all of the criteria is to examine the locks that are taken while the insert runs by running a query similar to the following:


SELECT request_session_id, resource_type, resource_subtype,

resource_associated_entity_id, request_mode

FROM sys.dm_tran_locks

WHERE resource_database_id = db_id(“REAL_Warehouse_Sample”)

ORDER BY request_session_id DESC


You should see locks that look like the following:

Bulk insert locks – minimal logging and locking

Bulk loading is certainly a worthwhile goal on the initial loading of data, but it is not out of reach even during incremental loads. Aside from a few BULK INSERT (or BCP) options, the destination table must be empty or have no indexes. This can be accomplished in two ways: drop the indexes on the table prior to the load, or use a new partition for the new data. Dropping indexes is a common recommendation anyway when inserting any significant amount of data. This only makes sense on smaller tables, or externally loaded partitions of a larger table.

Common insert strategies

The strategy for inserting data is dependent on a few factors, namely the availability requirements of the data, the simplicity of the load process, and whether there is a combination of updates and inserts. Again, this is primarily an issue with fact tables, not only because of their size but also because of the number of inserts/updates that occur in a single batch cycle. Inserting new data into the data warehouse can offer some challenges. Table partitioning provides the capability to load data outside of the table that users or applications access so that the availability of the data is greatly maximized. This was somewhat true of SQL Server 2000 and partitioned views, but has been improved with partitioned tables.

It is preferred that fact table updates be limited to inserts, but there are scenarios that require a combination of inserts and updates. There are a few high-level strategies for efficiently updating partitioned fact tables during the ETL process. They are mentioned here primarily due to their impact on partitioning design.

·         Insert/update rows directly in the partitioned table. This is the most straightforward method, but will usually only perform well when there are no indexes on the partitioned table, the number of rows to be inserted/updated are evenly scattered across several partitions, or the number of inserted/updated rows are small. Evaluate the performance of this method against the next one if you are unclear which is better in your environment.

·         SWITCH out the partition to be updated, drop all indexes, update the data, recreate the indexes, then SWITCH the partition back in. This option is obviously more complex to implement. It is highly typical that most fact table updates will be highly concentrated to just one partition—the current partition. You can combine this method and the previous one, using this method to update the current partition and the previous method for all other inserts/updates. Again, evaluate the performance of either method or a combination to develop the best approach for your data loads.

·         Create a new partition to cover the span of time since the last load, insert the data, create the indexes, SPLIT the partitioned table (function) to make a home for the new partition, and SWITCH the partition in. The only benefit this approach has to the second option is that the data that was previously in the partitioned table is never unavailable to users. This might be a good approach to near real-time loads, allowing updates to occur while business users are accessing the data. The downside of this approach is that the partitions will probably need to be merged to make larger partitions at some point, which can be time-consuming (perhaps during the weekend).

We’ve mentioned a few strategies for managing the time it takes to load new data into the warehouse while mitigating the time data is unavailable for queries. The remarks regarding relative impact on the environment is based on experience, but could vary from environment to environment. It is always best to compare methods within your own environment if you are unsure of which is best for you.

Maintaining pre-aggregated data

Data warehouses are usually supplemented with some form of pre-aggregated data, which removes the need to repeatedly recalculate commonly queried aggregated data. This concept is what allows Analysis Services to perform so well. In SQL Server, these pre-aggregations can take one of three forms, all of which require either implicit or explicit management after the data warehouse is updated. If indexed views can be used for most or all pre-aggregated summary tables, there is overhead when the underlying fact tables are updated. If the fact tables are sizeable and, therefore, partitioned, indexed views are probably not an option since they would need to be rebuilt every time a partition was switched in or out. User-defined summary tables or OLAP cubes require their own maintenance when the detailed fact data is updated. This can be minimized by building partition awareness into the summary tables and/or cubes and building intelligence in the ETL process to indicate which partitions are updated during a load.

This is usually accomplished in Analysis Service cubes by partitioning the cube’s measure groups in the same way as the underlying fact table. This way, only the measure group partitions that are based on the fact table partitions that were updated are reprocessed. This greatly diminishes the cube processing time.

When maintaining several partitions in a partitioned table, it is practical to maintain a table that is updated during the ETL process to keep track of which partitions are updated during each batch. This can be used to determine during scheduled maintenance which partitions have been updated since the last maintenance. This can be used, of course, to kick off the tasks to update cube partitions or to update segments of user-defined summary tables.

Likewise, user-defined summary tables should have a component that identifies what rows are associated with a partition. The partitioning key will most likely be highly significant in the data warehouse, so it will naturally become part of any summary table. For example, if order date is used as the partitioning key, some aspect of order date will reside in any summary tables dependent on the granularity of this date dimension in the summary table. Let’s say the fact table is partitioned on order date by calendar month. The summary table date granularity may be higher, lower, or equal, but it will be enough to identify the rows that are changed. If the October partition is updated, all rows in the summary table that represent all or part of October should be updated. This can be used to only recalculate pre-aggregated data in the summary table just like Analysis Services does.

Aging fact data

In most data warehouses, the value of data diminishes over time. A certain amount of data must be kept online and accessible for queries, but the frequency of the query hits on data tends to go down as the data ages. This fact can be used to implement more cost-conscious storage methods by moving older data to less expensive disk subsystems. The cutoff points for aging data vary from environment to environment. This information can be gathered from business users at the beginning of the data warehouse project and monitored in production.

Usually, two or three storage subsystems are used to handle Tier 1, Tier 2, and potentially Tier 3 data. In a three-tier scenario, Tier 1 data storage may represent RAID 1+0 on the fastest and most highly-available storage subsystem (redundant, multipath). Tier 2 may be RAID 5 on the fastest storage (this assumes that data on Tier 2 is older than the known potential update window and won’t incur the write overhead of RAID 5). Tier 3 may represent RAID 5 on a less expensive disk storage technology. These are just examples; your configuration depends on budget, storage vendors, update windows, and other factors.

Part of the regular maintenance plan would be to move data from one tier to another as it ages. This is true data movement, which requires an extended batch window, such as over a weekend. It is best to minimize the number of partitions that move at any given time to manage the time required for the data movement. Also, each tier implies additional movement of data, that is, three tiers means that data will move twice in its lifetime where as two tiers will only require one move.

To elaborate, imagine a data warehouse where the most actively queried data is 13 rolling months. Year-to-date queries are common, as well as monthly comparisons to the same month last year. These queries represent roughly 75% of the total queries. We also have another cutoff that includes data from the last three fiscal years. This represents an additional 20% of the queries, so Tier 1 and Tier 2 storage will serve up data for 95% of the query load. All data that is prior to the beginning of this fiscal year minus 2 will move to Tier 3. We currently have data starting from January 1st, 2000. If the fiscal year begins on July 1st (FY 2007) and ends on June 30th, the data is represented as follows for this scenario on October 15th, 2006:



Storage attributes

Date Range

# of Months

Tier 1

RAID 1+0; highest performance, highest availability

10/01/2005 to 10/15/2006


Tier 2

RAID 5; highest performance

07/01/2004 to 09/30/2005


Tier 3

RAID 5; least expensive

01/01/2000 to 06/30/2004



Note: The number of months does not necessarily directly correlate to the amount of the data in each tier. As business grows, the number of fact rows generally grows proportionally. In this case, Tier 1 space may be larger than Tier 2 space, even though there are more months stored on Tier 2.

Remember, that this is just a hypothetical scenario. Two tiers of storage are likely to be more common, in order to minimize data movement and due to the storage options available in a particular environment.

Partitioning makes the data movement more effective. Tables large enough to warrant a storage tier strategy should almost certainly be partitioned for other management reasons.

Some storage vendors also provide a storage system feature to perform such movement behind the scenes. This occurs at the physical disk level and is undetected by SQL Server. This greatly minimizes the impact of many tiers and the amount of data that can move at one time. The database files will probably need to be specific to a partition, since entire disk file(s) will be moving.

After implementing an aging strategy, it is important to monitor the files to make sure that the anticipated activity is accurate. This can be accomplished by reviewing the disk activity in the System Monitor. The ability to monitor the files is somewhat related to the way that the database files are mapped to logical disks. You can ascertain whether data is moving too quickly by comparing disk activity on the Tier 2 storage to that of Tier 1, for example. If a high proportion of disk activity is regularly occurring on Tier 2 as opposed to Tier 1, data may be moving too early.

For additional information on data aging, see the Project REAL: Data Lifecycle – Partitioning white paper on the Project REAL Web site (

Archiving fact data

To manage the size of a data warehouse over time it is important to establish a time after which data is no longer needed. This should be asked of the business community at the beginning of the project so it can be built into the management process during development. Waiting until the data warehouse is bursting at the seams puts the DBA in a crisis mode where the process may not be executed as effectively. Also, this information is important to have when designing the partitioning strategy. Partition boundaries should not exceed the portion of data to be removed at any point in time. For instance, if data is to be removed a fiscal month at a time, the partition boundary should be no larger than fiscal month. This allows the use of partition switching as a means of removing the data from the partitioned fact table. The alternative is to use a DELETE command which takes much longer and has a heavy impact on the log.

If the preceding conditions have been met, the sliding window implementation to remove data is easily performed.

1.    Create a new table that mimics the partitioned table in terms of metadata and indexes.

2.    SWITCH out the oldest partition to the new table.

3.    Perform any operations necessary prior to completely removing the data, such as a final backup.

4.    TRUNCATE the new table.

5.    MERGE the oldest partition with the next oldest in the partitioned table (function).

6.    If more than one partition is to be archived, repeat this process from step #2.

7.    When all data has been removed, DROP the table that was used for the switch out process.


For more details, see the partitioning white papers in References at the end of this whitepaper.

Maintaining dimensions (pruning)

A management activity that is often overlooked is dimension pruning, or the removal of old dimension members. Again, business rules dictate when dimension members can be removed. There are benefits to pruning dimensions on a regular basis, including:

·         Improved end user experience. It can be cumbersome to browse large dimensions. Oftentimes, users do not want to see dimension members that have no related fact table rows.

·         Query time. Joins to dimension tables are more efficient on smaller tables.

·         Potential benefits to cube processing and query performance

One criterion is to remove members that are not referenced by any fact table rows, although care must be taken to not remove new dimension members for which no facts have arrived. It is also possible that the business may opt to categorize dimension members in a generic member for very old references. Another consideration is with slowly changing dimensions—you may want to keep the first version of the dimension, even if it is no longer referenced by any fact rows. If a parent key is used to tie all versions of a dimension member together, be sure that this key is updated to the earliest represented version of the member if any of the initial versions are deleted.

Dimension pruning is usually an infrequent maintenance activity (quarterly or yearly), though this depends on the growth rate of a dimension. Be sure to rebuild indexes after dimensions are pruned to compact the index pages. Also run Update Statistics, if Auto Update Statistics is turned off.

Finally, the best way to manage the size of dimensions is to minimize the number of new inserts into the dimension. This is especially relevant with Type 2 slowly changing dimensions. Be sure to determine which dimension columns truly require versioning and insert new version rows only when any one of those columns is modified. Also, confirm whether dimension row versioning needs to occur before facts arrive that are related to them. For instance, if a product is created and multiple changes are made to add attributes to the product, it may not be necessary to track all of those changes unless or until the product starts selling (sales fact records start to arrive).

Backup and Restore Strategies

Database backups are often one of the first database maintenance activities that come to mind. Backups can seem to be less critical in a data warehouse database because the data can often be reproduced by reprocessing ETL. However, this is usually a more time-intensive proposition than simply restoring backups, especially in a severe disaster where the entire database must be recovered. It is also likely that the data is archived off of the source system long before it is removed from the data warehouse, making the data warehouse the only place where this historical data is available. If the database is strategically partitioned, the amount of data that must be initially restored in a recovery scenario can be greatly minimized. The ability to perform piecemeal restores make critical data available much faster, as we explain in this section. Backups can also be used to create a mirror of a database for load balancing or to perform additional maintenance, such as checking for index fragmentation or consistency checking on a separate server.

New functionality in SSIS can also be used to create checkpoints throughout the ETL/batch process. If this process takes five hours every night, it is a good idea to take occasional checkpoints after significant activities so that a problem that occurs well down the line does not require that the entire batch process be rerun!  If a problem occurs in the final hour and a checkpoint was taken after four hours of processing, the ETL process can be restarted (after the problem is resolved) at the point of the last checkpoint and only less than an hour of batch processing remains.

Backups for business continuity

A key thing to remember is that only the data that has been changed requires a new backup. We can utilize the partitioning strategies mentioned in the previous section to minimize the amount of data that requires backups. Specifically, there usually comes a time in any data warehouse where older partitions have little or no potential to be updated. This fact can be utilized to greatly minimize the required maintenance window. A data warehouse can extend into history for 5–7 years or more of fact data, but if only the last two months have the potential to be updated, those are the only partitions that require maintenance. This can be further guaranteed by setting the read-only attribute on filegroups related to stable partitions. As partitions age, simply flag the respective filegroup(s) as read-only, backup those filegroup(s), and no further backups are required unless very late fact data arrives.

The partitioning strategy suggested in this document supports this backup strategy. A best practice is to create one (or more) filegroups for dimension tables, another one for summary tables (if used), and to isolate table partitions to their own filegroup(s). The primary filegroup is used only for system tables. With this setup, only the primary, dimension, summary and active fact table partition filegroups require maintenance.

An additional benefit to this strategy is the ability to flexibly restore a data warehouse database in the event of a disaster. SQL Server 2005 Enterprise Edition provides an online piecemeal restore capability which makes the database available even while additional filegroups are restored. Only the primary filegroup must be restored in order to bring the database online. The remaining filegroups can be restored in order of query priority. This order usually requires restoring the filegroup(s) where the dimensions reside, followed by the summary tables, followed by each fact partition starting with the current partition and going backward. Note that any query that references partitions that have not yet been restored will fail. In other words, the query must include a filter on the partitioning key to eliminate nonloaded partitions.

For example, visualize a data warehouse database that has been laid out as follows:




Read Write/Read-only


System objects

Read write


All dimension tables

Read write


All summary tables

Read write


10/01/2006 - 10/31/2006

Read write


09/01/2006 - 09/30/2006

Read write


08/01/2006 - 08/31/2006



07/01/2006 - 07/31/2006



06/01/2006 - 06/30/2006



05/01/2006 - 05/31/2006



04/01/2006 - 04/30/2006


Dormant xx

Oldest partition



In this case, we’ll restore the filegroups in the order represented in the table. This means that the active filegroups (read write) will be available first. When restoring these filegroups individually, the database will be available as soon as the primary filegroup has successfully completed. This will not be very beneficial to the business users, though. We will continue to restore the dimension tables, the summary tables, the current partition, and last month’s partition. At this point, a query like the following can be run successfully:

SELECT Subject, SUM(Sale_Amt)

FROM Tbl_Fact_Store_Sales f

JOIN Tbl_Dim_Item i ON f.SK_Item_ID = i.SK_Item_ID

WHERE SK_Date_ID BETWEEN 20061001 and 20061031

GROUP BY Subject

This next query will not run successfully because it requests August 2006 data, which has not yet been restored:

SELECT Subject, SUM(Sale_Amt)

FROM Tbl_Fact_Store_Sales f

JOIN Tbl_Dim_Item i ON f.SK_Item_ID = i.SK_Item_ID

WHERE SK_Date_ID BETWEEN 20060801 and 20061031

GROUP BY Subject

The following error is returned:

Msg 679, Level 16, State 1, Line 1

One of the partitions of index 'CIX_Tbl_Fact_Store_Sales_DateItemStore' for table 'dbo.Tbl_Fact_Store_Sales'(partition ID 72057594080067584)

resides on a filegroup that cannot be accessed because it is offline, restoring, or defunct. This may limit the query result.

The following query will also not work because partition elimination does not occur due to a join to another table:

SELECT Subject, SUM(Sale_Amt)

FROM Tbl_Fact_Store_Sales f

JOIN Tbl_Dim_Item i ON f.SK_Item_ID = i.SK_Item_ID

JOIN Tbl_Dim_Date d ON f.SK_Date_ID = d.SK_Date_ID

WHERE d.Calendar_Year_ID = 2006

AND d.Calendar_Month_Desc = 'October'

GROUP BY Subject

This may be a good reason for using a smart key for the partitioning key. If the partitioning key is a surrogate key to a date dimension, consider using the date in lieu of a true surrogate key. If possible, use smalldatetime instead of datetime to compact the column from 8 bytes to 4. The valid values from smalldatetime are from January 1, 1900 through June 6, 2079, which is usually more than adequate. Time granularity is also to the minute, which is usually not an issue in a data warehouse. Using a true date type instead of an integer enables the use of date functions in queries and reports.

In general, the data type on all columns should be as compact as possible, especially on the fact table. A savings of 4 bytes may not seem significant, but when multiplied by one billion rows it can become so (that’s just over 3.7 GB). Columns that exist in a clustered index or partitioning key also are proliferated to all nonclustered indexes on the same table. Smaller columns mean more data per page, which results in more efficient use of memory and less disk I/O.

In SQL Server 2005, read-only filegroup restores no longer require that a transaction log be applied as long as the filegroup backup is made after the filegroup was designated as read-only. This may not be an issue in a data warehouse where transaction log backups are rare, but the flexibility is there. Also, be sure that all regular database maintenance is performed on the partition(s) in filegroups prior to making them read-only.

Note that online piecemeal restores are only available with the Enterprise Edition of SQL Server 2005. Much of the functionally that we have just described depends on the table partitioning feature of SQL Server 2005, which requires the use of Enterprise Edition. Smaller data warehouses that are running on Standard Edition will likely not require the backup/restore complexity that is suggested here.

ETL checkpoints

Checkpoints should be performed periodically in a long running batch to minimize the amount of work to be redone if the ETL process fails. SQL Server Integration Services (SSIS) provides a new checkpoint capability in packages so that a restart resumes work where it left off. When the package is run, package state information is written to a checkpoint file. The restart of a package whose run was interrupted begins at the last checkpoint in the file (the package must be configured to use this feature). This can be used in combination with package transactions to implement a rollback to the last checkpoint, which should be the beginning of the transaction. Package restartability through checkpoints must be configured by setting the package’s CheckpointFileName, CheckpointUsage, and SaveCheckpoints properties. You must also set the FailPackageOnFailure property to True on all containers that you want to designate as restart points.

It is important to understand the underlying workings of SSIS checkpoints as it may affect your package design. First of all, checkpoints are taken at the control flow container level. A data flow is a single unit of execution and, therefore, is rerun in its entirety if the package fails during the data flow’s execution. For that reason, you may wish to break up long data flows and call them individually from a control flow.

Also note that ForEach loop tasks and transacted containers constitute a unit of work. If a ForEach loop task is interrupted near the end of its iterations, it starts from the beginning. Likewise, a group of tasks that are grouped in a container that has been designated as a transaction are rolled back to the beginning of the transaction if a failure occurs during the transaction. This is by design and should be the desired outcome, but it can affect the package design.

Transactions should be used in conjunction with checkpoints to provide consistency in the database. Transactions in a package are very much the same as database transactions. They provide a means of designating a group of actions that should be performed as a unit. If there is a failure in the middle of the transaction, any updates that occurred from the beginning of the transaction up to the failure are rolled back. Transactions and checkpoints are used together to preserve database integrity in a restart scenario.

Transaction can be designated at the packages, container, For Loop, Foreach Loop, or Sequence container level to provide a high degree of control for the package developer. Transactions are started by setting the TransactionOption property for the package or container to Required. Subsequent containers can join a transaction started by a preceding container by setting the container’s TransactionOption property to Supported.

For a full discussion on implementing checkpoints and transactions, see SQL Server 2005 Books Online or one of the white papers in the References section in this white paper.

Recovery model

A detailed discussion of recovery models is best reviewed in SQL Server Books Online and other sources. Just a few remarks will be made here as they relate to backups of the data warehouse. The simple recovery model is often used in production data warehouses to manage the size of the transaction log during the huge bulk update activity that occurs during ETL processing. This can seem a much easier choice in a fully managed update process. The simple recovery model has its benefits but continues to be a dangerous choice unless you are willing to absorb the cost of re-executing ETL. The simple recovery model also removes much of the flexibility of piecemeal restores. If recovering from backups made while the database is set to the simple recovery model, all read-write filegroups must be restored in the initial piecemeal restore sequence. In other words, the database is not available to users until all read-write filegroups have been restored.

The recommended approach is to use the bulk-logged recovery model throughout the ETL batch processing. The data warehouse can be backed up using either differential or filegroup backups. This assumes that a point-in-time recovery is not relevant. After batch processing is complete and the database is readied for the business day, place the database in the simple recovery model. Ideally, make the database read-only to minimize locking, as well.

If the data warehouse is updated more frequently, such as in a near real-time environment, use either the bulk-logged or full recovery model based on your tolerance for point-in-time recovery.

Database Health

As with all SQL Server databases, there are a handful of health diagnostics that is a best practice to perform on a data warehouse database. These are often omitted in data warehouses due to limited batch windows and their very large size. Maintenance activities such as DBCC CHECKDB may be viewed as a nicety that can be put off, sometimes permanently. This, and the other database health assessment operations described in this section, should be performed on a regular basis for the very reasons they are so important on an OLTP system.

Database consistency checking

SQL Server database administrators are usually well aware of the importance of database consistency checking, especially in the OLTP environment. We have seen this operation ignored frequently in the data warehouse environment due to the amount of time that it requires on these very large databases. This is further impacted by the fact that it is mostly an all-or-nothing proposition. The granularity of consistency checking is at the table level. Fact tables can easily grow into the multiterabyte range, making consistency checking very time consuming. Note that DBCC CHECKTABLE and its related commands now use an internal snapshot of the database in order to maximize concurrency (not block users). These commands can consume tremendous server resources, though, and are preferably run outside of the active query window.

A common trick in OLTP is to restore a backup of the OLTP database to a secondary server, where time-consuming maintenance operations can be performed without affecting users. This has the additional benefits of testing the backup, providing a warm backup for disaster recovery purposes, and creating an additional low-availability source for reporting. This trick has not been a good option for data warehouses because of the disk resources that are required to retain two copies of the database. Partial restores make this a far more viable option.

Partial restores can be used to restore only the active part of the data warehouse, which is usually a small fraction of the full size. Options are available in the BACKUP and RESTORE commands to easily backup and restore only filegroups that are in read/write mode:



TO DISK = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Backup\ Partial.bak'



FROM DISK = 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Backup\Partial.bak'



At this point, DBCC CHECKTABLE can be used to check for consistency of tables that reside entirely within the restored filegroups, such as system, dimension, and summary tables. In SQL Server 2005 SP1, consistency checking is unavailable for tables that span filegroups where one or more of the filegroups were not recovered. This is the case with large fact tables, where many partitions reside on read-only filegroups that were not recovered during this partial restore. The ability to perform consistency checking of individual partitions of an index contained within a single filegroup by using the DBCC CHECKFILEGROUP command will be available in a future service pack. If partitions are isolated on their own filegroup, this can be viewed as a way of making the CHECKTABLE command partition-aware.

Index fragmentation

Index fragmentation will undoubtedly occur over a period of time as new data is introduced into the data warehouse. This can be mitigated somewhat by insert methods and how indexes are defined, but it is next to impossible to entirely prevent. Creating a chronological clustered index is helpful, but usually cannot entirely prevent fragmentation. For instance, dimension tables often have one or more nonclustered indexes on a nonsequential key. Late-arriving fact records are inserted nonsequentially. Also, updates to dimensions rows will often inflate the row to where a page split occurs. Indexes should be regularly monitored for fragmentation and re-organized when necessary.

A good way to deal with fragmentation is to prevent it from happening, to the extent possible. For instance, when adding inferred or incomplete dimension members, use column values that are indicative of the anticipated eventual size of the column. This can also be mitigated with Type 2 slowly changing dimensions, where changes in dimension columns result in a new row insert.

Since fragmentation is not completely unavoidable, minimize the maintenance tasks to defragment indexes. It is all too common to find that a single maintenance plan is run, perhaps over the weekend, to rebuild every index in every table in the database. Though this is thorough, it is rarely necessary. Use the new DMVs in SQL Server 2005 to determine which indexes in which partitions of which tables are fragmented enough to require an index rebuild. As always, remember that data that does not change cannot become fragmented. It is unnecessary to check for fragmentation on indexes on table partition filegroups that are read-only, let alone rebuild them. Maintenance overhead is be greatly minimized if you check only the active partitions. Also, defragment indexes in order by most used. If all fragmented indexes cannot be managed in a single batch window, at least the next day’s query load will benefit from this strategy.

The previous means of determining index fragmentation (versions previous to SQL Server 2005) was to use DBCC SHOWCONTIG. Though this command is still operational in SQL Server 2005, use the new dynamic management view (DMV), sys.dm_db_index_physical_stats, to assess index fragmentation. This DMV is partition-aware and can be run against a partially restored database as previously described. The avg_fragmentation_in_percent column reports the level of fragmentation of a partition of a table index. The following query can be modified but is a good start. It returns a list of all partitions of all fragmented indexes (greater than 10%) in descending order by usage and fragmentation percent:


-- List of fragmented indexes by usage

SELECT OBJECT_NAME(f.object_id) AS 'Table name', AS 'Index',

f.index_type_desc, f.partition_number,


ISNULL(user_seeks + user_scans + user_lookups, 0) AS 'Usage count',


FROM sys.dm_db_index_physical_stats(DB_ID(),NULL,NULL,NULL,'LIMITED') f

JOIN sys.indexes i

ON i.object_id = f.object_id AND i.index_id = f.index_id

JOIN sys.tables t

ON t.object_id = f.object_id

LEFT OUTER JOIN sys.dm_db_index_usage_stats s

ON s.database_id = f.database_id

AND s.object_id = f.object_id

AND s.index_id = f.index_id

WHERE f.avg_fragmentation_in_percent > 10.0

AND f.index_id > 0

AND f.page_count > 0

ORDER BY 'Usage count' DESC, f.avg_fragmentation_in_percent DESC


Likewise, the DBCC INDEXDEFRAG and DBCC DBREBUILD commands are being deprecated. Use the REORGANIZE and REBUILD options of the ALTER INDEX command, respectively, to replace these commands. They are equivalent to their predecessors.

In addition to checking for indexes that are fragmented, look into those that are rarely (or never) used. They may not be worth the upkeep (such as update overhead, space requirements, maintenance overhead)  The Customer Advisory Team has some great information posted on one of their blogs - “How can SQL Server 2005 help me evaluate and manage indexes?” ( The blog gives additional insight into useful DMVs and sample queries for comparing unused (or rarely used) indexes against their maintenance costs.

Update statistics

Index statistics are a critical input into the query optimizer. They are used to determine a high-level representation of data, which allows the optimizer to quickly determine the best indexes and access methods to use for fast query execution. The size of a data warehouse database may seem to warrant reconsideration of best practices regarding update statistics. Statistics are table-specific, not partition-specific, so it may seem unreasonable to allow SQL Server to update statistics whenever it deems necessary (after a significant proportion of the data has been updated). We briefly review key characteristics of data warehousing and SQL Server 2005 as they pertain to maintaining statistics.

In SQL Server, the first query to execute following a significant number of changes to one of the underlying tables can result in an update of the statistics prior to query compilation. This can add a few seconds (or sometimes minutes) to the execution time as the query will not run until the statistics update is complete. There is a new database option in SQL Server 2005 called AUTO_UPDATE_STATISTICS_ASYNC. If this option is turned on, statistics are updated in the background in the above scenario. The query that triggered the update statistics will compile, or re-use an existing query plan, based on the previous statistics. Only queries that run after the completion of the statistics update will compile using the updated statistics information.

One way to detect if you need to turn on the async option is to monitor the Recompile event in SQL Profiler. Under the EventSubClass you can see the cause for recompile (2 = Statistics changed). If the cause is outdated statistics, that means you often get recompiles due to stale statistics.

Though it is important to be aware of this behavior, the default database options (AUTO_UPDATE_STATISTICS ON, AUTO_UPDATE_STATISTICS_ASYNC OFF) are usually appropriate in a data warehouse environment. Long-running queries could benefit from the new statistics. A few seconds could result in an execution plan that removes minutes or more from the query execution time. Manually updating statistics requires diligence and a thorough understanding of when to do so. If daily reports are run off of the data warehouse, it might be a good idea to run them right after ETL processing if they don’t already. It is likely that the reports will absorb the minimal impact of updating statistics and it will also warm the cache. Otherwise, consider running some typical user queries at the end of ETL to warm the buffer cache. Warming the buffer cache can have the very positive affect of lowering the physical I/O incurred by first-run queries. This also updates statistics.

Depending on the size of your incremental updates, you might observe that statistics are not updated enough. The leftmost column on a statistics object must have changed for 500 rows plus 20% of the rows that existed when the statistics were last updated (for tables that had more than 500 rows when statistics were updated). On a large fact table that had 1 billion rows when statistics were last updated, the leftmost column had to have changed on 200,000,500 rows (including inserts). Since good query plans depend so much on good statistics, it is a good idea to update statistics manually on very large tables with relatively small incremental updates.

For more information on new features with statistics in SQL Server 2005, see Statistics Used by the Query Optimizer in Microsoft SQL Server 2005 (

A High-Level Maintenance System

Now that we’ve discussed the various maintenance activities to perform, let’s look at a high-level system for maintaining a moderate- to large-sized data warehouse with some of the suggestions from the preceding sections of this white paper. The scenario is intended to represent most maintenance challenges resulting from complex business requirements.


You are architecting the management strategy for a retail data warehouse:

·         The database is estimated to initially be 7 terabytes (after three years of historical data is loaded), with growth to 15 terabytes in the next two years due to the opening of new stores and new, innovative marketing strategies.

·         Fact tables are partitioned by month.

·         Data that is older than 7 fiscal years is to be archived (written to permanent media and stored offsite).

·         Analysis of the source system shows that fact data is very rarely received for days prior to six weeks ago.

·         85% of queries are expected to access data no older that 13 months ago (include the full month corresponding to this year’s month).

·         Data is loaded on a daily basis.

·         Data is not available from the source until midnight EST or later. Some business users are located in the UK, so service level agreements (SLAs) require that the data be available by 3:00 a.m. EST.

·         Due to the short batch window, the business users have agreed that the system can be unavailable on weekends (7:00 EST Friday to 3:00 a.m. EST Monday).

Management schedule

The following is a high-level plan for managing a database to the requirements listed above. This plan can almost certainly not be implemented verbatim due to variances in environments. Follow this plan without modification only if your business requirements match the scenario exactly. The scenario is intended to give a high-level view of how to put a management schedule into place.

Nightly processing

1.    If the database is in READ_ONLY mode, toggle it to READ_WRITE.

2.    Load updates to dimension tables.

3.    Create a new external table for the daily load and bulk insert data into the new table. Split the partition function, add indexes and check constraints to the external table, and switch into the partitioned table.

4.    Load updates prior to the previous day directly into the partitioned fact tables; implement a mechanism to track which partitions are updated either in the pipeline or through Service Broker (asynchronous).

5.    Update summary tables or cubes based on the partitions that are updated.

6.    Back up filegroups associated with the updated partitions.

7.    Change the database to READ_ONLY mode.

Thursday night

1.    After database backups, perform a partial restore of the primary and read-write filegroups on a secondary server.

1.    Execute a query on the restored backup server to establish fragmentation levels of the read-write filegroups. Generate a script to reorganize or rebuild fragmented indexes during the weekend.

Each weekend, outside of scheduled ETL batch

1.    Merge daily partitions from the previous week into the current month partition.

2.    Run DBCC CHECKFILEGROUP on read-write filegroups[2].

3.    Reorganize or rebuild fragmented indexes using the information from Step 2 from Thursday night. This is performed directly on the data warehouse and can be performed at the same time as Step 2.

End of month: weekend

1.    Move aged partitions per the business requirements (older than 13 months).

2.    Remove partitions, per business requirements, using the sliding window method. Copy filegroup backups to an archival media, as appropriate (this most likely should be done when the filegroup is made read-only).

3.    Finalize the new partition minus 3, for example, when starting December, finalize September:

a.    Validate that there is minimal index fragmentation from the previous Thursday fragmentation check.

b.    Update statistics if necessary (if not auto update, or if many changes have been made since last update statistics).

c.    Confirm that the filegroup is healthy by either re-running DBCC CHECKFILEGROUP or ensuring that no updates have been made to the partition since the last run of this command.

d.    Make the partition filegroup read-only.

e.    Make a final backup of the filegroup. Note that it is important to perform a backup of the filegroup after making it read-only.


·         Prune dimensions

It may be impossible to determine an exact cutoff point for when data will no longer be updated. If this is the case, you can implement ETL functionality to make a read-only filegroup back into a read-write filegroup when an update is identified that will reside in this filegroup. The management schedule can then check when the filegroup (partition) has been stable (no updates) for a number of days and go through the process of finalization again. It is important to perform all final maintenance prior to re-flagging the filegroup as read-only.

Filegroups cannot be changed from read-write to read-only, or vice versa, unless no other users are in the database. Since this property should be changed in a managed environment (ETL or other) this should not be an issue. If user connections remain even after batch has started, they can be terminated:


USE master;


-- Change the database to single user mode - rollback any

-- transactions and disconnect all users in the database

ALTER DATABASE [DataWarehouse]




-- Change the filegroup to read-only

ALTER DATABASE [DataWarehouse]



-- Change the database back to multiuser mode

ALTER DATABASE [DataWarehouse]




This is also required to make the database read-only. You will likely want to communicate to your business users that any connections that remain at the end of the day are likely to be terminated after a certain time.


There are several techniques that can be incorporated to minimize the time and resources required for database maintenance of very large data warehouse databases.

·         Isolate (partition) the data that actually changes and perform maintenance only on those partitions as it is necessary.

·         Instead of a broadstroke approach to index re-organization, evaluate index fragmentation and reorganize the most heavily used indexes that are highly fragmented, first.

·         Occasionally verify managed objects, such as indexes and summary tables, to validate their continued use. Removing unused objects saves both management and hardware resources.

·         Build intelligence into the ETL processing to keep track of the changed partitions in order to minimize maintenance.

·         Offload management to another server, if possible.

·         When in doubt, test different data life cycle strategies, such as data inserts, to determine which method best suits your environment based on business requirements.

Several features of SQL Server 2005, such as table partitioning, dynamic management views, SSIS checkpoints, and online piecemeal restores greatly enhance your ability to monitor and manage data, while keeping it highly available to the applications and business users who rely on it.


Data warehousing concepts and dimensional modeling

·         “The Data Warehouse Lifecycle Toolkit” by Ralph Kimball, Laura Reeves, Margy Ross and Warren Thornthwaite

·         “The Microsoft Data Warehouse Toolkit” by Joy Mundy and Warren Thornthwaite with Ralph Kimball

SQL Server 2005 table partitioning

·         SQL Server 2005 Books Online

·         “Project REAL: Data Lifecycle – Partitioning” by Erin Welker (

·         “Partitioned Tables and Indexes in SQL Server 2005” by Kimberly Tripp (

SQL Server database maintenance activities

·         SQL Server 2005 Books Online

·         “Statistics Used by the Query Optimizer in Microsoft SQL Server 2005” by Eric N. Hanson (

SSIS checkpoints and transactions

·         SQL Server 2005 Books Online

·         “SQL Server 2005 Integration Services, Part 1: Lessons from Project REAL” by Richard Waymire, Len Wyatt, John H. Miller, Donald Farmer, and Jyoti Jacob (

·         Managing and Deploying SQL Server Integration Services” by Michael Otey and Denielle Otey (


Project REAL (


For more information:


Did this paper help you? Please give us your feedback. On a scale of 1 (poor) to 5 (excellent), how would you rate this paper?!href(mailto: Developing Custom Components in SQL Server Integration Services for Project REAL)

[1] Partial recovery of a database in Simple recovery model restores the Primary and all read/write filegroups in the initial step.  If many of the filegroups are read-only, they can be restored separately.

[2] A future enhancement will allow this capability for tables that also include filegroups that have not been restored, that is, the check will be contained within the filegroup.

database solution, BI solution, provide solutions to the most complex database and BI problems

Data base & business intelligence solution website