Posts

SSIS Best Practices

SSIS represents a complete rewrite of its early predecessor Data Transformation Services. Usability, parallelism and performance have all been vastly improved over the years resulting in an SQL Server component aimed at high-volume, high-performance ETL applications. Here are several additional tuning steps that can further increase SSIS performance for your particular jobs.

Managing Parallel Execution

MaxConcurrentExecutables and EngineThreads are two SSIS properties used to optimize SSIS’ parallel execution capabilities. The first specifies the number of data flow tasks that are run in parallel by the SSIS runtime engine. Its default value is -1, which allocates an amount of threads equal to the number of processors/cores plus 2. If the value of MaxConcurrentExecutables is less than the number of parallel tasks available, only those number of tasks run simultaneously. When one completes, another is added.

EngineThreads limits the number of source and worker threads used to execute the data flow tasks in the data flow pipeline engine. Thus, its effect is “nested” within the overall data flow parallelism influenced by MaxConcurrentExecutables. Its default value is 10, which indicates up to 10 data source and 10 worker threads will be allocated.

In order to tune these values for an ETL job, turn on PipelineExecutionTrees event logging, which tells you how many execution trees are used per data flow task. Use this number to fine-tune the setting of EngineThreads.

One caveat here is that these values cannot override precedence constraints between tasks, which may decrease the amount of execution parallelism available.

Use Logging Judiciously

SSIS logging is very helpful in tracking down bugs in package code or for monitoring packages already debugged and deployed. It allows exclusive selection of events of interest and permits logging them to any of several destinations.

The LoggingMode property within a package enables and disables logging easily without package modification. Use it wisely as logging adds additional SSIS runtime overhead. Be aware also that a decrease in performance because of excessive logging may mask race conditions in code execution, which is typically the opposite effect that is desired when debugging intermittent package problems.

Taking Pressure Off TempDB and the Transaction Log

During high volume data transfers, changing the defaults for two SSIS settings will avoid uncontrolled growth of the transaction log and tempdb memory usage. By default, Rows per batch and Maximum insert commit size are set to maximum values of -1 and 2 GB, respectively. The first value permits all data rows to be processed as a single batch, whereas the second specifies whether all rows in the data transfer are to be committed in one go. Thus, for high volume transfers, set these to more reasonable values to match your execution environment.

Avoiding Defragmentation When Loading Large Data Volumes with Keys

When transferring several to hundreds of millions of rows from one table to another, you may experience a sharp slowdown in execution if the data contains a large number of clustered or non-clustered indices. Such fragmentation may reach as high as 90 percent.

It is possible to periodically halt the transfer and rebuild the indices, but a more efficient solution is to drop the indices completely, turn the target table into a heap and transfer all remaining data to this heap. Once the raw transfer is complete, rebuild the indices. This will decrease the overall execution time of the entire transfer several fold.

Save Time with Checkpoints

The Checkpoint features lets a package restart from the last point at which it failed instead of from the very beginning. When a Checkpoint is enabled on a package, execution status is logged to the file named in the CheckpointFileName property. Deleting this file will cause the package to start from the beginning.

Additionally, set CheckpointUsage, SaveCheckpoints on the package and FailPackageOnFailure for data flow tasks you want to have restart capability. Note that Checkpoint only works at the package level and not inside data flow tasks, which will start at the beginning of their internal sequence.

When performing long, high-volume ETL operations, Checkpoint can save many hours. It also is a great assist in debugging new packages. In the latter case, use it in conjunction with the Boolean DelayValidation property, which controls SSIS early and late validation of packages and components prior to execution.

Conclusion

SQL Server Integration Services’ creates data warehouses for the purpose of high-volume data ETL operations. The focal point of SSIS development is an SSIS package, created within Business Intelligence Development Studio. Packages are used for various workflows including data cleansing, standardization, integration and administrative tasks.

At the package, task and system levels, there are a number of tweaks possible to optimize SSIS performance to the job at hand or to assist the debugging of new or modified packages. Always bear in mind, that these tips are not hard and fast rules due to internal SSIS dynamic optimizations that happen at runtime. Thorough testing and monitoring is always a wise approach when applying these practices to your particular tasks.

This is just the tip of the iceberg for optimizing your SSIS, and there are many more tips and tricks to make your data systems clean and efficient. If you need help or have questions about your current system’s performance, our data analytics specialists will be able to guide you. Give us a call, we’re always happy to help.

How SSIS Fits into an ETL Process

MicrosoftSQL Server Integration Services – SSIS – is a common solution for many organizations’ data management requirements with regard to their Business Intelligence systems. Specifically, SSIS and SQL Server together streamline Extract, Transform and Load – ETL – operations on data warehouses. The process is essential to providing efficient, accessible data flows to BI system applications.

Using SSIS “As-Is” to Design Your ETL Process

Guidance about utilizing SSIS/ETL capabilities typically centers on the use of SSIS control and data flows to extract and apply multiple data transformations before storing processed results within the original or an alternative database. Its greatest strength is that it runs as a one-step process without the need for intermediate storage. However, that aspect leads to certain drawbacks that are discussed later for an alternative development approach with staging tables. 

Standard SSIS Extract/Transform

Data transformation employs SSIS’ large selection of data filtering operations, such as sorting, merging, unions, joins and data type conversions. Extraction is accomplished using either a flat file or OLE DB connection. Transformations are defined with graphical control and data flows that may include data multicasting to parallel processes. These process flows may have separate endpoints or may re-combine via a union operation.

The Use of Staging Tables

Staging table ETL architecture provides several benefits over standard SSIS ETL development, which has deficiencies especially with regard to slowing down development and debugging.

Standard SSIS Drawbacks

Because standard development utilizes the built-in one-step process, each data flow test run extracts a batch of data from the original source, which consumes additional time. This problem is compounded if your tests are running on a production database because extra care must be taken not to corrupt its state.

An additional problem is that SSIS data flows are specified at a very low level of abstraction, which makes their development and debugging tedious in the same way that assembly code slows code production compared to using C# or Python for example.

Separating Extract and Transform

Staging separates the steps of extraction and transformation in the top-level control flow. Rather than working with the entire data source, a copy of the data is pre-fetched into a staging table with minimal transformation, if any. The only filtering necessary would be to avoid performance issues such as an overly large table.

In SSIS, an easy way to create the staging table is to edit the data source destination table properties with the option to create a new table, which results in fetching all the correct columns. Data transformations are implemented as database views in lieu of specifying SSIS transformations. These can be written as SQL queries using CREATE VIEW with SELECT statements for the fields of interest.

Tips When Using Staging

While building a staging query, it is a good practice to change field names into more human-readable form as necessary and to rename IDs as Keys to accommodate BI application usage. The multiple staging tables created in this fashion are later joined into dimension and fact tables.

Note that the use of database views for larger data sets may create performance constrictions in production, although it is much more efficient at the development stage. You can get around this by later converting the views to stored-procedure tables that break down views operations into smaller pieces that are run during the ETL transformation step.

Memory Performance Tips for Standard and Staging Approaches

Regardless of which architectural approach is used to develop your ETL process, a key aspect to bear in mind is that transformation processes should be small enough to run completely within memory. This practices avoids performance-crushing disk swaps.

  • If the SQL Service and SSIS are running on the same system, SQL Service typically gains priority, which means that SSIS becomes a bottleneck
  • Apply SELECT only to the columns actually needed
  • Avoid using a Commit size of zero if inserting into a Btree to avoid memory spill
  • Use narrow data types to reduce memory allocation but avoid excessive data casts
  • Move sorts to the end of the transformation chain

To determine if your implementation avoids touching the disk during transformations, keep an eye on the SSIS performance counter Buffers spooled. If transformation remains in memory but performance lags, check the obvious, which is whether or not the speed of the data source system can keep up.

Especially if you employ the staging approach, since it requires more disk space, be mindful of overall resource utilization such as CPU usage, I/O bandwidth and network usage.

Conclusion

SSIS ETL is an essential pillar in supporting BI information systems. It provides powerful data transformation capabilities to match raw, large data stores to the narrower, optimized data needs of BI applications that are so essential to modern enterprise IT. The architectural approaches to SSIS ETL process development presented here offer trade-offs between performance and development/debugging effort that may improve the efficiency of your organization’s SSIS utilization.

Contact us today if you have any questions about SSIS and your BI system. We’re always happy to help.