What are the SSIS Best Practices?
SSIS represents a complete rewrite of its early predecessor Data Transformation Services. Usability, parallelism and performance have all been vastly improved over the years resulting in an SQL Server component aimed at high-volume, high-performance ETL applications. Here are several additional tuning steps that can further increase SSIS performance for your particular jobs.
Managing Parallel Execution
MaxConcurrentExecutables and EngineThreads are two SSIS properties used to optimize SSIS’ parallel execution capabilities. The first specifies the number of data flow tasks that are run in parallel by the SSIS runtime engine. Its default value is -1, which allocates an amount of threads equal to the number of processors/cores plus 2. If the value of MaxConcurrentExecutables is less than the number of parallel tasks available, only those number of tasks run simultaneously. When one completes, another is added.
EngineThreads limits the number of source and worker threads used to execute the data flow tasks in the data flow pipeline engine. Thus, its effect is “nested” within the overall data flow parallelism influenced by MaxConcurrentExecutables. Its default value is 10, which indicates up to 10 data source and 10 worker threads will be allocated.
In order to tune these values for an ETL job, turn on PipelineExecutionTrees event logging, which tells you how many execution trees are used per data flow task. Use this number to fine-tune the setting of EngineThreads.
One caveat here is that these values cannot override precedence constraints between tasks, which may decrease the amount of execution parallelism available.
Use Logging Judiciously
SSIS logging is very helpful in tracking down bugs in package code or for monitoring packages already debugged and deployed. It allows exclusive selection of events of interest and permits logging them to any of several destinations.
The LoggingMode property within a package enables and disables logging easily without package modification. Use it wisely as logging adds additional SSIS runtime overhead. Be aware also that a decrease in performance because of excessive logging may mask race conditions in code execution, which is typically the opposite effect that is desired when debugging intermittent package problems.
Taking Pressure Off TempDB and the Transaction Log
During high volume data transfers, changing the defaults for two SSIS settings will avoid uncontrolled growth of the transaction log and tempdb memory usage. By default, Rows per batch and Maximum insert commit size are set to maximum values of -1 and 2 GB, respectively. The first value permits all data rows to be processed as a single batch, whereas the second specifies whether all rows in the data transfer are to be committed in one go. Thus, for high volume transfers, set these to more reasonable values to match your execution environment.
Avoiding Defragmentation When Loading Large Data Volumes with Keys
When transferring several to hundreds of millions of rows from one table to another, you may experience a sharp slowdown in execution if the data contains a large number of clustered or non-clustered indices. Such fragmentation may reach as high as 90 percent.
It is possible to periodically halt the transfer and rebuild the indices, but a more efficient solution is to drop the indices completely, turn the target table into a heap and transfer all remaining data to this heap. Once the raw transfer is complete, rebuild the indices. This will decrease the overall execution time of the entire transfer several fold.
Save Time with Checkpoints
The Checkpoint features lets a package restart from the last point at which it failed instead of from the very beginning. When a Checkpoint is enabled on a package, execution status is logged to the file named in the CheckpointFileName property. Deleting this file will cause the package to start from the beginning.
Additionally, set CheckpointUsage, SaveCheckpoints on the package and FailPackageOnFailure for data flow tasks you want to have restart capability. Note that Checkpoint only works at the package level and not inside data flow tasks, which will start at the beginning of their internal sequence.
When performing long, high-volume ETL operations, Checkpoint can save many hours. It also is a great assist in debugging new packages. In the latter case, use it in conjunction with the Boolean DelayValidation property, which controls SSIS early and late validation of packages and components prior to execution.
SQL Server Integration Services’ creates data warehouses for the purpose of high-volume data ETL operations. The focal point of SSIS development is an SSIS package, created within Business Intelligence Development Studio. Packages are used for various workflows including data cleansing, standardization, integration and administrative tasks.
At the package, task and system levels, there are a number of tweaks possible to optimize SSIS performance to the job at hand or to assist the debugging of new or modified packages. Always bear in mind, that these tips are not hard and fast rules due to internal SSIS dynamic optimizations that happen at runtime. Thorough testing and monitoring is always a wise approach when applying these practices to your particular tasks.
This is just the tip of the iceberg for optimizing your SSIS, and there are many more tips and tricks to make your data systems clean and efficient. If you need help or have questions about your current system’s performance, our data analytics specialists will be able to guide you. Give us a call, we’re always happy to help.
Contact us to learn more.