I've been employing SQL Server Integration Services for some years now and find it to be quite an improvement over its predecessor Data Transformation Services. Apart from being an ETL product, it also provides different built-in tasks to manage a SQL Server instance. SSIS is not an enhancement to DTS but rather a new product which has been written from scratch to provide high performance and parallelism and as a result of this it overcomes several limitations of DTS. Not to mention the fact that versions of each package are easilty stored and maintained using TFS, and so plays well with other Visual Studio design elements by your support team.
The best part of SSIS is that it is a component of SQL server. It comes free
with the SQL Server installation and you don't need a separate license for it.
Because of this, along with hardcore BI developers, database developers and
database administrators are also using it to transfer and transform data.
Here are some tips when employing SSIS that I've come across over the past couple years...
Avoid asynchronous transformation wherever possible
Before I talk about different kinds of transformations and its impact on performance, let me briefly talk about of how SSIS works internally. The SSIS runtime engine executes the package. It executes every task other than data flow task in the defined sequence. Whenever the SSIS runtime engine encounters a data flow task, it hands over the execution of the data flow task to data flow pipeline engine. The data flow pipeline engine breaks the execution of a data flow task into one more execution tree(s) and may execute two or more execution trees in parallel to achieve high performance. Now if you are wondering what an execution tree is, then here is the answer.
An execution tree, as name implies, has a similar structure as a tree. It starts at a source or an asynchronous transformation and ends at destination or first asynchronous transformation in the hierarchy. Each execution tree has a set of allocated buffer and scope of these buffers are associated the execution tree. Also each execution tree is allocated an OS thread (worker-thread) and unlike buffers this thread may be shared by any other execution tree, in other words an OS thread might execute one or more execution trees. In SSIS 2008, the process of breaking data flow task into an execution tree has been enhanced to create an execution path and sub-path so that your package can take advantage of high-end multi-processor systems.
Synchronous transformations get a record, process it and pass it to the other transformation or destination in the sequence. The processing of a record is not dependent on the other incoming rows. Because synchronous transformations output the same number of records as the input, it does not require new buffers (processing is the done in the same incoming buffers i.e. in the same allocated memory) to be created and because of this it is normally faster. For example, in the Derived column transformation, it adds a new column in the each incoming row, but it does not add any additional records to the output.
Unlike synchronous transformations, the asynchronous transformation might output a different number of records than the input requiring new buffers to be created. Because an output is dependent on one or more records it is called a blocking transformation. Depending on the types of blocking it can either be partially blocking or a fully blocking transformation. For example, the Sort Transformation is a fully blocking transformation as it requires all the incoming rows to arrive before processing.
As discussed above, the asynchronous transformation requires addition buffers for its output and does not utilize the incoming input buffers. It also waits for all incoming rows to arrive for processing, that's the reason the asynchronous transformation performs slower and must be avoided wherever possible. For example, instead of using Sort Transformation you can get sorted results from the source itself by using ORDER BY clause.
DefaultBufferMaxSize and DefaultBufferMaxRows
The number of buffer created is dependent on how many rows fit into a buffer and how many rows fit into a buffer dependent on few other factors. The first consideration is the estimated row size, which is the sum of the maximum sizes of all the columns from the incoming records. The second consideration is the DefaultBufferMaxSize property of the data flow task. This property specifies the default maximum size of a buffer. The default value is 10 MB and its upper and lower boundaries are constrained by two internal properties of SSIS which are MaxBufferSize (100MB) and MinBufferSize (64 KB). It means the size of a buffer can be as small as 64 KB and as large as 100 MB. The third factor is, DefaultBufferMaxRows which is again a property of data flow task which specifies the default number of rows in a buffer. Its default value is 10000.
Although SSIS does a good job in tuning for these properties in order to create a optimum number of buffers, if the size exceeds the DefaultBufferMaxSize then it reduces the rows in the buffer. For better buffer performance you can do two things. First you can remove unwanted columns from the source and set data type in each column appropriately, especially if your source is flat file. This will enable you to accommodate as many rows as possible in the buffer. Second, if your system has sufficient memory available, you can tune these properties to have a small number of large buffers, which could improve performance. Beware if you change the values of these properties to a point where page spooling begins, it adversely impacts performance.
BufferTempStoragePath and BLOBTempStoragePath
If there is a lack of memory resource i.e. Windows triggers a low memory notification event, memory overflow or memory pressure, the incoming records, except BLOBs, will be spooled to the file system by SSIS. The file system location is set by the BufferTempStoragePath of the data flow task. By default its value is blank, in that case the location will be based on the of value of the TEMP/TMP system variable.
Likewise SSIS may choose to write the BLOB data to the file system before sending it to the destination because BLOB data is typically large and cannot be stored in the SSIS buffer. Once again the file system location for the spooling BLOB data is set by the BLOBTempStoragePath property of the data flow task. By default its value is blank. In that case the location will be the value of the TEMP/TMP system variable. As I said, if you don't specify the values for these properties, the values of TEMP and TMP system variables will be considered as locations for spooling. What is important is to change the default values of the BufferTempStoragePath/BLOBTempStoragePath properties and specify locations where the user executing the package (if the package is being executed by SQL Server Job, then SQL Server Agent service account) has access to these locations. Preferably both locations should refer to separate fast drives (with separate spindles) to maximize I/O throughput and improve performance.
How DelayValidation property can help you
SSIS uses validation to determine if the package could fail at runtime. SSIS uses two types of validation. First is package validation (early validation) which validates the package and all its components before starting the execution of the package. Second SSIS uses component validation (late validation), which validates the components of the package once started.
Let's consider a scenario where the first component of the package creates an object i.e. a temporary table, which is being referenced by the second component of the package. During package validation, the first component has not yet executed, so no object has been created causing a package validation failure when validating the second component. SSIS will throw a validation exception and will not start the package execution. So how will you get this package running in this common scenario?
To help you in this scenario, every component has a DelayValidation (default=FALSE) property. If you set it to TRUE, early validation will be skipped and the component will be validated only at the component level (late validation) which is during package execution.