Thu 29 Oct 2009
SSIS – Why is my data flow so slow with an OLE DB Command component?
Posted by Phil Brammer under SSIS , SSIS Data flow[3] Comments
“Why does my data flow take so long to execute when I use an OLE DB Command component?
“I need to update values in another table for every row in my data flow.”
These are common questions I see when dealing with “slow” performance of an SSIS package. Sometimes though, the developer of the package does not even understand which component is taking so long to process say, thousands of records in their data flow, so it is understandable why this component may proliferate in many shops. (Sometimes it is needed, of course, but you need to always ask the question – do I need the updates to happen RIGHT NOW?)
If you have a need to update data in another table, while still pushing data from a source to a destination, the OLE DB Command is often used because of the thought that it is part of the data flow, so why not use it.
Read on for more….
The trouble with the OLE DB Command component is that it executes whatever command you have provided for each and every row that passes through it. The data flow is designed to operate on buffers of data, which contain by default around 10,000 rows. If you are simply moving data from one table to another, this concept works really well and quite fast too, I might add.
However, if you need to update data, should you use the OLE DB Command component in that data flow? Are there other options? The answers of course are “maybe” and “yes.”
If you simply need to update data in a table and you don’t care when it happens, then I would recommend not using an OLE DB Command component at all in your data flow. Instead, I recommend you build a staging table on your destination database server, and populate that table with your “update” data. Then, in the control flow, after the data flow successfully executes, you can issue an Execute SQL Task to perform a batch update.
This recommendation is built on the fact that using an OLE DB Command component ensures that your update operations are done row-by-row, rather than in a batch operation.
Why? How much faster is the batch operation compared to the OLE DB Command operation? Well, the answer depends a bit, but in my simple test below, the time to update 5,000 records took 15 minutes using the OLE DB Command component and 4 seconds (yup – SECONDS) to populate a staging table and then perform a batch update.
Here is the setup I used:
CREATE TABLE numbers (
number int not null
)
GO
insert into numbers (number)
select number from
(
select row_number() over (order by s1.name) as number
from sysobjects s1
cross join sysobjects s2
cross join sysobjects s3
cross join sysobjects s4
cross join sysobjects s5
) as numbers
where number between 1 and 1000000;
GO
select number
from numbers
where number < 5003;
GO
/*
number
1
2
3
4
5
...
4998
4999
5000
5001
5002
*/
Here is a picture of the first SSIS package.
I used the following SQL for the OLE DB Source component:
SELECT number
FROM numbers
WHERE number < 5001
ORDER BY number DESC
I used the following expression in the Derived Column component to create a new column, new_number:
number + 1
I used the following SQL in the OLE DB Command component:
update numbers
set number = ?
where number = ?
In the OLE DB Command "Column Mappings" tab, I mapped the incoming column, new_number, to the Param_0 parameter, and the incoming column, number, to the parameter, Param_1.
I then executed the package. Time to execute: 15 minutes, 43 seconds.
Next I tried the batch update approach. Use the same setup as above, except replace the OLE DB Command component with an OLE DB Destination, and push the two data flow columns to a staging table.
Here is the staging table DDL:
CREATE TABLE staging_numbers (
orig_number int not null,
new_number int not null
)
Map the data flow column, number, to the orig_number field in the staging table. Map the data flow column, new_number, to the new_number field in the staging table, as shown below:

Then, in the control flow, hook up an Execute SQL Task to the new Data Flow task, so that the Execute SQL Task executes after the Data Flow, as shown below:

The Execute SQL Task SQL is below:
update numbers
set number = staging_numbers.new_number
from numbers, staging_numbers
where numbers.number = staging_numbers.orig_number
Upon executing the package, the execution time shrank to under five seconds:

So, the moral of this story is that whenever you think you need to use an OLE DB Command component, ask yourself, "Do I *really* need to use this component, or should I use a different approach?"



October 31st, 2009 at 8:07 pm
If anybody interested in high performance Upsert task then check TaskFactory (A collection of SSIS Tasks and Components)
They have really good set of tasks including Upsert Destination
http://www.pragmaticworks.com/Products/Business-Intelligence/TaskFactory/
November 2nd, 2009 at 11:44 am
That is a totally different issue. I couldn’t believe that OLE was that slow except that you are doing a sequential read of millions of rows 5000 times for the update. What is the timing if you put an index on number? Your join makes one pass through the data.
November 2nd, 2009 at 2:19 pm
I’m only reading in 5,000 rows at the source once. From there for each row an UPDATE is issued (when using an OLE DB Command to perform the update).
Even with an index so that a seek occurs on the update, there is still a single update transaction that occurs for each row in the data flow. (In this case, that is 5000 updates)
The point of this post is to illustrate the use of batch updates rather than single-unit-of-work updates that are very costly to implement. Where possible, always choose a batch update rather than row-by-row.