Mar 092007
 

It is one of the more common questions over on the SSIS Forums. How can I check if a record exists and if so, how can I check (quickly) if it has changed. Jamie Thomson has a blog post that outlines this, and I’ve made it a sticky post on the forums. However, if you need to check many, if not all of the columns in a table for changes, Jamie’s blogged method might be a bit laborious. This is where the Konesan’s Checksum transformation comes in. Read on…

The idea behind using a checksum is that you calculate it when you first see the record and then you store that checksum along with the record into the destination table. Later we can compare just the checksum value to see if this record has changed instead of physically checking every field. This concept is featured in Joy Mundy/Warren Thornthwaite’s book, The Microsoft Data Warehouse Toolkit, chapter 6.

To set this up, first download the Checksum Transformation from SQLIS.com. Install that and then proceed below.

Next, let’s set the stage by creating our table.

CREATE TABLE [dbo].[forums_checksum](
[column1] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[column2] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[column3] [int] NULL,
[hash] [int] NULL
) ON [PRIMARY]

Let’s use a flat file as our source that looks something like this (and I’ll only work with two rows for simplicity’s sake):

KEY1;XXXX;123
KEY2;ZZZZ;456

Now, inside SSIS, setup your flat file connection and set the delimiter correctly (semi-colon in this case). I’ll assume that you already know how to do this. Also, I should point out that “column1″ is the “key” column of the table for this exercise.

Here’s the package assembled in its entirety and after executing for the first time. The two records above are NEW and hence go down the insert path.

(NOTE – for each image below and elsewhere on this site, you can click on it to get a more readable image)

Checksum Exercise

Selecting out of the [forums_checksum] table will show you the stored checksum: (select * from forums_checksum)

KEY1 XXXX 123 -1338461000
KEY2 ZZZZ 456 -546099347

Now, change the column3 value to 333 for the “KEY1″ record and rerun the package:

Checksum Exercise 02

Note here now that we did not perform an insert at all. Instead our two records went down the “records that exist” path coming out of the lookup. Also note that we only updated one (1) record. The other record went into the row counter for illustration purposes. You could just not have anything coming out of the conditional split transformation, but I always like to capture all rows if I can.

Selecting from the table again will show our update:

KEY1 XXXX 333 995021855
KEY2 ZZZZ 456 -546099347

And that’s about it. Pretty simple, really. The only thing you’ll have to watch out for if you have millions of rows (perhaps) is the buffer to cache all of the key fields (plus the checksum field) in the lookup.

Below you will see how I’ve configured each of the main components of my package:

Checksum transformation:

Checksum Exercise 03

 

Lookup Transformation:

Checksum Exercise 04

Second screen for the lookup:

Checksum Exercise 05

Conditional Split:

Checksum Exercise 06

OLE DB Command:

Checksum Exercise 07

OLE DB Command parameter mappings:

Checksum Exercise 08

  36 Responses to “SSIS – Using a checksum to determine if a row has changed”

  1. [...] a Checksum in Integration Services March 12th, 2007 — Charlie Maitland Phil Brammer has an excellent post on how to use a checksum to identify if a row has been changed and how [...]

  2. hey i have a file which has got the new values
    i need to update this file values if they already exist in table or need to insert if they are not using SSIS can any one help

  3. It is all good but…
    The problem with CHECKSUM is that it does not guarantee to detect the change.
    It means, even if some values change, CHECKSUM may not change.

    It is well documented in BOL:
    ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/tsqlref9/html/e26d3339-845c-49c2-9d89-243376874c13.htm
    For this reason, Microsoft does NOT recommend using CHECKSUM for change detection purposes.

    Use Hashbytes intead.

  4. Hashbytes creates huge keys and also makes the computation difficult. Its a tradeoff I think many people are willing to take.

    If you look at the likelyhood of a data change no causing the hash code to change it is so slim that for most purposes it will be fine.

  5. First, the BOL mentions a SMALL change that the checksum will return the same value. However, when using a key combined with the checksum value, I feel that this is a non-issue. I also believe that for a given set of columns, if one value changes, there will be a new checksum generated.

    The idea here is that you are comparing a checksum value of the incoming data with that of the stored data. For a given key, this should work perfectly.

    Until someone (Konesans, perhaps???) builds an improved checksum component (using MD5 would be my preference), this is a very good solution when needing to compare MANY columns and the development time line is tight.

  6. good artical but i face this problem
    when execute package session Insert lock with update

    i have 2 tables
    first almost 2 milion
    second 59 million

    thanks in advanec

  7. Hosam,
    Please head over to the SSIS forums and we can try to help you over there.

    http://forums.microsoft.com/MSDN/ShowForum.aspx?ForumID=80&SiteID=1

  8. I am running into collossions :(

    I have some big (many columns, type 2 SCD) dimensional table. So I am using the checksum to determine if the record has changed or not. But a change in the record can be a single character different. This is not always detected by the checksum transformation :(

    Any suggestions on how to solve this?

  9. Hi Phill, help me, I have many rows in table which is US and fetching the data from it to my server in India and putting it to stagging table.
    But when putting in to staging table, I try to compare then and at that time the whole package takes more that 1 hr to complete one compare and update with rows more than 100000.
    Please show me the solution or some hint.
    Thanks,
    Jwalant Natvarlal Soneji

  10. I came across a problem.

    Date #1 – creates a checksum from all changable fields.
    Date #2 – a field change happens and a new checksum is created.
    Date #3 – all the fields are changed back to their values as of date#1 which creates the same checksum as date#1.

    What happens is I lose the change on Date #3 because this checksum already exists so the record is not inserted into the table and I lose chronological history regardless of how many date changes happen afterwards. I still have lost Date #3. Any suggestions? Thanks

  11. Also this might further clarify what I typed in earlier.

    Date #1 is a inserted record.
    Date #2 field change, I “End Date” the Date #1 record and insert date#2 as a new record.
    Date #3 fields change back to those values in date#1. I then “End Date” the date#2 record and insert a new date#3 record.

    The “End Date” is a date field at the end of each record denoting that the record is no longer the current active record.

    One other note. After each record is inserted it creates a new key field value for the record, but the fields i’m using for the key fields are constant with the other fields changing their values.

    I hope this clarifies my problem I submitted earlier.

  12. Sorry one last time. Date #3 does not get insert and Date#2 does not get end dated because the checksum matches Date #1′s checksum.

  13. James,
    Your checksum comparison should only be looking at the current row, not all rows for the given key.

  14. hi,

    I’m fairly new to ssis and was able to follow the sample given in this article. When I changed the table datatype for column2 to char(50) at the “look up check for existing method” it always fail and therefore inserted the same records.

    My question is if checksum won’t work with data type char?

    thanks,
    benny

  15. Checksums are great for checking wether your data is still valid, hence the name. Especially on big data volumes a small chance that you miss a change effectively implies that you will miss some changes eventualy. Likewise, the impact of missing a change is probably lower when you use your data warehouse only for data mining or reporting on aggregates. But still you might miss a change that would just flip the dime in favor of another scenario. If you use the online data store also to report on the detail level and to determine wether personal targets are met, you can not afford to miss a change. A fixed length checksum will never suffice to detect every change. The only way to achive this is by storing a compressed version of all column values in a single column using a custom lossless compression algorithm and a column separator that is not part of the character set used for the column values. You can compare the compressed version of the row without fetching the entire row and detect any change no matter how small or unlikely. Maybe a custom component can perform the compression?

  16. Dony,
    Not true. An MD5 hash (fixed length) would work accurately and would not miss a change in my opinion. The order of the columns would have to be strictly adhered to though, as it’s presented to the MD5 hash algorithm. The checksum component doesn’t use MD5, but one could write their own easy enough. Even better, of course, would be SHA-1 or greater.

  17. Be aware that the checksum transformation reference in this article does not work in the 64-bit runtime. It will run, but will always return the same value. Any idea why that is?

  18. A followup to my previous posting: The original checksum algorithim appears to work in the 64-bit runtime, while the new algorithim always returns the same value. The algorithim can be seleced through the designer property ChecksumAlgorithim. See also http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=2706876&SiteID=1

    Any comments on the new algorithim vs the old?

  19. There is now an alternative to the Konesan’s Hash component.

    See http://ssismhash.codeplex.com/

    This provides the MD5 and SHA1 hash’s that are mentioned by Phil Brammer earlier.

  20. Hi,

    I’m a Consultant at Numius and just tried the code from Keith Martin. http://ssismhash.codeplex.com/

    I can recommend it to anyone, very good code and does exactly what it should do. Good work!

    kind regards,

    Cedric De Vroey
    Consultant

    Numius NV – Better performance starts with insight

  21. When I am trying to configure the OLE DB command with the ? I am getting an error. Can you please provide further clarification as to how this should be configured?

  22. I’m sincerely appreciate the author.This problem is the real one that I want to solve.And the author helps me.

  23. Will generated hash change,if there is any change in column sequence ?

  24. How about showing step by step example with screen shots? God will bless you.

    Thanks.

  25. Posts like this brithegn up my day. Thanks for taking the time.

  26. I am using merge TSQL to do the detla load.
    Will the lookup transform be better than merge TSQL

  27. Will the checksum transform be better than merge TSQL?

  28. Do not use any checksum transformation built by some XYZ. Use checksum() function given by microsoft in sql server.In oledb source instead of selecting table or view select the sql query and write a query to generate checksum in the source transfromation itself and later to left outrer join with target using merger join. from that take only null values on target using condinal split.. this gives the changed records using set based opreation.. this is a very fast processing technique compared to row by row lookup operation.

  29. You need to be very careful using a Checksum. Things like this will yield the exact same checksum even though they are really changes:

    FirstName LastName ZipCode
    Bob Roger 12345
    Roger Bob 12345

    The Checksum method instead of a MERGE is beneficial in data warehousing projects, especially with SCD’s.

  30. checksum transform is insatlled but it is not showing in Data transform.Please let me know??

  31. Reply to E Andrews …..

    /*
    E Andrews says:
    September 1, 2011 at 9:23 pm
    You need to be very careful using a Checksum. Things like this will yield the exact same checksum even though they are really changes:

    FirstName LastName ZipCode
    Bob Roger 12345
    Roger Bob 12345

    The Checksum method instead of a MERGE is beneficial in data warehousing projects, especially with SCD’s.
    */

    DECLARE @t TABLE
    (fn VARCHAR(20)
    ,ln VARCHAR(20)
    ,zip INT)

    INSERT INTO @t
    (fn, ln, zip)
    VALUES (‘Bob’, — fn – varchar(20)
    ‘Roger’, — ln – varchar(20)
    12345 — zip – int
    )

    INSERT INTO @t
    (fn, ln, zip)
    VALUES (‘Roger’, — fn – varchar(20)
    ‘Bob’, — ln – varchar(20)
    12345 — zip – int
    )

    SELECT *,CHECKSUM(FN,ln,zip)
    FROM @t

    – seems fine to me
    /*
    fn ln zip (No column name)
    Bob Roger 12345 165913801
    Roger Bob 12345 -1772753975
    */

  32. Hi Dinesh,

    To see checksum transformation after instalation you need to right click on toolbox window(Left hand side) and select choose Items , it will open choose toolbox window go to SSIS Dataflow items and select the Checksum . say ok it will show you the checksum transformation.

    Niraj

  33. how does the hash variable get updated.?

  34. If i am using checksum for SCD type1 in SSIS. Will it locks the table(Insert and update on same table)?

  35. Our project need data validations to be performed post ETL Load – whether the data loaded from source system is the same loaded in our system or not. One of the validation is to validate if a code A001 from source is loaded with the same value A001 in the destination database or not. How shall i approach with this validation using SSIS packages?

  36. Data has almost 20000 records.

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>