Sunday, 7 September 2014

Selectively migrate data between tables using SSIS package

So, here is the problem that I faced. I had to move bulk data from multiple tables in one SQL server instance to another. It could be simply achieved by having multiple data flows each to move data from one ADO.NET source to an ADO.NET destination. However, the gotcha was that only those rows from the source table were to be moved that met a certain criterion. A join operation (merge join transform with inner join) with a conditional split would have done, but it required sorting the input data beforehand and repeating this operation(join) for all the tables which would prove costly in terms of memory. So storing the data (only the actual look up data) in an in memory cache was a better solution. I will describe how you can recreate the solution in this post.

Steps:

  1. Create a new project in SQL Server Data Tools.
  2. Add a data flow task. Name it "Populate look up cache".
  3. Get inside the data flow task. 
  4. Add the connection managers and configure them appropriately for the source and destination databases (which may reside in different server instances).Name them as "source" and "destination" respectively.
  5. Drag and drop a ADO.NET source (or any source of your choice). Configure this source with the source connection manager you had earlier selected and select the name of the table from which you want to perform the look up. This table contains the column based on which you want to filter the data and the column that all the other tables also contain (say as foreign keys in them). The data inside this column that meet the selection criterion will be used for the look up.
  6. Drag and drop a "conditional split" transform from the toolbox. Connect the output of the source to the input of this transform. Go to the configuration manager for the conditional split transformation by double clicking it. Specify the match criterion by using variables or hard-coding. You will have a screen like this: 
       
Figure 1: The highlighted region shows the split condition used.

     7.   Now select a "cache transform" from the other transforms in the toolbox. Connect the                          appropriate output from the "conditional split" transform into this transform. Double click to                open the cache transformation editor. Select new as cache connection manager. In the columns            tab select only the columns that will actually be used for the look up. Remove the others by                  selecting the  row and then right clicking and selecting delete. Then specify the index positions           for the remaining columns. For more details on index positions refer  
          You will get a screen like this:
          
Figure 2: The columns that are stored in the cache


      8. Go to the mappings section and see that the mappings are all appropriate. You will see a screen like this:
Figure 3: Appropriate mappings done
9. After all these steps the data flow task should look like this:
Figure 4: Overall data flow task 
10. Go back to the Control flow and add another data flow task. This task will migrate the data from the source table to the destination table based on whether the look up column for the row is present in the cache or not. Connect the output of the previous data flow task to the input of this data flow task.

11. Get inside the data flow task. Drag and drop a source of your choice. Configure the source with the appropriate connection manager. 

12. Drag and drop a "look up" transform.Connect the input of the look up transform with the output of the source. 

13. Double click on the look up transform to open the look up transformation editor. Change the connection type to "Cache connection manager". In the "Specify how to handle rows with no matching entries" select "Redirect rows to no match output".So we end up with something like this:
Figure 5: Values as set up in the transformation editor
14. In the connection option in the left column select the cache configuration manager that you had selected for the previous data flow task.

15. In the columns option, check the columns you want to look up from the "Available look up columns". Then connect the input column from the "Available input columns" by dragging from this table and dropping over the same column in the "Available look up columns" so that a line connects these columns as shown:
Figure 6: Connection between the input columns and look up columns
16. Then drag and drop an appropriate destination. Connect the "Look up Match Output" of the Look up transform as the input of this destination. 

17.Select an appropriate connection manager for the destination. Select an appropriate table where the data will be inserted.Then connect the ADO NET source.[your_look_up_column name] with the same column in the available destination column. This screen will look like this:
Figure 7:Connection between the columns in the destination




18. After this step we will land up with a data flow task that looks like this:



19. Overall the control flow now looks like this:
   
      
20. You can similarly add more data flow tasks each of which has a similar structure as the data flow task shown here but transfers data from different tables. All these tables however contain the column which is stored in the cache.
The final control flow will look like this:



To conclude, you can easily selectively migrate data from source to destination efficiently by using the conditional split, cache and look up transforms. There are advantages as well as disadvantages of using the cache as can be seen from here:

I am still a novice with SSIS packages. I really appreciate any feedback. :)

No comments:

Post a Comment