For the TLDR crowd – I’m supplying downloads of the packages so you can just open them and play.
Download “Cozy Roc Parallel Loop Demo Files” contains:
- setup_sql.txt
- SequentialLoop.dtsx
- ParallelLoop.dtsx
To use the demo files you’ll need to have at least the evaluation copy of the CozyRoc components installed (you can get 32bit and x64 versions from the CozyRoc site http://www.cozyroc.com/products)
Loops in SSIS
SSIS provides a very handy loop task – you supply a collection (of type object) and iterate through that object, executing steps or processes for each item in the collection.
Microsoft’s description (http://msdn.microsoft.com/en-us/library/ms139956.aspx) of the task:
“The For Loop container defines a repeating control flow in a package. The loop implementation is similar to the For looping structure in programming languages. In each repeat of the loop, the For Loop container evaluates an expression and repeats its workflow until the expression evaluates to False.”
Great – we can now loop through a set of data.
- For a given group of something (a collection)
- Iterate through the collection (a variable / instance)
- For each instance, execute a process
In the case of ETL Assistant we use this to do the following:
- We have a concept of a scheduling “group” – a set of source::destination table mappings. Let’s say I want to manage a set HR-related items (department list, employee list, address information, etc.) as a group (easier than managing individual tables). I can put them into a “group” (collection). Let’s call this “HR tables.” I can do the same with a set of patient information (patient / person list, encounter / visit information, and possibly some other patient demographic information). Let’s call this “Patient tables.”
- I can, for each group, pull back a list of tables (instances)
- For each table I can execute a dynamic ETL on them to pump data from a source to a destination (EX: Oracle::employee -> SQL Server::employee)
The loop task does a great job managing simple collections and executing an operation per item.
The problem? I’m now executing all of this serially.
This means if I have a fairly beefy server I’m still potentially sitting idle while I do a simple set of ETL operations. You have several ways to address this, but one I’ve found attractive is to convert from a serial ForEach loop to a parallel ForEach loop using the Parallel Loop Task from CozyRoc. This will let us do n-parallel executions of a given operation. If you have a 64 core host, for example, and the diagram above represented tables you wanted to load from a remote source, you could execute A, B, and C loads in parallel.
Let’s get back to that example using the HR tables (department list, employee list, address information). I can create a “group” (“HR”), then places these three tables into the HR group. When I run a process to pull over the HR group I reference the group, pull back the three table references and place them into a collection. I iterate through the collection and for n items in the collection, execute a task.
The Sequential ForEach Loop in SSIS
Let’s do some quick setup steps to prep this test scenario
--create a test schema CREATE SCHEMA cozyroc AUTHORIZATION dbo GO --this is our "group" table --EX: HR CREATE TABLE cozyroc.etl_groups ( group_id INT IDENTITY(1,1) NOT NULL, group_nm VARCHAR(100) NOT NULL, group_dsc VARCHAR(255) CONSTRAINT [PK_cozyroc_etl_groups_group_id] PRIMARY KEY CLUSTERED ( [group_id] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] --this is our "table" table --EX: employees, addresses, etc. CREATE TABLE cozyroc.etl_tables ( table_id INT IDENTITY(1,1) NOT NULL, table_nm VARCHAR(100) NOT NULL, table_dsc VARCHAR(255) CONSTRAINT [PK_cozyroc_etl_tables_table_id] PRIMARY KEY CLUSTERED ( [table_id] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] --this is our associative table to store -- the mapping from group::table -- I'm using this in the example because in later -- posts we'll allow the table to be "grouped" -- multiple times CREATE TABLE cozyroc.etl_group_tables ( group_table_id INT IDENTITY(1,1), group_id INT NOT NULL, table_id INT NOT NULL CONSTRAINT [PK_cozyroc_etl_group_tables_group_table_id] PRIMARY KEY CLUSTERED ( [group_table_id] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] --insert a sample group INSERT INTO cozyroc.etl_groups (group_nm, group_dsc) VALUES ('HR', 'HR Group') --insert some sample tables INSERT INTO cozyroc.etl_tables (table_nm, table_dsc) VALUES ('Employees', 'Employees Table') INSERT INTO cozyroc.etl_tables (table_nm, table_dsc) VALUES ('Departments', 'Departments Table') INSERT INTO cozyroc.etl_tables (table_nm, table_dsc) VALUES ('Addresses', 'Addresses Table') --blindly cross join everything INSERT INTO cozyroc.etl_group_tables (group_id, table_id) SELECT g.group_id, t.table_id FROM cozyroc.etl_groups g, cozyroc.etl_tables t --now let's also create a logging table -- this is a placeholder for more complex -- operations CREATE TABLE cozyroc.parallel_test ( log_id INT IDENTITY(1,1) NOT NULL, group_table_id INT NOT NULL, group_id INT NOT NULL, table_id INT NOT NULL, execution_dts datetime2(7) DEFAULT GETDATE() CONSTRAINT [PK_cozyroc_parallel_test_log_id] PRIMARY KEY CLUSTERED ( [log_id] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]
This will have created several tables:
- cozyroc.etl_groups (“HR”)
- cozyroc.etl_tables (“employees,” “addresses,” etc.)
- cozyroc.etl_group_tables (mapping “employees” to the “HR” group, for example)
- cozyroc.parallel_test (our fake table we’re using to test the parallel loop)
You don’t need all of this for the CozyRoc Parallel Loop Task to work, but I’m trying to just introduce some examples we’re going to use in later posts related to dynamic ETL.
Now let’s create a package
- Launch BIDS and create a new SSIS project
- Create an OLEDB connection to the server and database where you created the tables (in my case, that’s a database called “cozyroc” on localhost)
- For convenience, make sure you have the “variables” panel open (SSIS -> Variables)
- Drag a few things onto the workspace:
- an Execute SQL Task
- a ForEach Loop Container task
- drag another Execute SQL Task into the ForEach Loop Container (make sure it’s placed inside of the container)
- Create a package-level variable called “mylistoftables” – make it of type “Object”
- Click on the parallel loop task so it’s highlighted – now create a variable called “iter” and make it of type “int32.” Clicking on the ForEach Loop and then creating the variable will scope “iter” to the loop – make sure the scope of the variable is correct.
- Now let’s create another variable also scoped to the ForEachLoop. Let’s call it “SQL_insert” and make it of type String. We’re going to set this up to hold a sample insert statement so we can watch the loop in action.
- For the “sql_insert” variable, set EvaluateAsExpression to “True”
- Open the Expression and enter the following:
"insert into cozyroc.parallel_test (group_table_id, group_id, table_id) select gt.group_table_id, gt.group_id, gt.table_id from cozyroc.etl_group_tables gt where gt.group_table_id = " + (DT_WSTR, 100) @[User::iter]
- Connect the first SQL Task to the ForEach Loop
Your package should now look something like…
Now let’s set up the SQL Task
- Double-click the SQL Task to open the configuration
- On the “General” panel,
- Now go to the “Result Set” panel
What did we just do here? We told the SQL Task to execute a query (“get me everything from the coyroc.etl_group_tables table”) and then store the results in our “mylistoftables” object. Pretty straight forward.
Let’s proceed to setting up the loop
- Double-click on your ForEach Loop Container to open the properties panels
- On the “Collection” panel
- Now on the “Variable Mappings” panel
In this step we told the ForEach Loop Container to loop through the “mylistoftables” collection – on the first table – and set the “User::iter” variable to the first column as it loops. Keep in mind – and this is critical for later – you’re using the User::iter variable scoped to the ForEach Loop Container.
Alright – we’re almost done setting up the basic loop. Now we just have to wire up a task that the loop executes. In the SQL Task within your loop, set the loop to execute the “sql_insert” statement
- Double-click the second SQL Task to open the configuration
- On the “General” panel,
- Set the Connection property to point to your database (ex: “localhost”)
- Set the SQLSourceType to “Variable”
- Set the SQLStatement to your “User::sql_insert” variable
Run the package and, if there are no errors, pop over to SQL Server Management Studio for a minute.
Run a query to quickly look at the results of the package:
SELECT * FROM cozyroc.parallel_test
You should see something like…
log_id group_table_id group_id table_id execution_dts 1 1 1 1 2011-11-04 13:15:00.9500000 2 2 1 2 2011-11-04 13:15:00.9800000 3 3 1 3 2011-11-04 13:15:01.0100000
Note the dates and times. See how there are slight differences in the dates? The dates are clearly following a pattern where later group_table_ids have later dates? This is the result of the loop running sequentially.
Converting to the CozyRoc Parallel Loop Task
Let’s upgrade this to a Parallel Loop Task. Hang on – things are about to get weird.
First things first. Let’s quickly throw on some more components and variables as well as tweak some other bits.
- Drag a “Parallel Loop Task” onto the canvas
- Delete the link from your first SQL Task to the ForEach Loop. Where we’re going we don’t need that flow.
- Now connect that first SQL Task to the Parallel Loop Task
- Double-click on the Parallel Loop Task to open up the configuration panel
- Click on the Package Connection property and set the connection to “<New Connection>.” When the dialog box opens, make sure “Connection to Current Package” is checked and hit “OK.” We just told the Parallel Loop Task to talk to this package when executing. Right – this basically just became a “meta package” with execution steps. Think of this like it’s own self-referencing parent-child package.
- Now – still inside the configuration panel of the Parallel Loop Task – click on the ForEachLoop configuration item – a new popup should appear. Click on the name of your ForEach Loop within this package.
- Your final Parallel Loop Task configuration should look something like this
Now we’re cooking. Only a few simple changes left.
- Disable the main ForEach Loop. We no longer manage it – the Parallel Loop Task does. It enables/disables this as it fires each instance of this package. If we left the loop enabled things would get very messy – you’d have sequential instances of the loop firing within each parallel instance of this package. Loops in loops – very loopy.
- Create a new package-level variable called “Iter” of type “int32.” That’s right – we have a package and a loop variable now.
That’s it. That’s really all you have to do to take a sequential loop and turn it into a parallel loop. Your final package should look something like this:
Give it a quick test run and then go back and re-run that query to look at the dates and times.
SELECT * FROM cozyroc.parallel_test
Your execution dates and times should now be much, much closer to each other if not completely identical:
log_id group_table_id group_id table_id execution_dts 4 2 1 2 2011-11-04 13:36:06.3230000 5 3 1 3 2011-11-04 13:36:06.3300000 6 1 1 1 2011-11-04 13:36:06.3300000
See that? Granted, this is a fairly pointless test case, but you get the idea. By default the Parallel Loop Task iteration setting is set to “-1″ (as many cores as you have). You may want to play with this (or better yet expose it as a runtime configuration property) depending on your situation.
Next up we’re going to step into how the CozyRoc DFT+ component can make your life easier by side-stepping SSIS’s age-old static design-time column mapping problem. Combined with the Parallel Loop Task, that’s when things really start to get interesting.
Further reading:
- SSIS For Loop Container (http://msdn.microsoft.com/en-us/library/ms139956.aspx)
- SQL Server Integration Services – SSIS – For Loop Container Samples (http://www.sqlis.com/post/For-Loop-Container-Samples.aspx)
- CozyRoc’s Parallel Loop Task (http://www.cozyroc.com/ssis/parallel-loop-task)