Quantcast
Channel: Informatics @ Northwestern Weblog » ETL Assistant
Viewing all articles
Browse latest Browse all 4

Introducing ETL Assistant – A Dynamic, Templated Approach to SSIS

$
0
0

TLDR
This is the first post in a series where I’m going to explain how to do a few things:

  1. Use the CozyRoc Parallel Loop Task to execute ForEach Loops in parallel
  2. Create a package that will handle connections to and push data between arbitrary OLEDB systems
  3. Dynamically map columns in SSIS at runtime
  4. Dynamically gather error data (column name, column value, error description, etc.) at runtime in SSIS (no, really – it works)
  5. Create a templated package system with inline variables that can be replaced at runtime (don’t get too excited – it’s simple string replacement :) )
  6. All sorts of other nutty and, I hope, useful abuses of SSIS
  7. Create an ASP.NET MVC 3 app on top of everything to help you manage the insanity

UPDATE: While I’ve slacked off in posting detailed entries we have posted a “screenshot gallery” of the web management UI. That may help explain where we’ve gone with this. Enjoy!

This post is mainly prefacing why you’d try to do any or all of this.  Later posts will get into more technical stuff.

And now – the long version…

This is my first blog entry.  Not my first this month or first for the EDW, but first ever.  I’m happy that I finally have both something to blog about as well as the opportunity to blog – Northwestern allows and encourages the sharing of information that at most organizations would be considered proprietary.

Good.  Now that we’re past that, let’s get onto the more interesting stuff.  I joined the team here late in 2010 and was immediately impressed by four things:

  1. The team itself – they’re smart, hard-working, and humble.  Can’t beat that power combo.   We also have a large, extended family of indirect team members and organizational supporters who have the same drive, enthusiasm, and dedication to our mission.
  2. The sheer volume of data we deal with.  In terms of both breadth and depth it can be a bit intimidating.  And it’s alive.  It’s an organism.  It’s ever growing, ever changing.  The data just can’t sit still.  You might get something working only to find that the data structures or data within the structures has changed in small, subtle ways that create challenges for everyone.
  3. It’s in our DNA to constantly evolve and improve – and one vital part of that constant improvement is our desire to reduce manual tasks that could be safely and effectively automated.  I’ve been at shops where the entire business model was based around drudgery and inefficiency because they either didn’t get it (sad, but true) or charged clients by the hour.  Either of those are the kiss of death to keeping clients happy, your costs competitive, and your team a happy, hard-working, creative bunch.  Smart, creative people don’t like doing the same task over and over.  They like to solve the problem once and move on.  Smart, creative people are also the types who work hard to improve things because they see a problem (ahem – “opportunity…”) and need to solve it.   In our line work, when you genuinely care about how you’re helping improve medicine, that desire to improve and that unwillingness to let something by “good enough” is critical.
  4. The tools that have been built in-house to help reduce manual labor are pretty awe-inspiring.  You can read about them on our blog, but one in particular really caught my attention early on – PackageWriter.
PackageWriter was built by EDW-alum Eric Just – a great example of the creative, hard-working, “good enough is not enough” breed we have on our team.  Eric cooked up PackageWriter as a way to help reduce manual labor with the construction of SQL Server Integration Services (SSIS) packages that we need to churn out when we need to bring new data sources into the EDW.   Now – let’s set expectations real quick.  When we bring on a “small” data source it could be something on the order of ~15GB in size and comprised of 150+ tables and 170 million rows of data.  If you’ve used SSIS you know that to bring in that “small” system for a full load means you need to set up a data flow task (DFT) process per table and then combine those DFTs into a package you can schedule.
Let’s do some basic math on that.
  • Assuming you have 150 tables, each needs a separate DFT, so that’s 150 DFTs
  • Let’s say you can set up do a single DFT by hand in about 10 minutes
  • Let’s also assume you want to keep things semi-sane, so you have a maximum of 10 DFTs per SSIS package
  • We’re now at… 15 packages with 150 DFTs that took you 10 minutes per DFT.  That’s 25 hours.  In work days if you assume you might get 7 real hours per day of time to work on this that’s over 3.5 days to set up that import.

Ouch.

But – the work’s not over.  You still need to actually build the destination tables in your data warehouse, so you also need to build table DDL scripts.  You may be moving data from Oracle, PostgreSQL, MySQL, or any number of systems that require you to rewrite those DDL scripts in SQL Server-friendly form.  And, if you care about the quality of your data, you’ll really want to build in error row logging (not included in the 10 minutes per DFT above).  Importing that system doesn’t sound like it’s all that much fun or valuable compared to the “real work” you’ve also been tasked with completing (you know – actually using the data).

Enter PackageWriter.

PackageWriter is a beast of burden sent to do your bidding and lighten your workload by automating everything end to end.  You feed it a source database connection string, a few configuration options like destination schema, filegroup, etc. and then enter the list of source tables you want to import.  From there, it’s magic.  It pops out the DDL scripts (using data type conversion defaults you’ll want to review) and emits all of the packages for you, including the error row logging.

Total time for the equivalent effort?

  • You need to type in the database connection string, filegroup default, schema default, and source table names.  I’m lazy, so I listed the table names from the source system table catalog.
  • PackageWriter generates the destination DDL. You then use that DDL to create your tables in SQL Server.
  • Next up, PackageWriter emits the SSIS packages for you.  All of them.  And it crafts SSIS’s arcane error row mapping.
  • Total time for the same 150 tables?  About 30 minutes.

Wow.  That’s just astounding.  And it’s a huge relief.  Now you can get back to your real job.  Bless you, PackageWriter.

I began using PackageWriter to assist with the import of data related to one of the systems we replicate into our EDW.  It’s a larger system – something on the order of about 4,000 tables and TB upon TB or disk.  We currently only replicate in ~150 or so tables of that 4,000 table catalog, but we constantly receive requests to pull in more.  I usually field these requests by firing up PackageWriter and creating a full truncate/reload process for the new tables.  This worked out well until I got a request for a somewhat larger table (~200GB) that you wouldn’t want to reimport nightly.  Happily, poor Eric Just, the author of PackageWriter, sat next to me at the time (I’ve mentioned he’s now an EDW alum – I’m not sure if my proximity to him is related to his now being an “alum”).  I turned to Eric, who was ever too polite and didn’t seem to mind my constant, unending stream of questions.  “How do we incremental loads with PackageWriter?”  To which he replied “You don’t.  Incremental loads need to be written by hand since the queries are table-specific and depend on how changes are managed.  Is there a change date on some column you can use to compare the source and destination table timestamps?” Ah.  I see.  It made perfect sense.  There was no good way to write a generic query for “what changed?” across a series of tables.  And our data volume is just too large for many of the other alternatives you might choose.  My favorite magic bullet would get me part way there, but I needed to go back to writing packages by hand again.

I was new to SSIS when I joined the EDW.  I had worked with Data Transformation Services (DTS – the previous incarnation of SSIS) a few times before, but hadn’t needed to do anything that complex or critical (I’ve traditionally been more of a web-type-guy or a PM).  Luckily, the concepts in SSIS are fairly straight-forward and Microsoft has done a pretty darn good job making SSIS far, far more flexible than DTS.  Using PackageWriter as a starting point I began looking at our tables for this one, huge datasource and we lucked out – all of our tables have a common column called “updt_dt_tm” – the date/time the row was updated.  I could easily come up with a statement to handle that and stage the data for eventual reloading.

Our process is intentionally simple:

  1. Get the max(updt_dt_tm) from the local replica table
  2. Using that max(updt_dt_tm), get the update timestamp is more recent (where updt_dt_tm > [our max updt_dt_tm])
  3. Stage the data in a holding table
  4. Delete the rows from our local replica where there are matches in staging
  5. Insert the new rows from staging into our local replica

I thought to myself – well – if this is the case, then the SQL is nearly 100% identical across all of our tables.

select * from employee where updt_dt_tm > ‘January 1, 2011’
select * from phone where updt_dt_tm > ‘January 1, 2011’
select * from address where updt_dt_tm > ‘January 1, 2011’

(Look pretty similar, don’t they?)

The only real difference is the reference to table names and the final column mappings in SSIS.  When transferring data from table A to table B SSIS must map columnA to columnB – it then resolves data conversion issues, etc.

Hrm.  If I could dynamically map those columns then I could build one, simple incremental package that could be used for all tables – all that’s really changing is the SQL statement, source connection / table information, and destination table / connection information.

ETL Assistant - Dynamic Table Column Mapping

ETL Assistant - Dynamic Table Column Mapping

One small problem – SSIS can’t dynamically map columns at runtime (not without using the API to dynamically construct the package in memory).  Major, but understandable short-coming of the overall toolkit.  Handling dynamic mapping, as I now know, introduces all sorts of new and unusual problems that make you stay up all night wondering why you got into IT.  I started scouring the Internet to see if anyone else had tried to do this (why re-invent the wheel?).  I kept seeing posts from this shop called CozyRoc, noting they could, in fact, do this and more.  At this point I turned to Eric Just again and asked him if he’d ever heard of this “CozyRoc.”  “Oh, yeah – totally.  Great stuff.  We use their zip task.”  Oh, I see.  I began tearing into the CozyRoc control flow and data flow components and found exactly what I was looking for.

  • Data Flow Task Plus” – “allows setup of dynamic data flows. The dynamic setup options can be controlled with parameters from outside. This feature lessens the need to manually open and modify the data flow design, when new source and destination columns have to be accommodated.”
  • Parallel Loop Task” – “is SSIS Control Flow task, which can execute multiple iterations of the standard Foreach Loop Container concurrently. In tests, a CPU intensive sequential process when executed in parallel on 4-core machine was executed 3 times faster compared to the sequential.”

Whoah.  Hold up.  We can now dynamically map columns and run n-iterations of this task in parallel?  Sign me up.  With that I could unleash our server on poor, unsuspecting source systems in no time flat.  If we could do that, then we could replace hundreds of packages with just two packages.

ETL Assistant - Parallel Package Invoker

ETL Assistant - Parallel Package Invoker

And on that same day, in a matter of a few clicks, we delivered the first iteration of “ETL Assistant.”  ETL Assistant has been rebuilt many times since then, but at its core it does a few things:

  • Connects to a source OLEDB system.  Just supply a connection string (I’ve tried it with MSSQL, Oracle, and PostgreSQL thus far)
  • Pulls in the table catalog for that system (based on parameters you specify)
  • Checks for the existence of those tables in your destination system
  • Lets you configure default “global” definitions of query templates for full and incremental jobs – you can then override those templates on a table-by-table basis
  • Lets you create inline variables that can be used to gather small snippets of data in both the source and destination systems at runtime.  They can then be incorporated into your source queries.  Useful for those “find me the latest date in the destination – then get me more recent data from the source system” scenarios.  This really helps unleash more dynamic aspects of the system.
  • Abusing the inline variables we can also set batch windows on the queries so you can get recurring n-row series of rows if you need to (useful when you’re pulling in those 2+TB source tables and want to do so in 10mm row increments)
  • You can then set up “groups” of tables that you can schedule in SQL Server
  • Custom logging – needs it because the existing standard SSIS logging would be useless
  • Error row logging to a table (infinitely useful)

Really, it’s what you would do by hand, so why not let the computer do it for you?  That’s what they’re for.

As a bit of a tease, let me give you a quick example of how this is paying off for us.  I was asked to import a new source Oracle system with 155 tables.

Here’s how we did it

  1. Set up source system connection information (5 minutes – I kept messing up the connection string)
  2. Apply existing query template for “full select from SQL Server” (the web UI has some helper templates built into it since I’m lazy) (1 minute)
  3. Pull in table catalog (2 minutes)
  4. Use TableWriter (part of PackageWriter and now also incorporated into ETL Assistant) to give us helper DDL scripts for the destination tables.  This is an asynchronous ajax call to TableWriter, so it’s pretty quick.  (1 minute)
  5. Create tables in destination SQL Server (2 minutes)
  6. Create a group (for scheduling in SQL Server) (1 minute)
  7. Put tables in group (2 minutes)
  8. Open up SQL Server and create a new job pointing to our parallel ETL Assistant package – supplying the table group ID as a package parameter (5 minutes)
  9. Run the package.  (20 minutes)

Total time end to end? ~40 minutes to bring in 155 tables containing 170mm rows and consuming 18.5 GB of disk.  Not bad.

But – the cool part?  We get that incremental update ability.  And, since we’re using a dynamic column map inside the package, we can go back and adjust column names and data types in the destination table without needing to change and redploy the package.  And – if we need to change table replication behavior we can simply adjust either the overall “global” query template for that data source or override it for a specific table.  Or – let’s say you need to globally go back and re-load data from a specific date range – just override the global template and run the package.  The point is – we’re now open to a wee bit more flexibility than we were previously accustomed.

In the next few posts I’ll be delving into how to get things done.  I’ll try to give practical examples and insights so you can get more done with less – because in IT, wanting to get rid of the monotonous churn isn’t being lazy, it’s being efficient.


Viewing all articles
Browse latest Browse all 4

Trending Articles