Real-Time ETL (Extract, Transform, Load) [Written 2012, for work done in 2007, and an idea conceived in 1999]
Sometimes I have my head buried and focused in my work and can lose track of what is happening, or in this case, not happening in the mainstream. Recently I have done some research on the net and come up with very little regarding real-time ETL or transactional ETL.
As someone that understands the necessity for "buzzword compliance," I searched high and low on the terms "real-time ETL" and "transactional ETL" and was very surprised to see very little on the subject. What I did read left me scratching my head. It was mostly explanations as to why is was not possible or feasible.
Real-Time ETL is possible
I know Real-Time ETL is possible because I and the team I am a solutions architect for have been doing it for the past five years. Sometimes when you do something you just assume that everyone else is doing it too.
I am assuming my reader knows what ETL is, but I think we need to come to a common definition of the term for my discussion to make sense to you. ETL stands for Extract, Transform, and Load. The "extract" we should all be familiar with. It involves getting the data out of the source system. We also know that "load" involves getting the data into the source system. I want to note there is nothing in the acronym that suggests the means that we must extract or load. It is usually some batch process, but it can also be a "connected feed" that is real-time (or some degree of latency). It can also be an SQL trigger on the source system that feeds some intermediate staging location. Now the "transform" term is a little dicier and subject to some interpretation.
When it comes to technical terms I have find we techies like to take common words and severely restrict their definition to a much narrower scope. I tend to look at the term "transform" in a broader context. First, when transforming data we might need to deal with data type issue, for example, moving an int from a dbms that is defined as 2-bytes to a dbms that defines it as 4-bytes. We might need to upcast or downcast the precision of a timestamp. We may want to transform a char to a varchar or the other way around depending on what data types we have available in our target dbms.
We sometimes even have to make data type transformations for more exotic things like converting a date stored as a number to an SQL date. In an old healthcare medical record database I have worked with we needed to transform a 5-byte timestamp that was stored in a character array to an integer field in the target system.
But data transformation can involve much more than data types. In the case of certain types of warehousing or data mart implementations we might need to take discrete row data and roll it up or pivot it. And of course, row data in the source system is sometimes in a very denormalized state and we need to normalize it as part of the data transformation process and store it multiple tables in the target dbms.
I think why database professionals do not see real-time ETL as viable is first, they might have a narrow definition of the term "transform," second, folks that publicly discuss ETL tend to work in the data warehouse domain... which brings me to my next subject.
The ETL Specturm
ETL is not just confined to data warehousing activities!
The concept of extract, transform, and load is practiced in many other venues of software. ETL activities are done ...
- In Real-Time of near real-time interfaces utilizing interface engines such as Clover Leaf and protocols such as EDS or HL7.
- In upgrade work that involves migrating from one version of product software to another.
- In Master Data Management activities where source and target systems need to be kept in sync.
- Between development and production systems.
- Between internal development and product release systems.
Now that we have my operating definition of ETL and have discussed the ETL spectrum, I want to get back to the subject of Real-Time ETL starting with a picture.
Real-Time ETL Requires a Dynamic Pipe or Feed
This should be obvious. What may not be so obvious is how we achieve the feed. Database occupations tend to be very segregated in terms of work scope. Of course the industry only has three or four generic occupations we all use: Database Administration, Data Base Architect, Database Analyst. But because such deep knowledge is required many of us specialize in particular facets of data management.
Physical DBA's that work in production OLTP (Online Transaction Processing) environments are acutely aware of a tiny little matter :-) called disaster recovery or DR. While production systems use mirrored disks and even multiple controller paths to those disks they still need a means of recovering from a catastrophic system failure. OLTP environments cannot afford to drop transactions or lose data. Think NYSE, Visa, Amazon, Boeing, The Mayo Clinic, DOD, etc... . Enterprise data is an organizations life blood. DR solutions need to capture transactional activity real-time. Batch is just not going to cut it.
My Accidental Journey Into Real-Time ETL
About five years ago I was approached by my manager. He told me a customer that was leaving contacted him to see if we could build them a data archival system. The whole story for this is rather long and much of it not related to the subject at hand so I will not digress.
On the surface it seemed like a pretty simple problem to solve. We needed to move data from their source system into a more affordable dbms. Of course, they also wanted a front end application. And their data was stored in a proprietary dbms (H-P NonStop Enscribe). But all of that was really not the challenge.
The real challenges were ...
- They were running an enterprise OLTP application with over 3000 concurrent users and would be for the duration of their move to the new vendor.
- Their move to the new vendor was going to take 14 months from implementation to go-live.
- They did not want to wait until they went live to start the archival project. They wanted to make the transition and retire their system in 60 days or less in order to save millions of dollars.
Hmmmmmm.... batch was out of the question. As an architect I have to dance in many functional and technical camps. Fortunately I had a decent undertstanding (though not a lick of practical experience) with DR practices. I knew many of our customers were using a product called Golden Gate for their disaster recovery solution. In a nutshell, Golden Gate is a product that reads the transaction logs of the dbms and reconstructs the CRUD (Create, Read, Update, Delete) transaction to a DR system.
One thing I could immediately take to the bank is that any software that is used for disaster recovery that has a history in production has to be rock solid. Big check mark there. A production DR solution cannot be resource invasive to the production system. The beauty of a dbms is its ability to trace every transaction for recovery. The transaction logs can be segregated from the OLTP resources in a way that does not impede the performance (to a large degree) of the OLTP environment. And, the transaction is going to contain every single CRUD event for whatever periodicity they set before rolling the logs to tape. Another big check mark was I needed to begin extracting data from a system that would still be in use for at least 14 months. Finally, the DR solution essentially builds a mirrored database on another system. Winner, winner, chicken dinner! We have the trifecta that I needed to take on this gig.
Real-Time ETL - What is Real? What is Practical?
If you go back up to the diagram you will see that a product like Golden Gate can be used to feed data from OLTP source systems into a centralized data warehouse. In dealing with data warehousing activities there are essentially four types of OLAP (On-line Analytical Processing) operations. They are:
- Slice and Dice (partial extraction)
Of our fours OLAP operations, we are presented with a few challenges. If we extract the data from the source system and do "rollup" as part of the transformation process how do we know the data that we rolled up is clean? WE DON'T! For any of the four OLAP operations, if we are batch loading how do we know that the data has not changed on the source system? WE DON'T! unless we go and reload it.
Understanding the Nature and Persistance of the Data You Are Working With
There are instances when a row of data on the source system has come to a stable persistant state, meaning a state where it is no longer being updated. For example, closed accounts or closed episodes of care in a medical record. There are typically going to be markers and flags in the source database that can tell us the state of a row of data. But, most data is fluent and can be changed. For instance, we cannot assume that month end receivables that have been closed are not going to change. Someone could discover an accounting error months down the road that requires modification to the electronic books. If the error is widespread it could tell a totally different story than than the one that is currently being told in the warehouse that we have rolled up and reported on!
Remember, with batch feeds you are disconnected from the source. Gross changes to the source system might not be reflected in the target which can impact decision-making when viewing and summarizing data across a wider spectrum.
We could write an entire book on this subject but for now we need to move on.
Regarding Data Governance
Of the four OLAP operations if you are rolling up data into your warehouse that is not persistant on the source system how do you validate it? If you are pivoting data or rolling up data as an algorthmic function as part of the extract process, the job of data governance between what you have on the source system and what you have on the target system becomes quite a challenge.
As a quick aside, we sometimes use a product called veridata in the systems we build. This product ensures and audits that the target system data matches what we received from the source system. Those audit reports can be stored and referenced and come in handy when the data on the source system is purged.
Data Storage Is Cheap
If you noticed in the diagram above our warehouse is essentially positioned as the target system from our various source system feeds. As part of the feed we can implement the OLAP slice and dice function as part of the extraction process itself. In other words, in our Real-Time feed we are only bringing into the warehouse the columns and the rows from each table that we need.
Disks are cheap (see the I in RAID) and SAN solutions are readily available and easy to implement and administer. And, by bringing the data from the source system directly into the warehouse before transforming it (with the exception of slice and dice) we end up with traceable data governance back to the source systems (assuming they are not purging the data on the source system).
Advantages of Real-Time ETL
- Greater access to the freshest data for decision makers.
- The source system and the target system stay in sync with one another as they are connected though a zero or near-zero latency pipe.
- The target system gets updated as activity occurs on it. However, it may not be desireable to delete rows on the target system that have been deleted on the source system. The golden gate software allows us to configure it so certain data is not deleted on the target system. (I do not know if other vendors' products that do ETL by reading the dbms transaction logshave this functionality)
- It saves the resources consumed when performing batch ETL directly againt tables within an OLTP space.
- Makes data governance between the source and target systems easier to deal with.
Disadvantages / What Real-Time ETL Does Not Do For Us (There is no silver bullet)
- It does not make rolling up and summarizing data in a mart or warehouse any easier than batch ETL.
- Requires monitoring of the transactional feed to catch any missing transactions (though this has benefits when it comes to data governance)
Parting Words and Gifts
Disaster recovery (DR) products have been around for years. They are solid and proven in the marketplace. For whatever reason these products have gotten little use outside of the OLTP environment. For the past five years I and the team I work with have built data warehouses, data marts, and data archival systems using these tools. We have used them for master data management syncronization and for upgrading product level enterprise dbms driven OLTP software that have long implementation times and must run 24/7 with little to zero downtime between switchover.
Real-Time ETL also has great benefit by "off-boxing" data from OLTP system for regular 'ol reporting (a subset of what we now call Business Intelligence or BI.) Miminizing ad
-hoc and scheduled reporting through "off-boxing" frees resources and allows for greater scalability of the OLTP environment.
If properly implemented real-time ETL allows the decision makers access to the most current data. In the digital world we live in access to timely information is worth its weight in gold.
If you made it this far my hope is that this has given you some food for thought. I also thank you for taking the time to hear what I have to say on the subject.
I like to get feedback but do not have time to combat all the spammers out there. The best means to contact me is by sending me a message through LinkedIn (I think you need to be a member to do so). Once you signup, just request to add me to your network and I will accept. I will do my best to respond to your message.
Special Thanks to Kate Matthews for editing this piece. She said she did not understand a word of it :-) but her markup was invaluable.