Sometimes I have my head buried and focused in my work and can lose track of what is happening, or in this case, not happening in the mainstream. Recently I have done some research on the net and come up with very little regarding real-time ETL or transactional ETL.
As someone that understands the necessity for "buzzword compliance," I searched high and low on the terms "real-time ETL" and "transactional ETL" and was very surprised to see very little on the subject. What I did read left me scratching my head. It was mostly explanations as to why is was not possible or feasible.
I know Real-Time ETL is possible because I and the team I am a solutions architect for have been doing it for the past five years. Sometimes when you do something you just assume that everyone else is doing it too.
I am assuming my reader knows what ETL is, but I think we need to come to a common definition of the term for my discussion to make sense to you. ETL stands for Extract, Transform, and Load. The "extract" we should all be familiar with. It involves getting the data out of the source system. We also know that "load" involves getting the data into the source system. I want to note there is nothing in the acronym that suggests the means that we must extract or load. It is usually some batch process, but it can also be a "connected feed" that is real-time (or some degree of latency). It can also be an SQL trigger on the source system that feeds some intermediate staging location. Now the "transform" term is a little dicier and subject to some interpretation.
When it comes to technical terms I have find we techies like to take common words and severely restrict their definition to a much narrower scope. I tend to look at the term "transform" in a broader context. First, when transforming data we might need to deal with data type issue, for example, moving an int from a dbms that is defined as 2-bytes to a dbms that defines it as 4-bytes. We might need to upcast or downcast the precision of a timestamp. We may want to transform a char to a varchar or the other way around depending on what data types we have available in our target dbms.
We sometimes even have to make data type transformations for more exotic things like converting a date stored as a number to an SQL date. In an old healthcare medical record database I have worked with we needed to transform a 5-byte timestamp that was stored in a character array to an integer field in the target system.
But data transformation can involve much more than data types. In the case of certain types of warehousing or data mart implementations we might need to take discrete row data and roll it up or pivot it. And of course, row data in the source system is sometimes in a very denormalized state and we need to normalize it as part of the data transformation process and store it multiple tables in the target dbms.
I think why database professionals do not see real-time ETL as viable is first, they might have a narrow definition of the term "transform," second, folks that publicly discuss ETL tend to work in the data warehouse domain... which brings me to my next subject.
ETL is not just confined to data warehousing activities!
The concept of extract, transform, and load is practiced in many other venues of software. ETL activities are done ...
Now that we have my operating definition of ETL and have discussed the ETL spectrum, I want to get back to the subject of Real-Time ETL starting with a picture.
This should be obvious. What may not be so obvious is how we achieve the feed. Database occupations tend to be very segregated in terms of work scope. Of course the industry only has three or four generic occupations we all use: Database Administration, Data Base Architect, Database Analyst. But because such deep knowledge is required many of us specialize in particular facets of data management.
Physical DBA's that work in production OLTP (Online Transaction Processing) environments are acutely aware of a tiny little matter :-) called disaster recovery or DR. While production systems use mirrored disks and even multiple controller paths to those disks they still need a means of recovering from a catastrophic system failure. OLTP environments cannot afford to drop transactions or lose data. Think NYSE, Visa, Amazon, Boeing, The Mayo Clinic, DOD, etc... . Enterprise data is an organizations life blood. DR solutions need to capture transactional activity real-time. Batch is just not going to cut it.
About five years ago I was approached by my manager. He told me a customer that was leaving contacted him to see if we could build them a data archival system. The whole story for this is rather long and much of it not related to the subject at hand so I will not digress.
On the surface it seemed like a pretty simple problem to solve. We needed to move data from their source system into a more affordable dbms. Of course, they also wanted a front end application. And their data was stored in a proprietary dbms (H-P NonStop Enscribe). But all of that was really not the challenge.
The real challenges were ...
Hmmmmmm.... batch was out of the question. As an architect I have to dance in many functional and technical camps. Fortunately I had a decent undertstanding (though not a lick of practical experience) with DR practices. I knew many of our customers were using a product called Golden Gate for their disaster recovery solution. In a nutshell, Golden Gate is a product that reads the transaction logs of the dbms and reconstructs the CRUD (Create, Read, Update, Delete) transaction to a DR system.
One thing I could immediately take to the bank is that any software that is used for disaster recovery that has a history in production has to be rock solid. Big check mark there. A production DR solution cannot be resource invasive to the production system. The beauty of a dbms is its ability to trace every transaction for recovery. The transaction logs can be segregated from the OLTP resources in a way that does not impede the performance (to a large degree) of the OLTP environment. And, the transaction is going to contain every single CRUD event for whatever periodicity they set before rolling the logs to tape. Another big check mark was I needed to begin extracting data from a system that would still be in use for at least 14 months. Finally, the DR solution essentially builds a mirrored database on another system. Winner, winner, chicken dinner! We have the trifecta that I needed to take on this gig.
If you go back up to the diagram you will see that a product like Golden Gate can be used to feed data from OLTP source systems into a centralized data warehouse. In dealing with data warehousing activities there are essentially four types of OLAP (On-line Analytical Processing) operations. They are:
Of our fours OLAP operations, we are presented with a few challenges. If we extract the data from the source system and do "rollup" as part of the transformation process how do we know the data that we rolled up is clean? WE DON'T! For any of the four OLAP operations, if we are batch loading how do we know that the data has not changed on the source system? WE DON'T! unless we go and reload it.
There are instances when a row of data on the source system has come to a stable persistant state, meaning a state where it is no longer being updated. For example, closed accounts or closed episodes of care in a medical record. There are typically going to be markers and flags in the source database that can tell us the state of a row of data. But, most data is fluent and can be changed. For instance, we cannot assume that month end receivables that have been closed are not going to change. Someone could discover an accounting error months down the road that requires modification to the electronic books. If the error is widespread it could tell a totally different story than than the one that is currently being told in the warehouse that we have rolled up and reported on!
Remember, with batch feeds you are disconnected from the source. Gross changes to the source system might not be reflected in the target which can impact decision-making when viewing and summarizing data across a wider spectrum.
We could write an entire book on this subject but for now we need to move on.
Of the four OLAP operations if you are rolling up data into your warehouse that is not persistant on the source system how do you validate it? If you are pivoting data or rolling up data as an algorthmic function as part of the extract process, the job of data governance between what you have on the source system and what you have on the target system becomes quite a challenge.
As a quick aside, we sometimes use a product called veridata in the systems we build. This product ensures and audits that the target system data matches what we received from the source system. Those audit reports can be stored and referenced and come in handy when the data on the source system is purged.
If you noticed in the diagram above our warehouse is essentially positioned as the target system from our various source system feeds. As part of the feed we can implement the OLAP slice and dice function as part of the extraction process itself. In other words, in our Real-Time feed we are only bringing into the warehouse the columns and the rows from each table that we need.
Disks are cheap (see the I in RAID) and SAN solutions are readily available and easy to implement and administer. And, by bringing the data from the source system directly into the warehouse before transforming it (with the exception of slice and dice) we end up with traceable data governance back to the source systems (assuming they are not purging the data on the source system).
Disaster recovery (DR) products have been around for years. They are solid and proven in the marketplace. For whatever reason these products have gotten little use outside of the OLTP environment. For the past five years I and the team I work with have built data warehouses, data marts, and data archival systems using these tools. We have used them for master data management syncronization and for upgrading product level enterprise dbms driven OLTP software that have long implementation times and must run 24/7 with little to zero downtime between switchover.
Real-Time ETL also has great benefit by "off-boxing" data from OLTP system for regular 'ol reporting (a subset of what we now call Business Intelligence or BI.) Miminizing ad
-hoc and scheduled reporting through "off-boxing" frees resources and allows for greater scalability of the OLTP environment.
If properly implemented real-time ETL allows the decision makers access to the most current data. In the digital world we live in access to timely information is worth its weight in gold.
If you made it this far my hope is that this has given you some food for thought. I also thank you for taking the time to hear what I have to say on the subject.
I like to get feedback but do not have time to combat all the spammers out there. The best means to contact me is by sending me a message through LinkedIn (I think you need to be a member to do so). Once you signup, just request to add me to your network and I will accept. I will do my best to respond to your message.
Special Thanks to Kate Matthews for editing this piece. She said she did not understand a word of it :-) but her markup was invaluable.