For a long time I’ve mulled over whether it’s necessary to keep both a batch and real time infrastructure for Data Integration. After all, if you have a fully functional real time Data Integration solution, then why do you have to operate two sets of software and monitoring?
This is not to say that it is absolutely necessary to be running only one Data Integration solution in a large organization – as long as the solutions communicate effectively. There are clear benefits to minimizing the number of technologies running to support a particular capability. For each technology you run you need to have programmers who are experts in the technology and you have to have people monitoring the operations. This is in addition to the software licenses and hardware necessary for each solution environment.
So, if you have a real time Data Integration solution in operation then why not just use it for your batch needs as well?
The problem becomes one of volumes and processing time. The volumes involved in batch oriented data integration needed for a Data Warehouse nightly load, for example, would probably be difficult for a real time solution to handle in a timely fashion. Of course, it depends on the particular organization involved. And, given enough money and expertise, you can probably get a real time data integration solution to process whatever volumes are necessary in the time slot available.
Batch Data Integration solutions (i.e. ETL tools) are focused on handling large volumes in a small time slot, so for most organizations, it is worthwhile to have both batch and real time Data Integration solutions in operation. If you are calculating alternative costs, remember to include the cost of monitoring operations of multiple solutions as well as the cost of having expertise on staff for multiple technologies – and don’t forget the costs of conversion if you’re thinking of eliminating existing technologies.