For my part in this month’s blogfest, I want to talk about a tricky data transfer problem I needed to tackle. We had a need to make the data from a group of tables available for ad hoc queries, but we did not want the load on the production database. The BI team, who needed the access, wanted to pull data via linked server every 15 minutes. Yeah, linked servers … that’s not going to happen.
They asked us to make the tables available via replication or some other operational process. And they needed it done as quickly as possible because they had not anticipated that we would deny their request for a linked server and were planning to go live very soon. Although I am a firm believer of the saying, “a failure to plan on your part does not constitute an emergency on mine,” I wanted to help them out if I could.
Replication was not a possibility because this was a database for a vendor application, and our ability to modify it is limited. The tables in question do not have primary keys. Log shipping was out of the question because the database was in simple recovery model as required by the application. My first thought was to move data via SSIS via CDC, but there was no way were going to waste an Enterprise license on this application. It was using Standard Edition and CDC isn’t possible.
I had decided that moving data via SSIS was the best way to handle this situation, but I had some hurdles to overcome. The challenges to this little project were:
- No CDC or Change Tracking
- The application dynamically creates and drops tables when it seems fit
- No “modified date” or primary key value to identify new or changed data
- Data is regularly entered with bad dates causing older tables or tables for future data to have data added
Identifying New or Changed Data
Without some sort of change tracking mechanism or the ability to modify the schema to add keys or update dates, I resorted to using the Dynamic Management View (DMV) sys.dm_db_index_usage_stats to identify tables that had had changes since the last time I ran the SSIS update. I couldn’t identify which rows had changed, but I could identify the tables that had changed (and they only averaged 20K rows per table) and then merge the data after transfer.
The query below is the one I used to identify the changed tables. I track the last time the package ran as part of the package and update the date value at the bottom at run time. Fortunately, the tables I want all start with “ntEventLog” followed by a datevalue, so it is easy to get the list of tables.
Select OBJECT_NAME(object_id, database_id) As TableName From sys.dm_db_index_usage_stats Where database_id = DB_ID() And OBJECT_NAME(object_id, database_id) Like 'ntEventLog%' And index_id In (0, 1) And user_updates + system_updates > 0 And (last_user_update > '1900-01-01 00:00:00.000' Or last_system_update > '1900-01-01 00:00:00.000')
I insert the table names into a tracking table and then iterate through them dynamically in the SSIS package. No matter how many tables there are, I only have a single loop container that pulls data. The container pulls the list of tables, and then transfers each table to a single staging table on the destination server. This is the package from a high level:
The 3 tasks on the left get the list of tables to be transferred and insert them into a tracking table. The next step, “Get Tracked Tables”, get the list of tables to iterate through, and the container dynamically updates the query to use to pull the data from the tables and transfer to the source database. Then the job runs a stored procedure that merges the data from the staging table into the final table with only the rows that are actually new or updated merged in.