Considerations for Using Azure Data Sync

  • Comments 3
Author:  Steve Howard
Technical reviewers: Prem Mehra, Thomas Kejser, Shaun Tinline-Jones, Murshed Zaman, Praveen M, Sudhesh Suresh
Azure Data Sync offers companies a way of synchronizing data between Azure databases and other Azure databases, or between Azure databases and on premises databases. On selected intervals, only changed rows are synchronized between databases in a synchronization group, thus minimizing the bandwidth requirement at sync time. This makes it an attractive option both for synchronizing data among production databases for locality or functionality, and also for moving data for migration to the cloud to minimize cutover time. However; the minimizing of the bandwidth and synchronization time comes at a cost which must be planned for when Data Sync services are part of your option. This article looks at considerations learned from a recent engagement.

How the tracking of changes is accomplished

When Data Sync is set up, on all tables to be synchronized, a new table is created using the naming convention DataSync.<tablename>_dss_tracking. This table will have the primary key columns from the base table, and 76 bytes/row of meta metadata about modifications to that row. The primary key values for every row in the tracked table, along with the meta-columns will be stored in this table so that at the end of this initial synchronization step, there will be a 1:1 ratio of rows in the base and tracking tables.
If you run the “Disk Usage by Table” report on a Synchronized database immediately after the provisioning, you will see a pattern like Figure 1.
clip_image002[6]

Figure 1: Immediately after provisioning databases to use Azure Data Sync, a base table and its associated tracking table will have a 1:1 ratio of rows. The size (Reserved (KB)) of the tracking table can be significant - sometimes even more than the base table as seen when comparing dbo.PeopleType with DataSync.PeopleType_dss_tracking.

For tables with only a clustered index, the total space required to store the tracking table data and its indexes (as shone in the “Reserved (KB)” column) may be near to the storage space required for the base table. In fact, for a narrow base table such as the PeopleType table in the example, the tracking table may require more storage space than the base table.
To track data modifications, 3 triggers are created on each base table: one for inserts; one for updates; and one for deletes. For inserts, a new row is inserted in the tracking table. However; for deletes, the row is not immediately deleted from the tracking table. If it were, then it would not be possible for the Sync agent to know what rows were deleted. In the current version of Data Sync, the deleted rows will be marked as tombstoned and will stay in the tracking tables for 45 days after a delete operation before being removed by the Sync agent. This time is not configurable in the current version of Sync. This means that after deletes have been run on a table, there will likely be more rows in the tracking table than in the base table, and the tracking table may require significantly more storage space than the base table.

Planning for Data Sync Implementation

When planning to use Azure Data Sync to synchronize databases, you need to plan to leave enough space to store the tracking data both in any on-premises databases, and in any Azure databases to be synchronized. This may mean provisioning larger databases in Azure. In the example in Figure 1, when a synchronization was attempted with a 10 GB Azure database, less than half the data could be synchronized. You can see that even if the People table was reduced to a size of 9GB, the data still could not be synchronized with a 10 GB Azure database because the tracking table for People is nearly as large as the People table. If you have enough storage in the databases for the initial provision, but run out of space in a subsequent synchronization, then Data Sync will synchronize as much as there is room to synchronize, then will stop synchronizing.
With this in mind, here are a few recommendations when planning to use Data Sync:

·         Run the “Disk Usage by Table” report on the database you want to synchronize.

·         Start with the estimation that it may require as much space to store the synchronized tables with their associated tracking data as it takes to store just the table.

·         Remember that wide primary keys will increase the storage space required as primary keys are stored in the tracking tables.

·         Remember that deleted rows are stored for 45 days in the tracking tables. Therefore, with tables with significant numbers of deletes, you may need significantly more than double the storage space of the base table.

Other Considerations and Best Practices

After implementation, you need to consider what happens with schema modifications, or when a synchronized table is truncated.

Schema Modifications

Since only the primary key columns are tracked, the only schema modifications of concern are those that modify the primary key. Consider this chain of events:

1.       Create a table with a primary key on col1.

2.       Insert a row with col1 = 1 and col2 = 1.

3.       Set this table up as a synchronized table.

4.       Drop the primary key and re-create it on (col1, col2).

5.       Insert a new row with col1 = 1 and col2 = 2.

When step 4 is performed, nothing happens to the tracking table, it will still have its primary key on col1 only. This means when step 5 occurs, another row will not be inserted into the tracking table. Instead, the existing row with col1 = 1 in this table is marked as updated. This creates a situation where the tables in the other synchronized databases cannot be updated properly. For this reason, when making changes to your primary keys on your tables, it may be necessary to drop and recreate your synchronization databases or groups.

Truncating tables

A truncation does not fire triggers. In the current version of data sync, nothing prevents synchronized tables from being truncating. However; truncating a table will cause the tables to be out of sync, and in a state where they cannot be synchronized without manual intervention. Do not truncate tables synchronized by data sync.

Conclusion

By understanding how Sync works, and by following best practices when dealing with tables with triggers, you can make use of Azure Data Sync as part of your migration, and as an ongoing part of your hybrid on-premises and Azure database applications. As with all Azure features, Data Sync is undergoing rapid development and improvement. Sync may operate differently in the future. Check the current documentation on Data Sync to verify operations as new features in SQL Azure and Windows Azure may make changes in Data Sync possible.
Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Post
  • Hi Steve,

    Just wanted to add a some other points for consideration  (at least as of the current version) :

    1. Schema Modification  - It's not just the PK that is the only schema modification of concern. An addition/removal of a column as well as change in data types will also invalidate a sync group definition. For example, if you have added a 10-column table to a syncgroup and decide to remove one column, this would break the synchronization.

    2. PKs - the current release also cannot handle a change in the PK value. A change in a PK value is not propagated during synchronization.

    3. Bulk Inserts - since Change Tracking is done via triggers, any bulk operation (BULK_INSERT or BCP) that doesn't fire triggers will not insert or modify values in the tracking tables.

    Cheers,

    JuneT

  • Good One  Pram !

  • Awesome Work here Pram. Keep Kicking !!!

Page 1 of 1 (3 items)