Data synchronization - gusenov/kb GitHub Wiki
-
objc.io / Two years of issues on advanced iOS and macOS development / - Data Synchronization
- The Elements of Sync
- Identifying corresponding objects across stores
- it's important that objects in different stores can be correlated with one another, hence the need for global identifiers
- objects in different stores with the same global identifier are considered to be logically representative of a single instance
- Changes to one object should eventually result in the corresponding object also being updated.
-
UUIDs are not appropriate for all objects
- singleton object
- tag-like objects, where uniqueness is determined by a string
- Logically equivalent objects in different stores should have the same identifier, and objects that are not equivalent should have different identifiers.
- Determining what has changed since the last sync
- how a sync algorithm determines what has changed since the last synchronization event
- what should be changed locally
- Each change to an object (often called a delta) is usually handled as a CRUD operation: a creation, read, update, or deletion.
-
granularity
- Should all properties in an entity be updated if any single property changes?
- Or should only the changed property be recorded?
-
you need a means to record a change
- Boolean attribute in the local store, indicating whether the object is new or has been updated since the last sync
- change could be stored outside the main store as a dictionary of changed properties and with an associated timestamp
- Resolving conflicts due to concurrent changes
- reading and writing a store can be considered an atomic operation, and resolving conflicts simply involves choosing which version of the store to keep
- assume the latest sync operation takes priority
- comparing the creation timestamps of conflicting changes and keeping the most recent
- it is important that the resolution be deterministic
- Identifying corresponding objects across stores
- The Elements of Sync
-
IBM Developer / Offline data synchronization
- ideal solution: all data is always available offline, never outdated, and synchronized smoothly in the background for every possible app function
- technological limitations, such as the sheer mass of data that needs to be synchronized or limitations in computational power to efficiently process complex data synchronization logic
-
basic strategies
-
Mass of data
- consider how to best integrate the time slot to download the amount of data needed by the users in their daily or weekly work schedules
-
Sync cycles and prioritization
- how often offline data is updated and how updates are prioritized
- Some of the data might only be updated once a year, while other data requires updates several times a day, so you can define different synchronization cycles for different chunks of data based on business rules. The smaller the packages of data that require high-frequency updates, the better.
- A helpful tool is the use of forced ranking for the different chunks of data.
-
Delta sync: Preprocessed versus on-demand
- delta between the current data on the mobile device and the server
- Calculating such a delta can be a highly complex task for the server. Processing time can be significantly higher than the time required to exchange the data. This becomes even worse if a large number of clients request individual deltas at the same time from the server.
- A better approach can be preprocessing deltas by the server that are shared with all clients. For example, the server could preprocess a delta for a chunk of data overnight based on the updates of the past working day. In the morning, every client could request a copy of this delta file in order to get updated with the past day's changes.
- when access and delta processing capabilities of the back-end systems are limited, the data can be replicated to a dedicated data store in the mobile middleware
-
Mass of data
-
Advanced strategies
-
Modification of offline data
- it's best if data modification actions are not allowed or limited when the user is offline
- By their core nature, offline data modifications are commonly only for documentation and not collaboration with other workers. It's sufficient to cache this data and play it back to the server the next time the user has an online connection. Most conveniently for the user, this can happen seamlessly in the background.
-
Shared data sync
- If one dedicated user can modify the data, many users should be able to as well.
- By the very nature of this problem, conflicts can't be avoided.
- A viable approach for many scenarios is that the first update of a data set wins, and any other updates on outdated data will be ignored.
- you need to investigate these scenarios and take them into consideration when defining potential conflicts for offline functions on shared data
-
Auto-sync versus manual sync
- Modern apps have gotten users accustomed to automatic synchronization processes that run in the background. The user does not need to worry about data updates and is not blocked from doing work. This is very convenient, and many clients expect this approach as standard today. However, this might not be the best approach in complex synchronization scenarios.
- In cases with long sync processing times, the app might be in an inconsistent state, especially if the user can work on the data while the sync process is running in the background.
- If a large amount of data needs to be synced, the user might need to be able to control when and where it starts to download.
- And if data modifications are cached, the user would need to gain the control to reliably send them back to the server within a certain amount of time, such as before the end of a work shift.
-
Push versus pull
- how to initiate the synchronization. The two major approaches are pushing the sync by the server and pulling it by the client.
- Push is usually used for small changes so that every time a small piece of data is modified, the server sends out a push notification to all clients.
- The other approach — a pull mechanism — provides more reliability. A client contacts the server and requests all data updates since the latest request.
-
Modification of offline data
-
What is data synchronization? And why is it important?
- Having the same record types appear across applications is clearly essential, but the process of manually re-entering data in applications leaves employees prone to errors that create data discrepancies between systems.
- Data synchronization, or data sync, is the continual process of keeping a record type identical between two or more systems. This can be done in real time, in near real time, or in batches.
- Your synchronization process can work in one of two ways:
-
One-way data sync
- It’s simply when changes in the source system lead to changes in downstream systems, but not the other way around
-
Two-way data sync
- It’s when changes in either the source system or a downstream system lead to changes in the other systems.
-
One-way data sync
- A few real-word examples:
- Syncing employee data
- Syncing incident data
- Syncing customer data
- Benefits of synchronizing data
- Data silos are removed. Now that employees can access the data they need in the apps they work in, they can avoid the tedious process of requesting access to it—or worse, being unaware that the data even exists.
- Extensive data entry can be prevented. Data synchronization ensures that employees don’t have to re-enter data across apps, and in doing so, it allows them to avoid the negative consequences highlighted above.
- Several data operations can be performed
- Data can be synced in near real time
- FAQ
-
How does data integration relate to data synchronization?
- Data integration involves taking data from a variety of sources, validating it (to remove redundancies and inaccuracies), transforming it (to fit the data model used by the data warehouse), and then loading it into the data warehouse.
- Once the data lives in the warehouse, it can be synced with downstream systems in real time, in near real time, or by using the batch method.
-
Do data synchronization and data replication mean the same thing?
- unlike data syncing, data replication is often used to backup a full dataset, with the intention of maintaining a high level of data availability
-
How does data integration relate to data synchronization?
-
Common Data Sync Strategies for Application Integration
- Data integration is a technique that is used to synchronize information silos
-
One-Way sync
-
Record “Flag” and validate
- the records are extracted from the source application based on some “Flag” value
- Upon completion of successful record sync to the other application, we consider updating the flag on the source application again so as to ensure the synchronization does not re-capture the same data again.
- consider just a bit field representing true/false or some application does provide a status field naturally
- consider updating this filed just after the record is successfully captured
- consider the default value of a particular data object to be as “Not synced” or consider to define the initialization value
- consider updating the status of the flag when there is any change made to the source data again
- Pros and Cons
- If the flag field is exposed to the user through the user interface, the user can also trigger sync just by updating the record.
- The integration remains stateless.
-
Remember “Last Modified Time”
- the delta is captured using the Last update date
- consider the timestamp to capture data change
- best suited when the API provides a filter criterion for record retrieval
- Pros and Cons
- makes integration “stateful” which is sometimes troublesome
- does not work best if your application does not provide filter based on date and time in terms of milliseconds and also it does not work when there are data imports on the application which results in the generation of the same DateTime in multiple records
- process sometimes gets complex when there are multiple parameters involved, like pagination, sort criteria, etc.
-
Capture Data
- record some kind of checksum or hash for each record on the integration platform to ensure the duplicate checks could be easily identified even though duplicate data is retrieved each time
- store the record id, such that one can easily identify data modifications
- suitable mostly for master data, and should not be used for transactional records
- Pros and Cons
- very stateful and might sometimes require the entire record needs to be stored in the integration platform
- With growing data size, the checking of each record can be cumbersome and time-consuming.
-
Record “Flag” and validate
-
Two-Way Sync
- two-way sync is most often required in situations like customers, contacts, vendors, etc, rather than invoices, quotes, orders, etc.
- Bidirectional updates are sometimes critical and require manual conflict management which in most cases not recommended.
- two-way sync requires the same recordset to be updated multiple times
- Challenges
-
Conflict Management Technique
- As two applications involved in sync operation can both update the same record, there could be conflicts between them.
- mark one application as master and another as a Slave, the integration tries to merge unique values together while it automatically rejects the slave updates (if any) if there is any conflict between the two apps
-
Circular Update Management
- Let us suppose that an update to a record generates new modified time for a record.
- The integration looks for new changes and finds the record again to update.
- It then updates the application to the other application and changes the modified date again.
-
Conflict Management Technique
- To solve these challenges, we can device two methods:
-
Distinctly identify updates from User Vs Integration
- enhances the flag based approach
- stores some additional data that is triggered only when the data is modified or added through the integration platform
- platform knows the context and does additional stuff that ensures the flag is unset only when the data operation is performed from user interfaces
-
Automatically merge data with manual conflict management
- integration platform to decide which data to update and which not
- integration platform allows to choose the right value based on the record update or based on the Master-Slave relationship
- this kind of approach also produces conflict and the conflict is being put into the data bucket for a manual fix
- Pros and Cons
- requires manual intervention in some cases, which might delay the record update of certain conflicting recordsets
- As conflicts are managed automatically by merging data together, there might be some cases where the data is wrongful.
-
Distinctly identify updates from User Vs Integration
-
Real-Time Sync
- There are two patterns that one can follow while developing real-time sync:
-
Synchronous Pattern
- *This pattern is based on the Request / Reply model. When an end-user makes any changes on the application, the data is pushed to a URL, the integration then performs the transformation and sends it to the destination application. Once the transaction is complete, the integration creates a reply and respond back to the requesting channel. *
- Pros and Cons
- The synchronous pattern is safe and the source application can identify whether the process has completed successfully or not.
- Synchronous pattern responds at a slow rate and hence the source application might need to wait longer.
-
Asynchronous Pattern: Fire and Forget
- Pros and Cons
- Fire and forget pattern is ideal when there is a large amount of load in the source application and it does not bother whether the data is actually processed or not.
- Fire and forget pattern will have overhead of calling another API after actual execution to identify whether the data is successfully processed or not.
- Fire and forget needs a Queue to back up in the integration platform, to ensure the messages are processed in correct order.
- Pros and Cons
-
Synchronous Pattern
- There are two patterns that one can follow while developing real-time sync:
-
Veritas / The Complete Data Synchronization Guide
- Differences
- Data synchronization: type of integration that keeps data consistent between databases
- Data Integration: combining pieces of software or data from different sources into a unified view or single dataset
- Data Pushes: takes data from a designated point ‘’A” to point “B” immediately after its creation
- Data Replication: stores similar data in several locations
- Types of database synchronization:
- Insert: copies new source table records to the target table
- Update: tracks the table row values and replaces the changed records in the target tables
- Drop: removes corresponding records from the destination database when they are removed from the source
- Mixed: updating, adding, and deleting records in the target database
- Steps:
- An Update Event is Triggered: detects a change made to the data
- **Changes Identified and Extracted: identify instances where changes are made
-
Changes Made to Other Sources: schedules the movement of data
- Asynchronous: for example, once an hour or once a day
- Synchronous: runs after every change
- Incoming Changes Parsed: incoming data passes through a transformation layer that includes cleansing and harmonization
-
Changes Applied to Existing Data: writes incoming changes to the target data
- Transactional: applies changes one-by-one
- Snapshot: applies changes in aggregate
- Merge: merges changes if they occur on both sides
- Successful Updates Confirmed: return a message confirming its success
- Data Synchronization Methods:
- File Synchronization
- Version Control
- Distributed File Systems (DFS)
- Mirror Computing
- Data Synchronization Use Cases:
- Data Harmonization: if the customer changes their address in their e-commerce account, the change should reflect in all other systems using a synchronization process
- Distributed Computing: cloud server reflects and stores any changes they make and forces an update on all the connected devices to replace older versions with the latest copies
- Storage and Analysis: during a disaster recovery scenario, an organization will need an up-to-date data snapshot
- Distribute Updates: amending the structure of a relational database
- Differences
-
Data Synchronization
-
Part 1 – One-way Integration
- The Agile principle of delivering value early and often is critical when replacing large, existing software systems.
- Managing the scope of potential releases is key to providing value early and often.
- Viable Releasable Increment (VRI) – The set of features that could be released together to satisfy the needs of users and the selected data integration strategy.
- Loading data in one direction can be a good approach to feed read-only features of either system.
-
One-time load & switch
- Load the production data for the segment that you’re replacing once, and then require all future work to happen in the new system.
-
Scheduled batch loads
- Load the production data from one system into the other on a regular basis (e.g., nightly, hourly, etc.).
-
Event-based integration (near real-time)
- This type of integration involves near-real-time syncing of data from one system to the other triggered by events that change data in the system (e.g., user editing data via a web app, integration with another system, scheduled processes, etc.).
-
Part 2 – Two-way Integration
- Event-based integration (near-real-time)
-
Continue to use the existing data model and storage
- build a new application on top of the existing data model and storage so that the new and old system share it
-
Part 3 – No Integration
-
Separate Systems, Separate Data
- you don’t import any data to the new system, and you don’t load any into the existing system. Simply switch over to the new system and start using it.
-
Duplicate Maintenance
- you allow editing in both systems and don’t attempt to synchronize the data. Users will make updates in both places and keep them in sync.
-
Separate Systems, Separate Data
-
Part 1 – One-way Integration
-
Talend / What Is Data Synchronization and Why Is It Important?
- Data synchronization is the ongoing process of synchronizing data between two or more devices and updating changes automatically between them to maintain consistency within systems.
-
In order to successfully synchronize your data it must pass through five phases:
- Extraction from the source
- Transfer
- Transformation
- Transfer
- Load to target
-
DZone Integration / Database Synchronisation Is an Integration Pattern!
- Change data capture
- Microsoft Sync Framework can be used to synchronize data across multiple data stores
- Stack Overflow
- Software Engineering Stack Exchange