Azure Storage - barialim/architecture GitHub Wiki

Table of Content

Overview
Databases
Data Caching
- Time to Live or TTL
- Type of caching services
Analytics & Big Data Platform
Terminology

Overview

Databases

Azure database for PostgreSQL

Azure PostgreSQL is available in three deployment modes:

Single Server
Flexible Server
Hyperscale (Citus)

Direct connection to DB vs pulling data through REST API

When it comes to large applications with large database with millions of records, you'll quickly come to realize that plain CRUD statements are simply not a best practice to have in your code to implement complex logic.

So you start thinking other alternatives like creating Stored Procedures and Triggers in a dedicated file to implement complex logic with multiple condition blocks. After sometime you'll realize that that even this is an anti-pattern when it comes to your application talking to a database. Although you improved your approach to what you had, but this considered to be a cumbersome pattern. Database offers great performance through its built-in caching functionality when performing CRUD operations, but long Stored Procedures is considered to be anti-pattern.

Problem with Stored Procedures

Difficult to manage & migrate Stored Procedures to different DB

Now imagine you want to switch DB which does not support the concept of Store procedure. What will you do now?

You're forced to move the procedures to your code-base instead, where you can be pretty sure that once you program it lets say in Java, it'll always stay there no matter which database engine you use. Not to mention, your procedures are usually part of your business logic and it is not a good idea to have your business logic splatered across your codebase and database.

Logic implemented in programing language like Java ORM framework is known to be DB technology agnostic.

Ideally, you should always have a mediator between the database and the client implementing its own business rules. Providing direct access to database is not a good idea, because when you do so, the one with access has direct access to the tables and can do pretty much anything with the data there is.

Advantages of NOT implementing business logic in Stored Procedure

Migrating to other platforms is easier: Migrating to a new database engine? Definitely. Migrating the whole mediator to a new language? Not really.
The Business Logic is also needed when calling directly the database. It won't take much longer to develop: As explained previously, the procedures problem.
Security: With proper authorization having mediator is definitely much more secure than giving a user direct access to the database, because you restrict him to the end points which run only the queries you want to.
Maintainability: One of the best benefits of having a mediator. If there is a bug in an API your clients call, you fix it, push the fix to your VCS repository, build your mediator from the correct version of VCS containing the fix and all your clients are suddenly using the fix, without them needing to download an update. This is simply impossible to do, if the queries are stored directly in the client applications. In that case, clients are forced to update their application.

Disadvantages of implementing business logic in REST API backend layer

Takes longer to develop: Of course, you are creating a new system, that is going to be more time consuming than simply giving the client a database connection string and let him write the queries.
More complex: Complexity of a system > complexity of a database query. The server does more work: Not necessarily. With good design, caching,... you can move the load from the database server to the one of the mediator.
Slower: In terms of development? Yes. In terms of speed when retrieving data? No. You can optimize your mediator using caches (such as - popular as of January 2016 - Redis, Elasticsearch) and actually make it deliver data faster than a plain database query.

Sharding vs Partitioning

Similarity

Sharding and partitioning are both about breaking up a large data into smaller subsets.

Difference

Sharding: implies the data is spread across multiple computers, while partitioning does not.
Partitioning: is about grouping subset of data within a single database instance.

Sharding

Sharding is the practice of optimizing database management systems by separating the rows or columns of a larger database table into multiple smaller tables. The new tables are called “shards/logical shards” (or partitions), and each new table either has the same schema but unique rows (as is the case for "horizontal sharding") or has a schema that is a proper subset of the original table's schema (as is the case for "vertical sharding").

sharding-vertical-horizontal

Why is sharding used

Shading is a common concept in scalable database architectures.

By sharding a large table, you can store the new chunks of data, called logic shards, across multiple nodes to achieve horizontal scalability and improved performance. 🥇

Once the logical shard is stored on another node, it is referred to as a physical shard.

Horizontal sharding: is effective when queries tend to return a subset of rows that are often grouped together. For example, queries that filter data based on short date ranges are ideal for horizontal sharding since the date range will necessarily limit querying to only a subset of the servers.
Vertical sharding: is effective when queries tend to return only a subset of columns of the data. For example, if some queries request only names, and others request only addresses, then the names and addresses can be sharded onto separate servers.

Also, sharded databases can offer higher levels of availability. In the event of an outage on an unsharded database, the entire application is unusable. With a sharded database, only the portions of the application that relied on the missing chunks of data are unusable. In practice, sharded databases often further mitigate the impact of such outages by replicating backup shards on additional nodes.

https://www.digitalocean.com/community/tutorials/understanding-database-sharding

Database Partitioning

Big Data Analytics Platform

Big data is an extremely large volume of data and datasets that come in diverse forms and from multiple sources.

Many organizations have recognize the advantage of collecting as much as data as possible, but its not enough just to collect and store big data, you've to put it to use. Thanks to evolution of technology, organizations can use big data analytics to transform TBs of data into actionable insight.

What is Big data Analytics

Big data analytics describes the process of uncovering trends, patterns, and correlations (connection/relationship between two+ things) in large amounts of raw data to help make data-informed decisions.

With the explosion of data, early innovation projects like Hadoop, Spark, and NoSQL databases were created for the storage and processing of big data. Recently, big data analytics methods are being used with emerging technologies, like machine learning, to discover and scale more complex insights.

When is data Big data?

There is no official definition. What one person considers big data may just be a traditional dataset in another person's eyes.

That doesn't mean that people don't offer up various definitions for it for example, some would define it as any type of data that is distributed across multiple systems.

In some respects, that’s a good definition. Distributed systems tend to produce much more information than localized ones because distributed systems involve more machines, more services, and more applications, all of which generate more logs containing more information.

On the other hand, you can have a distributed system that doesn’t involve much. For instance, if you mount your laptop’s 500-gigabyte hard disk over the network so that you can share it with other computers in your house, you would technically be creating a distributed data environment. But most people wouldn’t consider this an example of big data.

How big data analytics work

Storage Advanced Threat Protection (ATP)

Advanced Threat Protection (ATP) for Azure Storage provides an additional layer of security intelligence that detects unusual and potentially harmful attempts to access or exploit storage accounts. This layer of protection allows you to protect and address concerns about potential threats to your storage accounts as they occur, without needing to be an expert in security.

❓ - Which two storage accounts should you identify?

🅰️ - Storage Threat Detection is available for the Blob Service.

Reference: R1, R2

Distributed Transaction

❓ - You have an app named App1 that uses data from two on-premises Microsoft SQL Server databases named DB1 and DB2. You plan to move DB1 and DB2 to Azure.

You need to implement Azure services to host DB1 and DB2. The solution must support server-side transactions across DB1 and DB2. 🅰️ - You deploy DB1 and DB2 to SQL Server on an Azure virtual machine.

When both the database management system and client are under the same ownership (e.g. when SQL Server is deployed to a virtual machine), transactions are available and the lock duration can be controlled.

Reference: R1

Terminology

Geo-replication: is a type of data storage replication where the same data is stored on a server in multiple distinct physical locations.
Multi-model database:
Geo-distributed database: