Tag Archives: Azure

Microsoft Azure存储架构设计

SQL Azure简介

SQL Azure是Azure存储平台的逻辑数据库,物理数据库仍然是SQL Server。一个物理的SQL Server被分成多个逻辑分片(partition),每一个分片成为一个SQL Azure实例,在分布式系统中也经常被称作子表(tablet)。和大多数分布式存储系统一样,SQL Azure的数据存储三个副本,同一个时刻一个副本为Primary,提供读写服务,其它副本为Secondary,可以提供最终一致性的读服务。每一个SQL Azure实例的允许的最大数据量可以为1GB或者5GB(Web Edition),10GB, 20GB, 30GB, 40GB或者50GB(Business Edition)。由于限制了子表最大数据量,Azure存储平台内部不支持子表分裂。

Azure整体架构.png

如上图,与大多数Web系统架构类似,Azure存储平台大致可以分为四层,从上到下分别为:

  • Client Layer:将用户的请求转化为Azure内部的TDS格式流;
  • Services Layer:相当于网关,相当于普通Web系统的逻辑层;
  • Platform Layer:存储节点集群,相当于普通Web系统的数据库层;
  • Infrastructure Layer:硬件和操作系统。Azure使用的硬件为普通PC机,论文中给出的典型配置为:8核,32GB内存,12块磁盘,大致的价格为3500美金;

Services Layer

服务层相当于普通Web系统的逻辑层,包含的功能包括:路由,计费,权限验证,另外,SQL Azure的服务层还监控Platform Layer中的存储节点,完成宕机检测和恢复,负载均衡等总控工作。Services Layer的架构如下:

Azure Service.png

sorry,图片直接copy的,字体比较小,重点理解功能划分及流程,Utility Layer理解大意即可

如上图,服务层包含四种类型的组件:

1, Front-end cluster:完成路由功能并包含防攻击模块,相当于Web架构中的Web服务器,如Apache或者Nginx;

2, Utility Layer:请求服务器合法性验证,计费等功能;

3, Service Platform:监控存储节点集群的机器健康状况,完成宕机检测和恢复,负载均衡等功能;

4, Master Cluster:配置服务器,保存每个SQL Azure实例的副本所在的物理存储节点信息;

其中,Master Cluster一般配置为七台机器,采用”Quorum Commit”技术,也就是任何一个Master操作必须同步到四个以上副本才算成功,四个以下Master机器故障不影响服务;其它类型的机器都是无状态的,且机器之间同构。上图中,请求的流程说明如下:

1, 客户端与Front-end机器建立连接,Front-end验证是否支持客户端的操作,如CREATE DATABASE这样的操作只能通过Azure实用工具执行;

2, Front-end网关机器与客户端进行SSL协议握手认证,如果客户端拒绝使用SSL协议则断开连接。这个过程中还将执行防攻击保护,比如拒绝某个或某一段范围IP地址频繁访问;

3, Front-end网关机器请求Utility Layer进行必要的验证,如请求服务器地址白名单认证;

4, Front-end网关机器请求Master获取用户请求的数据分片所在的物理存储节点副本信息;

5, Front-end网关机器请求请求Platform Layer中的物理存储节点验证用户的数据库权限;

6, 如果以上认证均通过,客户端和Platform Layer中的存储节点建立新的连接;

7~8, 后续所有的客户端请求都直接发送到Platform Layer中的物理存储节点,Front-end网关只是转发请求和回复数据,起一个中间代理作用。

Platform Layer

平台层就是存储节点集群,运行物理的SQL Server服务器。客户端的请求通过Front-end网关节点转发到平台层的数据节点,每个SQL Azure实例是SQL Server的一个数据分片,每个数据分片在不同的SQL Server数据节点上存储三个副本,同一时刻只有一个副本为Primary,其它副本为Secondary。数据写入采用”Quorum Commit”策略,至少两个副本写成功时才返回客户端,这样即使一个数据节点发生故障也不影响正常服务。Platform Layer的架构如下:

platform.jpg

sorry,图片直接copy,字体太小,请关注后续对存储节点Agent程序的描述

如上图,每个SQL Server数据节点最多服务650个数据分片,每一个数据节点上的所有数据分片的写操作记录到一个操作日志文件中,从而提高写入操作的聚合性能。每个分片的多个副本之间的数据同步是通过同步并回放操作日志实现的,由于每个分片的副本所在的机器可能不同,因此,每个SQL Server存储节点最多需要和650个其它存储节点进行数据同步,网络聚合不够,这也是限制单个存储节点最多服务650个分片的原因。

如上图,每个物理存储节点上都运行了一些实用的deamon程序(称为fabric),大致介绍如下:

1, Failure detection:检测数据节点故障从而触发Reconfiguration过程;

2, Reconfiguration Agent:节点故障后负责在数据节点重新生成Primary或者Secondary数据分片;

3, PM (Partition Manager) Location Resolution:解析Master的地址从而发送数据节点的消息给Master的Partition Manager处理;

4, Engine Throttling:限制每个逻辑的SQL Azure实例占用的资源比例,防止超出容量限制;

5, Ring Topology:所有的数据节点构成一个环,从而每个节点有两个邻居节点可以检测节点是否宕机;

分布式相关问题

1, 数据复制(Replication)

SQL Azure中采用”Quorum Commit”的策略,普通的数据存储三个副本,至少写成功两个副本才可以返回成功;Master存储七个副本,至少需要写成功四个副本。每个SQL Server节点的更新操作写到一个操作日志文件中并通过网络发送到另外两个副本,由于不同数据分片的副本所在的SQL Server机器可能不同,一个存储节点的操作日志最多需要和650个分片数量的机器通信,日志同步的网络聚合效果不够好。Yahoo的PNUTS为了解决这个问题采用了消息中间件进行操作日志分发,达到聚合操作日志的效果。

2, 宕机检测和恢复

SQL Azure的宕机检测论文中讲的不够细,大致的意思是:每个数据节点都被一些对等的数据节点监控,发现宕机则报告总控节点进行宕机恢复过程;同时,如果无法确定数据节点是否宕机,比如待监控数据节点假死而停止回复命令,此时需要由仲裁者节点进行仲裁。判断机器是否宕机需要一些协议控制,后面的文章会专门介绍。

如果数据节点发生了故障,需要启动宕机恢复过程。由于宕机的数据节点服务了最多650个逻辑的SQL Azure实例(子表),这些子表可能是Primary,也可能是Secondary。总控节点统一调度,每次选择一个数据分片进行Reconfiguration,即子表复制过程。对于Secondary数据分片,只需要通过从Primary拷贝数据来增加副本;对于Primary,首先需要从另外两个副本中选择一个Secondary作为新的Primary,接着执行和Secondary数据分片Reconfiguration一样的过程。另外,这里需要进行优先级的控制,比如某个数据分片只有一个副本,需要优先复制;某个数据分片的Primary不可服务,需要优先执行从剩余的副本中选择Secondary切换为Primary的过程。当然,这里还需要配置一些策略,比如只有两个副本的状态持续多长时间开始复制第三个副本,SQL Azure目前配置为两小时。

3, 负载均衡

新的数据节点加入或者发现某个节点负载过高时,总控节点启动负载均衡过程。数据节点负载影响因素包括:读写个数,磁盘/内存/CPU/IO使用量等。这里需要注意的是,新机器加入时需要控制子表迁移的节奏,否则大量的子表同时迁移到新加入的机器导致系统整体性能反而变慢。

SQL Azure由于可以控制每个逻辑SQL Azure实例,即每个子表的大小,因此,为了简便起见,可以不实现子表分裂,很大程度上简化了系统。

4, 事务

SQL Azure支持数据库事务,数据库事务相关的SQL语句都会记录BEGIN TRANSACTION,ROLLBACK TRANSACTION和COMMIT TRANSACTION相关的操作日志。在SQL Azure中,只需要将这些操作日志同步到其它副本即可,由于同一时刻同一个数据分片最多有一个Primary提供写服务,不涉及分布式事务。SQL Azure系统支持的事务级别为READ_COMMITTED。

5, 多租户干扰

云计算系统中多租用的操作相互干扰,因此需要限制每个SQL Azure逻辑实例使用的系统资源:

1, 系统操作系统资源限制,比如CPU和内存。超过限制时回复客户端要求10s后重试;

2, SQL Azure逻辑数据库容量限制。每个逻辑数据库都预先设置了最大的容量,超过限制时拒绝更新请求,但允许删除操作;

3, SQL Server物理数据库数据大小限制。超过该限制时返回客户端系统错误,此时需要人工介入。

与SQL Server的差别

1, 不支持的操作:Microsoft Azure作为一个针对企业级应用的平台,尽管尝试支持尽量多的SQL特性,仍然有一些特性无法支持。比如USE操作:SQL Server可以通过USE切换数据库,不过在SQL Azure不支持,这时因为不同的逻辑数据库可能位于不同的物理机器。具体可以参考SQL Azure vs. SQL Server

2, 观念转变:对于开发人员,需要用分布式系统的思维开发程序,比如一个连接除了成功,失败还有第三种不确定状态:云端没有返回操作结果,操作是否成功我们无从得知,又如,天下没有像SQL这么好的免费午餐;对于DBA同学,数据库的日常维护,比如升级,数据备份等工作都移交给了微软,可能会有更多的精力关注业务系统架构。

完整的信息可以参考微软前不久公布的Azure存储系统架构Inside SQL Azure论文

from:http://www.nosqlnotes.net/archives/83

Windows Azure Storage Architecture Overview

Update 1-2-2012:  See the new post on Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency , which gives a much more detailed and up to date description of the Windows Azure Storage Architecture.

 

In this posting we provide an overview of the Windows Azure Storage architecture to give some understanding of how it works. Windows Azure Storage is a distributed storage software stack built completely by Microsoft for the cloud.

Before diving into the details of this post, please read the prior posting on Windows Azure Storage Abstractions and their Scalability Targets to get an understanding of the storage abstractions (Blobs, Tables and Queues) provided and the concept of partitions.

3 Layer Architecture

The storage access architecture has the following 3 fundamental layers:

  1. Front-End (FE) layer – This layer takes the incoming requests, authenticates and authorizes the requests, and then routes them to a partition server in the Partition Layer. The front-ends know what partition server to forward each request to, since each front-end server caches a Partition Map. The Partition Map keeps track of the partitions for the service being accessed (Blobs, Tables or Queues) and what partition server is controlling (serving) access to each partition in the system.
  2. Partition Layer – This layer manages the partitioning of all of the data objects in the system. As described in the prior posting, all objects have a partition key. An object belongs to a single partition, and each partition is served by only one partition server. This is the layer that manages what partition is served on what partition server. In addition, it provides automatic load balancing of partitions across the servers to meet the traffic needs of Blobs, Tables and Queues. A single partition server can serve many partitions.
  3. Distributed and replicated File System (DFS) Layer – This is the layer that actually stores the bits on disk and is in charge of distributing and replicating the data across many servers to keep it durable. A key concept to understand here is that the data is stored by the DFS layer, but all DFS servers are (and all data stored in the DFS layer is) accessible from any of the partition servers.

These layers and a high level overview are shown in the below figure:

image

Here we can see that the Front-End layer takes incoming requests, and a given front-end server can talk to all of the partition servers it needs to in order to process the incoming requests. The partition layer consists of all of the partition servers, with a master system to perform the automatic load balancing (described below) and assignments of partitions. As shown in the figure, each partition server is assigned a set of object partitions (Blobs, Entities, Queues). The Partition Master constantly monitors the overall load on each partition sever as well the individual partitions, and uses this for load balancing. Then the lowest layer of the storage architecture is the Distributed File System layer, which stores and replicates the data, and all partition servers can access any of the DFS severs.

Lifecycle of a Request

To understand how the architecture works, let’s first go through the lifecycle of a request as it flows through the system. The process is the same for Blob, Entity and Message requests:

  1. DNS lookup – the request to be performed against Windows Azure Storage does a DNS resolution on the domain name for the object’s Uri being accessed. For example, the domain name for a blob request is “<your_account>.blob.core.windows.net”. This is used to direct the request to the geo-location (sub-region) the storage account is assigned to, as well as to the blob service in that geo-location.
  2. Front-End Server Processes Request– The request reaches a front-end, which does the following:
    1. Perform authentication and authorization for the request
    2. Use the request’s partition key to look up in the Partition Map to find which partition server is serving the partition. See this post for a description of a request’s partition key.
    3. Send the request to the corresponding partition server
    4. Get the response from the partition server, and send it back to the client.
  3. Partition Server Processes Request– The request arrives at the partition server, and the following occurs depending on whether the request is a GET (read operation) or a PUT/POST/DELETE (write operation):
    • GET – See if the data is cached in memory at the partition server
      1. If so, return the data directly from memory.
      2. Else, send a read request to one of the DFS Servers holding one of the replicas for the data being read.
    • PUT/POST/DELETE
      1. Send the request to the primary DFS Server (see below for details) holding the data to perform the insert/update/delete.
  4. DFS Server Processes Request – the data is read/inserted/updated/deleted from persistent storage and the status (and data if read) is returned. Note, for insert/update/delete, the data is replicated across multiple DFS Servers before success is returned back to the client (see below for details).

Most requests are to a single partition, but listing Blob Containers, Blobs, Tables, and Queues, and Table Queries can span multiple partitions. When a listing/query request that spans partitions arrives at a FE server, we know via the Partition Map the set of partition servers that need to be contacted to perform the query. Depending upon the query and the number of partitions being queried over, the query may only need to go to a single partition server to process its request. If the Partition Map shows that the query needs to go to more than one partition server, we serialize the query by performing it across those partition servers one at a time sorted in partition key order. Then at partition server boundaries, or when we reach 1,000 results for the query, or when we reach 5 seconds of processing time, we return the results accumulated thus far and a continuation token if we are not yet done with the query. Then when the client passes the continuation token back in to continue the listing/query, we know the Primary Key from which to continue the listing/query.

Fault Domains and Server Failures

Now we want to touch on how we maintain availability in the face of hardware failures. The first concept is to spread out the servers across different fault domains, so if a hardware fault occurs only a small percentage of servers are affected. The servers for these 3 layers are broken up over different fault domains, so if a given fault domain (rack, network switch, power) goes down, the service can still stay available for serving data.

The following is how we deal with node failures for each of the three different layers:

  • Front-End Server Failure – If a front-end server becomes unresponsive, then the load balancer will realize this and take it out of the available servers that serve requests from the incoming VIP. This ensures that requests hitting the VIP get sent to live front-end servers that are waiting to process requests.
  • Partition Server Failure – If the storage system determines that a partition server is unavailable, it immediately reassigns any partitions it was serving to other available partition servers, and the Partition Map for the front-end servers is updated to reflect this change (so front-ends can correctly locate the re-assigned partitions). Note, when assigning partitions to different partition servers no data is moved around on disk, since all of the partition data is stored in the DFS server layer and accessible from any partition server. The storage system ensures that all partitions are always served.
  • DFS Server Failure – If the storage system determines a DFS server is unavailable, the partition layer stops using the DFS server for reading and writing while it is unavailable. Instead, the partition layer uses the other available DFS servers which contain the other replicas of the data. If a DFS Server is unavailable for too long, we generate additional replicas of the data in order to keep the data at a healthy number of replicats for durability.

Upgrade Domains and Rolling Upgrade

A concept orthogonal to fault domains is what we call upgrade domains. Servers for each of the 3 layers are spread evenly across the different fault domains, and upgrade domains for the storage service. This way if a fault domain goes down we lose at most 1/X of the servers for a given layer, where X is the number of fault domains. Similarly, during a service upgrade at most 1/Y of the servers for a given layer are upgraded at a given time, where Y is the number of upgrade domains. To achieve this, we use rolling upgrades, which allows us to maintain high availability when upgrading the storage service.

The servers in each layer are broken up over a set of upgrade domains, and we upgrade a single upgrade domain at a time. For example, if we have 10 upgrade domains, then upgrading a single domain would potentially upgrade up to 10% of the servers from each layer at a time. A description of upgrade domains and an example of using rolling upgrades is in the PDC 2009 talk on Patterns for Building Scalable and Reliable Applications for Windows Azure (at 25:00).

We upgrade a single domain at a time for our storage service using rolling upgrades. A key part for maintaining availability during upgrade is that before upgrading a given domain, we proactively offload all the partitions being served on partition servers in that upgrade domain. In addition, we mark the DFS servers in that upgrade domain as being upgraded so they are not used while the upgrade is going on. This preparation is done before upgrading the domain, so that when we upgrade we reduce the impact on the service to maintain high availability.

After an upgrade domain has finished upgrading we allow the servers in that domain to serve data again. In addition, after we upgrade a given domain, we validate that everything is running fine with the service before going to the next upgrade domain. This process allows us to verify production configuration, above and beyond the pre-release testing we do, on just a small percentage of servers in the first few upgrade domains before upgrading the whole service. Typically if something is going to go wrong during an upgrade, it will occur when upgrading the first one or two upgrade domains, and if something doesn’t look quite right we pause upgrade to investigate, and we can even rollback to the prior version of the production software if need be.

Now we will go through the lower to layers of our system in more detail, starting with the DFS Layer.

DFS Layer and Replication

Durability for Windows Azure Storage is provided through replication of your data, where all data is replicated multiple times. The underlying replication layer is a Distributed File System (DFS) with the data being spread out over hundreds of storage nodes. Since the underlying replication layer is a distributed file system, the replicas are accessible from all of the partition servers as well as from other DFS servers.

The DFS layer stores the data in what are called “extents”. This is the unit of storage on disk and unit of replication, where each extent is replicated multiple times. The typical extent sizes range from approximately 100MB to 1GB in size.

When storing a blob in a Blob Container, entities in a Table, or messages in a Queue, the persistent data is stored in one or more extents. Each of these extents has multiple replicas, which are spread out randomly over the different DFS servers providing “Data Spreading”. For example, a 10GB blob may be stored across 10 one-GB extents, and if there are 3 replicas for each extent, then the corresponding 30 extent replicas for this blob could be spread over 30 different DFS servers for storage. This design allows Blobs, Tables and Queues to span multiple disk drives and DFS servers, since the data is broken up into chunks (extents) and the DFS layer spreads the extents across many different DFS servers. This design also allows a higher number of IOps and network BW for accessing Blobs, Tables, and Queues as compared to the IOps/BW available on a single storage DFS server. This is a direct result of the data being spread over multiple extents, which are in turn spread over different disks and different DFS servers, since any of the replicas of an extent can be used for reading the data.

For a given extent, the DFS has a primary server and multiple secondary servers. All writes go through the primary server, which then sends the writes to the secondary servers. Success is returned back from the primary to the client once the data is written to at least 3 DFS servers. If one of the DFS servers is unreachable when doing the write, the DFS layer will choose more servers to write the data to so that (a) all data updates are written at least 3 times (3 separate disks/servers in 3 separate fault+upgrade domains) before returning success to the client and (b) writes can make forward progress in the face of a DFS server being unreachable. Reads can be processed from any up-to-date extent replica (primary or secondary), so reads can be successfully processed from the extent replicas on its secondary DFS servers.

The multiple replicas for an extent are spread over different fault domains and upgrade domains, therefore no two replicas for an extent will be placed in the same fault domain or upgrade domain. Multiple replicas are kept for each data item, so if one fault domain goes down, there will still be healthy replicas to access the data from, and the system will dynamically re-replicate the data to bring it back to a healthy number of replicas. During upgrades, each upgrade domain is upgraded separately, as described above. If an extent replica for your data is in one of the domains currently being upgraded, the extent data will be served from one of the currently available replicas in the other upgrade domains not being upgraded.

A key principle of the replication layer is dynamic re-replication and having a low MTTR (mean-time-to-recovery). If a given DFS server is lost or a drive fails, then all of the extents that had a replica on the lost node/drive are quickly re-replicated to get those extents back to a healthy number of replicas. Re-replication is accomplished quickly, since the other healthy replicas for the affected extents are randomly spread across the many DFS servers in different fault/upgrade domains, providing sufficient disk/network bandwidth to rebuild replicas very quickly. For example, to re-replicate a failed DFS server with many TBs of data, with potentially 10s of thousands of lost extent replicas, the healthy replicas for those extents are potentially spread across hundreds to thousands of storage nodes and drives. To get those extents back up to a healthy number of replicas, all of those storage nodes and drives can be used to (a) read from the healthy remaining replicas, and (b) write another copy of the lost replica to a random node in a different fault/upgrade domain for the extent. This recovery process allows us to leverage the available network/disk resources across all of the nodes in the storage service to potentially re-replicate a lost storage node within minutes, which is a key property to having a low MTTR in order to prevent data loss.

Another important property of the DFS replication layer is checking and scanning data for bit rot. All data written has a checksum (internal to the storage system) stored with it. The data is continually scanned for bit rot by reading the data and verifying the checksum. In addition, we always validate this internal checksum when reading the data for a client request. If an extent replica is found to be corrupt by one of these checks, then the corrupted replica is discarded and the extent is re-replicated using one of the valid replicas in order to bring the extent back to healthy level of replication.

Geo-Replication

Windows Azure Storage provides durability by constantly maintaining multiple healthy replicas for your data. To achieve this, replication is provided within a single location (e.g., US South), across different fault and upgrade domains as described above. This provides durability within a given location. But what if a location has a regional disaster (e.g., wild fire, earthquake, etc.) that can potentially affect an area for many miles?

We are working on providing a feature called geo-replication, which replicates customer data hundreds of miles between two locations (i.e., between North and South US, between North and West Europe, and between East and Southeast Asia) to provide disaster recovery in case of regional disasters. The geo-replication is in addition to the multiple copies maintained by the DFS layer within a single location described above. We will have more details in a future blog post on how geo-replication works and how it provides geo-diversity in order to provide disaster recovery if a regional disaster were to occur.

Load Balancing Hot DFS Servers

Windows Azure Storage has load balancing at the partition layer and also at the DFS layer. The partition load balancing addresses the issue of a partition server getting too many requests per second for it to handle for the partitions it is serving, and load balancing those partitions across other partition servers to even out the load. The DFS layer is instead focused on load balancing the I/O load to its disks and the network BW to its servers.

The DFS servers can get too hot in terms of the I/O and BW load, and we provide automatic load balancing for DFS servers to address this. We provide two forms of load balancing at the DFS layer:

  • Read Load Balancing – The DFS layer maintains multiple copies of data through the multiple replicas it keeps, and the system is built to allow reading from any of the up to date replica copies. The system keeps track of the load on the DFS servers. If a DFS server is getting too many requests for it to handle, partition servers trying to access that DFS server will be routed to read from other DFS servers that are holding replicas of the data the partition server is trying to access. This effectively load balances the reads across DFS servers when a given DFS server gets too hot. If all of the DFS servers are too hot for a given set of data accessed from partition servers, we have the option to increase the number of copies of the data in the DFS layer to provide more throughput. However, hot data is mostly handled by the partition layer, since the partition layer caches hot data, and hot data is served directly from the partition server cache without going to the DFS layer.
  • Write Load Balancing – All writes to a given piece of data go to a primary DFS server, which coordinates the writes to the secondary DFS servers for the extent. If any of the DFS servers becomes too hot to service the requests, the storage system will then choose different DFS servers to write the data to.

Why Both a Partition Layer and DFS Layer?

When describing the architecture, one question we get is why do we have both a Partition layer and a DFS layer, instead of just one layer both storing the data and providing load balancing?

The DFS layer can be thought of as our file system layer, it understand files (these large chunks of storage called extents), how to store them, how to replicate them, etc, but it doesn’t understand higher level object constructs nor their semantics. The partition layer is built specifically for managing and understanding higher level data abstractions, and storing them on top of the DFS.

The partition layer understands what a transaction means for a given object type (Blobs, Entities, Messages). In addition, it provides the ordering of parallel transactions and strong consistency for the different types of objects. Finally, the partition layer spreads large objects across multiple DFS server chunks (called extents) so that large objects (e.g., 1 TB Blobs) can be stored without having to worry about running out of space on a single disk or DFS server, since a large blob is spread out over many DFS servers and disks.

Partitions and Partition Servers

When we say that a partition server is serving a partition, we mean that the partition server has been designated as the server (for the time being) that controls all access to the objects in that partition. We do this so that for a given set of objects there is a single server ordering transactions to those objects and providing strong consistency and optimistic concurrency, since a single server is in control of the access of a given partition of objects.

In the prior scalability targets post we described that a single partition can process up to 500 entities/messages per second. This is because all of the requests to a single partition have to be served by the assigned partition server. Therefore, it is important to understand the scalability targets and the partition keys for Blobs, Tables and Queues when designing your solutions (see the upcoming posts focused on getting the most out of Blobs, Tables and Queues for more information).

Load Balancing Hot Partition Servers

It is important to understand that partitions are not tied to specific partition servers, since the data is stored in the DFS layer. The partition layer can therefore easily load balance and assign partitions to different partition servers, since any partition server can potentially provide access to any partition.

The partition layer assigns partitions to partition severs based on each partition’s load. A given partition server may serve many partitions, and the Partition Master continuously monitors the load on all partition servers. If it sees that a partition server has too much load, the partition layer will automatically load balance some of the partitions from that partition server to a partition server with low load.

When reassigning a partition from one partition server to another, the partition is offline only for a handful seconds, in order to maintain high availability for the partition. Then in order to make sure we do not move partitions around too much and make too quick of decisions, the time it takes to decide to load balance a hot partition server is on the order of minutes.

Summary

The Windows Azure Storage architecture had three main layers – Front-End layer, Partition layer, and DFS layer. For availability, each layer has its own form of automatic load balancing and dealing with failures and recovery in order to provide high availability when accessing your data. For durability, this is provided by the DFS layer keeping multiple replicas of your data and using data spreading to keep a low MTTR when failures occur. For consistency, the partition layer provides strong consistency and optimistic concurrency by making sure a single partition server is always ordering and serving up access to each of your data partitions.

Brad Calder

from:http://blogs.msdn.com/b/windowsazurestorage/archive/2010/12/30/windows-azure-storage-architecture-overview.aspx

Microsoft Azure Storage架构分析

Microsoft云存储服务分为两个部分,SQL Azure和Azure  Storage。云存储系统的可扩展性和功能不可兼得,必须牺牲一定的关系数据库功能换取可扩展性。Microsoft实现云存储的思路有两种:

1, 做减法。SQL Azure直接在原有的SQL Server上引入分布式的因素,在满足一定可扩展性的前提下尽可能不牺牲原有的关系型数据库功能。SQL  Azure的可扩展性是有限的,单个SQL Azure实例不允许超过50GB,这是因为SQL Azure不支持子表动态分裂,单个SQL  Azure实例必须足够小从而可以被一个节点服务。具体信息可以参考我以前的一篇文章:Microsoft SQL Azure架构设计

2, 做加法。Azure Storage先做好底层可扩展性,然后再逐步加入功能,这与Google GFS &  Bigtable的思路比较一致。Azure Storage分为Blob, Queue和Table三个部分,其中Azure Table  Storage最具有代表性,由于底层系统支持子表分裂,单个用户的最大数据量可以达到100TB。然而Table  Storage支持的功能有限,甚至不支持索引功能。具体信息可以参考msdn的一篇文章:Azure  Storage架构介绍

Microsoft Azure Storage 逻辑上分为三层:

1, 前端(Front-End Layer):类似互联网公司的Web服务器层,可采用LVS +  Nginx的设计,主要做一些杂事,比如权限验证。由于前端服务器没有状态,因此很容易实现可扩展。

2, 存储层 (Partition Layer):类似Bigtable,分为Partition Master和Partition  Server两种角色,分别对应Bigtable Master和Tablet  Server。每一个Partition(相当于Bigtable中的Tablet)同时只能被一个Partition Server服务,Partition  Master会执行负载均衡等工作。Bigtable中分为Root Table, Meta Table和User Table,Azure Table  Storage可能会为了简单起见消除Meta Table,所有的元数据操作全部放到Partition Master上。

3, 文件系统层(DFS Layer):类似GFS,数据按照extent(相当于GFS中的chunk)划分,每个extent大小在100MB ~  1GB之间。数据存储三份,写操作经过Primary同步到多个Secondary,读操作可以选择负载较轻的某个Primary或者Secondary副本。当发生机器故障时,需要选择其它机器上的Secondary切换为Primary继续提供写服务,另外需要通过增加副本操作使得每个extent的副本数维持在安全值,比如三份。为了简单起见,DFS  Layer对上层Partition Layer可以不提供文件系统接口,只提供类似块设备的访问方式。

Azure Storage的主键包括Partition Key + Row Key两部分,其中,Partition  Key用于数据划分,规定相同的Partition Key只属于同一个Partition,从而被一台Partition  Server服务,这就使得Partition Key相同的多个行之间能够支持事务。与SQL Azure不同,Azure  Storage的并发操作通过乐观锁的方式实现。Azure Storage包含三个不同的产品,其中Azure Table  Storage支持用户设置Schema,支持byte[], bool, DataTime, double, Guid, Int32, Int64,  String这几种列类型;Azure Blob Storage将Blob数据存放到底层的DFS  Layer中,Blob过大时可能存放到多个extent中,Partition  Layer存储每个Blob的编号到Blob所在的多个extent位置之间的映射关系;Azure Queue Storage将Partition  Key设置为Queue的编号,Row Key设置为消息的编号,从而保证属于同一个Queue的消息连续存放。

总之,Microsoft Azure  Storage和早期的Bigtable总体架构是很类似的,可能做了一些简化,这也间接说明了一点:如果不发生重大硬件变革,工程上要实现高可扩展的分布式B+树,将云存储系统分成文件+表格两层还是比较靠谱的。开发人员更多地需要在实现细节上下功夫,落实到线上的每行代码上。

from:http://www.nosqlnotes.net/archives/175