All posts by dotte

MongoDB中时间比实际时间小8小时的解决方法

九月 26, 2012NOSQLMongoDB, 时区dotte

现象：

存储到数据库的时间总是比实际时间小8小时。

原因：

存储在mongodb中的时间是标准时间UTC +0:00 ，而中国的时区是+8.00 。

解决办法：

如果使用C#的Mongodb.Driver驱动，则只需要在实体的时间属性上添加一个特性并指时区就可以了。

比如：

[BsonDateTimeOptions(Kind = DateTimeKind.Local)]
public DateTime EntryTime
{get;set;}

此特性需要引用MongoDB.Bson.dll 。

using MongoDB.Bson.Serialization.Attributes;

Redis2.4.13 安装部署

九月 18, 2012NOSQLRedisdotte

1 Redis 介绍

Redis是Remote Dictionary Server的缩写。他本质上一个Key/Value数据库，与Memcached类似的NoSQL型数据库，但是他的数据可以持久化的保存在磁盘上，解决了服务重启后数据不丢失的问题，他的值可以是string（字符串）、list（列表）、sets（集合）或者是ordered sets（被排序的集合），所有的数据类型都具有push/pop、add/remove、执行服务端的并集、交集、两个sets集中的差别等等操作，这些操作都是具有原子性的，Redis还支持各种不同的排序能力

2Redis功能简介

l Redis的Sharding：目前，redis server没有提供类似mongodb那样的shard功能，只能在client端，通过一致性hash算法实现，当前Redis不支持故障冗余，在集群中不能在线增加或删除Redis

l Redis的master/slave复制：

n 一个master支持多个slave

n Slave可以接受其他slave的连接来替代他连接master

n 复制在master、在slave都是非阻塞的。

n 复制被利用来提供可扩展性，在slave端只提供查询功能及数据的冗余

l Redis的Virtual Memory功能：

u 因性能问题，2.4版本 VM机制彻底废弃

u redis的vm模式在实践中存在一些问题.

u 我使用过redis2.0.2, 发现当vm模式打开的时候, 并发连接数在1500以上时, redis latency会大大增加.平均每个请求的latency会超过4000ms, 观察redis的进程cpu占用率, 会超过100%. 最后迫于无奈,关掉了redis的vm功能. 此时并发连接不变的情况下,redis的latency下降到2ms以下. cpu占用率下降到1%.

l Redis的附加档案（AOF）功能：Redis通过配置的策略将数据集保存到aof中，当Redis挂掉后能够通过aof恢复到挂掉前的状态

l 提供批量写入功能

l 事务：允许让一组命令进入队列一次性执行，在执行的过程中不穿插其它命令（Redis的单线程保证）。

l 管道：一次性提交多个命令（如果只是进行一些设置，命令之间不需要依赖前置命令结果的话，可以提高不少效率）。

3 Redis机构示意图

4 Redis安装

Shell>wget http://redis.googlecode.com/files/redis-2.4.13.tar.gz #下载程序

Shell >tar –zxvf redis-2.4.13.tar.gz #解压程序包

Shell > cd redis-2.4.13 #进入解压目录

Shell> make #进行编译安装

Shell > make test #测试是否成功

该版本安装不需要configure 和make install ,make编译安装后SRC目录下会多几个文件

redis-server #Redis 服务器启动命令

redis-benchmark #Redis服务启动后查看相关服务信息命令

redis-check-aof

redis-check-dump

redis-cli #Redis 命令行操作工具

为了部署规范管理方便操作如下:

· shell>mkdir -p bin

· shell>mkdir -p conf

· shell>mkdir -p logs

· shell>mkdir -p data

· shell>cd src

· shell>cp redis-server redis-cli redis-benchmark redis-stat ../bin

· shell>cd ..

shell·>cp redis.conf conf

调整配置文件调整

daemonize yes #后台运行

pidfile /opt/redis-2.4.13/bin/redis.pid #pid路径

port6379 #监听端口

logfile /opt/redis-2.4.13logs/stdout.log #日志文件路径

dbfilename /opt/redis-2.4.13/data/dump.rdb #数据库文件路径
5 Redis服务启动与停止

Shell>bin/redis-server /opt/redis-2.4.13/conf/redis.conf #服务启动,直接运行在后台

Shell>ps -ef |grep redis #查看是否有进程

Shell>netstat –ntlp |grep 6379 #查看默认监听端口

Shell>bin/ redis-benchmark #性能测试工具,测试该系统下读写性能

Shell>bin/redis-cli #命令行工具,测试是否正常

Shell >bin/redis-cli shutdown #关闭Redis服务
6 Redis配置文件详解

配置文件参数说明:

1. Redis默认不是以守护进程的方式运行，可以通过该配置项修改，使用yes启用守护进程

i. daemonize no

2. 当Redis以守护进程方式运行时，Redis默认会把pid写入/var/run/redis.pid文件，可以通过pidfile指定

i. pidfile /var/run/redis.pid

3. 指定Redis监听端口，默认端口为6379，作者在自己的一篇博文中解释了为什么选用6379作为默认端口，因为6379在手机按键上MERZ对应的号码，而MERZ取自意大利歌女Alessia Merz的名字

i. port 6379

4. 绑定的主机地址

i. bind 127.0.0.1

5. 当客户端闲置多长时间后关闭连接，如果指定为0，表示关闭该功能

i. timeout 300 #

如果应用中使用了连接池，最好设置为0，表示不使用服务器自动断开的功能，否则容易出现 java.net.SocketTimeoutException: Read timedout 或者是 It seems like server has closedthe connection 这样的异常，应用中千万要控制住连接数，打开的连接一定要关闭

6. 指定日志记录级别，Redis总共支持四个级别：debug、verbose、notice、warning，默认为verbose

i. loglevel verbose

7. 日志记录方式，默认为标准输出，如果配置Redis为守护进程方式运行，而这里又配置为日志记录方式为标准输出，则日志将会发送给/dev/null

i. logfile stdout

8. 设置数据库的数量，默认数据库为0，可以使用SELECT <dbid>命令在连接上指定数据库id

i. databases 16

9. 指定在多长时间内，有多少次更新操作，就将数据同步到数据文件，可以多个条件配合

i. save <seconds> <changes>

ii. Redis默认配置文件中提供了三个条件：

iii. save 900 1

iv. save 300 10

v. save 60 10000

vi. 分别表示900秒（15分钟）内有1个更改，300秒（5分钟）内有10个更改以及60秒内有10000个更改。

10. 指定存储至本地数据库时是否压缩数据，默认为yes，Redis采用LZF压缩，如果为了节省CPU时间，可以关闭该选项，但会导致数据库文件变的巨大

i. rdbcompression yes

11. 指定本地数据库文件名，默认值为dump.rdb

i. dbfilename dump.rdb

12. 指定本地数据库存放目录

i. dir ./

13. 设置当本机为slav服务时，设置master服务的IP地址及端口，在Redis启动时，它会自动从master进行数据同步

i. slaveof <masterip> <masterport>

14. 当master服务设置了密码保护时，slav服务连接master的密码

i. masterauth <master-password>

15. 设置Redis连接密码，如果配置了连接密码，客户端在连接Redis时需要通过AUTH <password>命令提供密码，默认关闭

i. requirepass foobared

16. 设置同一时间最大客户端连接数，默认无限制，Redis可以同时打开的客户端连接数为Redis进程可以打开的最大文件描述符数，如果设置 maxclients 0，表示不作限制。当客户端连接数到达限制时，Redis会关闭新的连接并向客户端返回max number of clients reached错误信息

i. maxclients 128

17. 指定Redis最大内存限制，Redis在启动时会把数据加载到内存中，达到最大内存后，Redis会先尝试清除已到期或即将到期的Key，当此方法处理后，仍然到达最大内存设置，将无法再进行写入操作，但仍然可以进行读取操作。Redis新的vm机制，会把Key存放内存，Value会存放在swap区

i. maxmemory <bytes>

18. 指定是否在每次更新操作后进行日志记录，Redis在默认情况下是异步的把数据写入磁盘，如果不开启，可能会在断电时导致一段时间内的数据丢失。因为 redis本身同步数据文件是按上面save条件来同步的，所以有的数据会在一段时间内只存在于内存中。默认为no

i. appendonly no

19. 指定更新日志文件名，默认为appendonly.aof

i. appendfilename appendonly.aof

20. 指定更新日志条件，共有3个可选值：

i. no：表示等操作系统进行数据缓存同步到磁盘（快）

ii. always：表示每次更新操作后手动调用fsync()将数据写到磁盘（慢，安全）

iii. everysec：表示每秒同步一次（折衷，默认值）

iv. appendfsync everysec

21. 指定是否启用虚拟内存机制，默认值为no，简单的介绍一下，VM机制将数据分页存放，由Redis将访问量较少的页即冷数据swap到磁盘上，访问多的页面由磁盘自动换出到内存中（在后面的文章我会仔细分析Redis的VM机制）

i. vm-enabled no

22. 虚拟内存文件路径，默认值为/tmp/redis.swap，不可多个Redis实例共享

i. vm-swap-file /tmp/redis.swap

23. 将所有大于vm-max-memory的数据存入虚拟内存,无论vm-max-memory设置多小,所有索引数据都是内存存储的(Redis的索引数据就是keys),也就是说,当vm-max-memory设置为0的时候,其实是所有value都存在于磁盘。默认值为0

i. vm-max-memory 0

24. Redis swap文件分成了很多的page，一个对象可以保存在多个page上面，但一个page上不能被多个对象共享，vm-page-size是要根据存储的数据大小来设定的，作者建议如果存储很多小对象，page大小最好设置为32或者64bytes；如果存储很大大对象，则可以使用更大的page，如果不确定，就使用默认值

i. vm-page-size 32

25. 设置swap文件中的page数量，由于页表（一种表示页面空闲或使用的bitmap）是在放在内存中的，，在磁盘上每8个pages将消耗1byte的内存。

i. vm-pages 134217728

26. 设置访问swap文件的线程数,最好不要超过机器的核数,如果设置为0,那么所有对swap文件的操作都是串行的，可能会造成比较长时间的延迟。默认值为4

i. vm-max-threads 4

27. 设置在向客户端应答时，是否把较小的包合并为一个包发送，默认为开启

i. glueoutputbuf yes

28. 指定在超过一定的数量或者最大的元素超过某一临界值时，采用一种特殊的哈希算法

i. hash-max-zipmap-entries 6

ii. hash-max-zipmap-value 512

29. 指定是否激活重置哈希，默认为开启（后面在介绍Redis的哈希算法时具体介绍）

a) activerehashing yes

30. 指定包含其它的配置文件，可以在同一主机上多个Redis实例之间使用同一份配置文件，而同时各个实例又拥有自己的特定配置文件

include /path/to/local.conf
7 多端口运行的配置

redis是单进程的服务，所以咱得根据CPU的数目，确定究竟该运行几个实例，这样才能最好的发挥性能优势，比如服务器是8core的，那最好能够运行8个实例，这里只以第一个作为举例

1. 假设redis的安装目录在/opt/redis下面，即这个目录下包含了redis-benchmark redis-cli redis-server这几个可执行文件，现在在下面新建一个servers的文件夹，存放所有的实例

mkdir -p /opt/redis/servers/0/
mkdir -p /opt/redis/servers/0/conf
mkdir -p /opt/redis/servers/0/data
mkdir -p /opt/redis/servers/0/run
mkdir -p /opt/redis/servers/0/logs

2. 然后我们需要拷贝一份配置文件到该实例的路径下

cp redis.conf /opt/redis/servers/0/conf

3. 修改配置文件中的下列内容

pidfile /opt/redis/servers/0/run/redis.pid
port 6380
logfile /opt/redis/servers/0/logs/stdout.log
dbfilename /opt/redis/servers/0/data/dump.rdb

4. 启动停止

./redis-server /opt/redis/servers/0/conf/redis.conf #启动服务

./redis-cli -p 6380shutdown #停止服务
8 Redis维护常用命令

1、../bin/redis-cli keys\* #查看所有键值信息

[root@monitordata]# ../bin/redis-cli keys \*

1)”name”

2)”name1″

2、../bin/redis-cli info #查看redis运行状态

[root@monitordata]# ../bin/redis-cli info

redis_version:2.4.13

redis_git_sha1:00000000

redis_git_dirty:0

arch_bits:64

multiplexing_api:epoll

gcc_version:4.4.6

process_id:2738

uptime_in_seconds:6888

uptime_in_days:0

lru_clock:1508888

used_cpu_sys:0.08

used_cpu_user:0.02

used_cpu_sys_children:0.01

used_cpu_user_children:0.00

connected_clients:2

connected_slaves:0

client_longest_output_list:0

client_biggest_input_buf:0

blocked_clients:0

used_memory:734976

used_memory_human:717.75K

used_memory_rss:7323648

used_memory_peak:726504

used_memory_peak_human:709.48K

mem_fragmentation_ratio:9.96

mem_allocator:jemalloc-2.2.5

loading:0

aof_enabled:0

changes_since_last_save:0

bgsave_in_progress:0

last_save_time:1336293648

bgrewriteaof_in_progress:0

total_connections_received:8

total_commands_processed:19

expired_keys:0

evicted_keys:0

keyspace_hits:9

keyspace_misses:2

pubsub_channels:0

pubsub_patterns:0

latest_fork_usec:1679

vm_enabled:0

role:master

db0:keys=2,expires=0

from:http://hi.baidu.com/webwatch/item/47c7e3df6f4a37f592a97456

mongodb配置文件详解

八月 1, 2012NOSQLMongoDBdotte

运行时数据库配置

命令行和配置文件界面可为 MongoDB 管理员提供大量选项和设置，用于控制数据库系统的运行。该文档提供了通用配置以及普通使用案例的最佳配置示例。

尽管两种界面都可访问相同的选项和设置集合，但该文档主要使用配置文件界面。如果您使用控制脚本或操作系统的程序包来运行 MongoDB，很可能已经有一个配置文件，该文件位于 /etc/mogondb.conf。检查/etc/init.d/mongod 或 /etc/rc.d/mongod 脚本的内容确定这一点，以确保控制脚本会以适当的配置文件启动 mongod（见下文）。

要使用该配置启动 MongoDB 实例，按以下格式发出一个命令：

 mongod --config /etc/mongodb.conf mongod -f /etc/mongodb.conf

修改系统上的 /etc/mongodb.conf 文件中的值，以控制数据库实例的配置。

启动、停止和运行数据库

请看以下基本配置：

 fork = true bind_ip = 127.0.0.1 port = 27017 quiet = true dbpath = /srv/mongodb logpath = /var/log/mongodb/mongod.log logappend = true journal = true

对于大多数独立服务器，这是足够使用的基本配置。它作了几个假定，但请看以下说明：

fork 为true，可为 mongod 启用后台模式，使（如 “forks”）MongoDB 从当前会话中分离，并允许您将数据库作为传统服务器来运行。
bind_ip 为127.0.0.1，它会强制服务器仅侦听本地主机 IP 上的请求。仅绑定至安全接口，该接口可由应用程序级系统通过系统网络过滤（如“防火墙”）系统提供的访问控制权限来访问。
端口为 27017，这是数据库实例的默认 MongoDB 端口。MongoDB 可绑定至任何端口。您也可以使用网络过滤工具来过滤访问权限。

注意

UNIX 类系统要求超级用户权限才能将进程连接至低于 1000 的端口。
quiet 为 true。这会禁止输出/日志文件中的所有条目，但最重要的条目除外。在正常操作中，这是避免日志噪音的最佳操作。在诊断或测试情况中，将该值设为false。使用 setParameter 可在运行时过程中修改该设置。
dbpath 为 /srv/mongodb，它指定 MongoDB 存储其数据文件的位置。/srv/mongodb 和 /var/lib/mongodb 都是常用的位置。mongod 运行时所在的用户帐户将需要对该目录具有读写权限。
logpath 为 /var/log/mongodb/mongod.log，其中 mongod 将写入其输出。如果您不设置此值，mongod 将把所有输出写入到标准输出（即 stdout）中。
logappend为 true，确保 mongod 在服务器启动操作之后不会覆盖现有的日志文件。
journal 为 true，这样将启用日志。

日志可确保单实例写入耐久性。 64 位版本的 mongod默认情况下启用日志。因此，此设置可能是多余的。

如果采用默认配置，有些值可能是多余的。但是，在很多情况下，明确地描述配置可促进对整个系统的理解。

安全考虑事项

下面的配置选项集合对于限制对 mongod 实例的访问权限很有用。请考虑以下配置：

 bind_ip = 127.0.0.1 bind_ip = 10.8.0.10 bind_ip = 192.168.4.24 nounixsocket = true auth = true

考虑对这些配置决定的下列解释：

“bind_ip”有三个值：127.0.0.1，本地主机接口；10.8.0.10，通常用于本地网络和 VPN 接口的专用 IP 地址；192.168.4.24，通常用于本地网络的专用网络接口。
由于生产 MongoDB 实例需要从多个数据库服务器访问，因此务必将 MongoDB 绑定到多个可从您的应用程序服务器访问的接口。同时，务必将这些接口限制为在网络层实现控制和保护的接口。
“nounixsocket”为 true，这样将会禁用 UNIX 套接字，而在默认情况下为启用。这样可限制对本地系统的访问。使用共享权限连续运行 MongoDB 时这种情况很理想，但在大多数情况下影响极小。
“auth”为 true，这样将在 MongoDB 中启用身份验证。如果已启用，第一次登录时您需要通过本地主机接口建立连接，以创建用户凭证。

另见

“安全和身份验证”维基页面。

复制和分片配置

复制配置

副本集配置简单明了，只需要 replSet 有一个在集合的所有成员之间保持一致的值即可。请考虑以下配置：

 replSet = set0

使用描述性的副本集名称。配置后，使用 mongo壳将主机添加到副本集。

另见

“副本集重新配置”。

要对副本集启用身份验证，请添加下列项：

 keyfile = /srv/mongodb/keyfile

1.8 版新特性：针对副本集；1.9.1 版针对分片副本集。

设置keyFile以启用身份验证，并指定一个密钥文件供副本集成员使用，确定相互之间何时进行身份验证。密钥文件的内容可以任意规定，但在副本集 以及连接到该集的 mongos 实例的所有成员上必须相同。 keyfile 的大小必须小于 1 KB，可以只包含 base64 编码集字符，文件在 UNIX 系统上不得拥有组或“世界”权限。

另见

“副本集重新配置”部分，以了解在操作期间更改副本集的流程方面的信息。

此外，请考虑“副本集安全性”部分以了解使用副本集配置身份验证的信息。

最后，请参阅“复制” 索引和“复制基础”文档，以了解关于 MongoDB 中的复制以及一般副本集配置的信息。

分片配置

分片需要若干采用不同配置的 mongod 实例。配置服务器存储群集的元数据，而群集将数据发布到一个或多个分片服务器。

注意

配置服务器不是副本集。

设置一个或三个“配置服务器”实例作为正常 mongod 实例，然后添加下列配置选项：

 configsrv = true bind_ip = 10.8.0.12 port = 27001

这样将创建一个运行于专用 IP 地址：10.8.0.12，端口：27001 的配置服务器。确保没有端口冲突，且配置服务器可从您的“mongos”和“mongod”实例访问。

要设置分片，请配置两个或更多 mongod实例，使用您的基本配置并添加 shardsvr 设置：

 shardsvr = true

最后建立群集，使用下列设置来配置至少一个 mongos 进程：

 configdb = 10.8.0.12:27001 chunkSize = 64

您可以通过在逗号分隔列表的表格中指定主机名和端口来指定多个 configdb 实例。通常，避免将 chunkSize 修改为默认值 64 以外的值，[1]并应当确保此设置在所有 mongos实例中都保持一致。

[1]	数据块大小默认值为 64 MB，可在最均匀的数据分布（较小的数据块最佳）和最小化数据块迁移（较大的数据块最佳）之间实现理想的平衡。

另见

“分片”维基页面，以了解关于分片和分片群集配置的详细信息。

在同一系统上运行多个数据库实例

在很多情况下，建议不要在单个系统上运行多个 mongod 实例。有些类型的部署[2]可能会出于测试目的而需要在单个系统上运行多个 mongod。

在这些情况下，为每个实例应用基本配置，但是请考虑下列配置值：

 dbpath = /srv/mongodb/db0/ pidfileath = /srv/mongodb/db0.pid

dbpath 值控制 mongod 实例的数据目录的位置。确保每个数据库都有明确且标签正确的数据目录。pidfilepath 控制 mongod 进程将其pid 文件放置到的位置。由于此轨迹取决于具体的 mongod文件，因此务必确保该文件是唯一的且标签正确，以便于开始和停止这些进程。

创建附加控制脚本并/或调整现有 MongoDB 的配置以及控制这些进程所需的控制脚本。

[2]	使用 SSD 或其他高性能磁盘的单租户系统可为多 `mongod` 实例提供可接受的性能水平。此外，您还会发现，使用小工作集的多数据库在单系统上的性能可以接受。

诊断配置

下列配置选项控制多种用于诊断的 mongod 行为。下列使用针对一般生产目的调整的默认值：

 slowms = 50 profile = 3 verbose = true diaglog = 3 objcheck = true cpu = true

使用基本配置，如果您遇到一些未知的问题或性能问题，根据需要添加这些选项：

slowms 配置数据库探查器的阈值以考虑“缓慢”的查询。默认值为 100 毫秒。如果数据库探查器未返回有用的结果，则设置较低的值。请参阅“优化”维基页面，以了解 MongoDB 中的优化操作的详细信息。
profile 设置数据库探查器 等级。探查器默认情况下不活动，因为那样可能会影响探查器本身的性能。除非为此设置指定了一个值，否则不对查询进行探查。
verbose 启用详细记录模式，在此模式下可修改 mongod 输出并增加记录以包括更多的事件。仅在遇到不能正常反映日志记录级别的问题时使用此选项。如果您需要达到更详细的级别，请考虑下列选项：
```
 v = true vv = true vvv = true vvvv = true vvvvv = true
```
增加的每个 v 级别都会额外地增加记录的详细程度。verbose 选项相当于 v=true。
diaglog 启用诊断日志记录。等级 3 记录所有读写选项。
objcheck 强制 mongod 在收到来自客户端的请求时全部进行验证。使用此选项确保无效的请求不会导致错误，特别是在不可信客户机运行数据库时。此选项可能会影响数据库的性能。
cpu 强制 mongod 报告

写锁定所用的最后时间间隔的百分比。时间间隔通常为 4 秒，日志中的每个输出行都包括自上次报告以来的实际时间间隔和写锁定所用的时间百分比。

from:http://blog.sina.com.cn/s/blog_9c5dff2f01012n0f.html

Microsoft Azure存储架构设计

七月 27, 2012云计算Azuredotte

SQL Azure简介

SQL Azure是Azure存储平台的逻辑数据库，物理数据库仍然是SQL Server。一个物理的SQL Server被分成多个逻辑分片(partition)，每一个分片成为一个SQL Azure实例，在分布式系统中也经常被称作子表(tablet)。和大多数分布式存储系统一样，SQL Azure的数据存储三个副本，同一个时刻一个副本为Primary，提供读写服务，其它副本为Secondary，可以提供最终一致性的读服务。每一个SQL Azure实例的允许的最大数据量可以为1GB或者5GB(Web Edition)，10GB, 20GB, 30GB, 40GB或者50GB(Business Edition)。由于限制了子表最大数据量，Azure存储平台内部不支持子表分裂。

Azure整体架构.png

如上图，与大多数Web系统架构类似，Azure存储平台大致可以分为四层，从上到下分别为：

Client Layer：将用户的请求转化为Azure内部的TDS格式流；
Services Layer：相当于网关，相当于普通Web系统的逻辑层；
Platform Layer：存储节点集群，相当于普通Web系统的数据库层；
Infrastructure Layer：硬件和操作系统。Azure使用的硬件为普通PC机，论文中给出的典型配置为：8核，32GB内存，12块磁盘，大致的价格为3500美金；

Services Layer

服务层相当于普通Web系统的逻辑层，包含的功能包括：路由，计费，权限验证，另外，SQL Azure的服务层还监控Platform Layer中的存储节点，完成宕机检测和恢复，负载均衡等总控工作。Services Layer的架构如下：

Azure Service.png

【sorry，图片直接copy的，字体比较小，重点理解功能划分及流程，Utility Layer理解大意即可】

如上图，服务层包含四种类型的组件：

1, Front-end cluster：完成路由功能并包含防攻击模块，相当于Web架构中的Web服务器，如Apache或者Nginx；

2, Utility Layer：请求服务器合法性验证，计费等功能；

3, Service Platform：监控存储节点集群的机器健康状况，完成宕机检测和恢复，负载均衡等功能；

4, Master Cluster：配置服务器，保存每个SQL Azure实例的副本所在的物理存储节点信息；

其中，Master Cluster一般配置为七台机器，采用”Quorum Commit”技术，也就是任何一个Master操作必须同步到四个以上副本才算成功，四个以下Master机器故障不影响服务；其它类型的机器都是无状态的，且机器之间同构。上图中，请求的流程说明如下：

1, 客户端与Front-end机器建立连接，Front-end验证是否支持客户端的操作，如CREATE DATABASE这样的操作只能通过Azure实用工具执行；

2, Front-end网关机器与客户端进行SSL协议握手认证，如果客户端拒绝使用SSL协议则断开连接。这个过程中还将执行防攻击保护，比如拒绝某个或某一段范围IP地址频繁访问；

3, Front-end网关机器请求Utility Layer进行必要的验证，如请求服务器地址白名单认证；

4, Front-end网关机器请求Master获取用户请求的数据分片所在的物理存储节点副本信息；

5, Front-end网关机器请求请求Platform Layer中的物理存储节点验证用户的数据库权限；

6, 如果以上认证均通过，客户端和Platform Layer中的存储节点建立新的连接；

7~8, 后续所有的客户端请求都直接发送到Platform Layer中的物理存储节点，Front-end网关只是转发请求和回复数据，起一个中间代理作用。

Platform Layer

平台层就是存储节点集群，运行物理的SQL Server服务器。客户端的请求通过Front-end网关节点转发到平台层的数据节点，每个SQL Azure实例是SQL Server的一个数据分片，每个数据分片在不同的SQL Server数据节点上存储三个副本，同一时刻只有一个副本为Primary，其它副本为Secondary。数据写入采用”Quorum Commit”策略，至少两个副本写成功时才返回客户端，这样即使一个数据节点发生故障也不影响正常服务。Platform Layer的架构如下：

【sorry，图片直接copy，字体太小，请关注后续对存储节点Agent程序的描述】

如上图，每个SQL Server数据节点最多服务650个数据分片，每一个数据节点上的所有数据分片的写操作记录到一个操作日志文件中，从而提高写入操作的聚合性能。每个分片的多个副本之间的数据同步是通过同步并回放操作日志实现的，由于每个分片的副本所在的机器可能不同，因此，每个SQL Server存储节点最多需要和650个其它存储节点进行数据同步，网络聚合不够，这也是限制单个存储节点最多服务650个分片的原因。

如上图，每个物理存储节点上都运行了一些实用的deamon程序（称为fabric），大致介绍如下：

1, Failure detection：检测数据节点故障从而触发Reconfiguration过程；

2, Reconfiguration Agent：节点故障后负责在数据节点重新生成Primary或者Secondary数据分片；

3, PM (Partition Manager) Location Resolution：解析Master的地址从而发送数据节点的消息给Master的Partition Manager处理；

4, Engine Throttling：限制每个逻辑的SQL Azure实例占用的资源比例，防止超出容量限制；

5, Ring Topology：所有的数据节点构成一个环，从而每个节点有两个邻居节点可以检测节点是否宕机；

分布式相关问题

1, 数据复制(Replication)

SQL Azure中采用”Quorum Commit”的策略，普通的数据存储三个副本，至少写成功两个副本才可以返回成功；Master存储七个副本，至少需要写成功四个副本。每个SQL Server节点的更新操作写到一个操作日志文件中并通过网络发送到另外两个副本，由于不同数据分片的副本所在的SQL Server机器可能不同，一个存储节点的操作日志最多需要和650个分片数量的机器通信，日志同步的网络聚合效果不够好。Yahoo的PNUTS为了解决这个问题采用了消息中间件进行操作日志分发，达到聚合操作日志的效果。

2, 宕机检测和恢复

SQL Azure的宕机检测论文中讲的不够细，大致的意思是：每个数据节点都被一些对等的数据节点监控，发现宕机则报告总控节点进行宕机恢复过程；同时，如果无法确定数据节点是否宕机，比如待监控数据节点假死而停止回复命令，此时需要由仲裁者节点进行仲裁。判断机器是否宕机需要一些协议控制，后面的文章会专门介绍。

如果数据节点发生了故障，需要启动宕机恢复过程。由于宕机的数据节点服务了最多650个逻辑的SQL Azure实例（子表），这些子表可能是Primary，也可能是Secondary。总控节点统一调度，每次选择一个数据分片进行Reconfiguration，即子表复制过程。对于Secondary数据分片，只需要通过从Primary拷贝数据来增加副本；对于Primary，首先需要从另外两个副本中选择一个Secondary作为新的Primary，接着执行和Secondary数据分片Reconfiguration一样的过程。另外，这里需要进行优先级的控制，比如某个数据分片只有一个副本，需要优先复制；某个数据分片的Primary不可服务，需要优先执行从剩余的副本中选择Secondary切换为Primary的过程。当然，这里还需要配置一些策略，比如只有两个副本的状态持续多长时间开始复制第三个副本，SQL Azure目前配置为两小时。

3, 负载均衡

新的数据节点加入或者发现某个节点负载过高时，总控节点启动负载均衡过程。数据节点负载影响因素包括：读写个数，磁盘/内存/CPU/IO使用量等。这里需要注意的是，新机器加入时需要控制子表迁移的节奏，否则大量的子表同时迁移到新加入的机器导致系统整体性能反而变慢。

SQL Azure由于可以控制每个逻辑SQL Azure实例，即每个子表的大小，因此，为了简便起见，可以不实现子表分裂，很大程度上简化了系统。

4, 事务

SQL Azure支持数据库事务，数据库事务相关的SQL语句都会记录BEGIN TRANSACTION，ROLLBACK TRANSACTION和COMMIT TRANSACTION相关的操作日志。在SQL Azure中，只需要将这些操作日志同步到其它副本即可，由于同一时刻同一个数据分片最多有一个Primary提供写服务，不涉及分布式事务。SQL Azure系统支持的事务级别为READ_COMMITTED。

5, 多租户干扰

云计算系统中多租用的操作相互干扰，因此需要限制每个SQL Azure逻辑实例使用的系统资源：

1, 系统操作系统资源限制，比如CPU和内存。超过限制时回复客户端要求10s后重试；

2, SQL Azure逻辑数据库容量限制。每个逻辑数据库都预先设置了最大的容量，超过限制时拒绝更新请求，但允许删除操作；

3, SQL Server物理数据库数据大小限制。超过该限制时返回客户端系统错误，此时需要人工介入。

与SQL Server的差别

1, 不支持的操作：Microsoft Azure作为一个针对企业级应用的平台，尽管尝试支持尽量多的SQL特性，仍然有一些特性无法支持。比如USE操作：SQL Server可以通过USE切换数据库，不过在SQL Azure不支持，这时因为不同的逻辑数据库可能位于不同的物理机器。具体可以参考SQL Azure vs. SQL Server。

2, 观念转变：对于开发人员，需要用分布式系统的思维开发程序，比如一个连接除了成功，失败还有第三种不确定状态：云端没有返回操作结果，操作是否成功我们无从得知，又如，天下没有像SQL这么好的免费午餐；对于DBA同学，数据库的日常维护，比如升级，数据备份等工作都移交给了微软，可能会有更多的精力关注业务系统架构。

完整的信息可以参考微软前不久公布的Azure存储系统架构Inside SQL Azure论文，

from:http://www.nosqlnotes.net/archives/83

Windows Azure Storage Architecture Overview

七月 27, 2012云计算Azuredotte

Update 1-2-2012: See the new post on Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency , which gives a much more detailed and up to date description of the Windows Azure Storage Architecture.

In this posting we provide an overview of the Windows Azure Storage architecture to give some understanding of how it works. Windows Azure Storage is a distributed storage software stack built completely by Microsoft for the cloud.

Before diving into the details of this post, please read the prior posting on Windows Azure Storage Abstractions and their Scalability Targets to get an understanding of the storage abstractions (Blobs, Tables and Queues) provided and the concept of partitions.

3 Layer Architecture

The storage access architecture has the following 3 fundamental layers:

Front-End (FE) layer – This layer takes the incoming requests, authenticates and authorizes the requests, and then routes them to a partition server in the Partition Layer. The front-ends know what partition server to forward each request to, since each front-end server caches a Partition Map. The Partition Map keeps track of the partitions for the service being accessed (Blobs, Tables or Queues) and what partition server is controlling (serving) access to each partition in the system.
Partition Layer – This layer manages the partitioning of all of the data objects in the system. As described in the prior posting, all objects have a partition key. An object belongs to a single partition, and each partition is served by only one partition server. This is the layer that manages what partition is served on what partition server. In addition, it provides automatic load balancing of partitions across the servers to meet the traffic needs of Blobs, Tables and Queues. A single partition server can serve many partitions.
Distributed and replicated File System (DFS) Layer – This is the layer that actually stores the bits on disk and is in charge of distributing and replicating the data across many servers to keep it durable. A key concept to understand here is that the data is stored by the DFS layer, but all DFS servers are (and all data stored in the DFS layer is) accessible from any of the partition servers.

These layers and a high level overview are shown in the below figure:

Here we can see that the Front-End layer takes incoming requests, and a given front-end server can talk to all of the partition servers it needs to in order to process the incoming requests. The partition layer consists of all of the partition servers, with a master system to perform the automatic load balancing (described below) and assignments of partitions. As shown in the figure, each partition server is assigned a set of object partitions (Blobs, Entities, Queues). The Partition Master constantly monitors the overall load on each partition sever as well the individual partitions, and uses this for load balancing. Then the lowest layer of the storage architecture is the Distributed File System layer, which stores and replicates the data, and all partition servers can access any of the DFS severs.

Lifecycle of a Request

To understand how the architecture works, let’s first go through the lifecycle of a request as it flows through the system. The process is the same for Blob, Entity and Message requests:

DNS lookup – the request to be performed against Windows Azure Storage does a DNS resolution on the domain name for the object’s Uri being accessed. For example, the domain name for a blob request is “<your_account>.blob.core.windows.net”. This is used to direct the request to the geo-location (sub-region) the storage account is assigned to, as well as to the blob service in that geo-location.
Front-End Server Processes Request– The request reaches a front-end, which does the following:
1. Perform authentication and authorization for the request
2. Use the request’s partition key to look up in the Partition Map to find which partition server is serving the partition. See this post for a description of a request’s partition key.
3. Send the request to the corresponding partition server
4. Get the response from the partition server, and send it back to the client.
Partition Server Processes Request– The request arrives at the partition server, and the following occurs depending on whether the request is a GET (read operation) or a PUT/POST/DELETE (write operation):
- GET – See if the data is cached in memory at the partition server
  1. If so, return the data directly from memory.
  2. Else, send a read request to one of the DFS Servers holding one of the replicas for the data being read.
- PUT/POST/DELETE
  1. Send the request to the primary DFS Server (see below for details) holding the data to perform the insert/update/delete.
DFS Server Processes Request – the data is read/inserted/updated/deleted from persistent storage and the status (and data if read) is returned. Note, for insert/update/delete, the data is replicated across multiple DFS Servers before success is returned back to the client (see below for details).

Most requests are to a single partition, but listing Blob Containers, Blobs, Tables, and Queues, and Table Queries can span multiple partitions. When a listing/query request that spans partitions arrives at a FE server, we know via the Partition Map the set of partition servers that need to be contacted to perform the query. Depending upon the query and the number of partitions being queried over, the query may only need to go to a single partition server to process its request. If the Partition Map shows that the query needs to go to more than one partition server, we serialize the query by performing it across those partition servers one at a time sorted in partition key order. Then at partition server boundaries, or when we reach 1,000 results for the query, or when we reach 5 seconds of processing time, we return the results accumulated thus far and a continuation token if we are not yet done with the query. Then when the client passes the continuation token back in to continue the listing/query, we know the Primary Key from which to continue the listing/query.

Fault Domains and Server Failures

Now we want to touch on how we maintain availability in the face of hardware failures. The first concept is to spread out the servers across different fault domains, so if a hardware fault occurs only a small percentage of servers are affected. The servers for these 3 layers are broken up over different fault domains, so if a given fault domain (rack, network switch, power) goes down, the service can still stay available for serving data.

The following is how we deal with node failures for each of the three different layers:

Front-End Server Failure – If a front-end server becomes unresponsive, then the load balancer will realize this and take it out of the available servers that serve requests from the incoming VIP. This ensures that requests hitting the VIP get sent to live front-end servers that are waiting to process requests.
Partition Server Failure – If the storage system determines that a partition server is unavailable, it immediately reassigns any partitions it was serving to other available partition servers, and the Partition Map for the front-end servers is updated to reflect this change (so front-ends can correctly locate the re-assigned partitions). Note, when assigning partitions to different partition servers no data is moved around on disk, since all of the partition data is stored in the DFS server layer and accessible from any partition server. The storage system ensures that all partitions are always served.
DFS Server Failure – If the storage system determines a DFS server is unavailable, the partition layer stops using the DFS server for reading and writing while it is unavailable. Instead, the partition layer uses the other available DFS servers which contain the other replicas of the data. If a DFS Server is unavailable for too long, we generate additional replicas of the data in order to keep the data at a healthy number of replicats for durability.

Upgrade Domains and Rolling Upgrade

A concept orthogonal to fault domains is what we call upgrade domains. Servers for each of the 3 layers are spread evenly across the different fault domains, and upgrade domains for the storage service. This way if a fault domain goes down we lose at most 1/X of the servers for a given layer, where X is the number of fault domains. Similarly, during a service upgrade at most 1/Y of the servers for a given layer are upgraded at a given time, where Y is the number of upgrade domains. To achieve this, we use rolling upgrades, which allows us to maintain high availability when upgrading the storage service.

The servers in each layer are broken up over a set of upgrade domains, and we upgrade a single upgrade domain at a time. For example, if we have 10 upgrade domains, then upgrading a single domain would potentially upgrade up to 10% of the servers from each layer at a time. A description of upgrade domains and an example of using rolling upgrades is in the PDC 2009 talk on Patterns for Building Scalable and Reliable Applications for Windows Azure (at 25:00).

We upgrade a single domain at a time for our storage service using rolling upgrades. A key part for maintaining availability during upgrade is that before upgrading a given domain, we proactively offload all the partitions being served on partition servers in that upgrade domain. In addition, we mark the DFS servers in that upgrade domain as being upgraded so they are not used while the upgrade is going on. This preparation is done before upgrading the domain, so that when we upgrade we reduce the impact on the service to maintain high availability.

After an upgrade domain has finished upgrading we allow the servers in that domain to serve data again. In addition, after we upgrade a given domain, we validate that everything is running fine with the service before going to the next upgrade domain. This process allows us to verify production configuration, above and beyond the pre-release testing we do, on just a small percentage of servers in the first few upgrade domains before upgrading the whole service. Typically if something is going to go wrong during an upgrade, it will occur when upgrading the first one or two upgrade domains, and if something doesn’t look quite right we pause upgrade to investigate, and we can even rollback to the prior version of the production software if need be.

Now we will go through the lower to layers of our system in more detail, starting with the DFS Layer.

DFS Layer and Replication

Durability for Windows Azure Storage is provided through replication of your data, where all data is replicated multiple times. The underlying replication layer is a Distributed File System (DFS) with the data being spread out over hundreds of storage nodes. Since the underlying replication layer is a distributed file system, the replicas are accessible from all of the partition servers as well as from other DFS servers.

The DFS layer stores the data in what are called “extents”. This is the unit of storage on disk and unit of replication, where each extent is replicated multiple times. The typical extent sizes range from approximately 100MB to 1GB in size.

When storing a blob in a Blob Container, entities in a Table, or messages in a Queue, the persistent data is stored in one or more extents. Each of these extents has multiple replicas, which are spread out randomly over the different DFS servers providing “Data Spreading”. For example, a 10GB blob may be stored across 10 one-GB extents, and if there are 3 replicas for each extent, then the corresponding 30 extent replicas for this blob could be spread over 30 different DFS servers for storage. This design allows Blobs, Tables and Queues to span multiple disk drives and DFS servers, since the data is broken up into chunks (extents) and the DFS layer spreads the extents across many different DFS servers. This design also allows a higher number of IOps and network BW for accessing Blobs, Tables, and Queues as compared to the IOps/BW available on a single storage DFS server. This is a direct result of the data being spread over multiple extents, which are in turn spread over different disks and different DFS servers, since any of the replicas of an extent can be used for reading the data.

For a given extent, the DFS has a primary server and multiple secondary servers. All writes go through the primary server, which then sends the writes to the secondary servers. Success is returned back from the primary to the client once the data is written to at least 3 DFS servers. If one of the DFS servers is unreachable when doing the write, the DFS layer will choose more servers to write the data to so that (a) all data updates are written at least 3 times (3 separate disks/servers in 3 separate fault+upgrade domains) before returning success to the client and (b) writes can make forward progress in the face of a DFS server being unreachable. Reads can be processed from any up-to-date extent replica (primary or secondary), so reads can be successfully processed from the extent replicas on its secondary DFS servers.

The multiple replicas for an extent are spread over different fault domains and upgrade domains, therefore no two replicas for an extent will be placed in the same fault domain or upgrade domain. Multiple replicas are kept for each data item, so if one fault domain goes down, there will still be healthy replicas to access the data from, and the system will dynamically re-replicate the data to bring it back to a healthy number of replicas. During upgrades, each upgrade domain is upgraded separately, as described above. If an extent replica for your data is in one of the domains currently being upgraded, the extent data will be served from one of the currently available replicas in the other upgrade domains not being upgraded.

A key principle of the replication layer is dynamic re-replication and having a low MTTR (mean-time-to-recovery). If a given DFS server is lost or a drive fails, then all of the extents that had a replica on the lost node/drive are quickly re-replicated to get those extents back to a healthy number of replicas. Re-replication is accomplished quickly, since the other healthy replicas for the affected extents are randomly spread across the many DFS servers in different fault/upgrade domains, providing sufficient disk/network bandwidth to rebuild replicas very quickly. For example, to re-replicate a failed DFS server with many TBs of data, with potentially 10s of thousands of lost extent replicas, the healthy replicas for those extents are potentially spread across hundreds to thousands of storage nodes and drives. To get those extents back up to a healthy number of replicas, all of those storage nodes and drives can be used to (a) read from the healthy remaining replicas, and (b) write another copy of the lost replica to a random node in a different fault/upgrade domain for the extent. This recovery process allows us to leverage the available network/disk resources across all of the nodes in the storage service to potentially re-replicate a lost storage node within minutes, which is a key property to having a low MTTR in order to prevent data loss.

Another important property of the DFS replication layer is checking and scanning data for bit rot. All data written has a checksum (internal to the storage system) stored with it. The data is continually scanned for bit rot by reading the data and verifying the checksum. In addition, we always validate this internal checksum when reading the data for a client request. If an extent replica is found to be corrupt by one of these checks, then the corrupted replica is discarded and the extent is re-replicated using one of the valid replicas in order to bring the extent back to healthy level of replication.

Geo-Replication

Windows Azure Storage provides durability by constantly maintaining multiple healthy replicas for your data. To achieve this, replication is provided within a single location (e.g., US South), across different fault and upgrade domains as described above. This provides durability within a given location. But what if a location has a regional disaster (e.g., wild fire, earthquake, etc.) that can potentially affect an area for many miles?

We are working on providing a feature called geo-replication, which replicates customer data hundreds of miles between two locations (i.e., between North and South US, between North and West Europe, and between East and Southeast Asia) to provide disaster recovery in case of regional disasters. The geo-replication is in addition to the multiple copies maintained by the DFS layer within a single location described above. We will have more details in a future blog post on how geo-replication works and how it provides geo-diversity in order to provide disaster recovery if a regional disaster were to occur.

Load Balancing Hot DFS Servers

Windows Azure Storage has load balancing at the partition layer and also at the DFS layer. The partition load balancing addresses the issue of a partition server getting too many requests per second for it to handle for the partitions it is serving, and load balancing those partitions across other partition servers to even out the load. The DFS layer is instead focused on load balancing the I/O load to its disks and the network BW to its servers.

The DFS servers can get too hot in terms of the I/O and BW load, and we provide automatic load balancing for DFS servers to address this. We provide two forms of load balancing at the DFS layer:

Read Load Balancing – The DFS layer maintains multiple copies of data through the multiple replicas it keeps, and the system is built to allow reading from any of the up to date replica copies. The system keeps track of the load on the DFS servers. If a DFS server is getting too many requests for it to handle, partition servers trying to access that DFS server will be routed to read from other DFS servers that are holding replicas of the data the partition server is trying to access. This effectively load balances the reads across DFS servers when a given DFS server gets too hot. If all of the DFS servers are too hot for a given set of data accessed from partition servers, we have the option to increase the number of copies of the data in the DFS layer to provide more throughput. However, hot data is mostly handled by the partition layer, since the partition layer caches hot data, and hot data is served directly from the partition server cache without going to the DFS layer.
Write Load Balancing – All writes to a given piece of data go to a primary DFS server, which coordinates the writes to the secondary DFS servers for the extent. If any of the DFS servers becomes too hot to service the requests, the storage system will then choose different DFS servers to write the data to.

Why Both a Partition Layer and DFS Layer?

When describing the architecture, one question we get is why do we have both a Partition layer and a DFS layer, instead of just one layer both storing the data and providing load balancing?

The DFS layer can be thought of as our file system layer, it understand files (these large chunks of storage called extents), how to store them, how to replicate them, etc, but it doesn’t understand higher level object constructs nor their semantics. The partition layer is built specifically for managing and understanding higher level data abstractions, and storing them on top of the DFS.

The partition layer understands what a transaction means for a given object type (Blobs, Entities, Messages). In addition, it provides the ordering of parallel transactions and strong consistency for the different types of objects. Finally, the partition layer spreads large objects across multiple DFS server chunks (called extents) so that large objects (e.g., 1 TB Blobs) can be stored without having to worry about running out of space on a single disk or DFS server, since a large blob is spread out over many DFS servers and disks.

Partitions and Partition Servers

When we say that a partition server is serving a partition, we mean that the partition server has been designated as the server (for the time being) that controls all access to the objects in that partition. We do this so that for a given set of objects there is a single server ordering transactions to those objects and providing strong consistency and optimistic concurrency, since a single server is in control of the access of a given partition of objects.

In the prior scalability targets post we described that a single partition can process up to 500 entities/messages per second. This is because all of the requests to a single partition have to be served by the assigned partition server. Therefore, it is important to understand the scalability targets and the partition keys for Blobs, Tables and Queues when designing your solutions (see the upcoming posts focused on getting the most out of Blobs, Tables and Queues for more information).

Load Balancing Hot Partition Servers

It is important to understand that partitions are not tied to specific partition servers, since the data is stored in the DFS layer. The partition layer can therefore easily load balance and assign partitions to different partition servers, since any partition server can potentially provide access to any partition.

The partition layer assigns partitions to partition severs based on each partition’s load. A given partition server may serve many partitions, and the Partition Master continuously monitors the load on all partition servers. If it sees that a partition server has too much load, the partition layer will automatically load balance some of the partitions from that partition server to a partition server with low load.

When reassigning a partition from one partition server to another, the partition is offline only for a handful seconds, in order to maintain high availability for the partition. Then in order to make sure we do not move partitions around too much and make too quick of decisions, the time it takes to decide to load balance a hot partition server is on the order of minutes.

Summary

The Windows Azure Storage architecture had three main layers – Front-End layer, Partition layer, and DFS layer. For availability, each layer has its own form of automatic load balancing and dealing with failures and recovery in order to provide high availability when accessing your data. For durability, this is provided by the DFS layer keeping multiple replicas of your data and using data spreading to keep a low MTTR when failures occur. For consistency, the partition layer provides strong consistency and optimistic concurrency by making sure a single partition server is always ordering and serving up access to each of your data partitions.

Brad Calder

from:http://blogs.msdn.com/b/windowsazurestorage/archive/2010/12/30/windows-azure-storage-architecture-overview.aspx

Dotte博客

大数据、云计算、架构、语言的本质、计算的未来