A Deep Dive into DynamoDB Partitions

Databases are the backbone of most modern web applications and their performance plays a major role in user experience. Faster response times – even by a fraction of a second – can be the major deciding factor for most users to choose one option over another. Therefore, it is important to take response rate into consideration whilst designing your databases in order to provide the best possible performance. In this article, I’m going to discuss how to optimise DynamoDB database performance by using partitions.

Introducing Partitions

DynamoDB performance starts and ends with the concept of partitions. Partitions are like units of storage and performance. Not understating partitions means you will not be able to design highly effective and available databases with DynamoDB. So it’s worth understanding what’s going on under the hood.

Initially when you create one table on DynamoDB, it’ll create one partition and allocate this partition to the table. Any operations on this table – such as insert, delete and update – will be handled by the node where this partition is stored. It is important to remember that you do not have full control over the number of partitions created, but this can be influenced.

One partition can handle 10GB of data, 3000 read capacity units (RCU) and 1000 write capacity units (WCU), indicating a direct relationship between the amount of data stored in a table and performance requirements. A new partition will be added when more than 10GB of data is stored in a table, or RCUs are greater than 3000, or WCUs are greater than 1000. Then, the data will get spread across these partitions.

So how does DynamoDB spread  data across multiple partitions? The partition that a particular row is place within is selected based on a partition key. For each unique partition key value, the item gets assigned to a specific partition.

Let’s use an example to demonstrate. The below table shows a list of examinations and students who have taken them.

table

In this example, there is a many-to-one relationship between an exam and a student (for the sake of simplicity, we’ll assume that students do not resit exams). If this table was just for all the students at a particular school, the dataset would be fairly small. However, if it was all the students in a state or country, there could be millions and millions of rows. This might put us within range of the data storage and performance limits that would lead to a new partition being required.

Below is a virtual representation of how the above data would might be distributed if, based on the required RCU and WCP or the size of the dataset, DynamoDB were to decide to scale it out across 3 partitions:

AWS Network Diagram - New Page

As we can see above, each exam ID is assigned to a unique partition. A single partition may host multiple partition key values based on the size of the dataset, but the important thing to remember here is that one partition key can only be assigned to a single partition. One exam can be taken by many students. Therefore, the student ID becomes a perfect sort key value to query this data (as it allows sorting of exam results by student ID).

By adding more partitions, or by moving data between partitions, indefinite scaling is possible, based on the size or the performance requirements of the dataset. However, it is also important to remember that there are serious limitations that must be considered.

Firstly, the number of partitions are managed by DynamoDB, where partitions are added to accommodate increasing dataset size or increasing performance requirements. Whilst this is true for increasing the number of partitions, there is no automatic decrease in partitions during capacity or performance reductions.

This leads us to our next important point which is allocated RCU (read capacity unit) and WCU (write capacity unit) values spread across a number of partitions. Consider, for example, that you need 30000 RCUs to be allocated  to the database. The maximum an RCU single partition can support is 3000. Therefore, to accommodate the request, DynamoDB will automatically create 10 partitions.

If you are increasing your RCU and WCU via the console, AWS will provide you with an estimated cost per month as below,

RCU_WCU_Increase

Using the exam-student example, the dataset for each exam is assigned to one partition, which, as you will recall, can hold up to 10GB of data, 3000 RCUs and 1000 WCUs. Yet each exam can have millions of students. So the size of this dataset may go well beyond the 10GB capacity limit (which  must be kept in mind when selecting partition keys for a specific dataset).

Increasing the RCU or WCU values for a table beyond 3000 RCUs and 1000 WCUs prompts DynamoDB to create additional partitions with no way to reduce the number of partitions even if the number of required RCUs and WCUs drops. This can lead to a situation where each partition only ends up having a tiny number of RCUs and WCUs.

AWS Network Diagram - New Page

Because it is possible to have performance issues due to over-throttling – even though the overall assigned RCUs and WCUs are appropriate for the expected load – a formula can be created to calculate the desired number of partitions, whilst taking performance into consideration.

Based on our required read performance,

Partitions for desired read performance = 
  Desired RCU / 3000 RCU

and based on our required write performance,

Partitions for desired write performance = 
  Desired WCU / 1000 WCU

Giving us the number of partitions needed for the required performance,

Total partitions for desired performance = 
  (Desired RCU / 3000 RCU) + (Desired WCU / 1000 WCU)

But that’s only the performance aspect. We also have to look at the storage aspect. Assuming the max capacity supported by a single partition is 10GB,

Total partitions for desired storage = Desired capacity in GB / 10GB

The following formula can be used  to calculate the total number of partitions to accommodate the required performance aspect and capacity aspect.

Total partitions = 
  MAX(Total partitions for desired performance, 
      Total partitions for desired capacity)

As an example, consider the following requirements:

  • RCU Capacity:  7500
  • WCU Capacity: 4000
  • Storage Capacity: 100GB

The required number of partitions for performances can be calculated as:

(7500/3000) + (4000/1000) = 2.5 + 4 = 6.5

We’ll round this up to the nearest whole number: 7.

The required number of partitions for capacity is:

100/10 = 10

So the total number of partitions required is:

MAX(7, 10) = 10

A critical factor is that the total RCU and WCU is split equally across the total number of partitions. Therefore you will only get total allocated RCU and WCU amounts for a table if you are reading and writing in parallel across all partitions. This can only be archived via a good partition key model, meaning a key that is evenly distributed across all the key space.

Picking a good partition key

There is no universal answer when it comes to choosing a good key – it’s all dependant on the nature of the dataset. For a low-volume table, the key selection doesn’t matter as much (3000 RCU and 1000 RCU with a single partition is achievable even with a badly-designed key structure). However as the dataset grows, the key selection becomes increasingly important.

Partition key must be specified at the table creation time. If you’re using the console, you’ll see something similar to below,

DynamoDB_·_AWS_Console

Or if you’re using the CLI, you’d have to run something like,

aws dynamodb create-table \
  --table-name us_election_2016 \
  --attribute-definitions \
  AttributeName=candidate_id,AttributeType=S \
  AttributeName=voter_id,AttributeType=S \
  AttributeName=state,AttributeType=S \
  --key-schema AttributeName=candidate_id,KeyType=HASH AttributeName=voter_id,KeyType=RANGE \
  --provisioned-throughput ReadCapacityUnits=1,WriteCapacityUnits=1

The first criteria in choosing a good partition key is to select an attribute that has as many distinct values as possible. For an example, you would choose an employee ID when there are many employees available. What should not be selected is a department ID  where there are only handful of departments available.

The next criteria is to pick an attribute with uniformity of access across all key values. For an example, in a voting record system, selecting a candidate ID would be ideal if you expect each candidate to receive similar number of votes. If one or two candidates are to receive 90% of the available votes then this become less optimal.

Another criteria for good partition key candidate is that the attribute should have a temporal read and write pattern across time. If these are hard to achieve with an existing attribute, it’s worth looking at a syntactic or hybrid value.

Let’s look at an example that uses the 2016 US Elections to highlight everything we’ve just discussed. Specifically, we want to store a record of all of the votes for all of the candidates.

Each political party will have many candidates competing for the party’s electoral nomination. You may have anywhere from two to ten candidates. The problem is that votes between candidates will not be distributed uniformly – there will be one or two candidates that will receive the majority of the votes.

For the sake of this example, let’s assume that we expect 10000 WCU worth of votes to be received. Say that, in the first instance, we create a table and naively select the candidate ID as the partition key,and date/time as a range key.

DynamoDB will create 10 partitions for this example (Based on our previous formula, 10 partitions are needed to support 10000 WCU). If we also assume we have 10 candidates, DynamoDB will spread these partitions keys across 10 partitions as shown here:

election_1

This model is largely flawed. Firstly, we are limiting the performance for a candidate for a value much lower than 10000 WCU. As we discussed above, real world candidate voting will be heavily weighted towards one or two popular candidates. Therefore performance allocated to the least popular candidates is just wasted WCU.

Even if we assume voting is uniformly weighted between candidates, their voters may be located in different time zones and may vote at different times. Therefore, there might be spikes of votes for certain candidates at specific times compared to others. Even with carefully designed partition keys, you can run into time-based issues like this.

Let’s think about a case when there are only two candidates in the national election. To support performance capacity, 100000 WCU are assigned, and DynamoDB will create 100 partitions to support this. However, if the candidate ID is chosen as the partition key, each candidate data will be limited to one partition – even though there are 98 unused partitions. Consequently, the storage limit will be hit quickly causing the application to fail and stop recording further votes.

This issue  is resolved by introducing a key-sharing plan. This means for each candidate – i.e. for each partition – the partition key is prefixed with a value of 1 to 10 or 1 to 1000 (deepening on the size of your dataset). This gives us a much wider range of partition keys. That means DynamoDB will distribute this data across multiple partitions evenly. It’ll look a bit like this:

election_2

Now, we can look at the histogram before key-sharing:

heat_1

Where the corresponding partition keys will look something like (please note, for this example, I’ve only inserted data for 2 candidates):

before_sharding.png

Now here’s the histogram after key-sharing:

heat_2

We can see how, with a key sharing plan, the load is much more evenly distributed across partitions. Throttling is minimal. The corresponding partition keys will look like this:

after_sharding

Conclusion

There are many other factors that needs to be considered when designing data models on DynamoDB such as Local Secondary Indexes and Global Secondary Indexes. For further information on these indexes, check out the AWS documentation to understand how they may impact database performance.

Database modelling is very important when choosing a database structure and it’s essential for an optimally-performing application. Even though DynamoDB is a fully managed and highly-scalable database solution, it all comes down to you when designing a solid application. No matter how powerful DynamoDB is, a poorly designed database model will cause your application to perform poorly.

from:A Deep Dive into DynamoDB Partitions – Shine Solutions Group

自建一个简易的OpenAPI网关

网关(API Gateway)是请求流量的唯一入口,可以适配各类渠道和业务,处理各种协议接入、路由与报文转换、同步异步调用等,来管理 API 接口和进行请求流量控制,在微服务架构中,网关尤为重要。

预览图

背景

当然,现在已有很多开源软件,如 KongGraviteeZuul

这些开源网关固然功能齐全,但对于我们业务来说,有点太重了,我们有部分定制化需求,为此我们自建了一个轻量级的 OpenAPI 网关,主要供第三方渠道对接使用。

简介

功能特性

接口鉴权

  • 请求 5s 自动过期
  • 参数 md5 签名
  • 模块粒度的权限控制

接口版本控制

  • 支持转发到不同服务
  • 支持转发到同一个服务不同接口

事件回调

  • 事件订阅
  • 最大重试 3 次
  • 重试时间采用衰减策略(30s、60s、180s)

系统架构

从第三方请求 API 链路来说,第三方渠道通过 HTTP 协议请求 OpenAPI 网关,网关再将请求转发到对应的内部服务端口,这些端口层通过 gRPC 调用请求到服务层,处理完请求后依次返回。

从事件回调请求链路来说,服务层通过 HTTP 协议发起事件回调请求到 OpenAPI 网关,并立即返回成功。OpenAPI 网关异步完成第三方渠道事件回调请求。

系统架构

实现

网关配置

由于网关存在内部服务和第三方渠道配置,更为了实现配置的热更新,我们采用了 ETCD 存储配置,存储格式为 JSON。

配置分类

配置分为以下 3 类:

  • 第三方 AppId 配置
  • 内外 API 映射关系
  • 内部服务地址

配置结构

a、第三方 AppId 配置

AppId配置

b、内部服务地址

内部服务地址

c、内外 API 映射关系

API映射关系

配置更新

利用 ETCD 的 watch 监听,可以轻易实现配置的热更新。

配置热更新

当然也还是需要主动拉取配置的情况,如重启服务的时候。

拉取热更新

API 接口

第三方调用 API 接口的时序,大致如下:

API调用时序

参数格式

为了简化对接流程,我们统一了 API 接口的请求参数格式。请求方式支持 POST 或者 GET。

API调用时序

接口签名

签名采用 md5 加密方式,算法可描述为:

1、将参数 p、m、a、t、v、ak、secret 的值按顺序拼接,得到字符串;
2、md5 第 1 步的字符串并截取前 16 位, 得到新字符串;
3、将第 2 步的字符串转化为小写,即为签名;

PHP 版的请求,如下:

$appId = 'app id';
$appSecret = 'app secret';
$api = 'api method';

// 业务参数
$businessParams = [
  'orderId' => '123123132',
];

$time = time();
$params = [
  'p'  => json_encode($businessParams),
  'm'  => 'inquiry',
  'a'  => $api,
  't'  => $time,
  'v'  => 1,
  'ak' => $appId,
];

$signStr = implode('', array_values($params)) . $appSecret;
$sign = strtolower(substr(md5($signStr), 0, 16));

$params['s'] = $sign;

接口版本控制

不同的接口版本,可以转发请求到不同的服务,或同一个服务的不同接口。

接口版本控制

事件回调

通过事件回调机制,第三方可以订阅自己关注的事件。

接口版本控制

对接接入

渠道接入

只需要配置第三方 AppId 信息,包括 secret、回调地址、模块权限。

渠道AppId配置

即,需要在 ETCD 执行如下操作:

$ etcdctl set /openapi/app/baidu '{
    "Id": "baidu",
    "Secret": "00cf2dcbf8fb6e73bc8de50a8c64880f",
    "Modules": {
        "inquiry": {
            "module": "inquiry",
            "CallBack": "http://www.baidu.com"
        }
    }
}'

服务接入

a、配置内部服务地址

配置内部服务地址

即,需要在 ETCD 执行如下操作:

$ etcdctl set /openapi/backend/form_openapi '{
    "type": "form",
    "Url": "http://med-ih-openapi.app.svc.cluster.local"
}'

b、配置内外 API 映射关系

配置内部服务地址

同样,需要在 ETCD 执行如下操作:

$ etcdctl set /openapi/api/inquiry/createMedicine.v2 '{
    "Module": "inquiry",
    "Method": "createMedicine",
    "Backend": "form_openapi",
    "ApiParams": {
        "path": "inquiry/medicine-clinic/create"
    }
}'

c、接入事件回调

接入服务也需要按照第三方接入方式,并申请 AppId。回调业务参数约定为:

配置内部服务地址

Golang 版本的接入,如下:

const (
	AppId = "__inquiry"
	AppSecret = "xxxxxxxxxx"
	Version = "1"
)

type CallbackReq struct {
	TargetAppId string                 //目标APP Id
	Module      string                 //目标模块
	Event       string                 //事件
	Params      map[string]interface{} //参数
}

func generateData(req CallbackReq) map[string]string {
    params, _ := json.Marshal(req.Params)
	p := map[string]interface{}{
		"ak": req.TargetAppId,
		"m":  req.Module,
		"e":  req.Event,
		"p":  string(params),
	}

	pStr, _ := json.Marshal(p)
	postParams := map[string]string{
		"p":  string(pStr),
		"m":  "callback",
		"a":  "callback",
		"t":  fmt.Sprintf("%d", time.Now().Unix()),
		"v":  Version,
		"ak": AppId,
	}

	postParams["s"] = sign(getSignData(postParams) + AppSecret)
	
	return postParams
}

func getSignData(params map[string]string) string {
	return strings.Join([]string{params["p"], params["m"], params["a"], params["t"], params["v"], params["ak"]}, "")
}

func sign(str string) string {
	return strings.ToLower(utils.Md5(str)[0:16])
}

未来规划

  • 后台支持配置 AppId
  • 事件回调失败请求支持手动重试
  • 请求限流

All About Serverless Computing

Technologies and operational systems in businesses tend to evolve every 6 months. Certainly, matching up with the market trend every time there’s a turn is a humongous task. Imagine how much cost and effort would be saved if they were auto-scalable.

While there are many ways to enhance the scalability of a system, this article will talk about AWS serverless technology that is known to take businesses to a new level of productivity and scalability. The next big question that arises here is why is it called serverless? There are servers in the serverless but the term is used because it describes the customer’s experience of the servers, which is invisible and is not present physically in front of the customers. The customer doesn’t have to manage them or interact with them in any way.

We can dive deeper only after we have understood the true meaning of serverless computing.

What Is Serverless Computing?

It is a cloud computing execution model that provisions computing resources on demand. It enables the offloading of all common infrastructure management tasks such as patching, provisioning, scheduling, and scaling to cloud providers and tools, allowing engineers to focus on the customization required for applications required for the client.

  • Features of serverless computing

  • It does not require monitoring and management, which helps developers more time to optimize codes and find out innovative ideas to add features and functionalities to the application.

  • Serverless computing runs codes on-demand only, typically in a stateless container only when there’s a request. The scaling too is transparent with the number of requests being served.

  • Serverless computing charges only for what’s being used and not for idle capacity.

Benefits of Serverless Computing

The serverless market is estimated to grow around $20B USD by 2025. The striking figures are owed to the innumerable disadvantages of serverless computing as compared to traditional cloud computing, server-centric infrastructure. Below mentioned are some of the important benefits offered by top serverless cloud computing service providers.

No worries about server maintenance

Managed by the vendors completely, this can reduce the investment necessary in DevOps. This not just lowers the expenses but also frees up developers to create and expand the applications without being held behind by server capacity.

Codes can be used to reduce latency

Since the application is not hosted on an original server, its code can be run from anywhere. Depending on the servers, it can thus be used to run applications on servers that are close to the end-users. This reduces latency because requests from the user no longer have to travel all the way to the origin server.

Serverless architecture is scalable

Applications built on serverless architecture scale up automatically during spike season and scale down during the lean period. Additionally, if the function needs to be run in multiple instances, the vendor’s server will start, run and end when the requirement is over. This is done often using containers. A serverless application, thus, can handle a high number of requests as well as single requests.

Quick deployments are possible

There is no need for the developer to upload codes or do any backend configuration in order to release a working application. Uploading bits of code all at a time or one function at a time can help release an application quickly. This can be done because the application is not a single monolithic stack but rather a collection of functions provisioned by the vendor. This also helps in patching, fixing, updating new features to an application.

They are fault-tolerant

It is not the responsibility of the developers to ensure the fault tolerance of the serverless architecture. The cloud provider assigns the IT infrastructure that will be automatically allocated to account for any kind of failure.

No upfront costs

Users need to pay only for the running code and there are no upfront costs involved while deploying the serverless cloud infrastructure to build an application.

Why Would You Need Help?

Every technology has its own set of drawbacks which needs expert consultation and technology expertise. Some of the disadvantages of using serverless applications are as follows:

Debugging and testing become tough

It is difficult to replicate the serverless environment in order to check for bugs and see how the code will perform once deployed. Debugging is extremely difficult because developers are not aware of the backend process. Moreover, applications here are broken up into separate, smaller functions.

Solution: Businesses planning to use serverless applications should look for serverless cloud infrastructure providers of vendors who are experts in sandbox technology who can help in reducing the difficulties in testing and debugging.

Be prepared for a new set of security concerns

When applications are run on serverless platforms, the developers do not have access to the security systems or might not be able to supervise the security systems, which could be a big issue for platforms handling crucial and secret data. Since companies do not have their own assigned servers, serverless providers will often be running code from several of their customers. This scenario is also known as multitenancy. Interestingly, if not performed properly, this can lead to data exposure.

Solutions: Software service providers that sandbox functions avoid the impacts of multi-tenancy. They also have a powerful infrastructure that avoids data leaks.

Not best for long-term processes

Most of the applications do not fit the bill because clients would want a long-standing application, which would charge more on serverless architecture than on traditional ones. This is because providers charge for only the time when the code is running.

Solution: IT consultancy can help businesses understand whether their business requirements will be fulfilled by serverless architectures or not. It is advisable to get IT consultants cum solution providers to help businesses get the right guidance. This will not just save money but also time for businesses.

Risk of cold-start

Since the servers are not constantly used, the code might require ‘boot up’ when it is used. This startup might affect the performance of the application. But if the code is used regularly, the serverless provider is responsible to keep it ready for whenever it needs to be activated. A request for this ready-to-go code is called a “warm start”.

Solution: Experienced serverless cloud service providers will be able to avoid the cold start by using the Chrome V8 engine, which can restart the application in less than 5 milliseconds. The experts having good exposure to such a setup can easily manage the performance lag without the customers even noticing.

Type-set applications

Serverless cloud applications are often branded as type-set, unable to sync in with another vendor in time of transition. This is because the architecture and the workflow vary from one vendor to another.

Solution: Expert service providers can help you migrate with applications written with JavaScripts, written against the widely used service workers API. This helps in fast and seamless integrations without errors and failures.

Moving to serverless? Get the best help you need from trained developers and expert cloud consultants. Learn all about data pipeline architecture and sync serverless deployments while speeding migration times and reducing costs.

跨学科通识

视频地址

所长林超自阅读查理芒格观点后得出的感悟。本课程信奉 实用主义

现实生活中的挑战并不是按照大学学科划分的,但每种学科,都为我们解决问题提供了重要的思维模型

本课程将介绍约22个学科和120个常见思维模型,以及部分应用:

热力学、函数、工程学、复杂性科学、系统论、信息论、会计学、概率论、金融学、生物学、投资学、社会学、管理学、物理学、脑科学、认知心理学、历史学、语言学、逻辑学、经济学、营销学、哲学

大多数人的一生都喜欢用一个单薄的知识结构解决所有问题,这是专业化带来的狭隘思维


像我这样的人才:laughing:(bushi),就应该逐步构建自己的知识体系,哈哈我就是这样做的,我在大一接触了应用数学、博弈论、经济学、投资学、心理学、信息论、运筹学,但是学习方法不对,我需要的不是变成所有领域的专家,而是专精一两门(数据科学+经济学),然后学到其他学科的思考模型,先拿林超大大的课试试水。


20-35岁是人生的黄金时间,抓住啊!

一、熵与热力学

基础知识

S = k * lnW 也可以说熵正比于微状态数

可能性大 = 熵高 = 混乱程度高

这个世界上混乱才是常态,有序需要刻意营造

熵增定律

封闭系统,与外界隔绝,随时间推移趋于混乱。

趋于有序的方法

只要它能正确的做好这两件事,就能使事物变得更加有序。


应用:思维决策层

在”脑子里放一个小人“,判断该不该继续思考数量繁多而无序的念头,就能让思维变得有序,这相当于是在正常人的思维输入输出模型里面又加了一个决策层啊!果然方法论决定对外界的反馈,厉害。


耗散结构

感知和选择需要信息和能量,封闭系统变成开放系统,吸收外界的能量和信息

流水不腐就是一个耗散结构,人体也是,耗散结构是一个动态平衡体,变化又保持一个平衡状态。

运转策略

ps:虽然很多人都是这么做的,但是却没有总结成结论,没有总结成结论,运用的时候就会无效思考,心生犹豫,但若是作为一条公理、定理,一项判断决策,那就非常有价值了!

人们应该把焦点全部放在蓝色部分的因,接纳红色的因,而不是因为红色的果而感到自责,反而阻止了身体继续排出熵。所以鸡汤里面的依靠决心和自责什么的假大空的,其实就是抓错了重点。

我们改变不了整体系统,但我们可以改变 感知+选择 这件事

有些人会困在先天的家庭环境带来的红色的果里,有些人会找到铁饭碗,陷入封闭系统里,这个世界充满辩证法,耗散里的选择才是最关键的,如果落实到具体做法,请看👇

逐渐的,两种模式都会形成坚固的闭环。B模式总会伴随痛苦,反人性,充满困难。

工程学告诉我们如何拆解困难,化为一件件小事,轻松KO。

二、工程学

很多人知难而退,其实是知的不够

工程学的细分专业非常多:生物、农业、分子、土木、软件、森林。。。

这节课将提炼共通的重要方法论:行胜于言、分解结构、量化、列清单、取舍

行胜于言

风口思维找到大方向,高风险,看准下手。

工程思维就喜欢看得见摸得着的,能清楚看清收益回报,也能付出实践的。

我个人还是更喜欢风口思维,因为确实难以忍受搬砖的枯燥,但找对风口后,要见成效确实得搬砖┭┮﹏┭┮

激励来源:

要联系到后面讲的生理学:

内啡肽更长更持久,更有益身心健康,显然搬砖是后者。

人们往往习惯了多巴胺型快乐,它让人兴奋,但持续时间短,消散时会带来失落等负面情绪。

曾经物质不丰富,一般是一连串的内啡肽小快乐才等来一次多巴胺型大快乐。

但如今大刺激唾手可得,导致恶性循环👇:

真正公式:

日常搬砖获得内啡肽型愉悦感,达到了成功再允许自己激发一些多巴胺型的快乐,彻底放松一次

要是没有达到成果,那就忍着,这才是良性循环

工程分解结构

把任意问题拆解,是工程学最核心的思维。

这也有一项孪生能力:focus

focus like a laser, not like a flash

无限分解,直到找到抓手,从想->做就在那一瞬间,在脑科学里就是从一个新脑区的边缘神经元开始激活整个。

既见树木、也见森林

image-20210809170229459

image-20210809170229459

应用:接触全新学科

找到该领域最权威的教材,读目录,建框架

  • 读目录,查清所有不会的术语
  • 推敲概念之间的关系,建立宏观全局框架

这是知的部分,全局了解也是非常重要的。


这才是知行合一,我就是那种喜欢看全局,然后分解不到位,最后懒得搬砖的人哈哈哈

量化

image-20210809170731986

image-20210809170731986

一开始只想到房租、售价等几个变量

开始叙事,通过”讲故事“一步步推衍:

从开张开始,需要品牌,装修,开工后要培训,设备,原料。。。。

最后进行数学运算。

有两种厉害的商业思维分析框架,后面会讲:

image-20210809171033121

image-20210809171033121

这件事其实非常简单,更何况我可是数学大佬。

OKR模型

image-20210809171243197

image-20210809171243197

竖直分解目标,水平量化目标,案例:

image-20210809171352398

image-20210809171352398

这也为行动力提高打下了坚实的基础!

清单思维

对应to do list 和 time schedule

非常简单,但极其有用

列出来有利于

  • 确定优先级
  • 让人专注,有利于“聚焦”。
  • 利于推敲,找到不严谨的地方
  • 节省脑资源!!(我就因为这个当年开始列清单的,记得难受死了

image-20210809191126647

image-20210809191126647

to do表和时间表就像一个事情的两个维度,一个以任务为核心一个以时间为核心,推荐飞书表

取舍

对大多数人来说,难度更多在 

一般最多选两样。

image-20210809191611562

image-20210809191611562

工程思维:便宜+快 (反完美主义,反本能的

艺术思维:好

三、系统论

可以用来分析公司?

工程思想是拆分的还原论,那么这里就是整体论

观点:如果不着眼于系统,只改变局部,最终可能总被系统拉回来,比如一个闭环系统👇

image-20210812003721985

image-20210812003721985

系统论相当于是呈现这些“飞轮”的机制,让我们从整体把握这个逻辑

系统

很多实体通过相互 联系 形成了一个有运作规律的整体

image-20210812005240164

image-20210812005240164

image-20210812005306703

image-20210812005306703

要素

+ 正信息 正能量 资产

– 负信息 负能量 负债 负信息是让世界变得更加混乱的信息,如谎言、谣言

关系

+ 加强 – 减弱

四种循环

正要素+正关系 = 良性循环

负要素+正关系 = 恶性循环

负要素+负关系 = 贤者时刻(纠错)

正要素+负关系 = 回归平庸(后两个都是均值回归)

滞后效应

负循环

因果分析法,找到生活中的负循环

利用杠杆解,主要矛盾

image-20210812010338719

image-20210812010338719

复杂的模型(混沌),如果能找到关键变量,便能牵一发动全身,彻底重构整个系统


建立仪式!

所谓仪式,一旦开始,剩下的步骤就不需要控制,利用惯性执行

这可以尽可能地减少我们大脑的意识消耗。


建立良性循环

image-20210812011007405

image-20210812011007405

我也要建立对自己的信任!

均值回归,人是难以突破社会的。。。我跳出了这个圈子,所以我认识的人还是有点少、、、

image-20210812011253657

image-20210812011253657

我大概会抛弃回报公平吧,回报什么的没啥意思。

草,滞后效应,我现在熬夜会搞得以后难受。。。。哎呀好气

价值投资!也是滞后效应,这就是“确定性啊”

储蓄池

image-20210812011807899

image-20210812011807899

保护自己的缓冲地带

一个人何以扛过滞后效应,凭借的是

image-20210812011901249

image-20210812011901249

有储蓄机制的系统,也是一个有非常大适应力的系统。

这也引出好系统的三大特性:

  1. 鲁棒性,可以建立模拟攻击(跳出舒适圈
  2. 自组织,不依靠外部指令,按照某种内在规则自动自发地形成有序结构的一种现象(人体)
  3. 层次性,子系统,递归,封装思想(刚才的仪式

四、函数

函数是这个世界上最美妙的事物之一。

image-20210812012642329

image-20210812012642329

对大部分人来说,二号区是最方便的,实用而且简单直观

教科书为了严谨才把定义搞得那么抽象。函数重要的能力是可视化

时间函数

高特纳曲线

image-20210813094417037

image-20210813094417037

不要高估短期能发生的变化

不要低估长期能发生的变化

这是由神经科学的人性曲线和物性曲线(逻辑斯蒂函数(sigmoid))叠加构成的。

image-20210813095148856

image-20210813095148856

而逻辑斯蒂增长就是由指数增长+环境阻力构成的那个物种曲线hhh

倒U曲线

image-20210813095346101

image-20210813095346101

可以用在生命机能等诸多现象中

正余弦函数——波动,周期

指数函数,跨过某个拐点,就开始起飞。

对数函数,玻璃顶的存在。

image-20210813101214895

image-20210813101214895

这只是个大概。

五、脑科学

又称神经科学,作为几乎所有研究人类行为的基础学科

image-20210813101722891

image-20210813101722891

image-20210813101844649

image-20210813101844649

爬行脑是很难主管调节的,比如褪黑素得用化学药物,是生存本能。

所谓的刻意练习,就是利用人类脑对哺乳脑的驯化过程

新皮质四大脑区:

额叶、顶叶、

额叶

最核心区域,智慧所在

image-20210813102929814

image-20210813102929814

分好几个区:

image-20210813103133065

image-20210813103133065

ps: 布洛卡区和威尔尼克区就是语言中枢。

前额叶皮质是非常关键的部分,掌管诸多功能,最重要的是 社交和自控、专注

image-20210813102334506

image-20210813102334506

实际上这个距离就是自控和延迟满足,但如今人们的自控能力发展较慢,取而代之的是高维信息处理能力变强,所以人们自控力要在30岁左右才达到巅峰

image-20210813103419281

image-20210813103419281

镜像神经元

负则模仿能力和社交能力,还有共情能力,还有音乐。

分布在前运动皮质和初级运动皮层。向下直接连着脑干、脊椎。

还分布在布洛卡区。所以语言和音乐学习的捷径就是浸泡在大环境里面模仿别人,而不是考试。。

顶叶

负则统筹和协调,空间想象力

image-20210813103936139

image-20210813103936139

颞叶:语言理解,面部识别,洞察力,观察细节。

image-20210813104113191

image-20210813104113191

枕叶:完全用于视觉


学习

发现人类大脑里处理视觉的区域特别多!可视化yyds

image-20210813104345656

image-20210813104345656

依次激发更多的脑区!!

image-20210813104325285

image-20210813104325285

单纯语言沟通效率是很低的。

哺乳脑

(女发达)边缘系统:长期记忆、情绪管理、嗅觉(跟这些有关,所以体香很重要)、性唤醒

(女尤其发达)扣带回区域:情感、焦虑、痛苦、自我调节、负面想象

杏仁核:恐惧、愤怒、兴奋、战或逃

image-20210813104933864

image-20210813104933864

因此恐惧会抑制人类脑,再聪明也得服从本能,战或逃取决于过去的经历和基因。

可以通过训练,用经验告诉杏仁核不必恐惧,选择战而不是逃。

(商家最爱)基底神经节:操作技巧、习惯养成、奖赏系统、上瘾系统

image-20210813105502624

image-20210813105502624

这个核是消费关键。

刻意练习

刻意选择,不断重复。

就像是不同脑区的神经元链接,多次重复刺激以后形成坚固的道路

但关键步骤要放慢速度,可能调度更多的脑区观察这个过程。

image-20210813110129493

image-20210813110129493

所以贴标签是不可取的。

六、复杂性科学

还原论或还原主义(英语:Reductionism,又译化约论),是一种哲学思想,认为复杂的系统、事物、现象可以将其化解为各部分之组合来加以理解和描述。

在哲学上,还原论是一种观念,它认为某一给定实体是由更为简单或更为基础的实体所构成的集合或组合;或认为这些实体的表述可依据更为基础的实体的表述来定义。” 还原论方法是经典科学方法的内核,将高层的、复杂的对象分解为较低层的、简单的对象来处理;世界的本质在于简单性。

复杂性科学兴起于20世纪80年代的复杂性科学,是系统科学发展的新阶段,也是当代科学发展的前沿领域之一。复杂性科学的发展,不仅引发了自然科学界的变革,而且也日益渗透到哲学、人文社会科学领域。复杂性科学为什么会赢得如此盛誉,并带给科学研究如此巨大的变革呢?主要是因为复杂性科学在研究方法论上的突破和创新。在某种意义上,甚至可以说复杂性科学带来的首先是一场方法论或者思维方式的变革。

  1. 它只能通过研究方法来界定,其度量标尺和框架是非还原的研究方法论
  2. 它不是一门具体的学科,而是分散在许多学科中,是学科互涉的
  3. 它力图打破传统学科之间互不来往的界限,寻找各学科之间的相互联系、相互合作的统一机制
  4. 它力图打破从牛顿力学以来一直统治和主宰世界的线性理论,抛弃还原论适用于所用学科的梦想
  5. 它要创立新的理论框架体系或范式,应用新的思维模式来理解自然界带给我们的问题

复杂性科学是指以复杂性系统为研究对象,以超越还原论为方法论特征,以揭示和解释复杂系统运行规律为主要任务,以提高人们认识世界、探究世界和改造世界的能力为主要目的的一种“学科互涉”(inter—disciplinary)的新兴科学研究形态。

某学者定义:运用跨学科方法,研究不同复杂系统中的涌现行为和统一性规律

《复杂》 入门必读

《规模》 研究增长

《系统论》 系统论入门


复杂系统

image-20210815174046214

image-20210815174046214
  1. 大量个体聚集
  2. 个体间的运作相对简单,但是叠加后产生群体的复杂行为
  3. 能通过不断进化,对环境产生适应性

关键词:聚集,进化,适应性,涌现

image-20210815174524772

image-20210815174524772

image-20210815174958056

image-20210815174958056

这样简单的三条原则不断循环后,就能发挥超凡的作用。

这就是单体简单->群体智慧。

image-20210815190757944

image-20210815190757944

*聚焦

四段论:

  1. 清理念头整理、列出思绪,保证自己没有牵挂心无旁骛
  2. 忘我聚焦进入超高效状态持续2个小时,完成大部分任务
  3. 刻意休息只能玩,强迫自己玩,比如半个小时
  4. 惯性工作接着大体完成的脉络继续,由于之前的基础,可以低能耗的工作。

迭代

这里讲的是工程学里的迭代,尤其是软件开发里的敏捷开发。

找到自己的最小内核,一层一层递归、累加、迭代,最终涌现成为传奇

image-20210815192540335

image-20210815192540335

在3.0才找到拐点,我们要抱着正确的心理预期,不是每一次迭代都能升级。

精益创业

image-20210815192654391

image-20210815192654391

适应性

image-20210815192932647

image-20210815192932647

过犹不及。。彻底失败

七、认知心理学

认知心理学(cognitive psychology),20 世纪 50 年代中期在西方兴起的一种心理学思潮和研究方向。广义指研究人类的高级心理过程,主要是认识过程,如注意、知觉、表象、记忆、创造性、问题解决、言语和思维等。狭义相当于当代的信息加工心理学。即采用信息加工观点研究认知过程。


 文章作者: Darren
 版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Darren !

Comparing DynamoDB and MongoDB

Quick Comparison Table

MongoDB DynamoDB
Freedom to Run Anywhere

Only available on AWS

No support for on-premises deployments

Locked-in to a single cloud provider

Data Model

Limited key-value store with JSON support

Maximum 400KB record size

Limited data type support (number, string, binary only) increases application complexity

Querying

Key-value queries only

Primary-key can have at most 2 attributes, limiting query flexibility

Analytic queries requires replicating data to another AWS service, increasing cost and complexity

Indexing

Limited / Complex to manage

Indexes are sized, billed & provisioned separately from data

Hash or hash-range indexes only

Global secondary indexes (GSIs) are inconsistent with underlying data, forcing applications to handle stale data

Local secondary indexes (LSIs) can be strongly consistent, but must be defined when a table is created

GSIs can only be declared on top level item elements. Cannot index sub-documents or arrays, making complex queries impossible

Maximum of 20 GSIs & 5 LSIs per table

Data Integrity

Eventually Consistent

Complex – need to handle stale data in application

No data validation – must be handled in application

ACID transactions apply to table data only, not to indexes or backups

Maximum of 25 writes per transaction

Monitoring and Performance Tuning

Black-box

Fewer than 20 metrics limit visibility into database behavior

No tools to visualize schema or recommend indexes

Backup

On-demand or continuous backups

No queryable backup; additional charge to restore backups; many configurations are not backed up and need to be recreated manually

Pricing

Highly Variable

Throughput-based pricing

A wide range of inputs may affect price. See Pricing and Commercial Considerations

What is DynamoDB?

DynamoDB is a proprietary NoSQL database service built by Amazon and offered as part of the Amazon Web Services (AWS) portfolio.

The name comes from Dynamo, a highly available key-value store developed in response to holiday outages on the Amazon e-commerce platform in 2004. Initially, however, few teams within Amazon adopted Dynamo due to its high operational complexity and the trade-offs that needed to be made between performance, reliability, query flexibility, and data consistency.

Around the same time, Amazon found that its developers enjoyed using SimpleDB, its primary NoSQL database service at the time which allowed users to offload database administration work. But SimpleDB, which is no longer being updated by Amazon, had severe limitations when it came to scale; its strict storage limitation of 10 GB and the limited number of operations it could support per second made it only viable for small workloads.

DynamoDB, which was launched as a database service on AWS in 2012, was built to address the limitations of both SimpleDB and Dynamo.

What is MongoDB?

MongoDB is an open, non-tabular database built by MongoDB, Inc. The company was established in 2007 by former executives and engineers from DoubleClick, which Google acquired and now uses as the backbone of its advertising products. The founders originally focused on building a platform as a service using entirely open source components, but when they struggled to find an existing database that could meet their requirements for building a service in the cloud, they began work on their own database system. After realizing the potential of the database software on its own, the team shifted their focus to what is now MongoDB. The company released MongoDB in 2009.

MongoDB was designed to create a technology foundation that enables development teams through:

  1. The document data model – presenting them the best way to work with data.
  2. A distributed systems design – allowing them to intelligently put data where they want it.
  3. A unified experience that gives them the freedom to run anywhere – allowing them to future-proof their work and eliminate vendor lock-in.

MongoDB stores data in flexible, JSON-like records called documents, meaning fields can vary from document to document and data structure can be changed over time. This model maps to objects in application code, making data easy to work with for developers. Related information is typically stored together for fast query access through the MongoDB query language. MongoDB uses dynamic schemas, allowing users to create records without first defining the structure, such as the fields or the types of their values. Users can change the structure of documents simply by adding new fields or deleting existing ones. This flexible data model makes it easy for developers to represent hierarchical relationships and other more complex structures. Documents in a collection need not have an identical set of fields and denormalization of data is common.

In summer of 2016, MongoDB Atlas, the MongoDB fully managed cloud database service, was announced. Atlas offers genuine MongoDB under the hood, allowing users to offload operational tasks and featuring built-in best practices for running the database with all the power and freedom developers are used to with MongoDB.

Terminology and Concepts

Many concepts in DynamoDB have close analogs in MongoDB. The table below outlines some of the common concepts across DynamoDB and MongoDB.

DynamoDB MongoDB
Table Collection
Item Document
Attribute Field
Secondary Index Secondary Index

Deployment Environments

MongoDB can be run anywhere – from a developer’s laptop to an on-premises data center to any of the public cloud platforms. As mentioned above, MongoDB is also available as a fully managed cloud database with MongoDB Atlas; this model is most similar to how DynamoDB is delivered.

In contrast, DynamoDB is a proprietary database only available on Amazon Web Services. While a downloadable version of the database is available for prototyping on a local machine, the database can only be run in production in AWS. Organizations looking into DynamoDB should consider the implications of building on a data layer that is locked in to a single cloud vendor.

Comparethemarket.com, the UK’s leading price comparison service, completed a transition from on-prem deployments with Microsoft SQL Server to AWS and MongoDB. When asked why they hadn’t selected DynamoDB, a company representative was quoted as saying “DynamoDB was eschewed to help avoid AWS vendor lock-in.”

Data Model

MongoDB stores data in a JSON-like format called BSON, which allows the database to support a wide spectrum of data types including dates, timestamps, 64-bit integers, & Decimal128. MongoDB documents can be up to 16 MB in size; with GridFS, even larger assets can be natively stored within the database.

Unlike some NoSQL databases that push enforcement of data quality controls back into the application code, MongoDB provides built-in schema validation. Users can enforce checks on document structure, data types, data ranges and the presence of mandatory fields. As a result, DBAs can apply data governance standards, while developers maintain the benefits of a flexible document model.

DynamoDB is a key-value store with added support for JSON to provide document-like data structures that better match with objects in application code. An item or record cannot exceed 400KB. Compared to MongoDB, DynamoDB has limited support for different data types. For example, it supports only one numeric type and does not support dates. As a result, developers must preserve data types on the client, which adds application complexity and reduces data re-use across different applications. DynamoDB does not have native data validation capabilities.

Queries and Indexes

MongoDB‘s API enables developers to build applications that can query and analyze their data in multiple ways – by single keys, ranges, faceted search, graph traversals, JOINs and geospatial queries through to complex aggregations, returning responses in milliseconds. Complex queries are executed natively in the database without having to use additional analytics frameworks or tools. This helps users avoid the latency that comes from syncing data between operational and analytical engines.

MongoDB ensures fast access to data by any field with full support for secondary indexes. Indexes can be applied to any field in a document, down to individual values in arrays.

MongoDB supports multi-document transactions, making it the only database to combine the ACID guarantees of traditional relational databases; the speed, flexibility, and power of the document model; and the intelligent distributed systems design to scale-out and place data where you need it.

Multi-document transactions feel just like the transactions developers are familiar with from relational databases – multi-statement, similar syntax, and easy to add to any application. Through snapshot isolation, transactions provide a globally consistent view of data and enforce all-or-nothing execution. MongoDB allows reads and writes against the same documents and fields within the transaction. For example, users can check the status of an item before updating it. MongoDB best practices advise up to 1,000 operations in a single transaction. Learn more about MongoDB transactions here.

Supported indexing strategies such as compound, unique, array, partial, TTL, geospatial, sparse, hash, wildcard and text ensure optimal performance for multiple query patterns, data types, and application requirements. Indexes are strongly consistent with the underlying data.

DynamoDB supports key-value queries only. For queries requiring aggregations, graph traversals, or search, data must be copied into additional AWS technologies, such as Elastic MapReduce or Redshift, increasing latency, cost, and developer work. The database supports two types of indexes: Global secondary indexes (GSIs) and local secondary indexes (LSIs). Users can define up to 5 LSIs and 20 GSIs per table. Indexes can be defined as hash or hash-range indexes; more advanced indexing strategies are not supported.

GSIs, which are eventually consistent with the underlying data, do not support ad-hoc queries and usage requires knowledge of data access patterns in advance. GSIs can also not index any element below the top level record structure – so you cannot index sub-documents or arrays. LSIs can be queried to return strongly consistent data, but must be defined when the table is created. They cannot be added to existing tables and they cannot be removed without dropping the table.

DynamoDB indexes are sized and provisioned separately from the underlying tables, which may result in unforeseen issues at runtime. The DynamoDB documentation explains,

“In order for a table write to succeed, the provisioned throughput settings for the table and all of its global secondary indexes must have enough write capacity to accommodate the write; otherwise, the write to the table will be throttled.”

DynamoDB also supports multi-record ACID transactions. Unlike MongoDB transactions, each DynamoDB transaction is limited to just 25 write operations; the same item also cannot be targeted with multiple operations as a part of the same transaction. As a result, complex business logic may require multiple, independent transactions, which would add more code and overhead to the application, while also resulting in the possibility of more conflicts and transaction failures. Only base data in a DynamoDB table is transactional. Secondary indexes, backups and streams are updated “eventually”. This can lead to “silent data loss”. Subsequent queries against indexes can return data that is has not been updated data from the base tables, breaking transactional semantics. Similarly data restored from backups may not be transactionally consistent with the original table.

Consistency

MongoDB is strongly consistent by default as all read/writes go to the primary in a MongoDB replica set, scaled across multiple partitions (shards). If desired, consistency requirements for read operations can be relaxed. Through secondary consistency controls, read queries can be routed only to secondary replicas that fall within acceptable consistency limits with the primary server.

DynamoDB is eventually consistent by default. Users can configure read operations to return only strongly consistent data, but this doubles the cost of the read (see Pricing and Commercial Considerations) and adds latency. There is also no way to guarantee read consistency when querying against DynamoDB’s global secondary indexes (GSIs); any operation performed against a GSI will be eventually consistent, returning potentially stale or deleted data, and therefore increasing application complexity.

Operational Maturity

MongoDB Atlas allows users to deploy, manage, and scale their MongoDB clusters using built in operational and security best practices, such as end-to-end encryption, network isolation, role-based access control, VPC peering, and more. Atlas deployments are guaranteed to be available and durable with distributed and auto-healing replica set members and continuous backups with point in time recovery to protect against data corruption. MongoDB Atlas is fully elastic with zero downtime configuration changes and auto-scaling both storage and compute capacity. Atlas also grants organizations deep insights into how their databases are performing with a comprehensive monitoring dashboard, a real-time performance panel, and customizable alerting.

For organizations that would prefer to run MongoDB on their own infrastructure, MongoDB, Inc. offers advanced operational tooling to handle the automation of the entire database lifecycle, comprehensive monitoring (tracking 100+ metrics that could impact performance), and continuous backup. Product packages like MongoDB Enterprise Advanced bundle operational tooling and visualization and performance optimization platforms with end-to-end security controls for applications managing sensitive data.

MongoDB’s deployment flexibility allows single clusters to span racks, data centers and continents. With replica sets supporting up to 50 members and geography-aware sharding across regions, administrators can provision clusters that support globally deployments, with write local/read global access patterns and data locality. Using Atlas Global Clusters, developers can deploy fully managed “write anywhere” active-active clusters, allowing data to be localized to any region. With each region acting as primary for its own data, the risks of data loss and eventual consistency imposed by the multi-primary approach used by DynamoDB are eliminated, and customers can meet the data sovereignty demands of new privacy regulations. Finally, multi-cloud clusters enable users to provision clusters that span across AWS, Azure, and Google Cloud, giving maximum resilience and flexibility in terms of data distribution.

Offered only as a managed service on AWS, DynamoDB abstracts away its underlying partitioning and replication schemes. While provisioning is simple, other key operational tasks are lacking when compared to MongoDB:

  • Fewer than 20 database metrics are reported by AWS Cloudwatch, which limits visibility into real-time database behavior
  • AWS CloudTrail can be used to create audit trails, but it only tracks a small subset of DDL (administrative) actions to the database, not all user access to individual tables or records
  • DynamoDB has limited tooling to allow developers and/or DBAs to optimize performance by visualizing schema or graphically profiling query performance
  • DynamoDB supports cross region replication with multi-primary global tables, however these add further application complexity and cost, with eventual consistency, risks of data loss due to write conflicts between regions, and no automatic client failover

Pricing & Commercial Considerations

In this section we will again compare DynamoDB with its closest analog from MongoDB, Inc., MongoDB Atlas.

DynamoDB‘s pricing model is based on throughput. Users pay for a certain capacity on a given table and AWS automatically throttles any reads or writes that exceed that capacity.

This sounds simple in theory, but the reality is that correctly provisioning throughput and estimating pricing is far more nuanced.

Below is a list of all the factors that could impact the cost of running DynamoDB:

  • Size of the data set per month
  • Size of each object
  • Number of reads per second (pricing is based on “read capacity units”, which are equivalent to reading a 4KB object) and whether those reads need to be strongly consistent or eventually consistent (the former is twice as expensive)
    • If accessing a JSON object, the entire document must be retrieved, even if the application needs to read only a single element
  • Number of writes per second (pricing is based on “write capacity units”, which are the equivalent of writing a 1KB object)
  • Whether transactions will be used. Transactions double the cost of read and write operations
  • Whether clusters will be replicated across multiple regions. This increases write capacity costs by 50%.
  • Size and throughput requirements for each index created against the table
  • Costs for backup and restore. AWS offers on-demand and continuous backups – both are charged separately, at different rates for both the backup and restore operation
  • Data transferred by Dynamo streams per month
  • Data transfers both in and out of the database per month
  • Cross-regional data transfers, EC2 instances, and SQS queues needed for cross-regional deployments
  • The use of additional AWS services to address what is missing from DynamoDB’s limited key value query model
  • Use of on-demand or reserved instances
  • Number of metrics pushed into CloudWatch for monitoring
  • Number of events pushed into CloudTrail for database auditing

It is key to point out from the list above that indexes affect pricing and strongly consistent reads are twice as expensive.

With DynamoDB, throughput pricing actually dictates the number of partitions, not total throughput. Since users don’t have precise control over partitioning, if any individual partition is saturated, one may have to dramatically increase capacity by splitting partitions rather than scaling linearly. Very careful design of the data model is essential to ensure that provisioned throughput can be realized.

AWS has introduced the concept of Adaptive Capacity, which will automatically increase the available resources for a single partition when it becomes saturated, however it is not without limitations. Total read and write volume to a single partition cannot exceed 3,000 read capacity units and 1,000 write capacity units per second. The required throughput increase cannot exceed the total provisioned capacity for the table. Adaptive capacity doesn’t grant more resources as much as borrow resources from lower utilized partitions. And finally, DynamoDB may take up to 15 minutes to provision additional capacity.

For customers frustrated with capacity planning exercises for DynamoDB, AWS recently introduced DynamoDB On-Demand, which will allow the platform to automatically provision additional resources based on workload demand. On-demand is suitable for low-volume workloads with short spikes in demand. However, it can get expensive quick — when the database’s utilization rate exceeds 14% of the equivalent provisioned capacity, DynamoDB On-Demand becomes more expensive than provisioning throughput.

Compared to DynamoDB, pricing for MongoDB Atlas is relatively straightforward by selecting just:

  • The instance size with enough RAM to accommodate the portion of your data (including indexes) that clients access most often
  • the number of replicas and shards that will make up the cluster
  • whether to include fully managed backups
  • the region(s) the cluster needs to run in

Users can adjust any of these parameters on demand. The only additional charge is for data transfer costs.

When to use DynamoDB vs. MongoDB

DynamoDB may work for organizations that are:

  • Looking for a database to support relatively simple key-value workloads
  • Heavily invested in AWS with no plans to change their deployment environment in the future

For organizations that need their database to support a wider range of use cases with more deployment flexibility and no platform lock-in, MongoDB would likely be a better fit.

For example, biotechnology giant Thermo Fisher migrated from DynamoDB to MongoDB for their Instrument Connect IoT app, citing that while both databases were easy to deploy, MongoDB Atlas allowed for richer queries and much simpler schema evolution.

Want to Learn More?

MongoDB Atlas Best Practices

This guide describes the best practices to help you get the most out of the MongoDB Atlas service, including: schema design, capacity planning, security, and performance optimization.

MongoDB Atlas Security Controls

This document will provide you with an understanding of MongoDB Atlas’ Security Controls and Features as well as a view into how many of the underlying mechanisms work.

from:Comparing DynamoDB and MongoDB | MongoDB