Category Archives: NOSQL

A Deep Dive into DynamoDB Partitions

Databases are the backbone of most modern web applications and their performance plays a major role in user experience. Faster response times – even by a fraction of a second – can be the major deciding factor for most users to choose one option over another. Therefore, it is important to take response rate into consideration whilst designing your databases in order to provide the best possible performance. In this article, I’m going to discuss how to optimise DynamoDB database performance by using partitions.

Introducing Partitions

DynamoDB performance starts and ends with the concept of partitions. Partitions are like units of storage and performance. Not understating partitions means you will not be able to design highly effective and available databases with DynamoDB. So it’s worth understanding what’s going on under the hood.

Initially when you create one table on DynamoDB, it’ll create one partition and allocate this partition to the table. Any operations on this table – such as insert, delete and update – will be handled by the node where this partition is stored. It is important to remember that you do not have full control over the number of partitions created, but this can be influenced.

One partition can handle 10GB of data, 3000 read capacity units (RCU) and 1000 write capacity units (WCU), indicating a direct relationship between the amount of data stored in a table and performance requirements. A new partition will be added when more than 10GB of data is stored in a table, or RCUs are greater than 3000, or WCUs are greater than 1000. Then, the data will get spread across these partitions.

So how does DynamoDB spread  data across multiple partitions? The partition that a particular row is place within is selected based on a partition key. For each unique partition key value, the item gets assigned to a specific partition.

Let’s use an example to demonstrate. The below table shows a list of examinations and students who have taken them.

table

In this example, there is a many-to-one relationship between an exam and a student (for the sake of simplicity, we’ll assume that students do not resit exams). If this table was just for all the students at a particular school, the dataset would be fairly small. However, if it was all the students in a state or country, there could be millions and millions of rows. This might put us within range of the data storage and performance limits that would lead to a new partition being required.

Below is a virtual representation of how the above data would might be distributed if, based on the required RCU and WCP or the size of the dataset, DynamoDB were to decide to scale it out across 3 partitions:

AWS Network Diagram - New Page

As we can see above, each exam ID is assigned to a unique partition. A single partition may host multiple partition key values based on the size of the dataset, but the important thing to remember here is that one partition key can only be assigned to a single partition. One exam can be taken by many students. Therefore, the student ID becomes a perfect sort key value to query this data (as it allows sorting of exam results by student ID).

By adding more partitions, or by moving data between partitions, indefinite scaling is possible, based on the size or the performance requirements of the dataset. However, it is also important to remember that there are serious limitations that must be considered.

Firstly, the number of partitions are managed by DynamoDB, where partitions are added to accommodate increasing dataset size or increasing performance requirements. Whilst this is true for increasing the number of partitions, there is no automatic decrease in partitions during capacity or performance reductions.

This leads us to our next important point which is allocated RCU (read capacity unit) and WCU (write capacity unit) values spread across a number of partitions. Consider, for example, that you need 30000 RCUs to be allocated  to the database. The maximum an RCU single partition can support is 3000. Therefore, to accommodate the request, DynamoDB will automatically create 10 partitions.

If you are increasing your RCU and WCU via the console, AWS will provide you with an estimated cost per month as below,

RCU_WCU_Increase

Using the exam-student example, the dataset for each exam is assigned to one partition, which, as you will recall, can hold up to 10GB of data, 3000 RCUs and 1000 WCUs. Yet each exam can have millions of students. So the size of this dataset may go well beyond the 10GB capacity limit (which  must be kept in mind when selecting partition keys for a specific dataset).

Increasing the RCU or WCU values for a table beyond 3000 RCUs and 1000 WCUs prompts DynamoDB to create additional partitions with no way to reduce the number of partitions even if the number of required RCUs and WCUs drops. This can lead to a situation where each partition only ends up having a tiny number of RCUs and WCUs.

AWS Network Diagram - New Page

Because it is possible to have performance issues due to over-throttling – even though the overall assigned RCUs and WCUs are appropriate for the expected load – a formula can be created to calculate the desired number of partitions, whilst taking performance into consideration.

Based on our required read performance,

Partitions for desired read performance = 
  Desired RCU / 3000 RCU

and based on our required write performance,

Partitions for desired write performance = 
  Desired WCU / 1000 WCU

Giving us the number of partitions needed for the required performance,

Total partitions for desired performance = 
  (Desired RCU / 3000 RCU) + (Desired WCU / 1000 WCU)

But that’s only the performance aspect. We also have to look at the storage aspect. Assuming the max capacity supported by a single partition is 10GB,

Total partitions for desired storage = Desired capacity in GB / 10GB

The following formula can be used  to calculate the total number of partitions to accommodate the required performance aspect and capacity aspect.

Total partitions = 
  MAX(Total partitions for desired performance, 
      Total partitions for desired capacity)

As an example, consider the following requirements:

  • RCU Capacity:  7500
  • WCU Capacity: 4000
  • Storage Capacity: 100GB

The required number of partitions for performances can be calculated as:

(7500/3000) + (4000/1000) = 2.5 + 4 = 6.5

We’ll round this up to the nearest whole number: 7.

The required number of partitions for capacity is:

100/10 = 10

So the total number of partitions required is:

MAX(7, 10) = 10

A critical factor is that the total RCU and WCU is split equally across the total number of partitions. Therefore you will only get total allocated RCU and WCU amounts for a table if you are reading and writing in parallel across all partitions. This can only be archived via a good partition key model, meaning a key that is evenly distributed across all the key space.

Picking a good partition key

There is no universal answer when it comes to choosing a good key – it’s all dependant on the nature of the dataset. For a low-volume table, the key selection doesn’t matter as much (3000 RCU and 1000 RCU with a single partition is achievable even with a badly-designed key structure). However as the dataset grows, the key selection becomes increasingly important.

Partition key must be specified at the table creation time. If you’re using the console, you’ll see something similar to below,

DynamoDB_·_AWS_Console

Or if you’re using the CLI, you’d have to run something like,

aws dynamodb create-table \
  --table-name us_election_2016 \
  --attribute-definitions \
  AttributeName=candidate_id,AttributeType=S \
  AttributeName=voter_id,AttributeType=S \
  AttributeName=state,AttributeType=S \
  --key-schema AttributeName=candidate_id,KeyType=HASH AttributeName=voter_id,KeyType=RANGE \
  --provisioned-throughput ReadCapacityUnits=1,WriteCapacityUnits=1

The first criteria in choosing a good partition key is to select an attribute that has as many distinct values as possible. For an example, you would choose an employee ID when there are many employees available. What should not be selected is a department ID  where there are only handful of departments available.

The next criteria is to pick an attribute with uniformity of access across all key values. For an example, in a voting record system, selecting a candidate ID would be ideal if you expect each candidate to receive similar number of votes. If one or two candidates are to receive 90% of the available votes then this become less optimal.

Another criteria for good partition key candidate is that the attribute should have a temporal read and write pattern across time. If these are hard to achieve with an existing attribute, it’s worth looking at a syntactic or hybrid value.

Let’s look at an example that uses the 2016 US Elections to highlight everything we’ve just discussed. Specifically, we want to store a record of all of the votes for all of the candidates.

Each political party will have many candidates competing for the party’s electoral nomination. You may have anywhere from two to ten candidates. The problem is that votes between candidates will not be distributed uniformly – there will be one or two candidates that will receive the majority of the votes.

For the sake of this example, let’s assume that we expect 10000 WCU worth of votes to be received. Say that, in the first instance, we create a table and naively select the candidate ID as the partition key,and date/time as a range key.

DynamoDB will create 10 partitions for this example (Based on our previous formula, 10 partitions are needed to support 10000 WCU). If we also assume we have 10 candidates, DynamoDB will spread these partitions keys across 10 partitions as shown here:

election_1

This model is largely flawed. Firstly, we are limiting the performance for a candidate for a value much lower than 10000 WCU. As we discussed above, real world candidate voting will be heavily weighted towards one or two popular candidates. Therefore performance allocated to the least popular candidates is just wasted WCU.

Even if we assume voting is uniformly weighted between candidates, their voters may be located in different time zones and may vote at different times. Therefore, there might be spikes of votes for certain candidates at specific times compared to others. Even with carefully designed partition keys, you can run into time-based issues like this.

Let’s think about a case when there are only two candidates in the national election. To support performance capacity, 100000 WCU are assigned, and DynamoDB will create 100 partitions to support this. However, if the candidate ID is chosen as the partition key, each candidate data will be limited to one partition – even though there are 98 unused partitions. Consequently, the storage limit will be hit quickly causing the application to fail and stop recording further votes.

This issue  is resolved by introducing a key-sharing plan. This means for each candidate – i.e. for each partition – the partition key is prefixed with a value of 1 to 10 or 1 to 1000 (deepening on the size of your dataset). This gives us a much wider range of partition keys. That means DynamoDB will distribute this data across multiple partitions evenly. It’ll look a bit like this:

election_2

Now, we can look at the histogram before key-sharing:

heat_1

Where the corresponding partition keys will look something like (please note, for this example, I’ve only inserted data for 2 candidates):

before_sharding.png

Now here’s the histogram after key-sharing:

heat_2

We can see how, with a key sharing plan, the load is much more evenly distributed across partitions. Throttling is minimal. The corresponding partition keys will look like this:

after_sharding

Conclusion

There are many other factors that needs to be considered when designing data models on DynamoDB such as Local Secondary Indexes and Global Secondary Indexes. For further information on these indexes, check out the AWS documentation to understand how they may impact database performance.

Database modelling is very important when choosing a database structure and it’s essential for an optimally-performing application. Even though DynamoDB is a fully managed and highly-scalable database solution, it all comes down to you when designing a solid application. No matter how powerful DynamoDB is, a poorly designed database model will cause your application to perform poorly.

from:A Deep Dive into DynamoDB Partitions – Shine Solutions Group

Comparing DynamoDB and MongoDB

Quick Comparison Table

MongoDB DynamoDB
Freedom to Run Anywhere

Only available on AWS

No support for on-premises deployments

Locked-in to a single cloud provider

Data Model

Limited key-value store with JSON support

Maximum 400KB record size

Limited data type support (number, string, binary only) increases application complexity

Querying

Key-value queries only

Primary-key can have at most 2 attributes, limiting query flexibility

Analytic queries requires replicating data to another AWS service, increasing cost and complexity

Indexing

Limited / Complex to manage

Indexes are sized, billed & provisioned separately from data

Hash or hash-range indexes only

Global secondary indexes (GSIs) are inconsistent with underlying data, forcing applications to handle stale data

Local secondary indexes (LSIs) can be strongly consistent, but must be defined when a table is created

GSIs can only be declared on top level item elements. Cannot index sub-documents or arrays, making complex queries impossible

Maximum of 20 GSIs & 5 LSIs per table

Data Integrity

Eventually Consistent

Complex – need to handle stale data in application

No data validation – must be handled in application

ACID transactions apply to table data only, not to indexes or backups

Maximum of 25 writes per transaction

Monitoring and Performance Tuning

Black-box

Fewer than 20 metrics limit visibility into database behavior

No tools to visualize schema or recommend indexes

Backup

On-demand or continuous backups

No queryable backup; additional charge to restore backups; many configurations are not backed up and need to be recreated manually

Pricing

Highly Variable

Throughput-based pricing

A wide range of inputs may affect price. See Pricing and Commercial Considerations

What is DynamoDB?

DynamoDB is a proprietary NoSQL database service built by Amazon and offered as part of the Amazon Web Services (AWS) portfolio.

The name comes from Dynamo, a highly available key-value store developed in response to holiday outages on the Amazon e-commerce platform in 2004. Initially, however, few teams within Amazon adopted Dynamo due to its high operational complexity and the trade-offs that needed to be made between performance, reliability, query flexibility, and data consistency.

Around the same time, Amazon found that its developers enjoyed using SimpleDB, its primary NoSQL database service at the time which allowed users to offload database administration work. But SimpleDB, which is no longer being updated by Amazon, had severe limitations when it came to scale; its strict storage limitation of 10 GB and the limited number of operations it could support per second made it only viable for small workloads.

DynamoDB, which was launched as a database service on AWS in 2012, was built to address the limitations of both SimpleDB and Dynamo.

What is MongoDB?

MongoDB is an open, non-tabular database built by MongoDB, Inc. The company was established in 2007 by former executives and engineers from DoubleClick, which Google acquired and now uses as the backbone of its advertising products. The founders originally focused on building a platform as a service using entirely open source components, but when they struggled to find an existing database that could meet their requirements for building a service in the cloud, they began work on their own database system. After realizing the potential of the database software on its own, the team shifted their focus to what is now MongoDB. The company released MongoDB in 2009.

MongoDB was designed to create a technology foundation that enables development teams through:

  1. The document data model – presenting them the best way to work with data.
  2. A distributed systems design – allowing them to intelligently put data where they want it.
  3. A unified experience that gives them the freedom to run anywhere – allowing them to future-proof their work and eliminate vendor lock-in.

MongoDB stores data in flexible, JSON-like records called documents, meaning fields can vary from document to document and data structure can be changed over time. This model maps to objects in application code, making data easy to work with for developers. Related information is typically stored together for fast query access through the MongoDB query language. MongoDB uses dynamic schemas, allowing users to create records without first defining the structure, such as the fields or the types of their values. Users can change the structure of documents simply by adding new fields or deleting existing ones. This flexible data model makes it easy for developers to represent hierarchical relationships and other more complex structures. Documents in a collection need not have an identical set of fields and denormalization of data is common.

In summer of 2016, MongoDB Atlas, the MongoDB fully managed cloud database service, was announced. Atlas offers genuine MongoDB under the hood, allowing users to offload operational tasks and featuring built-in best practices for running the database with all the power and freedom developers are used to with MongoDB.

Terminology and Concepts

Many concepts in DynamoDB have close analogs in MongoDB. The table below outlines some of the common concepts across DynamoDB and MongoDB.

DynamoDB MongoDB
Table Collection
Item Document
Attribute Field
Secondary Index Secondary Index

Deployment Environments

MongoDB can be run anywhere – from a developer’s laptop to an on-premises data center to any of the public cloud platforms. As mentioned above, MongoDB is also available as a fully managed cloud database with MongoDB Atlas; this model is most similar to how DynamoDB is delivered.

In contrast, DynamoDB is a proprietary database only available on Amazon Web Services. While a downloadable version of the database is available for prototyping on a local machine, the database can only be run in production in AWS. Organizations looking into DynamoDB should consider the implications of building on a data layer that is locked in to a single cloud vendor.

Comparethemarket.com, the UK’s leading price comparison service, completed a transition from on-prem deployments with Microsoft SQL Server to AWS and MongoDB. When asked why they hadn’t selected DynamoDB, a company representative was quoted as saying “DynamoDB was eschewed to help avoid AWS vendor lock-in.”

Data Model

MongoDB stores data in a JSON-like format called BSON, which allows the database to support a wide spectrum of data types including dates, timestamps, 64-bit integers, & Decimal128. MongoDB documents can be up to 16 MB in size; with GridFS, even larger assets can be natively stored within the database.

Unlike some NoSQL databases that push enforcement of data quality controls back into the application code, MongoDB provides built-in schema validation. Users can enforce checks on document structure, data types, data ranges and the presence of mandatory fields. As a result, DBAs can apply data governance standards, while developers maintain the benefits of a flexible document model.

DynamoDB is a key-value store with added support for JSON to provide document-like data structures that better match with objects in application code. An item or record cannot exceed 400KB. Compared to MongoDB, DynamoDB has limited support for different data types. For example, it supports only one numeric type and does not support dates. As a result, developers must preserve data types on the client, which adds application complexity and reduces data re-use across different applications. DynamoDB does not have native data validation capabilities.

Queries and Indexes

MongoDB‘s API enables developers to build applications that can query and analyze their data in multiple ways – by single keys, ranges, faceted search, graph traversals, JOINs and geospatial queries through to complex aggregations, returning responses in milliseconds. Complex queries are executed natively in the database without having to use additional analytics frameworks or tools. This helps users avoid the latency that comes from syncing data between operational and analytical engines.

MongoDB ensures fast access to data by any field with full support for secondary indexes. Indexes can be applied to any field in a document, down to individual values in arrays.

MongoDB supports multi-document transactions, making it the only database to combine the ACID guarantees of traditional relational databases; the speed, flexibility, and power of the document model; and the intelligent distributed systems design to scale-out and place data where you need it.

Multi-document transactions feel just like the transactions developers are familiar with from relational databases – multi-statement, similar syntax, and easy to add to any application. Through snapshot isolation, transactions provide a globally consistent view of data and enforce all-or-nothing execution. MongoDB allows reads and writes against the same documents and fields within the transaction. For example, users can check the status of an item before updating it. MongoDB best practices advise up to 1,000 operations in a single transaction. Learn more about MongoDB transactions here.

Supported indexing strategies such as compound, unique, array, partial, TTL, geospatial, sparse, hash, wildcard and text ensure optimal performance for multiple query patterns, data types, and application requirements. Indexes are strongly consistent with the underlying data.

DynamoDB supports key-value queries only. For queries requiring aggregations, graph traversals, or search, data must be copied into additional AWS technologies, such as Elastic MapReduce or Redshift, increasing latency, cost, and developer work. The database supports two types of indexes: Global secondary indexes (GSIs) and local secondary indexes (LSIs). Users can define up to 5 LSIs and 20 GSIs per table. Indexes can be defined as hash or hash-range indexes; more advanced indexing strategies are not supported.

GSIs, which are eventually consistent with the underlying data, do not support ad-hoc queries and usage requires knowledge of data access patterns in advance. GSIs can also not index any element below the top level record structure – so you cannot index sub-documents or arrays. LSIs can be queried to return strongly consistent data, but must be defined when the table is created. They cannot be added to existing tables and they cannot be removed without dropping the table.

DynamoDB indexes are sized and provisioned separately from the underlying tables, which may result in unforeseen issues at runtime. The DynamoDB documentation explains,

“In order for a table write to succeed, the provisioned throughput settings for the table and all of its global secondary indexes must have enough write capacity to accommodate the write; otherwise, the write to the table will be throttled.”

DynamoDB also supports multi-record ACID transactions. Unlike MongoDB transactions, each DynamoDB transaction is limited to just 25 write operations; the same item also cannot be targeted with multiple operations as a part of the same transaction. As a result, complex business logic may require multiple, independent transactions, which would add more code and overhead to the application, while also resulting in the possibility of more conflicts and transaction failures. Only base data in a DynamoDB table is transactional. Secondary indexes, backups and streams are updated “eventually”. This can lead to “silent data loss”. Subsequent queries against indexes can return data that is has not been updated data from the base tables, breaking transactional semantics. Similarly data restored from backups may not be transactionally consistent with the original table.

Consistency

MongoDB is strongly consistent by default as all read/writes go to the primary in a MongoDB replica set, scaled across multiple partitions (shards). If desired, consistency requirements for read operations can be relaxed. Through secondary consistency controls, read queries can be routed only to secondary replicas that fall within acceptable consistency limits with the primary server.

DynamoDB is eventually consistent by default. Users can configure read operations to return only strongly consistent data, but this doubles the cost of the read (see Pricing and Commercial Considerations) and adds latency. There is also no way to guarantee read consistency when querying against DynamoDB’s global secondary indexes (GSIs); any operation performed against a GSI will be eventually consistent, returning potentially stale or deleted data, and therefore increasing application complexity.

Operational Maturity

MongoDB Atlas allows users to deploy, manage, and scale their MongoDB clusters using built in operational and security best practices, such as end-to-end encryption, network isolation, role-based access control, VPC peering, and more. Atlas deployments are guaranteed to be available and durable with distributed and auto-healing replica set members and continuous backups with point in time recovery to protect against data corruption. MongoDB Atlas is fully elastic with zero downtime configuration changes and auto-scaling both storage and compute capacity. Atlas also grants organizations deep insights into how their databases are performing with a comprehensive monitoring dashboard, a real-time performance panel, and customizable alerting.

For organizations that would prefer to run MongoDB on their own infrastructure, MongoDB, Inc. offers advanced operational tooling to handle the automation of the entire database lifecycle, comprehensive monitoring (tracking 100+ metrics that could impact performance), and continuous backup. Product packages like MongoDB Enterprise Advanced bundle operational tooling and visualization and performance optimization platforms with end-to-end security controls for applications managing sensitive data.

MongoDB’s deployment flexibility allows single clusters to span racks, data centers and continents. With replica sets supporting up to 50 members and geography-aware sharding across regions, administrators can provision clusters that support globally deployments, with write local/read global access patterns and data locality. Using Atlas Global Clusters, developers can deploy fully managed “write anywhere” active-active clusters, allowing data to be localized to any region. With each region acting as primary for its own data, the risks of data loss and eventual consistency imposed by the multi-primary approach used by DynamoDB are eliminated, and customers can meet the data sovereignty demands of new privacy regulations. Finally, multi-cloud clusters enable users to provision clusters that span across AWS, Azure, and Google Cloud, giving maximum resilience and flexibility in terms of data distribution.

Offered only as a managed service on AWS, DynamoDB abstracts away its underlying partitioning and replication schemes. While provisioning is simple, other key operational tasks are lacking when compared to MongoDB:

  • Fewer than 20 database metrics are reported by AWS Cloudwatch, which limits visibility into real-time database behavior
  • AWS CloudTrail can be used to create audit trails, but it only tracks a small subset of DDL (administrative) actions to the database, not all user access to individual tables or records
  • DynamoDB has limited tooling to allow developers and/or DBAs to optimize performance by visualizing schema or graphically profiling query performance
  • DynamoDB supports cross region replication with multi-primary global tables, however these add further application complexity and cost, with eventual consistency, risks of data loss due to write conflicts between regions, and no automatic client failover

Pricing & Commercial Considerations

In this section we will again compare DynamoDB with its closest analog from MongoDB, Inc., MongoDB Atlas.

DynamoDB‘s pricing model is based on throughput. Users pay for a certain capacity on a given table and AWS automatically throttles any reads or writes that exceed that capacity.

This sounds simple in theory, but the reality is that correctly provisioning throughput and estimating pricing is far more nuanced.

Below is a list of all the factors that could impact the cost of running DynamoDB:

  • Size of the data set per month
  • Size of each object
  • Number of reads per second (pricing is based on “read capacity units”, which are equivalent to reading a 4KB object) and whether those reads need to be strongly consistent or eventually consistent (the former is twice as expensive)
    • If accessing a JSON object, the entire document must be retrieved, even if the application needs to read only a single element
  • Number of writes per second (pricing is based on “write capacity units”, which are the equivalent of writing a 1KB object)
  • Whether transactions will be used. Transactions double the cost of read and write operations
  • Whether clusters will be replicated across multiple regions. This increases write capacity costs by 50%.
  • Size and throughput requirements for each index created against the table
  • Costs for backup and restore. AWS offers on-demand and continuous backups – both are charged separately, at different rates for both the backup and restore operation
  • Data transferred by Dynamo streams per month
  • Data transfers both in and out of the database per month
  • Cross-regional data transfers, EC2 instances, and SQS queues needed for cross-regional deployments
  • The use of additional AWS services to address what is missing from DynamoDB’s limited key value query model
  • Use of on-demand or reserved instances
  • Number of metrics pushed into CloudWatch for monitoring
  • Number of events pushed into CloudTrail for database auditing

It is key to point out from the list above that indexes affect pricing and strongly consistent reads are twice as expensive.

With DynamoDB, throughput pricing actually dictates the number of partitions, not total throughput. Since users don’t have precise control over partitioning, if any individual partition is saturated, one may have to dramatically increase capacity by splitting partitions rather than scaling linearly. Very careful design of the data model is essential to ensure that provisioned throughput can be realized.

AWS has introduced the concept of Adaptive Capacity, which will automatically increase the available resources for a single partition when it becomes saturated, however it is not without limitations. Total read and write volume to a single partition cannot exceed 3,000 read capacity units and 1,000 write capacity units per second. The required throughput increase cannot exceed the total provisioned capacity for the table. Adaptive capacity doesn’t grant more resources as much as borrow resources from lower utilized partitions. And finally, DynamoDB may take up to 15 minutes to provision additional capacity.

For customers frustrated with capacity planning exercises for DynamoDB, AWS recently introduced DynamoDB On-Demand, which will allow the platform to automatically provision additional resources based on workload demand. On-demand is suitable for low-volume workloads with short spikes in demand. However, it can get expensive quick — when the database’s utilization rate exceeds 14% of the equivalent provisioned capacity, DynamoDB On-Demand becomes more expensive than provisioning throughput.

Compared to DynamoDB, pricing for MongoDB Atlas is relatively straightforward by selecting just:

  • The instance size with enough RAM to accommodate the portion of your data (including indexes) that clients access most often
  • the number of replicas and shards that will make up the cluster
  • whether to include fully managed backups
  • the region(s) the cluster needs to run in

Users can adjust any of these parameters on demand. The only additional charge is for data transfer costs.

When to use DynamoDB vs. MongoDB

DynamoDB may work for organizations that are:

  • Looking for a database to support relatively simple key-value workloads
  • Heavily invested in AWS with no plans to change their deployment environment in the future

For organizations that need their database to support a wider range of use cases with more deployment flexibility and no platform lock-in, MongoDB would likely be a better fit.

For example, biotechnology giant Thermo Fisher migrated from DynamoDB to MongoDB for their Instrument Connect IoT app, citing that while both databases were easy to deploy, MongoDB Atlas allowed for richer queries and much simpler schema evolution.

Want to Learn More?

MongoDB Atlas Best Practices

This guide describes the best practices to help you get the most out of the MongoDB Atlas service, including: schema design, capacity planning, security, and performance optimization.

MongoDB Atlas Security Controls

This document will provide you with an understanding of MongoDB Atlas’ Security Controls and Features as well as a view into how many of the underlying mechanisms work.

from:Comparing DynamoDB and MongoDB | MongoDB

JWT implementation with Refresh Token in Node.js example | MongoDB

In previous post, we’ve known how to build Token based Authentication & Authorization with Node.js, JWT and MongoDB. This tutorial will continue to make JWT Refresh Token in the Node.js Express Application. You can know how to expire the JWT, then renew the Access Token with Refresh Token.

Related Posts:
– Node.js, Express & MongoDb: Build a CRUD Rest Api example
– How to upload/store images in MongoDB using Node.js, Express & Multer
– Using MySQL/PostgreSQL instead: JWT Refresh Token implementation in Node.js example

Associations:
– MongoDB One-to-One relationship with Mongoose example
– MongoDB One-to-Many Relationship tutorial with Mongoose examples
– MongoDB Many-to-Many Relationship with Mongoose examples

The code in this post bases on previous article that you need to read first:
Node.js + MongoDB: User Authentication & Authorization with JWT

Overview of JWT Refresh Token with Node.js example

We already have a Node.js Express & MongoDB application in that:

  • User can signup new account, or login with username & password.
  • By User’s role (admin, moderator, user), we authorize the User to access resources

With APIs:

Methods Urls Actions
POST /api/auth/signup signup new account
POST /api/auth/signin login an account
GET /api/test/all retrieve public content
GET /api/test/user access User’s content
GET /api/test/mod access Moderator’s content
GET /api/test/admin access Admin’s content

For more details, please visit this post.

We’re gonna add Token Refresh to this Node.js & JWT Project.
The final result can be described with following requests/responses:

– Send /signin request, return response with refreshToken.

jwt-refresh-token-node-js-example-mongodb-signin

– Access resource successfully with accessToken.

jwt-refresh-token-node-js-example-mongodb-access-resource

– When the accessToken is expired, user cannot use it anymore.

jwt-refresh-token-node-js-example-mongodb-expire-token

– Send /refreshtoken request, return response with new accessToken.

jwt-refresh-token-node-js-example-mongodb-send-token-refresh-request

– Access resource successfully with new accessToken.

jwt-refresh-token-node-js-example-mongodb-new-token-access-resource

– Send an expired Refresh Token.

jwt-refresh-token-node-js-example-mongodb-expire-refresh-token

– Send an inexistent Refresh Token.

jwt-refresh-token-node-js-example-mongodb-token-not-exist-database

– Axios Client to check this: Axios Interceptors tutorial with Refresh Token example

– Using React Client:

– Or Vue Client:

– Angular Client:

Flow for JWT Refresh Token implementation

The diagram shows flow of how we implement Authentication process with Access Token and Refresh Token.

jwt-refresh-token-node-js-example-flow

– A legal JWT must be added to HTTP Header if Client accesses protected resources.
– A refreshToken will be provided at the time user signs in.

How to Expire JWT Token in Node.js

The Refresh Token has different value and expiration time to the Access Token.
Regularly we configure the expiration time of Refresh Token longer than Access Token’s.

Open config/auth.config.js:

module.exports = {
  secret: "bezkoder-secret-key",
  jwtExpiration: 3600,           // 1 hour
  jwtRefreshExpiration: 86400,   // 24 hours

  /* for test */
  // jwtExpiration: 60,          // 1 minute
  // jwtRefreshExpiration: 120,  // 2 minutes
};

Update middlewares/authJwt.js file to catch TokenExpiredError in verifyToken() function.

const jwt = require("jsonwebtoken");
const config = require("../config/auth.config");
const db = require("../models");
...
const { TokenExpiredError } = jwt;

const catchError = (err, res) => {
  if (err instanceof TokenExpiredError) {
    return res.status(401).send({ message: "Unauthorized! Access Token was expired!" });
  }

  return res.sendStatus(401).send({ message: "Unauthorized!" });
}

const verifyToken = (req, res, next) => {
  let token = req.headers["x-access-token"];

  if (!token) {
    return res.status(403).send({ message: "No token provided!" });
  }

  jwt.verify(token, config.secret, (err, decoded) => {
    if (err) {
      return catchError(err, res);
    }
    req.userId = decoded.id;
    next();
  });
};

Create Refresh Token Model

This Mongoose model has one-to-one relationship with User model. It contains expiryDate field which value is set by adding config.jwtRefreshExpiration value above.

There are 2 static methods:

  • createToken: use uuid library for creating a random token and save new object into MongoDB database
  • verifyExpiration: compare expiryDate with current Date time to check the expiration
const mongoose = require("mongoose");
const config = require("../config/auth.config");
const { v4: uuidv4 } = require('uuid');

const RefreshTokenSchema = new mongoose.Schema({
  token: String,
  user: {
    type: mongoose.Schema.Types.ObjectId,
    ref: "User",
  },
  expiryDate: Date,
});

RefreshTokenSchema.statics.createToken = async function (user) {
  let expiredAt = new Date();

  expiredAt.setSeconds(
    expiredAt.getSeconds() + config.jwtRefreshExpiration
  );

  let _token = uuidv4();

  let _object = new this({
    token: _token,
    user: user._id,
    expiryDate: expiredAt.getTime(),
  });

  console.log(_object);

  let refreshToken = await _object.save();

  return refreshToken.token;
};

RefreshTokenSchema.statics.verifyExpiration = (token) => {
  return token.expiryDate.getTime() < new Date().getTime();
}

const RefreshToken = mongoose.model("RefreshToken", RefreshTokenSchema);

module.exports = RefreshToken;

Don’t forget to export this model in models/index.js:

const mongoose = require('mongoose');
mongoose.Promise = global.Promise;

const db = {};

db.mongoose = mongoose;

db.user = require("./user.model");
db.role = require("./role.model");
db.refreshToken = require("./refreshToken.model");

db.ROLES = ["user", "admin", "moderator"];

module.exports = db;

Node.js Express Rest API for JWT Refresh Token

Let’s update the payloads for our Rest APIs:
– Requests:

  • refreshToken }

– Responses:

  • Signin Response: { accessToken, refreshToken, id, username, email, roles }
  • Message Response: { message }
  • RefreshToken Response: { new accessTokenrefreshToken }

In the Auth Controller, we:

  • update the method for /signin endpoint with Refresh Token
  • expose the POST API for creating new Access Token from received Refresh Token

controllers/auth.controller.js

const config = require("../config/auth.config");
const db = require("../models");
const { user: User, role: Role, refreshToken: RefreshToken } = db;

const jwt = require("jsonwebtoken");
const bcrypt = require("bcryptjs");

...
exports.signin = (req, res) => {
  User.findOne({
    username: req.body.username,
  })
    .populate("roles", "-__v")
    .exec(async (err, user) => {
      if (err) {
        res.status(500).send({ message: err });
        return;
      }

      if (!user) {
        return res.status(404).send({ message: "User Not found." });
      }

      let passwordIsValid = bcrypt.compareSync(
        req.body.password,
        user.password
      );

      if (!passwordIsValid) {
        return res.status(401).send({
          accessToken: null,
          message: "Invalid Password!",
        });
      }

      let token = jwt.sign({ id: user.id }, config.secret, {
        expiresIn: config.jwtExpiration,
      });

      let refreshToken = await RefreshToken.createToken(user);

      let authorities = [];

      for (let i = 0; i < user.roles.length; i++) {
        authorities.push("ROLE_" + user.roles[i].name.toUpperCase());
      }
      res.status(200).send({
        id: user._id,
        username: user.username,
        email: user.email,
        roles: authorities,
        accessToken: token,
        refreshToken: refreshToken,
      });
    });
};

exports.refreshToken = async (req, res) => {
  const { refreshToken: requestToken } = req.body;

  if (requestToken == null) {
    return res.status(403).json({ message: "Refresh Token is required!" });
  }

  try {
    let refreshToken = await RefreshToken.findOne({ token: requestToken });

    if (!refreshToken) {
      res.status(403).json({ message: "Refresh token is not in database!" });
      return;
    }

    if (RefreshToken.verifyExpiration(refreshToken)) {
      RefreshToken.findByIdAndRemove(refreshToken._id, { useFindAndModify: false }).exec();
      
      res.status(403).json({
        message: "Refresh token was expired. Please make a new signin request",
      });
      return;
    }

    let newAccessToken = jwt.sign({ id: refreshToken.user._id }, config.secret, {
      expiresIn: config.jwtExpiration,
    });

    return res.status(200).json({
      accessToken: newAccessToken,
      refreshToken: refreshToken.token,
    });
  } catch (err) {
    return res.status(500).send({ message: err });
  }
};

In refreshToken() function:

  • Firstly, we get the Refresh Token from request data
  • Next, get the RefreshToken object {idusertokenexpiryDate} from raw Token using RefreshToken model static method
  • We verify the token (expired or not) basing on expiryDate field. If the Refresh Token was expired, remove it from MongoDB database and return message
  • Continue to use user _id field of RefreshToken object as parameter to generate new Access Token using jsonwebtoken library
  • Return { new accessTokenrefreshToken } if everything is done
  • Or else, send error message

Define Route for JWT Refresh Token API

Finally, we need to determine how the server with an endpoint will response by setting up the routes.
In routes/auth.routes.js, add one line of code:

...
const controller = require("../controllers/auth.controller");

module.exports = function(app) {
  ...
  app.post("/api/auth/refreshtoken", controller.refreshToken);
};

Conclusion

Today we’ve learned JWT Refresh Token implementation in just a Node.js example using Express Rest Api and MongoDB. You also know how to expire the JWT Token and renew the Access Token.

The code in this post bases on previous article that you need to read first:
Node.js + MongoDB: User Authentication & Authorization with JWT

If you want to use MySQL/PostgreSQL instead, please visit:
JWT Refresh Token implementation in Node.js example

You can test this Rest API with:
– Axios Client: Axios Interceptors tutorial with Refresh Token example
– React Client:

– Vue Client:

– Angular Client:

Happy learning! See you again.

Further Reading

Fullstack CRUD application:
– MEVN: Vue.js + Node.js + Express + MongoDB example
– MEAN:
Angular 8 + Node.js + Express + MongoDB example
Angular 10 + Node.js + Express + MongoDB example
Angular 11 + Node.js + Express + MongoDB example
Angular 12 + Node.js + Express + MongoDB example
– MERN: React + Node.js + Express + MongoDB example

Source Code

You can find the complete source code for this tutorial on Github.

from:JWT implementation with Refresh Token in Node.js example | MongoDB – BezKoder

MongoDB中时间比实际时间小8小时的解决方法

现象

存储到数据库的时间总是比实际时间小8小时。

原因

存储在mongodb中的时间是标准时间UTC +0:00 , 而中国的时区是+8.00 。

解决办法

如果使用C#的Mongodb.Driver驱动,则只需要在实体的时间属性上添加一个特性并指时区就可以了。

比如:

[BsonDateTimeOptions(Kind = DateTimeKind.Local)]
public DateTime EntryTime
{get;set;}

此特性需要引用MongoDB.Bson.dll 。

using MongoDB.Bson.Serialization.Attributes;

 

 

Redis2.4.13 安装部署

1 Redis 介绍

Redis是Remote Dictionary Server的缩写。他本质上一个Key/Value数据库,与Memcached类似的NoSQL型数据库,但是他的数据可以持久化的保存在磁盘上,解决了服务重启后数据不丢失的问题,他的值可以是string(字符串)、list(列表)、sets(集合)或者是ordered sets(被排序的集合),所有的数据类型都具有push/pop、add/remove、执行服务端的并集、交集、两个sets集中的差别等等操作,这些操作都是具有原子性的,Redis还支持各种不同的排序能力

2Redis功能简介

l Redis的Sharding:目前,redis server没有提供类似mongodb那样的shard功能,只能在client端,通过一致性hash算法实现,当前Redis不支持故障冗余,在集群中不能在线增加或删除Redis

l Redis的master/slave复制:

n 一个master支持多个slave

n Slave可以接受其他slave的连接来替代他连接master

n 复制在master、在slave都是非阻塞的。

n 复制被利用来提供可扩展性,在slave端只提供查询功能及数据的冗余

l Redis的Virtual Memory功能:

u 因性能问题,2.4版本 VM机制彻底废弃

u redis的vm模式在实践中存在一些问题.

u 我使用过redis2.0.2, 发现当vm模式打开的时候, 并发连 接数在1500以上时, redis latency会大大增加.平均每个请求的latency会超过4000ms, 观察redis的进程cpu占用率, 会超过100%. 最后迫于无奈,关掉了redis的vm功能. 此时并发连接不变的情况下,redis的latency下降到2ms以下. cpu占用率下降到1%.

l Redis的附加档案(AOF)功能:Redis通过配置的策略将数据集保存到aof中,当Redis挂掉后能够通过aof恢复到挂掉前的状态

l 提供批量写入功能

l 事务:允许让一组命令进入队列一次性执行,在执行的过程中不穿插其它命令(Redis的单线程保证)。

l 管道:一次性提交多个命令(如果只是进行一些设置,命令之间不需要依赖前置命令结果的话,可以提高不少效率)。

3 Redis机构示意图

4 Redis安装

Shell>wget http://redis.googlecode.com/files/redis-2.4.13.tar.gz #下载程序

Shell >tar –zxvf redis-2.4.13.tar.gz #解压程序包

Shell > cd redis-2.4.13 #进入解压目录

Shell> make #进行编译安装

Shell > make test #测试是否成功

该版本安装不需要configure 和make install ,make编译安装后SRC目录下会多几个文件

redis-server #Redis 服务器启动命令

redis-benchmark #Redis服务启动后查看相关服务信息命令

redis-check-aof

redis-check-dump

redis-cli #Redis 命令行操作工具

为了部署规范管理方便操作如下:

· shell>mkdir -p bin

· shell>mkdir -p conf

· shell>mkdir -p logs

· shell>mkdir -p data

· shell>cd src

· shell>cp redis-server redis-cli redis-benchmark redis-stat ../bin

· shell>cd ..

shell·>cp redis.conf conf

调整配置文件调整

daemonize yes #后台运行

pidfile /opt/redis-2.4.13/bin/redis.pid #pid路径

port6379 #监听端口

logfile /opt/redis-2.4.13logs/stdout.log #日志文件路径

dbfilename /opt/redis-2.4.13/data/dump.rdb #数据库文件路径
5 Redis服务启动与停止

Shell>bin/redis-server /opt/redis-2.4.13/conf/redis.conf #服务启动,直接运行在后台

Shell>ps -ef |grep redis #查看是否有进程

Shell>netstat –ntlp |grep 6379 #查看默认监听端口

Shell>bin/ redis-benchmark #性能测试工具,测试该系统下读写性能

Shell>bin/redis-cli #命令行工具,测试是否正常

Shell >bin/redis-cli shutdown #关闭Redis服务
6 Redis配置文件详解

配置文件参数说明:

1. Redis默认不是以守护进程的方式运行,可以通过该配置项修改,使用yes启用守护进程

i. daemonize no

2. 当Redis以守护进程方式运行时,Redis默认会把pid写入/var/run/redis.pid文件,可以通过pidfile指定

i. pidfile /var/run/redis.pid

3. 指定Redis监听端口,默认端口为6379,作者在自己的一篇博文中解释了为什么选用6379作为默认端口,因为6379在手机按键上MERZ对应的号码,而MERZ取自意大利歌女Alessia Merz的名字

i. port 6379

4. 绑定的主机地址

i. bind 127.0.0.1

5. 当客户端闲置多长时间后关闭连接,如果指定为0,表示关闭该功能

i. timeout 300 #

如果应用中使用了连接池,最好设置为0,表示不使用服务器自动断开的功能,否则容易出现 java.net.SocketTimeoutException: Read timedout 或者是 It seems like server has closedthe connection 这样的异常,应用中千万要控制住连接数,打开的连接一定要关闭

6. 指定日志记录级别,Redis总共支持四个级别:debug、verbose、notice、warning,默认为verbose

i. loglevel verbose

7. 日志记录方式,默认为标准输出,如果配置Redis为守护进程方式运行,而这里又配置为日志记录方式为标准输出,则日志将会发送给/dev/null

i. logfile stdout

8. 设置数据库的数量,默认数据库为0,可以使用SELECT <dbid>命令在连接上指定数据库id

i. databases 16

9. 指定在多长时间内,有多少次更新操作,就将数据同步到数据文件,可以多个条件配合

i. save <seconds> <changes>

ii. Redis默认配置文件中提供了三个条件:

iii. save 900 1

iv. save 300 10

v. save 60 10000

vi. 分别表示900秒(15分钟)内有1个更改,300秒(5分钟)内有10个更改以及60秒内有10000个更改。

10. 指定存储至本地数据库时是否压缩数据,默认为yes,Redis采用LZF压缩,如果为了节省CPU时间,可以关闭该选项,但会导致数据库文件变的巨大

i. rdbcompression yes

11. 指定本地数据库文件名,默认值为dump.rdb

i. dbfilename dump.rdb

12. 指定本地数据库存放目录

i. dir ./

13. 设置当本机为slav服务时,设置master服务的IP地址及端口,在Redis启动时,它会自动从master进行数据同步

i. slaveof <masterip> <masterport>

14. 当master服务设置了密码保护时,slav服务连接master的密码

i. masterauth <master-password>

15. 设置Redis连接密码,如果配置了连接密码,客户端在连接Redis时需要通过AUTH <password>命令提供密码,默认关闭

i. requirepass foobared

16. 设置同一时间最大客户端连接数,默认无限制,Redis可以同时打开的客户端连接数为Redis进程可以打开的最大文件描述符数,如果设置 maxclients 0,表示不作限制。当客户端连接数到达限制时,Redis会关闭新的连接并向客户端返回max number of clients reached错误信息

i. maxclients 128

17. 指定Redis最大内存限制,Redis在启动时会把数据加载到内存中,达到最大内存后,Redis会先尝试清除已到期或即将到期的Key,当此方法处理 后,仍然到达最大内存设置,将无法再进行写入操作,但仍然可以进行读取操作。Redis新的vm机制,会把Key存放内存,Value会存放在swap区

i. maxmemory <bytes>

18. 指定是否在每次更新操作后进行日志记录,Redis在默认情况下是异步的把数据写入磁盘,如果不开启,可能会在断电时导致一段时间内的数据丢失。因为 redis本身同步数据文件是按上面save条件来同步的,所以有的数据会在一段时间内只存在于内存中。默认为no

i. appendonly no

19. 指定更新日志文件名,默认为appendonly.aof

i. appendfilename appendonly.aof

20. 指定更新日志条件,共有3个可选值:

i. no:表示等操作系统进行数据缓存同步到磁盘(快)

ii. always:表示每次更新操作后手动调用fsync()将数据写到磁盘(慢,安全)

iii. everysec:表示每秒同步一次(折衷,默认值)

iv. appendfsync everysec

21. 指定是否启用虚拟内存机制,默认值为no,简单的介绍一下,VM机制将数据分页存放,由Redis将访问量较少的页即冷数据swap到磁盘上,访问多的页面由磁盘自动换出到内存中(在后面的文章我会仔细分析Redis的VM机制)

i. vm-enabled no

22. 虚拟内存文件路径,默认值为/tmp/redis.swap,不可多个Redis实例共享

i. vm-swap-file /tmp/redis.swap

23. 将所有大于vm-max-memory的数据存入虚拟内存,无论vm-max-memory设置多小,所有索引数据都是内存存储的(Redis的索引数据 就是keys),也就是说,当vm-max-memory设置为0的时候,其实是所有value都存在于磁盘。默认值为0

i. vm-max-memory 0

24. Redis swap文件分成了很多的page,一个对象可以保存在多个page上面,但一个page上不能被多个对象共享,vm-page-size是要根据存储的 数据大小来设定的,作者建议如果存储很多小对象,page大小最好设置为32或者64bytes;如果存储很大大对象,则可以使用更大的page,如果不 确定,就使用默认值

i. vm-page-size 32

25. 设置swap文件中的page数量,由于页表(一种表示页面空闲或使用的bitmap)是在放在内存中的,,在磁盘上每8个pages将消耗1byte的内存。

i. vm-pages 134217728

26. 设置访问swap文件的线程数,最好不要超过机器的核数,如果设置为0,那么所有对swap文件的操作都是串行的,可能会造成比较长时间的延迟。默认值为4

i. vm-max-threads 4

27. 设置在向客户端应答时,是否把较小的包合并为一个包发送,默认为开启

i. glueoutputbuf yes

28. 指定在超过一定的数量或者最大的元素超过某一临界值时,采用一种特殊的哈希算法

i. hash-max-zipmap-entries 6

ii. hash-max-zipmap-value 512

29. 指定是否激活重置哈希,默认为开启(后面在介绍Redis的哈希算法时具体介绍)

a) activerehashing yes

30. 指定包含其它的配置文件,可以在同一主机上多个Redis实例之间使用同一份配置文件,而同时各个实例又拥有自己的特定配置文件

include /path/to/local.conf
7 多端口运行的配置

redis是单进程的服务,所以咱得根据CPU的数目,确定究竟该运行几个实例,这样才能最好的发挥性能优势,比如服务器是8core的,那最好能够运行8个实例,这里只以第一个作为举例

1. 假设redis的安装目录在/opt/redis下面,即这个目录下包含了redis-benchmark redis-cli redis-server这几个可执行文件,现在在下面新建一个servers的文件夹,存放所有的实例

mkdir -p /opt/redis/servers/0/
mkdir -p /opt/redis/servers/0/conf
mkdir -p /opt/redis/servers/0/data
mkdir -p /opt/redis/servers/0/run
mkdir -p /opt/redis/servers/0/logs

2. 然后我们需要拷贝一份配置文件到该实例的路径下

cp redis.conf /opt/redis/servers/0/conf

3. 修改配置文件中的下列内容

pidfile /opt/redis/servers/0/run/redis.pid
port 6380
logfile /opt/redis/servers/0/logs/stdout.log
dbfilename /opt/redis/servers/0/data/dump.rdb

4. 启动停止

./redis-server /opt/redis/servers/0/conf/redis.conf #启动服务

./redis-cli -p 6380shutdown #停止服务
8 Redis维护常用命令

1、../bin/redis-cli keys\* #查看所有键值信息

[root@monitordata]# ../bin/redis-cli keys \*

1)”name”

2)”name1″

2、../bin/redis-cli info #查看redis运行状态

[root@monitordata]# ../bin/redis-cli info

redis_version:2.4.13

redis_git_sha1:00000000

redis_git_dirty:0

arch_bits:64

multiplexing_api:epoll

gcc_version:4.4.6

process_id:2738

uptime_in_seconds:6888

uptime_in_days:0

lru_clock:1508888

used_cpu_sys:0.08

used_cpu_user:0.02

used_cpu_sys_children:0.01

used_cpu_user_children:0.00

connected_clients:2

connected_slaves:0

client_longest_output_list:0

client_biggest_input_buf:0

blocked_clients:0

used_memory:734976

used_memory_human:717.75K

used_memory_rss:7323648

used_memory_peak:726504

used_memory_peak_human:709.48K

mem_fragmentation_ratio:9.96

mem_allocator:jemalloc-2.2.5

loading:0

aof_enabled:0

changes_since_last_save:0

bgsave_in_progress:0

last_save_time:1336293648

bgrewriteaof_in_progress:0

total_connections_received:8

total_commands_processed:19

expired_keys:0

evicted_keys:0

keyspace_hits:9

keyspace_misses:2

pubsub_channels:0

pubsub_patterns:0

latest_fork_usec:1679

vm_enabled:0

role:master

db0:keys=2,expires=0

from:http://hi.baidu.com/webwatch/item/47c7e3df6f4a37f592a97456