What’s Up With Floating Point?

I presume most people reading this have used floating-point numbers at some point, often not intentionally.

I’m also fairly sure a good number who have encountered them did so after trying to discover why the result of a simple computation is incorrect. E.G.

0.1 + 0.2
// result: 0.30000000000000004
// Which I'm fairly sure
// should be 0.3 ...

The problem is that without understanding what a floating-point number is, you will often find yourself frustrated when the result you expected is ever-so-slightly different.

My goal here is to clarify what a floating-point number is, why we use them, and how they work.

Why do we even need floating point 🤔?

It’s not a bold statement to say that computers need to store numbers. We store these numbers in binary on our phones, laptops, fridges etc.

I hope that most people reading this are familiar with binary numbers in some form; if not, consider reading this blog post by Linda Vivah.

But what about decimal? Fractions, π, real numbers?

For any useful computation, we need computers to be able to represent the following:

  • Veryveryveryvery SMALL numbers,
  • Veryveryveryvery BIG numbers,
  • Everything in-between!

Starting with the veryveryveryvery small numbers to take us in the right direction; how do we store them?

Well, thats simple. We store those using the equivalent to decimal representation in binary…

Binary fractions

An example should help here. Let’s choose a random binary fraction: 101011.101

 

This is very similar to how decimal numbers work. The only difference is that we have base 2 instead of base 10. If you chuck the above in a binary converter of your choice, you’ll find that it is correct!

So how might we store these binary fractions?

Let’s say we allocate one byte (8 bits) to store our binary fraction: 00000000. We then must choose a place to put our binary separator so that we can have the fractional part of our binary number.

Let’s try smack-bang in the middle!

0000.0000

What’s the biggest number we can represent with this?

1111.1111 = 15.9375

That’s… not very impressive. Neither is the smallest number we can represent, 0.00625.

There is a lot of wasted storage space here alongside a poor range of possible numbers. The issue is that choosing any place to put our point would leave us with either fractional precision or a larger integer range — not both.

If we could just move the fractional point around as we needed, we could get so much more out of our limited storage. If only the point could float around as we needed it, a floating point if you will…

(I’m sorry, I had to 😅.)

So what is floating point?

Floating point is exactly that, a floating (fractional) point number that gives us the ability to change the relative size of our number.

So how do we mathematically represent a number in such a way that we,

  1. store the significant digits of the number we want to represent (E.G. the 12 in 0.00000012);
  2. know where to put the fractional point in relation to the significant digits (E.G. all the 0’s in 0.00000012)?

To do this, let’s time travel (for some) back to secondary school…

Standard Form (Scientific Notation)

Anyone remember mathsisfun? I somehow feel old right now but either way, this is from their website:

 

You can do the exact same thing with binary! Instead of 7*10^2 = 700, we can write

1010111100 * 2^0 = 700= 10101111 * 2^2 = 700

Which is equivalent to 175 * 4 = 700. This is a great way to represent a number based on the significant digits and the placement of the fractional point relative to said digits.

That’s it! Floating point is a binary standard-form representation of numbers.

If we want to formalise the representation a little, we need to account for positive and negative numbers. To do this, we also add a sign to our number by multiplying by \pm 1:

(\text{sign}) * (\text{significant digits}) * (\text{base})^{(\text{some power})}

And back to the example given by mathsisfun…

700 = (1) * (7) * (10)^{2}= (1) * (10101111) * (2)^{2}

If you are reading other literature, you’ll find the representation will look something like…

(-1)^s * c * b^{e}

Where s is the sign bit, c is the significand/mantissa, b is the base, and e is the exponent.

So why am I getting weird errors 😡?

So, we know what floating point is now. But why can’t I do something as simple as add two numbers together without the risk of error?

Well the problem lies partially with the computer, but mostly with mathematics itself.

Recurring Fractions

Recurring fractions are an interesting problem in number representation systems.

Let’s choose any fraction \frac{x}{y}. If y has a prime factor that isn’t also a factor of the base, it will be a recurring fraction.

This is why numbers like 1/21 can’t be represented in a finite number of digits; 21 has 7 and 3 as prime factors, neither of which are a factor of 10.

Let’s work through an example in decimal.

Decimal

Say you want to add the numbers 1/3 and 2/3. We all know the answer is 42 1, but if we are working in decimal, this isn’t as obvious.

This is because 1/3 = 0.333333333333…

It isn’t a number that can be represented as a finite number of digits in decimal. As we can’t store infinite digits, we store an approximation accurate to 10 places.

The calculation becomes…

0.3333333333 + 0.6666666666 = 0.999999999

Which is definitely not 1. It’s real close, but it’s not 1.

The finite nature in which we can store numbers doesn’t mesh well with the inevitable fact that we can’t easily represent all numbers in a finite number of digits.

Binary

This exact same problem occurs in binary, except it’s even worse! The reason for this is that 2 has one less prime factor than 10, namely 5.

Because of this, recurring fractions happen more commonly in base 2.

An example of this is 0.1:

In decimal, that’s easy. In binary however… 0.00011001100110011..., it’s another recurring fraction!

So trying to perform 0.1 + 0.2 becomes

  0.0001100110011
+ 0.0011001100110
= 0.0100110011001
= 0.299926758

Now before we got something similar to 0.30000000004, this is because of things like rounding modes which I won’t go into here (but will do so in a future post). The same principle is causing this issue.

This number of errors introduced by this fact are lessened by introducing rounding.

Precision

The other main issue comes in the form of precision.

We only have a certain number of bits dedicated to the significant digits of our floating point number.

Decimal

As an example, consider we have three decimal values to store our significant digits.

If we compare 333 and 1/3 * 10^3, we would find that in our system they are the exact same.

This is because we only have three values of precision to store the significant digits of our number, and that involves truncating the end off of our recurring fraction.

In an extreme example, adding 1 to 1 * 10^3 will result in 1 * 10^3, the number hasn’t changed. This is because you need four significant digits to represent 1001.

Binary

This exact same issue occurs in binary with veryveryveryvery small and veryveryveryvery big numbers. In a future post I will be talking more about the limits of floating point.

For completeness, consider the previous example in binary, where we now have 3 bits to represent our significant digits.

By using base 2 instead, 1 * 2^3, adding 1 to our number will result in no change. This is because to represent 1001 (now equivalent to 9 in decimal) requires 4 bits, the less significant binary digits are lost in translation.

There is no solution here, the limits of precision are defined by the amount we can store. To get around this, use a larger floating point data type.

  • E.G. move from a single-precision floating-point number to a double-precision floating-point number.

Summary

TLDR

To bring it all together, floating-point numbers are a representation of binary values akin to standard-form or scientific notation.

It allows us to store BIG and small numbers with precision.

To do this we move the fractional point relative to the significant digits of the number to make that number bigger or smaller at a rate proportional to the base being used.

Most of the errors associated with floating point come in the form of representing recurring fractions in a finite number of bits. Rounding modes help to reduce these errors.

Thanks 🥰!

Thank you so much for reading! I hope this was helpful to those that needed a little refresher on floating point and also to those that are new to this concept.

I will be making one or two updates to this post explaining the in-depths of the IEEE 754-2008 Floating Point Standard, so if you have questions like:

  • “What are the biggest and smallest numbers I can use in floating point?”
  • “What do financial institutions do about these errors?
  • “Can we use other bases?”
  • “How do we actually perform floating-point arithmetic?”

Then feel free to follow to see an update! You can also follow me on twitter @tim_cb_roderick for updates. If you have any questions please feel free to leave a comment below.

from:https://timroderick.com/floating-point-introduction/

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)

For the last five years, the Internet Engineering Task Force (IETF), the standards body that defines internet protocols, has been working on standardizing the latest version of one of its most important security protocols: Transport Layer Security (TLS). TLS is used to secure the web (and much more!), providing encryption and ensuring the authenticity of every HTTPS website and API. The latest version of TLS, TLS 1.3 (RFC 8446) was published today. It is the first major overhaul of the protocol, bringing significant security and performance improvements. This article provides a deep dive into the changes introduced in TLS 1.3 and its impact on the future of internet security.

An evolution

One major way Cloudflare provides security is by supporting HTTPS for websites and web services such as APIs. With HTTPS (the “S” stands for secure) the communication between your browser and the server travels over an encrypted and authenticated channel. Serving your content over HTTPS instead of HTTP provides confidence to the visitor that the content they see is presented by the legitimate content owner and that the communication is safe from eavesdropping. This is a big deal in a world where online privacy is more important than ever.

The machinery under the hood that makes HTTPS secure is a protocol called TLS. It has its roots in a protocol called Secure Sockets Layer (SSL) developed in the mid-nineties at Netscape. By the end of the 1990s, Netscape handed SSL over to the IETF, who renamed it TLS and have been the stewards of the protocol ever since. Many people still refer to web encryption as SSL, even though the vast majority of services have switched over to supporting TLS only. The term SSL continues to have popular appeal and Cloudflare has kept the term alive through product names like Keyless SSL and Universal SSL.

Timeline

In the IETF, protocols are called RFCs. TLS 1.0 was RFC 2246, TLS 1.1 was RFC 4346, and TLS 1.2 was RFC 5246. Today, TLS 1.3 was published as RFC 8446. RFCs are generally published in order, keeping 46 as part of the RFC number is a nice touch.

TLS 1.2 wears parachute pants and shoulder pads

MC Hammer
MC Hammer, like SSL, was popular in the 90s

Over the last few years, TLS has seen its fair share of problems. First of all, there have been problems with the code that implements TLS, including HeartbleedBERserkgoto fail;, and more. These issues are not fundamental to the protocol and mostly resulted from a lack of testing. Tools like TLS Attacker and Project Wycheproof have helped improve the robustness of TLS implementation, but the more challenging problems faced by TLS have had to do with the protocol itself.

TLS was designed by engineers using tools from mathematicians. Many of the early design decisions from the days of SSL were made using heuristics and an incomplete understanding of how to design robust security protocols. That said, this isn’t the fault of the protocol designers (Paul Kocher, Phil Karlton, Alan Freier, Tim Dierks, Christopher Allen and others), as the entire industry was still learning how to do this properly. When TLS was designed, formal papers on the design of secure authentication protocols like Hugo Krawczyk’s landmark SIGMA paper were still years away. TLS was 90s crypto: It meant well and seemed cool at the time, but the modern cryptographer’s design palette has moved on.

Many of the design flaws were discovered using formal verification. Academics attempted to prove certain security properties of TLS, but instead found counter-examples that were turned into real vulnerabilities. These weaknesses range from the purely theoretical (SLOTH and CurveSwap), to feasible for highly resourced attackers (WeakDHLogJamFREAKSWEET32), to practical and dangerous (POODLEROBOT).

TLS 1.2 is slow

Encryption has always been important online, but historically it was only used for things like logging in or sending credit card information, leaving most other data exposed. There has been a major trend in the last few years towards using HTTPS for all traffic on the Internet. This has the positive effect of protecting more of what we do online from eavesdroppers and injection attacks, but has the downside that new connections get a bit slower.

For a browser and web server to agree on a key, they need to exchange cryptographic data. The exchange, called the “handshake” in TLS, has remained largely unchanged since TLS was standardized in 1999. The handshake requires two additional round-trips between the browser and the server before encrypted data can be sent (or one when resuming a previous connection). The additional cost of the TLS handshake for HTTPS results in a noticeable hit to latency compared to an HTTP alone. This additional delay can negatively impact performance-focused applications.

Defining TLS 1.3

Unsatisfied with the outdated design of TLS 1.2 and two-round-trip overhead, the IETF set about defining a new version of TLS. In August 2013, Eric Rescorla laid out a wishlist of features for the new protocol:
https://www.ietf.org/proceedings/87/slides/slides-87-tls-5.pdf

After some debate, it was decided that this new version of TLS was to be called TLS 1.3. The main issues that drove the design of TLS 1.3 were mostly the same as those presented five years ago:

  • reducing handshake latency
  • encrypting more of the handshake
  • improving resiliency to cross-protocol attacks
  • removing legacy features

The specification was shaped by volunteers through an open design process, and after four years of diligent work and vigorous debate, TLS 1.3 is now in its final form: RFC 8446. As adoption increases, the new protocol will make the internet both faster and more secure.

In this blog post I will focus on the two main advantages TLS 1.3 has over previous versions: security and performance.

Trimming the hedges

hedge
Creative Commons Attribution-Share Alike 3.0

In the last two decades, we as a society have learned a lot about how to write secure cryptographic protocols. The parade of cleverly-named attacks from POODLE to Lucky13 to SLOTH to LogJam showed that even TLS 1.2 contains antiquated ideas from the early days of cryptographic design. One of the design goals of TLS 1.3 was to correct previous mistakes by removing potentially dangerous design elements.

Fixing key exchange

TLS is a so-called “hybrid” cryptosystem. This means it uses both symmetric key cryptography (encryption and decryption keys are the same) and public key cryptography (encryption and decryption keys are different). Hybrid schemes are the predominant form of encryption used on the Internet and are used in SSHIPsecSignalWireGuard and other protocols. In hybrid cryptosystems, public key cryptography is used to establish a shared secret between both parties, and the shared secret is used to create symmetric keys that can be used to encrypt the data exchanged.

As a general rule, public key crypto is slow and expensive (microseconds to milliseconds per operation) and symmetric key crypto is fast and cheap (nanoseconds per operation). Hybrid encryption schemes let you send a lot of encrypted data with very little overhead by only doing the expensive part once. Much of the work in TLS 1.3 has been about improving the part of the handshake, where public keys are used to establish symmetric keys.

RSA key exchange

The public key portion of TLS is about establishing a shared secret. There are two main ways of doing this with public key cryptography. The simpler way is with public-key encryption: one party encrypts the shared secret with the other party’s public key and sends it along. The other party then uses its private key to decrypt the shared secret and … voila! They both share the same secret. This technique was discovered in 1977 by Rivest, Shamir and Adelman and is called RSA key exchange. In TLS’s RSA key exchange, the shared secret is decided by the client, who then encrypts it to the server’s public key (extracted from the certificate) and sends it to the server.

image4

The other form of key exchange available in TLS is based on another form of public-key cryptography, invented by Diffie and Hellman in 1976, so-called Diffie-Hellman key agreement. In Diffie-Hellman, the client and server both start by creating a public-private key pair. They then send the public portion of their key share to the other party. When each party receives the public key share of the other, they combine it with their own private key and end up with the same value: the pre-main secret. The server then uses a digital signature to ensure the exchange hasn’t been tampered with. This key exchange is called “ephemeral” if the client and server both choose a new key pair for every exchange.

image3

Both modes result in the client and server having a shared secret, but RSA mode has a serious downside: it’s not forward secret. That means that if someone records the encrypted conversation and then gets ahold of the RSA private key of the server, they can decrypt the conversation. This even applies if the conversation was recorded and the key is obtained some time well into the future. In a world where national governments are recording encrypted conversations and using exploits like Heartbleed to steal private keys, this is a realistic threat.

RSA key exchange has been problematic for some time, and not just because it’s not forward-secret. It’s also notoriously difficult to do correctly. In 1998, Daniel Bleichenbacher discovered a vulnerability in the way RSA encryption was done in SSL and created what’s called the “million-message attack,” which allows an attacker to perform an RSA private key operation with a server’s private key by sending a million or so well-crafted messages and looking for differences in the error codes returned. The attack has been refined over the years and in some cases only requires thousands of messages, making it feasible to do from a laptop. It was recently discovered that major websites (including facebook.com) were also vulnerable to a variant of Bleichenbacher’s attack called the ROBOT attack as recently as 2017.

To reduce the risks caused by non-forward secret connections and million-message attacks, RSA encryption was removed from TLS 1.3, leaving ephemeral Diffie-Hellman as the only key exchange mechanism. Removing RSA key exchange brings other advantages, as we will discuss in the performance section below.

Diffie-Hellman named groups

When it comes to cryptography, giving too many options leads to the wrong option being chosen. This principle is most evident when it comes to choosing Diffie-Hellman parameters. In previous versions of TLS, the choice of the Diffie-Hellman parameters was up to the participants. This resulted in some implementations choosing incorrectly, resulting in vulnerable implementations being deployed. TLS 1.3 takes this choice away.

Diffie-Hellman is a powerful tool, but not all Diffie-Hellman parameters are “safe” to use. The security of Diffie-Hellman depends on the difficulty of a specific mathematical problem called the discrete logarithm problem. If you can solve the discrete logarithm problem for a set of parameters, you can extract the private key and break the security of the protocol. Generally speaking, the bigger the numbers used, the harder it is to solve the discrete logarithm problem. So if you choose small DH parameters, you’re in trouble.

The LogJam and WeakDH attacks of 2015 showed that many TLS servers could be tricked into using small numbers for Diffie-Hellman, allowing an attacker to break the security of the protocol and decrypt conversations.

Diffie-Hellman also requires the parameters to have certain other mathematical properties. In 2016, Antonio Sanso found an issue in OpenSSL where parameters were chosen that lacked the right mathematical properties, resulting in another vulnerability.

TLS 1.3 takes the opinionated route, restricting the Diffie-Hellman parameters to ones that are known to be secure. However, it still leaves several options; permitting only one option makes it difficult to update TLS in case these parameters are found to be insecure some time in the future.

Fixing ciphers

The other half of a hybrid crypto scheme is the actual encryption of data. This is done by combining an authentication code and a symmetric cipher for which each party knows the key. As I’ll describe, there are many ways to encrypt data, most of which are wrong.

CBC mode ciphers

In the last section we described TLS as a hybrid encryption scheme, with a public key part and a symmetric key part. The public key part is not the only one that has caused trouble over the years. The symmetric key portion has also had its fair share of issues. In any secure communication scheme, you need both encryption (to keep things private) and integrity (to make sure people don’t modify, add, or delete pieces of the conversation). Symmetric key encryption is used to provide both encryption and integrity, but in TLS 1.2 and earlier, these two pieces were combined in the wrong way, leading to security vulnerabilities.

An algorithm that performs symmetric encryption and decryption is called a symmetric cipher. Symmetric ciphers usually come in two main forms: block ciphers and stream ciphers.

A stream cipher takes a fixed-size key and uses it to create a stream of pseudo-random data of arbitrary length, called a key stream. To encrypt with a stream cipher, you take your message and combine it with the key stream by XORing each bit of the key stream with the corresponding bit of your message.. To decrypt, you take the encrypted message and XOR it with the key stream. Examples of pure stream ciphers are RC4 and ChaCha20. Stream ciphers are popular because they’re simple to implement and fast in software.

A block cipher is different than a stream cipher because it only encrypts fixed-sized messages. If you want to encrypt a message that is shorter or longer than the block size, you have to do a bit of work. For shorter messages, you have to add some extra data to the end of the message. For longer messages, you can either split your message up into blocks the cipher can encrypt and then use a block cipher mode to combine the pieces together somehow. Alternatively, you can turn your block cipher into a stream cipher by encrypting a sequence of counters with a block cipher and using that as the stream. This is called “counter mode”. One popular way of encrypting arbitrary length data with a block cipher is a mode called cipher block chaining (CBC).

encryption
decryption

In order to prevent people from tampering with data, encryption is not enough. Data also needs to be integrity-protected. For CBC-mode ciphers, this is done using something called a message-authentication code (MAC), which is like a fancy checksum with a key. Cryptographically strong MACs have the property that finding a MAC value that matches an input is practically impossible unless you know the secret key. There are two ways to combine MACs and CBC-mode ciphers. Either you encrypt first and then MAC the ciphertext, or you MAC the plaintext first and then encrypt the whole thing. In TLS, they chose the latter, MAC-then-Encrypt, which turned out to be the wrong choice.

You can blame this choice for BEAST, as well as a slew of padding oracle vulnerabilities such as Lucky 13 and Lucky Microseconds. Read my previous post on the subject for a comprehensive explanation of these flaws. The interaction between CBC mode and padding was also the cause of the widely publicized POODLE vulnerability in SSLv3 and some implementations of TLS.

RC4 is a classic stream cipher designed by Ron Rivest (the “R” of RSA) that was broadly supported since the early days of TLS. In 2013, it was found to have measurable biases that could be leveraged to allow attackers to decrypt messages.

image2
AEAD Mode

In TLS 1.3, all the troublesome ciphers and cipher modes have been removed. You can no longer use CBC-mode ciphers or insecure stream ciphers such as RC4. The only type of symmetric crypto allowed in TLS 1.3 is a new construction called AEAD (authenticated encryption with additional data), which combines encryption and integrity into one seamless operation.

Fixing digital signatures

Another important part of TLS is authentication. In every connection, the server authenticates itself to the client using a digital certificate, which has a public key. In RSA-encryption mode, the server proves its ownership of the private key by decrypting the pre-main secret and computing a MAC over the transcript of the conversation. In Diffie-Hellman mode, the server proves ownership of the private key using a digital signature. If you’ve been following this blog post so far, it should be easy to guess that this was done incorrectly too.

PKCS#1v1.5

Daniel Bleichenbacher has made a living identifying problems with RSA in TLS. In 2006, he devised a pen-and-paper attack against RSA signatures as used in TLS. It was later discovered that major TLS implemenations including those of NSS and OpenSSL were vulnerable to this attack. This issue again had to do with how difficult it is to implement padding correctly, in this case, the PKCS#1 v1.5 padding used in RSA signatures. In TLS 1.3, PKCS#1 v1.5 is removed in favor of the newer design RSA-PSS.

Signing the entire transcript

We described earlier how the server uses a digital signature to prove that the key exchange hasn’t been tampered with. In TLS 1.2 and earlier, the server’s signature only covers part of the handshake. The other parts of the handshake, specifically the parts that are used to negotiate which symmetric cipher to use, are not signed by the private key. Instead, a symmetric MAC is used to ensure that the handshake was not tampered with. This oversight resulted in a number of high-profile vulnerabilities (FREAK, LogJam, etc.). In TLS 1.3 these are prevented because the server signs the entire handshake transcript.

tls12

The FREAK, LogJam and CurveSwap attacks took advantage of two things:

  1. the fact that intentionally weak ciphers from the 1990s (called export ciphers) were still supported in many browsers and servers, and
  2. the fact that the part of the handshake used to negotiate which cipher was used was not digitally signed.

The on-path attacker can swap out the supported ciphers (or supported groups, or supported curves) from the client with an easily crackable choice that the server supports. They then break the key and forge two finished messages to make both parties think they’ve agreed on a transcript.

FREAK

These attacks are called downgrade attacks, and they allow attackers to force two participants to use the weakest cipher supported by both parties, even if more secure ciphers are supported. In this style of attack, the perpetrator sits in the middle of the handshake and changes the list of supported ciphers advertised from the client to the server to only include weak export ciphers. The server then chooses one of the weak ciphers, and the attacker figures out the key with a brute-force attack, allowing the attacker to forge the MACs on the handshake. In TLS 1.3, this type of downgrade attack is impossible because the server now signs the entire handshake, including the cipher negotiation.

signed transcript

Better living through simplification

TLS 1.3 is a much more elegant and secure protocol with the removal of the insecure features listed above. This hedge-trimming allowed the protocol to be simplified in ways that make it easier to understand, and faster.

No more take-out menu

In previous versions of TLS, the main negotiation mechanism was the ciphersuite. A ciphersuite encompassed almost everything that could be negotiated about a connection:

  • type of certificates supported
  • hash function used for deriving keys (e.g., SHA1, SHA256, …)
  • MAC function (e.g., HMAC with SHA1, SHA256, …)
  • key exchange algorithm (e.g., RSA, ECDHE, …)
  • cipher (e.g., AES, RC4, …)
  • cipher mode, if applicable (e.g., CBC)

Ciphersuites in previous versions of TLS had grown into monstrously large alphabet soups. Examples of commonly used cipher suites are: DHE-RC4-MD5 or ECDHE-ECDSA-AES-GCM-SHA256. Each ciphersuite was represented by a code point in a table maintained by an organization called the Internet Assigned Numbers Authority (IANA). Every time a new cipher was introduced, a new set of combinations needed to be added to the list. This resulted in a combinatorial explosion of code points representing every valid choice of these parameters. It had become a bit of a mess.

take-out menu

TLS 1.2

prix fixe

TLS 1.3

TLS 1.3 removes many of these legacy features, allowing for a clean split between three orthogonal negotiations:

  • Cipher + HKDF Hash
  • Key Exchange
  • Signature Algorithm

negotiation

This simplified cipher suite negotiation and radically reduced set of negotiation parameters opens up a new possibility. This possibility enables the TLS 1.3 handshake latency to drop from two round-trips to only one round-trip, providing the performance boost that will ensure that TLS 1.3 will be popular and widely adopted.

Performance

When establishing a new connection to a server that you haven’t seen before, it takes two round-trips before data can be sent on the connection. This is not particularly noticeable in locations where the server and client are geographically close to each other, but it can make a big difference on mobile networks where latency can be as high as 200ms, an amount that is noticeable for humans.

1-RTT mode

TLS 1.3 now has a radically simpler cipher negotiation model and a reduced set of key agreement options (no RSA, no user-defined DH parameters). This means that every connection will use a DH-based key agreement and the parameters supported by the server are likely easy to guess (ECDHE with X25519 or P-256). Because of this limited set of choices, the client can simply choose to send DH key shares in the first message instead of waiting until the server has confirmed which key shares it is willing to support. That way, the server can learn the shared secret and send encrypted data one round trip earlier. Chrome’s implementation of TLS 1.3, for example, sends an X25519 keyshare in the first message to the server.

DH in 1.2
DH in 1.3

In the rare situation that the server does not support one of the key shares sent by the client, the server can send a new message, the HelloRetryRequest, to let the client know which groups it supports. Because the list has been trimmed down so much, this is not expected to be a common occurrence.

0-RTT resumption

A further optimization was inspired by the QUIC protocol. It lets clients send encrypted data in their first message to the server, resulting in no additional latency cost compared to unencrypted HTTP. This is a big deal, and once TLS 1.3 is widely deployed, the encrypted web is sure to feel much snappier than before.

In TLS 1.2, there are two ways to resume a connection, session ids and session tickets. In TLS 1.3 these are combined to form a new mode called PSK (pre-shared key) resumption. The idea is that after a session is established, the client and server can derive a shared secret called the “resumption main secret”. This can either be stored on the server with an id (session id style) or encrypted by a key known only to the server (session ticket style). This session ticket is sent to the client and redeemed when resuming a connection.

For resumed connections, both parties share a resumption main secret so key exchange is not necessary except for providing forward secrecy. The next time the client connects to the server, it can take the secret from the previous session and use it to encrypt application data to send to the server, along with the session ticket. Something as amazing as sending encrypted data on the first flight does come with its downfalls.

Replayability

There is no interactivity in 0-RTT data. It’s sent by the client, and consumed by the server without any interactions. This is great for performance, but comes at a cost: replayability. If an attacker captures a 0-RTT packet that was sent to server, they can replay it and there’s a chance that the server will accept it as valid. This can have interesting negative consequences.

0-rtt-attack-@2x

An example of dangerous replayed data is anything that changes state on the server. If you increment a counter, perform a database transaction, or do anything that has a permanent effect, it’s risky to put it in 0-RTT data.

As a client, you can try to protect against this by only putting “safe” requests into the 0-RTT data. In this context, “safe” means that the request won’t change server state. In HTTP, different methods are supposed to have different semantics. HTTP GET requests are supposed to be safe, so a browser can usually protect HTTPS servers against replay attacks by only sending GET requests in 0-RTT. Since most page loads start with a GET of “/” this results in faster page load time.

Problems start to happen when data sent in 0-RTT are used for state-changing requests. To help prevent against this failure case, TLS 1.3 also includes the time elapsed value in the session ticket. If this diverges too much, the client is either approaching the speed of light, or the value has been replayed. In either case, it’s prudent for the server to reject the 0-RTT data.

For more details about 0-RTT, and the improvements to session resumption in TLS 1.3, check out this previous blog post.

Deployability

TLS 1.3 was a radical departure from TLS 1.2 and earlier, but in order to be deployed widely, it has to be backwards compatible with existing software. One of the reasons TLS 1.3 has taken so long to go from draft to final publication was the fact that some existing software (namely middleboxes) wasn’t playing nicely with the new changes. Even minor changes to the TLS 1.3 protocol that were visible on the wire (such as eliminating the redundant ChangeCipherSpec message, bumping the version from 0x0303 to 0x0304) ended up causing connection issues for some people.

Despite the fact that future flexibility was built into the TLS spec, some implementations made incorrect assumptions about how to handle future TLS versions. The phenomenon responsible for this change is called ossification and I explore it more fully in the context of TLS in my previous post about why TLS 1.3 isn’t deployed yet. To accommodate these changes, TLS 1.3 was modified to look a lot like TLS 1.2 session resumption (at least on the wire). This resulted in a much more functional, but less aesthetically pleasing protocol. This is the price you pay for upgrading one of the most widely deployed protocols online.

Conclusions

TLS 1.3 is a modern security protocol built with modern tools like formal analysis that retains its backwards compatibility. It has been tested widely and iterated upon using real world deployment data. It’s a cleaner, faster, and more secure protocol ready to become the de facto two-party encryption protocol online. Draft 28 of TLS 1.3 is enabled by default for all Cloudflare customers, and we will be rolling out the final version soon.

Publishing TLS 1.3 is a huge accomplishment. It is one the best recent examples of how it is possible to take 20 years of deployed legacy code and change it on the fly, resulting in a better internet for everyone. TLS 1.3 has been debated and analyzed for the last three years and it’s now ready for prime time. Welcome, RFC 8446.

from:https://blog.cloudflare.com/rfc-8446-aka-tls-1-3/

使用 Spring Boot AOP 实现 Web 日志处理和分布式锁

AOP

AOP 的全称为 Aspect Oriented Programming,译为面向切面编程。实际上 AOP 就是通过预编译和运行期动态代理实现程序功能的统一维护的一种技术。在不同的技术栈中 AOP 有着不同的实现,但是其作用都相差不远,我们通过 AOP 为既有的程序定义一个切入点,然后在切入点前后插入不同的执行内容,以达到在不修改原有代码业务逻辑的前提下统一处理一些内容(比如日志处理、分布式锁)的目的。

为什么要使用 AOP

在实际的开发过程中,我们的应用程序会被分为很多层。通常来讲一个 Java 的 Web 程序会拥有以下几个层次:

  • Web 层:主要是暴露一些 Restful API 供前端调用。
  • 业务层:主要是处理具体的业务逻辑。
  • 数据持久层:主要负责数据库的相关操作(增删改查)。

虽然看起来每一层都做着全然不同的事情,但是实际上总会有一些类似的代码,比如日志打印和安全验证等等相关的代码。如果我们选择在每一层都独立编写这部分代码,那么久而久之代码将变的很难维护。所以我们提供了另外的一种解决方案: AOP。这样可以保证这些通用的代码被聚合在一起维护,而且我们可以灵活的选择何处需要使用这些代码。

AOP 的核心概念

  • 切面(Aspect):通常是一个类,在里面可以定义切入点和通知。
  • 连接点(Joint Point):被拦截到的点,因为 Spring 只支持方法类型的连接点,所以在 Spring 中连接点指的就是被拦截的到的方法,实际上连接点还可以是字段或者构造器。
  • 切入点(Pointcut):对连接点进行拦截的定义。
  • 通知(Advice):拦截到连接点之后所要执行的代码,通知分为前置、后置、异常、最终、环绕通知五类。
  • AOP 代理:AOP 框架创建的对象,代理就是目标对象的加强。Spring 中的 AOP 代理可以使 JDK 动态代理,也可以是 CGLIB 代理,前者基于接口,后者基于子类。

Spring AOP

Spring 中的 AOP 代理还是离不开 Spring 的 IOC 容器,代理的生成,管理及其依赖关系都是由 IOC 容器负责,Spring 默认使用 JDK 动态代理,在需要代理类而不是代理接口的时候,Spring 会自动切换为使用 CGLIB 代理,不过现在的项目都是面向接口编程,所以 JDK 动态代理相对来说用的还是多一些。在本文中,我们将以注解结合 AOP 的方式来分别实现 Web 日志处理和分布式锁。

Spring AOP 相关注解

  • @Aspect: 将一个 java 类定义为切面类。
  • @Pointcut:定义一个切入点,可以是一个规则表达式,比如下例中某个 package 下的所有函数,也可以是一个注解等。
  • @Before:在切入点开始处切入内容。
  • @After:在切入点结尾处切入内容。
  • @AfterReturning:在切入点 return 内容之后切入内容(可以用来对处理返回值做一些加工处理)。
  • @Around:在切入点前后切入内容,并自己控制何时执行切入点自身的内容。
  • @AfterThrowing:用来处理当切入内容部分抛出异常之后的处理逻辑。

其中 @Before@After@AfterReturning@Around@AfterThrowing 都属于通知。

AOP 顺序问题

在实际情况下,我们对同一个接口做多个切面,比如日志打印、分布式锁、权限校验等等。这时候我们就会面临一个优先级的问题,这么多的切面该如何告知 Spring 执行顺序呢?这就需要我们定义每个切面的优先级,我们可以使用 @Order(i) 注解来标识切面的优先级, i 的值越小,优先级越高。假设现在我们一共有两个切面,一个 WebLogAspect,我们为其设置 @Order(100);而另外一个切面 DistributeLockAspect 设置为 @Order(99),所以 DistributeLockAspect 有更高的优先级,这个时候执行顺序是这样的:在 @Before 中优先执行 @Order(99) 的内容,再执行 @Order(100) 的内容。而在 @After 和 @AfterReturning 中则优先执行 @Order(100) 的内容,再执行 @Order(99) 的内容,可以理解为先进后出的原则。

基于注解的 AOP 配置

使用注解一方面可以减少我们的配置,另一方面注解在编译期间就可以验证正确性,查错相对比较容易,而且配置起来也相当方便。相信大家也都有所了解,我们现在的 Spring 项目里面使用了非常多的注解替代了之前的 xml 配置。而将注解与 AOP 配合使用也是我们最常用的方式,在本文中我们将以这种模式实现 Web 日志统一处理和分布式锁两个注解。下面就让我们从准备工作开始吧。

准备工作

准备一个 Spring Boot 的 Web 项目

你可以通过 Spring Initializr 页面生成一个空的 Spring Boot 项目,当然也可以下载 springboot-pom.xml 文件,然后使用 maven 构建一个 Spring Boot 项目。项目创建完成后,为了方便后面代码的编写你可以将其导入到你喜欢的 IDE 中,我这里选择了 Intelli IDEA 打开。

添加依赖

我们需要添加 Web 依赖和 AOP 相关依赖,只需要在 pom.xml 中添加如下内容即可:

清单 1. 添加 web 依赖
1
2
3
4
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
</dependency>
清单 2. 添加 AOP 相关依赖
1
2
3
4
<dependency>
     <groupId>org.springframework.boot</groupId>
     <artifactId>spring-boot-starter-aop</artifactId>
</dependency>

其他准备工作

为了方便测试我还在项目中集成了 Swagger 文档,具体的集成方法可以参照在 Spring Boot 项目中使用 Swagger 文档。另外编写了两个接口以供测试使用,具体可以参考本文源码。由于本教程所实现的分布式锁是基于 Redis 缓存的,所以需要安装 Redis 或者准备一台 Redis 服务器。

利用 AOP 实现 Web 日志处理

为什么要实现 Web 日志统一处理

在实际的开发过程中,我们会需要将接口的出请求参数、返回数据甚至接口的消耗时间都以日志的形式打印出来以便排查问题,有些比较重要的接口甚至还需要将这些信息写入到数据库。而这部分代码相对来讲比较相似,为了提高代码的复用率,我们可以以 AOP 的方式将这种类似的代码封装起来。

Web 日志注解

清单 3. Web 日志注解代码
1
2
3
4
5
6
7
8
@Documented
@Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.METHOD)
public @interface ControllerWebLog {
     String name();
     boolean intoDb() default false;
}

其中 name 为所调用接口的名称,intoDb 则标识该条操作日志是否需要持久化存储,Spring Boot 连接数据库的配置,可以参考 SpringBoot 项目配置多数据源这篇文章,具体的数据库结构可以点击这里获取。现在注解有了,我们接下来需要编写与该注解配套的 AOP 切面。

实现 WebLogAspect 切面

  1. 首先我们定义了一个切面类 WebLogAspect 如清单 4 所示。其中@Aspect 注解是告诉 Spring 将该类作为一个切面管理,@Component 注解是说明该类作为一个 Spring 组件。
    清单 4. WebLogAspect
    1
    2
    3
    4
    5
    @Aspect
    @Component
    @Order(100)
    public class WebLogAspect {
    }
  2. 接下来我们需要定义一个切点。
    清单 5. Web 日志 AOP 切点
    1
    2
    @Pointcut("execution(* cn.itweknow.sbaop.controller..*.*(..))")
    public void webLog() {}

    对于 execution 表达式,官网的介绍为(翻译后):

    清单 6. 官网对 execution 表达式的介绍
    1
    execution(<修饰符模式>?<返回类型模式><方法名模式>(<参数模式>)<异常模式>?)

    其中除了返回类型模式、方法名模式和参数模式外,其它项都是可选的。这个解释可能有点难理解,下面我们通过一个具体的例子来了解一下。在 WebLogAspect 中我们定义了一个切点,其 execution 表达式为 * cn.itweknow.sbaop.controller..*.*(..),下表为该表达式比较通俗的解析:

    表 1. execution() 表达式解析
  3. @Before 修饰的方法中的内容会在进入切点之前执行,在这个部分我们需要打印一个开始执行的日志,并将请求参数和开始调用的时间存储在 ThreadLocal 中,方便在后面结束调用时打印参数和计算接口耗时。
    清单 7. @Before 代码
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    @Before(value = "webLog()& &  @annotation(controllerWebLog)")
        public void doBefore(JoinPoint joinPoint, ControllerWebLog controllerWebLog) {
            // 开始时间。
            long startTime = System.currentTimeMillis();
            Map<String, Object> threadInfo = new HashMap<>();
            threadInfo.put(START_TIME, startTime);
            // 请求参数。
            StringBuilder requestStr = new StringBuilder();
            Object[] args = joinPoint.getArgs();
            if (args != null && args.length > 0) {
                for (Object arg : args) {
                    requestStr.append(arg.toString());
                }
            }
            threadInfo.put(REQUEST_PARAMS, requestStr.toString());
            threadLocal.set(threadInfo);
            logger.info("{}接口开始调用:requestData={}", controllerWebLog.name(), threadInfo.get(REQUEST_PARAMS));
     }
  4. @AfterReturning,当程序正常执行有正确的返回时执行,我们在这里打印结束日志,最后不能忘的是清除 ThreadLocal 里的内容。
    清单 8. @AfterReturning 代码
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    @AfterReturning(value = "webLog()&& @annotation(controllerWebLog)", returning = "res")
    public void doAfterReturning(ControllerWebLog controllerWebLog, Object res) {
            Map<String, Object> threadInfo = threadLocal.get();
            long takeTime = System.currentTimeMillis() - (long) threadInfo.getOrDefault(START_TIME, System.currentTimeMillis());
            if (controllerWebLog.intoDb()) {
                insertResult(controllerWebLog.name(), (String) threadInfo.getOrDefault(REQUEST_PARAMS, ""),
                            JSON.toJSONString(res), takeTime);
            }
            threadLocal.remove();
            logger.info("{}接口结束调用:耗时={}ms,result={}", controllerWebLog.name(),
                    takeTime, res);
    }
  5. 当程序发生异常时,我们也需要将异常日志打印出来:
    清单 9. @AfterThrowing 代码
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    @AfterThrowing(value = "webLog()& &  @annotation(controllerWebLog)", throwing = "throwable")
        public void doAfterThrowing(ControllerWebLog controllerWebLog, Throwable throwable) {
            Map< String, Object> threadInfo = threadLocal.get();
            if (controllerWebLog.intoDb()) {
                insertError(controllerWebLog.name(), (String)threadInfo.getOrDefault(REQUEST_PARAMS, ""),
                        throwable);
            }
            threadLocal.remove();
            logger.error("{}接口调用异常,异常信息{}",controllerWebLog.name(), throwable);
    }
  6. 至此,我们的切面已经编写完成了。下面我们需要将 ControllerWebLog 注解使用在我们的测试接口上,接口内部的代码已省略,如有需要的话,请参照本文源码
    清单 10. 测试接口代码
    1
    2
    3
    4
    5
    @PostMapping("/post-test")
    @ApiOperation("接口日志 POST 请求测试")
    @ControllerWebLog(name = "接口日志 POST 请求测试", intoDb = true)
    public BaseResponse postTest(@RequestBody BaseRequest baseRequest) {
    }
  7. 最后,启动项目,然后打开 Swagger 文档进行测试,调用接口后在控制台就会看到类似图 1 这样的日志。
    图 1. 基于 Redis 的分布式锁测试效果

    基于 Redis 的分布式锁测试效果

利用 AOP 实现分布式锁

为什么要使用分布式锁

我们程序中多多少少会有一些共享的资源或者数据,在某些时候我们需要保证同一时间只能有一个线程访问或者操作它们。在传统的单机部署的情况下,我们简单的使用 Java 提供的并发相关的 API 处理即可。但是现在大多数服务都采用分布式的部署方式,我们就需要提供一个跨进程的互斥机制来控制共享资源的访问,这种互斥机制就是我们所说的分布式锁。

注意

  1. 互斥性。在任时刻,只有一个客户端能持有锁。
  2. 不会发生死锁。即使有一个客户端在持有锁的期间崩溃而没有主动解锁,也能保证后续其他客户端能加锁。这个其实只要我们给锁加上超时时间即可。
  3. 具有容错性。只要大部分的 Redis 节点正常运行,客户端就可以加锁和解锁。
  4. 解铃还须系铃人。加锁和解锁必须是同一个客户端,客户端自己不能把别人加的锁给解了。

分布式锁注解

清单 11. 分布式锁注解
1
2
3
4
5
6
7
8
@Documented
@Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.METHOD)
public @interface DistributeLock {
    String key();
    long timeout() default 5;
    TimeUnit timeUnit() default TimeUnit.SECONDS;
}

其中,key 为分布式所的 key 值,timeout 为锁的超时时间,默认为 5,timeUnit 为超时时间的单位,默认为秒。

注解参数解析器

由于注解属性在指定的时候只能为常量,我们无法直接使用方法的参数。而在绝大多数的情况下分布式锁的 key 值是需要包含方法的一个或者多个参数的,这就需要我们将这些参数的位置以某种特殊的字符串表示出来,然后通过参数解析器去动态的解析出来这些参数具体的值,然后拼接到 key 上。在本教程中我也编写了一个参数解析器 AnnotationResolver。篇幅原因,其源码就不直接粘在文中,需要的读者可以查看源码

获取锁方法

清单 12. 获取锁
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
private String getLock(String key, long timeout, TimeUnit timeUnit) {
        try {
            String value = UUID.randomUUID().toString();
            Boolean lockStat = stringRedisTemplate.execute((RedisCallback< Boolean>)connection ->
                    connection.set(key.getBytes(Charset.forName("UTF-8")), value.getBytes(Charset.forName("UTF-8")),
                            Expiration.from(timeout, timeUnit), RedisStringCommands.SetOption.SET_IF_ABSENT));
            if (!lockStat) {
                // 获取锁失败。
                return null;
            }
            return value;
        } catch (Exception e) {
            logger.error("获取分布式锁失败,key={}", key, e);
            return null;
        }
}

RedisStringCommands.SetOption.SET_IF_ABSENT 实际上是使用了 setNX 命令,如果 key 已经存在的话则不进行任何操作返回失败,如果 key 不存在的话则保存 key 并返回成功,该命令在成功的时候返回 1,结束的时候返回 0。我们随机产生了一个 value 并且在获取锁成功的时候返回出去,是为了在释放锁的时候对该值进行比较,这样可以做到解铃还须系铃人,由谁创建的锁就由谁释放。同时还指定了超时时间,这样可以保证锁释放失败的情况下不会造成接口永远不能访问。

释放锁方法

清单 13. 释放锁
1
2
3
4
5
6
7
8
9
10
11
12
13
private void unLock(String key, String value) {
        try {
            String script = "if redis.call('get', KEYS[1]) == ARGV[1] then return redis.call('del', KEYS[1]) else return 0 end";
            boolean unLockStat = stringRedisTemplate.execute((RedisCallback< Boolean>)connection ->
                    connection.eval(script.getBytes(), ReturnType.BOOLEAN, 1,
                            key.getBytes(Charset.forName("UTF-8")), value.getBytes(Charset.forName("UTF-8"))));
            if (!unLockStat) {
                logger.error("释放分布式锁失败,key={},已自动超时,其他线程可能已经重新获取锁", key);
            }
        } catch (Exception e) {
            logger.error("释放分布式锁失败,key={}", key, e);
        }
}

切面

切点和 Web 日志处理的切点一样,这里不再赘述。我们在切面中使用的通知类型为 @Around,在切点之前我们先尝试获取锁,若获取锁失败则直接返回错误信息,若获取锁成功则执行方法体,当方法结束后(无论是正常结束还是异常终止)释放锁。

清单 14. 环绕通知
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
@Around(value = "distribute()&& @annotation(distributeLock)")
public Object doAround(ProceedingJoinPoint joinPoint, DistributeLock distributeLock) throws Exception {
        String key = annotationResolver.resolver(joinPoint, distributeLock.key());
        String keyValue = getLock(key, distributeLock.timeout(), distributeLock.timeUnit());
        if (StringUtil.isNullOrEmpty(keyValue)) {
            // 获取锁失败。
            return BaseResponse.addError(ErrorCodeEnum.OPERATE_FAILED, "请勿频繁操作");
        }
        // 获取锁成功
        try {
            return joinPoint.proceed();
        } catch (Throwable throwable) {
            return BaseResponse.addError(ErrorCodeEnum.SYSTEM_ERROR, "系统异常");
        } finally {
            // 释放锁。
            unLock(key, keyValue);
        }
}

测试

清单 15. 分布式锁测试代码
1
2
3
4
5
6
7
8
9
10
11
12
@PostMapping("/post-test")
@ApiOperation("接口日志 POST 请求测试")
@ControllerWebLog(name = "接口日志 POST 请求测试", intoDb = true)
@DistributeLock(key = "post_test_#{baseRequest.channel}", timeout = 10)
public BaseResponse postTest(@RequestBody BaseRequest baseRequest) {
        try {
            Thread.sleep(10000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return BaseResponse.addResult();
}

在本次测试中我们将锁的超时时间设置为 10 秒钟,在接口中让当前线程睡眠 10 秒,这样可以保证 10 秒钟之内锁不会被释放掉,测试起来更加容易些。启动项目后,我们快速访问两次该接口,注意两次请求的 channel 传值需要一致(因为锁的 key 中包含该值),会发现第二次访问时返回如下结果:

图 2. 基于 Redis 的分布式锁测试效果

基于 Redis 的分布式锁测试效果

这就说明我们的分布式锁已经生效。

结束语

在本教程中,我们主要了解了 AOP 编程以及为什么要使用 AOP。也介绍了如何在 Spring Boot 项目中利用 AOP 实现 Web 日志统一处理和基于 Redis 的分布式锁。你可以在 Github 上找到本教程的完整实现,如果你想对本教程做补充的话欢迎发邮件(gancy.programmer@gmail.com)给我或者直接在 Github 上提交 Pull Reqeust。

参考资源

from:https://www.ibm.com/developerworks/cn/java/j-spring-boot-aop-web-log-processing-and-distributed-locking/index.html

Git:Please make sure you have the correct access rights and the repository exists

1、首先我得重新在git设置一下身份的名字和邮箱
git config –global user.name “yourname”
git config –global user.email“your@email.com”
2.删除.ssh文件夹(直接搜索该文件夹)下的known_hosts
3.$ ssh-keygen -t rsa -C “your@email.com”
.ssh文件夹下生成id_rsa和id_rsa.pub 2个文件,复制id_rsa.pub的全部内容
4.打开https://github.com/,登陆你的账户,[设置]->[ssh设置],将第3步复制的内容粘贴在key中后提交。
5.ssh -T git@github.com

基于大中台小前台模式设计高并发电商架构

一、什么是大中台(业务中台、数据中台、技术中台等)

大中台小前台的组织模式最近在业界很火热,此模式最早在芬兰著名移动游戏公司Supercell实施。在Supercell公司内部以小前台的方式组织了若干个开发团队,每个开发团队包含开发一款游戏所需的各种角色,从而在开发团队内部可以快速决策、快速开发。而支撑这些开发团队的基础设施(机房、网络、架构组件等)、游戏引擎、内部开发测试发布上线工具等则由“部落”(即中台)部门提供。“部落”部门可以根据需要扩展为多个小分队,亦即中台部门划分成多个,但各个小分队都保持共同目标。“部落”作为中台部门,赋能前台业务开发团队,中台部门本身并不直接提供游戏给消费者。

在国内,2015年阿里巴巴业务种类纷繁复杂,业务之间交叉依赖,业务团队众多,不能及时响应业务需求。2015年12月张勇宣布启动中台战略,构建符合DT时代的更具备创新性和灵活性的“大中台,小前台”的组织机制和业务机制,实现管理模式创新。即将产品技术力量和数据运营能力从前台剥离,成为独立的中台,包括搜索事业部、共享业务事业部、数据平台事业部等,为前台即零售电商事业群提供服务。从而前台得到精简,保持足够的敏捷度,更好地满足业务发展和创新需求。2017年5月出版了《企业IT架构转型之道:阿里巴巴中台战略思想和架构实践》,随后很多互联网公司快速跟进中台战略:2017年12月滴滴构建业务中台、2018年12月京东宣布前台、中台、后台组织架构[1]。进入2019年,大中台小前台模式更是在各个公司如火如荼地进行中。

那么中台是什么?中台是一种组织机制和业务机制。在公司组织架构层面通过组织架构调整,物理拆分成独立的中台部门。在公司业务层面通过把公共能力下沉为服务,并做好服务间连接,持续赋能业务部门。可类比航母(大中台)携带和赋能舰载机(小前台)作战(如图1);也可类比为中台生产各种乐高颗粒,传感器和执行器(如图2)。前台把这些颗粒打包集成为各种乐高套装,再加上不同的文档和包装,以及少量个性颗粒(比如特定IP的积木,星战主题积木块),快速形成不同产品卖给不同用户。另一方面,如果开发了10000种SKU的乐高套装,反过来会形成一个强大的乐高积木中台,几乎无所不能,前台产品越多,中台也越强大,中台越强大,前台产品开发也越简单,竞争力极强。

图1 航空母舰和舰载机

图2 乐高颗粒和产品

公司执行好大中台小前台模式,首先需要进行组织架构调整,比如阿里巴巴大中台小前台组织架构(如图3)如下:中台事业群和小前台事业群。其中中台事业群包括:搜索事业部、共享业务事业部(用户、商品、交易等)、数据技术及产品部(OLAP)、基础架构事业部等;小前台事业群包括电商事业群、蚂蚁金服集团、阿里云事业群、菜鸟网络、大文娱集团、阿里妈妈等其他。

图3 阿里巴巴大中台小前台组织架构

公司的交付物是产品,为了让公司更好地完成产品的交付,需要做好业务架构、数据架构、技术架构三个层面。其中业务架构(OLTP)包括个性化的业务架构(小前台)和公共业务架构(中台),数据架构(OLAP)包括个性化的数据架构(小前台)和公共数据架构(中台),技术架构即技术支撑(中台)。这三个层面的架构,我们可以进一步抽象和拆分个性化部分和公共部分。其中个性化的部分即小前台部分,公共部分即中台部分。因此公司的中台分为业务中台、数据中台和技术中台。
假如公司的业务架构采用了目前主流的微服务架构模式(如图4),其中大中台部分包括:网关层、公共业务逻辑层、数据访问层、DB、Cache、配置中心、注册中心,小前台部分包括:业务逻辑层、App端。

图4 业务架构

假如公司的数据架构采用了目前主流的Hadoop生态架构模式(如图5),其中大中台部分包括:PAAS层(数据传输、数据计算、数据存储)、DAAS层(数据源、数据仓库、数据集市 /数据模型),小前台部分包括:DA(Data Application)(留存应用、画像应用、业务报表应用、数据智能应用)。

图5 数据架构

假如公司的技术架构采用了目前主流的技术栈(如图6),其中大中台部分包括:基础平台(消息平台、分布式锁平台、APM、立体监控平台、任务调度平台等)、基础组件(Web框架、RPC框架、分布式事务、数据库中间件等)、服务网格、存储体系(RDBMS、NoSQL、NewSQL)、容器弹性云等。

图6 技术架构

二、什么是小前台

从公司组织架构上来看,公司的个性化业务部门属于小前台,从公司业务服务上来看,公司的个性化业务服务属于小前台。

三、大中台小前台模式适用场景

大中台小前台模式特别有利于业务复制尝试和需要大量尝试创新的新业务,假如把公司的发展周期划分为0-1阶段为初创公司,1-10阶段为高速成长型公司,10-100阶段为稳定发展型公司。那么此模式比较适合10-100阶段,1-10阶段可以开始尝试了,但不适合0-1的初创公司阶段。
大中台需要通过抽象、封装共性能力和知识,可供需要使用的小前台使用(提供内部产品、服务、赋能等),从而使让前台更灵活,降低创新成本,支持更快更轻的试错和创新。

四、大中台小前台电商架构如何设计实践

在电商行业实施大中台小前台的业务架构模式,需要结合业务领域做好两个层面的工作,第一,在公司业务层面通过把公共能力下沉为服务;第二,做好服务的连接,并持续赋能业务部门。
在电商行业内,公共能力下沉为服务,比如把用户、商品、交易、支付、营销、搜索、推荐、风控等服务抽象后下沉为独立的服务。如图7所示的业务架构,其中网关层、公共业务逻辑层、数据访问层、DB、Cache以及注册中心、配置中心等属于电商的公共能力,为电商的中台服务。APP端、小程序端、个性化业务逻辑层等个性化的服务属于小前台部分。

图7 电商业务架构

在电商行业构建大中台小前台的模式中,第二步需要做好公共能力下沉服务的全连接,使得小前台业务可以做到一键接入。如何做好公共服务的全连接呢?首先需要从公司层面定义好业务线的标识标准,比如采用三级体系结构,如表1所示:

表1 业务线标识三级体系结构

公司统一了业务线三级体系结构后,需要提供统一的业务注册中心,使得业务通过业务注册中心完成所有业务线三级体系结构的注册以及查询。其次公司层面需要统一的业务线分发配置服务,分发配置服务的作用是把每个小前台业务需要连接的中台服务集中配置(比如手机前台业务需要接入商品中台、搜索中台、客服中台、交易中台等配置策略),并且配置小前台业务数据分发到每个中台服务的具体的接入策略(比如手机前台业务接入到搜索服务中台,手机业务哪些字段需要建立索引等接入策略),详见表2所示:

表2 业务线分发配置策略

在公司层面具备了统一的业务注册中心和分发配置服务后,需要进一步建立分发连接中心,分发连接中心需要分发两方面的内容:策略流和数据流,第一是策略流,分发业务线分发配置策略到各个中台服务,比如在表2中需要把业务线ID为1的商品数据类型的接入策略分发到表2中配置的商品中台服务、搜索中台服务、推荐中台服务、客服中台服务、数据中台服务等,并把订单数据类型的接入策略分发到表2中配置的搜索中台服务、客服中台服务等。这些中台服务收到分发连接中心的前台数据接入中台策略后,解析这些接入策略,后续对数据流的处理按照这些接入策略进行,完成策略的全连接。第二是数据流,当小前台业务产生相应的数据时,会分发到对应的中台服务。比如手机前台产生商品数据,由分发连接中心分发给相应的商品中台、搜索中台、推荐中台、客服中台、数据中台等,完成数据的全连接。
公司大中台小前台连接生态如图8所示,包含了小前台业务1、业务注册中心、分发配置服务、业务分发连接中心、各个中台服务,图8中包含了一个业务的策略流(黑色连接线)和数据流(红色连接线)具体的分发连接关系。

图8 大中台小前台连接生态

公司具备了大中台小前台的连接生态后,那么小前台业务产生的数据(比如手机业务的商品数据)如何存储呢?以小前台业务产生的商品数据为例,包括了商品公共的数据以及小前台业务个性化的数据。针对商品公共数据和商品个性化数据,存储有两种方案,一是商品公共数据存储在中台部门,商品个性化数据存储在小前台业务部门;第二种方案是商品公共数据和商品个性化数据全部存储在中台部门,有利用数据的统一存储和管理,并且使得业务查询等接入也非常简单。推荐大家使用第二种数据存储方案(同时同学们思考下第一种存储方案带来的问题有哪些?),那么针对商品的公共数据和个性化数据设计存储表结构:商品公共数据表 +商品业务个性化扩展数据表,其中商品公共数据表包含了所有业务线商品公共的字段,如表3所示:商品ID、发布人、分类ID、价格、发布时间、商品库存、商品状态等等。

图3 商品公共数据表

其中商品个性化数据表(如表4)采用Key,Value扩展列的方式进行存储,比如Key的类型可以固定几种类型:比如Long类型、Double类型、String类型,业务个性化数据都使用固定的几种数据类型来表示和存储,列中Key的含义在映射表(如表5)中指定了每个Key具体的的业务字段含义。

表4 商品业务个性化扩展数据表

表5 商品个性化字段映射数据表

通过以上大中台小前台的连接生态以及公共数据表和业务个性化数据表的存储方式,使得大中台小前台模式在公司内得以很好的落地和实践。
参考文献:

[1] 中台战略-中台建设与数字商业:机械工业出版社

from: 欢迎关注公众号 架构之美