Blogroll: CloudFlare

I read blogs, as well as write one. The 'blogroll' on this site reproduces some posts from some of the people I enjoy reading. There are currently 36 posts from the blog 'CloudFlare.'

Disclaimer: Reproducing an article here need not necessarily imply agreement or endorsement!

Subscribe to CloudFlare feed CloudFlare
Helping Build a Better Internet
Updated: 42 min 20 sec ago

Three new ways teams are using Cloudflare Access

Wed, 15/08/2018 - 22:00
Three new ways teams are using Cloudflare Access

Since leaving beta three weeks ago, Cloudflare Access has become our fastest-growing subscription service. Every day, more teams are using Access to leave their VPN behind and connect to applications quickly and securely from anywhere in the world.

We’ve heard from a number of teams about how they’re using Access. Each team has unique needs to consider as they move away from a VPN and to a zero trust model. In a zero trust framework, each request has to prove that a given application should trust its attempt to reach a secure tool. In this post, we’re highlighting some of the solutions that groups are using to transition to Cloudflare Access.

Solution 1: Collaborate with External Partners

Cloudflare Access integrates with popular identity providers (IdPs) so that your team can reach internal applications without adding more credentials. However, teams rarely work in isolation. They frequently rely on external partners who also need to reach shared tools.

How to grant and manage permissions with external partners poses a security risk. Just because you are working with a third-party doesn’t mean they should have credentials to your IdP. They typically need access to a handful of tools, not all of your internal resources.

We’ve heard from Access customers who are increasingly using the One-Time Pin feature to solve this problem. With One-Time Pin, you can grant access to third-party users without adding them to your IdP. Your internal team will continue to use their IdP credentials to authenticate while external users input their email address and receive a single-use code in their inbox. Here’s how your team can set this up:

Three new ways teams are using Cloudflare Access

In this example, we have Okta configured as our IdP. We have also enabled One-Time Pin as an additional login method.

Three new ways teams are using Cloudflare Access

Now that both login options are available, we can decide who should be able to reach our application. We’ll start by creating a new Access Group. An Access Group defines a set of users. We’ll name the group “Third-Party Partners” and include the email addresses of the individuals who need permission. Once the list is complete, the group can be saved.

Since Access Groups can be reused across policies, adding or removing a user from this list will apply to all policies that use the “Third-Party Partners” group.

Three new ways teams are using Cloudflare Access

Now that we have saved an Access Group, we can return to the administration panel and build a policy based on that group membership. First, we need to make sure our internal team can reach the application. To do so, we’ll create an Allow decision and include emails ending in our @widgetcorp.tech domain. Since that domain is tied to our Okta account, our internal team can continue to use Okta to reach the tool.

Next, we’ll need a second Include rule in the same policy for the external teams. For this rule, select “Access Groups” from the drop-down options. Once selected, we can pick the “Third-Party Partners” group that was saved in the previous step. This will allow any user who is a member of that group to reach the application.

Three new ways teams are using Cloudflare Access

Now when users attempt to reach the application, they are presented with two options. The internal team can continue to login with Okta. Third-party partners can instead select the One-Time Pin option.

Three new ways teams are using Cloudflare Access

When they choose One-Time Pin, they will be prompted to input their email address. Access will send a one-time code to their inbox. If they are an authorized user, as defined by the Access Group list, they can follow a link in that email or input the code to reach the application.

Solution 2: Require a Specific Network

For some applications, you want to ensure that your end users are both part of an approved list and originate from a known connection, like a secure office network. Building a rule with this requirement adds an extra layer of scrutiny to each request. Teams are using Access to enforce more comprehensive requirements like this one by creating policies with multiple rules. You can set this up for a specific application by creating a policy like the one below.

Three new ways teams are using Cloudflare Access

First, create a new Access Group. List the addresses or ranges you want to require. When adding multiple, use the Include rule, which means users must originate from one of the addresses in the list. You can give the group a title like "Office Networks" and save it.

Three new ways teams are using Cloudflare Access

Next, create a new policy. First, allow users to authenticate with their IdP credentials by including your team’s email domain or the group name from your IdP. Second, add a rule to require that requests originate from the network(s) you defined in your Access Group.

In this example, users who want to reach the site would first need to authenticate with the IdP you have configured. In addition, Access will check to make sure their request is coming from the IP range you configured in the second rule underneath the “Require” line.

Solution 3: Reach On-Premise Applications with Argo Tunnel

Some applications are too sensitive to expose to the public internet through firewall ports and access control lists (ACLs). At first glance, these tools seem doomed to live on-premise and require a VPN when your team members are away from the office.

Cloudflare Access can still help. When you combine Access with Cloudflare Argo Tunnel, you can avoid the hassle of a VPN while making your on-premise applications available to end users through secure connections to the Internet. Argo Tunnel securely exposes your web servers to the Internet without opening up firewall ports or requiring ACL configuration. Argo Tunnel ensures that requests route through Cloudflare before reaching the web server.

To configure Argo Tunnel, you’ll first need to create a zone in Cloudflare to serve as the hostname for your web server. Argo Tunnel creates a DNS entry for that hostname so that visitors can find it. Next, lock down that hostname with a new Access policy. Once you’re ready, you can proceed to install Argo Tunnel on your web server by following the instructions here.

With Access and Argo Tunnel, teams are making their on-premise applications feel and operate like SaaS products.

What's next?

We’re always excited to hear about how customers use our products. The feedback helps us iterate and build better solutions for your teams. We’d like to thank our Access beta users, as well as early adopters, for their input. We’re excited to continue to improving Access so that your team can continue transitioning away from your VPN.

Categories: Technology

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)

Sat, 11/08/2018 - 00:00
A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)

For the last five years, the Internet Engineering Task Force (IETF), the standards body that defines internet protocols, has been working on standardizing the latest version of one of its most important security protocols: Transport Layer Security (TLS). TLS is used to secure the web (and much more!), providing encryption and ensuring the authenticity of every HTTPS website and API. The latest version of TLS, TLS 1.3 (RFC 8446) was published today. It is the first major overhaul of the protocol, bringing significant security and performance improvements. This article provides a deep dive into the changes introduced in TLS 1.3 and its impact on the future of internet security.

An evolution

One major way Cloudflare provides security is by supporting HTTPS for websites and web services such as APIs. With HTTPS (the “S” stands for secure) the communication between your browser and the server travels over an encrypted and authenticated channel. Serving your content over HTTPS instead of HTTP provides confidence to the visitor that the content they see is presented by the legitimate content owner and that the communication is safe from eavesdropping. This is a big deal in a world where online privacy is more important than ever.

The machinery under the hood that makes HTTPS secure is a protocol called TLS. It has its roots in a protocol called Secure Sockets Layer (SSL) developed in the mid-nineties at Netscape. By the end of the 1990s, Netscape handed SSL over to the IETF, who renamed it TLS and have been the stewards of the protocol ever since. Many people still refer to web encryption as SSL, even though the vast majority of services have switched over to supporting TLS only. The term SSL continues to have popular appeal and Cloudflare has kept the term alive through product names like Keyless SSL and Universal SSL.

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)

In the IETF, protocols are called RFCs. TLS 1.0 was RFC 2246, TLS 1.1 was RFC 4346, and TLS 1.2 was RFC 5246. Today, TLS 1.3 was published as RFC 8446. RFCs are generally published in order, keeping 46 as part of the RFC number is a nice touch.

TLS 1.2 wears parachute pants and shoulder pads

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)
MC Hammer, like SSL, was popular in the 90s

Over the last few years, TLS has seen its fair share of problems. First of all, there have been problems with the code that implements TLS, including Heartbleed, BERserk, goto fail;, and more. These issues are not fundamental to the protocol and mostly resulted from a lack of testing. Tools like TLS Attacker and Project Wycheproof have helped improve the robustness of TLS implementation, but the more challenging problems faced by TLS have had to do with the protocol itself.

TLS was designed by engineers using tools from mathematicians. Many of the early design decisions from the days of SSL were made using heuristics and an incomplete understanding of how to design robust security protocols. That said, this isn’t the fault of the protocol designers (Paul Kocher, Phil Karlton, Alan Freier, Tim Dierks, Christopher Allen and others), as the entire industry was still learning how to do this properly. When TLS was designed, formal papers on the design of secure authentication protocols like Hugo Krawczyk’s landmark SIGMA paper were still years away. TLS was 90s crypto: It meant well and seemed cool at the time, but the modern cryptographer’s design palette has moved on.

Many of the design flaws were discovered using formal verification. Academics attempted to prove certain security properties of TLS, but instead found counter-examples that were turned into real vulnerabilities. These weaknesses range from the purely theoretical (SLOTH and CurveSwap), to feasible for highly resourced attackers (WeakDH, LogJam, FREAK, SWEET32), to practical and dangerous (POODLE, ROBOT).

TLS 1.2 is slow

Encryption has always been important online, but historically it was only used for things like logging in or sending credit card information, leaving most other data exposed. There has been a major trend in the last few years towards using HTTPS for all traffic on the Internet. This has the positive effect of protecting more of what we do online from eavesdroppers and injection attacks, but has the downside that new connections get a bit slower.

For a browser and web server to agree on a key, they need to exchange cryptographic data. The exchange, called the “handshake” in TLS, has remained largely unchanged since TLS was standardized in 1999. The handshake requires two additional round-trips between the browser and the server before encrypted data can be sent (or one when resuming a previous connection). The additional cost of the TLS handshake for HTTPS results in a noticeable hit to latency compared to an HTTP alone. This additional delay can negatively impact performance-focused applications.

Defining TLS 1.3

Unsatisfied with the outdated design of TLS 1.2 and two-round-trip overhead, the IETF set about defining a new version of TLS. In August 2013, Eric Rescorla laid out a wishlist of features for the new protocol:
https://www.ietf.org/proceedings/87/slides/slides-87-tls-5.pdf

After some debate, it was decided that this new version of TLS was to be called TLS 1.3. The main issues that drove the design of TLS 1.3 were mostly the same as those presented five years ago:

  • reducing handshake latency
  • encrypting more of the handshake
  • improving resiliency to cross-protocol attacks
  • removing legacy features

The specification was shaped by volunteers through an open design process, and after four years of diligent work and vigorous debate, TLS 1.3 is now in its final form: RFC 8446. As adoption increases, the new protocol will make the internet both faster and more secure.

In this blog post I will focus on the two main advantages TLS 1.3 has over previous versions: security and performance.

Trimming the hedges

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)
Creative Commons Attribution-Share Alike 3.0

In the last two decades, we as a society have learned a lot about how to write secure cryptographic protocols. The parade of cleverly-named attacks from POODLE to Lucky13 to SLOTH to LogJam showed that even TLS 1.2 contains antiquated ideas from the early days of cryptographic design. One of the design goals of TLS 1.3 was to correct previous mistakes by removing potentially dangerous design elements.

Fixing key exchange

TLS is a so-called “hybrid” cryptosystem. This means it uses both symmetric key cryptography (encryption and decryption keys are the same) and public key cryptography (encryption and decryption keys are different). Hybrid schemes are the predominant form of encryption used on the Internet and are used in SSH, IPsec, Signal, WireGuard and other protocols. In hybrid cryptosystems, public key cryptography is used to establish a shared secret between both parties, and the shared secret is used to create symmetric keys that can be used to encrypt the data exchanged.

As a rule of thumb, public key crypto is slow and expensive (microseconds to milliseconds per operation) and symmetric key crypto is fast and cheap (nanoseconds per operation). Hybrid encryption schemes let you send a lot of encrypted data with very little overhead by only doing the expensive part once. Much of the work in TLS 1.3 has been about improving the part of the handshake, where public keys are used to establish symmetric keys.

RSA key exchange

The public key portion of TLS is about establishing a shared secret. There are two main ways of doing this with public key cryptography. The simpler way is with public-key encryption: one party encrypts the shared secret with the other party’s public key and sends it along. The other party then uses its private key to decrypt the shared secret and ... voila! They both share the same secret. This technique was discovered in 1977 by Rivest, Shamir and Adelman and is called RSA key exchange. In TLS’s RSA key exchange, the shared secret is decided by the client, who then encrypts it to the server’s public key (extracted from the certificate) and sends it to the server.

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)

The other form of key exchange available in TLS is based on another form of public-key cryptography, invented by Diffie and Hellman in 1976, so-called Diffie-Hellman key agreement. In Diffie-Hellman, the client and server both start by creating a public-private key pair. They then send the public portion of their key share to the other party. When each party receives the public key share of the other, they combine it with their own private key and end up with the same value: the pre-master secret. The server then uses a digital signature to ensure the exchange hasn’t been tampered with. This key exchange is called “ephemeral” if the client and server both choose a new key pair for every exchange.

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)

Both modes result in the client and server having a shared secret, but RSA mode has a serious downside: it’s not forward secret. That means that if someone records the encrypted conversation and then gets ahold of the RSA private key of the server, they can decrypt the conversation. This even applies if the conversation was recorded and the key is obtained some time well into the future. In a world where national governments are recording encrypted conversations and using exploits like Heartbleed to steal private keys, this is a realistic threat.

RSA key exchange has been problematic for some time, and not just because it’s not forward-secret. It’s also notoriously difficult to do correctly. In 1998, Daniel Bleichenbacher discovered a vulnerability in the way RSA encryption was done in SSL and created what’s called the “million-message attack,” which allows an attacker to perform an RSA private key operation with a server’s private key by sending a million or so well-crafted messages and looking for differences in the error codes returned. The attack has been refined over the years and in some cases only requires thousands of messages, making it feasible to do from a laptop. It was recently discovered that major websites (including facebook.com) were also vulnerable to a variant of Bleichenbacher’s attack called the ROBOT attack as recently as 2017.

To reduce the risks caused by non-forward secret connections and million-message attacks, RSA encryption was removed from TLS 1.3, leaving ephemeral Diffie-Hellman as the only key exchange mechanism. Removing RSA key exchange brings other advantages, as we will discuss in the performance section below.

Diffie-Hellman named groups

When it comes to cryptography, giving too many options leads to the wrong option being chosen. This principle is most evident when it comes to choosing Diffie-Hellman parameters. In previous versions of TLS, the choice of the Diffie-Hellman parameters was up to the participants. This resulted in some implementations choosing incorrectly, resulting in vulnerable implementations being deployed. TLS 1.3 takes this choice away.

Diffie-Hellman is a powerful tool, but not all Diffie-Hellman parameters are “safe” to use. The security of Diffie-Hellman depends on the difficulty of a specific mathematical problem called the discrete logarithm problem. If you can solve the discrete logarithm problem for a set of parameters, you can extract the private key and break the security of the protocol. Generally speaking, the bigger the numbers used, the harder it is to solve the discrete logarithm problem. So if you choose small DH parameters, you’re in trouble.

The LogJam and WeakDH attacks of 2015 showed that many TLS servers could be tricked into using small numbers for Diffie-Hellman, allowing an attacker to break the security of the protocol and decrypt conversations.

Diffie-Hellman also requires the parameters to have certain other mathematical properties. In 2016, Antonio Sanso found an issue in OpenSSL where parameters were chosen that lacked the right mathematical properties, resulting in another vulnerability.

TLS 1.3 takes the opinionated route, restricting the Diffie-Hellman parameters to ones that are known to be secure. However, it still leaves several options; permitting only one option makes it difficult to update TLS in case these parameters are found to be insecure some time in the future.

Fixing ciphers

The other half of a hybrid crypto scheme is the actual encryption of data. This is done by combining an authentication code and a symmetric cipher for which each party knows the key. As I’ll describe, there are many ways to encrypt data, most of which are wrong.

CBC mode ciphers

In the last section we described TLS as a hybrid encryption scheme, with a public key part and a symmetric key part. The public key part is not the only one that has caused trouble over the years. The symmetric key portion has also had its fair share of issues. In any secure communication scheme, you need both encryption (to keep things private) and integrity (to make sure people don’t modify, add, or delete pieces of the conversation). Symmetric key encryption is used to provide both encryption and integrity, but in TLS 1.2 and earlier, these two pieces were combined in the wrong way, leading to security vulnerabilities.

An algorithm that performs symmetric encryption and decryption is called a symmetric cipher. Symmetric ciphers usually come in two main forms: block ciphers and stream ciphers.

A stream cipher takes a fixed-size key and uses it to create a stream of pseudo-random data of arbitrary length, called a key stream. To encrypt with a stream cipher, you take your message and combine it with the key stream by XORing each bit of the key stream with the corresponding bit of your message.. To decrypt, you take the encrypted message and XOR it with the key stream. Examples of pure stream ciphers are RC4 and ChaCha20. Stream ciphers are popular because they’re simple to implement and fast in software.

A block cipher is different than a stream cipher because it only encrypts fixed-sized messages. If you want to encrypt a message that is shorter or longer than the block size, you have to do a bit of work. For shorter messages, you have to add some extra data to the end of the message. For longer messages, you can either split your message up into blocks the cipher can encrypt and then use a block cipher mode to combine the pieces together somehow. Alternatively, you can turn your block cipher into a stream cipher by encrypting a sequence of counters with a block cipher and using that as the stream. This is called “counter mode”. One popular way of encrypting arbitrary length data with a block cipher is a mode called cipher block chaining (CBC).

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)
A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)

In order to prevent people from tampering with data, encryption is not enough. Data also needs to be integrity-protected. For CBC-mode ciphers, this is done using something called a message-authentication code (MAC), which is like a fancy checksum with a key. Cryptographically strong MACs have the property that finding a MAC value that matches an input is practically impossible unless you know the secret key. There are two ways to combine MACs and CBC-mode ciphers. Either you encrypt first and then MAC the ciphertext, or you MAC the plaintext first and then encrypt the whole thing. In TLS, they chose the latter, MAC-then-Encrypt, which turned out to be the wrong choice.

You can blame this choice for BEAST, as well as a slew of padding oracle vulnerabilities such as Lucky 13 and Lucky Microseconds. Read my previous post on the subject for a comprehensive explanation of these flaws. The interaction between CBC mode and padding was also the cause of the widely publicized POODLE vulnerability in SSLv3 and some implementations of TLS.

RC4 is a classic stream cipher designed by Ron Rivest (the “R” of RSA) that was broadly supported since the early days of TLS. In 2013, it was found to have measurable biases that could be leveraged to allow attackers to decrypt messages.

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)
AEAD Mode

In TLS 1.3, all the troublesome ciphers and cipher modes have been removed. You can no longer use CBC-mode ciphers or insecure stream ciphers such as RC4. The only type of symmetric crypto allowed in TLS 1.3 is a new construction called AEAD (authenticated encryption with additional data), which combines encryption and integrity into one seamless operation.

Fixing digital signatures

Another important part of TLS is authentication. In every connection, the server authenticates itself to the client using a digital certificate, which has a public key. In RSA-encryption mode, the server proves its ownership of the private key by decrypting the pre-master secret and computing a MAC over the transcript of the conversation. In Diffie-Hellman mode, the server proves ownership of the private key using a digital signature. If you’ve been following this blog post so far, it should be easy to guess that this was done incorrectly too.

PKCS#1v1.5

Daniel Bleichenbacher has made a living identifying problems with RSA in TLS. In 2006, he devised a pen-and-paper attack against RSA signatures as used in TLS. It was later discovered that major TLS implemenations including those of NSS and OpenSSL were vulnerable to this attack. This issue again had to do with how difficult it is to implement padding correctly, in this case, the PKCS#1 v1.5 padding used in RSA signatures. In TLS 1.3, PKCS#1 v1.5 is removed in favor of the newer design RSA-PSS.

Signing the entire transcript

We described earlier how the server uses a digital signature to prove that the key exchange hasn’t been tampered with. In TLS 1.2 and earlier, the server’s signature only covers part of the handshake. The other parts of the handshake, specifically the parts that are used to negotiate which symmetric cipher to use, are not signed by the private key. Instead, a symmetric MAC is used to ensure that the handshake was not tampered with. This oversight resulted in a number of high-profile vulnerabilities (FREAK, LogJam, etc.). In TLS 1.3 these are prevented because the server signs the entire handshake transcript.

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)

The FREAK, LogJam and CurveSwap attacks took advantage of two things:

  1. the fact that intentionally weak ciphers from the 1990s (called export ciphers) were still supported in many browsers and servers, and
  2. the fact that the part of the handshake used to negotiate which cipher was used was not digitally signed.

The “man-in-the-middle” attacker can swap out the supported ciphers (or supported groups, or supported curves) from the client with an easily crackable choice that the server supports. They then break the key and forge two finished messages to make both parties think they’ve agreed on a transcript.

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)

These attacks are called downgrade attacks, and they allow attackers to force two participants to use the weakest cipher supported by both parties, even if more secure ciphers are supported. In this style of attack, the perpetrator sits in the middle of the handshake and changes the list of supported ciphers advertised from the client to the server to only include weak export ciphers. The server then chooses one of the weak ciphers, and the attacker figures out the key with a brute-force attack, allowing the attacker to forge the MACs on the handshake. In TLS 1.3, this type of downgrade attack is impossible because the server now signs the entire handshake, including the cipher negotiation.

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)

Better living through simplification

TLS 1.3 is a much more elegant and secure protocol with the removal of the insecure features listed above. This hedge-trimming allowed the protocol to be simplified in ways that make it easier to understand, and faster.

No more take-out menu

In previous versions of TLS, the main negotiation mechanism was the ciphersuite. A ciphersuite encompassed almost everything that could be negotiated about a connection:

  • type of certificates supported
  • hash function used for deriving keys (e.g., SHA1, SHA256, ...)
  • MAC function (e.g., HMAC with SHA1, SHA256, …)
  • key exchange algorithm (e.g., RSA, ECDHE, …)
  • cipher (e.g., AES, RC4, ...)
  • cipher mode, if applicable (e.g., CBC)

Ciphersuites in previous versions of TLS had grown into monstrously large alphabet soups. Examples of commonly used cipher suites are: DHE-RC4-MD5 or ECDHE-ECDSA-AES-GCM-SHA256. Each ciphersuite was represented by a code point in a table maintained by an organization called the Internet Assigned Numbers Authority (IANA). Every time a new cipher was introduced, a new set of combinations needed to be added to the list. This resulted in a combinatorial explosion of code points representing every valid choice of these parameters. It had become a bit of a mess.

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)TLS 1.2
A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)

TLS 1.3

TLS 1.3 removes many of these legacy features, allowing for a clean split between three orthogonal negotiations:
  • Cipher + HKDF Hash
  • Key Exchange
  • Signature Algorithm

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)

This simplified cipher suite negotiation and radically reduced set of negotiation parameters opens up a new possibility. This possibility enables the TLS 1.3 handshake latency to drop from two round-trips to only one round-trip, providing the performance boost that will ensure that TLS 1.3 will be popular and widely adopted.

Performance

When establishing a new connection to a server that you haven’t seen before, it takes two round-trips before data can be sent on the connection. This is not particularly noticeable in locations where the server and client are geographically close to each other, but it can make a big difference on mobile networks where latency can be as high as 200ms, an amount that is noticeable for humans.

1-RTT mode

TLS 1.3 now has a radically simpler cipher negotiation model and a reduced set of key agreement options (no RSA, no user-defined DH parameters). This means that every connection will use a DH-based key agreement and the parameters supported by the server are likely easy to guess (ECDHE with X25519 or P-256). Because of this limited set of choices, the client can simply choose to send DH key shares in the first message instead of waiting until the server has confirmed which key shares it is willing to support. That way, the server can learn the shared secret and send encrypted data one round trip earlier. Chrome’s implementation of TLS 1.3, for example, sends an X25519 keyshare in the first message to the server.

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)
A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)

In the rare situation that the server does not support one of the key shares sent by the client, the server can send a new message, the HelloRetryRequest, to let the client know which groups it supports. Because the list has been trimmed down so much, this is not expected to be a common occurrence.

0-RTT resumption

A further optimization was inspired by the QUIC protocol. It lets clients send encrypted data in their first message to the server, resulting in no additional latency cost compared to unencrypted HTTP. This is a big deal, and once TLS 1.3 is widely deployed, the encrypted web is sure to feel much snappier than before.

In TLS 1.2, there are two ways to resume a connection, session ids and session tickets. In TLS 1.3 these are combined to form a new mode called PSK (pre-shared key) resumption. The idea is that after a session is established, the client and server can derive a shared secret called the “resumption master secret”. This can either be stored on the server with an id (session id style) or encrypted by a key known only to the server (session ticket style). This session ticket is sent to the client and redeemed when resuming a connection.

For resumed connections, both parties share a resumption master secret so key exchange is not necessary except for providing forward secrecy. The next time the client connects to the server, it can take the secret from the previous session and use it to encrypt application data to send to the server, along with the session ticket. Something as amazing as sending encrypted data on the first flight does come with its downfalls.

Replayability

There is no interactivity in 0-RTT data. It’s sent by the client, and consumed by the server without any interactions. This is great for performance, but comes at a cost: replayability. If an attacker captures a 0-RTT packet that was sent to server, they can replay it and there’s a chance that the server will accept it as valid. This can have interesting negative consequences.

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)

An example of dangerous replayed data is anything that changes state on the server. If you increment a counter, perform a database transaction, or do anything that has a permanent effect, it’s risky to put it in 0-RTT data.

As a client, you can try to protect against this by only putting “safe” requests into the 0-RTT data. In this context, “safe” means that the request won’t change server state. In HTTP, different methods are supposed to have different semantics. HTTP GET requests are supposed to be safe, so a browser can usually protect HTTPS servers against replay attacks by only sending GET requests in 0-RTT. Since most page loads start with a GET of “/” this results in faster page load time.

Problems start to happen when data sent in 0-RTT are used for state-changing requests. To help prevent against this failure case, TLS 1.3 also includes the time elapsed value in the session ticket. If this diverges too much, the client is either approaching the speed of light, or the value has been replayed. In either case, it’s prudent for the server to reject the 0-RTT data.

For more details about 0-RTT, and the improvements to session resumption in TLS 1.3, check out this previous blog post.

Deployability

TLS 1.3 was a radical departure from TLS 1.2 and earlier, but in order to be deployed widely, it has to be backwards compatible with existing software. One of the reasons TLS 1.3 has taken so long to go from draft to final publication was the fact that some existing software (namely middleboxes) wasn’t playing nicely with the new changes. Even minor changes to the TLS 1.3 protocol that were visible on the wire (such as eliminating the redundant ChangeCipherSpec message, bumping the version from 0x0303 to 0x0304) ended up causing connection issues for some people.

Despite the fact that future flexibility was built into the TLS spec, some implementations made incorrect assumptions about how to handle future TLS versions. The phenomenon responsible for this change is called ossification and I explore it more fully in the context of TLS in my previous post about why TLS 1.3 isn’t deployed yet. To accommodate these changes, TLS 1.3 was modified to look a lot like TLS 1.2 session resumption (at least on the wire). This resulted in a much more functional, but less aesthetically pleasing protocol. This is the price you pay for upgrading one of the most widely deployed protocols online.

Conclusions

TLS 1.3 is a modern security protocol built with modern tools like formal analysis that retains its backwards compatibility. It has been tested widely and iterated upon using real world deployment data. It’s a cleaner, faster, and more secure protocol ready to become the de facto two-party encryption protocol online. TLS 1.3 is enabled by default for all Cloudflare customers.

Publishing TLS 1.3 is a huge accomplishment. It is one the best recent examples of how it is possible to take 20 years of deployed legacy code and change it on the fly, resulting in a better internet for everyone. TLS 1.3 has been debated and analyzed for the last three years and it’s now ready for prime time. Welcome, RFC 8446.

Categories: Technology

Optimising Caching on Pwned Passwords (with Workers)

Thu, 09/08/2018 - 16:42
Optimising Caching on Pwned Passwords (with Workers)

In February, Troy Hunt unveiled Pwned Passwords v2. Containing over half a billion real world leaked passwords, this database provides a vital tool for correcting the course of how the industry combats modern threats against password security.

In supporting this project; I built a k-Anonymity model to add a layer of security to performed queries. This model allows for enhanced caching by mapping multiple leaked password hashes to a single hash prefix and additionally being performed in a deterministic HTTP-friendly way (which allows caching whereas other implementations of Private Set Intersection require a degree of randomness).

Since launch, PwnedPasswords, using this anonymity model and delivered by Cloudflare, has been implemented in a widespread way across a wide variety of platforms - from site like EVE Online and Kogan to tools like 1Password and Okta's PassProtect. The anonymity model is also used by Firefox Monitor when checking if an email is in a data breach.

Since it has been adopted, Troy has tweeted out about the high cache hit ratio; and people have been asking me about my "secret ways" of gaining such a high cache hit ratio. Over time I touched various pieces of Cloudflare's caching systems; in late 2016 I worked to bring Bypass Cache on Cookie functionality to our self-service Business plan users and wrestled with cache implications of CSRF tokens - however Pwned Passwords was far more fun to help show the power of Cloudflare's cache functionality from the perspective of a user.

Looks like Pwned Passwords traffic has started to double over the norm, trending around 8M requests a day now. @IcyApril made a cache change to improve stability but reduce hit ratio around the 10th, but that's improving again now with higher volumes (94% for the last week). pic.twitter.com/HwMDLlmBEY

— Troy Hunt (@troyhunt) June 25, 2018

Will @IcyApril secret ways ever be released?!

— Neal (@tun35) May 7, 2018

It is worth noting that PwnedPasswords is not like a typical website in terms of caching - it contains 16^5 possible API queries (any possible form of five hexadecimal charecters, in total over a million possible queries) in order to guarantee k-Anonymity in the API. Whilst the API guarantees k-Anonymity, it does not guarantee l-Diversity, meaning individual queries can occur more than others.

For ordinary websites, with less assets, the cache hit ratio can be far greater. An example of this is another site Troy set-up using our barebones free plan; by simply configuring a Page Rule with the Cache Everything option (and setting an Edge Cache TTL option, should the Cache-Control headers from your origin not do so), you are able to cache static HTML easily.

When I've written about really high cache-hit ratios on @haveibeenpwned courtesy of @Cloudflare, some people have suggested it's due to higher-level plans. Here's https://t.co/Y4GlsInvu2 running on the *free* plan: 99.0% cache hit ratio on requests and 99.5% on bandwidth. Free! pic.twitter.com/pP0wo7qKF3

— Troy Hunt (@troyhunt) July 31, 2018 Origin Headers

Indeed, the fact the queries are usually API queries makes a substantial difference. When optimising caching, the most important thing to look for is instances where the same cache asset is stored multiple times for different cache keys; for some assets this may involve selectively ignoring query strings for cache purposes, but for APIs the devil is more in the detail.

When a HTTP request is made from a JavaScript asset (as is done when PwnedPasswords is directly implemented in login forms) - the site will also send an Origin header to indicate where a fetch originates from.

When you make a search on haveibeenpwned.com/Passwords, there's a bit of JavaScript which takes the password and applies the k-Anonymity model by SHA-1 hashing the password and truncating the hash to the first five charecters and sending that request off to https://api.pwnedpasswords.com/range/A94A8 (then performing a check to see if any of the contained suffixes are in the response).

In the headers of this request to PwnedPasswords.com, you can see the request contains an Origin header of the querying site.

Optimising Caching on Pwned Passwords (with Workers)

This header is often useful for mitigating Cross-Site Request Forgery (CSRF) vulnerabilities by only allowing certain Origins to make HTTP requests using Cross-Origin Resource Sharing (CORS).

In the context of an API, this does not nessecarily make sense where there is no state (i.e. cookies). However, Cloudflare's default Cache Key contains this header for those who wish to use it. This means, Cloudflare will store a new cached copy of the asset whenever a different Origin header is present. Whilst this is ordinarily not a problem (most sites have one Origin header, or just a handful when using CORS), PwnedPasswords has Origin headers coming from websites all over the internet.

As Pwned Passwords will always respond with the same for a given request, regardless of the Origin header - we are able to remove this header from the Cache Key using our Custom Cache Key functionality.

Incidently, JavaScript CDNs will frequently be requested to fetch assets as sub-resources from another JavaScript asset - removing the Origin header from their Cache Key can have similar benefits:

Just applied some @Cloudflare cache magic I experimented with to get @troyhunt's Pwned Passwords API cache hit ratio to ~91%, to a large JS CDN (@unpkg) during a slow traffic period. Traffic 30mins post deploy shows a growing ~94% Cache Hit Ratio (with a planned cache purge!). pic.twitter.com/ZQmfzEi4Y2

— Junade Ali (@IcyApril) May 6, 2018 Case Insensitivity

One thing I realised after speaking to Stefán Jökull Sigurðarson from EVE Online was that different users were querying assets using different casing; for example, instead of range/A94A8 - a request to range/a94a8 would result in the same asset. As the Cache Key accounted for case sensitivity, the asset would be cached twice.

Unfortuantely, the API was already public with both forms of casing being acceptable once I started these optimisations.

Enter Cloudflare Workers

Instead of adjusting the cache key to solve this problem, I decided to use Cloudflare Workers - allowing me to adjust cache behaviour using JavaScript.

Troy initially had a simple worker on the site to enable CORS:

addEventListener('fetch', event => { event.respondWith(checkAndDispatchReports(event.request)) }) async function checkAndDispatchReports(req) { if(req.method === 'OPTIONS') { let responseHeaders = setCorsHeaders(new Headers()) return new Response('', {headers:responseHeaders}) } else { return await fetch(req) } } function setCorsHeaders(headers) { headers.set('Access-Control-Allow-Origin', '*') headers.set('Access-Control-Allow-Methods', 'GET') headers.set('Access-Control-Allow-Headers', 'access-control-allow-headers') headers.set('Access-Control-Max-Age', 1728000) return headers }

I added to this worker to ensure that when a request left Workers, the hash prefix would always be upper case, additionally I used the cacheKey flag to allow the Cache Key to be set directly in Workers when making the request (instead of using our internal Custom Cache Key configuration):

addEventListener('fetch', event => { event.respondWith(handleRequest(event.request)); }) /** * Fetch request after making casing of hash prefix uniform * @param {Request} request */ async function handleRequest(request) { if(request.method === 'OPTIONS') { let responseHeaders = setCorsHeaders(new Headers()) return new Response('', {headers:responseHeaders}) } const url = new URL(request.url); if (!url.pathname.startsWith("/range/")) { const response = await fetch(request) return response; } const prefix = url.pathname.substr(7); const newRequest = "https://api.pwnedpasswords.com/range/" + prefix.toUpperCase() if (prefix === prefix.toUpperCase()) { const response = await fetch(request, { cf: { cacheKey: newRequest } }) return response; } const init = { method: request.method, headers: request.headers } const modifiedRequest = new Request(newRequest, init) const response = await fetch(modifiedRequest, { cf: { cacheKey: newRequest } }) return response } function setCorsHeaders(headers) { headers.set('Access-Control-Allow-Origin', '*') headers.set('Access-Control-Allow-Methods', 'GET') headers.set('Access-Control-Allow-Headers', 'access-control-allow-headers') headers.set('Access-Control-Max-Age', 1728000) return headers }

Incidentially, our Workers team are working on some really cool stuff around controlling our cache APIs at a fine grained level, you'll be able to see some of that stuff in due course by following this blog.

Argo

Finally, Argo plays an important part in improving Cache Hit ratio. Once toggled on, it is known for optimising speed at which traffic travels around the internet - but it also means that when traffic is routed from one Cloudflare data center to another, if an asset is cached closer to the origin web server, the asset will be served from that data center. In essence, it offers Tiered Cache functionality; by making sure when traffic comes from a less user Cloudflare data center, it can still utilise cache from a data center recieving greater traffic (and more likely to have an asset in cache). This prevents an asset from having to travel all the way around the world whilst still being served from cache (even if not optimally close to the user).

Optimising Caching on Pwned Passwords (with Workers)

Conclusion

By using Cloudflare's caching functionality, we are able to reduce the amount of times a single asset is in cache by accidental variations in the request parameters. Workers offers a mechanism to control the cache of assets on Cloudflare, with more fine-grained controls under active development.

By implementing this on Pwned Passwords; we are able to provide developers a simple and fast interface to reduce password reuse amonst their users, thereby limiting the effects of Credential Stuffing attacks on their system. If only Irene Adler had used a password manager:

Interested in helping debug performance, cache and security issues for websites of all sizes? We're hiring for Support Engineers to join us in London, and additionally those speaking Japanese, Korean or Mandarin in our Singapore office.

Categories: Technology

Use Cloudflare Stream to build secure, reliable video apps

Tue, 07/08/2018 - 14:00
Use Cloudflare Stream to build secure, reliable video apps

It’s our pleasure to announce the general availability of Cloudflare Stream. Cloudflare Stream is the best way for any founder or developer to deliver an extraordinary video experience to their viewers while cutting development time and costs, and as of today it is available to every Cloudflare customer.

If I had to summarize what we’ve learned as we’ve built Stream it would be: Video streaming is hard, but building a successful video streaming business is even harder. This is why our goal has been to take away the complexity of encoding, storage, and smooth delivery so you can focus on all the other critical parts of your business.

Cloudflare Stream API

You call a single endpoint, Cloudflare Stream delivers a high-quality streaming experience to your visitors. Here’s how it works:

  1. Your app calls the /stream endpoint to upload a video. You can submit the contents of the video with the request or you can provide a URL to a video hosted elsewhere.
  2. Cloudflare Stream encodes the stream in multiple resolutions to enable multi-bitrate streaming. We also automatically prepare DASH and HLS manifest files.
  3. Cloudflare serves your video (in multiple resolutions) from our vast network of 150+ data centers around the world, as close as we can manage to every Internet-connected device on earth.
  4. We provide you an embed code for the video which loads the unbranded and customizable Cloudflare Stream Player.
  5. You place the embed code in your app, and you’re done.
Why Stream

Cloudflare Stream is a simple product by design. We are happy to say we don’t provide every configuration option. Instead we make the best choices possible, both on a player and network level, to deliver a high-quality experience to your visitors.

Low Cost

Cloudflare Stream does not charge you for the intensive and complex job of encoding your video in different resolutions. You pay a dollar for every 1,000 minutes of streaming, and $5/mo for every 1,000 minutes of storage, and that’s it.

Behind the scenes we are driving costs so low by having both the most peered network in the world, and by intelligently serving your video from the data center which is fastest when the user has an empty buffer, and the most affordable when their buffer is full. This gives them the experience they need, while allowing you to serve video at a lower cost than you can find from platforms which can’t make these optimizations.

Efficient Routing

Cloudflare touches as much as one in every ten web requests made over the Internet. If you read this blog you know how much energy and effort we put into optimizing that system to deliver resources faster. When applied to video, it means faster time to first frame and reduced buffering for your viewers than providers who operate at a smaller scale.

Integrated Solution

The key innovation of Stream is looking at a video as not just a bunch of bytes to be served over the Internet, but as an experience for a user. Our encoding takes into account how files will be delivered from our data centers. Our player uses its knowledge of how we deliver to provide a better experience. All of this is only made possible through working with a partner who can see the entire user experience from developer to viewer.

Common Use Cases
  • Video-on-demand: Whether you have 50 hours or 50,000 hours of video content, you can use Cloudflare Stream to make it streamable to the world.
  • Gaming: Allow your users from around the world to upload and share videos of their gameplay.
  • eLearning: Cloudflare Stream makes it a breeze to build eLearning applications that offer multi-bitrate streaming and other important features such as offline viewing and advance security tokens.
  • Video Ads: Use the Cloudflare Player to stream video ads with the confidence that your stream will be optimized for your audience.
  • Your Idea: We are here to help make the Internet better so you can build amazing things with it. Reach out with your ideas for how video can make your app, site, or service more powerful.
How to Get Started

To get started, simply sign-up for Cloudflare and visit the Stream tab! As of today it is generally available for every Cloudflare user. If you’re an Enterprise customer, speak with your Cloudflare team.

Have a question or idea? Reach out in the community forum.

Categories: Technology

Additional Record Types Available with Cloudflare DNS

Mon, 06/08/2018 - 17:45
Additional Record Types Available with Cloudflare DNS

Additional Record Types Available with Cloudflare DNS
Photo by Mink Mingle / Unsplash

Cloudflare recently updated the authoritative DNS service to support nine new record types. Since these records are less commonly used than what we previously supported, we thought it would be a good idea to do a brief explanation of each record type and how it is used.

DNSKEY and DS

DNSKEY and DS work together to allow you to enable DNSSEC on a child zone (subdomain) that you have delegated to another Nameserver. DS is useful if you are delegating DNS (through an NS record) for a child to a separate system and want to keep using DNSSEC for that child zone; without a DS entry in the parent, the child data will not be validated. We’ve blogged about the details of Cloudflare’s DNSSEC implementation and why it is important in the past, and this new feature allows for more flexible adoption for customers who need to delegate subdomains.

Certificate Related Record Types

Today, there is no way to restrict which TLS (SSL) certificates are trusted to be served for a host. For example if an attacker were able to maliciously generate an SSL certificate for a host, they could use a man-in-the-middle attack to appear as the original site. With SSHFP, TLSA, SMIMEA, and CERT, a website owner can configure the exact certificate public key that is allowed to be used on the domain, stored inside the DNS and secured with DNSSEC, reducing the risk of these kinds of attacks working.

It is critically important that if you rely on these types of records that you enable and configure DNSSEC for your domain.

SSHFP

This type of record is an answer to the question “When I’m connecting via SSH to this remote machine, it’s authenticating me, but how do I authenticate it?” If you’re the only person connecting to this machine, your SSH client will compare the fingerprint of the public host key to the one it kept in the known_hosts file during the first connection. However across multiple machines or multiple users from an organization, you need to verify this information against a common source of trust. In essence, you need the equivalent of the authentication that a certificate authority provides by signing an HTTPS certificate, but for SSH. Although it’s possible to set certificate authorities for SSH and to have them sign public host keys, another way is to publish the fingerprint of the keys in the domain via the SSHFP record type.

Again, for these fingerprints to be trustworthy it is important to enable DNSSEC on your zone.

The SSHFP record type is similar to TLSA record. You are specifying the algorithm type, the signature type, and then the signature itself within the record for a given hostname.

If the domain and remote server have SSHFP set and you are running an SSH client (such as OpenSSH 5.1+) that supports it, you can now verify the remote machine upon connection by adding the following parameters to your connection:

❯ ssh -o "VerifyHostKeyDNS=yes" -o "StrictHostKeyChecking=yes" [insertremoteserverhere]

TLSA and SMIMEA

TLSA records were designed to specify which keys are allowed to be used for a given domain when connecting via TLS. They were introduced in the DANE specification and allow domain owners to announce which certificate can and should be used for specific purposes for the domain. While most major browsers do not support TLSA, it may still be valuable for non browser specific applications and services.

For example, I’ve set a TLSA record for the domain hasvickygoneonholiday.com for TCP traffic over port 443. There are a number of ways to generate the record, but the easiest is likely through Shuman Huque’s tool.

For most of the examples in this post we will be using kdig rather than the ubiquitous dig. Generally preinstalled dig versions can be old and may not handle newer record types well. If your queries do not quite match up, you should either upgrade your version of dig or install knot.

;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 2218 ;; Flags: qr rd ra ad; QUERY: 1; ANSWER: 2; AUTHORITY: 0; ADDITIONAL: 1 ;; QUESTION SECTION: ;; _443._tcp.hasvickygoneonholiday.com. IN TLSA ;; ANSWER SECTION: _443._tcp.hasvickygoneonholiday.com. 300 IN TLSA 3 1 1 4E48ED671DFCDF6CBF55E52DBC8B9C9CC21121BD149BC24849D1398DA56FB242 _443._tcp.hasvickygoneonholiday.com. 300 IN RRSIG TLSA 13 4 300 20180803233834 20180801213834 35273 hasvickygoneonholiday.com. JvC9mZLfuAyEHZUZdq4n8kyRbF09vwgx4c1fas24Ag925LILr1armjHbr7ZTp8ycS/Go3y3lgyYCuBeW/vT/3w== ;; Received 232 B ;; Time 2018-08-02 15:38:34 PDT ;; From 192.168.1.1@53(UDP) in 28.5 ms

From the above request and response, we can see that a) the response for the zone is secured and signed with DNSSEC (Flag: ad) and that I should be verifying a certificate with the public key (3 1 1) SHA256 hash (3 1 1) of 4E48ED671DFCDF6CBF55E52DBC8B9C9CC21121BD149BC24849D1398DA56FB242. We can use openssl (v1.1.x or higher) to verify the results:

❯ openssl s_client -connect hasvickygoneonholiday.com:443 -dane_tlsa_domain "hasvickygoneonholiday.com" -dane_tlsa_rrdata " 3 1 1 4e48ed671dfcdf6cbf55e52dbc8b9c9cc21121bd149bc24849d1398da56fb242" CONNECTED(00000003) depth=0 C = US, ST = CA, L = San Francisco, O = "CloudFlare, Inc.", CN = hasvickygoneonholiday.com verify return:1 --- Certificate chain 0 s:/C=US/ST=CA/L=San Francisco/O=CloudFlare, Inc./CN=hasvickygoneonholiday.com i:/C=US/ST=CA/L=San Francisco/O=CloudFlare, Inc./CN=CloudFlare Inc ECC CA-2 1 s:/C=US/ST=CA/L=San Francisco/O=CloudFlare, Inc./CN=CloudFlare Inc ECC CA-2 i:/C=IE/O=Baltimore/OU=CyberTrust/CN=Baltimore CyberTrust Root --- Server certificate -----BEGIN CERTIFICATE----- MIIE7jCCBJSgAwIBAgIQB9z9WxnovNf/lt2Lkrfq+DAKBggqhkjOPQQDAjBvMQsw ... --- SSL handshake has read 2666 bytes and written 295 bytes Verification: OK Verified peername: hasvickygoneonholiday.com DANE TLSA 3 1 1 ...149bc24849d1398da56fb242 matched EE certificate at depth 0

SMIMEA records function similar to TLSA but are specific to email addresses. The domain for these records should be prefixed by “_smimecert.” and specific formatting is required to attach a SMIMEA record to an email address. The local-part (username) of the email address must be treated in a specific format and SHA-256 hashed as detailed in the RFC. From the RFC example: “ For example, to request an SMIMEA resource record for a user whose email address is "hugh@example.com", an SMIMEA query would be placed for the following QNAME: c93f1e400f26708f98cb19d936620da35eec8f72e57f9eec01c1afd6._smimecert.example.com

CERT

CERT records are used for generically storing certificates within DNS and are most commonly used by systems for email encryption. To create a CERT record, you must specify the certificate type, the key tag, the algorithm, and then the certificate, which is either the certificate itself, the CRL, a URL of the certificate, or fingerprint and a URL.

Other Newly Supported Record Types PTR

PTR (Pointer) records are pointers to canonical names. They are similar to CNAME in structure, meaning they only contain one FQDN (fully qualified domain name) but the RFC dictates that subsequent lookups are not done for PTR records, the value should just be returned back to the requestor. This is different to a CNAME where a recursive resolver would follow the target of the canonical name. The most common use of a PTR record is in reverse DNS, where you can look up which domains are meant to exist at a given IP address. These are useful for outbound mailservers as well as authoritative DNS servers.

It is only possible to delegate the authority for IP addresses that you own from your Regional Internet Registry (RIR). Creating reverse zones and PTR records for IPs that you can not (or do not) delegate does not serve any practical purpose.

For example, looking up the A record for marek.ns.cloudflare.com gives us the IP of 173.245.59.202.

❯ kdig a marek.ns.cloudflare.com +short 173.245.59.202

Now imagine we want to know if the owner of this IP ‘authorizes’ marek.ns.cloudflare.com to point to it. Reverse Zones are specifically crafted child zones within in-addr.arpa. (for IPv4) and ip6.arpa. (for IPv6) whom are delegated via the Regional Internet Registries to the owners of the IP address space. That is to say if you own a /24 from ARIN, ARIN will delegate the reverse zone space for your /24 to you to control. The IPv4 address is represented inverted as the subdomain in in-addr.arpa. Since Cloudflare owns the IP, we’ve delegated the reverse zone and created a PTR there.

❯ kdig -x 173.245.59.202 ;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 18658 ;; Flags: qr rd ra; QUERY: 1; ANSWER: 1; AUTHORITY: 0; ADDITIONAL: 0 ;; QUESTION SECTION: ;; 202.59.245.173.in-addr.arpa. IN PTR ;; ANSWER SECTION: 202.59.245.173.in-addr.arpa. 1222 IN PTR marek.ns.cloudflare.com.

For completeness, here is the +trace for the 202.59.245.173.in-addr.arpa zone. We can see that the /24 59.245.173.in-addr.arpa has been delegated to Cloudflare from ARIN:

❯ dig 202.59.245.173.in-addr.arpa +trace ; <<>> DiG 9.8.3-P1 <<>> 202.59.245.173.in-addr.arpa +trace ;; global options: +cmd . 48419 IN NS a.root-servers.net. . 48419 IN NS b.root-servers.net. . 48419 IN NS c.root-servers.net. . 48419 IN NS d.root-servers.net. . 48419 IN NS e.root-servers.net. . 48419 IN NS f.root-servers.net. . 48419 IN NS g.root-servers.net. . 48419 IN NS h.root-servers.net. . 48419 IN NS i.root-servers.net. . 48419 IN NS j.root-servers.net. . 48419 IN NS k.root-servers.net. . 48419 IN NS l.root-servers.net. . 48419 IN NS m.root-servers.net. ;; Received 228 bytes from 2001:4860:4860::8888#53(2001:4860:4860::8888) in 25 ms in-addr.arpa. 172800 IN NS e.in-addr-servers.arpa. in-addr.arpa. 172800 IN NS d.in-addr-servers.arpa. in-addr.arpa. 172800 IN NS b.in-addr-servers.arpa. in-addr.arpa. 172800 IN NS f.in-addr-servers.arpa. in-addr.arpa. 172800 IN NS c.in-addr-servers.arpa. in-addr.arpa. 172800 IN NS a.in-addr-servers.arpa. ;; Received 421 bytes from 192.36.148.17#53(192.36.148.17) in 8 ms 173.in-addr.arpa. 86400 IN NS u.arin.net. 173.in-addr.arpa. 86400 IN NS arin.authdns.ripe.net. 173.in-addr.arpa. 86400 IN NS z.arin.net. 173.in-addr.arpa. 86400 IN NS r.arin.net. 173.in-addr.arpa. 86400 IN NS x.arin.net. 173.in-addr.arpa. 86400 IN NS y.arin.net. ;; Received 165 bytes from 199.180.182.53#53(199.180.182.53) in 300 ms 59.245.173.in-addr.arpa. 86400 IN NS ns1.cloudflare.com. 59.245.173.in-addr.arpa. 86400 IN NS ns2.cloudflare.com. ;; Received 95 bytes from 2001:500:13::63#53(2001:500:13::63) in 188 ms NAPTR

Naming Authority Pointer Records are used in conjunction with SRV records, generally as a part of the SIP protocol. NAPTR records point to domains to specific services, if available for that domain. Anders Brownworth has an excellent description in detail on his blog. The start of his example, with his permission:

Let’s consider a call to 2125551212@example.com. Given only this address though, we don't know what IP address, port or protocol to send this call to. We don't even know if example.com supports SIP or some other VoIP protocol like H.323 or IAX2. I'm implying that we're interested in placing a call to this URL but if no VoIP service is supported, we could just as easily fall back to emailing this user instead. To find out, we start with a NAPTR record lookup for the domain we were given:

#host -t NAPTR example.com example.com NAPTR 10 100 "S" "SIP+D2U" "" _sip._udp.example.com. example.com NAPTR 20 100 "S" "SIP+D2T" "" _sip._tcp.example.com. example.com NAPTR 30 100 "S" "E2U+email" "!^.*$!mailto:info@example.com!i" _sip._tcp.example.com.

Here we find that example.com gives us three ways to contact example.com, the first of which is "SIP+D2U" which would imply SIP over UDP at _sip._udp.example.com.

URI

Uniform Resource Identifier records are commonly used as a compliment to NAPTR records and per the RFC, can be used to replace SRV records. As such, they contain a Weight and Priority field as well as Target, similar to SRV.

One use case is proposed by this draft RFC is to replace SRV records with URI records for discovering Kerberos key distribution centers (KDC). It minimizes the number of requests over SRV records and allows the domain owner to specify preference for TCP or UDP.

In the below example, it specifies that we should use a KDC on TCP at the default port and UDP on port 89 should the primary connection fail.

❯ kdig URI _kerberos.hasvickygoneonholiday.com ;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 8450 ;; Flags: qr rd ra; QUERY: 1; ANSWER: 2; AUTHORITY: 0; ADDITIONAL: 0 ;; QUESTION SECTION: ;; _kerberos.hasvickygoneonholiday.com. IN URI ;; ANSWER SECTION: _kerberos.hasvickygoneonholiday.com. 283 IN URI 1 10 "krb5srv:m:tcp:kdc.hasbickygoneonholiday.com" _kerberos.hasvickygoneonholiday.com. 283 IN URI 1 20 "krb5srv:m:udp:kdc.hasbickygoneonholiday.com:89" Summary

Cloudflare now supports CERT, DNSKEY, DS, NAPTR, PTR, SMIMEA, SSHFP, and TLSA in our authoritative DNS products. We would love to hear if you have any interesting example use cases for the new record types and what other record types we should support in the future.

Our DNS engineering teams in London and San Francisco are both hiring if you would like to contribute to the fastest authoritative and recursive DNS services in the world.

Software Engineer

Categories: Technology

Growing the Cloudflare Apps Ecosystem

Thu, 02/08/2018 - 18:26
Growing the Cloudflare Apps Ecosystem

Starting today we are announcing the availability of two key pilot programs:

Why now? Over the course of past few months we've seen accelerating interest in Workers, and we frequently field the question on what we are doing to combine our growing ecosystem around Workers, and our unique deliverability capability, Cloudflare Apps. To meet this need, we have introduced two programs, Apps with Workers and Workers Service Providers. Let’s dig into the details:

First, we are announcing the upcoming availability of Cloudflare Apps, powered by embeddable Workers. This will allow any developer to build, deploy and in the near future package Workers to distribute to third parties, all using the Cloudflare Apps platform. It will be, in effect, the world's first serverless Apps platform.

Today, it's easy develop Workers using with our UI or API. The ability to App-ify Workers opens up a whole new promise to those who prefer to deal in clicks and not code. For our Apps developers, Apps with Workers allows for more complex Apps offerings running on Cloudflare, and for our customers the next generation in Apps. So, while we are actively putting the finishing touches on this capability we are opening up this pilot program for select developers. We have a limited early access to program. To apply, click here for more details.

Growing the Cloudflare Apps Ecosystem

Second, we are announcing the upcoming availability Cloudflare Worker Service providers. While many Cloudflare customers write Cloudflare Workers for themselves, many customers want to focus on their core business and bring in the development expertise when they need it. The goal is simple: make it easy for our customers to connect to an ecosystem of developers and Apps, and to grow a vibrant marketplace around customers and partners. Moving forward, in addition to Apps, we will support the ability to post Solutions and Services backed by curated set of consultants, experts and System Integrators, all adding a new richness to the Cloudflare ecosystem. We are excited to hear from our community so drop us a line.

Growing the Cloudflare Apps Ecosystem

Categories: Technology

How we scaled nginx and saved the world 54 years every day

Tue, 31/07/2018 - 16:00
How we scaled nginx and saved the world 54 years every day

The @Cloudflare team just pushed a change that improves our network's performance significantly, especially for particularly slow outlier requests. How much faster? We estimate we're saving the Internet ~54 years *per day* of time we'd all otherwise be waiting for sites to load.

— Matthew Prince (@eastdakota) June 28, 2018

10 million websites, apps and APIs use Cloudflare to give their users a speed boost. At peak we serve more than 10 million requests a second across our 151 data centers. Over the years we’ve made many modifications to our version of NGINX to handle our growth. This is blog post is about one of them.

How NGINX works

NGINX is one of the programs that popularized using event loops to solve the C10K problem. Every time a network event comes in (a new connection, a request, or a notification that we can send more data, etc.) NGINX wakes up, handles the event, and then goes back to do whatever it needs to do (which may be handling other events). When an event arrives, data associated with the event is already ready, which allows NGINX to efficiently handle many requests simultaneously without waiting.

num_events = epoll_wait(epfd, /*returned=*/events, events_len, /*timeout=*/-1); // events is list of active events // handle event[0]: incoming request GET http://example.com/ // handle event[1]: send out response to GET http://cloudflare.com/

For example, here's what a piece of code could look like to read data from a file descriptor:

// we got a read event on fd while (buf_len > 0) { ssize_t n = read(fd, buf, buf_len); if (n < 0) { if (errno == EWOULDBLOCK || errno == EAGAIN) { // try later when we get a read event again } if (errno == EINTR) { continue; } return total; } buf_len -= n; buf += n; total += n; }

When fd is a network socket, this will return the bytes that have already arrived. The final call will return EWOULDBLOCK which means we have drained the local read buffer, so we should not read from the socket again until more data becomes available.

Disk I/O is not like network I/O

When fd is a regular file on Linux, EWOULDBLOCK and EAGAIN never happens, and read always waits to read the entire buffer. This is true even if the file was opened with O_NONBLOCK. Quoting open(2):

Note that this flag has no effect for regular files and block devices

In other words, the code above basically reduces to:

if (read(fd, buf, buf_len) > 0) { return buf_len; }

Which means that if an event handler needs to read from disk, it will block the event loop until the entire read is finished, and subsequent event handlers are delayed.

This ends up being fine for most workloads, because reading from disk is usually fast enough, and much more predictable compared to waiting for a packet to arrive from network. That's especially true now that everyone has an SSD, and our cache disks are all SSDs. Modern SSDs have very low latency, typically in 10s of µs. On top of that, we can run NGINX with multiple worker processes so that a slow event handler does not block requests in other processes. Most of the time, we can rely on NGINX's event handling to service requests quickly and efficiently.

SSD performance: not always what’s on the label

As you might have guessed, these rosy assumptions aren’t always true. If each read always takes 50µs then it should only take 2ms to read 0.19MB in 4KB blocks (and we read in larger blocks). But our own measurements showed that our time to first byte is sometimes much worse, particularly at 99th and 999th percentile. In other words, the slowest read per 100 (or per 1000) reads often takes much longer.

SSDs are very fast but they are also notoriously complicated. Inside them are computers that queue up and re-order I/O, and also perform various background tasks like garbage collection and defragmentation. Once in a while, a request gets slowed down enough to matter. My colleague Ivan Babrou ran some I/O benchmarks and saw read spikes of up to 1 second. Moreover, some of our SSDs have more performance outliers than others. Going forward we will consider performance consistency in our SSD purchases, but in the meantime we need to have a solution for our existing hardware.

Spreading the load evenly with SO_REUSEPORT

An individual slow response once in a blue moon is difficult to avoid, but what we really don't want is a 1 second I/O blocking 1000 other requests that we receive within the same second. Conceptually NGINX can handle many requests in parallel but it only runs 1 event handler at a time. So I added a metric that measures this:

gettimeofday(&start, NULL); num_events = epoll_wait(epfd, /*returned=*/events, events_len, /*timeout=*/-1); // events is list of active events // handle event[0]: incoming request GET http://example.com/ gettimeofday(&event_start_handle, NULL); // handle event[1]: send out response to GET http://cloudflare.com/ timersub(&event_start_handle, &start, &event_loop_blocked);

p99 of event_loop_blocked turned out to be more than 50% of our TTFB. Which is to say, half of the time it takes to serve a request is a result of the event loop being blocked by other requests. event_loop_blocked only measures about half of the blocking (because delayed calls to epoll_wait() are not measured) so the actual ratio of blocked time is much higher.

Each of our machines run NGINX with 15 worker processes, which means one slow I/O should only block up to 6% of the requests. However, the events are not evenly distributed, with the top worker taking 11% of the requests (or twice as many as expected).

SO_REUSEPORT can solve the uneven distribution problem. Marek Majkowski has previously written about the downside in the context of other NGINX instances, but that downside mostly doesn't apply in our case since upstream connections in our cache process are long-lived, so a slightly higher latency in opening the connection is negligible. This single configuration change to enable SO_REUSEPORT improved peak p99 by 33%.

Moving read() to thread pool: not a silver bullet

A solution to this is to make read() not block. In fact, this is a feature that's implemented in upstream NGINX! When the following configuration is used, read() and write() are done in a thread pool and won't block the event loop:

aio threads; aio_write on;

However when we tested this, instead of 33x response time improvement, we actually saw a slight increase in p99. The difference was within margin of error but we were quite discouraged by the result and stopped pursuing this option for a while.

There are a few reasons why we didn’t see the level of improvements that NGINX saw. In their test, they were using 200 concurrent connections to request files that were 4MB in size, which were residing on spinning disks. Spinning disks increase I/O latency so it makes sense that an optimization that helps latency would have more dramatic effect.

We are also mostly concerned with p99 (and p999) performance. Solutions that help the average performance don't necessarily help with outliers.

Finally, in our environment, typical file sizes are much smaller. 90% of our cache hits are for files smaller than 60KB. Smaller files mean fewer occasions to block (we typically read the entire file in 2 reads).

If we look at the disk I/O that a cache hit has to do:

// we got a request for https://example.com which has cache key 0xCAFEBEEF fd = open("/cache/prefix/dir/EF/BE/CAFEBEEF", O_RDONLY); // read up to 32KB for the metadata as well as the headers // done in thread pool if "aio threads" is on read(fd, buf, 32*1024);

32KB isn't a static number, if the headers are small we need to read just 4KB (we don't use direct IO so kernel will round up to 4KB). The open() seems innocuous but it's actually not free. At a minimum the kernel needs to check if the file exists and if the calling process has permission to open it. For that it would have to find the inode of /cache/prefix/dir/EF/BE/CAFEBEEF, and to do that it would have to look up CAFEBEEF in /cache/prefix/dir/EF/BE/. Long story short, in the worst case the kernel has to do the following lookups:

/cache /cache/prefix /cache/prefix/dir /cache/prefix/dir/EF /cache/prefix/dir/EF/BE /cache/prefix/dir/EF/BE/CAFEBEEF

That's 6 separate reads done by open() compared to just 1 read done by read()! Fortunately, most of the time lookups are serviced by the dentry cache and don't require trips to the SSDs. But clearly having read() done in thread pool is only half of the picture.

The coup de grâce: non-blocking open() in thread pools

So I modified NGINX to do most of open() inside the thread pool as well so it won't block the event loop. And the result (both non-blocking open and non-blocking read):

How we scaled nginx and saved the world 54 years every day

On June 26 we deployed our changes to 5 of our busiest data centers, followed by world wide roll-out the next day. Overall peak p99 TTFB improved by a factor of 6. In fact, adding up all the time from processing 8 million requests per second, we saved the Internet 54 years of wait time every day.

Our event loop handling is still not completely non-blocking. In particular, we still block when we are caching a file for the first time (both the open(O_CREAT) and rename()), or doing revalidation updates. However, those are rare compared to cache hits. In the future we will consider moving those off of the event loop to further improve our p99 latency.

Conclusion

NGINX is a powerful platform, but scaling extremely high I/O loads on linux can be challenging. Upstream NGINX can offload reads in separate threads, but at our scale we often need to go one step further. If working on challenging performance problems sounds exciting to you, apply to join our team in San Francisco, London, Austin or Champaign.

Categories: Technology

Minecraft API with Workers + Coffeescript

Tue, 31/07/2018 - 09:00
Minecraft API with Workers + Coffeescript

The following is a guest post by Ashcon Partovi, a computer science and business undergraduate at the University of British Columbia in Vancouver, Canada. He's the founder of a popular Minecraft multiplayer server, stratus.network, that provides competitive, team-based gameplay to thousands of players every week

Minecraft API with Workers + Coffeescript

If you've ever played a video game in the past couple of years, chances are you know about Minecraft. You might be familiar with the game or even planted a tree or two, but what you might not know about is the vast number of Minecraft online communities. In this post, I'm going to describe how I used Cloudflare Workers to deploy and scale a production-grade API that solves a big problem for these Minecraft websites.

Introducing the Problem

Here is an example of my Minecraft player profile from one of the many multiplayer websites. It shows some identity information such as my username, a bitmap of my avatar, and a preview of my friends. Although rendering this page with 49 bitmap avatars may seem like an easy task, it's far from trivial. In fact, it's unnecessarily complicated.

Minecraft API with Workers + Coffeescript

Here is the current workflow to render a player profile on a website given their username:

  1. Find the UUID from the player's username.
curl api.mojang.com/users/profiles/minecraft/ElectroidFilms { "id": "dad8b95ccf6a44df982e8c8dd70201e0", "name": "ElectroidFilms" }
  1. Use that UUID to fetch the latest player information from the session server.
curl sessionserver.mojang.com/session/minecraft/profile/dad8b95cc... { "id": "dad8b95ccf6a44df982e8c8dd70201e0", "name": "ElectroidFilms", "properties": [{ "name": "textures", "value": "eyJ0aW1lc3RhbXAiOjE1MzI1MDI..." // <base64> }] }
  1. Decode the textures string which is encoded as base64.
echo "eyJ0aW1lc3RhbXAiOjE1MzI1MDIwNDY5NjIsIn..." | base64 --decode { "timestamp": 1532502046962, "profileId": "dad8b95ccf6a44df982e8c8dd70201e0", "profileName": "ElectroidFilms", "textures": { "SKIN": {"url": "textures.minecraft.net/texture/741df6aa0..."}, "CAPE": {"url": "textures.minecraft.net/texture/e7dfea16d..."} } }
  1. Fetch the texture from the URL in the decoded JSON payload.
curl textures.minecraft.net/texture/741df6aa027... > skin.png
  1. Cache the texture in a database to avoid the 60-second rate limit.
mongo > db.users.findOneAndUpdate( { _id: "dad8b95ccf6a44df982e8c8dd70201e0" }, { skin_png: new BinData(0, "GWA3u4F42GIH318sAlN2wfDAWTQ...") })

Yikes, that's 5 complex operations required to render a single avatar! But that's not all, in my example profile, there are 49 avatars, which would require a total of 5 * 49 = 245 operations.

And that's just fetching the data, we haven't even started to serve it to players! Then you have to setup a host to serve the web traffic, ensure that the service scales with demand, handle cache expiration of assets, and deploy across multiple regions. Then you have to deploy There has to be a better way!

Prototyping with Workers

I'm a strong believer in the future of serverless computing. So naturally, when I learned how Cloudflare Workers allow you to run Javascript code in 150+ points of presence, I started to tinker with the possibilities of solving this problem. After looking at the documentation and using the Workers playground, I quickly put together some Javascript code that aggregated all that profile complexity into a single request.

addEventListener('fetch', event => { event.respondWith(renderPlayerBitmap(event.request)) }) async function renderPlayerBitmap(request) { var username = request.url.split("/").pop() console.log("Starting request for... " + username) // Step 1: Username -> UUID var uuid = await fetch("https://api.mojang.com/users/profiles/minecraft/" + username) if(uuid.ok) { uuid = (await uuid.json()).id console.log("Found uuid... " + uuid) // Step 2: UUID -> Profile var session = await fetch("https://sessionserver.mojang.com/session/minecraft/profile/" + uuid) if(session.ok) { session = await session.json() console.log("Found session... " + JSON.stringify(session)) // Step 3: Profile -> Texture URL var texture = atob(session.properties[0].value) console.log("Found texture... " + texture) // Step 4 + 5: Texture URL -> Texture PNG + Caching texture = JSON.parse(texture) return fetch(texture.textures.SKIN.url, cf: {cacheTtl: 60}) } } return new Response(undefined, {status: 500}) }

Within a couple minutes I had my first Workers implementation! I gave it my username and it was able to make all the necessary sub-requests to return my player's bitmap texture.

Minecraft API with Workers + Coffeescript

After realizing the potential of Workers, I started to wonder if I could use it for more than just a single script. What if I could design and deploy a production-ready API for Minecraft that runs exclusively on Workers?

Designing an API

I wanted to address an essential problem for Minecraft developers: too many APIs with too many restrictions. The hassle of parsing multiple requests and handling errors prevents developers from focusing on creating great experiences for players. There needs to be a solution that requires only 1 HTTP request with no rate limiting and no client-side caching. After looking at the various use-cases for the existing APIs, I created a JSON schema that encompassed all the essential data into a single response:

GET: api.ashcon.app/mojang/v1/user/<username|uuid> { "uuid": "<uuid>", "username": "<username>", "username_history": [ { "username": "<username>", "changed_at": "<date|null>" } ], "textures": { "slim": "<boolean>", "custom": "<boolean>", "skin": { "url": "<url>", "data": "<base64>" }, "cape": { "url": "<url|null>", "data": "<base64|null>" } }, "cached_at": "<date>" }

One of the primary goals I had in mind was to minimize sub-requests by clients. For example, instead of giving developers a URL to a image/png static asset, why not fetch it for them and embed it as a base64 string? Now that's simplicity!

Getting Started

For this project, I decided to use Coffeescript, which transcompiles to Javascript and has a simple syntax. We'll also need to use Webpack to bundle all of our code into a single Javascript file to upload to Cloudflare.

# Welcome to Coffeescript! str = "heyo! #{40+2}" # 'heyo! 42' num = 12 if str? # 12 arr = [1, null, "apple"] # [1, null, 'apple'] val = arr[1]?.length() # null hash = # {key: 'value'} key: "value" add = (a, b, {c, d} = {}) -> c ?= 3 d ?= 4 a + b + c + d add(1, 2, d: 5) # 1 + 2 + 3 + 5 = 11

First, let's make sure we have the proper dependencies installed for the project! These commands will create a package.json file and a node_modules/ folder in our workspace.

mkdir -p workspace/src cd workspace npm init --yes npm install --save-dev webpack webpack-cli coffeescript coffee-loader workers-preview

Now, we're going to edit our package.json to add two helper scripts for later. You can delete the default "test" script as well.

"scripts": { "build": "webpack", "build:watch": "webpack --watch", "preview": "workers-preview < dist/bundle.js" }

We also need to initialize a webpack.config.js file with a coffeescript compiler.

const path = require('path') module.exports = { entry: { bundle: path.join(__dirname, './src/index.coffee'), }, output: { filename: 'bundle.js', path: path.join(__dirname, 'dist'), }, mode: 'production', watchOptions: { ignored: /node_modules|dist|\.js/g, }, resolve: { extensions: ['.coffee', '.js', '.json'], plugins: [], }, module: { rules: [ { test: /\.coffee?$/, loader: 'coffee-loader', } ] } }

Before we start coding, we'll create a src/index.coffee file and make sure everything is working so far.

addEventListener('fetch', (event) -> event.respondWith(route(event.request))) # We will populate this with our own logic after we test it! route = (request) -> fetch('https://api.ashcon.app/mojang/v1/user/ElectroidFilms')

Open your terminal in the workspace/ directory and run the following commands:

npm run build npm run preview

Your computer's default internet browser will open up a new window and preview the result of our Worker. If you see a JSON response, then everything is working properly and we're ready to go!

Minecraft API with Workers + Coffeescript

Writing Production Code for Workers

Now that we're setup with a working example, we can design our source code file structure. It's important that we break up our code into easily testable chunks, so I've gone ahead and outlined the approach that I took with this project:

src/ index.coffee # routing and serving requests api.coffee # logic layer to mutate and package requests mojang.coffee # non-logic layer to send upstream requests http.coffee # HTTP requesting, parsing, and responding util.coffee # util methods and extensions

If you've feeling adventurous, I've included a simplified version of my API code that you can browse through below. If you look at each file, you'll have a fully working implementation by the end! Otherwise, you can continue reading to learn about my deployment and analysis of the APIs impact.

http.coffee

Since our API will be making several HTTP requests, it's a good idea to code some common request and respond methods that can be reused among multiple requests. At the very least, we need to support parsing JSON or base64 responses and sending JSON or string data back to the client.

# Send a Http request and get a response. # # @param {string} url - Url to send the request. # @param {string} method - Http method (get, post, etc). # @param {integer} ttl - Time in seconds for Cloudflare to cache the request. # @param {boolean} json - Whether to parse the response as json. # @param {boolean} base64 - Whether to parse the response as a base64 string. # @returns {promise< # json -> [err, json] # base64 -> string|null # else -> response # >} - A different response based on the method parameters above. export request = (url, {method, ttl, json, base64} = {}) -> method ?= "GET" response = await fetch(url, method: method, cf: {cacheTtl: ttl} if ttl) if json # Return a tuple of [err, json]. if err = coerce(response.status) [err, null] else [null, await response.json()] else if base64 # Return base64 string or null. if response.ok Buffer.from(await response.arrayBuffer(), "binary").toString("base64") else # If no parser is specified, just return the raw response. response export get = (url, options = {}) -> request(url, Object.assign(options, {method: "GET"})) # Respond to a client with a http response. # # @param {object} data - Data to send back in the response. # @param {integer} code - Http status code. # @param {string} type - Http content type. # @param {boolean} json - Whether to respond in json. # @param {boolean} text - Whether to respond in plain text. # @returns {response} - Raw response object. export respond = (data, {code, type, json, text} = {}) -> code ?= 200 if json type = "application/json" # "Pretty-print" our JSON response with 2 spaces. data = JSON.stringify(data, undefined, 2) else if text type = "text/plain" data = String(data) else type ?= "application/octet-stream" new Response(data, {status: code, headers: {"Content-Type": type}}) export error = (reason = null, {code, type} = {}) -> code ?= 500 type ?= "Internal Error" # An example would be: "Internal Error - 500 (this is the reason)" respond("#{code} - #{type}" + (if reason then " (#{reason})" else ""), code: code, text: true) export badRequest = (reason = null) -> error(reason, code: 400, type: "Bad Request") export notFound = (reason = null) -> error(reason, code: 404, type: "Not Found") export tooManyRequests = (reason = null) -> error(reason, code: 429, type: "Too Many Requests") # Convert common http error codes into error responses. # # @param {integer} code - Http status code. # @returns {response|null} - An error response or null if a 200 code. export coerce = (code) -> switch code when 200 then null # Some Minecraft APIs use 204 as a stand-in for a 404. when 204 then notFound() when 400 then invalidRequest() # Theoretically this should never happen, but sometimes does. when 429 then tooManyRequests() else error("Unknown Response", code: code)

The cf key can be used to control various Cloudflare features, including how sub-requests are cached. See the Workers documentation for a more in-depth explanation.

cf: cacheTtl: 120 # Cache for 2 mins. # Pro+ only. polish: "lossless" # Compress image data. # Enterprise only. cacheTtlByStatus: "200-299": 60 # Cache for 60 secs. "300-399": 0 # Cache but expire instantly. "400-404": 10 # Cache for 10 secs. "405-599": -1 # Do not cache at all. cacheKey: url # Cache lookup key, defaults to the request URL. mojang.coffee

Now that we have code to send and parse requests, we can create an interface to retrieve data from the upstream APIs. It's good to note that there should be no mutation logic in this file, it's purpose is just to get the old APIs, not change them.

import { get } from "./http" # Get the UUID of a username at the current time. # # @param {string} name - Minecraft username. # @throws {204} - When no user exists with that name. # @returns {[err, json]} - An error or username and UUID response. export usernameToUuid = (name) -> get("https://api.mojang.com/users/profiles/minecraft/#{name}", json: true) # Get the history of usernames for the given UUID. # # @param {string} id - The UUID to check the username history. # @returns {[err, json]} - An error or the username history. export uuidToUsernameHistory = (id) -> get("https://api.mojang.com/user/profiles/#{id}/names", json: true) # Get the session profile of the UUID. # # @param {string} id - UUID to get the session profile. # @returns {[err, json]} - An error or the session profile. export uuidToProfile = (id) -> get("https://sessionserver.mojang.com/session/minecraft/profile/#{id}", json: true) api.coffee

This is where the bulk of our API logic will reside. I've broken up the process into 3 interdependent tasks that are executed in order:

  1. Given a username, fetch its UUID.
  2. Given a UUID, fetch the user's profile.
  3. Given a user's profile, decode and fetch the textures.
import { get, respond, error, notFound, badRequest } from "./http" import { usernameToUuid, uuidToProfile, uuidToUsernameHistory } from "./mojang" # Get the uuid of a user given their username. # # @param {string} name - Minecraft username, must be alphanumeric 16 characters. # @returns {[err, response]} - An error or the dashed uuid of the user. export uuid = (name) -> if name.asUsername() # Fits regex of a Minecraft username. [err, res] = await usernameToUuid(name) if id = res?.id?.asUuid(dashed: true) [null, respond(id, text: true)] else # Response was received, but contains no UUID. [err || notFound(), null] else [badRequest("malformed username '#{name}'"), null] # Get the full profile of a user given their uuid or username. # # @param {string} id - Minecraft username or uuid. # @returns {[err, json]} - An error or user profile. export user = (id) -> if id.asUsername() [err, res] = await uuid(id) if err # Could not find a player with that username. [err, null] else # Recurse with the new UUID. await user(id = await res.text()) else if id.asUuid() # Fetch the profile and usernames in parallel. [[err0, profile], [err1, history]] = await Promise.all([ uuidToProfile(id = id.asUuid()) uuidToUsernameHistory(id)]) # Extract the textures from the profile. # Since this operation is complex, off-load # the logic into its own method. [err2, texture] = await textures(profile) if err = err0 || err1 || err2 [err, null] # One of the last three operations failed. else # Everything is good, now just put the data together. [null, respond( uuid: profile.id.asUuid(dashed: true) username: profile.name username_history: history.map((item) -> username: item.name changed_at: item.changedToAt?.asDate()) textures: texture cached_at: new Date(), json: true)] else [badRequest("malformed uuid '#{id}'"), null] # Parse and decode base64 textures from the user profile. # # @param {json} profile - User profile from #uuidToProfile(id). # @returns {json} - Enhanced user profile with more convient texture fields. textures = (profile) -> unless profile # Will occur if the profile api failed. return [error("no user profile found"), null] properties = profile.properties if properties.length == 1 texture = properties[0] else texture = properties.filter((pair) -> pair.name == "textures" && pair.value?)[0] # If a embedded texture does not exist or is empty, # that user does not have a custom skin. if !texture || (texture = JSON.parse(atob(texture.value)).textures).isEmpty() skinUrl = "http://assets.mojang.com/SkinTemplates/steve.png" # Fetch the skin and cape data in parallel, and cache for a day. [skin, cape] = await Promise.all([ get(skinUrl ?= texture.SKIN?.url, base64: true, ttl: 86400) get(capeUrl = texture.CAPE?.url, base64: true, ttl: 86400)]) unless skin [error("unable to fetch skin '#{skinUrl}'"), null] else texture = slim: texture.SKIN?.metadata?.model == "slim" skin: {url: skinUrl, data: skin} cape: {url: capeUrl, data: cape} if capeUrl [null, texture] index.coffee

Now, we parse the request's route and respond with the corresponding API.

import "./util" import { notFound } from "./http" import { uuid, user } from "./api" addEventListener("fetch", (event) -> event.respondWith(route(event.request))) route = (request) -> [base, version, method, id] = request.url.split("/")[3..6] if base == "mojang" && id? if version == "v1" v1(method, id) else notFound("unknown api version '#{version}'") else notFound("unknown route") v1 = (method, id) -> if method == "uuid" [err, res] = await uuid(id) else if method == "user" [err, res] = await user(id) err || res || notFound("unknown v1 route '#{method}'") util.coffee

Finally, we'll add some prototype extensions that we used along the way.

# Insert a string at a given index. # # @param {integer} i - Index to insert the string at. # @param {string} str - String to insert. String::insert = (i, str) -> this.slice(0, i) + str + this.slice(i) # Ensure that the string is a valid Uuid. # # If dashed is enabled, it is possible the input # string is not the same as the output string. # # @param {boolean} dashed - Whether to return a dashed uuid. # @returns {string|null} - A uuid or null. String::asUuid = ({dashed} = {}) -> if match = uuidPattern.exec(this) uuid = match[1..].join("") if dashed uuid.insert(8, "-") .insert(12+1, "-") .insert(16+2, "-") .insert(20+3, "-") else uuid uuidPattern = /^([0-9a-f]{8})(?:-|)([0-9a-f]{4})(?:-|)(4[0-9a-f]{3})(?:-|)([0-9a-f]{4})(?:-|)([0-9a-f]{12})$/i # Ensure that the string is a valid Minecraft username. # # @returns {string|null} - Minecraft username or null. String::asUsername = -> if usernamePattern.test(this) then this else false usernamePattern = /^[0-9A-Za-z_]{1,16}$/i # Ensure that the unix number is a Date. # # @returns {date} - The number as a floored date. Number::asDate = -> new Date(Math.floor(this)) # Determine if the object is empty. # # @returns {boolean} - Whether the object is empty. Object::isEmpty = -> Object.keys(this).length == 0 Analyzing a Workers Deployment

I've had this code deployed and tested by real Minecraft users for the past few weeks. As a developer that has global web traffic, it's pivotal that players can quickly get access to my services. The essential advantage of Workers is that I don't need to deploy several replicas of my code to different cloud regions, it's everywhere! That means players from any part of the world get the same great web experience with minimal latency.

Minecraft API with Workers + Coffeescript

As of today, the API is processing over 400k requests per day from users all over the world! Cloudflare caches responses in the closest point of presence to the client, so I don't need to setup a database and developers don't need to worry about rate-limiting.

Minecraft API with Workers + Coffeescript

Since each request to the API generates 4 to 5 additional sub-requests, it handles approximately 1.8 million fetches per day with a 88% cache hit rate.

Minecraft API with Workers + Coffeescript

Wrapping Up

Cloudflare Workers have enabled me to solve complex technical problems without worrying about host infrastructure or cloud regions. It's simple, easy to deploy, and works blazing fast all around the world. And for 50 cents for every 1 million requests, it's incomparable to the other serverless solutions on the market.

If you're not already convinced to start using Workers, here's the deployment history of my API. I went from 0 to 5 million requests with no scaling, no resizing, no servers, no clusters, and no containers. Just code.

Minecraft API with Workers + Coffeescript

If you're interested in looking at all of the code used in the post, you can find it here:
https://github.com/Electroid/mojang-api

And if you're a Minecraft developer, my API is open for you to use for free:

curl https://api.ashcon.app/mojang/v1/uuid/ElectroidFilms curl https://api.ashcon.app/mojang/v1/user/ElectroidFilms

You can also use this extra goodie that will crop just the face from a player texture:

curl https://api.ashcon.app/mojang/v1/avatar/ElectroidFilms > avatar.png open avatar.png
Categories: Technology

Q2 FY 18 Product Releases, for a better Internet “end-to-end”

Thu, 26/07/2018 - 19:35
Q2 FY 18 Product Releases, for a better Internet “end-to-end”

Q2 FY 18 Product Releases, for a better Internet “end-to-end”
Photo by Liu Zai Hou / Unsplash

In Q2, Cloudflare released several products which enable a better Internet “end-to-end” — from the mobile client to host infrastructure. Now, anyone from an individual developer to large companies and governments, can control, secure, and accelerate their applications from the “perimeter” back to the “host.”

On the client side, Cloudflare’s Mobile SDK extends control directly into your mobile apps, providing visibility into application performance and load times across any global carrier network.

On the host side, Cloudflare Workers lets companies move workloads from their host to the Cloudflare Network, reducing infrastructure costs and speeding up the user experience. Argo Tunnel lets you securely connect your host directly to a Cloudflare data center. If your host infrastructure is running other TCP services besides HTTP(S), you can now protect it with Cloudflare’s DDoS protection using Spectrum.

So for end-to-end control that is easy and fast to deploy, these recent products are all incredible “workers” across the “spectrum” of your needs.

But there’s more to the story

End users want richer experiences, such as more video, interactivity, and images. Meeting those needs can incur real costs in bandwidth, hardware, and time. Cloudflare addresses these with three products that improve video delivery, reduce paint times, and shrink the round-trip times.

Cloudflare now simplifies and reduces delivery cost of video with Stream Delivery. Pages using plenty of Javascript now have faster paint times and wider mobile-device support with Rocket Loader. If you’re managing multiple origins and want to ensure fastest delivery based on the shortest round-trip time, Cloudflare Load Balancer now supports Dynamic Steering.

Attackers are shifting their focus to the application layer. Some security features, like CAPTCHA and Javascript Challenge, give you more control and reduce false-positives when blocking rate-based threats at the edge, such as layer 7 DDoS or brute-force attacks.

Finally, Cloudflare extended privacy to consumers through the launch of our DNS resolver 1.1.1.1 on 4/1/2018! Now users who set their DNS resolvers to 1.1.1.1 can browse faster while protecting browser data with Cloudflare’s privacy-first consumer DNS service.

Here is a recap from April to June of the features we released in Q2
Dynamic Steering

Tue, July 10, 2018
Dynamic steering is a load balancing feature that automates traffic steering across origins in multiple geographic regions. Round-trip time (RTT) for health checks is calculated across multiple pools of load balanced servers and origins to determine the fastest server pools. This RTT data enables the load balancers to identify the fastest pools, and to direct user requests to the most responsive origins.

Support for New DNS Record Types

Thu, July 5, 2018
Cloudflare's Authoritative DNS now supports the following record types: CERT, DNSKEY, DS, NAPTR, SMIMEA, SSHFP, TLSA, and URI via the web and API.

Developer Portal Q2 Update

Mon, June 11, 2018
The Developer Portal has been updated in Q2 to include improved search, documentation for new products, and listings of upcoming Cloudflare community events.

Rocket Loader Upgrade

Fri, June 1, 2018
Rocket Loader has been updated to deliver faster performance for website paint & load times by prioritizing website content over JavaScript. Majority of mobile devices are now supported. Increased compliance with strict content security policies.

Stream Delivery

Thu, May 31, 2018
Cloudflare’s Stream Delivery solution offers fast caching and delivery of video content across our network of 150+ global data centers.

Deprecating TLS 1.0 and 1.1 on api.cloudflare.com

Tue, May 29, 2018
On June 4, Cloudflare will be dropping support for TLS 1.0 and 1.1 on api.cloudflare.com. Additionally, the dashboard will be moved from www.cloudflare.com/a to dash.cloudflare.com and will require a browser that supports TLS 1.2 or higher.

Rate Limiting has new Actions and Triggers

Mon, May 21, 2018
Rate Limiting has two new features: challenges (CAPTCHA and JS Challenge) as an Action; and matching Header attributes in the response (from either origin or the cache) as the Trigger. These features give more control over how Cloudflare Rate Limiting responds to threshold violations, giving customers granularity over the types of requests to "count" to fit their different applications. To learn more, go to the blog post.

Support purge-by-tag for large tag sizes

Thu, May 10, 2018
The Cache-Tag header now supports up to 1000 tags and a total header length of 16kb. This update simplifies file purges for customers who deploy websites with Drupal.

Multi-User Access on dash.cloudflare.com

Wed, May 2, 2018
Starting May 2 2018, users can go to the new home of Cloudflare’s Dashboard at dash.cloudflare.com and share account access. This has been supported at our Enterprise level of service, but is now being extended to all customers.

Support full SSL (Strict) mode validation for CNAME domains

Thu, April 12, 2018
Cloudflare is now able to validate origin certificates that use a hostname's CNAME target in Full SSL (Strict) mode. Previously, Cloudflare would not validate any certificate without a direct match of the HTTP hostname and the certificate's Common Name or SAN. This update allows SSL for SaaS customers to more easily enable end-to-end security.

Cloudflare Spectrum

Thu, April 12, 2018
Spectrum protects TCP applications and ports from volumetric DDoS attacks and data theft by proxying non-web traffic through Cloudflare’s Anycast network.

Workers Can Control Cache TTL by Response Code

Wed, April 11, 2018
Cloudflare workers can now control cache TTL by response code. This provides greater control over cached assets with Cloudflare Workers.

Argo Tunnel

Thu, April 5, 2018
Argo Tunnel ensures that no visitor or attacker can reach your web server unless they first pass through Cloudflare. Using a lightweight agent installed on your origin, Cloudflare creates an encrypted tunnel between your host infrastructure and our nearest data centers without opening a public inbound port. It’s more secure, more performant, and easier to manage than exposing your services publically.

Categories: Technology

The Road to QUIC

Thu, 26/07/2018 - 16:04
The Road to QUIC

QUIC (Quick UDP Internet Connections) is a new encrypted-by-default Internet transport protocol, that provides a number of improvements designed to accelerate HTTP traffic as well as make it more secure, with the intended goal of eventually replacing TCP and TLS on the web. In this blog post we are going to outline some of the key features of QUIC and how they benefit the web, and also some of the challenges of supporting this radical new protocol.

The Road to QUIC

There are in fact two protocols that share the same name: “Google QUIC” (“gQUIC” for short), is the original protocol that was designed by Google engineers several years ago, which, after years of experimentation, has now been adopted by the IETF (Internet Engineering Task Force) for standardization.

“IETF QUIC” (just “QUIC” from now on) has already diverged from gQUIC quite significantly such that it can be considered a separate protocol. From the wire format of the packets, to the handshake and the mapping of HTTP, QUIC has improved the original gQUIC design thanks to open collaboration from many organizations and individuals, with the shared goal of making the Internet faster and more secure.

So, what are the improvements QUIC provides?

Built-in security (and performance)

One of QUIC’s more radical deviations from the now venerable TCP, is the stated design goal of providing a secure-by-default transport protocol. QUIC accomplishes this by providing security features, like authentication and encryption, that are typically handled by a higher layer protocol (like TLS), from the transport protocol itself.

The initial QUIC handshake combines the typical three-way handshake that you get with TCP, with the TLS 1.3 handshake, which provides authentication of the end-points as well as negotiation of cryptographic parameters. For those familiar with the TLS protocol, QUIC replaces the TLS record layer with its own framing format, while keeping the same TLS handshake messages.

Not only does this ensure that the connection is always authenticated and encrypted, but it also makes the initial connection establishment faster as a result: the typical QUIC handshake only takes a single round-trip between client and server to complete, compared to the two round-trips required for the TCP and TLS 1.3 handshakes combined.

The Road to QUIC The Road to QUIC

But QUIC goes even further, and also encrypts additional connection metadata that could be abused by middle-boxes to interfere with connections. For example packet numbers could be used by passive on-path attackers to correlate users activity over multiple network paths when connection migration is employed (see below). By encrypting packet numbers QUIC ensures that they can't be used to correlate activity by any entity other than the end-points in the connection.

Encryption can also be an effective remedy to ossification, which makes flexibility built into a protocol (like for example being able to negotiate different versions of that protocol) impossible to use in practice due to wrong assumptions made by implementations (ossification is what delayed deployment of TLS 1.3 for so long, which was only possible after several changes, designed to prevent ossified middle-boxes from incorrectly blocking the new revision of the TLS protocol, were adopted).

Head-of-line blocking

One of the main improvements delivered by HTTP/2 was the ability to multiplex different HTTP requests onto the same TCP connection. This allows HTTP/2 applications to process requests concurrently and better utilize the network bandwidth available to them.

This was a big improvement over the then status quo, which required applications to initiate multiple TCP+TLS connections if they wanted to process multiple HTTP/1.1 requests concurrently (e.g. when a browser needs to fetch both CSS and Javascript assets to render a web page). Creating new connections requires repeating the initial handshakes multiple times, as well as going through the initial congestion window ramp-up, which means that rendering of web pages is slowed down. Multiplexing HTTP exchanges avoids all that.

The Road to QUIC

This however has a downside: since multiple requests/responses are transmitted over the same TCP connection, they are all equally affected by packet loss (e.g. due to network congestion), even if the data that was lost only concerned a single request. This is called “head-of-line blocking”.

QUIC goes a bit deeper and provides first class support for multiplexing such that different HTTP streams can in turn be mapped to different QUIC transport streams, but, while they still share the same QUIC connection so no additional handshakes are required and congestion state is shared, QUIC streams are delivered independently, such that in most cases packet loss affecting one stream doesn't affect others.

This can dramatically reduce the time required to, for example, render complete web pages (with CSS, Javascript, images, and other kinds of assets) particularly when crossing highly congested networks, with high packet loss rates.

That easy, uh?

In order to deliver on its promises, the QUIC protocol needs to break some of the assumptions that were taken for granted by many network applications, potentially making implementations and deployment of QUIC more difficult.

QUIC is designed to be delivered on top of UDP datagrams, to ease deployment and avoid problems coming from network appliances that drop packets from unknown protocols, since most appliances already support UDP. This also allows QUIC implementations to live in user-space, so that, for example, browsers will be able to implement new protocol features and ship them to their users without having to wait for operating systems updates.

However despite the intended goal of avoiding breakage, it also makes preventing abuse and correctly routing packets to the correct end-points more challenging.

One NAT to bring them all and in the darkness bind them

Typical NAT routers can keep track of TCP connections passing through them by using the traditional 4-tuple (source IP address and port, and destination IP address and port), and by observing TCP SYN, ACK and FIN packets transmitted over the network, they can detect when a new connection is established and when it is terminated. This allows them to precisely manage the lifetime of NAT bindings, the association between the internal IP address and port, and the external ones.

With QUIC this is not yet possible, since NAT routers deployed in the wild today do not understand QUIC yet, so they typically fallback to the default and less precise handling of UDP flows, which usually involves using arbitrary, and at times very short, timeouts, which could affect long-running connections.

When a NAT rebinding happens (due to a timeout for example), the end-point on the outside of the NAT perimeter will see packets coming from a different source port than the one that was observed when the connection was originally established, which makes it impossible to track connections by only using the 4-tuple.

The Road to QUIC

And it's not just NAT! One of the features QUIC is intended to deliver is called “connection migration” and will allow QUIC end-points to migrate connections to different IP addresses and network paths at will. For example, a mobile client will be able to migrate QUIC connections between cellular data networks and WiFi when a known WiFi network becomes available (like when its user enters their favorite coffee shop).

QUIC tries to address this problem by introducing the concept of a connection ID: an arbitrary opaque blob of variable length, carried by QUIC packets, that can be used to identify a connection. End-points can use this ID to track connections that they are responsible for without the need to check the 4-tuple (in practice there might be multiple IDs identifying the same connection, for example to avoid linking different paths when connection migration is used, but that behavior is controlled by the end-points not the middle-boxes).

However this also poses a problem for network operators that use anycast addressing and ECMP routing, where a single destination IP address can potentially identify hundreds or even thousands of servers. Since edge routers used by these networks also don't yet know how to handle QUIC traffic, it might happen that UDP packets belonging to the same QUIC connection (that is, with the same connection ID) but with different 4-tuple (due to NAT rebinding or connection migration) might end up being routed to different servers, thus breaking the connection.

The Road to QUIC

In order to address this, network operators might need to employ smarter layer 4 load balancing solutions, which can be implemented in software and deployed without the need to touch edge routers (see for example Facebook's Katran project).

QPACK

Another benefit introduced by HTTP/2 was header compression (or HPACK) which allows HTTP/2 end-points to reduce the amount of data transmitted over the network by removing redundancies from HTTP requests and responses.

In particular, among other techniques, HPACK employs dynamic tables populated with headers that were sent (or received) from previous HTTP requests (or responses), allowing end-points to reference previously encountered headers in new requests (or responses), rather than having to transmit them all over again.

HPACK's dynamic tables need to be synchronized between the encoder (the party that sends an HTTP request or response) and the decoder (the one that receives them), otherwise the decoder will not be able to decode what it receives.

With HTTP/2 over TCP this synchronization is transparent, since the transport layer (TCP) takes care of delivering HTTP requests and responses in the same order they were sent in, the instructions for updating the tables can simply be sent by the encoder as part of the request (or response) itself, making the encoding very simple. But for QUIC this is more complicated.

QUIC can deliver multiple HTTP requests (or responses) over different streams independently, which means that while it takes care of delivering data in order as far as a single stream is concerned, there are no ordering guarantees across multiple streams.

For example, if a client sends HTTP request A over QUIC stream A, and request B over stream B, it might happen, due to packet reordering or loss in the network, that request B is received by the server before request A, and if request B was encoded such that it referenced a header from request A, the server will be unable to decode it since it didn't yet see request A.

In the gQUIC protocol this problem was solved by simply serializing all HTTP request and response headers (but not the bodies) over the same gQUIC stream, which meant headers would get delivered in order no matter what. This is a very simple scheme that allows implementations to reuse a lot of their existing HTTP/2 code, but on the other hand it increases the head-of-line blocking that QUIC was designed to reduce. The IETF QUIC working group thus designed a new mapping between HTTP and QUIC (“HTTP/QUIC”) as well as a new header compression scheme called “QPACK”.

In the latest draft of the HTTP/QUIC mapping and the QPACK spec, each HTTP request/response exchange uses its own bidirectional QUIC stream, so there's no head-of-line blocking. In addition, in order to support QPACK, each peer creates two additional unidirectional QUIC streams, one used to send QPACK table updates to the other peer, and one to acknowledge updates received by the other side. This way, a QPACK encoder can use a dynamic table reference only after it has been explicitly acknowledged by the decoder.

Deflecting Reflection

A common problem among UDP-based protocols is their susceptibility to reflection attacks, where an attacker tricks an otherwise innocent server into sending large amounts of data to a third-party victim, by spoofing the source IP address of packets targeted to the server to make them look like they came from the victim.

The Road to QUIC

This kind of attack can be very effective when the response sent by the server happens to be larger than the request it received, in which case we talk of “amplification”.

TCP is not usually used for this kind of attack due to the fact that the initial packets transmitted during its handshake (SYN, SYN+ACK, …) have the same length so they don’t provide any amplification potential.

QUIC’s handshake on the other hand is very asymmetrical: like for TLS, in its first flight the QUIC server generally sends its own certificate chain, which can be very large, while the client only has to send a few bytes (the TLS ClientHello message embedded into a QUIC packet). For this reason, the initial QUIC packet sent by a client has to be padded to a specific minimum length (even if the actual content of the packet is much smaller). However this mitigation is still not sufficient, since the typical server response spans multiple packets and can thus still be far larger than the padded client packet.

The QUIC protocol also defines an explicit source-address verification mechanism, in which the server, rather than sending its long response, only sends a much smaller “retry” packet which contains a unique cryptographic token that the client will then have to echo back to the server inside a new initial packet. This way the server has a higher confidence that the client is not spoofing its own source IP address (since it received the retry packet) and can complete the handshake. The downside of this mitigation is that it increases the initial handshake duration from a single round-trip to two.

An alternative solution involves reducing the server's response to the point where a reflection attack becomes less effective, for example by using ECDSA certificates (which are typically much smaller than their RSA counterparts). We have also been experimenting with a mechanism for compressing TLS certificates using off-the-shelf compression algorithms like zlib and brotli, which is a feature originally introduced by gQUIC but not currently available in TLS.

UDP performance

One of the recurring issues with QUIC involves existing hardware and software deployed in the wild not being able to understand it. We've already looked at how QUIC tries to address network middle-boxes like routers, but another potentially problematic area is the performance of sending and receiving data over UDP on the QUIC end-points themselves. Over the years a lot of work has gone into optimizing TCP implementations as much as possible, including building off-loading capabilities in both software (like in operating systems) and hardware (like in network interfaces), but none of that is currently available for UDP.

However it’s only a matter of time until QUIC implementations can take advantage of these capabilities as well. Look for example at the recent efforts to implement Generic Segmentation Offloading for UDP on LInux, which would allow applications to bundle and transfer multiple UDP segments between user-space and the kernel-space networking stack at the cost of a single one (or close enough), as well as the one to add zerocopy socket support also on Linux which would allow applications to avoid the cost of copying user-space memory into kernel-space.

Conclusion

Like HTTP/2 and TLS 1.3, QUIC is set to deliver a lot of new features designed to improve performance and security of web sites, as well as other Internet-based properties. The IETF working group is currently set to deliver the first version of the QUIC specifications by the end of the year and Cloudflare engineers are already hard at work to provide the benefits of QUIC to all of our customers.

Categories: Technology

1.1.1.1 for Your Organization

Wed, 25/07/2018 - 16:00
1.1.1.1 for Your Organization

1.1.1.1 for Your Organization

A few months ago, we announced the world’s fastest, privacy-first, recursive DNS resolver, 1.1.1.1. It’s been exciting watching the community reaction to this project, and to be in a position where we can promote new standards around private DNS.

The Cloudflare network helps to make measurable improvements to the Internet by rolling out security updates to millions of websites at once. This allows us to provide free SSL certificates to any website, and to implement state-of-the-art security for our customers.

We saw the same potential impact when deciding to build 1.1.1.1. From launch, we wanted people to be able to connect to their favorite websites faster, and to ensure that no entity between their computer and the origin web server was recording their browsing history. We’re proud to have achieved that goal with the fastest public DNS resolver in the world.

Consumer adoption of the resolver has been strong, and it makes sense: new legislation allows ISPs to track and sell your web history. But, not everyone feels comfortable changing the default DNS resolver on their computer or home network. We want to empower IT departments and network administrators to change the default DNS resolver for their organization, at the network or device level. Our fast, privacy-centric 1.1.1.1 project can secure your users on the Internet, and you’ll always know that they’ll be the first to benefit from the work of Internet standards bodies like the IETF.

If you, or your IT department, are interested, please get in touch! We’d be delighted to answer your questions and do our best to send you some trendy 1.1.1.1 stickers.

1.1.1.1 for Your Organization

Categories: Technology

Going Proactive on Security: Driving Encryption Adoption Intelligently

Tue, 24/07/2018 - 18:32
 Driving Encryption Adoption Intelligently

It's no secret that Cloudflare operates at a huge scale. Cloudflare provides security and performance to over 9 million websites all around the world, from small businesses and WordPress blogs to Fortune 500 companies. That means one in every 10 web requests goes through our network.

However, hidden behind the scenes, we offer support in using our platform to all our customers - whether they're on our free plan or on our Enterprise offering. This blog post dives into some of the technology that helps make this possible and how we're using it to drive encryption and build a better web.

Why Now?

Recently web browser vendors have been working on extending encryption on the internet. Traditionally they would use positive indicators to mark encrypted traffic as secure; when traffic was served securely over HTTPS, a green padlock would indicate in your browser that this was the case. In moving to standardise encryption online, Google Chrome have been leading the charge in marking insecure page loads as "Not Secure". Today, this UI change has been pushed out to all Google Chrome users globally for all websites: any website loaded over HTTP will be marked as insecure.

 Driving Encryption Adoption Intelligently

That's not all though; all resources loaded by a website need to be loaded over HTTPS and such sites need to be configured properly to avoid mixed-content warnings, not to mention correctly configuring secure cryptography at the web server. Cloudflare helped bring widespread adoption of HTTPS to the internet by offering free of charge SSL certificates; in doing so we've become experts at knowing where web developers trip up in configuring HTTPS on their websites. HTTPS is now important for everyone who builds on the web, not just those with an interest in cryptography.

Meet HelperBot

In recent months, we’ve taken this expertise to help our Cloudflare customers avoid common mistakes. One of things me and my team have been working on building has been intelligent systems which automatically triage support tickets and present relevant debugging information upfront to the agent assigned to the ticket.

We use a custom-build Natural Language Processing model to determine the issues related to what the customer is discussing, and then we run technical tests in a Chain-of-Responsibility (with the most relevant to the customer running first) to determine what's going wrong. We then automatically triage the ticket and present this information to the support agent in the ticket.

Here's an example of a piece of the information we present upfront:

 Driving Encryption Adoption Intelligently

Whilst we initially manually built automated debugging tests, we soon used Search Based Software Engineering strategies to self-write debugging automations based on various data points (such as the underlying technologies powering a site, their configuration or their error rates). When we detect anomalies, we are able to present this information upfront to our support agents to reduce the manual debugging they must conduct. In essence, we are able to get the software to write itself from test behaviour, within reason.

 Driving Encryption Adoption Intelligently

Whilst this data is largely mostly internally used; we are starting to A/B test new versions of our support ticket submission form which present a subset of this information upfront to users before they write into us - allowing them to the answers to their problem quicker.

 Driving Encryption Adoption Intelligently

Being Proactive About Security

To help drive adoption of a more secure internet - and drive down common misconfigurations of SSL - we have started testing emailing customers proactively about Mixed Content errors and Redirect Loops associated with HTTPS web server misconfigurations.

By joining forces with our Marketing team, we were able to run an ongoing campaign of testing user behaviour to proactive security advice. Users receive messages similar to the one below.

 Driving Encryption Adoption Intelligently

With this capability, we decided to expose the functionality to a wider audience, including those not already using Cloudflare.

SSL Test Tool (Powered by HelperBot-External)

 Driving Encryption Adoption Intelligently

To help website owners make the transition to HTTPS, we've launched the SSL Test Tool. We internally codenamed the backend as HelperBot-External, after the internal HelperBot service. We decided to take a subset of the SSL tests we use internally and allow someone to run a basic version of the scan on their own site. This helps users understand what they need to do to move their site to HTTPS by detecting the most common issues. By doing so, we seek to help users who are struggling to get over the line in enabling HTTPS on their sites by providing them some dynamic guidance in a plain-English fashion.

The tool runs 12 tests across three key categories of errors: HTTPS Disabled, Client Errors and Cryptography Errors. Unlike other tools, these are tests are based on the questions we see real users ask about their SSL configuration and the tasks they most struggle with. This is a tool designed to support all web developers in enabling HTTPS, not just those with an interest in cryptography. For example; by educating users about mixed content errors, we are able to make the case for them enabling HTTPS Strict Transport Security, thereby improving the security practices they adopt.

Further; these tests are available to everyone. We believe it’s important that the entire Internet be safer, not only for our customers and their visitors (although, admittedly, Cloudflare’s SSL and crypto features make it very simple to be HTTPS-ready).

Conclusion: Just the Beginning

As we grow our intelligence capabilities; we do so to provide better performance and security to our customers. We want build a better internet and make our users more successful on our platform. Whilst there's still plenty of ground left to cover in building out our intelligent capability for supporting customers, we're developing rapidly and focussed on using those skills to improve things our customers care about.

Categories: Technology

Cloudflare Access: Now teams of any size can turn off their VPN

Tue, 24/07/2018 - 17:15
 Now teams of any size can turn off their VPN

 Now teams of any size can turn off their VPN

Using a VPN is painful. Logging-in interrupts your workflow. You have to remember a separate set of credentials, which your administrator has to manage. The VPN slows you down when you're away from the office. Beyond just inconvenience, a VPN can pose a real security risk. A single infected device or malicious user can compromise your network once inside the perimeter.

In response, large enterprises have deployed expensive zero trust solutions. The name sounds counterintuitive - don’t we want to add trust to our network security? Zero trust refers to the default state of these tools. They trust no one; each request has to prove that itself. This architecture, most notably demonstrated at Google with Beyondcorp, has allowed teams to start to migrate to a more secure method of access control.

However, users of zero trust tools still suffer from the same latency problems they endured with old-school VPNs. Even worse, the price tag puts these tools out of reach for most teams.

Here at Cloudflare, we shared those same frustrations with VPNs. After evaluating our options, we realized we could build a better zero trust solution by leveraging some of the unique capabilities we have here at Cloudflare:

Our global network of data centers

Cloudflare’s network spans 150+ data centers around the globe. With a data center within 10 ms of 95% of the world’s internet-connected population, we can bring content closer to the end user. We could beat the performance of both VPNs and existing zero trust tools by evaluating permissions and serving pages at the edge of our network.

Cloudflare already protects your sites from threats

Cloudflare shields your site from attacks by sitting between your server and the rest of the internet. We could build on that experience by shielding your site from unauthorized users before the request ever reaches your origin.

With these foundations, we were able to build Cloudflare Access as a fast and secure way to protect applications. We started by using it internally. We migrated applications from our VPN to Access and suddenly our self-hosted tools felt like SaaS products.

We launched Access into beta at the start of 2018. Today, we are excited to announce the release of Cloudflare Access to all customers at a price that makes it affordable for teams of any size to leave their VPN behind.

A Quick Recap of Cloudflare Access

Cloudflare Access controls who can reach your internal resources. You don’t need to change your hosting or add new components to your site to integrate with an identity provider. Access does the work for you.

Before any requests reach your origin, Access checks to make sure they are approved based on policies you configure. We integrate with popular identity providers, like GSuite and Okta, so that you don’t have to manage a new set of credentials.

When your team members need to get to their tools and documents behind Access, they will login with the identity provider credentials managed by your organization. Once authorized, they’ll be able to access those protected resources for a duration that you define.

Your team can use your self-hosted tools as if they were a SaaS deployment. Cloudflare’s global network of 150+ data centers puts those resources closer to your end users, regardless of their location. Your administrators can control groups that should or shouldn’t be able to reach certain materials and review an audit log of account logins.

BeyondCorp for YourCorp

Starting today, you will be able to sign up for an Access plan sized to meet the needs of your team. Access Basic only costs USD $3 per user, per month. The Basic plan can be connected to social identity providers, like Facebook or GitHub. The Access Premium plan starts at USD $5 per user, per month, and integrates with corporate identity providers like Okta, OneLogin, and G Suite. The price per user decreases for larger teams.

As in the beta, the first five users are still free.

Cloudflare wants to make enterprise-grade security available to every team. With Access, teams can select a plan that fits their size. Whether you have 5 or 5,000 employees, Access can ensure that your entire team has secure and fast access to the tools they need.

New Policies: Control by IP or Build a Detour

Access works by requiring that users authenticate with their identity provider credentials to reach your site, or sections of it. However, sometimes you need to open paths for external services or outside groups of users.

As part of today’s release, you can create policies based on IP addresses. For example, if you have a secure office network, you can whitelist the office’s IP. Users outside of the office will be required to authenticate with their IdP. Or you can require that a user both authenticate against the IdP and be using a specific IP address.

You can also build a detour to allow traffic to a specified path or subdomain to bypass Access. When enabled, Access will not check requests to that destination for authorization tokens. Traffic will still be protected by your standard Cloudflare features, like DDoS mitigation and SSL encryption.

This is helpful when third-party services need to reach your site. Say you manage a WordPress site where you want to control who can access protected resources. WordPress can provide additional functionality by creating a connection between the browser and the server using AJAX. To do so, WordPress needs to reach a particular endpoint. With Bypass, you can allow traffic to reach that endpoint while protecting the rest of your site.

A Quick Demo of New Policy Rules

When creating an Access policy, you can build with Allow or Deny criteria. In that same dropdown, you’ll find the new Bypass policy type. As described above, Access will ignore traffic set to bypass (whether it’s for the entire site or just a section of it).

 Now teams of any size can turn off their VPN

When defining policy rules, you can now use new criteria: IP Ranges and Everyone. You can configure Access to allow or deny requests that meet these profiles.

 Now teams of any size can turn off their VPN

Access Groups

Zero trust solutions let you control who can access tools at a level more specific than “all.” However, defining access policies for the same set of individual users can be tedious. If you have a team of four engineers, and you want to connect them to multiple internal tools, you need to rebuild that “group” each time.

Starting today, you can create an Access Group to quickly apply policies to a set of users that meet membership rules you define. You can build groups based on a number of criteria. For example, create a group that only includes team members in a secure office by specifying the IP range. Or build a super group that consists of multiple, smaller groups defined in your identity provider.

Once you define Access Groups, you can create policies that apply to groups. Access Groups can be reused across sites in your account so that you can quickly reuse membership rules to create policies for all of your tools. Just select the Access Group from the dropdown. Whether you want to include your engineering team, require admin accounts, or exclude certain departments, you can do it with Access Groups.

A quick demo of Access Groups

To create an Access Group, start by giving it a name. Groups use the same rule types as policies; you can configure group membership criteria based on inclusion, exclusion, and requirement.

 Now teams of any size can turn off their VPN

Once you select the type of filter, you can define membership rules based on email addresses, IP ranges, or groups from your identity provider.

When you have saved your group, you can return to modify a policy, or create a new one, and select your Access Group from the drop-down list to build policies based on it.

 Now teams of any size can turn off their VPN

What’s next?

The new features are available today to all Access customers. You can read the documentation here. To our beta customers - thank you for helping make Access better! You can continue to use Access in your current arrangement for the next 30 days. After August 24th, you will need to sign up for a plan. We’re excited to help your team turn off your VPN and improve the speed and security of your most important tools.

Categories: Technology

Today, Chrome Takes Another Step Forward in Addressing the Design Flaw That is an Unencrypted Web

Tue, 24/07/2018 - 16:04
Today, Chrome Takes Another Step Forward in Addressing the Design Flaw That is an Unencrypted Web

The following is a guest post by Troy Hunt, awarded Security expert, blogger, and Pluralsight author. He’s also the creator of the popular Have I been pwned?, the free aggregation service that helps the owners of over 5 billion accounts impacted by data breaches.

Today, Chrome Takes Another Step Forward in Addressing the Design Flaw That is an Unencrypted Web

I still clearly remember my first foray onto the internet as a university student back in the mid 90's. It was a simpler online time back then, of course; we weren't doing our personal banking or our tax returns or handling our medical records so the whole premise of encrypting the transport layer wasn't exactly a high priority. In time, those services came along and so did the need to have some assurances about the confidentiality of the material we were sending around over other people's networks and computers. SSL as it was at the time was costly, but hey, banks and the like could absorb that given the nature of their businesses. However, at the time, there were all sorts of problems with the premise of serving traffic securely ranging from the cost of certs to the effort involved in obtaining and configuring them through to the performance hit on the infrastructure. We've spent the last couple of decades fixing these shortcomings and subsequently, driving site owners towards a more secure web. Today represents just one more step in that journey: as of today, Chrome is flagging all non-secure connections as... not secure!

I want to delve into the premise of this a little deeper because certainly there are those who question the need for the browser to be so shouty about a lack of encryption. I particularly see this point of view expressed as it relates to sites without the need for confidentiality, for example a static site that collects no personal data. But let me set the stage for this blog post because we're actually addressing a very fundamental problem here:

The push for HTTPS is merely addressing a design flaw with the original, unencrypted web.

I mean think about it - we've been plodding along standing up billions of websites and usually having no idea whether requests are successfully reaching the correct destination, whether they've been observed, tampered with, logged or otherwise mishandled somewhere along the way. We'd never sit down and design a network like this today but as with so many aspects of the web, we're still dealing with the legacy of decisions made in a very different time.

So back to Chrome for moment and the "Not secure" visual indicator. When I run training on HTTPS, I load up a website in the browser over a secure connection and I ask the question - "How do we know this connection is secure"? It's a question usually met by confused stares as we literally see the word "Secure" sitting up next to the address bar. We know the connection is secure because the browser tells us this explicitly. Now, let's try it with a site loaded over an insecure connection - "How do we know this connection is not secure"? And the penny drops because the answer is always "We know it's not secure because it doesn't tell us that it is secure"! Isn't that an odd inversion? Was an odd inversion because as of today, both secure and non-secure connections get the same visual treatment so finally, we have parity.

But is parity what we actually want? think back to the days when Chrome didn't tell you an insecure connection wasn't secure (ah, isn't it nice that's in the past already?!); browsers could get away with this because that was the normal state! Why explicitly say anything when the connection is "normal"? But now we're changing what "normal" means and in the future that means we'll be able to apply the same logic as Chrome used to: visual indicators for the normal state won't be necessary or in other words, we won't need to say "secure" any more. Instead, we can focus on the messaging around deviations from normal, namely connections that aren't secure. Google has already flagged that we'll see this behaviour in the future too, it's just a matter of time.

Let's take a moment to reflect on what that word "normal" means as it relates to secure comms on the internet because it's something that changes over time. A perfect example of that is Scott Helme's six-monthly Alexa Top 1M report. A couple of times a year, Scott publishes stats on the adoption of a range of different security constructs by the world's largest websites. One of those security constructs is the use of HTTPS or more specifically, sites that automatically redirect non-secure requests to the secure scheme. In that report above, he found that 6.7% of sites did this in August 2015. Let's have a look at just how quickly that number has changed and for ease of legibility, I'll list them all below followed by the change from the previous scan 6 months earlier:

  • Aug 2015: 6.7%
  • Feb 2016: 9.4% (+42%)
  • Aug 2016: 13.8% (+46%)
  • Feb 2017: 20.0% (+45%)
  • Aug 2017: 30.8% (+48%)
  • Feb 2018: 38.4% (+32%)

That's an astonishingly high growth rate, pretty much doubling every 12 months. We can't sustain that rate forever, of course, but depending on how you look at it, the numbers are even higher than that. Firefox's telemetry suggests that as of today, 73% of all requests are served over a secure HTTPS connection. That number is much higher than Scott's due to the higher prevalence of the world's largest websites implementing HTTPS more frequently than the smaller ones. In fact, Scott's own figures graphically illustrate this:

Today, Chrome Takes Another Step Forward in Addressing the Design Flaw That is an Unencrypted Web

Each point on the graph is a cluster of 4,000 websites with the largest ones on the left and the smallest on the right. It's clear that well over half of the largest sites are doing HTTPS by default whilst the smallest ones are much closer to one quarter. This can be explained by the fact that larger services tend to be those that we've traditionally expected higher levels of security on; they're e-commerce sites, social media platforms, banks and so on. Paradoxically, those sites are also the ones that are less trivial to roll over to HTTPS whilst the ones to the right of the graph are more likely to literally be lunchtime jobs. Last month I produced a free 4-part series called "HTTP Is Easy" and part 1 literally went from zero HTTPS to full HTTPS across the entire site in 5 minutes. It took another 5 minutes to get a higher grade than what most banks have for their transport layer encryption. HTTPS really is easy!

Yet still, there remain those who are unconvinced that secure connections are always necessary. Content integrity, they argue, is really not that important, what can a malicious party actually do with a static site such as a blog anyway? Good question! In no particular order, they can inject script to modify the settings of vulnerable routers and hijack DNS, inject cryptominers into the browser, weaponise people's browsers into a DDoS cannon or serve malware or phishing pages to unsuspecting victims. Just to really drive home the real-world risks, I demo'd all those in a single video a couple of weeks ago. Mind you, the sorts of sites for whom owners are questioning the need for HTTPS are precisely the sorts of sites that tend to be 5-minute exercises to put behind Cloudflare so regardless of debates about how necessary it is, the actual effort involved in doing it is usually negligible. Oh - and it'll give you access to HTTP/2 and Brotli compression which are both great for performance and only work over HTTPS plus enable you to access a whole range of browser features that are only available in secure contexts.

Today is just one more correction in a series that's been running for some time now. In Jan last year it was both Chrome and Firefox flagging insecure pages accepting passwords or credit cards as not secure. In October Chrome began showing the same visual indicator when entering data into any non-secure form. In March this year Safari on iOS began showing "Not Secure" when entering text into an insecure login form. We all know what's happened today and as I flagged earlier, the future holds yet more changes as we move towards a more "secure by default" web. (Incidentally, note how it's multiple browser vendors driving this change, it's by no means solely Google's doing.)

Bit by bit, we're gradually fixing the design flaws of the web.

A Note from Cloudflare
In June, Troy authored a post entitled “HTTPS is Easy!,” which highlights the simplicity of converting a site to HTTPS with Cloudflare. It’s worth noting that, as indicated in his post, we were (pleasantly) surprised to see this series.

At Cloudflare, it’s our mission to build a better Internet, and a part of that is democratizing modern web technologies to everyone. This was the motivation for launching Universal SSL in 2014 - a move that made us the first company to offer SSL for free to anyone. With the release of Chrome 68, we want to continue making HTTPS easy, and have launched a free tool to help any website owner troubleshoot common problems with HTTPS configuration.

Are you Chome 68 ready? Check your website with our free SSL Test.
Categories: Technology

I Wanna Go Fast - Load Balancing Dynamic Steering

Sat, 21/07/2018 - 16:51
I Wanna Go Fast - Load Balancing Dynamic Steering

I Wanna Go Fast - Load Balancing Dynamic Steering

Earlier this month we released Dynamic Steering for Load Balancing which allows you to have your Cloudflare load balancer direct traffic to the fastest pool for a given Cloudflare region or colo (Enterprise only).

To build this feature, we had to solve two key problems: 1) How to decide which pool of origins was the fastest and 2) How to distribute this decision to a growing group of 151 locations around the world.

I Wanna Go Fast - Load Balancing Dynamic Steering

Distance, Approximate Latency, and a Better Way

As my math teacher taught me, the shortest distance between two points is a straight line. This is also typically true on the internet - the shorter approximate distance there is between a user going through Cloudflare location to a customer origin, the better the experience is for the user. Geography is one way to approximate speed and we included the Geo Steering function when we initially introduced the Cloudflare Load Balancer. It is powerful, but manual; it’s not the best way. A customer on Twitter said it best:

@Cloudflare #FeatureRequest why can’t your load balancers determine which server is closest to the user then direct them to that one?

I don't want to have configure 10+ regions manually. This feels like something that should be built in? Am I missing it?

cc: @eastdakota

— Adam Evers
Categories: Technology

Securing U.S. Democracy: Athenian Project Update

Thu, 19/07/2018 - 16:01
 Athenian Project Update

 Athenian Project Update
Last December, Cloudflare announced the Athenian Project to help protect U.S. state and local election websites from cyber attack.

Since then, the need to protect our electoral systems has become increasingly urgent. As described by Director of National Intelligence Dan Coats, the “digital infrastructure that serves this country is literally under attack.” Just last week, we learned new details about how state election systems were targeted for cyberattack during the 2016 election. The U.S. government’s indictment of twelve Russian military intelligence officers describes the scanning of state election-related websites for vulnerabilities and theft of personal information related to approximately 500,000 voters.

This direct attack on the U.S. election systems using common Internet vulnerabilities reinforces the need to ensure democratic institutions are protected from attack in the future. The Athenian Project is Cloudflare’s attempt to do our part to secure our democracy.

Engaging with Elections Officials

Since announcing the Athenian Project, we’ve talked to state, county, and municipal officials around the country about protecting their election and voter registration websites. Today, we’re proud to report that we have Athenian Project participants in 19 states, and are in talks with many more. We have also strategized with civil society organizations, government associations, and federal government officials who share the goal of ensuring state and local officials have the tools they need to protect their institutions from cyberattack.

Working with state and local election officials has given us new appreciation for the dedication of those who serve as election officials, and how difficult it can be for those officials to identify and get the resources they need.

Local election officials — like ordinary voters — are the foundation of democracy. They guard the infrastructure of our constitutional system. Many officials juggle multiple roles within local government. They may manage multiple election websites, with limited information technology staff. Yet they know that their community, and sometimes the entire country, is relying on them to protect election integrity from countless global threats against it. The Athenian Project is about giving these dedicated professionals the tools they need to fight back and secure their systems.

A county Clerk-Recorder and Registrar of Voters, who is responsible for a number of election-related websites, told us that election officials worry about drawing attention to themselves, for fear they may be targeted for attack. Although cybersecurity is only one of the many responsibilities on her plate, this official is determined protect the county, using all the resources at her disposal. But without dedicated information technology staff, she has had difficulty identifying how best to protect county infrastructure.

Cloudflare can help, with both tools and know how.

 Athenian Project Update

Benefits of Cloudflare services

Given the current threats, we think it’s important to provide more details about what our services do, and how they can help election officials. We’ve understood since the beginning that election websites would benefit from Cloudflare’s security features, including our DDoS mitigation, Web Application Firewall (WAF), IP reputation database, and ability to block traffic by country or IP address. In fact, reports of DDoS attacks on state and local government websites often get the most coverage because the impact — loss of service to the site — is visible to the public. Until our conversations, however, we did not fully appreciate how our services could solve other common problems for state and local government officials.

For election officials, the last day of voter registration and election day are often nerve-wracking events. Their websites can see more traffic in an hour than they’ve seen all year. For example, when the Special Election in Alabama in 2017 drew traffic from around the country, Alabama needed a distributed network and a CDN to ensure that the nearly 5 million Alabamians and everyone else in the U.S. could follow along.

Cloudflare’s other features can also help state and local election websites. The Senate Select Committee on Intelligence summary of the 2016 election hacking attempts concluded that the majority of malicious access attempts on voting-related websites were perpetrated using SQL injection. Cloudflare’s WAF protects against SQL injection, as well as other forms of attack.

Recently, one of the states whose election websites are part of the Athenian Project was attacked and two non-election related websites were defaced. Website defacement occurs when someone who is not authorized to make website changes alters the content on the site, often changing the home page to display the hacker’s logo or other material. Although the state’s election websites saw a 100-fold increase in threat traffic, our WAF helped prevent a similar defacement on those sites.

For election websites that are not already running on HTTPS, Cloudflare can also simplify the process of transitioning to use of SSL. With Google Chrome’s new initiative to mark non-HTTPS sites as insecure, potential voters visiting non-encrypted voter registration websites will be warned not to enter sensitive information on the site “because it could be stolen by attackers.” That is not the message officials want to send to a public nervous about cyberattacks on election infrastructure. Adding a security certificate can be a daunting task for local officials without IT resources, but for Athenian Project participants, it’s available at the click of a button. Athenian Project participants who need help with certificate management are given dedicated, auto-renewed certificates to improve the security of their sites. Cloudflare page rules can then direct all traffic to the HTTPS site.

Lessons learned and new tools

We’ve also tailored the Athenian Project to better address the needs of those we are serving. So what have we done?

  • More tools: We wanted to provide more tools for those who want to learn about and set up our service. We’ve therefore revamped our website to be more intuitive to navigate and to provide more information. We’ve created a new, interactive guide discussing website protection and a short video sharing the experience of current Athenian Project participants.

  • How-to videos: There are videos to not only walk new participants through creating an account and transitioning their DNS servers, but also to provide best practices so that new participants can identify and turn on important features.

Getting Started

Best Practices

  • Support help: We have found that state and local election officials often have challenges at the onboarding stage that are best addressed through personal attention. We’ve therefore added support features — including Athenian-specific support — to increase the personal interaction we have with officials and to provide them an opportunity to describe their own situation and needs.

  • Set up flexibility: We’ve learned to be flexible with how we set up our service. While some counties were eager to leverage as much of the service as possible, including using full DNS delegation and dedicated certificates, others preferred to pick and choose between options. Depending on the circumstances for a given jurisdiction, we customize protection so they can use Cloudflare without needing to change the IT system for the whole state or county.

  • Athenian Project-specific terms of service: To address common government contracting restrictions, we’ve drafted an Athenian Project-specific terms of service.

We hope these new details will make it even easier for election officials to get access to tools that can help them fulfill their critical responsibility to protect our elections.

 Athenian Project Update

What’s next

In November, every state and district in the country will hold congressional elections. Election officials — and all of us — want to make sure that voter information remains secure and that websites stay online as voters seek out information on polling places and voting requirements, and anxiously refresh results pages on election night.

The entire American experiment is built on a simple act: a vote. To work as designed, citizens must trust the electoral system, its strength, integrity, and the people who protect it. Cloudflare is proud to support local officials on the front lines of election security.

And we, like election officials, know that building a resilient system requires long-term commitment. We are committed to continuing to do our part to keep U.S. election websites secure in this election and beyond.

If you would like more information about the Athenian Project, please visit our website cloudflare.com/athenian-project.

Categories: Technology

IPv6 in China

Thu, 19/07/2018 - 01:03
IPv6 in China

IPv6 in China
Photo by chuttersnap / Unsplash

At the end of 2017, Xinhua reported that there will be 200 Million IPv6 users inside Mainland China by the end of this year. Halfway into the year, we’re seeing a rapid growth in IPv6 users and traffic originating from Mainland China.

Why does this matter?

IPv6 is often referred to the next generation of IP addressing. The reality is, IPv6 is what is needed for addressing today. Taking the largest mobile network in China today, China Mobile has over 900 Million mobile subscribers and over 670 Million 4G/LTE subscribers. To be able to provide service to their users, they need to provide an IP address to each subscriber’s device. This means close to a billion IP addresses would be required, which is far more than what is available in IPv4, especially as the available IP address pools have been exhausted.

What is the solution?

To solve the addressability of clients, many networks, especially mobile networks, will use Carrier Grade NAT (CGN). This allows thousands, possibly up to hundreds of thousands, of devices to be shared behind a single internet IP address. The CGN equipment can be very expensive to scale and further, given the scale of the networks, they might need to layer CGNs behind other CGNs. This increases costs per subscriber, can reduce performance and makes scaling very challenging. A further solution, NAT64, allows IPv6 addresses to be given to subscribers, but then translated to IPv4 addresses similar to other NATs. This allows networks and ISPs to begin deploying IPv6 to subscribers, a first step in transition to IPv6.

IPv6 IPv6 IPv6!

IPv6 in China
Announcements IPv6 address blocks from China Mobile. Source: Hurricane Electric

On June 7, China Mobile started to announce IPv6 address blocks to the Internet at large. At the same time, Cloudflare started seeing traffic being exchanged with China Mobile users over IPv6 connections.

IPv6 in China
IPv4 to IPv6 percentage of traffic as seen from Cloudflare to AS9808 China Mobile’s Guangdong network.

Throughout the past 45 days, we’ve seen more and more IPv6 address blocks being announced to the internet, along with very aggressive usage. Interestingly this all started on-or-around June 8th 2018 (seven years to the day from World IPv6 Day)

It’s natural to see traffic graphs like this go up; then down after a while. This could indicate there’s some testing still going on with the deployment. We fully expect that the traffic percentage will climb back up once this is fully rolled out.

It’s fantastic to see the IPv6 enablement! We congratulate China Mobile on their successful enablement going forward.

Categories: Technology

Proxying traffic to Report URI with Cloudflare Workers

Tue, 17/07/2018 - 14:00
Proxying traffic to Report URI with Cloudflare Workers

The following is a guest post by Scott Helme, a Security Researcher, international speaker, and blogger. He's also the founder of the popular securityheaders.com and report-uri.com, free tools to help people deploy better security.

With the continued growth of Report URI we're seeing a larger and larger variety of sites use the service. With that diversity comes additional requirements that need to be met, some of them simple and some of them less so. Here's a quick look at those challenges and how they can be solved easily with a Cloudflare Worker.

Sending CSP Reports

When a browser sends a CSP report for us to collect at Report URI, we receive the JSON payload sent to us but we also have access to two other pieces of information, the client IP and the User Agent string. We never store, collect or analyse the client IP, we simply don't need or want to, and all we do with the UA string is extract the browser name like Chrome or Firefox. Most site operators are perfectly happy with our approach here and things work just fine. There are however some issues when the site operator simply doesn't want to have us to have this information and some cases have come up where they can't allow us to have access to that information because of restrictions placed on them by a regulator. The other common thing that comes up, which I honestly never anticipated, was simply the perception of the reporting endpoint being a 3rd party address. There are various different ways we can and do tackle these problems.

Proxying traffic to Report URI with Cloudflare Workers

CNAME

Up until now, if a client didn't want to report to a 3rd party address we would ask them to CNAME their subdomain to us and run a dedicated instance that would ingest reports using their subdomain. We take control of certificate issuance and renewal and the customer doesn't need to do anything further. This is a fairly common approach across many different technical requirements and it's something that has worked well for us. The problem is that it does come with some administrative overheads for both parties. From our side the technical requirements of managing separate infrastructure are an additional burden, we're responsible for a subdomain belonging to someone else and there are more moving parts in the system, increasing complexity. I was curious if there was another way.

HTTP Proxy

One idea that we discussed with a customer a while back, but never deployed, was for them to run a proxy on premise. They could report to their own endpoint under their own control and simply forward reports to their Report URI reporting address. This means they could shield the client IP from us, mask the User Agent string if required and generally do any sanitisation they liked on the payload. The problem with this was that it just seemed like an awful lot of work, I'd much rather have discussed deploying Report URI on premise instead. The client is also still at risk of things like accidentally DDoSing their endpoint, which removes one of the good reasons to use Report URI.

Finding Another Way

For the most part our current model was working really well but there were some customers who had a hard requirement to not send reports directly to us. Our on premise solution also isn't ready for prime time just yet so we needed something that we could offer, without it requiring too much of the overhead mentioned above. That's when I had an idea that might just cut it.

Javascript On A Plane

I was sat on a flight just a few days ago and I never like to waste time. When I'm sat in the car on the way to the airport, sat in the airport or sat on my flight, I'm working. Those are precious hours that can't be wasted and during a recent flight between Manchester and Vienna I was playing around with Cloudflare Workers in their playground. I was tinkering with a worker to add Security Headers to websites, which has since been launched, and whilst inspecting the headers object and looking through the headers that were in the request I saw the User Agent string. "Oh hey, I could remove that if I wanted to" I thought to myself, and then the rapid fire series of events triggered in my brain when you're in the process of realising a great idea. I could remove the UA header... From the request... Then the worker can make any subrequests it likes... Requests to a different origin... THE WORKER CAN RECEIVE A REPORT AND FORWARD IT!!!

I realised that (of course) a Cloudflare Worker could be used to receive reports on a subdomain of your own site and then forward them to your reporting address at Report URI.

Using Cloudflare Workers As A Report Proxy

One of the main benefits of using Report URI is just how simple everything is to do and all of the solutions mentioned at the start of this blog changed that. With a Cloudflare Worker we could keep the absolute simplicity of deploying Report URI but also easily allow you the option to shield your client's IP addresses, or any other information in the payload, from us.


let subdomain = 'scotthelme' addEventListener('fetch', event => { event.respondWith(forwardReport(event.request)) }) async function forwardReport(req) { let newHdrs = new Headers() newHdrs.set('User-Agent', req.headers.get('User-Agent')) const init = { body: req.body, headers: newHdrs, method: 'POST' } let path = new URL(req.url).pathname let address = 'https://' + subdomain + '.report-uri.com' + path let response = await fetch (address, init); return new Response(response.body, { status: response.status, statusText: response.statusText }) }

This simple worker, deployed on your own site, provides a solution to all of the above problems. All you need to do is configure your subdomain in the var on the first line and everything else will be taken care of for you. Deploy this worker onto the subdomain you want to send reports to, follow the same naming convention for the path when sending reports, and everything will Just Work(TM).

Proxying traffic to Report URI with Cloudflare Workers

The script above is configured for my subdomain, so if I wanted to deploy this on any site, say example.com, I'd choose the subdomain on my site where I wanted to send reports report-uri.example.com and off we go.


https://scotthelme.report-uri.com/r/d/csp/enforce becomes https://report-uri.example.com/r/d/csp/enforce

The reports are now being sent to a subdomain under your own control, the worker will intercept the request and forward it to the destination at Report URI for you. In the process you will shield the client IP as we will only see the source IP as being the Cloudflare Worker and in the example above we are forwarding the UA string for browser identification.

Amazingly Simple

With the worker above we don't need to worry about setting up a CNAME, certificate provisioning, separate infrastructure or anything else that goes with it. You also don't need to worry about setting up and managing a proxy to forward the reports to us and traffic or processing power required to do so. The worker will take care of all of that and what's best is that it will take care of it with minimal overhead, taking only a few minutes to setup and costing only $0.50 for every 1 million reports it processes.

Taking It One Step Further

The great thing about this is that once the worker is setup and processing reports, you can start to do some pretty awesome things beyond just proxying reports, workers are incredibly powerful.

Downsample report volume

If you want to save your quota on Report URI, maybe you're early in the process of deploying CSP and it's quite noisy, no problem. The worker can select a random downsample of reports to forward on so you can still receive reports but not eat your quota quite as quickly. Make the following change to the start of the forwardReport() function to randomly drop 50% of reports.


async function forwardReport(req) { if(Math.floor((Math.random() * 100) + 1) <= 50)="" {="" return="" new="" response("discarded")="" }<="" code="">
Hide the UA string

If you did want to hide the UA string from Report URI and not let us see that either, you simple need to remove the following line of code.


newHdrs.set('User-Agent', req.headers.get('User-Agent'))
Advanced work

The worker can pretty much do anything you like. Maybe there are sections of your site that you don't want to send reports from. You could parse the JSON and check which page triggered the report and discard them. You could do a regex match on the JSON payload to make sure no sensitive tokens or information get sent too. The possibilities are basically endless and what we can say is that if you need to do it, it's easy and cheap enough to do in a Cloudflare Worker.

Pricing

Talking about cheap enough, I thought I'd actually quantify that and quote the Cloudflare pricing for workers.

Proxying traffic to Report URI with Cloudflare Workers

Starting at $5 per month and covering your first 10 million requests is an amazingly cheap deal. Most websites that report through us wouldn't even come close to sending 10 million reports so you'd probably never pay any more than $5 for your Cloudflare Worker. That's it, $5 per month... By the time you've even thought about creating a CNAME or standing up your own proxy you've probably blown through more than Cloudflare Workers would ever cost you. What's best is that if you already use Cloudflare Workers then you can roll this into your existing usage and it might not even increase the cost if you have some of your initial 10 million requests spare. If you don't use Cloudflare on your site already then you could just as easily grab a new domain name exclusively for reporting, that'd cost just a few dollars, and stand that up behind Cloudflare too. One way or another this is insanely easy and insanely cheap.

Categories: Technology

DNS-Over-TLS Built-In & Enforced - 1.1.1.1 and the GL.iNet GL-AR750S

Sat, 14/07/2018 - 18:13
DNS-Over-TLS Built-In & Enforced - 1.1.1.1 and the GL.iNet GL-AR750S

DNS-Over-TLS Built-In & Enforced - 1.1.1.1 and the GL.iNet GL-AR750SGL.iNet GL-AR750S in black, same form-factor as the prior white GL.iNet GL-AR750. Credit card for comparison.

Back in April, I wrote about how it was possible to modify a router to encrypt DNS queries over TLS using Cloudflare's 1.1.1.1 DNS Resolver. For this, I used the GL.iNet GL-AR750 because it was pre-installed with OpenWRT (LEDE). The folks at GL.iNet read that blog post and decided to bake DNS-Over-TLS support into their new router using the 1.1.1.1 resolver, they sent me one to take a look at before it's available for pre-release. Their new router can also be configured to force DNS traffic to be encrypted before leaving your local network, which is particularly useful for any IoT or mobile device with hard-coded DNS settings that would ordinarily ignore your routers DNS settings and send DNS queries in plain-text.

DNS-Over-TLS Built-In & Enforced - 1.1.1.1 and the GL.iNet GL-AR750S

In my previous blog post I discussed how DNS was often the weakest link in the chain when it came to browsing privacy; whilst HTTP traffic is increasingly encrypted, this is seldom the case for DNS traffic. This makes it relatively trivial for an intermediary to work out what site you're sending traffic to. In that post, I went through the technical steps required to modify a router using OpenWRT to support DNS Privacy using the DNS-Over-TLS protocol.

GL.iNet were in contact since I wrote the original blog post and very supportive of encrypting DNS queries at the router level. Last week whilst working in Cloudflare's San Francisco office, they reached out to me over Twitter to let me know they were soon to launch a new product with a new web UI containing a "DNS over TLS from Cloudflare" feature and offered to send me the new router before it was even available for pre-order.

On arrival back to our London office, I found a package from Hong Kong waiting for me. Aside from the difference in colour, the AR750S itself is identical in form-factor to the AR750 and was packaged up very similarly. They both have capacity for external storage, an OpenVPN client and can be powered over USB; amongst many other useful functionalities. Alongside the S suffixing the model number, I did notice the new model had some upgraded specs, but I won't dwell on that here.

Below you can see the white AR750 and the new black AR750S router together for comparison. Both have a WAN ethernet port, 2 LAN ethernet ports, a USB port for external storage (plus a micro SD port) and a micro USB power port.

DNS-Over-TLS Built-In & Enforced - 1.1.1.1 and the GL.iNet GL-AR750S

The UI is where the real changes come. In the More Settings tab, there's an option to configure DNS with some nice options.

DNS-Over-TLS Built-In & Enforced - 1.1.1.1 and the GL.iNet GL-AR750S

One notable option is the DNS over TLS from Cloudflare toggle. This option uses the TLS security protocol for encrypting DNS queries, helping increase privacy and prevent eavesdropping.

Another option, Override DNS Settings for All Clients, forcibly overrides the DNS configuration on all clients so that queries are encrypted to the WAN. Unencrypted DNS traffic is intercepted by the router, and by forcing traffic to use it's own local resolver, it is able to transparently rewrite traffic to be encrypted before leaving the router and heading out into the public internet to the upstream resolver - 1.1.1.1.

This option is particularly useful when dealing with embedded systems or IoT devices which don't have configurable DNS options; Smart TVs, TV boxes, your toaster, etc. As this router can proxy traffic over to other Wi-Fi networks (and is portable), this is particularly useful when connecting out to an ordinarily insecure Wi-Fi network; the router can sit in the middle and transparently upgrade unencrypted DNS queries. This is even useful when dealing with phones and tablets where you can't install a DNS-Over-TLS client.

These options both come disabled by default, but can easily be toggled in the UI. As before, you can configure other DNS resolvers by toggling "Manual DNS Server Settings" and entering in any other DNS servers.

There are a number of other cool features I've noticed in this router; for example, the More Settings > Advanced option takes you into a standard LuCi UI that ordinarily comes bundled with LEDE routers. Like previous routers, you can easily SSH into the device and install various program and perform customisations.

For example; after installing TCPDump on the router, I am able to run tcpdump -n -i wlan-sta 'port 853' to see encrypted DNS traffic leaving the router. When I run a DNS query over an unencrypted resolver (using dig A junade.com on my local computer), I can see the outgoing DNS traffic upgraded to encrypted queries on 1.1.1.1 and 1.0.0.1.

DNS-Over-TLS Built-In & Enforced - 1.1.1.1 and the GL.iNet GL-AR750S

If you're interested in learning how to configure 1.1.1.1 on other routers, your computer or your phone - check out the project landing page at https://1.1.1.1/. If you're a developer and want to learn about how you can integrate 1.1.1.1 into your project with either DNS-Over-TLS or DNS-Over-HTTPS, checkout the 1.1.1.1 Developer Documentation.

Categories: Technology

Introducing Proudflare, Cloudflare's LGBTQIA+ Group

Thu, 12/07/2018 - 18:55
Introducing Proudflare, Cloudflare's LGBTQIA+ Group

Introducing Proudflare, Cloudflare's LGBTQIA+ Group

With Pride month now in our collective rearview mirror for 2018, I wanted to share what some of us have been up to at Cloudflare. We're so proud that, in the last 8 months, we've formed a LGBTQIA+ Employee Resource Group (ERG) called Proudflare. We've launched chapters and monthly activities in each of our primary locations: San Francisco, London, Singapore, and Austin. This month, we came out in force! We transformed our company's social profiles, wrapped our HQ building in rainbow window decals, highlighted several non-profits we support, and threw a heck of an inaugural Pride Celebration.

We’re a very young group — just 8 months old — but we have big plans. Check out some of our activities and future plans, follow us on social media, and consider starting an ERG at your company too.

The History of Proudflare

On my first day at Cloudflare in October, 2017, I logged into Hipchat and searched LGBTQ. Fortunately for me, there was a "LGBT at Cloudflare" chat room already created, and I started establishing connections right away. I found that there had been a couple of informal group outings, but there was no regular activity, sharing of resources, nor an official group. Proudflare was born that day, and the ball kept rolling.

Introducing Proudflare, Cloudflare's LGBTQIA+ Group

Our first official event was a Lunch & Discussion in December. We had a gathering of eleven Cloudflare employees around lunch to discuss articles about LGBTQIA+ issues in tech. We unanimously agreed to continue holding events like this and form an ERG.

Here are the first two articles we discussed:

Once we established a regular structure of events, we started introducing Proudflare to our other locations. In March, we held our first SF mixer with LGBTQIA+ ERGs from other tech companies. We decided we wanted to fully announce the group to the whole company during Pride month, so we sent out an email to the entire company introducing Proudflare and gave presentations at our All Hands meeting.

All of Cloudflare welcomed us and embraced us as their first ERG.

Our Pride month activity Austin

Our Austin chapter held its second Lunch & Discussion event, where Cloudflare employees got together to discuss how to write more inclusive job descriptions. They also discussed ideas for a Pride celebration and announced the first Proudflare service day, where the group will take time off to volunteer at a LGBTQIA+ youth organization.

London

The London chapter held its third Lunch & Discussion event, where the group brainstormed better processes for welcoming new employees to the London office, supporting them with resources, and making Proudflare a more salient part of the office culture. They also began planning their first Pride Celebration, which will take place after London Pride this July.

Singapore

The Singapore chapter held its first event this month and was overwhelmed with support. A group of twenty-five Cloudflarians gathered to learn how they may make the Singapore office inclusive and supportive of LGBTQIA+ individuals. They discussed articles about LGBTQIA+ issues in Singapore and started planning their first external event in support of Pink Dot's PinkFest.

San Francisco

At our headquarters, where roughly half of our global employee base is located, we felt it important to really make an impact. We wrapped our SOMA offices with rainbow window decals, organized a contingent to march with Bluegrass Pride in the parade, and renamed Cloudflare to "Proudflare".

We also held a Lunch & Discussion event where we shared stories of what Pride means to each of us and hosted our inaugural Pride Celebration, where we welcomed one hundred sixty people into our space to learn about nonprofits we believe in and celebrate with us.

Here are the nonprofits we highlighted:

The Trevor Project: Founded in 1998 by the creators of the Academy Award®-winning short film TREVOR, The Trevor Project is the leading national organization providing crisis intervention and suicide prevention services to lesbian, gay, bisexual, transgender, and questioning (LGBTQ) young people ages 13–24.

We're honored to support the Trevor Project with Cloudflare's Project Galileo. Organizations working on behalf of the arts, human rights, civil society, or democracy, can apply for Project Galileo to get Cloudflare’s highest level of protection for free.

Rainbow Railroad: In response to the confirmed reports of abductions, detentions, enforced disappearances, torture, and deaths targeting over 200 gay and bisexual men in Chechnya, Rainbow Railroad immediately went into action to assist those in danger. Rainbow Railroad has been working closely with the Russian LGBT Network, a non-governmental organization currently leading the campaign to rescue those facing danger in Chechnya.

Project Open House: Openhouse enables San Francisco Bay Area LGBT seniors to overcome the unique challenges they face as they age by providing housing, direct services, and community programs. As a result, they have reduced isolation and empowered LGBT seniors to improve their overall health, well-being, and economic security.

What's Next?

We're a new ERG and we've come a long way in a short amount of time, but we have a lot more planned. Here are some projects we're currently working on:

  • Hosting an event in support of Pink Dot in Singapore
  • Hosting Pride Celebration events in Austin
  • Inserting a presentation about inclusion and ERGs in our new hire orientation
  • Supporting ally skills trainings for employees
  • Working with recruiting on writing inclusive job descriptions
  • Advising human resources on which benefits packages are most LGBTQIA+ friendly
  • Establishing a framework for LGBTQIA+ diversity data collection and reporting with our people team
  • Publishing all Proudflare-related resources in a Wiki for all Cloudflare employees to access easily
Call to Action

I suggest starting an employee resource group at your company. Whether it be focused on LGBTQIA+, women, people of color, parents, or other underserved populations in tech, conversations about inclusion and community-building make for a better work atmosphere. Here are some beginning resources I used.

Let's make our industry a better, more inclusive place for all.

Follow & join us

Also, follow us on social media and join us at our next events.

<3

Proudflare

Introducing Proudflare, Cloudflare's LGBTQIA+ Group

Categories: Technology

Pages

Additional Terms