Blogroll: CloudFlare

I read blogs, as well as write one. The 'blogroll' on this site reproduces some posts from some of the people I enjoy reading. There are currently 41 posts from the blog 'CloudFlare.'

Disclaimer: Reproducing an article here need not necessarily imply agreement or endorsement!

Subscribe to CloudFlare feed CloudFlare
The Cloudflare Blog
Updated: 2 hours 44 min ago

Cloudflare Global Network Expands to 193 Cities

Thu, 15/08/2019 - 02:41
Cloudflare Global Network Expands to 193 Cities

Cloudflare’s global network currently spans 193 cities across 90+ countries. With over 20 million Internet properties on our network, we increase the security, performance, and reliability of large portions of the Internet every time we add a location.

Cloudflare Global Network Expands to 193 CitiesExpanding Network to New Cities

So far in 2019, we’ve added a score of new locations: Amman, Antananarivo*, Arica*, Asunción, Baku, Bengaluru, Buffalo, Casablanca, Córdoba*, Cork, Curitiba, Dakar*, Dar es Salaam, Fortaleza, Geneva, Göteborg, Guatemala City, Hyderabad, Kigali, Kolkata, Male*, Maputo, Nagpur, Neuquén*, Nicosia, Nouméa, Ottawa, Port-au-Prince, Porto Alegre, Querétaro, Ramallah, and Thessaloniki.

Our Humble Beginnings

When Cloudflare launched in 2010, we focused on putting servers at the Internet’s crossroads: large data centers with key connections, like the Amsterdam Internet Exchange and Equinix Ashburn. This not only provided the most value to the most people at once but was also easier to manage by keeping our servers in the same buildings as all the local ISPs, server providers, and other people they needed to talk to streamline our services.

This is a great approach for bootstrapping a global network, but we’re obsessed with speed in general. There are over five hundred cities in the world with over one million inhabitants, but only a handful of them have the kinds of major Internet exchanges that we targeted. Our goal as a company is to help make a better Internet for all, not just those lucky enough to live in areas with affordable and easily-accessible interconnection points. However, we ran up against two broad, nasty problems: a) running out of major Internet exchanges and b) latency still wasn’t as low as we wanted. Clearly, we had to start scaling in new ways.

One of our first big steps was entering into partnerships around the world with local ISPs, who have many of the same problems we do: ISPs want to save money and provide fast Internet to their customers, but they often don’t have a major Internet exchange nearby to connect to. Adding Cloudflare equipment to their infrastructure effectively brought more of the Internet closer to them. We help them speed up millions of Internet properties while reducing costs by serving traffic locally. Additionally, since all of our servers are designed to support all our products, a relatively small physical footprint can also provide security, performance, reliability, and more.

Upgrading Capacity in Existing Cities

Though it may be obvious and easy to overlook, continuing to build out existing locations is also a key facet of building a global network. This year, we have significantly increased the computational capacity at the edge of our network. Additionally, by making it easier to interconnect with Cloudflare, we have increased the number of unique networks directly connected with us to over 8,000. This makes for a faster, more reliable Internet experience for the >1 billion IPs that we see daily.

To make these capacity upgrades possible for our customers, efficient infrastructure deployment has been one of our keys to success. We want our infrastructure deployment to be targeted and flexible.

Targeted Deployment

The next Cloudflare customer through our door could be a small restaurant owner on a Pro plan with thousands of monthly pageviews or a fast-growing global tech company like Discord. As a result, we need to always stay one step ahead and synthesize a lot of data all at once for our customers.

To accommodate this expansion, our Capacity Planning team is learning new ways to optimize our servers. One key strategy is targeting exactly where to send our servers. However, staying on top of everything isn’t easy - we are a global anycast network, which introduces unpredictability as to where incoming traffic goes. To make things even more difficult, each city can contain as many as five distinct deployments. Planning isn’t just a question of what city to send servers to, it’s one of which address.

To make sense of it all, we tackle the problem with simulations. Some, but not all, of the variables we model include historical traffic growth rates, foreseeable anomalous spikes (e.g., Cyber Day in Chile), and consumption states from our live deal pipeline, as well as product costs, user growth, end-customer adoption. We also add in site reliability, potential for expansion, and expected regional expansion and partnerships, as well as strategic priorities and, of course, feedback from our fantastic Systems Reliability Engineers.

Flexible Supply Chain

Knowing where to send a server is only the first challenge of many when it comes to a global network. Just like our user base, our supply chain must span the entire world while also staying flexible enough to quickly react to time constraints, pricing changes including taxes and tariffs, import/export restrictions and required certifications - not to mention local partnerships many more dynamic location-specific variables. Even more reason we have to stay quick on our feet, there will always be unforeseen roadblocks and detours even in the most well-prepared plans. For example, a planned expansion in our Prague location might warrant an expanded presence in Vienna for failover.

Once servers arrive at our data centers, our Data Center Deployment and Technical Operations teams work with our vendors and on-site data center personnel (our “Remote Hands” and “Smart Hands”) to install the physical server, manage the cabling, and handle other early-stage provisioning processes.

Our architecture, which is designed so that every server can support every service, makes it easier to withstand hardware failures and efficiently load balance workloads between equipment and between locations.

Join Our Team

If working at a rapidly expanding, globally diverse company interests you, we’re hiring for scores of positions, including in the Infrastructure group. If you want to help increase hardware efficiency, deploy and maintain servers, work on our supply chain, or strengthen ISP partnerships, get in touch.

*Represents cities where we have data centers with active Internet ports and where we are configuring our servers to handle traffic for more customers (at the time of publishing)

Categories: Technology

Building a GraphQL server on the edge with Cloudflare Workers

Wed, 14/08/2019 - 04:21
Building a GraphQL server on the edge with Cloudflare WorkersBuilding a GraphQL server on the edge with Cloudflare Workers

Today, we're open-sourcing an exciting project that showcases the strengths of our Cloudflare Workers platform: workers-graphql-server is a batteries-included Apollo GraphQL server, designed to get you up and running quickly with GraphQL.

Building a GraphQL server on the edge with Cloudflare WorkersTesting GraphQL queries in the GraphQL Playground

As a full-stack developer, I’m really excited about GraphQL. I love building user interfaces with React, but as a project gets more complex, it can become really difficult to manage how your data is managed inside of an application. GraphQL makes that really easy - instead of having to recall the REST URL structure of your backend API, or remember when your backend server doesn't quite follow REST conventions - you just tell GraphQL what data you want, and it takes care of the rest.

Cloudflare Workers is uniquely suited as a platform to being an incredible place to host a GraphQL server. Because your code is running on Cloudflare's servers around the world, the average latency for your requests is extremely low, and by using Wrangler, our open-source command line tool for building and managing Workers projects, you can deploy new versions of your GraphQL server around the world within seconds.

If you'd like to try the GraphQL server, check out a demo GraphQL playground, deployed on This optional add-on to the GraphQL server allows you to experiment with GraphQL queries and mutations, giving you a super powerful way to understand how to interface with your data, without having to hop into a codebase.

If you're ready to get started building your own GraphQL server with our new open-source project, we've added a new tutorial to our Workers documentation to help you get up and running - check it out here!

Finally, if you're interested in how the project works, or want to help contribute - it's open-source! We'd love to hear your feedback and see your contributions. Check out the project on GitHub.

Categories: Technology

On the recent HTTP/2 DoS attacks

Tue, 13/08/2019 - 18:00
On the recent HTTP/2 DoS attacksOn the recent HTTP/2 DoS attacks

Today, multiple Denial of Service (DoS) vulnerabilities were disclosed for a number of HTTP/2 server implementations. Cloudflare uses NGINX for HTTP/2. Customers using Cloudflare are already protected against these attacks.

The individual vulnerabilities, originally discovered by Netflix and are included in this announcement are:

As soon as we became aware of these vulnerabilities, Cloudflare’s Protocols team started working on fixing them. We first pushed a patch to detect any attack attempts and to see if any normal traffic would be affected by our mitigations. This was followed up with work to mitigate these vulnerabilities; we pushed the changes out few weeks ago and continue to monitor similar attacks on our stack.

If any of our customers host web services over HTTP/2 on an alternative, publicly accessible path that is not behind Cloudflare, we recommend you apply the latest security updates to your origin servers in order to protect yourselves from these HTTP/2 vulnerabilities.

We will soon follow up with more details on these vulnerabilities and how we mitigated them.

Full credit for the discovery of these vulnerabilities goes to Jonathan Looney of Netflix and Piotr Sikora of Google and the Envoy Security Team.

Categories: Technology

Magic Transit makes your network smarter, better, stronger, and cheaper to operate

Tue, 13/08/2019 - 14:01
Magic Transit makes your network smarter, better, stronger, and cheaper to operate

Today we’re excited to announce Cloudflare Magic Transit. Magic Transit provides secure, performant, and reliable IP connectivity to the Internet. Out-of-the-box, Magic Transit deployed in front of your on-premise network protects it from DDoS attack and enables provisioning of a full suite of virtual network functions, including advanced packet filtering, load balancing, and traffic management tools.

Magic Transit makes your network smarter, better, stronger, and cheaper to operate

Magic Transit is built on the standards and networking primitives you are familiar with, but delivered from Cloudflare’s global edge network as a service. Traffic is ingested by the Cloudflare Network with anycast and BGP, announcing your company’s IP address space and extending your network presence globally. Today, our anycast edge network spans 193 cities in more than 90 countries around the world.

Once packets hit our network, traffic is inspected for attacks, filtered, steered, accelerated, and sent onward to the origin. Magic Transit will connect back to your origin infrastructure over Generic Routing Encapsulation (GRE) tunnels, private network interconnects (PNI), or other forms of peering.

Enterprises are often forced to pick between performance and security when deploying IP network services. Magic Transit is designed from the ground up to minimize these trade-offs: performance and security are better together. Magic Transit deploys IP security services across our entire global network. This means no more diverting traffic to small numbers of distant “scrubbing centers” or relying on on-premise hardware to mitigate attacks on your infrastructure.

We’ve been laying the groundwork for Magic Transit for as long as Cloudflare has been in existence, since 2010. Scaling and securing the IP network Cloudflare is built on has required tooling that would have been impossible or exorbitantly expensive to buy. So we built the tools ourselves! We grew up in the age of software-defined networking and network function virtualization, and the principles behind these modern concepts run through everything we do.

When we talk to our customers managing on-premise networks, we consistently hear a few things: building and managing their networks is expensive and painful, and those on-premise networks aren’t going away anytime soon.

Traditionally, CIOs trying to connect their IP networks to the Internet do this in two steps:

  1. Source connectivity to the Internet from transit providers (ISPs).
  2. Purchase, operate, and maintain network function specific hardware appliances. Think hardware load balancers, firewalls, DDoS mitigation equipment, WAN optimization, and more.

Each of these boxes costs time and money to maintain, not to mention the skilled, expensive people required to properly run them. Each additional link in the chain makes a network harder to manage.

This all sounded familiar to us. We had an aha! moment: we had the same issues managing our datacenter networks that power all of our products, and we had spent significant time and effort building solutions to those problems. Now, nine years later, we had a robust set of tools we could turn into products for our own customers.

Magic Transit aims to bring the traditional datacenter hardware model into the cloud, packaging transit with all the network “hardware” you might need to keep your network fast, reliable, and secure. Once deployed, Magic Transit allows seamless provisioning of virtualized network functions, including routing, DDoS mitigation, firewalling, load balancing, and traffic acceleration services.

Magic Transit is your network’s on-ramp to the Internet

Magic Transit delivers its connectivity, security, and performance benefits by serving as the “front door” to your IP network. This means it accepts IP packets destined for your network, processes them, and then outputs them to your origin infrastructure.

Connecting to the Internet via Cloudflare offers numerous benefits. Starting with the most basic, Cloudflare is one of the most extensively connected networks on the Internet. We work with carriers, Internet exchanges, and peering partners around the world to ensure that a bit placed on our network will reach its destination quickly and reliably, no matter the destination.

An example deployment: Acme Corp

Let’s walk through how a customer might deploy Magic Transit. Customer Acme Corp. owns the IP prefix, which they use to address a rack of hardware they run in their own physical datacenter. Acme currently announces routes to the Internet from their customer-premise equipment (CPE, aka a router at the perimeter of their datacenter), telling the world is reachable from their autonomous system number, AS64512. Acme has DDoS mitigation and firewall hardware appliances on-premise.

Magic Transit makes your network smarter, better, stronger, and cheaper to operate

Acme wants to connect to the Cloudflare Network to improve the security and performance of their own network. Specifically, they’ve been the target of distributed denial of service attacks, and want to sleep soundly at night without relying on on-premise hardware. This is where Cloudflare comes in.

Magic Transit makes your network smarter, better, stronger, and cheaper to operate

Deploying Magic Transit in front of their network is simple:

  1. Cloudflare uses Border Gateway Protocol (BGP) to announce Acme’s prefix from Cloudflare’s edge, with Acme’s permission.
  2. Cloudflare begins ingesting packets destined for the Acme IP prefix.
  3. Magic Transit applies DDoS mitigation and firewall rules to the network traffic. After it is ingested by the Cloudflare network, traffic that would benefit from HTTPS caching and WAF inspection can be “upgraded” to our Layer 7 HTTPS pipeline without incurring additional network hops.
  4. Acme would like Cloudflare to use Generic Routing Encapsulation (GRE) to tunnel traffic back from the Cloudflare Network back to Acme’s datacenter. GRE tunnels are initiated from anycast endpoints back to Acme’s premise. Through the magic of anycast, the tunnels are constantly and simultaneously connected to hundreds of network locations, ensuring the tunnels are highly available and resilient to network failures that would bring down traditionally formed GRE tunnels.
  5. Cloudflare egresses packets bound for Acme over these GRE tunnels.

Let’s dive deeper on how the DDoS mitigation included in Magic Transit works.

Magic Transit protects networks from DDoS attack

Customers deploying Cloudflare Magic Transit instantly get access to the same IP-layer DDoS protection system that has protected the Cloudflare Network for the past 9 years. This is the same mitigation system that stopped a 942Gbps attack dead in its tracks, in seconds. This is the same mitigation system that knew how to stop memcached amplification attacks days before a 1.3Tbps attack took down Github, which did not have Cloudflare watching its back. This is the same mitigation we trust every day to protect Cloudflare, and now it protects your network.

Cloudflare has historically protected Layer 7 HTTP and HTTPS applications from attacks at all layers of the OSI Layer model. The DDoS protection our customers have come to know and love relies on a blend of techniques, but can be broken into a few complementary defenses:

  1. Anycast and a network presence in 193 cities around the world allows our network to get close to users and attackers, allowing us to soak up traffic close to the source without introducing significant latency.
  2. 30+Tbps of network capacity allows us to soak up a lot of traffic close to the source. Cloudflare's network has more capacity to stop DDoS attacks than that of Akamai Prolexic, Imperva, Neustar, and Radware — combined.
  3. Our HTTPS reverse proxy absorbs L3 (IP layer) and L4 (TCP layer) attacks by terminating connections and re-establishing them to the origin. This stops most spurious packet transmissions from ever getting close to a customer origin server.
  4. Layer 7 mitigations and rate limiting stop floods at the HTTPS application layer.

Looking at the above description carefully, you might notice something: our reverse proxy servers protect our customers by terminating connections, but our network and servers still get slammed by the L3 and 4 attacks we stop on behalf of our customers. How do we protect our own infrastructure from these attacks?

Enter Gatebot!

Gatebot is a suite of software running on every one of our servers inside each of our datacenters in the 193 cities we operate, constantly analyzing and blocking attack traffic. Part of Gatebot’s beauty is its simple architecture; it sits silently, in wait, sampling packets as they pass from the network card into the kernel and onward into userspace. Gatebot does not have a learning or warm-up period. As soon as it detects an attack, it instructs the kernel of the machine it is running on to drop the packet, log its decision, and move on.

Historically, if you wanted to protect your network from a DDoS attack, you might have purchased a specialized piece of hardware to sit at the perimeter of your network. This hardware box (let’s call it “The DDoS Protection Box”) would have been fantastically expensive, pretty to look at (as pretty as a 2U hardware box could get), and required a ton of recurring effort and money to stay on its feet, keep its licence up to date, and keep its attack detection system accurate and trained.

For one thing, it would have to be carefully monitored to make sure it was stopping attacks but not stopping legitimate traffic. For another, if an attacker managed to generate enough traffic to saturate your datacenter’s transit links to the Internet, you were out of luck; no box sitting inside your datacenter can protect you from an attack generating enough traffic to congest the links running from the outside world to the datacenter itself.

Early on, Cloudflare considered buying The DDoS Protection Box(es) to protect our various network locations, but ruled them out quickly. Buying hardware would have incurred substantial cost and complexity. In addition, buying, racking, and managing specialized pieces of hardware makes a network hard to scale. There had to be a better way. We set out to solve this problem ourselves, starting from first principles and modern technology.

To make our modern approach to DDoS mitigation work, we had to invent a suite of tools and techniques to allow us to do ultra-high performance networking on a generic x86 server running Linux.

At the core of our network data plane is the eXpress Data Path (XDP) and the extended Berkeley Packet Filter (eBPF), a set of APIs that allow us to build ultra-high performance networking applications in the Linux kernel. My colleagues have written extensively about how we use XDP and eBPF to stop DDoS attacks:

At the end of the day, we ended up with a DDoS mitigation system that:

  • Is delivered by our entire network, spread across 193 cities around the world. To put this another way, our network doesn’t have the concept of “scrubbing centers” — every single one of our network locations is always mitigating attacks, all the time. This means faster attack mitigation and minimal latency impact for your users.
  • Has exceptionally fast times to mitigate, with most attacks mitigated in 10s or less.
  • Was built in-house, giving us deep visibility into its behavior and the ability to rapidly develop new mitigations as we see new attack types.
  • Is deployed as a service, and is horizontally scalable. Adding x86 hardware running our DDoS mitigation software stack to a datacenter (or adding another network location) instantly brings more DDoS mitigation capacity online.

Gatebot is designed to protect Cloudflare infrastructure from attack. And today, as part of Magic Transit, customers operating their own IP networks and infrastructure can rely on Gatebot to protect their own network.

Magic Transit puts your network hardware in the cloud

We’ve covered how Cloudflare Magic Transit connects your network to the Internet, and how it protects you from DDoS attack. If you were running your network the old-fashioned way, this is where you’d stop to buy firewall hardware, and maybe another box to do load balancing.

With Magic Transit, you don’t need those boxes. We have a long track record of delivering common network functions (firewalls, load balancers, etc.) as services. Up until this point, customers deploying our services have relied on DNS to bring traffic to our edge, after which our Layer 3 (IP), Layer 4 (TCP & UDP), and Layer 7 (HTTP, HTTPS, and DNS) stacks take over and deliver performance and security to our customers.

Magic Transit is designed to handle your entire network, but does not enforce a one-size-fits-all approach to what services get applied to which portion of your traffic. To revisit Acme, our example customer from above, they have brought to the Cloudflare Network. This represents 256 IPv4 addresses, some of which (eg might front load balancers and HTTP servers, others mail servers, and others still custom UDP-based applications.

Each of these sub-ranges may have different security and traffic management requirements. Magic Transit allows you to configure specific IP addresses with their own suite of services, or apply the same configuration to large portions (or all) of your block.

Taking the above example, Acme may wish that the block containing HTTP services fronted by a traditional hardware load balancer instead deploy the Cloudflare Load Balancer, and also wants HTTP traffic analyzed with Cloudflare’s WAF and content cached by our CDN. With Magic Transit, deploying these network functions is straight-forward — a few clicks in our dashboard or API calls will have your traffic handled at a higher layer of network abstraction, with all the attendant goodies applying application level load balancing, firewall, and caching logic bring.

This is just one example of a deployment customers might pursue. We’ve worked with several who just want pure IP passthrough, with DDoS mitigation applied to specific IP addresses. Want that? We got you!

Magic Transit runs on the entire Cloudflare Global Network. Or, no more scrubs!

When you connect your network to Cloudflare Magic Transit, you get access to the entire Cloudflare network. This means all of our network locations become your network locations. Our network capacity becomes your network capacity, at your disposal to power your experiences, deliver your content, and mitigate attacks on your infrastructure.

How expansive is the Cloudflare Network? We’re in 193 cities worldwide, with more than 30Tbps of network capacity spread across them. Cloudflare operates within 100 milliseconds of 98% of the Internet-connected population in the developed world, and 93% of the Internet-connected population globally (for context, the blink of an eye is 300-400 milliseconds).

Magic Transit makes your network smarter, better, stronger, and cheaper to operateAreas of the globe within 100 milliseconds of a Cloudflare datacenter.

Just as we built our own products in house, we also built our network in house. Every product runs in every datacenter, meaning our entire network delivers all of our services. This might not have been the case if we had assembled our product portfolio piecemeal through acquisition, or not had completeness of vision when we set out to build our current suite of services.

The end result for customers of Magic Transit: a network presence around the globe as soon you come on board. Full access to a diverse set of services worldwide. All delivered with latency and performance in mind.

We'll be sharing a lot more technical detail on how we deliver Magic Transit in the coming weeks and months.

Magic Transit lowers total cost of ownership

Traditional network services don’t come cheap; they require high capital outlays up front, investment in staff to operate, and ongoing maintenance contracts to stay functional. Just as our product aims to be disruptive technically, we want to disrupt traditional network cost-structures as well.

Magic Transit is delivered and billed as a service. You pay for what you use, and can add services at any time. Your team will thank you for its ease of management; your management will thank you for its ease of accounting. That sounds pretty good to us!

Magic Transit is available today

We’ve worked hard over the past nine years to get our network, management tools, and network functions as a service into the state they’re in today. We’re excited to get the tools we use every day in customers’ hands.

So that brings us to naming. When we showed this to customers the most common word they used was ‘whoa.’ When we pressed what they meant by that they almost all said: ‘It’s so much better than any solution we’ve seen before. It’s, like, magic!’ So it seems only natural, if a bit cheesy, that we call this product what it is: Magic Transit.

We think this is all pretty magical, and think you will too. Contact our Enterprise Sales Team today.

Categories: Technology

Magic Transit: Network functions at Cloudflare scale

Tue, 13/08/2019 - 14:00
 Network functions at Cloudflare scale

Today we announced Cloudflare Magic Transit, which makes Cloudflare’s network available to any IP traffic on the Internet. Up until now, Cloudflare has primarily operated proxy services: our servers terminate HTTP, TCP, and UDP sessions with Internet users and pass that data through new sessions they create with origin servers. With Magic Transit, we are now also operating at the IP layer: in addition to terminating sessions, our servers are applying a suite of network functions (DoS mitigation, firewalling, routing, and so on) on a packet-by-packet basis.

Over the past nine years, we’ve built a robust, scalable global network that currently spans 193 cities in over 90 countries and is ever growing. All Cloudflare customers benefit from this scale thanks to two important techniques. The first is anycast networking. Cloudflare was an early adopter of anycast, using this routing technique to distribute Internet traffic across our data centers. It means that any data center can handle any customer’s traffic, and we can spin up new data centers without needing to acquire and provision new IP addresses. The second technique is homogeneous server architecture. Every server in each of our edge data centers is capable of running every task. We build our servers on commodity hardware, making it easy to quickly increase our processing capacity by adding new servers to existing data centers. Having no specialty hardware to depend on has also led us to develop an expertise in pushing the limits of what’s possible in networking using modern Linux kernel techniques.

Magic Transit is built on the same network using the same techniques, meaning our customers can now run their network functions at Cloudflare scale. Our fast, secure, reliable global edge becomes our customers’ edge. To explore how this works, let’s follow the journey of a packet from a user on the Internet to a Magic Transit customer’s network.

Putting our DoS mitigation to work… for you!

In the announcement blog post we describe an example deployment for Acme Corp. Let’s continue with this example here. When Acme brings their IP prefix to Cloudflare, we start announcing that prefix to our transit providers, peers, and to Internet exchanges in each of our data centers around the globe. Additionally, Acme stops announcing the prefix to their own ISPs. This means that any IP packet on the Internet with a destination address within Acme’s prefix is delivered to a nearby Cloudflare data center, not to Acme’s router.

Let’s say I want to access Acme’s FTP server on from my computer in Cloudflare’s office in Champaign, IL. My computer generates a TCP SYN packet with destination address and sends it out to the Internet. Thanks to anycast, that packet ends up at Cloudflare’s data center in Chicago, which is the closest data center (in terms of Internet routing distance) to Champaign. The packet arrives on the data center’s router, which uses ECMP (Equal Cost Multi-Path) routing to select which server should handle the packet and dispatches the packet to the selected server.

Once at the server, the packet flows through our XDP- and iptables-based DoS detection and mitigation functions. If this TCP SYN packet were determined to be part of an attack, it would be dropped and that would be the end of it. Fortunately for me, the packet is permitted to pass.

So far, this looks exactly like any other traffic on Cloudflare’s network. Because of our expertise in running a global anycast network we’re able to attract Magic Transit customer traffic to every data center and apply the same DoS mitigation solution that has been protecting Cloudflare for years. Our DoS solution has handled some of the largest attacks ever recorded, including a 942Gbps SYN flood in 2018. Below is a screenshot of a recent SYN flood of 300M packets per second. Our architecture lets us scale to stop the largest attacks.

 Network functions at Cloudflare scaleNetwork namespaces for isolation and control

The above looked identical to how all other Cloudflare traffic is processed, but this is where the similarities end. For our other services, the TCP SYN packet would now be dispatched to a local proxy process (e.g. our nginx-based HTTP/S stack). For Magic Transit, we instead want to dynamically provision and apply customer-defined network functions like firewalls and routing. We needed a way to quickly spin up and configure these network functions while also providing inter-network isolation. For that, we turned to network namespaces.

Namespaces are a collection of Linux kernel features for creating lightweight virtual instances of system resources that can be shared among a group of processes. Namespaces are a fundamental building block for containerization in Linux. Notably, Docker is built on Linux namespaces. A network namespace is an isolated instance of the Linux network stack, including its own network interfaces (with their own eBPF hooks), routing tables, netfilter configuration,, and so on. Network namespaces give us a low-cost mechanism to rapidly apply customer-defined network configurations in isolation, all with built-in Linux kernel features so there’s no performance hit from userspace packet forwarding or proxying.

When a new customer starts using Magic Transit, we create a brand new network namespace for that customer on every server across our edge network (did I mention that every server can run every task?). We built a daemon that runs on our servers and is responsible for managing these network namespaces and their configurations. This daemon is constantly reading configuration updates from Quicksilver, our globally distributed key-value store, and applying customer-defined configurations for firewalls, routing, etc, inside the customer’s namespace. For example, if Acme wants to provision a firewall rule to allow FTP traffic (TCP ports 20 and 21) to, that configuration is propagated globally through Quicksilver and the Magic Transit daemon applies the firewall rule by adding an nftables rule to the Acme customer namespace:

# Apply nftables rule inside Acme’s namespace $ sudo ip netns exec acme_namespace nft add rule inet filter prerouting ip daddr tcp dport 20-21 accept

Getting the customer’s traffic to their network namespace requires a little routing configuration in the default network namespace. When a network namespace is created, a pair of virtual ethernet (veth) interfaces is also created: one in the default namespace and one in the newly created namespace. This interface pair creates a “virtual wire” for delivering network traffic into and out of the new network namespace. In the default network namespace, we maintain a routing table that forwards Magic Transit customer IP prefixes to the veths corresponding to those customers’ namespaces. We use iptables to mark the packets that are destined for Magic Transit customer prefixes, and we have a routing rule that specifies that these specially marked packets should use the Magic Transit routing table.

(Why go to the trouble of marking packets in iptables and maintaining a separate routing table? Isolation. By keeping Magic Transit routing configurations separate we reduce the risk of accidentally modifying the default routing table in a way that affects how non-Magic Transit traffic flows through our edge.)

 Network functions at Cloudflare scale

Network namespaces provide a lightweight environment where a Magic Transit customer can run and manage network functions in isolation, letting us put full control in the customer’s hands.

GRE + anycast = magic

After passing through the edge network functions, the TCP SYN packet is finally ready to be delivered back to the customer’s network infrastructure. Because Acme Corp. does not have a network footprint in a colocation facility with Cloudflare, we need to deliver their network traffic over the public Internet.

This poses a problem. The destination address of the TCP SYN packet is, but the only network announcing the IP prefix on the Internet is Cloudflare. This means that we can’t simply forward this packet out to the Internet—it will boomerang right back to us! In order to deliver this packet to Acme we need to use a technique called tunneling.

Tunneling is a method of carrying traffic from one network over another network. In our case, it involves encapsulating Acme’s IP packets inside of IP packets that can be delivered to Acme’s router over the Internet. There are a number of common tunneling protocols, but Generic Routing Encapsulation (GRE) is often used for its simplicity and widespread vendor support.

GRE tunnel endpoints are configured both on Cloudflare’s servers (inside of Acme’s network namespace) and on Acme’s router. Cloudflare servers then encapsulate IP packets destined for inside of IP packets destined for a publicly-routable IP address for Acme’s router, which decapsulates the packets and emits them into Acme’s internal network.

 Network functions at Cloudflare scale

Now, I’ve omitted an important detail in the diagram above: the IP address of Cloudflare’s side of the GRE tunnel. Configuring a GRE tunnel requires specifying an IP address for each side, and the outer IP header for packets sent over the tunnel must use these specific addresses. But Cloudflare has thousands of servers, each of which may need to deliver packets to the customer through a tunnel. So how many Cloudflare IP addresses (and GRE tunnels) does the customer need to talk to? The answer: just one, thanks to the magic of anycast.

Cloudflare uses anycast IP addresses for our GRE tunnel endpoints, meaning that any server in any data center is capable of encapsulating and decapsulating packets for the same GRE tunnel. How is this possible? Isn’t a tunnel a point-to-point link? The GRE protocol itself is stateless—each packet is processed independently and without requiring any negotiation or coordination between tunnel endpoints. While the tunnel is technically bound to an IP address it need not be bound to a specific device. Any device that can strip off the outer headers and then route the inner packet can handle any GRE packet sent over the tunnel. Actually, in the context of anycast the term “tunnel” is misleading since it implies a link between two fixed points. With Cloudflare’s Anycast GRE, a single “tunnel” gives you a conduit to every server in every data center on Cloudflare’s global edge.

 Network functions at Cloudflare scale

One very powerful consequence of Anycast GRE is that it eliminates single points of failure. Traditionally, GRE-over-Internet can be problematic because an Internet outage between the two GRE endpoints fully breaks the “tunnel”. This means reliable data delivery requires going through the headache of setting up and maintaining redundant GRE tunnels terminating at different physical sites and rerouting traffic when one of the tunnels breaks. But because Cloudflare is encapsulating and delivering customer traffic from every server in every data center, there is no single “tunnel” to break. This means Magic Transit customers can enjoy the redundancy and reliability of terminating tunnels at multiple physical sites while only setting up and maintaining a single GRE endpoint, making their jobs simpler.

Our scale is now your scale

Magic Transit is a powerful new way to deploy network functions at scale. We’re not just giving you a virtual instance, we’re giving you a global virtual edge. Magic Transit takes the hardware appliances you would typically rack in your on-prem network and distributes them across every server in every data center in Cloudflare’s network. This gives you access to our global anycast network, our fleet of servers capable of running your tasks, and our engineering expertise building fast, reliable, secure networks. Our scale is now your scale.

Categories: Technology

The Serverlist: Building out the SHAMstack

Fri, 09/08/2019 - 17:01
 Building out the SHAMstack

Check out our seventh edition of The Serverlist below. Get the latest scoop on the serverless space, get your hands dirty with new developer tutorials, engage in conversations with other serverless developers, and find upcoming meetups and conferences to attend.

Sign up below to have The Serverlist sent directly to your mailbox.

.newsletter .visually-hidden { position: absolute; white-space: nowrap; width: 1px; height: 1px; overflow: hidden; border: 0; padding: 0; clip: rect(0 0 0 0); clip-path: inset(50%); } .newsletter form { display: flex; flex-direction: row; margin-bottom: 1em; } .newsletter input[type="email"], .newsletter button[type="submit"] { font: inherit; line-height: 1.5; padding-top: .5em; padding-bottom: .5em; border-radius: 3px; } .newsletter input[type="email"] { padding-left: .8em; padding-right: .8em; margin: 0; margin-right: .5em; box-shadow: none; border: 1px solid #ccc; } .newsletter input[type="email"]:focus { border: 1px solid #3279b3; } .newsletter button[type="submit"] { padding-left: 1.25em; padding-right: 1.25em; background-color: #f18030; color: #fff; } .newsletter .privacy-link { font-size: .9em; } Email Submit Your privacy is important to us newsletterForm.addEventListener('submit', async function(e) { e.preventDefault() fetch('', { method: 'POST', body: newsletterForm.elements[0].value }).then(async res => { const thing = await res.text() newsletterForm.innerHTML = thing const homeURL = '' if (window.location.href !== homeURL) { window.setTimeout(_ => { window.location = homeURL }, 5000) } }) }) iframe[seamless]{ background-color: transparent; border: 0 none transparent; padding: 0; overflow: hidden; } const magic = document.getElementById('magic') function resizeIframe() { const iframeDoc = magic.contentDocument const iframeWindow = magic.contentWindow magic.height = iframeDoc.body.clientHeight magic.width = "100%" const injectedStyle = iframeDoc.createElement('style') injectedStyle.innerHTML = ` body { background: white !important; } ` magic.contentDocument.head.appendChild(injectedStyle) function onFinish() { setTimeout(() => { = '' }, 80) } if (iframeDoc.readyState === 'loading') { iframeWindow.addEventListener('load', onFinish) } else { onFinish() } } async function fetchURL(url) { magic.addEventListener('load', resizeIframe) const call = await fetch(`${url}`) const text = await call.text() const divie = document.createElement("div") divie.innerHTML = text const listie = divie.getElementsByTagName("a") for (var i = 0; i < listie.length; i++) { listie[i].setAttribute("target", "_blank") } magic.scrolling = "no" magic.srcdoc = divie.innerHTML } fetchURL("")
Categories: Technology

Introducing Certificate Transparency Monitoring

Thu, 08/08/2019 - 23:00
Introducing Certificate Transparency MonitoringIntroducing Certificate Transparency Monitoring

Today we’re launching Certificate Transparency Monitoring (my summer project as an intern!) to help customers spot malicious certificates. If you opt into CT Monitoring, we’ll send you an email whenever a certificate is issued for one of your domains. We crawl all public logs to find these certificates quickly. CT Monitoring is available now in public beta and can be enabled in the Crypto Tab of the Cloudflare dashboard.


Most web browsers include a lock icon in the address bar. This icon is actually a button — if you’re a security advocate or a compulsive clicker (I’m both), you’ve probably clicked it before! Here’s what happens when you do just that in Google Chrome:

Introducing Certificate Transparency Monitoring

This seems like good news. The Cloudflare blog has presented a valid certificate, your data is private, and everything is secure. But what does this actually mean?


Your browser is performing some behind-the-scenes work to keep you safe. When you request a website (say,, the website should present a certificate that proves its identity. This certificate is like a stamp of approval: it says that your connection is secure. In other words, the certificate proves that content was not intercepted or modified while in transit to you. An altered Cloudflare site would be problematic, especially if it looked like the actual Cloudflare site. Certificates protect us by including information about websites and their owners.

We pass around these certificates because the honor system doesn’t work on the Internet. If you want a certificate for your own website, just request one from a Certificate Authority (CA), or sign up for Cloudflare and we’ll do it for you! CAs issue certificates just as real-life notaries stamp legal documents. They confirm your identity, look over some data, and use their special status to grant you a digital certificate. Popular CAs include DigiCert, Let’s Encrypt, and Sectigo. This system has served us well because it has kept imposters in check, but also promoted trust between domain owners and their visitors.

Introducing Certificate Transparency Monitoring

Unfortunately, nothing is perfect.

It turns out that CAs make mistakes. In rare cases, they become reckless. When this happens, illegitimate certificates are issued (even though they appear to be authentic). If a CA accidentally issues a certificate for your website, but you did not request the certificate, you have a problem. Whoever received the certificate might be able to:

  1. Steal login credentials from your visitors.
  2. Interrupt your usual services by serving different content.

These attacks do happen, so there’s good reason to care about certificates. More often, domain owners lose track of their certificates and panic when they discover unexpected certificates. We need a way to prevent these situations from ruining the entire system.

Certificate Transparency

Ah, Certificate Transparency (CT). CT solves the problem I just described by making all certificates public and easy to audit. When CAs issue certificates, they must submit certificates to at least two “public logs.” This means that collectively, the logs carry important data about all trusted certificates on the Internet. Several companies offer CT logs — Google has launched a few of its own. We announced Cloudflare's Nimbus log last year.

Logs are really, really big, and often hold hundreds of millions of certificate records.

Introducing Certificate Transparency Monitoring

The log infrastructure helps browsers validate websites’ identities. When you request in Safari or Google Chrome, the browser will actually require Cloudflare’s certificate to be registered in a CT log. If the certificate isn’t found in a log, you won’t see the lock icon next to the address bar. Instead, the browser will tell you that the website you’re trying to access is not secure. Are you going to visit a website marked “NOT SECURE”? Probably not.

There are systems that audit CT logs and report illegitimate certificates. Therefore, if your browser finds a valid certificate that is also trusted in a log, everything is secure.

What We're Announcing Today

Cloudflare has been an industry leader in CT. In addition to Nimbus, we launched a CT dashboard called Merkle Town and explained how we made it. Today, we’re releasing a public beta of Certificate Transparency Monitoring.

If you opt into CT Monitoring, we’ll send you an email whenever a certificate is issued for one of your domains. When you get an alert, don’t panic; we err on the side of caution by sending alerts whenever a possible domain match is found. Sometimes you may notice a suspicious certificate. Maybe you won’t recognize the issuer, or the subdomain is not one you offer (e.g. Alerts are sent quickly so you can contact a CA if something seems wrong.

Introducing Certificate Transparency Monitoring

This raises the question: if services already audit public logs, why are alerts necessary? Shouldn’t errors be found automatically? Well no, because auditing is not exhaustive. The best person to audit your certificates is you. You know your website. You know your personal information. Cloudflare will put relevant certificates right in front of you.

You can enable CT Monitoring on the Cloudflare dashboard. Just head over to the Crypto Tab and find the “Certificate Transparency Monitoring” card. You can always turn the feature off if you’re too popular in the CT world.

Introducing Certificate Transparency Monitoring

If you’re on a Business or Enterprise plan, you can tell us who to notify. Instead of emailing the zone owner (which we do for Free and Pro customers), we accept up to 10 email addresses as alert recipients. We do this to avoid overwhelming large teams. These emails do not have to be tied to a Cloudflare account and can be manually added or removed at any time.

Introducing Certificate Transparency Monitoring

How This Actually Works

Our Cryptography and SSL teams worked hard to make this happen; they built on the work of some clever tools mentioned earlier:

  • Merkle Town is a hub for CT data. We process all trusted certificates and present relevant statistics on our website. This means that every certificate issued on the Internet passes through Cloudflare, and all the data is public (so no privacy concerns here).
  • Cloudflare Nimbus is our very own CT log. It contains more than 400 million certificates.

Introducing Certificate Transparency MonitoringNote: Cloudflare, Google, and DigiCert are not the only CT log providers.

So here’s the process... At some point in time, you (or an impostor) request a certificate for your website. A Certificate Authority approves the request and issues the certificate. Within 24 hours, the CA sends this certificate to a set of CT logs. This is where we come in: Cloudflare uses an internal process known as “The Crawler” to look through millions of certificate records. Merkle Town dispatches The Crawler to monitor CT logs and check for new certificates. When The Crawler finds a new certificate, it pulls the entire certificate through Merkle Town.

Introducing Certificate Transparency Monitoring

When we process the certificate in Merkle Town, we also check it against a list of monitored domains. If you have CT Monitoring enabled, we’ll send you an alert immediately. This is only possible because of Merkle Town’s existing infrastructure. Also, The Crawler is ridiculously fast.

Introducing Certificate Transparency Monitoring

I Got a Certificate Alert. What Now?

Good question. Most of the time, certificate alerts are routine. Certificates expire and renew on a regular basis, so it’s totally normal to get these emails. If everything looks correct (the issuer, your domain name, etc.), go ahead and toss that email in the trash.

In rare cases, you might get an email that looks suspicious. We provide a detailed support article that will help. The basic protocol is this:

  1. Contact the CA (listed as “Issuer” in the email).
  2. Explain why you think the certificate is suspicious.
  3. The CA should revoke the certificate (if it really is malicious).

We also have a friendly support team that can be reached here. While Cloudflare is not at CA and cannot revoke certificates, our support team knows quite a bit about certificate management and is ready to help.

The FutureIntroducing Certificate Transparency Monitoring

Certificate Transparency has started making regular appearances on the Cloudflare blog. Why? It’s required by Chrome and Safari, which dominate the browser market and set precedents for Internet security. But more importantly, CT can help us spot malicious certificates before they are used in attacks. This is why we will continue to refine and improve our certificate detection methods.

What are you waiting for? Go enable Certificate Transparency Monitoring!

Categories: Technology

Terminating Service for 8Chan

Mon, 05/08/2019 - 02:44

The mass shootings in El Paso, Texas and Dayton, Ohio are horrific tragedies. In the case of the El Paso shooting, the suspected terrorist gunman appears to have been inspired by the forum website known as 8chan. Based on evidence we've seen, it appears that he posted a screed to the site immediately before beginning his terrifying attack on the El Paso Walmart killing 20 people.

Unfortunately, this is not an isolated incident. Nearly the same thing happened on 8chan before the terror attack in Christchurch, New Zealand. The El Paso shooter specifically referenced the Christchurch incident and appears to have been inspired by the largely unmoderated discussions on 8chan which glorified the previous massacre. In a separate tragedy, the suspected killer in the Poway, California synagogue shooting also posted a hate-filled “open letter” on 8chan. 8chan has repeatedly proven itself to be a cesspool of hate.

8chan is among the more than 19 million Internet properties that use Cloudflare's service. We just sent notice that we are terminating 8chan as a customer effective at midnight tonight Pacific Time. The rationale is simple: they have proven themselves to be lawless and that lawlessness has caused multiple tragic deaths. Even if 8chan may not have violated the letter of the law in refusing to moderate their hate-filled community, they have created an environment that revels in violating its spirit.

We do not take this decision lightly. Cloudflare is a network provider. In pursuit of our goal of helping build a better internet, we’ve considered it important to provide our security services broadly to make sure as many users as possible are secure, and thereby making cyberattacks less attractive — regardless of the content of those websites.  Many of our customers run platforms of their own on top of our network. If our policies are more conservative than theirs it effectively undercuts their ability to run their services and set their own policies. We reluctantly tolerate content that we find reprehensible, but we draw the line at platforms that have demonstrated they directly inspire tragic events and are lawless by design. 8chan has crossed that line. It will therefore no longer be allowed to use our services.

What Will Happen Next

Unfortunately, we have seen this situation before and so we have a good sense of what will play out. Almost exactly two years ago we made the determination to kick another disgusting site off Cloudflare's network: the Daily Stormer. That caused a brief interruption in the site's operations but they quickly came back online using a Cloudflare competitor. That competitor at the time promoted as a feature the fact that they didn't respond to legal process. Today, the Daily Stormer is still available and still disgusting. They have bragged that they have more readers than ever. They are no longer Cloudflare's problem, but they remain the Internet's problem.

I have little doubt we'll see the same happen with 8chan. While removing 8chan from our network takes heat off of us, it does nothing to address why hateful sites fester online. It does nothing to address why mass shootings occur. It does nothing to address why portions of the population feel so disenchanted they turn to hate. In taking this action we've solved our own problem, but we haven't solved the Internet's.

In the two years since the Daily Stormer what we have done to try and solve the Internet’s deeper problem is engage with law enforcement and civil society organizations to try and find solutions. Among other things, that resulted in us cooperating around monitoring potential hate sites on our network and notifying law enforcement when there was content that contained an indication of potential violence. We will continue to work within the legal process to share information when we can to hopefully prevent horrific acts of violence. We believe this is our responsibility and, given Cloudflare's scale and reach, we are hopeful we will continue to make progress toward solving the deeper problem.

Rule of Law

We continue to feel incredibly uncomfortable about playing the role of content arbiter and do not plan to exercise it often. Some have wrongly speculated this is due to some conception of the United States' First Amendment. That is incorrect. First, we are a private company and not bound by the First Amendment. Second, the vast majority of our customers, and more than 50% of our revenue, comes from outside the United States where the First Amendment and similarly libertarian freedom of speech protections do not apply. The only relevance of the First Amendment in this case and others is that it allows us to choose who we do and do not do business with; it does not obligate us to do business with everyone.

Instead our concern has centered around another much more universal idea: the Rule of Law. The Rule of Law requires policies be transparent and consistent. While it has been articulated as a framework for how governments ensure their legitimacy, we have used it as a touchstone when we think about our own policies.

We have been successful because we have a very effective technological solution that provides security, performance, and reliability in an affordable and easy-to-use way. As a result of that, a huge portion of the Internet now sits behind our network. 10% of the top million, 17% of the top 100,000, and 19% of the top 10,000 Internet properties use us today. 10% of the Fortune 1,000 are paying Cloudflare customers.

Cloudflare is not a government. While we've been successful as a company, that does not give us the political legitimacy to make determinations on what content is good and bad. Nor should it. Questions around content are real societal issues that need politically legitimate solutions. We will continue to engage with lawmakers around the world as they set the boundaries of what is acceptable in their countries through due process of law. And we will comply with those boundaries when and where they are set.

Europe, for example, has taken a lead in this area. As we've seen governments there attempt to address hate and terror content online, there is recognition that different obligations should be placed on companies that organize and promote content — like Facebook and YouTube — rather than those that are mere conduits for that content. Conduits, like Cloudflare, are not visible to users and therefore cannot be transparent and consistent about their policies.

The unresolved question is how should the law deal with platforms that ignore or actively thwart the Rule of Law? That's closer to the situation we have seen with the Daily Stormer and 8chan. They are lawless platforms. In cases like these, where platforms have been designed to be lawless and unmoderated, and where the platforms have demonstrated their ability to cause real harm, the law may need additional remedies. We and other technology companies need to work with policy makers in order to help them understand the problem and define these remedies. And, in some cases, it may mean moving enforcement mechanisms further down the technical stack.

Our Obligation

Cloudflare's mission is to help build a better Internet. At some level firing 8chan as a customer is easy. They are uniquely lawless and that lawlessness has contributed to multiple horrific tragedies. Enough is enough.

What's hard is defining the policy that we can enforce transparently and consistently going forward. We, and other technology companies like us that enable the great parts of the Internet, have an obligation to help propose solutions to deal with the parts we're not proud of. That's our obligation and we're committed to it.

Unfortunately the action we take today won’t fix hate online. It will almost certainly not even remove 8chan from the Internet. But it is the right thing to do. Hate online is a real issue. Here are some organizations that have active work to help address it:

Our whole Cloudflare team’s thoughts are with the families grieving in El Paso, Texas and Dayton, Ohio this evening.

Categories: Technology

Why I’m Helping Cloudflare Grow in Australia & New Zealand (A/NZ)

Wed, 31/07/2019 - 21:10
Why I’m Helping Cloudflare Grow in Australia & New Zealand (A/NZ)Why I’m Helping Cloudflare Grow in Australia & New Zealand (A/NZ)

I’ve recently joined Cloudflare as Head of Australia and New Zealand (A/NZ). This is an important time for the company as we continue to grow our presence locally to address the demand in A/NZ, recruit local talent, and build on the successes we’ve had in our other offices around the globe. In this new role, I’m eager to grow our brand recognition in A/NZ and optimise our reach to customers by building up my team and channel presence.

A little about me

I’m a Melburnian born and bred (most livable city in the world!) with more than 20 years of experience in our market. From guiding strategy and architecture of the region’s largest resources company, BHP, to building and running teams and channels, and helping customers solve the technical challenges of their time, I have been in, or led, businesses in the A/NZ Enterprise market, with a focus on network and security for the last six years.

Why Cloudflare?

I joined Cloudflare because I strongly believe in its mission to help build a better Internet, and believe this mission, paired with its massive global network, will enable the company to continue to deliver incredibly innovative solutions to customers of all segments.

Four years ago, I was lucky to build and lead the VMware Network & Security business, working with some of Cloudflare’s biggest A/NZ customers. I was confronted with the full extent of the security challenges that A/NZ businesses face. I recognized that there must be a better way to help customers secure their local and multi-cloud environments. That's how I found Cloudflare. With Cloudflare's Global Cloud Platform, businesses have an integrated solution that offers the best in security, performance and reliability.

Second, something that’s personally important for me as the son of Italian migrants, and now a dad of two gorgeous daughters, is that Cloudflare is serious about culture and diversity. When I was considering joining Cloudflare, I watched videos from the Internet Summit, an annual event that Cloudflare hosts in its San Francisco office. One thing that really stood out to me was that the speakers came from so many different backgrounds.

I’m extremely passionate about encouraging those from all walks of life to pursue opportunities in business and tech, so seeing the diversity of people giving insightful talks made me realise that this was a company I wanted to work for, and hopefully perhaps my girls as well (no pressure).

Cloudflare A/NZ

I strongly believe that Cloudflare’s mission, paired with its massive global network, will enable customers of all sizes in segments in Australia and New Zealand to leverage Cloudflare’s security, performance and reliability solutions.

For example, VicRoads is 85 percent faster now that they are using Argo Smart Routing, Ansarada uses Cloudflare’s WAF to protect against malicious activity, and MyAffiliates harnesses Cloudflare’s global network, which spans more than 180 cities in 80 countries, to ensure an interruption-free service for its customers.

Making security and speed, which are necessary for any strong business, available to anyone with an Internet property is truly a noble goal. That’s another one of the reasons I’m most excited to work at Cloudflare.

Australians and Kiwis alike have always been great innovators and users of technology. However, being so physically isolated (Perth is the most isolated city in the world and A/NZ are far from pretty much everywhere else in the world) has limited our ability to have the diversity of choice and competition. Our isolation from said choice and competition fueled innovation, but at the price of complexity, cost, and ease. This makes having local servers absolutely vital for good performance. With Cloudflare’s expansive network, 98 percent of the Internet-connected developed world is located within 100 milliseconds of our network. In fact, Cloudflare already has data centers in Auckland, Brisbane, Melbourne, Perth, and Sydney, ensuring that customers in A/NZ have access to a secure, fast, and reliable Internet.

Our opportunities in Australia, New Zealand and beyond...

I’m truly looking forward to helping Cloudflare grow its reach over the next five years. If you are a business in Australia and New Zealand and have a cyber-security, performance or reliability need, get in touch with us (1300 748 959). We’d love to explore how we can help.

If you’re interested in exploring careers at Cloudflare, we are hiring globally. Our team in Australia is small today, about a dozen, and we are growing quickly. We have open roles in Solutions Engineering and Business Development Representatives. Check out our careers page to learn more, or send me a note.

Categories: Technology

Securing infrastructure at scale with Cloudflare Access

Fri, 19/07/2019 - 14:00
Securing infrastructure at scale with Cloudflare Access

I rarely have to deal with the hassle of using a corporate VPN and I hope it remains this way. As a new member of the Cloudflare team, that seems possible. Coworkers who joined a few years ago did not have that same luck. They had to use a VPN to get any work done. What changed?

Cloudflare released Access, and now we’re able to do our work without ever needing a VPN again. Access is a way to control access to your internal applications and infrastructure. Today, we’re releasing a new feature to help you replace your VPN by deploying Access at an even greater scale.

Access in an instant

Access replaces a corporate VPN by evaluating every request made to a resource secured behind Access. Administrators can make web applications, remote desktops, and physical servers available at dedicated URLs, configured as DNS records in Cloudflare. These tools are protected via access policies, set by the account owner, so that only authenticated users can access those resources. These end users are able to be authenticated over both HTTPS and SSH requests. They’re prompted to login with their SSO credentials and Access redirects them to the application or server.

For your team, Access makes your internal web applications and servers in your infrastructure feel as seamless to reach as your SaaS tools. Originally we built Access to replace our own corporate VPN. In practice, this became the fastest way to control who can reach different pieces of our own infrastructure. However, administrators configuring Access were required to create a discrete policy per each application/hostname. Now, administrators don’t have to create a dedicated policy for each new resource secured by Access; one policy will cover each URL protected.

When Access launched, the product’s primary use case was to secure internal web applications. Creating unique rules for each was tedious, but manageable. Access has since become a centralized way to secure infrastructure in many environments. Now that companies are using Access to secure hundreds of resources, that method of building policies no longer fits.

Starting today, Access users can build policies using a wildcard subdomain to replace the typical bottleneck that occurs when replacing dozens or even hundreds of bespoke rules within a single policy. With a wildcard, the same ruleset will now automatically apply to any subdomain your team generates that is gated by Access.

How can teams deploy at scale with wildcard subdomains?

Administrators can secure their infrastructure with a wildcard policy in the Cloudflare dashboard. With Access enabled, Cloudflare adds identity-based evaluation to that traffic.

In the Access dashboard, you can now build a rule to secure any subdomain of the site you added to Cloudflare. Create a new policy and enter a wildcard tag (“*”) into the subdomain field. You can then configure rules, at a granular level, using your identity provider to control who can reach any subdomain of that apex domain.

Securing infrastructure at scale with Cloudflare Access

This new policy will propagate to all 180 of Cloudflare’s data centers in seconds and any new subdomains created will be protected.

Securing infrastructure at scale with Cloudflare Access

How are teams using it?

Since releasing this feature in a closed beta, we’ve seen teams use it to gate access to their infrastructure in several new ways. Many teams use Access to secure dev and staging environments of sites that are being developed before they hit production. Whether for QA or collaboration with partner agencies, Access helps make it possible to share sites quickly with a layer of authentication. With wildcard subdomains, teams are deploying dozens of versions of new sites at new URLs without needing to touch the Access dashboard.

For example, an administrator can create a policy for “*” and then developers can deploy iterations of sites at “” and “” and both inherit the global Access policy.

The feature is also helping teams lock down their entire hybrid, on-premise, or public cloud infrastructure with the Access SSH feature. Teams can assign dynamic subdomains to their entire fleet of servers, regardless of environment, and developers and engineers can reach them over an SSH connection without a VPN. Administrators can now bring infrastructure online, in an entirely new environment, without additional or custom security rules.

What about creating DNS records?

Cloudflare Access requires users to associate a resource with a domain or subdomain. While the wildcard policy will cover all subdomains, teams will still need to connect their servers to the Cloudflare network and generate DNS records for those services.

Argo Tunnel can reduce that burden significantly. Argo Tunnel lets you expose a server to the Internet without opening any inbound ports. The service runs a lightweight daemon on your server that initiates outbound tunnels to the Cloudflare network.

Instead of managing DNS, network, and firewall complexity, Argo Tunnel helps administrators serve traffic from their origin through Cloudflare with a single command. That single command will generate the DNS record in Cloudflare automatically, allowing you to focus your time on building and managing your infrastructure.

What’s next?

More teams are adopting a hybrid or multi-cloud model for deploying their infrastructure. In the past, these teams were left with just two options for securing those resources: peering a VPN with each provider or relying on custom IAM flows with each environment. In the end, both of these solutions were not only quite costly but also equally unmanageable.

While infrastructure benefits from becoming distributed, security is something that is best when controlled in a single place. Access can consolidate how a team controls who can reach their entire fleet of servers and services.

Categories: Technology

A Tale of Two (APT) Transports

Thu, 18/07/2019 - 15:12
A Tale of Two (APT) Transports

Securing access to your APT repositories is critical. At Cloudflare, like in most organizations, we used a legacy VPN to lock down who could reach our internal software repositories. However, a network perimeter model lacks a number of features that we consider critical to a team’s security.

As a company, we’ve been moving our internal infrastructure to our own zero-trust platform, Cloudflare Access. Access added SaaS-like convenience to the on-premise tools we managed. We started with web applications and then moved resources we need to reach over SSH behind the Access gateway, for example Git or user-SSH access. However, we still needed to handle how services communicate with our internal APT repository.

We recently open sourced a new APT transport which allows customers to protect their private APT repositories using Cloudflare Access. In this post, we’ll outline the history of APT tooling, APT transports and introduce our new APT transport for Cloudflare Access.

A brief history of APT

Advanced Package Tool, or APT, simplifies the installation and removal of software on Debian and related Linux distributions. Originally released in 1998, APT was to Debian what the App Store was to modern smartphones - a decade ahead of its time!

APT sits atop the lower-level dpkg tool, which is used to install, query, and remove .deb packages - the primary software packaging format in Debian and related Linux distributions such as Ubuntu. With dpkg, packaging and managing software installed on your system became easier - but it didn’t solve for problems around distribution of packages, such as via the Internet or local media; at the time of inception, it was commonplace to install packages from a CD-ROM.

APT introduced the concept of repositories - a mechanism for storing and indexing a collection of .deb packages. APT supports connecting to multiple repositories for finding packages and automatically resolving package dependencies. The way APT connects to said repositories is via a “transport” - a mechanism for communicating between the APT client and its repository source (more on this later).

APT over the Internet

Prior to version 1.5, APT did not include support for HTTPS - if you wanted to install a package over the Internet, your connection was not encrypted. This reduces privacy - an attacker snooping traffic could determine specific package version your system is installing. It also exposes you to man-in-the-middle attacks where an attacker could, for example, exploit a remote code execution vulnerability. Just 6 months ago, we saw an example of the latter with CVE-2019-3462.

Enter the APT HTTPS transport - an optional transport you can install to add support for connecting to repositories over HTTPS. Once installed, users need to configure their APT sources.list with repositories using HTTPS.

The challenge here, of course, is that the most common way to install this transport is via APT and HTTP - a classic bootstrapping problem! An alternative here is to download the .deb package via curl and install it via dpkg. You’ll find the links to apt-transport-https binaries for Stretch here - once you have the URL path for your system architecture, you can download it from the mirror-redirector over HTTPS, e.g. for amd64 (a.k.a. x86_64):

curl -o apt-transport-https.deb -L HASH=c8c4366d1912ff8223615891397a78b44f313b0a2f15a970a82abe48460490cb && echo "$HASH apt-transport-https.deb" | sha256sum -c sudo dpkg -i apt-transport-https.deb

To confirm which APT transports are installed on your system, you can list each “method binary” that is installed:

ls /usr/lib/apt/methods

With apt-transport-https installed you should now see ‘https’ in that list.

The state of APT & HTTPS on Debian

You may be wondering how relevant this APT HTTPS transport is today. Given the prevalence of HTTPS on the web today, I was surprised when I found out exactly how relevant it is.

Up until a couple of weeks ago, Debian Stretch (9.x) was the current stable release; 9.0 was first released in June 2017 - and the latest version (9.9) includes apt 1.4.9 by default - meaning that securing your APT communication for Debian Stretch requires installing the optional apt-transport-https package.

Thankfully, on July 6 of this year, Debian released the latest version - Buster - which currently includes apt 1.8.2 with HTTPS support built-in by default, negating the need for installing the apt-transport-https package - and removing the bootstrapping challenge of installing HTTPS support via HTTPS!

BYO HTTPS APT Repository

A powerful feature of APT is the ability to run your own repository. You can mirror a public repository to improve performance or protect against an outage. And if you’re producing your own software packages, you can run your own repository to simplify distribution and installation of your software for your users.

If you have your own APT repository and you’re looking to secure it with HTTPS we’ve offered free Universal SSL since 2014 and last year introduced a way to require it site-wide automatically with one click. You’ll get the benefits of DDoS attack protection, a Global CDN with Caching, and Analytics.

But what if you’re looking for more than just HTTPS for your APT repository? For companies operating private APT repositories, authentication of your APT repository may be a challenge. This is where our new, custom APT transport comes in.

Building custom transports

The system design of APT is powerful in that it supports extensibility via Transport executables, but how does this mechanism work?

When APT attempts to connect to a repository, it finds the executable which matches the “scheme” from the repository URL (e.g. “https://” prefix on a repository results in the “https” executable being called).

APT then uses the common Linux standard streams: stdin, stdout, and stderr. It communicates via stdin/stdout using a set of plain-text Messages, which follow IETF RFC #822 (the same format that .deb “Package” files use).

Examples of input message include “600 URI Acquire”, and examples of output messages include “200 URI Start” and “201 URI Done”:

A Tale of Two (APT) Transports

If you’re interested in building your own transport, check out the APT method interface spec for more implementation details.

APT meets Access

Cloudflare prioritizes dogfooding our own products early and often. The Access product has given our internal DevTools team a chance to work closely with the product team as we build features that help solve use cases across our organization. We’ve deployed new features internally, gathered feedback, improved them, and then released them to our customers. For example, we’ve been able to iterate on tools for Access like the Atlassian SSO plugin and the SSH feature, as collaborative efforts between DevTools and the Access team.

Our DevTools team wanted to take the same dogfooding approach to protect our internal APT repository with Access. We knew this would require a custom APT transport to support generating the required tokens and passing the correct headers in HTTPS requests to our internal APT repository server. We decided to build and test our own transport that both generated the necessary tokens and passed the correct headers to allow us to place our repository behind Access.

After months of internal use, we’re excited to announce that we have recently open-sourced our custom APT transport, so our customers can also secure their APT repositories by enabling authentication via Cloudflare Access.

By protecting your APT repository with Cloudflare Access, you can support authenticating users via Single-Sign On (SSO) providers, defining comprehensive access-control policies, and monitoring access and change logs.

Our APT transport leverages another Open Source tool we provide, cloudflared, which enables users to connect to your Cloudflare-protected domain securely.

Securing your APT Repository

To use our APT transport, you’ll need an APT repository that’s protected by Cloudflare Access. Our instructions (below) for using our transport will use as a hostname.

To use our APT transport with your own web-based APT repository, refer to our Setting Up Access guide.

APT Transport Installation

To install from source, both tools require Go - once you install Go, you can install `cloudflared` and our APT transport with four commands:

go get sudo cp ${GOPATH:-~/go}/bin/cloudflared /usr/local/bin/cloudflared go get sudo cp ${GOPATH:-~/go}/bin/cfd /usr/lib/apt/methods/cfd

The above commands should place the cloudflared executable in /usr/local/bin (which should be on your PATH), and the APT transport binary in the required /usr/lib/apt/methods directory.

To confirm cloudflared is on your path, run:

which cloudflared

The above command should return /usr/local/bin/cloudflared

Now that the custom transport is installed, to start using it simply configure an APT source with the cfd:// rather than https:// e.g:

$ cat /etc/apt/sources.list.d/example.list deb [arch=amd64] cfd:// stable common

Next time you do `apt-get update` and `apt-get install`, a browser window will open asking you to log-in over Cloudflare Access, and your package will be retrieved using the token returned by `cloudflared`.

Fetching a GPG Key over Access

Usually, private APT repositories will use SecureApt and have their own GPG public key that users must install to verify the integrity of data retrieved from that repository.

Users can also leverage cloudflared for securely downloading and installing those keys, e.g:

cloudflared access login cloudflared access curl | sudo apt-key add -

The first command will open your web browser allowing you to authenticate for your domain. The second command wraps curl to download the GPG key, and hands it off to `apt-key add`.

Cloudflare Access on "headless" servers

If you’re looking to deploy APT repositories protected by Cloudflare Access to non-user-facing machines (a.k.a. “headless” servers), opening a browser does not work. The good news is since February, Cloudflare Access supports service tokens - and we’ve built support for them into our APT transport from day one.

If you’d like to use service tokens with our APT transport, it’s as simple as placing the token in a file in the correct path; because the machine already has a token, there is also no dependency on `cloudflared` for authentication. You can find details on how to set-up a service token in the APT transport README.

What’s next?

As demonstrated, you can get started using our APT transport today - we’d love to hear your feedback on this!

This work came out of an internal dogfooding effort, and we’re currently experimenting with additional packaging formats and tooling. If you’re interested in seeing support for another format or tool, please reach out.

Categories: Technology

Cloudflare em Lisboa

Wed, 17/07/2019 - 07:03
Cloudflare em Lisboa

Eu fui o 24º funcionário da Cloudflare e o primeiro a trabalhar fora de São Francisco. A trabalhar num escritorio improvisado em minha casa, e escrevi um pedaço grande do software da Cloudflare antes de ter contratato uma equipa em Londres. Hoje, Cloudflare London, a nossa a sede da EMEA a região da Europa, Médio Oriente e África tem mais de 200 pessoas a trabalhar no edifício histórico County Hall há frente do Parlamento Britânico. O meu escritório improvisado é agora história antiga.

Cloudflare em LisboaCC BY-SA 2.0 image by Sridhar Saraf

Cloudflare não parou em Londres. Temos pessoas em Munique, Cingapura, Pequim, Austin, Texas, Chicago e Champaign, Illinois, Nova York, Washington,DC, São José, California, Miami, Florida, Sydney, Austrália e também em Sao Francisco e Londres. Hoje estamos a anunciar o estabelecimento de um novo escritório em Lisboa, Portugal. Como parte da abertura do escritório este Verão irei me deslocar para Lisboa juntamente com um pequeno número de pessoal técnico de outros escritórios da Cloudflare.

Estamos a recrutar em Lisboa neste momento. Pode visitar este link para ver todas as oportunidades actuais. Estamos há procura de candidatos para preencher os cargos de Engenheiro, Segurança, Produto, Produto de Estratégia, Investigação Tecnológica e Atendimento ao Cliente.

Se está interessado num cargo que não está actualmente listado na nossa página de carreiras profissionais, também poderá enviar-nos um email para a nossa equipa de recruitamento pelo para expressar o seu interesse.

Cloudflare em LisboaCC BY-SA 2.0 Image by Rustam Aliyev

A minha primeira ideia realista de Lisboa formou-se há 30 anos atrás com a publicação de 1989 do John Le Carré, The Russia House (A casa da Rússia). Tão real, claro, como qualquer Le Carré’s visão do mundo:

[...] dez anos atrás, por um capricho qualquer, Barley Blair, tido herdado uns quantos milhares por uma tia distante, comprou para si um pé de terra mais modesto em Lisboa, onde costumava ter descansos regulares com o peso de uma alma multilateral. Poderia ter sido Cornwall, poderia ter sido a Provença ou mesmo até Timbuktu. Contudo, Lisboa por um acidente agarrou-o [...]

A escolha da Clouflare por Lisboa, não aconteceu por um acaso, mas sim por uma pesquisa cuidadosa de uma nova cidade continental Europeia para localizar um escritório. Eu fui convidado novamente para ir a Lisboa em 2014 para ser um dos oradores na Sapo Codebits e fiquei impressionado com o tamanho e a variedade de talento técnico presente no evento. Subsequentemente, visitámos 45 cidades por 29 países, reduzindo a uma lista final de três.

A combinação de um elevado e crescente ecossistema de tecnologia existente em Lisboa, uma política de imigração atraente,estabilidade política, alto padrão de vida, assim como todos os factores logísticos como o fuso horário (o mesmo que na Grã-Bretanha) e os voos directos para São Francisco fizeram com que fosse o vencedor evidente.

Eu começei a aprender Português há três meses…e estou desejoso para descobrir este país e a cultura, e criar um novo escritório para a Cloudflare.

Encontrámos um ecossistema tecnológico local próspero, apoiado tanto pelo governo como por uma miríade de startups empolgantes, e esperamos colaborar com eles para continuar a elevar o perfil de Lisboa.

Categories: Technology

Cloudflare's new Lisbon office

Wed, 17/07/2019 - 07:00
Cloudflare's new Lisbon office

I was the 24th employee of Cloudflare and the first outside of San Francisco. Working out of my spare bedroom, I wrote a chunk of Cloudflare’s software before starting to recruit a team in London. Today, Cloudflare London, our EMEA headquarters, has more than 200 people working in the historic County Hall building opposite the Houses of Parliament. My spare bedroom is ancient history.

Cloudflare's new Lisbon officeCC BY-SA 2.0 image by Sridhar Saraf

And Cloudflare didn’t stop at London. We now have people in Munich, Singapore, Beijing, Austin, TX, Chicago and Champaign, IL, New York, Washington, DC, San Jose, CA, Miami, FL, and Sydney, Australia, as well as San Francisco and London. And today we’re announcing the establishment of a new technical hub in Lisbon, Portugal. As part of that office opening I will be relocating to Lisbon this summer along with a small number of technical folks from other Cloudflare offices.

We’re recruiting in Lisbon starting today. Go here to see all the current opportunities. We’re looking for people to fill roles in Engineering, Security, Product, Product Strategy, Technology Research, and Customer Support.

Cloudflare's new Lisbon officeCC BY-SA 2.0 Image by Rustam Aliyev

My first real idea of Lisbon dates to 30 years ago with the 1989 publication of John Le Carré’s The Russia House. As real, of course, as any Le Carré view of the world:

[...] ten years ago on a whim Barley Blair, having inherited a stray couple of thousand from a remote aunt, bought himself a scruffy pied-a-terre in Lisbon, where he was accustomed to take periodic rests from the burden of his many-sided soul. It could have been Cornwall, it could have been Provence or Timbuktu. But Lisbon by an accident had got him [...]

Cloudflare’s choice of Lisbon, however, came not by way of an accident but a careful search for a new continental European city in which to locate a technical office. I had been invited to Lisbon back in 2014 to speak at SAPO Codebits and been impressed by the size and range of technical talent present at the event. Subsequently, we looked at 45 cities across 29 countries, narrowing down to a final list of three.

Lisbon’s combination of a large and growing existing tech ecosystem, attractive immigration policy, political stability, high standard of living, as well as logistical factors like time zone (the same as the UK) and direct flights to San Francisco made it the clear winner.

Eu começei a aprender Português há três meses... and I’m looking forward to discovering a country and a culture, and building a new technical hub for Cloudflare. We have found a thriving local technology ecosystem, supported both by the government and a myriad of exciting startups, and we look forward to collaborating with them to continue to raise Lisbon's profile.

Categories: Technology

Details of the Cloudflare outage on July 2, 2019

Fri, 12/07/2019 - 16:45

Almost nine years ago, Cloudflare was a tiny company and I was a customer not an employee. Cloudflare had launched a month earlier and one day alerting told me that my little site,, didn’t seem to have working DNS any more. Cloudflare had pushed out a change to its use of Protocol Buffers and it had broken DNS.

I wrote to Matthew Prince directly with an email titled “Where’s my dns?” and he replied with a long, detailed, technical response (you can read the full email exchange here) to which I replied:

From: John Graham-Cumming Date: Thu, Oct 7, 2010 at 9:14 AM Subject: Re: Where's my dns? To: Matthew Prince Awesome report, thanks. I'll make sure to call you if there's a problem. At some point it would probably be good to write this up as a blog post when you have all the technical details because I think people really appreciate openness and honesty about these things. Especially if you couple it with charts showing your post launch traffic increase. I have pretty robust monitoring of my sites so I get an SMS when anything fails. Monitoring shows I was down from 13:03:07 to 14:04:12. Tests are made every five minutes. It was a blip that I'm sure you'll get past. But are you sure you don't need someone in Europe? :-)

To which he replied:

From: Matthew Prince Date: Thu, Oct 7, 2010 at 9:57 AM Subject: Re: Where's my dns? To: John Graham-Cumming Thanks. We've written back to everyone who wrote in. I'm headed in to the office now and we'll put something on the blog or pin an official post to the top of our bulletin board system. I agree 100% transparency is best.

And so, today, as an employee of a much, much larger Cloudflare I get to be the one who writes, transparently about a mistake we made, its impact and what we are doing about it.

The events of July 2

On July 2, we deployed a new rule in our WAF Managed Rules that caused CPUs to become exhausted on every CPU core that handles HTTP/HTTPS traffic on the Cloudflare network worldwide. We are constantly improving WAF Managed Rules to respond to new vulnerabilities and threats. In May, for example, we used the speed with which we can update the WAF to push a rule to protect against a serious SharePoint vulnerability. Being able to deploy rules quickly and globally is a critical feature of our WAF.

Unfortunately, last Tuesday’s update contained a regular expression that backtracked enormously and exhausted CPU used for HTTP/HTTPS serving. This brought down Cloudflare’s core proxying, CDN and WAF functionality. The following graph shows CPUs dedicated to serving HTTP/HTTPS traffic spiking to nearly 100% usage across the servers in our network.

CPU utilization in one of our PoPs during the incident

This resulted in our customers (and their customers) seeing a 502 error page when visiting any Cloudflare domain. The 502 errors were generated by the front line Cloudflare web servers that still had CPU cores available but were unable to reach the processes that serve HTTP/HTTPS traffic.

We know how much this hurt our customers. We’re ashamed it happened. It also had a negative impact on our own operations while we were dealing with the incident.

It must have been incredibly stressful, frustrating and frightening if you were one of our customers. It was even more upsetting because we haven’t had a global outage for six years.

The CPU exhaustion was caused by a single WAF rule that contained a poorly written regular expression that ended up creating excessive backtracking. The regular expression that was at the heart of the outage is (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

Although the regular expression itself is of interest to many people (and is discussed more below), the real story of how the Cloudflare service went down for 27 minutes is much more complex than “a regular expression went bad”. We’ve taken the time to write out the series of events that lead to the outage and kept us from responding quickly. And, if you want to know more about regular expression backtracking and what to do about it, then you’ll find it in an appendix at the end of this post.

What happened

Let’s begin with the sequence of events. All times in this blog are UTC.

At 13:42 an engineer working on the firewall team deployed a minor change to the rules for XSS detection via an automatic process. This generated a Change Request ticket. We use Jira to manage these tickets and a screenshot is below.

Three minutes later the first PagerDuty page went out indicating a fault with the WAF. This was a synthetic test that checks the functionality of the WAF (we have hundreds of such tests) from outside Cloudflare to ensure that it is working correctly. This was rapidly followed by pages indicating many other end-to-end tests of Cloudflare services failing, a global traffic drop alert, widespread 502 errors and then many reports from our points-of-presence (PoPs) in cities worldwide indicating there was CPU exhaustion.

Some of these alerts hit my watch and I jumped out of the meeting I was in and was on my way back to my desk when a leader in our Solutions Engineering group told me we had lost 80% of our traffic. I ran over to SRE where the team was debugging the situation. In the initial moments of the outage there was speculation it was an attack of some type we’d never seen before.

Cloudflare’s SRE team is distributed around the world, with continuous, around-the-clock coverage. Alerts like these, the vast majority of which are noting very specific issues of limited scopes in localized areas, are monitored in internal dashboards and addressed many times every day. This pattern of pages and alerts, however, indicated that something gravely serious had happened, and SRE immediately declared a P0 incident and escalated to engineering leadership and systems engineering.

The London engineering team was at that moment in our main event space listening to an internal tech talk. The talk was interrupted and everyone assembled in a large conference room and others dialed-in. This wasn’t a normal problem that SRE could handle alone, it needed every relevant team online at once.

At 14:00 the WAF was identified as the component causing the problem and an attack dismissed as a possibility. The Performance Team pulled live CPU data from a machine that clearly showed the WAF was responsible. Another team member used strace to confirm. Another team saw error logs indicating the WAF was in trouble. At 14:02 the entire team looked at me when it was proposed that we use a ‘global kill’, a mechanism built into Cloudflare to disable a single component worldwide.

But getting to the global WAF kill was another story. Things stood in our way. We use our own products and with our Access service down we couldn’t authenticate to our internal control panel (and once we were back we’d discover that some members of the team had lost access because of a security feature that disables their credentials if they don’t use the internal control panel frequently).

And we couldn’t get to other internal services like Jira or the build system. To get to them we had to use a bypass mechanism that wasn’t frequently used (another thing to drill on after the event). Eventually, a team member executed the global WAF kill at 14:07 and by 14:09 traffic levels and CPU were back to expected levels worldwide. The rest of Cloudflare's protection mechanisms continued to operate.

Then we moved on to restoring the WAF functionality. Because of the sensitivity of the situation we performed both negative tests (asking ourselves “was it really that particular change that caused the problem?”) and positive tests (verifying the rollback worked) in a single city using a subset of traffic after removing our paying customers’ traffic from that location.

At 14:52 we were 100% satisfied that we understood the cause and had a fix in place and the WAF was re-enabled globally.

How Cloudflare operates

Cloudflare has a team of engineers who work on our WAF Managed Rules product; they are constantly working to improve detection rates, lower false positives, and respond rapidly to new threats as they emerge. In the last 60 days, 476 change requests have been handled for the WAF Managed Rules (averaging one every 3 hours).

This particular change was to be deployed in “simulate” mode where real customer traffic passes through the rule but nothing is blocked. We use that mode to test the effectiveness of a rule and measure its false positive and false negative rate. But even in the simulate mode the rules actually need to execute and in this case the rule contained a regular expression that consumed excessive CPU.

As can be seen from the Change Request above there’s a deployment plan, a rollback plan and a link to the internal Standard Operating Procedure (SOP) for this type of deployment. The SOP for a rule change specifically allows it to be pushed globally. This is very different from all the software we release at Cloudflare where the SOP first pushes software to an internal dogfooding network point of presence (PoP) (which our employees pass through), then to a small number of customers in an isolated location, followed by a push to a large number of customers and finally to the world.

The process for a software release looks like this: We use git internally via BitBucket. Engineers working on changes push code which is built by TeamCity and when the build passes, reviewers are assigned. Once a pull request is approved the code is built and the test suite runs (again).

If the build and tests pass then a Change Request Jira is generated and the change has to be approved by the relevant manager or technical lead. Once approved deployment to what we call the “animal PoPs” occurs: DOG, PIG, and the Canaries.

The DOG PoP is a Cloudflare PoP (just like any of our cities worldwide) but it is used only by Cloudflare employees. This dogfooding PoP enables us to catch problems early before any customer traffic has touched the code. And it frequently does.

If the DOG test passes successfully code goes to PIG (as in “Guinea Pig”). This is a Cloudflare PoP where a small subset of customer traffic from non-paying customers passes through the new code.

If that is successful the code moves to the Canaries. We have three Canary PoPs spread across the world and run paying and non-paying customer traffic running through them on the new code as a final check for errors.

Cloudflare software release process

Once successful in Canary the code is allowed to go live. The entire DOG, PIG, Canary, Global process can take hours or days to complete, depending on the type of code change. The diversity of Cloudflare’s network and customers allows us to test code thoroughly before a release is pushed to all our customers globally. But, by design, the WAF doesn’t use this process because of the need to respond rapidly to threats.

WAF Threats

In the last few years we have seen a dramatic increase in vulnerabilities in common applications. This has happened due to the increased availability of software testing tools, like fuzzing for example (we just posted a new blog on fuzzing here).


What is commonly seen is a Proof of Concept (PoC) is created and often published on Github quickly, so that teams running and maintaining applications can test to make sure they have adequate protections. Because of this, it’s imperative that Cloudflare are able to react as quickly as possible to new attacks to give our customers a chance to patch their software.

A great example of how Cloudflare proactively provided this protection was through the deployment of our protections against the SharePoint vulnerability in May (blog here). Within a short space of time from publicised announcements, we saw a huge spike in attempts to exploit our customer’s Sharepoint installations. Our team continuously monitors for new threats and writes rules to mitigate them on behalf of our customers.

The specific rule that caused last Tuesday’s outage was targeting Cross-site scripting (XSS) attacks. These too have increased dramatically in recent years.


The standard procedure for a WAF Managed Rules change indicates that Continuous Integration (CI) tests must pass prior to a global deploy. That happened normally last Tuesday and the rules were deployed. At 13:31 an engineer on the team had merged a Pull Request containing the change after it was approved.

At 13:37 TeamCity built the rules and ran the tests, giving it the green light. The WAF test suite tests that the core functionality of the WAF works and consists of a large collection of unit tests for individual matching functions. After the unit tests run the individual WAF rules are tested by executing a huge collection of HTTP requests against the WAF. These HTTP requests are designed to test requests that should be blocked by the WAF (to make sure it catches attacks) and those that should be let through (to make sure it isn’t over-blocking and creating false positives). What it didn’t do was test for runaway CPU utilization by the WAF and examining the log files from previous WAF builds shows that no increase in test suite run time was observed with the rule that would ultimately cause CPU exhaustion on our edge.

With the tests passing, TeamCity automatically began deploying the change at 13:42.


Because WAF rules are required to address emergent threats they are deployed using our Quicksilver distributed key-value (KV) store that can push changes globally in seconds. This technology is used by all our customers when making configuration changes in our dashboard or via the API and is the backbone of our service’s ability to respond to changes very, very rapidly.

We haven’t really talked about Quicksilver much. We previously used Kyoto Tycoon as a globally distributed key-value store, but we ran into operational issues with it and wrote our own KV store that is replicated across our more than 180 cities. Quicksilver is how we push changes to customer configuration, update WAF rules, and distribute JavaScript code written by customers using Cloudflare Workers.

From clicking a button in the dashboard or making an API call to change configuration to that change coming into effect takes seconds, globally. Customers have come to love this high speed configurability. And with Workers they expect near instant, global software deployment. On average Quicksilver distributes about 350 changes per second.

And Quicksilver is very fast.  On average we hit a p99 of 2.29s for a change to be distributed to every machine worldwide. Usually, this speed is a great thing. It means that when you enable a feature or purge your cache you know that it’ll be live globally nearly instantly. When you push code with Cloudflare Workers it's pushed out a the same speed. This is part of the promise of Cloudflare fast updates when you need them.

However, in this case, that speed meant that a change to the rules went global in seconds. You may notice that the WAF code uses Lua. Cloudflare makes use of Lua extensively in production and details of the Lua in the WAF have been discussed before. The Lua WAF uses PCRE internally and it uses backtracking for matching and has no mechanism to protect against a runaway expression. More on that and what we're doing about it below.

Everything that occurred up to the point the rules were deployed was done “correctly”: a pull request was raised, it was approved, CI/CD built the code and tested it, a change request was submitted with an SOP detailing rollout and rollback, and the rollout was executed.

Cloudflare WAF deployment process

What went wrong

As noted, we deploy dozens of new rules to the WAF every week, and we have numerous systems in place to prevent any negative impact of that deployment. So when things do go wrong, it’s generally the unlikely convergence of multiple causes. Getting to a single root cause, while satisfying, may obscure the reality. Here are the multiple vulnerabilities that converged to get to the point where Cloudflare’s service for HTTP/HTTPS went offline.

  1. An engineer wrote a regular expression that could easily backtrack enormously.
  2. A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.
  3. The regular expression engine being used didn’t have complexity guarantees.
  4. The test suite didn’t have a way of identifying excessive CPU consumption.
  5. The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.
  6. The rollback plan required running the complete WAF build twice taking too long.
  7. The first alert for the global traffic drop took too long to fire.
  8. We didn’t update our status page quickly enough.
  9. We had difficulty accessing our own systems because of the outage and the bypass procedure wasn’t well trained on.
  10. SREs had lost access to some systems because their credentials had been timed out for security reasons.
  11. Our customers were unable to access the Cloudflare Dashboard or API because they pass through the Cloudflare edge.
What’s happened since last Tuesday

Firstly, we stopped all release work on the WAF completely and are doing the following:

  1. Re-introduce the excessive CPU usage protection that got removed. (Done)
  2. Manually inspecting all 3,868 rules in the WAF Managed Rules to find and correct any other instances of possible excessive backtracking. (Inspection complete)
  3. Introduce performance profiling for all rules to the test suite. (ETA:  July 19)
  4. Switching to either the re2 or Rust regex engine which both have run-time guarantees. (ETA: July 31)
  5. Changing the SOP to do staged rollouts of rules in the same manner used for other software at Cloudflare while retaining the ability to do emergency global deployment for active attacks.
  6. Putting in place an emergency ability to take the Cloudflare Dashboard and API off Cloudflare's edge.
  7. Automating update of the Cloudflare Status page.

In the longer term we are moving away from the Lua WAF that I wrote years ago. We are porting the WAF to use the new firewall engine. This will make the WAF both faster and add yet another layer of protection.


This was an upsetting outage for our customers and for the team. We responded quickly to correct the situation and are correcting the process deficiencies that allowed the outage to occur and going deeper to protect against any further possible problems with the way we use regular expressions by replacing the underlying technology used.

We are ashamed of the outage and sorry for the impact on our customers. We believe the changes we’ve made mean such an outage will never recur.

Appendix: About Regular Expression Backtracking

To fully understand how (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))  caused CPU exhaustion you need to understand a little about how a standard regular expression engine works. The critical part is .*(?:.*=.*). The (?: and matching ) are a non-capturing group (i.e. the expression inside the parentheses is grouped together as a single expression).

For the purposes of the discussion of why this pattern causes CPU exhaustion we can safely ignore it and treat the pattern as .*.*=.*. When reduced to this, the pattern obviously looks unnecessarily complex; but what's important is any "real-world" expression (like the complex ones in our WAF rules) that ask the engine to "match anything followed by anything" can lead to catastrophic backtracking. Here’s why.

In a regular expression, . means match a single character, .* means match zero or more characters greedily (i.e. match as much as possible) so .*.*=.* means match zero or more characters, then match zero or more characters, then find a literal = sign, then match zero or more characters.

Consider the test string x=x. This will match the expression .*.*=.*. The .*.* before the equal can match the first x (one of the .* matches the x, the other matches zero characters). The .* after the = matches the final x.

It takes 23 steps for this match to happen. The first .* in .*.*=.* acts greedily and matches the entire x=x string. The engine moves on to consider the next .*. There are no more characters left to match so the second .* matches zero characters (that’s allowed). Then the engine moves on to the =. As there are no characters left to match (the first .* having consumed all of x=x) the match fails.

At this point the regular expression engine backtracks. It returns to the first .* and matches it against x= (instead of x=x) and then moves onto the second .*. That .* matches the second x and now there are no more characters left to match. So when the engine tries to match the = in .*.*=.* the match fails. The engine backtracks again.

This time it backtracks so that the first .* is still matching x= but the second .* no longer matches x; it matches zero characters. The engine then moves on to try to find the literal = in the .*.*=.* pattern but it fails (because it was already matched against the first .*). The engine backtracks again.

This time the first .* matches just the first x. But the second .* acts greedily and matches =x. You can see what’s coming. When it tries to match the literal = it fails and backtracks again.

The first .* still matches just the first x. Now the second .* matches just =. But, you guessed it, the engine can’t match the literal = because the second .* matched it. So the engine backtracks again. Remember, this is all to match a three character string.

Finally, with the first .* matching just the first x, the second .* matching zero characters the engine is able to match the literal = in the expression with the = in the string. It moves on and the final .* matches the final x.

23 steps to match x=x. Here’s a short video of that using the Perl Regexp::Debugger showing the steps and backtracking as they occur.

That’s a lot of work but what happens if the string is changed from x=x to x=xx? This time is takes 33 steps to match. And if the input is x=xxx it takes 45. That’s not linear. Here’s a chart showing matching from x=x to x=xxxxxxxxxxxxxxxxxxxx (20 x’s after the =). With 20 x’s after the = the engine takes 555 steps to match! (Worse, if the x= was missing, so the string was just 20 x’s, the engine would take 4,067 steps to find the pattern doesn’t match).

This video shows all the backtracking necessary to match x=xxxxxxxxxxxxxxxxxxxx:

That’s bad because as the input size goes up the match time goes up super-linearly. But things could have been even worse with a slightly different regular expression. Suppose it had been .*.*=.*; (i.e. there’s a literal semicolon at the end of the pattern). This could easily have been written to try to match an expression like foo=bar;.

This time the backtracking would have been catastrophic. To match x=x takes 90 steps instead of 23. And the number of steps grows very quickly. Matching x= followed by 20 x’s takes 5,353 steps. Here’s the corresponding chart. Look carefully at the Y-axis values compared the previous chart.

To complete the picture here are all 5,353 steps of failing to match x=xxxxxxxxxxxxxxxxxxxx against .*.*=.*;

Using lazy rather than greedy matches helps control the amount of backtracking that occurs in this case. If the original expression is changed to .*?.*?=.*? then matching x=x takes 11 steps (instead of 23) and so does matching x=xxxxxxxxxxxxxxxxxxxx. That’s because the ? after the .* instructs the engine to match the smallest number of characters first before moving on.

But laziness isn’t the total solution to this backtracking behaviour. Changing the catastrophic example .*.*=.*; to .*?.*?=.*?; doesn’t change its run time at all. x=x still takes 555 steps and x= followed by 20 x’s still takes 5,353 steps.

The only real solution, short of fully re-writing the pattern to be more specific, is to move away from a regular expression engine with this backtracking mechanism. Which we are doing within the next few weeks.

The solution to this problem has been known since 1968 when Ken Thompson wrote a paper titled “Programming Techniques: Regular expression search algorithm”. The paper describes a mechanism for converting a regular expression into an NFA (non-deterministic finite automata) and then following the state transitions in the NFA using an algorithm that executes in time linear in the size of the string being matched against.

Thompson’s paper doesn’t actually talk about NFA but the linear time algorithm is clearly explained and an ALGOL-60 program that generates assembly language code for the IBM 7094 is presented. The implementation may be arcane but the idea it presents is not.

Here’s what the .*.*=.* regular expression would look like when diagrammed in a similar manner to the pictures in Thompson’s paper.

Figure 0 has five states starting at 0. There are three loops which begin with the states 1, 2 and 3. These three loops correspond to the three .* in the regular expression. The three lozenges with dots in them match a single character. The lozenge with an = sign in it matches the literal = sign. State 4 is the ending state, if reached then the regular expression has matched.

To see how such a state diagram can be used to match the regular expression .*.*=.* we’ll examine matching the string x=x. The program starts in state 0 as shown in Figure 1.

The key to making this algorithm work is that the state machine is in multiple states at the same time. The NFA will take every transition it can, simultaneously.

Even before it reads any input, it immediately transitions to both states 1 and 2 as shown in Figure 2.

Looking at Figure 2 we can see what happened when it considers  first x in x=x. The x can match the top dot by transitioning from state 1 and back to state 1. Or the x can match the dot below it by transitioning from state 2 and back to state 2.

So after matching the first x in x=x the states are still 1 and 2. It’s not possible to reach state 3 or 4 because a literal = sign is needed.

Next the algorithm considers the = in x=x. Much like the x before it, it can be matched by either of the top two loops transitioning from state 1 to state 1 or state 2 to state 2, but additionally the literal = can be matched and the algorithm can transition state 2 to state 3 (and immediately state 4). That’s illustrated in Figure 3.

Next the algorithm reaches the final x in x=x. From states 1 and 2 the same transitions are possible back to states 1 and 2. From state 3 the x can match the dot on the right and transition back to state 3.

At that point every character of x=x has been considered and because state 4 has been reached the regular expression matches that string. Each character was processed once so the algorithm was linear in the length of the input string. And no backtracking was needed.

It might also be obvious that once state 4 was reached (after x= was matched) the regular expression had matched and the algorithm could terminate without considering the final x at all.

This algorithm is linear in the size of its input.

Categories: Technology

Details zum Cloudflare-Ausfall am 2. Juli 2019

Fri, 12/07/2019 - 16:45

Vor etwa neun Jahren war Cloudflare noch ein winziges Unternehmen und ich war ein Kunde, kein Mitarbeiter. Cloudflare gab es erst seit einem Monat. Eines Tages wurde ich darüber benachrichtigt, dass bei meiner kleinen Website der DNS-Service nicht mehr funktionierte. Cloudflare hat seine Verwendung von Protocol Buffers angepasst und dadurch wurde der DNS-Service unterbrochen.

Ich habe eine E-Mail mit dem Titel „Where‘s my dns?“ (Wo ist mein DNS) direkt an Matthew Prince gesendet und er hat mit einer langen, detaillierten, technischen Erklärung reagiert (Sie können den vollständigen E-Mail-Austausch hier lesen), auf die ich antwortete:

Von: John Graham-Cumming Datum: Do., 7. Okt. 2010 um 09:14 Betreff: Re: Wo ist mein DNS? An: Matthew Prince Toller Bericht, danke. Ich werde auf jeden Fall anrufen, wenn es ein Problem geben sollte. Es wäre wahrscheinlich sinnvoll, all das in einem Blog-Beitrag festzuhalten, wenn Sie alle technischen Details haben. Ich glaube nämlich, dass es Kunden wirklich zu schätzen wissen, wenn mit solchen Dingen offen und ehrlich umgegangen wird. Sie könnten auch die Traffic-Zunahme nach der Implementierung mit Diagrammen veranschaulichen. Ich habe eine recht zuverlässige Überwachung für meine Websites eingerichtet, deshalb bekomme ich eine SMS, wenn etwas ausfällt. Meine Daten zeigen, dass die Website von 13:03:07 bis 14:04:12 nicht verfügbar war. Die Tests erfolgen alle fünf Minuten. Das war nur ein kleiner Fehler und ich bin mir sicher, dass Sie etwas daraus lernen. Aber bräuchten Sie nicht vielleicht jemanden in Europa? :-)

Darauf antwortete er:

Von: Matthew Prince Datum: Do., 7. Okt. 2010 um 09:57 Betreff: Re: Wo ist mein DNS? An: John Graham-Cumming Vielen Dank. Wir haben allen geantwortet, die sich bei uns gemeldet haben. In bin gerade auf dem Weg zum Büro und wir werden etwas in den Blog stellen oder einen offiziellen Beitrag ganz oben auf dem Bulletin Board System verankern. Ich stimme Ihnen zu 100 % zu, dass Transparenz der richtige Weg ist.

Und so kommt es, dass ich heute ein Mitarbeiter eines deutlich größeren Cloudflare bin und für Transparenz sorge, indem ich über unsere Fehler, ihre Auswirkungen und unsere Gegenmaßnahmen schreibe.

Die Ereignisse des 2. Juli

Am 2. Juli haben wir eine neue Regel zu unseren WAF Managed Rules hinzugefügt, durch die alle CPU-Kerne überlastet wurden, die HTTP/HTTPS-Traffic im weltweiten Cloudflare-Netzwerk verarbeiten. Wir optimieren die WAF Managed Rules kontinuierlich, um neue Schwachstellen und Bedrohungen zu eliminieren. Zum Beispiel haben wir mit einem schnellen WAF-Update im Mai eine Regel implementiert, um eine schwerwiegende SharePoint-Schwachstelle zu schließen. Die Möglichkeit, Regeln schnell und global bereitzustellen, ist ein besonders wichtiges Feature unserer WAF.

Leider enthielt das Update vom letzten Dienstag einen regulären Ausdruck, der ein enormes Backtracking ausgelöst hat und die CPUs der HTTP/HTTPS-Verarbeitung überlastet hat. Dadurch wurden die grundlegenden Proxy-, CDN- und WAF-Funktionen von Cloudflare deaktiviert. Auf dem folgenden Graphen können Sie sehen, dass die CPUs für den HTTP/HTTPS-Traffic bei allen Servern unseres Netzwerks fast zu 100 % ausgelastet waren.

CPU-Auslastung bei einem unserer PoPs während des Vorfalls

Deshalb wurde unseren Kunden (und deren Kunden) beim Aufrufen einer beliebigen Cloudflare-Domain eine 502-Fehlerseite angezeigt. Die 502-Fehler wurden von den Cloudflare-Webservern erzeugt, die noch über CPU-Kerne verfügten, aber die Prozesse für den HTTP/HTTPS-Traffic nicht erreichen konnten.

Wir wissen, wie sehr der Vorfall unseren Kunden geschadet hat. Wir schämen uns dafür. Auch unsere eigenen Betriebsabläufe waren betroffen, als wir Gegenmaßnahmen ergriffen haben.

Der Ausfall muss Ihnen als Kunde enormen Stress, Frustration und vielleicht sogar Verzweiflung bereitet haben. Wir hatten seit sechs Jahren keinen globalen Ausfall mehr, entsprechend groß war unser Ärger.

Die CPU-Überlastung wurde von einer einzigen WAF-Regel verursacht, die einen schlecht geschriebenen regulären Ausdruck enthielt, der ein enormes Backtracking auslöste. Dies ist der reguläre Ausdruck, der den Ausfall verursacht hat: (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

Obwohl dieser reguläre Ausdruck für viele Personen von Interesse ist (und unten genauer beschrieben wird), sind die genauen Gründe für die 27 Minuten lange Nichtverfügbarkeit des Cloudflare-Services deutlich komplexer, als dass einfach nur ein schlecht geschriebener regulärer Ausdruck implementiert wurde. Wir haben uns die Zeit genommen, die Ereigniskette aufzuschreiben, die zum Ausfall geführt hat und unsere Reaktion gebremst hat. Wenn Sie mehr über das Backtracking bei regulären Ausdrücken und die möglichen Gegenmaßnahmen erfahren möchten, sehen Sie sich den Anhang am Ende des Beitrags an.

Was passiert ist

Betrachten wir die Ereignisse in ihrer Reihenfolge. Alle Zeitangaben in diesem Blog basieren auf UTC.

Um 13:42 hat ein Engineer des Firewall-Teams eine kleine Änderung an den Regeln der XSS-Erkennung mithilfe eines automatischen Prozesses implementiert. Dadurch wurde ein Ticket für eine Änderungsanfrage erzeugt. Wir verwenden Jira, um diese Tickets zu bearbeiten und unten sehen Sie einen Screenshot davon.

Drei Minuten später ist die erste PagerDuty-Seite ausgefallen, was auf einen Fehler bei der WAF hingedeutet hat. Das war ein synthetischer Test, mit dem außerhalb von Cloudflare überprüft wird, ob die WAF ordnungsgemäß funktioniert (wir nutzen Hunderte solcher Tests). Direkt darauf folgten die Meldungen weiterer End-to-End-Tests über die Ausfälle von Cloudflare-Services bei Websites, eine Warnung wegen einer rapide Abnahme des globalen Traffics, eine enorme Anzahl an 502-Fehlern und dann viele Berichte von unseren PoPs (Points-of-Presence) in Städten auf der ganzen Welt, die eine CPU-Überlastung anzeigten.

Einige dieser Meldungen wurden auf meiner Uhr angezeigt und ich bin während des Meetings aufgesprungen und war gerade auf dem Weg zu meinem Schreibtisch, als ein leitender Solutions Engineer mich darüber informierte, dass wir 80 % unseres Traffics verloren hatten. Ich rannte zu unserer SRE-Abteilung, wo das Team gerade die Situation analysierte. Anfangs wurde sogar spekuliert, ob es sich um einen Angriff ungeahnten Ausmaßes handeln könnte.

Das SRE-Team von Cloudflare ist auf der ganzen Welt verteilt, damit rund um die Uhr für Monitoring gesorgt ist. Warnungen wie diese, die meist nur sehr spezifische Probleme mit überschaubaren Auswirkungen betreffen, werden mit internen Dashboards überwacht und mehrfach täglich überprüft und behandelt. Diese Menge an Websites und Warnungen deutete aber darauf hin, dass etwas äußerst Schwerwiegendes vorgefallen ist, weshalb das SRE-Team dies sofort als P0-Vorfall deklariert hat und ihn zum leitenden Engineering und System Engineering eskaliert hat.

Das Engineering-Team aus London befand sich gerade im zentralen Veranstaltungsraum und hörte sich einen internen Tech Talk an. Der Tech Talk wurde unterbrochen, das Team versammelte sich in einem großen Konferenzraum und andere schalteten sich dazu. Das war kein normales Problem, um das sich das SRE-Team alleine kümmern konnte: Alle relevanten Teams mussten gleichzeitig verfügbar sein.

Um 14:00 wurde erkannt, dass die WAF der Ursprung des Problems ist, und die Möglichkeit eines Angriffs wurde ausgeschlossen. Das Performance Team konnte CPU-Daten in Echtzeit abrufen, die eindeutig belegten, dass die WAF ursächlich war. Ein Teammitglied konnte dies mit strace bestätigen. Ein anderes Team erhielt Fehlerprotokolle, die auf Probleme bei der WAF hindeuteten. Um 14:02 wandten sich alle Blicke des Teams zu mir, als die Verwendung eines „global kill“ im Raum stand, eines Cloudflare-Mechanismus, mit dem eine bestimmte Komponente weltweit deaktiviert werden kann.

Aber dazu mussten wir erst einmal die Fähigkeit zu einem „global kill“ der WAF erhalten. Ohne Weiteres war dies nicht möglich. Wir verwenden unsere eigenen Produkte und da unser Access-Dienst nicht mehr funktionierte, konnten wir uns bei unserem internen Control Panel nicht authentifizieren (wir haben festgestellt, dass einige Teammitglieder ihren Zugriff verloren hatten, weil eine Sicherheitsfunktion ihre Anmeldedaten deaktiviert, wenn sie das interne Control Panel nicht regelmäßig verwenden).

Und wir konnten andere interne Dienste wie Jira oder das Build-System nicht mehr aufrufen. Wir mussten dieses Problem mit einem Mechanismus umgehen, der nur sehr selten verwendet wurde (und ein weiterer Prozess, den wir nach dem Vorfall genauer unter die Lupe nahmen). Letztendlich konnte ein Teammitglied um 14:07 den „global kill“ der WAF ausführen und um 14:09 befanden sich Traffic und CPU-Niveaus wieder weltweit im normalen Bereich. Der restliche Cloudflare-Schutzmechanismus war wieder aktiv.

Dann sorgten wir dafür, dass die WAF wieder funktionierte. Da dieser Vorfall ziemlich ernst war, führten wir in einer einzigen Stadt sowohl negative Tests (mit der Frage, ob wirklich diese eine Änderung das Problem verursacht hatte) als auch positive Tests (zur Überprüfung, ob der Rollback wirklich funktioniert hatte) mit einem Teil des Traffics durch, nachdem wir den Traffic unserer zahlenden Kunden von diesem Standort abgezogen hatten.

Um 14:52 waren wir zu 100 % davon überzeugt, dass wir die Ursache verstanden hatten, das Problem behoben war und die WAF wieder global aktiv war.

Wie Cloudflare arbeitet

Cloudflare verfügt über ein Engineering-Team, das an WAF Managed Rules arbeitet. Es optimiert kontinuierlich die Erkennungsraten, minimiert die falsch-positiven Ergebnisse und reagiert unmittelbar auf neue Bedrohungen. In den vergangenen 60 Tagen wurden 476 Änderungsanfragen für die WAF Managed Rules bearbeitet (durchschnittlich eine alle 3 Stunden).

Diese spezielle Änderung wurde im „Simulationsmodus“ bereitgestellt, in dem der echte Kunden-Traffic zwar von der Regel überprüft wird, er aber ungehindert durchgeleitet wird. Mit diesem Modus testen wir die Effektivität einer Regel und messen die Raten falsch-positiver und falsch-negativer Ergebnisse. Aber selbst im „Simulationsmodus“ müssen die Regeln tatsächlich ausgeführt werden und in diesem Fall enthielt die Regel einen regulären Ausdruck, der eine CPU-Überlastung auslöste.

Wie oben in der Änderungsanfrage ersichtlich, gibt es einen Bereitstellungsplan, einen Rollbackplan und einen Link zum internen Standard Operating Procedure (SOP) für diese Art von Bereitstellung. Das SOP erlaubt ausdrücklich die globale Implementierung einer Regeländerung. Diese Methodik unterscheidet sich deutlich von unserem normalen Ansatz bei der Software-Veröffentlichung, wo SOP die Software zunächst bei einem internen Dogfooding-Netzwerk-PoP (Point of Presence) implementiert (den unsere Kunden nur passieren), dann bei einer geringen Kundenzahl an einem isolierten Standort, gefolgt von einer großen Kundenzahl und schließlich weltweit. „Dogfooding“ ist übrigens ein englischer Ausdruck dafür, dass ein Unternehmen sein eigenes Produkt verwendet.

Der Prozess zur Software-Veröffentlichung sieht folgendermaßen aus: Wir verwenden intern git mittels BitBucket. Die Engineers, die Änderungen bearbeiten, schreiben Code, der in TeamCity erstellt wird. Wenn das Build bestätigt wird, werden Prüfer zugewiesen. Sobald ein Pull Request bestätigt wurde, wird der Code erstellt und die Test-Suite (erneut) ausgeführt.

Wenn der Build-Test erfolgreich war, wird bei Jira eine Änderungsanfrage erstellt und die Änderung muss von der zuständigen Führungskraft oder einer technischen Leitung bestätigt werden. Nach der Bestätigung erfolgt die Bereitstellung an den „Animal PoPs“, wie wir sie nennen: DOG, PIG und Canaries.

Der DOG-PoP ist ein Cloudflare-PoP (genau wie eine unserer Städte auf der Welt), der aber nur von Cloudflare-Mitarbeitern verwendet wird. Mithilfe dieses Dogfooding-PoPs können wir Probleme beheben, bevor ein Kunden-Traffic damit in Kontakt kommt. Und genau das passiert auch häufig.

Wenn der DOG-Test erfolgreich abgeschlossen wird, geht der Code in die PIG-Phase über (Englisch „Guinea Pig“, zu Deutsch „Meerschweinchen“). Dies ist ein Cloudflare-PoP, an dem ein kleiner Anteil des Traffics von kostenlosen Benutzern den neuen Code durchläuft.

Wenn dieser Test erfolgreich ist, geht der Code zu den „Canaries“ (Kanarienvögeln) über. Wir verfügen über drei auf die ganze Welt verteilte Canary-PoPs, über die der Traffic von zahlenden und kostenlosen Kunden geleitet wird, damit der neue Code noch ein letztes Mal auf Fehler überprüft werden kann.

Veröffentlichungsprozess bei Cloudflare-Software

Nach dem erfolgreichen Canary-Test ist der Code zur globalen Implementierung freigegeben. Je nach Codeänderung können mehrere Stunden oder Tage bis zum Abschluss des gesamten Prozesses aus DOG, PIG, Canary und Global vergehen. Dank der Vielseitigkeit des Netzwerks und der Kunden von Cloudflare können wir den Code gründlich überprüfen, bevor eine neue Version global für alle Kunden eingeführt wird. Im Falle der WAF findet dieser Prozess aber keine Anwendung, da Bedrohungen ja eine schnelle Reaktion erfordern.


In den vergangenen Jahren mussten wir eine drastische Zunahme an Schwachstellen bei gängigen Anwendungen feststellen. Das lässt sich auf die steigende Verfügbarkeit von Softwaretestingtools mit Methoden wie Fuzzing zurückführen (einen neuen Blog-Beitrag zum Thema Fuzzing haben wir erst kürzlich hier veröffentlicht).


Häufig wird ein Proof of Concept (PoC) erstellt und direkt auf Github veröffentlicht, damit Teams, die Anwendungen ausführen und bearbeiten, ihre Tests durchführen können, um sicherzustellen, dass sie über geeignete Schutzmaßnahmen verfügen. Deshalb muss Cloudflare unbedingt so schnell wie möglich auf neue Bedrohungen reagieren und Softwarepatches für seine Kunden bereitstellen.

Ein gutes Beispiel für diesen proaktiven Schutz von Cloudflare ist die Bereitstellung der Schutzmaßnahmen wegen der SharePoint-Schwachstelle im Mai hier im Blog). Direkt nach der öffentlichen Bekanntgabe verzeichneten wir einen signifikanten Anstieg der Exploit-Versuche bei den SharePoint-Installationen unserer Kunden. Unser Team hält kontinuierlich Ausschau nach neuen Bedrohungen und schreibt Regeln, um sie im Sinne unserer Kunden zu bekämpfen.

Bei der spezifischen Regel, die den Ausfall am letzten Dienstag verursachte, ging es um XSS-Angriffe (Cross-Site Scripting). Auch diese haben in den letzten Jahren signifikant zugenommen.


Im Rahmen des Standardverfahrens für eine Anpassung der WAF Managed Rules sind erfolgreiche CI-Tests (Continuous Integration) vor der globalen Bereitstellung vorgesehen. Diese wurden am letzten Dienstag erfolgreich durchgeführt und die Regeln wurden bereitgestellt. Um 13:31 hat ein Engineer des Teams einen Pull Request mit der Änderung implementiert, nachdem sie bestätigt worden war.

Um 13:37 hat TeamCity die Regeln erstellt, seine Tests durchgeführt und grünes Licht gegeben. Die WAF-Testsuite überprüft die grundlegenden WAF-Funktionen und besteht aus einer großen Testsammlung für individuelle Abgleichfunktionen. Nach dem Testlauf werden die individuellen WAF-Regeln getestet, indem eine große Anzahl an HTTP-Anfragen unter Einbeziehung der WAF ausgeführt wird. Diese HTTP-Anfragen sind als Testanfragen konzipiert, die von der WAF blockiert werden sollen (damit potenzielle Angriffe abgewehrt werden) bzw. nicht blockiert werden sollen (damit nicht zu viel blockiert wird und keine falsch-positiven Ergebnisse entstehen). Nicht getestet wurde jedoch die übermäßige CPU-Auslastung durch die WAF. Auch die Überprüfung der Protokolldateien von vorherigen WAF-Builds hat ergeben, dass bei der Regel keine überhöhte Testsuite-Laufzeit erkannt wurde, die letztendlich eine CPU-Auslastung verursachen könnte.

Die Tests wurden erfolgreich abgeschlossen und TeamCity begann um 13:42 mit der automatischen Bereitstellung der Änderung.


Da WAF-Regeln akute Bedrohungen abwehren müssen, werden sie mit unserer verteilten Schlüssel-Werte-Datenbank Quicksilver bereitgestellt, die Änderungen in wenigen Sekunden global implementiert. Diese Technologie wird von allen unseren Kunden für Konfigurationsänderungen in unserem Dashboard oder per API verwendet und darauf beruht unsere Fähigkeit, auf Änderungen äußerst schnell zu reagieren.

Bis jetzt haben wir noch nicht viel über Quicksilver gesprochen. Zuvor haben wir Kyoto Tycoonals globalen Schlüssel-Werte-Speicher (KV-Speicher) verwendet, aber wir hatten damit Probleme im Betrieb und erstellten dann unseren eigenen KV-Speicher, der für unsere über 180 Standorte repliziert wird. Mit Quicksilver übertragen wir Änderungen an Kundenkonfigurationen, aktualisieren wir WAF-Regeln und verteilen JavaScript-Code, der von Kunden mit Cloudflare Workers geschrieben wurde.

Es dauert vom Klicken auf eine Schaltfläche im Dashboard oder Tätigen eines API-Aufrufs zum Ändern der Konfiguration nur ein paar Sekunden, bis die Änderung aktiv ist – global. Die Kunden lieben mittlerweile diese Konfigurierbarkeit mit Höchstgeschwindigkeit. Sie erwarten bei Workers eine praktisch sofortige, globale Softwarebereitstellung. Quicksilver verteilt durchschnittlich etwa 350 Änderungen pro Sekunde.

Und Quicksilver ist sehr schnell.  Unser P99 für die Verteilung einer Änderung an jeden Rechner weltweit lag bei durchschnittlich 2,29 s. Diese Geschwindigkeit ist normalerweise eine tolle Sache. Wenn man ein Feature aktiviert oder den Cache entleert, weiß man, dass diese Änderung praktisch sofort live ist, weltweit. Jede Codeübermittlung mit Cloudflare Workers erfolgt mit der gleichen Geschwindigkeit. Das ist Teil des Versprechens der schnellen Updates von Cloudflare, die da sind, wenn man sie braucht.

In diesem Fall hieß die Geschwindigkeit jedoch, dass eine Änderung an den Regeln innerhalb von Sekunden global live war. Wie Sie sehen, nutzt der WAF-Code Lua. Cloudflare nutzt Lua bei der Produktion in hohem Maße. Details zu Lua in der WAF wurden bereits erörtert. Die Lua-WAF nutzt intern PCRE und verwendet Backtracking zum Abgleich. Sie hat keine Schutzvorrichtung gegen aus der Reihe tanzende Ausdrücke. Mehr dazu, und was wir dagegen tun, lesen Sie unten.

Alles bis zum Zeitpunkt der Regelbereitstellung erfolgte „korrekt“: eine Pull-Anfrage wurde gestellt, sie wurde genehmigt, CI/CD erstellte den Code und testete ihn, eine Änderungsanfrage mit einem SOP mit Details zu Rollout und Rollback wurde eingereicht und das Rollout wurde ausgeführt.

Bereitstellungsprozess für Cloudflare WAF

Was ist schiefgelaufen?

Wie erwähnt, stellen wir jede Woche Dutzende neuer Regeln für die WAF bereit und haben mehrere Systeme installiert, um negative Auswirkungen dieser Bereitstellungen zu vermeiden. Wenn also etwas schiefgeht, ist die Ursache normalerweise ein unwahrscheinliches Zusammentreffen mehrerer Faktoren. Die Suche nach einer einzigen Grundursache mag zwar befriedigend sein, geht aber oft an der Realität vorbei. Dies sind die Verwundbarkeiten, die alle zusammentrafen, bis der Punkt erreicht war, an dem die Cloudflare-Services für HTTP/HTTPS offline gingen.

  1. Ein Techniker schrieb einen regulären Ausdruck, der leicht ein enormes Backtracking bewirken konnte.
  2. Ein Schutz vor übermäßiger CPU-Auslastung durch einen regulären Ausdruck wurde versehentlich einige Wochen vorher während einer Umgestaltung der WAF entfernt. Die Umgestaltung war Teil des Bemühens, die CPU-Nutzung durch die WAF zu reduzieren.
  3. Die verwendete Engine für reguläre Ausdrücke hatte keine Komplexitätsgarantien.
  4. Die Testsuite konnte eine übermäßige CPU-Nutzung nicht erkennen.
  5. Das SOP erlaubte, dass eine nicht mit einem Notfall zusammenhängende Regeländerung global in die Produktion ging, ohne dass ein gestaffelter Rollout stattfand.
  6. Der Rollback-Plan erforderte, dass die komplette WAF zweimal ausgeführt wird, was zu lange dauerte.
  7. Das Auslösen der ersten Warnmeldung für den globalen Traffic-Rückgang dauerte zu lange.
  8. Unsere Statusseite wurde nicht schnell genug aktualisiert.
  9. Wir hatten wegen des Ausfalls Probleme, auf unsere eigenen Systeme zuzugreifen, und die Mitarbeiter waren für das Umgehungsverfahren nicht gut geschult.
  10. Die SREs hatten den Zugriff auf einige Systeme verloren, da ihre Anmeldedaten aus Sicherheitsgründen ausgesetzt wurden.
  11. Unsere Kunden konnten nicht auf das Cloudflare-Dashboard oder die Cloudflare-API zugreifen, da sie durch das Cloudflare-Edge laufen.
Das ist seit letztem Dienstag passiert

Zunächst stellten wir alle Release-Arbeiten am WAF komplett ein und widmeten uns Folgendem:

  1. Wiedereinführung des entfernten Schutzes vor übermäßiger CPU-Auslastung (erledigt)
  2. Manuelles Überprüfen aller 3.868 Regeln in den WAF Managed Rules, um etwaige andere Fälle von potenziell übermäßigem Backtracking zu finden und zu korrigieren (Prüfung abgeschlossen)
  3. Einführung von Performanceprofilierung für alle Regeln an die Testsuite (voraussichtl.  Juli 2019)
  4. Wechsel zur re2- oder Rust regex-Engine, die beide Laufzeitgarantien bieten (voraussichtl. 31. Juli)
  5. Ändern der SOP, sodass gestaffelte Rollouts von Regeln erfolgen, wie sie auch für andere Software bei Cloudflare durchgeführt werden; dabei soll die Fähigkeit für globale Notfallbereitstellungen bei aktiven Angriffen erhalten bleiben
  6. Implementieren einer Notfallfunktion zum Entfernen von Cloudflare-Dashboard und -API aus dem Cloudflare-Edge
  7. Automatisieren von Aktualisierungen der Cloudflare Status-Seite

Langfristig möchten wir von der Lua-WAF abrücken, die ich vor Jahren schrieb. Wir migrieren die WAF, sodass sie die neue Firewall-Engine nutzen kann. Dadurch wird die WAF schneller und es wird eine weitere Schutzebene hinzugefügt.


Dieser Ausfall war für unsere Kunden und das Team äußerst ärgerlich. Wir reagierten schnell, um das Problem zu beheben, und korrigieren nun die Mängel im Prozess, die den Ausfall möglich machten. Wir gehen der Sache auf den Grund, um Schutz vor weiteren potenziellen Problemen durch die Verwendung regulärer Ausdrücke zu bieten, indem wir die zugrunde liegende Technologie ersetzen.

Wir sind beschämt über den Ausfall und entschuldigen uns bei unseren Kunden für die Auswirkungen. Wir denken, dass die von uns vorgenommenen Änderungen dafür sorgen, dass ein solcher Ausfall nie mehr auftreten wird.

Anhang: Über das Backtracking von regulären Ausdrücken

Um genau zu verstehen, wie  (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*))) die CPU-Überlastung verursachte, müssen wir etwas darüber wissen, wie eine Standard-Engine für reguläre Ausdrücke funktioniert. Der kritische Teil ist .*(?:.*=.*). Das (?: und die passende ) sind eine Gruppe ohne Erfassung (d. h., der Ausdruck innerhalb der Klammern ist als ein einziger Ausdruck zusammen gruppiert).

Bei der Diskussion, warum dieses Muster eine CPU-Überlastung verursachte, können wir ihn getrost ignorieren und das Muster als .*.*=.* behandeln. Wenn es darauf reduziert wird, sieht das Muster natürlich unnötig komplex aus; das Wichtige ist jedoch, dass jeder Ausdruck aus der „realen Welt“ (wie die komplexen Ausdrücke in unseren WAF-Regeln), der von der Engine verlangt, „irgendetwas gefolgt von irgendetwas abzugleichen“, zu katastrophalem Backtracking führen kann. Hier ist der Grund:

In einem regulären Ausdruck bedeutet . den Abgleich eines einzigen Zeichens. .* bedeutet einen „gierigen“ (greedy) Abgleich von null oder mehr Zeichen (d. h. einen Abgleich von so viel wie möglich). .*.*=.* bedeutet also den Abgleich von null oder mehr Zeichen, dann den Abgleich von null oder mehr Zeichen, dann das Finden eines literalen =-Zeichens, dann den Abgleich von null oder mehr Zeichen.

Nehmen wir die Testzeichenfolge x=x. Sie entspricht dem Ausdruck .*.*=.*. Die .*.* vor dem Gleichheitszeichen können mit dem ersten  x abgeglichen werden (einer der .* entspricht dem x, der andere entspricht null Zeichen). Der .* nach dem = entspricht dem letzten x.

Dieser Abgleich erfordert 23 Schritte. Der erste .* in .*.*=.* verhält sich gierig und gleicht die gesamte Zeichenfolge x=x ab. Die Engine fährt dann mit dem nächsten .* fort. Es sind keine passenden Zeichen mehr übrig, also entspricht der zweite .* null Zeichen (das ist zulässig). Dann fährt die Engine mit dem = fort. Da keine Zeichen zum Abgleichen übrig sind (der erste .* hat alle x=x aufgebraucht), schlägt der Abgleich fehl.

An diesem Punkt führt die Engine für reguläre Ausdrücke Backtracking durch. Sie kehrt zum ersten .* zurück und gleicht ihn mit  x= (anstatt x=x) ab, dann wechselt sie zum zweiten .*. Dieser .* entspricht dem zweiten x. Nun sind keine weiteren Zeichen zum Abgleichen übrig. Deshalb schlägt der Abgleich fehl, wenn die Engine nun versucht, das = in .*.*=.* abzugleichen. Die Engine führt erneut Backtracking durch.

Dieses Mal führt sie das Backtracking so durch, dass der erste .* noch x= entspricht, der zweite .* jedoch nicht mehr x, sondern null Zeichen entspricht. Die Engine fährt dann damit fort, das Literal = im Muster .*.*=.* zu suchen, aber das schlägt fehl (weil es bereits mit dem ersten .* abgeglichen wurde). Die Engine führt erneut Backtracking durch.

Dieses Mal entspricht der erste .* nur dem ersten x. Der zweite .* verhält sich jedoch gierig und gleicht =x ab. Sie ahnen, was nun kommt. Wenn die Engine versucht, das Literal = abzugleichen, schlägt dies fehl und sie führt erneut Backtracking durch.

Der erste .* entspricht immer noch nur dem ersten x. Nun entspricht der zweite .* nur =. Die Engine kann aber, Sie ahnen es, das Literal = nicht abgleichen, weil ihm der zweite .* entsprach. Die Engine führt also erneut Backtracking durch. Denn bei all dem geht es ja, Sie erinnern sich, darum, eine Zeichenfolge aus drei Zeichen abzugleichen.

Nun, da der erste .* nur dem ersten x entspricht und der zweite .* null Zeichen entspricht, kann die Engine schließlich den Literal = im Ausdruck mit dem = in der Zeichenfolge abgleichen. Sie fährt fort und der letzte .* entspricht dem letzten x.

23 Schritte zum Abgleich von x=x. Hier ist ein kurzes Video davon mit dem Perl Regexp::Debugger, das die durchgeführten Schritte und Backtrackings zeigt.

Das ist viel Arbeit. Was aber passiert, wenn die Zeichenfolge von x=x zu x=xx geändert wird? Dieses Mal erfordert der Abgleich 33 Schritte. Und wenn die Eingabe x=xxx lautet, sind es 45. Das ist nicht linear. Hier ist ein Diagramm, das den Abgleich von x=x bis x=xxxxxxxxxxxxxxxxxxxx zeigt (20 x nach dem =). Bei 20 x nach dem = benötigt die Engine 555 Schritte für den Abgleich! (Und wenn das x= fehlen würde, sodass die Zeichenfolge nur aus 20 x bestünde, würde die Engine sogar 4.067 Schritte benötigen, um herauszufinden, dass das Muster nicht übereinstimmt).

Dieses Video zeigt alle notwendigen Backtrackings zum Abgleich von x=xxxxxxxxxxxxxxxxxxxx:

Das ist schlecht, denn wenn die Eingabegröße sich erhöht, steigt die Abgleichzeit superlinear. Es hätte jedoch noch schlimmer kommen können, wenn der reguläre Ausdruck etwas anders aussähe. Angenommen, er hätte .*.*=.*; gelautet (d. h. mit einem literalen Semikolon am Ende des Musters). Dieser Ausdruck könnte z. B. geschrieben werden, um einen Ausdruck wie foo=bar; abzugleichen.

Dieses Mal wäre das Backtracking katastrophal gewesen. Der Abgleich von x=x erfordert 90 Schritte statt 23. Und die Zahl der Schritte wächst sehr schnell. Das Abgleichen von x= gefolgt von 20 x erfordert 5.353 Schritte. Hier ist das entsprechende Diagramm. Sehen Sie sich die Y-Achsen-Werte genau an und vergleichen Sie sie mit dem vorherigen Diagramm.

Um das Bild zu vervollständigen, sind hier alle 5.353 Schritte des fehlgeschlagenen Abgleichs von x=xxxxxxxxxxxxxxxxxxxx mit .*.*=.*;

Durch die Verwendung „fauler“ (lazy) anstelle gieriger Abgleiche lässt sich die Zahl der Backtrackings reduzieren, die in diesem Fall auftreten. Wenn der ursprüngliche Ausdruck zu .*?.*?=.*? geändert wird, erfordert der Abgleich von x=x 11 Schritte (statt 23). Genauso ist es beim Abgleich von x=xxxxxxxxxxxxxxxxxxxx. Der Grund ist, dass das ?  nach dem .* die Engine anweist, zuerst die kleinste Anzahl von Zeichen abzugleichen, bevor sie mit den nächsten Schritten fortfährt.

Faulheit ist aber keine umfassende Lösung für dieses Backtracking-Verhalten. Wenn im Beispiel mit dem katastrophalen Backtracking .*.*=.*; zu .*?.*?=.*?; geändert wird, verändert sich seine Laufzeit überhaupt nicht. x=xerfordert weiterhin 555 Schritte und  x= gefolgt von 20 x erfordert weiterhin 5.353 Schritte.

Die einzige echte Lösung, abgesehen von einem kompletten Umschreiben des Musters, ist, von einer Engine für reguläre Ausdrücke mit diesem Backtracking-Mechanismus abzurücken. Genau das tun wir innerhalb der nächsten paar Wochen.

Die Lösung dieses Problems ist seit 1968 bekannt, als Ken Thompson den Artikel „Programming Techniques: Regular expression search algorithm“ veröffentlichte. Darin wird ein Mechanismus zum Umwandeln eines regulären Ausdrucks in einen NEA (nichtdeterministischer endlicher Automat) beschrieben. Außerdem werden die Zustandswechsel im NEA erläutert, die einem Algorithmus folgen, der zeitlich linear für die Größe der abgeglichenen Zeichenfolge ausgeführt wird.


Thompsons Artikel nimmt nicht direkt Bezug auf den NEA, aber der Algorithmus mit linearer Zeit wird genau erklärt und ein ALGOL-60-Programm, das Assemblersprachencode für den IBM 7094 generiert, wird vorgestellt. Die Implementierung mag obskur erscheinen, die Idee ist es nicht.

So sähe der reguläre Ausdruck .*.*=.* aus, wenn er gemäß den Zeichnungen in Thompsons Artikel dargestellt würde:

Abbildung 0 zeigt fünf Zustände, angefangen mit 0. Die drei Kreise zeigen die Zustände 1, 2 und 3. Sie entsprechen den drei .* im regulären Ausdruck. Die drei Rhomben mit Punkten darin entsprechen einem einzelnen Zeichen. Der Rhombus mit einem =-Zeichen entspricht dem literalen =-Zeichen. Zustand 4 ist der Endzustand. Wenn er erreicht ist, wurde der reguläre Ausdruck abgeglichen.

Um zu prüfen, wie ein solches Zustandsdiagramm zum Abgleich des regulären Ausdrucks .*.*=.* verwendet werden kann, sehen wir uns nun den Abgleich der Zeichenfolge x=x an. Das Programm beginnt mit Zustand 0, wie in Abbildung 1 gezeigt.

Der Schlüssel dazu, diesen Algorithmus zum Funktionieren zu bringen, ist, dass der Zustandsautomat gleichzeitig mehrere Zustände aufweist. Der NEA führt jeden Wechsel, den er erreichen kann, gleichzeitig durch.

Noch bevor er eine Eingabe liest, wechselt er sofort sowohl in Zustand 1 als auch in Zustand 2, wie in Abbildung 2 gezeigt.

In Abbildung 2 sehen wir, was passieren würde, wenn er zuerst x in x=x berücksichtigt. Das x kann dem obersten Punkt entsprechen, indem von Zustand 1 gewechselt und wieder zurück zu Zustand 1 gewechselt wird. Oder das x kann dem Punkt darunter entsprechen, indem von Zustand 2 gewechselt und wieder zurück zu Zustand 2 gewechselt wird.

Nach dem Abgleich des ersten x in x=x sind die Zustände also weiterhin 1 und 2. Die Zustände 3 oder 4 können nicht erreicht werden, da dazu ein literales =-Zeichen benötigt wird.

Als nächstes nimmt sich der Algorithmus das = in x=x vor. Ähnlich wie das x zuvor kann es einem der beiden oberen Kreise mit dem Wechsel von Zustand 1 zu Zustand 1 bzw. Zustand 2 zu Zustand 2 entsprechen. Zusätzlich kann jedoch das Literal = abgeglichen werden und der Algorithmus kann von Zustand 2 zu Zustand 3 (und sofort zu Zustand 4) wechseln. Das ist in Abbildung 3 veranschaulicht.

Als nächstes erreicht der Algorithmus das letzte x in x=x. Von den Zuständen 1 und 2 sind die gleichen Wechsel zurück zu den Zuständen 1 und 2 möglich. Von Zustand 3 kann das x dem Punkt auf der rechten Seite entsprechen und zurück zu Zustand 3 wechseln.

An diesem Punkt wurde jedes Zeichen in x=x berücksichtigt; da Zustand 4 erreicht wurde, entspricht der reguläre Ausdruck dieser Zeichenfolge. Jedes Zeichen wurde einmal verarbeitet. Der Algorithmus war also linear für die Länge der Eingabezeichenfolge. Und kein Backtracking war erforderlich.

Es mag offensichtlich sein, aber nachdem Zustand 4 erreicht wurde (nach dem Abgleich von x=), war der reguläre Ausdruck abgeglichen und der Algorithmus konnte enden, ohne das letzte x überhaupt zu berücksichtigen.

Der Algorithmus ist linear für die Größe seiner Eingabe.

Categories: Technology


Fri, 12/07/2019 - 16:45

9年ほど前のCloudflareは小さな会社で、当時私は一顧客であり従業員ではありませんでした。Cloudflareがひと月前に設立されたという時期のある日、私は自分の小さなサイト、jgc.orgのDNSが動作していないという警告を受け取りました。そしてCloudflareはProtocol Buffersの使用に変更を加えた上でDNSを切断したのです。

私は「私のDNSはどうなったのでしょうか?」という件名のメールを直接Matthew Prince宛に出しました。すると彼は長文かつ詳細な返信をくれたのです。(メールのやり取りの全文はこちらからご覧いただけます)。下記は私のそのメールに対する返信です。

From: John Graham-Cumming 日時:2010/10/7(木)9:14 AM 件名:Re: 私のDNSはどうなったのでしょうか? To: Matthew Prince ご報告ありがとうございました。何か問題があれば ご連絡します。 技術詳細に関する全容が判明したら、 本件をブログに記載するのはいかがでしょうか。 本件に対しての開示や誠実であることを他の人も評価すると思うのです。 特に、ローンチ後のトラフィック増加を示すグラフを 添えていただければと思います。 私は自分のサイトを厳格に監視しているので、何かあれば SMSを受け取れます。 監視結果では13:03:07から14:04:12までダウンしていたことが わかりました。 テストは5分おきに実行されています。 本件は大事には至らずに済んでいますし、解決していただけると確信しています。 しかしながら、ヨーロッパには本当に


これに対するMatthewの返信は以下の通りです。 From: Matthew Prince 日時:2010/10/7(木)9:57 AM 件名:Re: 私のDNSはどうなったのでしょうか? To: John Graham-Cumming ありがとうございます。Cloudflareではいただいたメールすべてに対して返信しております。私は現在 オフィスに向かっており、ブログへの投稿またはCloudflareの掲示板システムのトップに 公式投稿をピン留めする予定です。透明性が一番だということには 全面的に同意します。





インシデント中のCloudflare PoPにおけるCPU使用量

この結果、Cloudflareのお客様(およびお客様の顧客の方々)に対し、Cloudflareのドメイン訪問時に502エラーが表示されることとなりました。この502エラーはフロントのCloudflare Webサーバーに利用可能なCPUコアがあるにも関わらずHTTP/HTTPSトラフィックを配信するプロセスに到達できないことにより発生したものです。








3分後、1つ目のPagerDutyページがWAFの異常を表示して停止しました。これはCloudflare外からWAFの機能を確認する模擬テストで(このようなテストは数百とあります)、正常動作を確認するためのものでした。そしてすぐにCloudflareサービスのエンドツーエンドテストの失敗、グローバルなトラフィック低下アラート、502エラーの蔓延がページに表示され、世界各都市のPoint of Presence(PoP)からCPU枯渇に関する報告を多数受けました。




14時00分、WAFが問題の原因コンポーネントであることを特定し、攻撃が原因である可能性は却下されました。パフォーマンスチームはマシンから稼働中のCPUデータを取得し、WAFが原因であることを明示しました。他のチームのメンバーがstraceで確認を行い、また別のチームはWAFが問題を起こしているという記載があるエラーログを発見しました。14時02分、私は全チームに対して「global kill」を行う提案をしました。これはCloudflareに組み込まれた仕組みで、世界中の単一コンポーネントを無効とするものです。

しかしWAFに対するglobal killの実行も簡単にはいきませんでした。また問題が現れたのです。Cloudflareでは自社製品を使用しているため、Accessサービスがダウンすると内部のコントロールパネルで認証することができないのです(復旧後、内部コントロールパネルをあまり使用していないメンバーはセキュリティ機能により資格情報が無効になったためアクセスできなくなっていることがわかりました)。

さらにJiraやビルドシステムのような他の内部サービスも利用できなくなりました。利用できるようにするにはあまり使っていないバイパスの仕組みを使う必要がありました(これも本件の後で検討すべき項目です)。最終的にチームメンバーがWAFのglobal killを14時07分に実行し、14時09分までに世界中のトラフィックレベルおよびCPUが想定状態にまで戻りました。その他のCloudflareの保護の仕組みは継続して運用できています。






上記の変更申請でご確認いただける通り、リリース計画、ロールバック計画、この種のリリース向けの内部標準業務手順書(SOP)へのリンクが記載されています。そして、ルール変更向けのSOPでは特別にグローバルなプッシュが許可されています。これはCloudflareでリリースする他のソフトウェアとは大きく異なるものです。通常SOPのプッシュ先はまず内部の試験運用版ネットワークにあるPoit of Presence(PoP)、次に独立した地域にいる少数のお客様、多数のお客様、最後に世界という順になります。



DOG PoPは(世界の他の場所と同様)CloudflareのPoPですが、Cloudflareの社員のみが利用するものです。この試験運用版のPoPではお客様のトラフィックがコードに接触する前に問題を早期発見することができ、実際頻繁に検出されています。
















WAFルールは新たな脅威に対応する必要があるため、数秒で世界中に変更を適用することのできるCloudflareの分散型Key-Value Store(KVS)、Quicksilverを使用してリリースしています。この技術はCloudflareのダッシュボード内やAPI経由での設定変更時にCloudflareの全てのお客様が使用しているもので、Cloudflareが変更に対して非常に迅速に処理できる理由でもあります。

Quicksilverについてはこれまであまり言及したことがありませんでした。以前CloudflareではKyoto Tycoonを分散型Key-Value Storeとしてグローバルに採用しておりましたが、運用上の問題が発生したため独自KVSを構築して180以上の都市に複製していました。Quicksilverはお客様の設定に変更を加えたり、WAFルールを更新したり、Cloudflare Workersを使用して書いたJavaScriptコードを配信したりするための手段です。


さらに、Quicksilverは非常に高速です。 平均では2.29秒で世界中のマシンへ1つの変更を配信することができます。通常、このスピードは素晴らしいことです。要するに、機能を有効にしたりキャッシュをパージしたりする際、世界中に一瞬で稼働させられるのです。Cloudflare Workersでコードをプッシュすると、同じ速度でプッシュすることができます。これは必要なときに高速で更新ができるという、Cloudflareのお約束の1つです。



Cloudflare WAFのリリース手順



  1. エンジニアが簡単に膨大なバックトラックをしてしまう正規表現を記述しました。
  2. 数週間前に実施したWAFのリファクタリングにより、正規表現によるCPUの過度な消費を防ぐための保護が誤って削除されていました。このリファクタリングはWAFによるCPU消費を抑えるためのものでした。
  3. 使用していた正規表現エンジンには複雑性の保証がありませんでした。
  4. テストスイートに過度なCPU消費を特定する手段がありませんでした。
  5. 段階的ロールアウトをせずに世界中の本番環境へ緊急性のないルール変更を展開できるようなSOPになっていました。
  6. WAFの完全なビルドを2回実行するという時間のかかりすぎることがロールバック計画で要求されていました。
  7. グローバルなトラフィックドロップに対する初めのアラートの発火が遅すぎました。
  8. Cloudflareのステータスページをすぐに更新できませんでした。
  9. 停止およびバイパス手順に不慣れだったため、Cloudflare内からのCloudflare独自システムへのアクセスが困難でした。
  10. セキュリティ上の理由により資格情報がタイムアウトしたため、SREが一部のシステムにアクセスできませんでした。
  11. Cloudflareのエッジを経由するお客様は、CloudflareのダッシュボードやAPIにアクセスできませんでした。


  1. 過度のCPU利用を行う保護を取り除いた上での再導入(完了)
  2. WAFマネージドルールにある全3,868件のルールを手動にて調査し、過度のバックトラッキングが発生する可能性があるその他のインスタンスを検出、修正(検査は完了)
  3. 全ルールに対するパフォーマンスプロファイリングをテストスイートに導入(完了予定: 7月19日)
  4. 正規表現エンジンをre2またはRustに切り替え(どちらもランタイム保証搭載)(完了予定:7月31日)
  5. 進行中の攻撃に対する緊急かつグローバルなリリースを実施できる点を保持しつつ、Cloudflareの他のソフトウェアと同じ方法でルールを段階的ロールアウトするようSOPを変更
  6. CloudflareのダッシュボードやAPIをCloudflareのエッジから外すための緊急機能を設置
  7. Cloudflareステータスページへの更新の自動化






(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*))) がどのようにCPU枯渇を引き起こすのかを完全に理解するには、正規表現エンジンの動作を少々理解しておく必要があります。重要なのは.*(?:.*=.*)の部分です。(?:と)は非キャプチャグループです(つまり、カッコ内の表現は1つの表現としてグルーピングされています)。



















この問題に対する解決策は1968年のKen Thompson氏による「Programming Techniques:Regular expression search algorithm(プログラミングアルゴリズム:正規表現の検索アルゴリズム)」という論文で知られているものです。この論文では正規表現をNFA(非決定性有限オートマトン)に変換し、その後照合する文字列のサイズで時間線形的に実行するアルゴリズムを用いてNFAの状態遷移をするメカニズムについて説明しています。

この論文で実際にNFAに関する記述があるわけではありませんが、線形時間アルゴリズムに関しては明確に説明されており、IBM 7094用のアセンブリ言語のコードを生成するALGOL60プログラムが提示されています。その実装は難解なものですが、考え方はさほどではありません。













タグ事後検討,停止,Deep Dive

Categories: Technology

Détails de la panne Cloudflare du 2 juillet 2019

Fri, 12/07/2019 - 16:45

Il y a près de neuf ans, Cloudflare était une toute petite entreprise dont j’étais le client, et non l’employé. Cloudflare était sorti depuis un mois et un jour, une notification m’alerte que mon petit site,, semblait ne plus disposer d’un DNS fonctionnel. Cloudflare avait effectué une modification dans l’utilisation de Protocol Buffers qui avait endommagé le DNS.

J’ai contacté directement Matthew Prince avec un e-mail intitulé « Où est mon DNS ? » et il m’a envoyé une longue réponse technique et détaillée (vous pouvez lire tous nos échanges d’e-mails ici) à laquelle j’ai répondu :

De: John Graham-Cumming Date: Jeudi 7 octobre 2010 à 09:14 Objet: Re: Où est mon DNS? À: Matthew Prince Superbe rapport, merci. Je veillerai à vous appeler s’il y a un problème. Il serait peut-être judicieux, à un certain moment, d’écrire tout cela dans un article de blog, lorsque vous aurez tous les détails techniques, car je pense que les gens apprécient beaucoup la franchise et l’honnêteté sur ce genre de choses. Surtout si vous y ajoutez les tableaux qui montrent l’augmentation du trafic suite à votre lancement. Je dispose d’un système robuste de surveillance de mes sites qui m’envoie un SMS en cas de problème. La surveillance montre une panne de 13:03:07 à 14:04:12. Des tests sont réalisés toutes les cinq minutes. Un accident de parcours dont vous vous relèverez certainement. Mais ne pensez-vous pas qu’il vous faudrait quelqu’un en Europe? :-)

Ce à quoi il a répondu :

De: Matthew Prince Date: Jeudi 7 octobre 2010 à 09:57 Objet: Re: Où est mon DNS? À: John Graham-Cumming Merci. Nous avons répondu à tous ceux qui nous ont contacté. Je suis en route vers le bureau et nous allons publier un article sur le blog ou épingler un post officiel en haut de notre panneau d’affichage. Je suis parfaitement d’accord, la transparence est nécessaire.

Aujourd’hui, en tant qu’employé d’un Cloudflare bien plus grand, je vous écris de manière transparente à propos d’une erreur que nous avons commise, de son impact et de ce que nous faisons pour régler le problème.

Les événements du 2 juillet

Le 2 juillet, nous avons déployé une nouvelle règle dans nos règles gérées du pare-feu applicatif Web (WAF) qui ont engendré un surmenage des processeurs sur chaque cœur de processeur traitant le trafic HTTP/HTTPS sur le réseau Cloudflare à travers le monde. Nous améliorons constamment les règles gérées du WAF pour répondre aux nouvelles menaces et vulnérabilités. En mai, par exemple, nous avons utilisé la vitesse avec laquelle nous pouvons mettre à jour le WAF pour appliquer une règle et nous protéger d’une vulnérabilité SharePoint importante. Être capable de déployer rapidement et globalement des règles est une fonctionnalité essentielle de notre WAF.

Malheureusement, la mise à jour de mardi dernier contenait une expression régulière qui nous a fait reculer énormément et qui a épuisé les processeurs utilisés pour la distribution HTTP/HTTPS. Les fonctionnalités essentielles de mise en proxy, du CDN et du WAF de Cloudflare sont tombées en panne. Le graphique suivant présente les processeurs dédiés à la distribution du trafic HTTP/HTTPS atteindre presque 100 % d’utilisation sur les serveurs de notre réseau.

Utilisation des processeurs pendant l’incident dans l’un de nos PoP

En conséquence, nos clients (et leurs clients) voyaient une page d’erreur 502 lorsqu’ils visitaient n’importe quel domaine Cloudflare. Les erreurs 502 étaient générées par les serveurs Web frontaux Cloudflare dont les cœurs de processeurs étaient encore disponibles mais incapables d’atteindre les processus qui distribuent le trafic HTTP/HTTPS.

Nous réalisons à quel points nos clients ont été affectés. Nous sommes terriblement gênés qu’une telle chose se soit produite. L’impact a été négatif également pour nos propres activités lorsque nous avons traité l’incident.

Cela a du être particulièrement stressant, frustrant et angoissant si vous étiez l’un de nos clients. Le plus regrettable est que nous n’avions pas eu de panne globale depuis six ans.

Le surmenage des processeurs fut causé par une seule règle WAF qui contenait une expression régulière mal écrite et qui a engendré un retour en arrière excessif. L’expression régulière au cœur de la panne est (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

Bien que l’expression régulière soit intéressante pour de nombreuses personnes (et évoquée plus en détails ci-dessous), comprendre ce qui a causé la panne du service Cloudflare pendant 27 minutes est bien plus complexe « qu’une expression régulière qui a mal tourné ». Nous avons pris le temps de noter les séries d’événements qui ont engendré la panne et nous ont empêché de réagir rapidement. Et si vous souhaitez en savoir plus sur le retour en arrière d’une expression régulière et que faire dans ce cas, veuillez consulter l’annexe à la fin de cet article.

Ce qui s’est produit

Commençons par l’ordre des événements. Toutes les heures de ce blog sont en UTC.

À 13:42, un ingénieur de l’équipe pare-feu a déployé une légère modification aux règles de détection XSS via un processus automatique. Cela a généré un ticket de requête de modification. Nous utilisons Jira pour gérer ces tickets (capture d’écran ci-dessous).

Trois minutes plus tard, la première page PagerDuty indiquait une erreur avec le pare-feu applicatif Web (WAF). C’était un test synthétique qui vérifie les fonctionnalités du WAF (nous disposons de centaines de tests de ce type) en dehors de Cloudflare pour garantir qu’il fonctionne correctement. Ce test fut rapidement suivi par des pages indiquant de nombreux échecs d’autres tests de bout-en-bout des services Cloudflare, une alerte de perte de trafic globale, des erreurs 502 généralisées, puis par de nombreux rapports de nos points de présence (PoP) dans des villes à travers le monde indiquant un surmenage des processeurs.

Préoccupé par ces alertes, j’ai quitté la réunion à laquelle j’assistais et en me dirigeant vers mon bureau, un responsable de notre groupe Ingénieur Solutions m’a dit que nous avions perdu 80 % de notre trafic. J’ai couru au département SRE où l’équipe était en train de déboguer la situation. Au tout début de la panne, certains pensaient à un type d’attaque que nous n’avions jamais connu auparavant.

L’équipe SRE de Cloudflare est répartie dans le monde entier pour assurer une couverture continue. Les alertes de ce type, dont la majorité identifient des problèmes très spécifiques sur des cadres limités et dans des secteurs précis, sont surveillées dans des tableaux de bord internes et traitées de nombreuses fois chaque jour. Cependant, ce modèle de pages et d’alertes indiquait que quelque chose de grave s’était produit, et le département SRE a immédiatement déclaré un incident P0 et transmis à la direction de l’ingénierie et à l’ingénierie des systèmes.

À ce moment, l’équipe d’ingénieurs de Londres assistait à une conférence technique interne dans notre espace principal d’événements. La discussion fut interrompue et tout le monde s’est rassemblé dans une grande salle de conférence ; les autres se sont connectés à distance. Ce n’était pas un problème normal que le département SRE pouvait régler seul ; nous avions besoin que toutes les équipes appropriées soient en ligne au même moment.

À 14:00, nous avons identifié que le composant à l’origine du problème était le WAF, et nous avons écarté la possibilité d’une attaque. L’équipe Performances a extrait les données en direct du processeur d’une machine qui montrait clairement que le WAF était responsable. Un autre membre de l’équipe a utilisé Strace pour confirmer. Une autre équipe a observé des journaux d’erreur indiquant que le WAF présentait un problème. À 14:02, toute l’équipe m’a regardé : la proposition était d’utiliser un « global kill », un mécanisme intégré à Cloudflare pour désactiver un composant unique partout dans le monde.

Mais utiliser un « global kill » pour le WAF, c’était une toute autre histoire. Nous avons rencontré plusieurs obstacles. Nous utilisons nos propres produits et avec la panne de notre service Access, nous ne pouvions plus authentifier notre panneau de contrôle interne (une fois de retour en ligne, nous avons découvert que certains membres de l’équipe avaient perdu leur accès à cause d’une fonctionnalité de sécurité qui désactive les identifiants s’ils n’utilisent pas régulièrement le panneau de contrôle interne).

Et nous ne pouvions pas atteindre les autres services internes comme Jira ou le moteur de production. Pour les atteindre, nous avons du employer un mécanisme de contournement très rarement utilisé (un autre point à creuser après l’événement). Finalement, un membre de l’équipe a exécuté le « global kill » du WAF à 14:07 ; à 14:09, les niveaux du trafic et des processeurs sont revenus à la normale à travers le monde. Le reste des mécanismes de protection de Cloudflare a continué de fonctionner.

Nous sommes ensuite passés à la restauration de la fonctionnalité WAF. Le côté sensible de la situation nous a poussé à réaliser des tests négatifs (nous demander « est-ce réellement ce changement particulier qui a causé le problème ? ») et des tests positifs (vérifier le fonctionnement du retour) dans une ville unique à l’aide d’un sous-ensemble de trafic après avoir retiré de cet emplacement le trafic de nos clients d’offres payantes.

À 14:52, nous étions certains à 100 % d’avoir compris la cause, résolu le problème et que le WAF était réactivé partout dans le monde.

Fonctionnement de Cloudflare

Cloudflare dispose d’une équipe d’ingénieurs dédiée à notre solution de règles gérées du WAF ; ils travaillent constamment pour améliorer les taux de détection, réduire les faux positifs et répondre rapidement aux nouvelles menaces dès qu’elles apparaissent. Dans les 60 derniers jours, 476 requêtes de modifications ont été traitées pour les règles gérées du WAF (en moyenne une toutes les 3 heures).

Cette modification spécifique fut déployée en mode « simulation » où le trafic client passe au travers de la règle mais rien n’est bloqué. Nous utilisons ce mode pour tester l’efficacité d’une règle et mesurer son taux de faux positifs et de faux négatifs. Cependant, même en mode simulation, les règles doivent être exécutées, et dans cette situation, la règle contenait une expression régulière qui consommait trop de processeur.

On peut observer dans la requête de modification ci-dessus un plan de déploiement, un plan de retour et un lien vers la procédure opérationnelle standard (POS) interne pour ce type de déploiement. La POS pour une modification de règle lui permet d’être appliquée à l’échelle mondiale. Ce processus est très différent de tous les logiciels que nous publions chez Cloudflare, où la POS applique d’abord le logiciel au réseau interne de dogfooding d’un point de présence (PoP) (au travers duquel passent nos employés), puis à quelques clients sur un emplacement isolé, puis à un plus grand nombre de clients et finalement au monde entier.

Le processus pour le lancement d’un logiciel ressemble à cela : Nous utilisons Git en interne via BitBucket. Les ingénieurs qui travaillent sur les modifications intègrent du code créé par TeamCity ; les examinateurs sont désignés une fois la construction validée. Lorsqu’une requête d’extraction est approuvée, le code est créé et la série de tests exécutée (à nouveau).

Si la construction et les tests sont validés, une requête de modification Jira est générée et la modification doit être approuvée par le responsable approprié ou la direction technique. Une fois approuvée, le déploiement de ce que l’on appelle les « PDP animaux » survient : DOG, PIG et les Canaries

Le PoP DOG est un PoP Cloudflare (au même titre que toutes nos villes à travers le monde) mais qui n’est utilisé que par les employés Cloudflare. Ce PoP de dogfooding nous permet d’identifier les problèmes avant que le trafic client n’atteigne le code. Et il le fait régulièrement.

Si le test DOG réussit, le code est transféré vers PIG (en référence au « cobaye » ou « cochon d’Inde »). C’est un PoP Cloudflare sur lequel un petit sous-ensemble de trafic client provenant de clients de l’offre gratuite passe au travers du nouveau code.

Si le test réussit, le code est transféré vers les Canaries. Nous disposons de trois PoP Canaries répartis dans le monde entier ; nous exécutons du trafic client des offres payantes et gratuite qui traverse ces points sur le nouveau code afin de vérifier une dernière fois qu’il n’y a pas d’erreur.

Processus de lancement d’un logiciel Cloudflare

Une fois validé aux Canaries, le code peut être déployé. L’intégralité du processus DOG, PIG, Canari, Global peut prendre plusieurs heures ou plusieurs jours en fonction du type de modification de code. La diversité du réseau et des clients de Cloudflare nous permet de tester rigoureusement le code avant d’appliquer un lancement à tous nos clients partout dans le monde. Cependant, le WAF n’est pas conçu pour utiliser ce processus car il doit répondre rapidement aux menaces.

Menaces WAF

Ces derniers années, nous avons observé une augmentation considérable des vulnérabilités dans les applications communes. C’est une conséquence directe de la disponibilité augmentée des outils de test de logiciels, comme par exemple le fuzzing (nous venons de publier un nouveau blog sur le fuzzing ici).


Le plus souvent, une preuve de concept (PoC, Proof of Concept) est créée et publiée rapidement sur Github, afin que les équipes en charge de l’exécution et de la maintenance des applications puissent effectuer des tests et vérifier qu’ils disposent des protections adéquates. Il est donc essentiel que Cloudflare soit capable de réagir aussi vite que possible aux nouvelles attaques pour offrir à nos clients une chance de corriger leur logiciel.

Le déploiement de nos protections contre la vulnérabilité SharePoint au mois de mai illustre parfaitement la façon dont Cloudflare est capable d’offrir une protection de manière proactive (blog ici). Peu de temps après la médiatisation de nos annonces, nous avons observé un énorme pic de tentatives visant à exploiter les installations Sharepoint de notre client. Notre équipe surveille constamment les nouvelles menaces et rédige des règles pour les atténuer pour le compte de nos clients.

La règle spécifique qui a engendré la panne de mardi dernier ciblait les attaques Cross-Site Scripting (XSS). Ce type d’attaque a aussi considérablement augmenté ces dernières années.


La procédure standard pour une modification des règles gérées du WAF indique que les tests d’intégration continue (CI) doivent être validés avant un déploiement à l’échelle mondiale. Cela s’est passé normalement mardi dernier et les règles ont été déployées. À 13:31, un ingénieur de l’équipe fusionnait une requête d’extraction contenant la modification déjà approuvée.

À 13:37, TeamCity a construit les règles et exécuté les tests puis donné son feu vert. La série de tests WAF vérifie que les fonctionnalités principales du WAF fonctionnent ; c’est un vaste ensemble de tests individuels ayant chacun une fonction associée. Une fois les tests individuels réalisés, les règles individuelles du WAF sont testées en exécutant une longue série de requêtes HTTP sur le WAF. Ces requêtes HTTP sont conçues pour tester les requêtes qui doivent être bloquées par le WAF (pour s’assurer qu’il identifie les attaques) et celles qui doivent être transmises (pour s’assurer qu’il ne bloque pas trop et ne crée pas de faux positifs). Il n’a pas vérifié si le WAF utilisait excessivement le processeur, et en examinant les journaux des précédentes constructions du WAF, on peut voir qu’aucune augmentation n’est observée pendant l’exécution de la série de tests avec la règle qui aurait engendré le surmenage du processeur sur notre périphérie.

Après le succès des tests, TeamCity a commencé à déployer automatiquement la modification à 13:42.


Les règles du WAF doivent répondre à de nouvelles menaces ; elles sont donc déployées à l’aide de notre base de données clé-valeur Quicksilver, capable d’appliquer des modifications à l’échelle mondiale en quelques secondes. Cette technologie est utilisée par tous nos clients pour réaliser des changements de configuration dans notre tableau de bord ou via l’API. C’est la base de notre service : répondre très rapidement aux modifications.

Nous n’avons pas vraiment abordé Quicksilver. Auparavant, nous utilisions Kyoto Tycoon comme base de données clé-valeur distribuées globalement, mais nous avons rencontré des problèmes opérationnels et rédigé notre propre base de données clé-valeur que nous avons reproduite dans plus de 180 villes. Quicksilver nous permet d’appliquer des modifications de configuration client, de mettre à jour les règles du WAF et de distribuer du code JavaScript rédigé par nos clients à l’aide de Cloudflare Workers.

Cliquer sur un bouton dans le tableau de bord, passer un appel d’API pour modifier la configuration et cette modification prendra effet en quelques secondes à l’échelle mondiale. Le clients adorent cette configurabilité haut débit. Avec Workers, ils peuvent profiter d’un déploiement logiciel global quasi instantané. En moyenne, Quicksilver distribue environ 350 modifications par seconde.

Et Quicksilver est très rapide.  En moyenne, nous atteignons un p99 de 2,29 s pour distribuer une modification sur toutes nos machines à travers le monde. En général, cette vitesse est un avantage. Lorsque vous activez une fonctionnalité ou purgez votre cache, vous êtes sûr que ce sera fait à l’échelle mondiale presque instantanément. Lorsque vous intégrez du code avec Cloudflare Workers, celui-ci est appliqué à la même vitesse. La promesse de Cloudflare est de vous offrir des mises à jour rapides quand vous en avez besoin.

Cependant, dans cette situation, cette vitesse a donc appliqué la modification des règles à l’échelle mondiale en quelques secondes. Vous avez peut-être remarqué que le code du WAF utilise Lua. Cloudflare utilise beaucoup Lua en production, et des informations sur Lua dans le WAF ont été évoquées auparavant. Le WAF Lua utilise PCRE en interne, le retour en arrière pour la correspondance, et ne dispose pas de mécanisme pour se protéger d’une expression étendue. Ci-dessous, plus d’informations sur nos actions.

Tout ce qui s’est passé jusqu’au déploiement des règles a été fait « correctement » : une requête d’extraction a été lancée, puis approuvée, CI/CD a créé le code et l’a testé, une requête de modification a été envoyée avec une procédure opérationnelle standard précisant le déploiement et le retour en arrière, puis le déploiement a été exécuté.

Processus de déploiement du WAF Cloudflare

Les causes du problème

Nous déployons des douzaines de nouvelles règles sur le WAF chaque semaine, et nous avons de nombreux systèmes en place pour empêcher tout impact négatif sur ce déploiement. Quand quelque chose tourne mal, c’est donc généralement la convergence improbable de causes multiples. Trouver une cause racine unique est satisfaisant mais peut obscurcir la réalité. Voici les vulnérabilités multiples qui ont convergé pour engendrer la mise hors ligne du service HTTP/HTTPS de Cloudflare.

  • Un ingénieur a rédigé une expression régulière qui pouvait facilement causer un énorme retour en arrière.
  • Une protection qui aurait pu empêcher l’utilisation excessive du processeur par une expression régulière a été retirée par erreur lors d’une refonte du WAF quelques semaines plus tôt (refonte dont l’objectif était de limiter l’utilisation du processeur par le WAF).
  • Le moteur d’expression régulière utilisé n’avait pas de garanties de complexité.
  • La série de tests ne présentait pas de moyen d’identifier une consommation excessive du processeur.
  • La procédure opérationnelle standard a permis la mise en production globale d’une modification de règle non-urgente sans déploiement progressif.
  • Le plan de retour en arrière nécessitait la double exécution de l’intégralité du WAF, ce qui aurait pris trop de temps.
  • La première alerte de chute du trafic global a mis trop de temps à se déclencher.
  • Nous n’avons pas mis à jour suffisamment rapidement notre page d’état.
  • Nous avons eu des problèmes pour accéder à nos propres systèmes en raison de la panne et la procédure de contournement n’était pas bien préparée.
  • Les ingénieurs fiabilité avaient perdu l’accès à certains systèmes car leurs identifiants ont expiré pour des raisons de sécurité.
  • Nos clients ne pouvaient pas accéder au tableau de bord ou à l’API Cloudflare car ils passent par la périphérie Cloudflare.
Ce qui s’est passé depuis mardi dernier

Tout d’abord, nous avons interrompu la totalité du travail de publication sur le WAF et réalisons les actions suivantes :

  • Réintroduction de la protection en cas d’utilisation excessive du processeur qui avait été retirée. (Terminé)
  • Inspection manuelle de toutes les 3 868 règles des règles gérées du WAF pour trouver et corriger tout autre cas d’éventuel retour en arrière excessif. (Inspection terminée)
  • Introduction de profil de performances pour toutes les règles sur la série de tests. (ETA :  19 juillet)
  • Passage au moteur d’expression régulière re2 ou Rust qui disposent de garanties de temps d’exécution. (ETA : 31 juillet)
  • Modifier la procédure opérationnelle standard afin d’effectuer des déploiements progressifs de règles de la même manière que pour les autres logiciels chez Cloudflare tout en conservant la capacité de réaliser des déploiements globaux d’urgence pour les attaques actives.
  • Mettre en place une capacité d’urgence pour retirer le tableau de bord et l’API Cloudflare de la périphérie de Cloudflare.
  • Mettre à jour automatiquement la page d’état Cloudflare.

À long terme, nous nous éloignons du WAF Lua que j’ai rédigé il y a plusieurs années. Nous déplaçons le WAF pour qu’il utilise le nouveau moteur pare-feu. Cela rendra le WAF plus rapide et offrira une couche de protection supplémentaire.


Ce fut une panne regrettable pour nos clients et pour l’équipe. Nous avons réagi rapidement pour régler la situation et nous réparons les défauts de processus qui ont permis à cette panne de se produire. Afin de nous protéger contre d’éventuels problèmes ultérieurs avec notre utilisation des expressions régulières, nous allons remplacer notre technologie fondamentale.

Nous sommes très gênés par cette panne et désolés de l’impact sur nos clients. Nous sommes convaincus que les changements réalisés nous permettront de ne plus jamais subir une telle panne.

Annexe : À propos du retour en arrière d’expression régulière

Pour comprendre comment (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))  a causé le surmenage du processeur, vous devez vous familiariser avec le fonctionnement d’un moteur d’expression régulière standard. La partie critique est .*(?:.*=.*). Le(?: et les ) correspondantes sont un groupe de non-capture (l’expression entre parenthèses est regroupée dans une seule expression).

Pour identifier la raison pour laquelle ce modèle engendre le surmenage du processeur, nous pouvons l’ignorer et traiter le modèle .*.*=.*. En le réduisant, le modèle paraît excessivement complexe, mais ce qui est important, c’est que toute expression du « monde réel » (comme les expressions complexes au sein de nos règles WAF) qui demande au moteur de « faire correspondre quelque chose suivi de quelque chose » peut engendrer un retour en arrière catastrophique. Explication.

Dans une expression régulière, . signifie correspondre à un caractère unique, .* signifie correspondre à zéro ou plus de caractères abondamment (c’est à dire faire correspondre autant que possible). .*.*=.* signifie donc correspondre à zéro ou plus de caractères, puis à nouveau zéro ou plus de caractères, puis trouver un signe littéral =, puis correspondre à zéro ou plus de caractères.

Prenons la chaîne test x=x. Cela correspondra à l’expression .*.*=.*. Les .*.* précédant le signe = peuvent correspondre au premier x (l’un des .* correspond au x, l’autre correspond à zéro caractère). Le .* après le signe = correspond au x final.

Il faut 23 étapes pour atteindre cette correspondance. Le premier .* de .*.*=.* agit abondamment et correspond à la chaîne complète x=x. Le moteur avance pour examiner le .* suivant. Il ne reste aucun caractère à faire correspondre, le deuxième .* correspond donc à zéro caractère (c’est autorisé). Le moteur passe ensuite au =. La correspondance échoue car il n’y a plus de caractère à faire correspondre (le premier .* ayant consommé l’intégralité de x=x).

À ce moment, le moteur d’expression régulière retourne en arrière. Il retourne au premier  .* et le fait correspondre à x= (au lieu de x=x) et passe ensuite au deuxième .*. Ce .* correspond au deuxième x et il n’y a maintenant plus aucun caractère à associer. Quand le moteur essaie d’associer le = dans .*.*=.*, la correspondance échoue. Le moteur retourne de nouveau en arrière.

Avec ce retour en arrière, le premier .* correspond toujours à x= mais le deuxième .* ne correspond plus à x ; il correspond à zéro caractère. Le moteur avance ensuite pour essayer de trouver le signe littéral = dans le modèle .*.*=.* mais échoue (car il correspond déjà au premier .*). Le moteur retourne de nouveau en arrière.

Cette fois-ci, le premier .* correspond seulement au premier x. Mais le deuxième .* agit abondamment et correspond à =x. Vous imaginez la suite. Lorsqu’il essaie d’associer le signe littéral =, il échoue et retourne de nouveau en arrière.

Le premier .* correspond toujours seulement au premier x. Désormais, le deuxième .* correspond seulement à =. Comme vous l’avez deviné, le moteur ne peut pas associer le signe littéral = car il correspond déjà au deuxième .*. Le moteur retourne donc de nouveau en arrière. Tout cela pour associer une chaîne de trois caractères.

Enfin, avec le premier .* correspondant uniquement au premier x et le deuxième .* correspondant à zéro caractère, le moteur est capable d’associer le signe littéral = dans l’expression avec le = contenu dans la chaîne. Il avance et le dernier .* correspond au dernier x.

23 étapes pour faire correspondre x=x. Voici une courte vidéo de l’utilisation de Perl Regexp::Debugger présentant les étapes et le retour en arrière.

Cela représente beaucoup de travail, mais que se passe-t-il si la chaîne passe de x=x à x=xx ? La correspondance prend cette fois 33 étapes. Et si l’entrée est x=xxx, 45 étapes. Ce n’est pas linéaire. Voici un graphique présentant la correspondance de x=x à x=xxxxxxxxxxxxxxxxxxxx (20 x après le =). Avec 20 x après le =, le moteur requiert 555 étapes pour associer ! (Pire encore, si le x= était manquant et que la chaîne ne contenait que 20 x, le moteur nécessiterait 4 067 étapes pour réaliser que le modèle ne correspond pas).

Cette vidéo montre tout le retour en arrière nécessaire pour correspondre à  x=xxxxxxxxxxxxxxxxxxxx:

Ce n’est pas bon signe car à mesure que la taille de l’entrée augmente, le temps de correspondance augmente de manière ultra linéaire. Mais cela aurait pu être encore pire avec une expression régulière légèrement différente. Prenons l’exemple .*.*=.*; (avec un point-virgule littéral à la fin du modèle). Nous aurions facilement pu rédiger cela pour tenter d’associer une expression comme foo=bar;.

Le retour en arrière aurait été catastrophique. Associer x=x prend 90 étapes au lieu de 23. Et le nombre d’étapes augmente très rapidement. Associer x= suivi de 20 x prend 5 353 étapes. Voici le graphique correspondant : Observez attentivement les valeurs de l’axe Y par rapport au graphique précédent.

Pour terminer, voici toutes les 5 353 étapes de l’échec de la correspondance de x=xxxxxxxxxxxxxxxxxxxx avec .*.*=.*;

L’utilisation de correspondances fainéantes plutôt que gourmandes (lazy / greedy) permet de contrôler l’étendue du retour en arrière qui survient dans ce cas précis. Si l’expression originale est modifiée en .*?.*?=.*?, la correspondance de x=x prend 11 étapes (au lieu de 23), tout comme la correspondance de x=xxxxxxxxxxxxxxxxxxxx. C’est parce que le ? après le .*ordonne au moteur d’associer le plus petit nombre de caractères avant de commencer à avancer.

Mais la fainéantise n’est pas la solution totale à ce comportement de retour en arrière. Modifier l’exemple catastrophique .*.*=.*; en .*?.*?=.*?; n’affecte pas du tout son temps d’exécution. x=x prend toujours 555 étapes et x= suivi de 20 x prend toujours 5 353 étapes.

La seule vraie solution, mise à part la réécriture complète du modèle pour être plus précis, c’est de s’éloigner du moteur d’expression régulière avec ce mécanisme de retour en arrière. Ce que nous faisons au cours des prochaines semaines.

La solution à ce problème existe depuis 1968, lorsque Ken Thompson écrivait un article intitulé « Techniques de programmation : Algorithme de recherche d’expression régulière ». L’article décrit un mécanisme pour convertir une expression régulière en AFN (automate fini non-déterministe) et suivre les transitions d’état dans l’AFN à l’aide d’un algorithme exécuté en un temps linéaire de la taille de la chaîne que l’on souhaite faire correspondre.

L’article de Thompson n’évoque pas l’AFN mais l’algorithme de temps linéaire est clairement expliqué. Il présente également un programme ALGOL-60 qui génère du code de langage assembleur pour l’IBM 7094. La réalisation paraît ésotérique mais l’idée présentée ne l’est pas.

Voici à quoi ressemblerait l’expression régulière .*.*=.* si elle était schématisée d’une manière similaire aux images de l’article de Thompson.

Le schéma 0 contient 5 états commençant à 0. Il y a trois boucles qui commencent avec les états 1, 2 et 3. Ces trois boucles correspondent aux trois .* dans l’expression régulière. Les trois pastilles avec des points à l’intérieur correspondent à un caractère unique. La pastille avec le signe = à l’intérieur correspond au signe littéral =. L’état 4 est l’état final : s’il est atteint, l’expression régulière a trouvé une correspondance.

Pour voir comment un tel schéma d’état peut être utilisé pour associer l’expression régulière .*.*=.*, nous allons examiner la correspondance de la chaîne x=x. Le programme commence à l’état 0 comme indiqué dans le schéma 1.

La clé pour faire fonctionner cet algorithme est que la machine d’état soit dans plusieurs états au même moment. L’AFN prendra autant de transitions que possible simultanément.

Avant de lire une entrée, il passe immédiatement aux deux états 1 et 2 comme indiqué dans le schéma 2.

En observant le schéma 2, on peut voir ce qui s’est passé lorsqu’il considère le premier x dans x=x. Le x peut correspondre au point supérieur en faisant une transition de l’état 1 et en retournant à l’état 1. Ou le x peut correspondre au point inférieur en faisant une transition de l’état 2 et en retournant à l’état 2.

Après avoir associé le premier x dans x=x, les états restent 1 et 2. On ne peut pas atteindre l’état 3 ou 4 car il faut un signe littéral =.

L’algorithme considère ensuite le = dans x=x. Comme le x avant lui, il peut être associé par l’une des deux boucles supérieures faisant une transition de l’état 1 à l’état 1 ou de l’état 2 à l’état 2. En outre, le signe littéral = peut être associé et l’algorithme peut faire une transition de l’état 2 à l’état 3 (et directement à l’état 4). C’est illustré dans le schéma 3.

Ensuite, l’algorithme atteint le x final dans x=x. Depuis les états 1 et 2, les mêmes transitions sont possibles vers les états 1 et 2. Depuis l’état 3, le x peut correspondre au point sur la droite et retourner à l’état 3.

À ce niveau, chaque caractère de x=x a été considéré, et puisque l’état 4 a été atteint, l’expression régulière correspond à cette chaîne. Chaque caractère a été traité une fois, l’algorithme était donc linéaire avec la longueur de la chaîne saisie. Aucun retour en arrière nécessaire.

Cela peut paraître évident qu’une fois l’état 4 atteint (après la correspondance de x=), l’expression régulière correspond et l’algorithme peut terminer sans prendre en compte le x final.

Cet algorithme est linéaire avec la taille de son entrée.

Categories: Technology

关于 2019 年 7 月 2 日 Cloudflare 中断的详情

Fri, 12/07/2019 - 16:45

大约九年前,Cloudflare 还是一家小公司,我也还是客户,而不是员工。当时,Cloudflare 早在一个月前就已发布了,有一天警报消息显示,这个小网站似乎不再支持 DNS 了。Cloudflare 实施了一项对 Protocol Buffers 使用的改动,这破坏了 DNS。

我直接给 Matthew Prince 写了一封题为“我的 DNS 在哪儿?”的邮件,他回复了一封篇幅很长、内容详实的技术性解答邮件(您可以点击此处查看往来邮件的全部内容),我对该邮件的回复是:

发件人:John Graham-Cumming 日期:2010 年 10 月 7 日星期四上午 9:14 主题:回复:我的 DNS 在哪儿? 收件人:Matthew Prince 谢谢,这是一篇很棒的报告。如果有问题,我一定会去电 问询。 就某种程度而言,在掌握了所有技术细节后, 将它们撰写为一篇博客文章可能会更好,因为我认为 读者会非常感谢博主对这些信息的坦诚公开。 这一点在您看到文章发布后流量增加的图表时, 会感触更深。 我在密切监控着网站,以便在出现任何故障时能够 收到短信通知。 监控显示,我的网站在 13:03:07 至 14:04:12 期间流量下降。 我会每五分钟测试一次。 这只是个小插曲,我相信您会解决这个问题。 但您确定您不需要 有人在欧洲为您分忧吗?:-)


发件人:Matthew Prince 日期:2010 年 10 月 7 日星期四上午 9:57 主题:回复:我的 DNS 在哪儿? 收件人:John Graham-Cumming 谢谢。我们已经回复了所有来信。我现在要去办公室, 我们会在博客上发布些信息,或在我们的公告栏系统中 置顶一篇官方帖文。我同意 100% 透明度是最好的。

因此,今天,作为规模远胜以往的 Cloudflare 公司的一员,我要写一篇文章,清楚讲述我们所犯的错误、它的影响以及我们正在为此采取的行动。

7 月 2 日事件

7 月 2 日,我们在 WAF 托管规则中部署了一项新规则,导致全球 Cloudflare 网络上负责处理 HTTP/HTTPS 流量的各 CPU 核心上的 CPU 耗尽。我们在不断改进 WAF 托管规则,以应对新的漏洞和威胁。例如,我们在 5 月份以更新 WAF 的速度出台了一项规则,以防范严重的 SharePoint 漏洞。能够快速地全局部署规则是 WAF 的一个重要特征。

遗憾的是,上周二的更新中包含了一个规则表达式,它在极大程度上回溯并耗尽了用于 HTTP/HTTPS 服务的 CPU。这降低了 Cloudflare 的核心代理、CDN 和 WAF 功能。下图显示了专用于服务 HTTP/HTTPS 流量的 CPU,在我们网络中的服务器上,这些 CPU 的使用率几乎达到了 100%。

事件发生期间某个 PoP 的 CPU 利用率

这导致我们的客户(以及他们的客户)在访问任何 Cloudflare 域时都会看到 502 错误页面。502 错误是由前端 Cloudflare Web 服务器生成的,这些服务器仍有可用的 CPU 内核,但无法访问服务 HTTP/HTTPS 流量的进程。



CPU 耗尽是由一个 WAF 规则引起的,该规则里包含不严谨的正则表达式,最终导致了过多的回溯。作为中断核心诱因的正则表达式是 (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

尽管正则表达式本身成为很多人关注的焦点(下文将进行详细讨论),但 Cloudflare 服务中断 27 分钟的真实情况要比“正则表达式出错”复杂得多。我们已经花时间写下了导致中断并使我们无法快速响应的一系列事件。如果您想了解更多关于正则表达式回溯以及如何处理该问题的信息,可在本文末尾的附录中查找。


我们按事情发生的先后次序讲述。本博客中的所有时间均为协调世界时 (UTC)。

在 13:42,防火墙团队的一名工程师通过一个自动过程对 XSS 检测规则进行了微小改动。这生成了变更请求票证。我们使用 Jira 管理这些票证,下面是截图。

三分钟后,第一个 PagerDuty 页面出现,显示 WAF 故障。这是一项综合测试,从 Cloudflare 外部检查 WAF 的功能(我们会进行数百个此类测试),以确保其正常工作。紧接着出现了多个页面,显示许多其他的 Cloudflare 服务端到端测试失败、全球流量下降警报、众多的 502 错误,之后便是我们在全球各城市的网点 (PoP) 发来的大量指示 CPU 耗尽的报告。

我收到了其中部分警告并立马起身走出会议室,而正在我回到办公桌的途中,解决方案工程师团队的一名负责人告诉我,我们的流量已经减少了 80%。我跑向 SRE 团队,他们正在排除故障。在中断的最初时刻,有人猜测这是某种我们从未见过的攻击。

Cloudflare 的 SRE 团队成员分布在世界各地,他们全天持续监控着网络。绝大多数此类警报都指出了局部区域有限范围内的非常具体的问题,这些警报均在内部仪表板中监控,并且每天会进行多次处理。但这种页面和警报模式表明发生了严重问题,SRE 立即宣布发生 P0 事件,并上报给工程领导层和系统工程部门。

当时,伦敦工程团队正在我们的主要活动场地听取一场内部技术讲座。讲座被打断,所有人都聚集在大型会议室中,商讨问题或是接打电话。这不是 SRE 能够独立处理的一般问题,它需要所有相关团队立即在线联合处理。

在 14:00,WAF 被确定为导致问题的部分原因,并排除了攻击的可能性。性能团队从一台清楚表明 WAF 为罪魁祸首的机器中获取了实时 CPU 数据。另一名团队成员用 strace 进行了确认。还有一个团队找到了指示 WAF 出现问题的错误日志。在 14:02,有人提议使用“全球终止”,这是 Cloudflare 的一种内置机制,可以在全球范围内禁用单个组件,整个团队都在等我做决定。

但进行全球 WAF 终止是另一回事。我们陷入了两难境地。我们使用自己的产品,并且由于我们的 Access 服务停机,我们无法向内部控制面板进行身份验证(而一旦我们返回,就会发现由于安全功能 - 如果不经常使用内部控制面板就会禁用凭据,团队中的一些成员失去了访问权限)。

此外,我们也无法使用其他内部服务,如 Jira 或构建系统。如果要使用这些服务,我们就必须使用一种不常用的旁路机制(在事件发生后继续钻研)。最终,一名团队成员在 14:07 执行了全球 WAF 终止,到 14:09,流量渐趋平缓,CPU 恢复到全球预期水平。Cloudflare 的其他保护机制继续运行。

之后,我们继续恢复 WAF 功能。由于情况的敏感性,我们在将付费客户的流量从该位置移除后,使用一部分流量在单个城市中执行了负面测试(扪心自问“真的是那个改动导致了问题?”)和正面测试(验证回滚是否正常工作)。

在 14:52,我们心中的大石终于落地,我们找出了根由并进行了修复,WAF 在全球重新启用。

Cloudflare 的运行原理

Cloudflare 拥有一支工程师团队,负责我们的 WAF 托管规则产品;他们不断努力提高检出率,降低误报率,并对新出现的威胁迅速做出反应。在过去的 60 天里,他们已经为 WAF 托管规则处理了 476 个更改请求(平均每 3 小时处理一个)。

这项特定更改将部署在“模拟”模式中,而在这一模式下,真实的客户流量将通过规则输送,但不会阻止任何内容。我们使用该模型测试规则的有效性,并测量其误报率和漏报率。但即使在模拟模式下,也需要实际执行规则,而在此情况中,规则会包含消耗过多 CPU 的正则表达式。

从上面的变更请求中可以看出,对于这种类型的部署,我们提供了部署计划、回滚计划以及内部标准操作程序 (SOP) 的链接。规则更改的 SOP 特别允许将其推向全球。这与我们在 Cloudflare 发布的所有软件都非常不同,在Cloudflare,SOP 首先将软件推送到内部测试网点 (PoP)(我们的员工会通过这个网点),然后推送到隔离区的少量用户,之后再推送到大量客户,最终推向全球。

软件发布的过程就好比:我们通过 BitBucket 在内部使用 git。处理更改的工程师推送由 TeamCity 构建的代码,当构建通过时,将分配审阅者。在拉取请求获得批准后,代码即构建完成,测试套件(再次)运行。

如果构建和测试通过,则生成变更请求 Jira,并且变更必须得到相关经理或技术主管的批准。批准后,即发生向我们所谓的“动物 PoP”的部署:DOG、PIG 和 Canaries。

DOG PoP 是 Cloudflare PoP(就像我们在全球的任何城市一样),但它只供 Cloudflare 员工使用。这种内部测试 PoP 使我们能够在任何客户流量接触代码之前及早发现问题。而且这种情况经常发生。

如果 DOG 测试成功通过,代码将转到 PIG(如同在“Guinea Pig”中)。这是 Cloudflare PoP,其中非付费客户的一小部分客户流量通过新代码。

如果成功,代码将移动到 Canaries。我们在全球分布有三个 Canary PoP,它们在新代码上运行付费和非付费客户流量,以作为对错误的最终检查。

Cloudflare 软件发布流程

一旦在 Canary 中成功,代码即可上线。根据代码更改的类型,整个 DOG、PIG、Canary、全球流程可能需要数小时或数天才能完成。Cloudflare 网络和客户的多样性使我们能够在向全球所有客户发布软件之前彻底测试代码。但根据设计,WAF 没有使用此流程,因为需要对威胁做出快速响应。

WAF 威胁



常见的是创建概念证明 (PoC),并经常在 Github 上快速发布,这样运行和维护应用程序的团队就可以进行测试,以确保它们有足够的保护。正因为如此,Cloudflare 必须能够尽快对新的攻击做出响应,以便让我们的客户有机会修补其软件。

Cloudflare 在 5 月份部署了针对 SharePoint 漏洞的保护措施就是 Cloudflare 主动提供此类保护的很好示例(点击此处查看博客)。在发布公告后的短时间内,我们看到,利用客户的 Sharepoint 安装的企图激增。我们的团队持续监控新威胁,并代表客户编写规则以缓解威胁。

导致上周二中断的具体规则是针对跨站点脚本 (XSS) 攻击的规则。近年来,这些攻击也急剧增多。


WAF 托管规则更改的标准程序表明,在全局部署之前必须通过持续集成 (CI) 测试。上周二,这种情况正常发生,规则也得以实施。在 13:31,团队中的一名工程师在获得批准后合并了一个包含更改的拉取请求。

在 13:37,TeamCity 制定了规则并运行测试,最终允许合并。WAF 测试套件测试 WAF 的核心功能是否有效,其中包含大量针对单个匹配功能的单元测试。在运行单元测试后,通过对 WAF 执行大量 HTTP 请求以测试各个 WAF 规则。这些 HTTP 请求旨在测试应被 WAF 阻止的请求(以确保捕获攻击)和应被允许通过的请求(以确保不会过度阻止并产生误报)。它们无法做到的是测试 WAF 的失控 CPU 利用率以及检查之前 WAF 版本中的日志文件,结果显示使用最终导致 CPU 耗尽的规则,并未使测试套件运行时间增加。

测试通过后,TeamCity 在 13:42 自动开始部署更改。


由于需要 WAF 规则以处理紧急威胁,因此我们使用 Quicksilver 分布式键值 (KV) 存储来部署规则,从而能够在几秒钟内向全球推送更改。我们的所有客户在仪表板中或通过 API 进行配置更改时都会用到这项技术,而它也是我们的服务能够非常快速地响应变更的有力支撑。

我们还没有详细探讨过 Quicksilver。我们之前使用 Kyoto Tycoon 作为全球分布式键值存储,但在使用它时遇到了操作问题,随后,我们编写了自己的 KV 存储并在 180 多个城市进行复制。我们通过 Quicksilver 向客户配置推送更改、更新 WAF 规则并分发客户使用 Cloudflare Workers 编写的 JavaScript 代码。

从单击仪表板上的按钮或调用 API 以更改配置,到该更改在全球范围内生效,只需要几秒钟。客户已经开始喜欢这种高速可配置性。而借助 Workers,客户有望实现近乎即时的全球软件部署。Quicksilver 平均每秒可分发约 350 个更改。

Quicksilver 的速度非常快。 平均而言,我们将一项更改发布到全球每台机器的 p99 达到了 2.29 秒。通常情况下,这种速度代表着极大的突破。这意味着,当您启用某项功能或清除缓存时,它会立刻在全球同步执行。而在您使用 Cloudflare Workers 推送代码时,它会以同样的速度推送出去。这是 Cloudflare 的承诺 - 在您需要时快速执行更新。

然而,在此情况下,这种速度意味着对规则的更改在几秒钟内就会传遍全球。您可能会注意到 WAF 代码使用的是 Lua。Cloudflare 在生产中广泛使用 Lua,之前我们已经讨论过 WAF 中 Lua 的相关详情。Lua WAF 在内部使用 PCRE,它使用回溯进行匹配,并且无任何机制来防范失控表达式。我们将在下文就此点以及我们采取的措施做更多介绍。

在部署规则之前发生的一切都是“正确的”:提出拉取请求,请求获得批准,CI/CD 构建代码并对其进行测试,提交带有 SOP 推广和回滚详情的变更请求,以及执行推广。

Cloudflare WAF 部署流程


如前所述,我们每周都会向 WAF 部署几十条新规则,并且我们有大量系统可以防止此类部署产生任何负面影响。因此,在出现错误时,通常不会是多个原因导致。虽然大家都会满足于找出一个根本原因,但这可能会掩盖真相。以下是导致用于 HTTP/HTTPS 的 Cloudflare 服务离线的多个漏洞。

  • 工程师写下极易引起大量回溯的正则表达式。
  • 在几周前对 WAF 进行重构(旨在使 WAF 使用更少的 CPU)时,一个有助于防止正则表达式过度使用 CPU 的保护被错误地删除。
  • 正在使用的正则表达式引擎没有复杂性保证。
  • 测试套件没有办法识别过多的 CPU 消耗。
  • SOP 允许非紧急规则变更在全球投入生产,而无需分阶段部署。
  • 回滚计划要求运行完整的 WAF 构建两次,耗时太长。
  • 第一个全球流量下降警报花了很长时间才发出。
  • 我们更新状态页面的速度不够快。
  • 由于未经过充分的中断和旁路程序培训,我们很难访问自己的系统。
  • SRE 失去了对部分系统的访问权限,原因是出于安全考虑,他们的凭据已超时。
  • 我们的客户无法访问 Cloudflare 仪表板或 API,因为他们要通过 Cloudflare edge。

首先,我们全面停止了对 WAF 的所有发布工作并着手:

  • 重新引入已删除的过度 CPU 使用保护。(完成)
  • 手动检查 WAF 托管规则中的所有 3,868 条规则,以发现并纠正任何其他可能过度回溯的实例。(检查完毕)
  • 在测试套件中引入对所有规则的性能分析。(预计完成时间: 7 月 19 日)
  • 切换到具有运行时保证的 re2 或 Rust 正则表达式引擎。(预计完成时间:7 月 31 日)
  • 更改 SOP 以与 Cloudflare 的其他软件相同的方式执行规则的阶段性推出,同时保留对主动攻击执行紧急全局部署的能力。
  • 建立应急功能,使 Cloudflare 仪表板和 API 脱离Cloudflare edge。
  • 自动更新 Cloudflare 状态页面。

从长远来看,我们正在远离我多年前编写的 Lua WAF,并在移植 WAF 以使用新的防火墙引擎。这将使 WAF 速度更快,并为其添加另一层保护。





若要充分理解  (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*))) 如何导致 CPU 耗尽,您需要先了解一些标准正则表达式引擎的工作原理。关键部分是 .*(?:.*=.*)。(?: and matching ) 是非捕获组(也就是说,括号内的表达式被组合成一个单独的表达式)。

为方便讨论这种模式为何会导致 CPU 耗尽,我们完全可以忽略它,并将该模式视为 .*.*=.*。在将这部分减去之后,此模式明显看起来过于复杂;但重要的是,任何要求引擎“匹配 xx 后接 xx”的“真实世界”表达式(就像 WAF 规则中的复杂表达式一样)都可能导致灾难性的回溯。这就是原因。

在正则表达式中,. 表示匹配单个字符,.*  表示可贪婪匹配零个或多个字符(即尽可能匹配),因此 .*.*=.* 表示匹配零个或多个字符,之后匹配零个或多个字符,再之后找到文本 = 符号,然后再匹配零个或多个字符。

考虑测试字符串 x=x。这将匹配表达式 .*.*=.*。等号前的 .*.* 可匹配第一个 x(其中一个 .* 匹配 x,另一个匹配零个字符)。= 后的 .* 匹配最后的 x。

该匹配需要 23 步才能完成。.*.*=.* 中的第一个 .* 执行贪婪匹配,匹配整个 x=x字符串。引擎继续考虑下一个 .*。现在没有可供匹配的字符了,因此第二个 .* 匹配零个字符(这是允许的)。之后,引擎转移到 =。由于已经没有可供匹配的字符串(第一个 .* 消耗了所有 x=x),匹配失败。

此时,正则表达式引擎将回溯。引擎回到第一个 .*,将其与 x=(而不是 x=x)匹配,然后移到第二个 .*。该 .* 匹配第二个 x,现在没有可供匹配的字符了。因此,引擎尝试匹配 .*.*=.* 中的 =,匹配失败。引擎再次回溯。

这次引擎回溯使第一个 .* 仍匹配 x=,但第二个 .* 不再匹配 x,而是匹配零个字符。然后,引擎继续尝试寻找 .*.*=.* 模式中的文本  =,但失败(因为它已经与第一个 .* 匹配)。引擎再次回溯。

这次第一个 .* 只匹配第一个 x。但第二个 .* 执行贪婪匹配,匹配 =x。您可以看到接下来会发生什么。引擎尝试匹配文本 = 时失败,再次回溯。

第一个 .* 依旧只匹配第一个 x。现在,第二个 .* 只匹配 =。但您猜对了,引擎无法匹配文本 =,因为第二个 .* 已经与它匹配。因此,引擎再次回溯。请记住,这一切都是为了匹配一个拥有三个字符的字符串。

最后,只有第一个 .* 仅匹配第一个 x,第二个 .* 匹配零个字符,引擎才能将表达式中的文本 = 与字符串中的 = 匹配。引擎继续,最末的 .* 匹配最后的 x。

综上,通过这 23 步完成了 x=x 匹配。下面是一段使用 Perl Regexp::Debugger 的短视频,展示了这些步骤以及步骤发生时的回溯。

这会产生大量工作,但如果将字符串从 x=x 更改为 x=xx 又会发生什么?这会需要 33 步才能完成匹配。而且如果更改为 x=xxx,步骤数会增加到 45。这并不是线性的。下面的图表显示了从 x=x 到  x=xxxxxxxxxxxxxxxxxxxx (=后面有 20 个 x)的匹配步骤数变化。如果 = 后是 20 个 x,引擎需要执行 555 步才能完成匹配!(更糟的是,如果 x= 缺失,那么字符串就只有 20 个 x,引擎需要执行 4,067 步才能发现模式不匹配)。

此视频显示了匹配 x=xxxxxxxxxxxxxxxxxxxx 所需的所有回溯:

这很糟糕,因为随着输入大小的增加,匹配时间会呈超线性增长。但如果使用稍微不同的正则表达式,情况可能会更糟。假设表达式为 .*.*=.*;(即模式的末尾有一个分号)。这很容易编写,以尝试匹配类似 foo=bar; 的表达式。

而这次,回溯将是灾难性的。匹配 x=x 需要 90 步,而不是 23 步。步骤数增长得非常快。匹配 x= 后跟 20 个 x,需要 5,353 步。下面是相应的图表。仔细看 Y 轴的数值,与前一个图表进行比较。

以下是图中没有显示的将 x=xxxxxxxxxxxxxxxxxxxx 与 .*.*=.*; 匹配失败的所有 5,353 步。

在这种情况下,使用惰性匹配而不是贪婪匹配有助于控制回溯的数量。如果将原始表达式更改为 .*?.*?=.*?,则匹配 x=x 需要 11 步(而不是 23 步),匹配 x=xxxxxxxxxxxxxxxxxxxx 也是如此。这是因为 .* 后的 ? 指示引擎在继续之前首先匹配最小的字符数。

但惰性匹配并不是这种回溯行为的全面解决方案。将灾难性实例 .*.*=.*; 更改为 .*?.*?=.*?; 根本不会改变它的运行时间。匹配 x=x 仍需要 555 步,匹配 x= 后跟 20 个 x 也仍需要 5,353 步。


自 1968 年 Ken Thompson 写了一篇名为“编程技术:正则表达式搜索算法”(Programming Techniques:Regular expression search algorithm) 的论文以来,这一问题的解决方案早就广为人知。这篇论文介绍了一种机制,它可以将正则表达式转换为非确定性有限状态自动机 (NFA),然后使用一种按匹配字符串大小的时间线性执行的算法,跟踪 NFA 中的状态转换。

Thompson 的论文实际上没有讨论 NFA,但明确解释了线性时间算法,并且提出了为 IBM 7094 生成汇编语言代码的 ALGOL-60 程序。具体的实施步骤可能晦涩难解,但其中呈现的思路却清晰明了。

下面是以类似于 Thompson 论文中图片的方式用图解法表示的 .*.*=.* 正则表达式。

图 0 有五种从 0 开始的状态。其中有三个循环以状态 1、2 和 3 开始。这三个循环对应正则表达式中的三个 .*。三个带点的菱形匹配一个字符。带有 = 符号的菱形与文本 = 符号匹配。状态 4 是结束状态,如果到达此状态,则表示正则表达式已匹配。

若要了解如何使用此类状态图以匹配正则表达式 .*.*=.*,我们需要检查匹配字符串 x=x。程序从状态 0 开始,如图 1 所示。

使这种算法有效的关键是状态机同时处于多个状态。NFA 将同时进行每个转换。

甚至在读取任何输入之前,它就会立即同时转换到状态 1 和状态 2,如图 2 所示。

在图 2 中,我们可以看到当 NFA 考虑 x=x 中的第一个 x 时会发生什么。x 可以通过在状态 1 中来回转换以匹配顶点。或者 x 可以通过在状态 2 中来回转换以匹配其下面的点。

因此,在匹配了 x=x 中的第一个 x 后,状态仍是 1 和 2。由于需要文本 = 符号,因而不可能到达状态 3 或 4。

接下来,算法考虑 x=x 中的 =。与它前面的 x 非常相似,它可以通过顶部的两个循环在状态 1 或状态 2 中来回转换以进行匹配;此外,文本 = 也可以匹配并且算法可以将状态 2 转换到状态 3(并立即转换到状态 4)。如图 3 所示。

接下来,算法到达 x=x 中的最后一个 x 。从状态 1 和状态 2,同样可以转换回状态 1 和状态 2。从状态 3,x 可以匹配右边的点并转换回状态 3。

此时,已经考虑了 x=x 的每个字符,并且因为已到达状态 4,所以正则表达式匹配该字符串。每个字符都经过了一次处理,因此算法在输入字符串的长度上是线性的。并且不需要回溯。

同样显而易见的是,一旦到达状态 4(在匹配了 x= 之后),正则表达式便已匹配,并且算法可以在完全不考虑最末 x 的情况下终止。


Categories: Technology

The Network is the Computer: A Conversation with John Gage

Thu, 11/07/2019 - 14:03
 A Conversation with John Gage A Conversation with John Gage

To learn more about the origins of The Network is the Computer®, I spoke with John Gage, the creator of the phrase and the 21st employee of Sun Microsystems. John had a key role in shaping the vision of Sun and had a lot to share about his vision for the future. Listen to our conversation here and read the full transcript below.


John Graham-Cumming: I’m talking to John Gage who was what, the 21st employee of Sun Microsystems, which is what Wikipedia claims and it also claims that you created this phrase “The Network is the Computer,” and that's actually one of the things I want to talk about with you a little bit because I remember when I was in Silicon Valley seeing that slogan plastered about the place and not quite understanding what it meant. So do you want to tell me what you meant by it or what Sun meant by it at the time?


John Gage: Well, in 2019, recalling what it meant in 1982 or 83’ will be colored by all our experience since then but at the time it seemed so obvious that when we introduced the first scientific workstations, they were not very powerful computers. The first Suns had a giant screen and they were on the Internet but they were designed as a complementary component to supercomputers. Bill Joy and I had a series of diagrams for talks we’d give, and Bill had the bi-modal, the two node picture. The serious computing occurred on the giant machines where you could fly into the heart of a black hole and the human interface was the workstation across the network. So each had to complement the other, each built on the strengths of the other, and each enhanced the other because to deal in those days with a supercomputer was very ugly. And to run all your very large computations, you could run them on a Sun because we had virtual memory and series of such advanced things but not fast. So the speed of scientific understanding is deeply affected by the tools the scientist has — is it a microscope, is it an optical telescope, is it a view into the heart of a star by running a simulation on a supercomputer? You need to have the loop with the human and the science constantly interacting and constantly modifying each other, and that’s what the network is for, to tie those different nodes together in as seamless a way as possible. Then, the instant anyone that’s ever created a programming language says, “so if I have to create a syntax of this where I’m trying to let you express, do this, how about the delay on the network, the latency”? Does your phrase “The Network is the Computer” really capture this hundreds, thousands, tens of thousands, millions perhaps at that time, now billions and billions and billions today, all these devices interacting and exchanging state with latency, with delay. It’s sort of an oversimplification, and that we would point out, but it’s just network is the computer. Four words, you know, what we tried to do is give a metaphor that allows you to explore it in your mind and think of new things to do and be inspired.


Graham-Cumming: And then by a sort of strange sequence of events, that was a trademark of Sun. It got abandoned. And now Cloudflare has swooped in and trademarked it again. So now it's our trademark which sort of brings us full circle, I suppose.


Gage: Well, trademarks are dealing with the real world, but the inspiration of Cloudflare is to do exactly what Bill Joy and I were talking about in 1982. It's to build an environment in which every participant globally can share with security, and we were not as strong. Bill wrote most of the code of TCP/IP implemented by every other computer vendor, and still these questions of latency, these questions of distributed denial of service which was, how do you block that? I was so happy to see that Cloudflare invests real money and real people in addressing those kinds of critical problems, which are at the core, what will destroy the Internet.


Graham-Cumming: Yes, I agree. I mean, it is a significant investment to actually deal with it and what I think people don't appreciate about the DDoS attack situation is that they are going on all the time and it's just a continuous, you know, just depends who the target is. It's funny you mentioned TCP/IP because about 10 years after, so in about ‘92, my first real job, I had to write a TCP/IP stack for an obscure network card. And this was prior to the Internet really being available everywhere. And so I didn't realize I could go and get the BSD implementation and recompile it. So I did it from scratch from the RFCs.


Gage: You did!


Graham-Cumming: And the thing I recommend here is that nobody ever does that because, you know, the real world, real code that really interacts is really hard when you're trying to work it with other things, so.


Gage: Do you still, John, do you have that code?


Graham-Cumming: I wonder. I have the binary for it.


Gage: Do hunt for it, because our story was at the time DARPA, the Defense Advanced Research Projects Agency, that had funded networking initiatives around the world. I just had a discussion yesterday with Norway and they were one of the first entities to implement using essentially Bill Joy’s code, but to be placed on the ARPANET. And a challenge went out, and at that time the slightly older generation, the Bolt Beranek and Newman Group, Vint Cerf, Bob Con, those names, as Vint Cerf was a grad student at UCLA where he had built one of the four first Internet sites and the DARPA offices were in Arlington, Virginia, they had massive investments in detection of nuclear underground tests, so seismological data, and the moment we made the very first Suns, I shipped them to DARPA, we got the network up and began serving seismic data globally. Really lovely visualization of events. If you’re trying to detect something, those things go off and then there’s a distinctive signature, a collapse of the underground cavern after. So DARPA had tried to implement, as you did, from the spec, from the RFC, the components, and Vint had designed a lot of this, all the acknowledgement codes and so forth that you had to implement in TCP/IP. So Bill, as a graduate student at Berkeley, we had a meeting in Arlington at DARPA headquarters where BBN and AT&T Bell Labs and a number of other people were in the room. Their code didn’t work, this graduate student from Berkeley named Bill Joy, his code did work, and when Bob Kahn and Vint Cerf asked Bill, “Well, so how did you do it?” What he said was exactly what you just said, he said, “I just read the spec and wrote the code.”


Graham-Cumming: I do remember very distinctly because the company I was working at didn’t have a TCP/IP stack and we didn’t have any IP machines, right, we were doing actually stuff that was all IBM networking, SMA stuff. Somehow we bought what was at that point a HP machine, it was an Apollo workstation and a Sun workstation. I had them on Ethernet and talking to each other. And I do distinctly remember the first time a ping packet came back from that Sun box, saying, yes I managed to send you an IP packet, you managed to send me ICMP response and that was pretty magical. And then I got to TCP and that was hard.


Gage: That was hard. Yeah. When you get down to the details, the spec can be wrong. I mean, it will want you to do something that’s a stupid thing to do. So Bill has such good taste in these things. It would be interesting to do a kind of a diff across the various implementations of the stack. Years and years later we had maybe 50 companies all assemble in a room, only engineers, throw out all the marketing people and all the Ps and VPs and every company in this room—IBM, Hewlett-Packard—oh my God, Hewlett-Packard, fix your TCP—and we just kept going until everybody could work with everybody else in sort of a pact. We’re not going to reveal, Honeywell, that you guys were great with earlier absolute assembly code, determinate time control stuff but you have no clue about how packets work, we’ll help you, so that all of us can make every machine interoperate, which yielded the network show, Interop. Every year we would go put a bunch of fiber inside whatever, you know, Geneva, or pick some, Las Vegas, some big venue.


Graham-Cumming: I used to go to Vegas all the time and that was my great introduction to Vegas was going there for Interop, year after year.


Gage: Oh, you did! Oh, great.


Graham-Cumming: Yes, yes, yes.


Gage: You know in a way, what you’re doing with, for example, just last week with the Verizon problem, everybody implementing what you’re doing now that is not open about their mistakes and what they’ve learned and is not sharing this, it’s a problem. And your global presence to me is another absolutely critical thing. We had about, I forget, 600 engineers in Beijing at the East Gate of Tsinghua a lot of networking expertise and lots of those people are at Tencent and Huawei and those network providers throughout the rest of the world, politics comes and goes but the engineering has to be done in a way that protects us. And so these conversations globally are critical.


Graham-Cumming: Yes, that's one of the things that’s fascinating actually about doing real things on the real Internet is there is a global community of people making computers talk to each other and you know, that it's a tremendously complicated thing to actually make that work, and you do it across countries, across languages. But you end up actually making them work, and that's the Internet we're sitting on, that you and I are talking on right now that is based on those conversations around the world.


Gage: And only by doing it do you understand more deeply how to do it. It’s very difficult in the abstract to say what should happen as we begin to spread. As Sun grew, every major city in Africa had installations and for network access, you were totally dependent on an often very corrupt national telco or the complications dealing with these people just to make your packet smooth. And as it turned out, many of the intelligence and military entities in all of these countries had very little understanding of any of this. That’s changed to some degree. But the dangerous sides of the Internet. Total surveillance, IPv6, complete control of exact identity of origins of packets. We implemented, let’s see, you had an early Sun. We probably completed our IPv6 implementation, was it still fluid in the 90s, but I remember 10 years after we finished a complete implementation of IPv6, the U.S. was still IPv4, it’s still IPv4.


Graham-Cumming: It still is, it still is. Pretty much. Except for the mobile carriers right now. I think in general the mobile phone operators are the ones who've gone more into IPv6 than anybody else.


Gage: It was remarkable in China. We used to have a conference. We’d bring a thousand Chinese universities into a room. Professor Wu from Tsinghua who built the Chinese Education and Research Network, CERNET. And now a thousand universities have a building on campus doing Internet research. We would get up and show this map of China and he kept his head down politically, but he managed at the point when there was a big fight between the Minister of Telecom and the Minister of Railways. The Minister of Railways said, look, I have continuity throughout China because I have railines. I’ve just made a partnership with the People’s Liberation Army, and they are essentially slave labor, and they’re going to dig the ditches, and I’m going to run fiber alongside the railways and I don’t care what you, the Minister of Telecommunications, has to say about it, because I own the territory. And that created a separate pathway for the backbone IPv6 network in China. Cheap, cheap, cheap, get everybody doing things.


Graham-Cumming: Yes, now of course in China that’s resulted in an interesting situation where you have China Telecom and China Unicom, who sort of cooperate with each other but they’re almost rivals which makes IP packets quite difficult to route inside China.


Gage: Yes exactly. At one point I think we had four hunks of China. Everyone was geographically divided. You know there were meetings going on, I remember the moment they merged the telecom ministry with the electronics ministry and since we were working with both of them, I walk in a room and there’s a third group, people I didn’t know, it turns out that’s the People’s Liberation Army.


Graham-Cumming: Yes, they’re part of the team. So okay, going back to this “Network is the Computer” notion. So you were talking about the initial things that you were doing around that, why is it that it's okay that Cloudflare has gone out and trademarked that phrase now, because you seem to think that we've got a leg to stand on, I guess.


Gage: Frankly, I’d only vaguely heard of Cloudflare. I’ve been working in areas, I’ve got a project in the middle of Nairobi in the slum where I’ve spent the last 15 years or so learning a lot about clean water and sewage treatment because we have almost 400,000 people in a very small area, biggest slum in East Africa. How can you introduce sanitary water and clean sewage treatment into a very, an often corrupt, a very difficult environment, and so that’s been a fascination of mine and I’ve been spending a lot of time. What's a computer person know about fluid dynamics and pathogens? There’s a lot to learn. So as you guys grew so rapidly, I vaguely knew of you but until I started reading your blog about post-quantum crypto and how do we devise a network in these resilient denial of service attacks and all these areas where you’re a growing company, it’s very hard to take time to do serious advanced research-level work on distributed computing and distributed security, and yet you guys are doing it. When Bill created Java, the subsequent step from Java for billions and billions of devices to share resources and share computations was something we call Genie which is a framework for validation of who you are, movement of code from device to device in a secure way, total memory control so that someone is not capable of taking over memory in your device as we’ve seen with Spectre and the failures of these billions of Intel chips out there that all have a flaw on take all branches parallel compute implementations. So the very hardware you’re using can be insecure so your operating systems are insecure, the hardware is insecure, and yet you’re trying to build on top with fallible pieces in infallible systems. And you’re in the middle of this, John, which I’m so impressed by.


Graham-Cumming: And Jini sort of lives on as called Apache River now. It moved away from Sun and into an Apache project.


Gage: Yes, very few people seem to realize that the name Apache is a poetic phrasing of “a patchy system.” We patch everything because everything is broken. We moved a lot of it, Brian Behlendorf and the Apache group. Well, many of the innovations at Sun, Java is one, file systems that are far more secure and far more resilient than older file systems, the SPARC  implementation, I think the SPARC processor, even though you’re using the new ARM processors, but Fujitsu, I still think keeps the SPARC architecture as the world’s fastest microprocessor.  


Graham-Cumming: Right. Yes. Being British of course, ARM is a great British success. So I'm honor-bound to use that particular architecture. Clearly.


Gage: Oh, absolutely. And the power. That was the one always in a list of what our engineering goals are. We wanted to make, we were building supercomputers, we were building very large file servers for the telcos and the banks and the intelligence agencies and all these different people, but we always wanted to make a low power and it just fell off the list of what you could accomplish and the ARM chips, their ratios of wattage to packets treated are—you have a great metric on your website someplace about measuring these things at a very low level—that’s key.


Graham-Cumming: Yes, and we had Sophie Wilson, who of course is one of the founders of ARM and actually worked on the original chip, tell this wonderful story at our Internet Summit about how the first chip they hooked up was operating fine until they realized they hadn't hooked the power up and they were asked to. It was so low power that it was able to use the power that was coming in over the logic lines to actually power the whole chip. And they said to me, wait a minute, we haven't plugged the power in but the thing is running, which was really, I mean that was an amazing achievement to have done that.


Gage: That’s amazing. We open sourced SPARC, the instruction set, so that anybody doing crypto that also had Fab capabilities could implement detection of ones and zeroes, sheep and goats, or other kinds of algorithms that are necessary for very high speed crypto. And that’s another aspect that I’m so impressed by Cloudflare. Cloudflare is paying attention at a machine instruction level because you’re implementing with your own hardware packages in what, 180 cities? You’re moving logistically a package into Ulan Bator, or into Mombasa and you’re coming up live.


Graham-Cumming: And we need that to be inexpensive and fast because we're promising people that we will make their Internet properties faster and secure at the same time and that's one of the interesting challenges which is not trading those two things off. Which means your crypto better be fast, for example, and that requires a lot of fiddling around at the hardware level and understanding it. In our case because we're using Intel, really what Intel chips are doing at the low level.


Gage: Intel did implement a couple of things in one or another of the more recent chips that were very useful for crypto. We had a group of the SPARC engineers, probably 30, at a dinner five or six months ago discussing, yes, we set the world standard for parallel execution branching optimizations for pipelines and chips, and when the overall design is not matched by an implementation that pays attention to protecting the memory, it’s a fundamental, exploitable flaw. So a lot of discussion about this. Selecting precisely which instructions are the most important, the risk analysis with the ability to make a chip specifically to implement a particular algorithm, there’s a lot more to go. We have multiples of performance ahead of us for specific algorithms based on a more fluid way to add instructions that are necessary into a specific piece of hardware. And then we jump to quantum. Oh my.


Graham-Cumming: Yes. To talk about that a little bit, the ever-increasing speed of processors and the things we can do; Do you think we actually need that given that we're now living in this incredibly distributed world where we are actually now running very distributed algorithms and do we really need beefier machines?


Gage: At this moment, in a way, it’s you making fun of Bill Joy for only wanting a megabit in Aspen. When Steve Jobs started NeXT, sadly his hardware was just terrible, so we sent a group over to boost NeXT. In fact we sort of secretly slipped him $30 million to keep him afloat. And I’d say, “Jobs, if you really understood something about hardware, it would really be useful here.” So one of the main team members that we sent over to NeXT came to live in Aspen and ended up networking the entire valley. At a point, megabit for what you needed to do, seemed reasonable, so at this moment, as things become alive by the introduction of a little bit of intelligence in them, some little flickering chip that’s able to execute an algorithm, many tasks don’t require. If you really want to factor things fast, quantum, quantum. Which will destroy our existing crypto systems. But if you are just bringing the billions of places where a little bit of knowledge can alter locally a little bit of performance, we could do very well with the compute power that we have right now. But making it live on the network, securely, that’s the key part. The attacks that are going on, simple errors as you had yesterday, are simple errors. In a way, across Cloudflare’s network, you’re watching the challenges of the 21st century take place: attacks, obscure, unknown exploits of devices in the power and water control systems. And so, you are in exactly the right spot to not get much sleep and feel a heavy responsibility.


Graham-Cumming: Well it certainly felt like it yesterday when we were offline for 27 minutes, and that’s when we suddenly discovered, we sort of know how many customers we have, and then we really discover when they start phoning us. Our support line had his own DDoS basically where it didn’t work anymore because so many people signed in. But yes, I think that it's interesting your point about a little bit extra on a device somewhere can do something quite magical and then you link it up to the network and you can do a lot. What we think is going on partly is some things around AI, where large amounts of machine learning are happening on big beefy machines, perhaps in the cloud, perhaps groups of machines, and then devices are doing their own little bits of inference or recognizing faces and stuff like that. And that seems to be an interesting future where we have these devices that are actually intelligent in our pockets.


Gage: Oh, I think that’s exactly right. There’s so much power in your pocket. I’m spending a lot of time trying to catch up that little bit of mathematics that you thought you understood so many years ago and it turns out, oh my, I need a little bit of work here. And I’ve been reading Michael Jordan’s papers and watching his talks and he’s the most cited computer scientist in machine learning and he will always say, “Be very careful about the use of the phrase, ‘Artificial Intelligence’.” Maybe it’s a metaphor like “The Network is the Computer.” But, we’re doing gradient descent optimization. Is the slope going up, or is the slope going down? That’s not smart. It’s useful and the real time language translation and a lot of incredible work can occur when you’re doing phrases. There’s a lot of great pattern work you can do, but he’s out in space essentially combining differentiation and integration in a form of integral. And off we go. Are your hessians rippling in the wind? And what’s the shape of this slope? And is this actually the fastest path from here to there to constantly go downhill. Maybe it’s sometimes going uphill and going over and then downhill that’s faster. So there’s just a huge amount of new mathematics coming in this territory and each time, as we move from 2G to 3G to 4G to 5G, many people don’t appreciate that the compression algorithms changed between 2G, 3G, 4G and 5G and as a result, so much more can move into your mobile device for the same amount of power. 10 or 20 times more for the same about of power. And mathematics leads to insights and applications of it. And you have a working group in that area, I think. I tried to probe around to see if you’re hiring.


Graham-Cumming: Well you could always just come around to just ask us because we'll probably tell you because we tend to be fairly transparent. But yes, I mean compression is definitely an area where we are interested in doing things. One of the things I first worked on at Cloudflare was a thing that did differential compression based on the insight that web pages don’t actually change that much when you hit ‘refresh’. And so it turns out that if you if you compress based on the delta from the last thing you served to someone you can actually send many orders of magnitude less data and so there's lots of interesting things you can do with that kind of insight to save a tremendous amount of bandwidth. And so yeah, definitely compression is interesting, crypto is interesting to us. We’ve actually open sourced some of our compression improvements in zlib which was very popular compression algorithm and now it's been picked up. It turns out that in neuroscience, because there's a tremendous amount of data which needs compression and there are pipelines used in neuroscience where actually having better compression algorithms makes you work a lot faster. So it's fascinating to see the sort of overspill of things we’re doing into other areas where I know nothing about what goes on inside the brain.


Gage: Well isn’t that fascinating, John. I mean here you are, the CTO of Cloudflare working on a problem that deeply affects the Internet, enabling a lot more to move across the Internet in less time with less power, and suddenly it turns into a tool for brain modeling and neuroscientists. This is the benefit. There’s a terrific initiative. I’m at Berkeley. The Jupiter notebooks created by Fernando Perez, this environment in which you can write text and code and share things. That environment, taken up by machine learning. I think it’s a major change. And the implementation of diagrams that are causal. These forms of analysis of what caused what. These are useful across every discipline and for you to model traffic and see patterns emerge and find webpages and see the delta has changed and then intelligently change the pattern of traffic in response to it, it’s all pretty much the same thing here.


Graham-Cumming: Yes and then as a mathematician, when I see things that are the same thing, I can't help wondering what the real deep structure is underneath. There must be another layer another layer down or something. So as you know it's this thing. There's some other deeper layer below all this stuff.  


Gage: I think this is just endlessly fascinating. So my only recommendations to Cloudflare: first, double what you’re doing. That’s so hard because as you go from 10 people to 100 people to 1,000 people to 10,000 people, it’s a different world. You are a prime example, you are global. Suddenly you’re able to deal with local authorities in 60-70 countries and deal with some of the world’s most interesting terrain and with network connectivity and moving data, surveillance, and some security of the foundation infrastructure of all countries. You couldn’t be engaged in more exciting things.


Graham-Cumming: It's true. I mean one of the most interesting things to me is that I have grown up with the Internet when I you know I got an email address using actually the crazy JANET scheme in the UK where the DNS names were backwards. I was in Oxford and they gave me an email address and it was I think it was JGC at uk dot ac dot ox dot prg and that then at some point it flipped around and it went to DNS looked like it had won. For a long time my address was the wrong way around. I think that's a typically British decision to be slightly different to everybody else.


Gage: Well, Oxford’s always had that style, that we’re going to do things differently. There’s an Oxford Center for the 21st century that was created by the money from a wonderful guy who had donated maybe $100 million. And they just branched out into every possible research area. But when you went to meetings, you would enter a building that was built at the time of the Raj. It was the India temple of colonialism.


Graham-Cumming: There's quite a few of those in the UK. Are you thinking of the Martin School? James Martin. And he gave a lot of money to Oxford. Well the funny thing about that was the programming research group. The one thing they didn't teach us really as an undergraduate was how to program which was one of the most fascinating things they have because that was a bit getting your hands dirty so you needed to let all the theory. So we learnt all the theory we did a little bit of functional programming and that was the extent of it which set me really up badly for a career in an industry. My first job I had to pretend I knew how to program and see and learn very quickly.


Gage: Oh my. Well now you’ve been writing code in Go.


Graham-Cumming: Yes. Well the thing about Go, the other Oxford thing of course is Tony Hoare, who is a professor of computer science there. He had come up with this thing called CSP (Communicating Sequential Processes) so that was a whole theory around how you do parallel execution. And so of course everybody used his formalism and I did in my doctoral thesis and so when Go came along and they said oh this how Go works, I said, well clearly that’s CSP and I know how to do this. So I can do it again.


Gage: Tony Hoare occasionally would issue a statement about something and it was always a moment. So few people seem to realize the birth of so much of what we took in the 60s, 70s, 80s, in Silicon Valley and Berkeley, derived from the Manchester Group, the virtual memory work, these innovations. Today, Whit Diffie. He used to love these Bletchley stories, they’re so far advanced. That generation has died off.


Graham-Cumming: There’s a very peculiar thing in computer science and the real application of computing which is that we both somehow sit on this great knowledge of the past of computing and at the same time we seem to willfully forget it and reinvent everything every few years. We go through these cycles where it's like, let’s do centralized computing, now distributed computing. No, let’s have desktop PCs, now let’s have the cloud. We seem to have this collective amnesia and then on occasion people go, “Oh, Leslie Lamport wrote this thing in 1976 about this problem”. What other subject do we willfully forget the past and then have to go and doing archaeology to discover again?


Gage: As a sociological phenomenon it means that the older crowd in a company are depressing because they’ll say, “Oh we tried that and it didn’t work”. Over the years as Sun grew from 15 people or so and ended up being like 45,000 people before we were sold off to Oracle and then everybody dumped out because Oracle didn’t know too much about computing. So Ivan Sutherland, Whit Diffie. Ivan actually stayed on. He may actually still have an Oracle email. Almost all of the research groups, certainly the chip group went off to Intel, Fujitsu, Microsoft. It’s funny to think now that Microsoft’s run by a Sun person.


Graham-Cumming: Well that's the same thing. Everyone’s forgotten that Microsoft was the evil empire not that long ago. And so now it’s not. Right now it’s cool again.


Gage: Well, all of the embedded stuff from Microsoft is still that legacy that Bill Gates who’s now doing wonderful things with the Gates Foundation. But the embedded insecurity of the global networks is due to, in large part, the insecurities, that horrible engineering of Microsoft embedded everywhere. You go anywhere in China to some old industrial facility and there is some old not updated junky PC running totally insecure software. And it’s controlling the grid. It’s discouraging. It’s like a lot of the SCADA systems.


Graham-Cumming: I’m completely terrified of SCADA systems.


Gage: The simplest exploits. I mean, it’s nothing even complicated. There are a series of emerging journalists today that are paying attention to cybersecurity and people have come out with books even very recently. Well, now because we’re in this China, US, Iran nightmare, a United States presidential directive taking the cybersecurity crowd and saying, oops, now you’re an offensive force. Which means we got some 20-year-old lieutenant somewhere who suddenly might just for fun turn off Tehran’s water supply or something. This is scary because the SCADA systems are embedded everywhere, and they’re, I don’t know, would you say totally insecure? Just the simple things, just simple exploits. One of the journalists described, I guess it was the Russians who took a bunch of small USB sticks and at a shopping center near a military base just gave them away. And people put them into their PCs inside SIPRNet, inside the secure U.S. Department of Defense network. Instantly the network was taken over just by inserting a USB device to something on the net. And there you are, John, protecting against this.


Graham-Cumming: Trying hard to protect against these things, yes absolutely. It's very interesting because you mentioned before how rapidly Cloudflare had grown over the last few years. And of course Sun also really got going pretty rapidly, didn’t it?


Gage: Well, yes. The first year we were just some students from Berkeley, hardware from Stanford, Andy Bechtolsheim, software from Berkeley, Berkeley Unix BSD, Bill Joy. Combine the two, and 10 of us or so, and we were, I think the first year was 12 million booked, the second year was 50 or 60 million booked, and the third year was 150 or so million booked and then we hit 500 million and then we hit a billion. And now, it’s selling boxes, we were a manufacturing company so that’s different from software or services, but we also needed lots of people and so we instantly raided the immense benefit of variety of people in the San Francisco Bay Area, with Berkeley and Stanford. We had students in computer science, and mechanical engineering, and physics, and mathematics from every country in the world and we recruited from every country in the world. So a great part of Sun’s growth came, as you are, expanding internationally, and at one point I think we ran most of the telcos of the world, we ran China Mobile. 900 million subscribers on China Mobile, all Sun stuff in the back. Throughout Africa, every telco was running Sun and Cisco until Huawei knocked Cisco out. It was an amazing time.


Graham-Cumming: You ran the machine that ran Latek, that let me get my doctoral thesis done.


Gage: You know that’s how I got into it, actually. I was in econometrics and mathematics at Berkeley, and I walk down a hallway and outside a room was that funny smell from photographic paper from something, and there was perfectly typeset mathematics. Troff and nroff, all those old UNIX utilities for the Bell Technical Journal, and I open the door and I’ve got to get in there. There’s two hundred people sitting in front of these beehive-like little terminals all typing away on a UNIX system. And I want to get an account and I walk down the hall and there's this skinny guy who types about 200 words a minute named Bill Joy. And I said, I need an account, I’ve got to type set integral signs, and he said, what’s your name. I tell him my name, John Gage, and he goes voop, and I’ve never seen anybody type as fast as him in my life. This is a new world, here.


Graham-Cumming: So he was rude then?


Gage: Yeah he was, he was. Well, it’s interesting since the arrival of a device at Berkeley to complement the arrival of an MIT professor who had implemented in LISP, mathematical, not typesetting of mathematics, but actual maxima. To get Professor fetman, maxima god from MIT, to come to Berkeley and live a UNIX environment, we had to put a LISP up outside on the PDP. So Bill took that machine which had virtual memory and implemented the environment for significant computational mathematics. And Steve Wolfram took that CalTech, and Princeton Institute for Advanced Studies, and now we have Mathematica. So in a way, all of Sun and the UNIX world derived from attempting to do executable mathematics.


Graham-Cumming: Which in some ways is what computers are doing. I think one of the things that people don’t really appreciate is the extent to which all numbers underneath.


Gage: Well that’s just this discrete versus continuous problem that Michael Jordan is attempting to address. To my current total puzzlement and complete ignorance, is what in the world is symplectic integration? And how do Lyapunov functions work? Oh, no clue.


Graham-Cumming: Are we going to do a second podcast on that? Are you going to come back and teach us?


Gage: Try it. We’re on, you’re on, you’re on. Absolutely. But you’ve got to run a company.


Graham-Cumming: Well I've got some things to do. Yeah. But you can go do that and come tell us about it.


Gage: All right, Great John. Well it was terrific to talk to you.


Graham-Cumming: So yes it was wonderful speaking to you as well. Thank you for helping me dig up memories of when I was first fooling around with Sun Systems and, you know, some of the early days and of course “The Network is the Computer,” I'm not sure I fully yet understand quite the metaphor or even if maybe I do somehow deeply in my soul get it, but we’re going to try and make it a reality, whatever it is.


Gage: Well, I count it as a complete success, because you count as one of our successes because you‘re doing what you’re doing, therefore the phrase, “The Network is the Computer,” resides in your brain and when you get up in the morning and decide what to do, a little bit nudges you toward making the network work.


Graham-Cumming: I think that's probably true. And there's the dog, the dog is saying you've been yakking for an hour and now we better stop. So listen, thank you so much for taking the time. It was wonderful talking to you. You have a good day. Thank you very much.

Interested in hearing more? Listen to my conversations with Ray Rothrock and Greg Papadopoulos of Sun Microsystems:

To learn more about Cloudflare Workers, check out the use cases below:

  • Optimizely - Optimizely chose Workers when updating their experimentation platform to provide faster responses from the edge and support more experiments for their customers.
  • Cordial - Cordial used a “stable of Workers” to do custom Black Friday load shedding as well as using it as a serverless platform for building scalable customer-facing services.
  • - used Workers to avoid significant code changes to their underlying platform when migrating from a legacy provider to a modern cloud backend.
  • Pwned Passwords - Troy Hunt’s popular "Have I Been Pwned" project benefits from cache hit ratios of 94% on its Pwned Passwords API due to Workers.
  • Timely - Using Workers and Workers KV, Timely was able to safely migrate application endpoints using simple value updates to a distributed key-value store.
  • Quintype - Quintype was an eager adopter of Workers to cache content they previously considered un-cacheable and improve the user experience of their publishing platform.
Categories: Technology

The Network is the Computer: A Conversation with Ray Rothrock

Thu, 11/07/2019 - 14:02
 A Conversation with Ray Rothrock A Conversation with Ray Rothrock

Last week I spoke with Ray Rothrock, former Director of CAD/CAM Marketing at Sun Microsystems, to discuss his time at Sun and how the Internet has evolved. In this conversation, Ray discusses the importance of trust as a principle, the growth of Sun in sales and marketing, and that time he gave Vice President Bush a Sun demo. Listen to our conversation here and read the full transcript below.


John Graham-Cumming: Here I am very lucky to get to talk with Ray Rothrock who was I think one of the first investors in Cloudflare, a Series A investor and got the company a little bit of money to get going, but if we dial back a few earlier years than that, he was also at Sun as the Director of CAD/CAM Marketing. There is a link between Sun and Cloudflare. At least one, but probably more than one, which is that Cloudflare has recently trademarked, “The Network is the Computer”. And that was a Sun trademark, wasn’t it?


Ray Rothrock: It was, yes.


Graham-Cumming: I talked to John Gage and I asked him about this as well and I asked him to explain to me what it meant. And I'm going to ask you the same thing because I remember walking around the Valley thinking, that sounds cool; I’m not sure I totally understand it. So perhaps you can tell me, was I right that it was cool, and what does it mean?


Rothrock: Well it certainly was cool and it was extraordinarily unique at the time. Just some quick background. In those early days when I was there, the whole concept of networking computers was brand new. Our competitor Apollo had a proprietary network but Sun chose to go with TCP/IP which was a standard at the time but a brand new standard that very few people know about right. So when we started connecting computers and doing some intensive computing which is what I was responsible for—CAD/CAM in those days was extremely intensive whether it was electrical CAD/camera, or mechanical CAD/CAM, or even simulation solid design modeling and things—having a little extra power from other computers was a big deal. And so this concept of “The Network is the Computer” essentially said that you had one window into the network through your desktop computer in those days—there was no mobile computing at that time, this was like 84’, 85’, 86’ I think. And so if you had the appropriate software you could use other people's computers (for CPU power) and so you could do very hard problems at that single computer could not do because you could offload some of that CPU to the other computers. Now that was very nerdy, very engineering intensive, and not many people did it. We’d go to the SIGGRAPH, which was a huge graphics show in those days and we would demonstrate ten Sun computers for example, doing some graphic rendering of a 3D wireframe that had been created in the CAD/CAM software of some sort. And it was, it was hard, and that was in the mechanical side. On the electrical side, Berkeley had some software that was called Magic—it’s still around and is a very popular EDA software that’s been incorporated in those concepts. But to imagine calculating the paths in a very complicated PCB or a very complicated chip, one computer couldn't do it, but Sun had the fundamental technology. So from my seat at Sun at the time, I had access to what could be infinite computing power, even though I had a single application running, and that was a big selling point for me when I was trying to convince EDA and MDA companies to put their software on the Sun. That was my job.


Graham-Cumming: And hearing it now, it doesn’t sound very revolutionary, because of course we’re all doing that now. I mean I get my phone out of my pocket and connect to goodness knows what computing power which does image recognition and spots faces and I can do all sorts of things. But walk me through what it felt like at the time.


Rothrock: Just doing a Google search, I mean, how many data stores are being spun up for that? At the time it was incredible, because you could actually do side by side comparisons. We created some demonstrations, where one computer might take ten hours to do a calculation, two computers might take three hours, five computers might take 30 minutes. So with this demo, you could turn on computers and we would go out on the TCP/IP network to look for an available CPU that could give me some time. Let's go back even further. Probably 15 years before that, we had time sharing. So you had a terminal into a big mainframe and did all this swapping in and out of stuff to give you a time slice computing. We were doing the exact same thing except we were CPU slicing, not just time slicing. That’s pretty nerdy, but that's what we did. And I had to work with the engineering department, with all these great engineers in those days, to make this work for a demo. It was so unique, you know, their eyes would get big. You remember Novell...


Graham-Cumming: I was literally just thinking about Novell because I actually worked on IPX and SPX networking stuff at the time. I was going to ask you actually, to what extent do you think TCP/IP was a very important part of this revolution?


Rothrock: It was huge. It was fundamentally huge because it was a standard, so it was available and if you implemented it, you didn’t have to pay for it. When Bob Metcalfe did Ethernet, it was on top of the TCP stack. Sun, in my memory, and I could be wrong, was the first company to put a TCP/IP stack on the computer. And so you just plugged in the back, an RJ45 into this TCP/IP network with a switch or a router on it and you were golden. They made it so simple and so cheap that you just did it. And of course if you give an engineer that kind of freedom and it opens up. By the way, as the marketing guy at Sun, this was my first non-engineering job. I came from a very technical world of nuclear physics into Sun. And so it was stunning, just stunning.


Graham-Cumming: It’s interesting that you mentioned Novell and then you mentioned Apollo before that and obviously IBM had SNA networking and there were attempts to do all those networking things. It's interesting that these open standards have really enabled the explosion of everything else we've seen and with everything that's going on in the Internet.


Rothrock: Sun was open, so to speak, but this concept of open source now that just dominates the conversation. As a venture capitalist, every deal I ever invested in had open source of some sort in it. There was a while when it was very problematic in an M&A event, but the world’s gotten used to it. So open, is very powerful. It's like freedom. It's like liberty. Like today, July 4th, it’s a big deal.


Graham-Cumming: Yes, absolutely. It’s just interesting to see it explode today because I spent a lot of my career looking at so many different networking protocols. The thing that really surprises me, or perhaps shouldn’t surprise me when you’ve got these open things, is that you harness so many people's intelligence that you just end up with something that’s just better. It seems simple.


Rothrock: It seems simple. I think part of the magic of Sun is that they made it easy. Easy is the most powerful thing you can do in computing. Computing can be so nerdy and so difficult. But if you just make it easy, and Cloudflare has done a great job with that at that; they did it with their DNS service, they did it with all the stuff we worked on back when I was on the board and actively involved in the company. You’ve got to make it easy. I mean, I remember when Matthew and Lee worked like 20 hours a day on how to switch your DNS from whoever your provider was to Cloudflare. That was supposed to be one click, done. A to B. And that DNA was part of the magic. And whether we agree that Sun did it that way, to me at least, Sun did it that way as well. So it's huge, a huge lift.


Graham-Cumming: It’s funny you talk about that because at the time, how that actually worked is that we just asked people to give us their username and password. And we logged in and did it for them. Early on, Matthew asked me if I’d be interested in joining Cloudflare when it was brand new and because of other reasons I’d moved back to the UK and I wasn’t ready to change jobs and I’d just taken another job. And I remember thinking, this thing is crazy this Cloudflare thing. Who's going to hand over their DNS and their traffic to these four or five people above a nail salon in Palo Alto? And Matthew’s response was, “They’re giving us their passwords, let alone their traffic.” Because they were so desperate for it.


Rothrock: It tells you a lot about Matthew and you know as an attorney, I mean he was very sensitive to that and believes that one of the one of the founding principles is trust. His view was that, if I ever lose the customer’s trust, Cloudflare is toast. And so everything focused around that key value. And he was right.


Graham-Cumming: And you must have, at Sun, been involved with some high performance computing things that involved sensitive customers doing cryptography and things like that. So again trust is another theme that runs through there as well.


Rothrock: Yeah, very true. As the marketing guy of CAD/CAM, I was in the field two-thirds of the time, showing customers what was possible with them. My job was to get third party software onto the Sun box and then to turn that into a presentation to a customer. So I visited many government customers, many aerospace, power, all these very high falutin sort of behind the firewall kinds of guys in those days. So yes, trust was huge. It would come up: “Okay, so I’m using your CPU, how is it that you can’t use mine. And how do you convince me that you've not violated something.” In those days it was a whole different conversation that it is today but it was nonetheless just as important. In fact I remember I spent quite a bit of time at NCSA at the University of Illinois Urbana-Champaign. Larry Smarr was the head of NCSA. We spent a lot of time with Larry. I think John was there with me. John Gage and Vinod and some others but it was a big deal taking about high performance computing because that's what they were doing and doing it with Sun.


Graham-Cumming: So just to dial forward, so you’re at Venrock and you decide to invest in Cloudflare. What was it that made you think that this was worth investing in? Presumably you saw some things that were in some of Sun’s vision. Because Sun had a very wide-ranging visions about what was going to be possible with computing.


Rothrock: Yeah. Let me sort of touch on a few points probably. Certainly Sun was my first computer company I worked for after I got out of the nuclear business and the philosophy of the company was very powerful. Not only we had this cool 19 inch black and white giant Macintosh essentially although the Mac wasn't even born yet, but it had this ease of use that was powerful and had this open, I mean it was we preached that all the time and we made that possible. And Cloudflare—the related philosophy of Matthew and Michelle's genius—was they wanted to make security and distribution of data as free and easy as possible for the long tail. That was the first thinking because you didn't have access if you were in the long tail you were a small company you or you're just going to get whipped around by the big boys. And so there was a bit of, “We're here to help you, we're going to do it.” It's a good thing that the long tail get mobilized if you will or emboldened to use the Internet like the big boys do. And that was part of the attractiveness. I didn't say, “Boy, Matthew, this sounds like Sun,” but the concept of open and liberating which is what they were trying to do with this long tail DNS and CDN stuff was very compelling and seemed easy. But nothing ever is. But they made it look easy.


Graham-Cumming: Yeah, it never is. One of the parallels that I’ve noticed is that I think early on at Sun, a lot of Sun equipment went to companies that later became big companies. So some of these small firms that were using crazy work stations ended up becoming some of the big names in the Valley. To your point about the long tail, they were being ignored and couldn’t buy from IBM even if they wanted to.


Rothrock: They couldn’t afford SNA and they couldn’t do lots of things. So Sun was an enabler for these companies with cool ideas for products and software to use Sun as the underpinning. workstations were all the rage, because PCs were very limited in those days. Very very limited, they were all Intel based. Sun was 68000-based originally and then it was their own stuff, SPARC. You know in the beginning it was a cheap microprocessor from Motorola.


Graham-Cumming: What was the growth like at Sun? Because it was very fast, right?


Rothrock: Oh yes, it was extraordinarily fast. I think I was employee 130 or something like that. I left Sun in 1986 to go to business school and they gave me a leave of absence. Carol Bartz was my boss at that moment. The company was like at 2000 people just two and a half years later. So it was growing like a weed. I measured my success by how thick the catalyst—that was our catalog name and our program—how thick and how quickly I could add bonafide software developers to our catalog. We published on one sheet of paper front to back. When I first got there, our catalyst catalogue was a sheet of paper, and when I left, it was a book. It was about three-quarters of an inch thick. My group grew from me to 30 people in about a year and a half. It was extraordinary growth. We went public during that time, had a lot of capital and a lot of buzz. That openness, that our competition was all proprietary just like you were citing there, John. IBM and Apollo were all proprietary networks. You could buy a NIC card and stick it into your PC and talk to a Sun. And vise versa. And you couldn’t do that with IBM or Apollo. Do you remember those?


Graham-Cumming: I do because I was talking to John Gage. In my first job out of college, I wrote a TCP/IP stack from scratch, for a manufacturer of network cards. The test of this stack was I had an HP Apollo box and I had a Sun workstation and there was a sort of magical, can I talk to these devices? And can I ping them? And then that was already magical the first ping as it went across the network. And then, can I Telnet to one of these? So you know, getting the networking actually running was sort of the key thing. How important was networking for Sun in the early days? Was it always there?


Rothrock: Yeah, it was there from the beginning, the idea of having a network capability. When I got there it was network; the machine wasn’t standalone at all. We sort of mimicked the mainframe world where we had green screens hooked into a Sun in a department for example. And there was time sharing. But as soon as you got a Sun on your desk, which was rare because we were shipping as many as we could build, it was fantastic. I was sharing information with engineering and we were working back and forth on stuff. But I think it was fundamental: you have a microprocessor, you’ve got a big screen, you’ve got a graphic UI, and you have a network that hooks into the greater universe. In those days, to send an all-Sun email around the world, modems spun up everywhere. The network wasn’t what it is now.


Graham-Cumming: I remember in about 89’, I was at a conference and Whit Diffie was there. I asked him what he was doing. He was in a little computer room. I was trying to typeset something. And he said, “I’m telnetting into a machine which is in San Diego.” It was the first time I’d seen this and I stepped over and he was like, “look at this.” And he’s hitting the keyboard and the keys are getting echoed back. And I thought, oh my goodness, this is incredible. It’s right across the Atlantic and across the country as well.


Rothrock: I think, and this is just me talking having lived the last years and with all the investing and stuff I did, but you know it enabled the Internet to come about, the TCP/IP standard. You may recall that Microsoft tried to modify the TCP/IP stack slightly, and the world rejected it, because it was just too powerful, too pervasive. And then along comes HTTP and all the other protocols that followed. Telnetting, FTPing, all that file transfer stuff, we were doing that left, right, and center back in the 80s. I mean you know Cloudflare just took all this stuff and made it better, easier, and literally lower friction. That was the core investment thesis at the time and it just exploded. Much like when Sun adopted TCP/IP, it just exploded. You were there when it happened. My little company that I’m the CEO of now, we use Cloudflare services. First thing I did when I got there was switched to Cloudflare.


Graham-Cumming: And that was one of the things when I joined, we really wanted people get to a point where if you’re putting something on the web, you just say, well I’m going to put Cloudflare or a thing like Cloudflare just on it. Because it protects it, it makes it faster, etc. And of course now what we've done is we’ve given people compute facility. Right now you can write code and run it in our in our machines worldwide which is another whole thing.


Rothrock: And that is “The Network is the Computer”. The other thing that Sun was pitching then was a paperless office. I remember we had posters of paper flying out of a computer window on a Sun workstation and I don't think we've gotten there yet. But certainly, the network is the computer.


Graham-Cumming: It was probably the case that the paperless office was one of those things that was about to happen for quite a long time.


Rothrock: It's still about to happen if you ask me. I think e-commerce and the sort of the digital transformation has driven it harder than just networking. You know, the fact that we can now sign legal documents over the Internet without paper and things like that. People had to adopt. People have to trust. People have to adopt these standards and accept them. And lo and behold we are because we made it easy, we made it cheap, and we made it trustworthy.


Graham-Cumming: If you dial back through Sun, what was the hardest thing? I’m asking because I’m at a 1,000-person company and it feels hard some days, so I’m curious. What do I need to start worrying about?


Rothrock: Well yeah, at 1,000 people, I think that’s when John came into the company and sort of organized marketing. I would say, holding engineering to schedules; that was hard. That was hard because we were pushing the envelope our graphics was going from black and white to color. The networking stuff the performance of all the chips into the boards and just the performance was a big deal. And I remember, for me personally, I would go to a trade show. I'd go to Boston to the Association of Mechanical Engineers with the team there and would show up at these workstations and of course the engineers want to show off the latest. So I would be bringing with me tapes that we had of the latest operating system. But getting the engineers to be ready for a tradeshow was very hard because they were always experimenting. I don't believe the word “code freeze” meant much to them, frankly, but we would we would be downloading the software and building a trade show thing that had to run for three days on the latest and greatest and we knew our competitor would be there right across the aisle from us sort of showing their hot stuff. And working with Eric Schmidt in those days, you know, Eric you just got to be done on this date. But trade shows were wonderful. They focused the company’s endpoints if you will. And marketing and sales drove Sun; Scott McNealy’s culture there was big on that. But we had to show. It’s different today than it was then, I don’t know about the Cloudflare competition, but back then, there were a dozen workstation companies and we were fighting for mindshare and market share every day. So you didn't dare sort of leave your best jewels at home. You brought them with you. I will give John Gage high, high marks. He showed me how to dance through a reboot in case the code crashed and he’s marvelous and I learned how to work that stuff and to survive.


Rothrock: Can I tell you one sort of sales story?  


Graham-Cumming: Yes, I’m very interested in hearing the non-technical stories. As an engineer, I can hear engineering stories all the time, but I’m curious what it was like being in sales and marketing in such an engineering heavy company as Sun.


Rothrock: Yeah. Well it was challenging of course. One of the strategies that Sun had in those days was to get anyone who was building their own computer. This was Computer Vision and Data General and all those guys to adopt the Sun as their hardware platform and then they could put on whatever they wanted. So because I was one of the demo gods, my job was to go along with the sales guys when they wanted to try to convince somebody. So one of the companies we went after was Data General (DG) in Massachusetts. And so I worked for weeks on getting this whole demo suite running MDA, EDA, word processing, I had everything. And this was a big, big, big deal. And I mean like hundreds of millions of dollars of revenue. And so I went out a couple of days early and we were going to put up a bunch of Suns and I had a demo room at DG. So all the gear showed up and I got there at like 5:30 in the morning and started downloading everything, downloading software, making it dance. And at about 8:00 a.m. in the morning the CEO of Data General walks in. I didn't know who he was but it turned out to be Ed de Castro. And he introduces himself and I didn’t know who he was and he said, “What are you doing?” And I explained, “I’m from Sun, I’m getting ready for a big demo. We’ve got a big executive presentation. Mr. McNealy will be here shortly, etc.” And he said, “Well, show me what you’ve got.” So I’m sort of still in the middle of downloading this software and I start making this thing dance. I’ve got these machines talking to each other and showing all kinds of cool stuff. And he left. And the meeting was about 10 or 11 in the morning. And so when the executive team from Sun showed up they said, “Well, how's it going?” I said, “Well I gave a demo to a guy,” and they asked, “Who's the guy,” and I said, “It was Ed de Castro.” And they went, “Oh my God, that was the CEO.” Well, we got the deal. I thought Ed had a little tactic there to come in early, see what he could see, maybe get the true skinny on this thing and see what’s real. I carried the day. But anyway, I got a nice little bonus for that. But Vinod and I would drop into Lockheed down in Southern California. They wanted to put Suns on P-3 airplanes and we'd go down there with an engineer and we’d figure out how to make it. Those were just incredible times. You may remember back in the 80s everyone dressed up except on Fridays. It was dress-down Fridays. And one day I dressed down and Carol Bartz, my boss, saw me wearing blue jeans and just an open collared shirt and she said, “Rothrock, you go home and put on a suit! You never know when a customer is going to walk in the front door.” She was quite right. Kodak shows up. Kodak made a big investment in Sun when it was still private. And I gave that demo and then AT&T, and then interestingly Vice President Bush back in the Reagan administration came to Sun to see the manufacturing and I gave the demo to the Vice President with Scott and Andy and Bill and Vinod standing there.


Graham-Cumming: Do you remember what he saw?


Rothrock: It was my standard two minute Sun demo that I can give in my sleep. We were on the manufacturing floor. We picked up a machine and I created a demo for it and my executive team was there. We have a picture of it somewhere, but it was fun. As John Gage would say, he’d say, “Ray, your job is to make the computer dance.” So I did.  


Graham-Cumming: And one of the other things I wanted to ask you about is at some point Sun was almost Amazon Web Services, wasn't it. There was a rent-a-computer service, right?  


Rothrock: I don't know. I don't remember the rent-a-computer service. I remember we went after the PC business aggressively and went after the data centers which were brand new in those days pretty aggressively, but I don’t remember the rent-a-computer business that much. It wasn’t in my domain.


Graham-Cumming: So what are you up to these days?


Rothrock: I’m still investing. I do a lot of security investing. I did 15 deals while I was at Venrock. Cloudflare was the last one I did, which turned out really well of course. More to come, I hope. And I’m CEO of one of Venrock’s portfolio companies that had a little trouble a few years back but I fixed that and it’s moving up nicely now. But I’ve started thinking about more of a science base. I’m on the board of the Carnegie Institute of Science. I'm on the board of MIT and I just joined the board of the Nuclear Threat Initiative in Washington which is run by Secretary Ernie Moniz, former secretary of energy. So I’m doing stuff like that. John would be pleased with how well that played through. But I'll tell you it is this these fundamental principles, just tying it all back to Sun and Cloudflare, and this sort of open, cheap, easy, enabling humans to do things without too much friction, that is exciting. I mean, look at your phone. Steve Jobs was the master of design to make this thing as sweet as it is.


Graham-Cumming: Yes, and as addictive.


Rothrock: Absolutely, right. I haven’t been to a presentation from Cloudflare in two years, but every time I see an announcement like the DNS service, I immediately switched all my DNS here at the house to Stuff like that. Because I know it’s good and I know it’s trustworthy, and it’s got that philosophy built in the DNA.


Graham-Cumming: Yes definitely. Taking it back to what we talked about at the beginning, it’s definitely the trustworthiness is something that Cloudflare has cared about from the beginning and continues to care about. We’re sort of the guardians of the traffic that passes through it.


Rothrock: Back when the Internet started happening and when Sun was doing Java, I mean, all those things in the 90s, I was of course at Venrock, but I was still pretty connected to [Edward] Zander and [Scott] McNealy. We were hoping that it would be liberating, that it would create a world which was much more free and open to conversation and we’ve seen the dark side of some of that. But I continue to believe that transparency and openness is a good thing and we should never shut it down. I don't mean to get it all waxing philosophical here but way more good comes from being open and transparent than bad.


Graham-Cumming: Listen it's July 4th. It's evening here in London. We can be waxing philosophical as much as we like. Well listen, thank you for taking the time to chat with me. Are there any other reminiscences of Sun that you think the public needs to know in this oral history of “The Network is the Computer.”


Rothrock: Well you know the only thing I'd say is having landed in the Silicon Valley in 1981 and getting on with Sun, I can say this given my age and longevity here, everything is built on somebody else's great ideas. And starting with TCP/IP and then we went to this HTML protocol and browsers, it’s just layer on layer on layer on layer and so Cloudflare is just one of the latest to climb on the shoulders of those giants who put it all together. I mean, we don’t even think about the physical network anymore. But it is there and thank goodness companies like Cloudflare keep providing that fundamental service on which we can build interesting, cool, exciting, and mind-changing things. And without a Cloudflare, without Sun, without Apollo, without all those guys back in the day, it would be different. The world would just be so, so different. I did the New York Times crossword puzzle. I could not do it without Google because I have access to information I would not have unless I went to the library. It’s exponential and it just gets better. Thanks to Michelle and Matthew and Lee for starting Cloudflare and allowing Venrock to invest in it.


Graham-Cumming: Well thank you for being an investor. I mean, it helped us get off the ground and get things moving. I very much agree with you about the standing on the shoulders of giants because people don't appreciate the extent to which so much of this fundamental work that we did was done in the 70s and 80s.


Rothrock: Yea, it’s just like the automobile and the airplane. We reminisce about the history but boy, there were a lot of giants in those industries as well. And computing is just the latest.


Graham-Cumming: Yep, absolutely. Well, Ray, thank you. Have a good afternoon.

Interested in hearing more? Listen to my conversations with John Gage and Greg Papadopoulos of Sun Microsystems:

To learn more about Cloudflare Workers, check out the use cases below:

  • Optimizely - Optimizely chose Workers when updating their experimentation platform to provide faster responses from the edge and support more experiments for their customers.
  • Cordial - Cordial used a “stable of Workers” to do custom Black Friday load shedding as well as using it as a serverless platform for building scalable customer-facing services.
  • - used Workers to avoid significant code changes to their underlying platform when migrating from a legacy provider to a modern cloud backend.
  • Pwned Passwords - Troy Hunt’s popular "Have I Been Pwned" project benefits from cache hit ratios of 94% on its Pwned Passwords API due to Workers.
  • Timely - Using Workers and Workers KV, Timely was able to safely migrate application endpoints using simple value updates to a distributed key-value store. Quintype - Quintype was an eager adopter of Workers to cache content they previously considered un-cacheable and improve the user experience of their publishing platform.
Categories: Technology


Additional Terms