Blogroll: CloudFlare

I read blogs, as well as write one. The 'blogroll' on this site reproduces some posts from some of the people I enjoy reading. There are currently 19 posts from the blog 'CloudFlare.'

Disclaimer: Reproducing an article here need not necessarily imply agreement or endorsement!

Subscribe to CloudFlare feed
Cloudflare Blog
Updated: 3 hours 5 min ago

Our Response to the Senate Vote on FCC Privacy Rules

Fri, 24/03/2017 - 00:16

Today, the U.S. Senate voted narrowly to undo certain regulations governing broadband providers, put in place during the Obama administration, that would have required Internet Service Providers (ISPs) to obtain approval from their customers before sharing information such as web-browsing histories, app usage, and aspects of their financial and health information, with third parties. Now, ISPs may sell targeted advertising or share personal information and browsing history with third party marketers, without first getting explicit consent from web users.

Cloudflare is disappointed with the Senate’s actions, as we feel strongly that consumer privacy rights need to be at the forefront of discussions around how personal information is treated. The new regulations would have steered the U.S. closer to the privacy standards enjoyed by citizens in many other developed countries, rather than away from such rights.

Defaulting to an “opt-in” rather than “opt-out” standard would provide consumers with greater controls over how, when, and with whom their personal information is used and shared. We believe that individuals should have the last say on what is done with their personal information, rather than corporations.

Regardless of whether Washington ultimately decides to approve rolling back these regulations, Cloudflare will continue to prioritize the sensitivity and privacy of the data we handle from and on behalf of our customers, and to comply with applicable privacy regulations worldwide.

Categories: Technology

Buongiorno, Roma! Cloudflare Data Center CV

Wed, 22/03/2017 - 22:14

CC-BY 2.0 image by Ilaria Giacomi

We’re excited to announce Cloudflare’s 105th data center in Rome. Visitors in Italy (and especially around the region of Lazio) to over 6 million Internet properties now benefit from reduced latency and increased security. As our global network grows in breadth and capacity, we are able to stop attacks (typically, outside of Italy!), while serving legitimate traffic from our nearest in-country data center. Rome serves as a point of redundancy to our existing data center in Milan, and expands Cloudflare’s Europe network to 29 cities, with at least five more cities already in the works.

Siamo orgogliosi di annunciare il 105esimo data center di Cloudflare a Roma. Utenti in tutta Italia (e specialmente nel Lazio e regioni limitrofe) insieme ad oltre 6 milioni di proprietà in rete beneficeranno di latenze ridotte e maggior sicurezza. Con la crescita della nostra rete sia in copertura che capacità, abbiamo la possibilità di fermare attacchi (tipicamente originati fuori del territorio Italiano!) e di servire traffico legittimo dal data center più vicino. Roma offre maggiore ridondanza nella rete in coppia con il data center di Milano ed espande la rete Europea di Cloudflare a 29 città, contando almeno altre 5 maggiori città in via di completamento.

To close followers of Cloudflare blog, with Rome (R) and Yerevan (Y) live, the only remaining letters that aren’t as yet at the start of a city with a Cloudflare datacenter are E, I and U. Our hardware on its way to a transcontinental city promises to makes that list even smaller.

Per chi segue il blog di Cloudflare da vicino, con Roma (R) e Yerevan (Y) attive, la lista di iniziali di città in cui non abbiamo un data center si riduce a E, I e U. Il nostro hardware è già in viaggio verso un città transcontinentale e promettiamo di ridurre ulteriormente la sopra citata lista.


Cloudflare partners closely with hosting providers, value-added resellers, managed service providers, digital agencies, and eCommerce/SaaS platforms to optimize our service. If you are a customer working with one of our partners in Italy, which include Altervista, Planetel and among many others, you will also see an improvement in performance. Plus, it only takes a few clicks to add Cloudflare via one of these partners to make your site faster and safer in seconds. To become a Cloudflare partner, in Italy or anywhere around the world, click here.

Cloudflare collabora strettamente con hosting providers, value added resellers, managed service providers, agenzie digitali e piattaforme eCommerce/SaaS per ottimizzare i servizi. Se sei cliente di uno dei nostri partners in Italia, tra cui citiamo Altervista, Planetel e, anche voi noterete un aumento di performance. Inoltre, tramite questi partners, con pochi clic si può aggiungere Cloudflare e rendere il tuo sito più veloce e sicuri nel giro di pochi secondi. Per diventare partner di Cloudflare, in Italia o ovunque nel mondo, clicca qui.

Another day, another continent

After Asia, South America and Europe, our 106th Cloudflare data center will be back in North America. Everything’s up to date in our next city! To win some Cloudflare swag: can you guess the name?

Dopo Asia, Sud America ed Europa, il 106esimo data center verrà attivato in Nord America. Vuoi vincere un premio? Indovina la prossima città dove “Everything’s up to date”!

- The Cloudflare Team

The Cloudflare network today

Categories: Technology

¡Hola, Ecuador! Quito Data Center expands Cloudflare network to 104 cities across 52 countries

Tue, 21/03/2017 - 22:16

CC-BY 2.0 image by Scipio

Located only 15 miles from the Equator, we are excited to announce Cloudflare’s newest data center in the World Heritage City of Quito, Ecuador. This deployment is made possible in partnership with the NAP.EC Internet exchange run by AEPROVI (Asociación de empresas proveedoras de servicios de internet). Our newest data center expands Cloudflare’s growing Latin America network to six cities, joining Buenos Aires (Argentina), Lima (Peru), Medellin (Colombia), Sao Paulo (Brazil) and Valparaiso (Chile). Quito is our 104th deployment globally, with over a dozen additional cities in the works right now.

Ubicado a sólo 15 millas del ecuador, estamos contentos de anunciar el nuevo centro de datos de Cloudflare en la ciudad de Quito, Ecuador. Este se realizó en asociación con el intercambio neutral de Internet de NAP.EC administrado por AEPROVI (Asociación de empresas proveedoras de servicios de internet). Este despliegue amplía la red latinoamericana de Cloudflare a seis ciudades: Buenos Aires (Argentina), Lima (Perú), Medellín (Colombia), Sao Paulo (Brasil) y Valparaíso (Chile). Quito es nuestro 104 despliegue global, con más de una docena de ciudades en desarrollo de expansión en este momento.

Open interconnection

Cloudflare participates at over 150 Internet exchanges globally with an open peering policy, and welcomes the opportunity to interconnect locally. As additional networks peer with Cloudflare, we’ll be able to localize a growing share of traffic that would have otherwise been served from Miami, while helping support the Internet exchange as a whole. In South America, we are existing members of the Terremark NAP do Brasil, Camara Argentina de Internet (CABASE) Buenos Aires and PTT São Paulo (run by NIC.BR).

Cloudflare participa en más de 150 intercambios de Internet a nivel mundial, con una política de peering abierta que permite conectarse a las principales redes de internet locales. A medida que nuevas redes se interconectan con Cloudflare, podremos redirigir una parte cada vez mayor del tráfico local que de otro modo habría sido proveído por Miami, USA. Además esto ayudará a mejorar el servicio de internet a nivel global. En Sur América, somos miembros de Terremark NAP do Brasil, Cámara Argentina de Internet (CABASE) Buenos Aires y la PTT São Paulo.

Latin America expansion continues Continúa la expansión de América Latina

With over 350 million Internet users, Latin America is experiencing the second fastest growth globally in mobile penetration. We are now only days away from announcing two additional South American data centers to improve the performance and security of 6 million Internet properties.

Sooner yet is Cloudflare's 105th deployment! After Monday in Asia, and Tuesday in South America, tomorrow, we travel to a new continent. All roads lead to the home of our next data center.

Con más de 350 millones de usuarios conectados, Latinoamérica está experimentando el segundo mayor crecimiento a nivel mundial en penetración móvil. Estamos a pocos días de anunciar dos centros de datos adicionales en América del Sur para mejorar el rendimiento y la seguridad de 6 millones de sitios web.

¡Pronto este será el 105 despliegue de Cloudflare! Ayer inauguramos en Asia, Hoy en América del Sur y mañana viajaremos a un nuevo continente. Cada dia Cloudflare se expande a más países y continentes convirtiéndose así en una de las redes más importantes del mundo.

- The Cloudflare Team

The Cloudflare network today

Categories: Technology

Yerevan, Armenia: Cloudflare Data Center #103

Mon, 20/03/2017 - 19:34

CC-BY 2.0 image by Marco Polo

In the coming days, Cloudflare will be announcing a series of new data centers across five continents. We begin with Yerevan, the capital and largest city of Armenia, the mountainous country in the South Caucasus. This deployment is our 37th data center in Asia, and 103rd data center globally.


CC-BY 2.0 image by PAN Photo

Yerevan, one of the oldest continuously inhabited cities in the world, has a rich history going back all the way to 782 BC. Famous for its cognac, lavash flatbread, and beautiful medieval churches, Armenia is also home to more chess grandmasters per capita than most countries!

6 Million Websites Faster

Latency (ms) decreases 6x for UCOM Internet user in Yerevan to Cloudflare. Source: Cedexis

The newest Cloudflare deployment will make 6 million Internet properties faster and more secure, as we serve traffic to Yerevan and adjoining countries.

If the Cloudflare datacenter closest to the Equator (to date) was Singapore, the next deployment brings us even closer. Which one do you think it is?

The Cloudflare network today

- The Cloudflare Team

Categories: Technology

Introducing Zero Round Trip Time Resumption (0-RTT)

Wed, 15/03/2017 - 14:00
Introducing Zero Round Trip Time Resumption (0-RTT)

Cloudflare’s mission is to help build a faster and more secure Internet. Over the last several years, the Internet Engineering Task Force (IETF) has been working on a new version of TLS, the protocol that powers the secure web. Last September, Cloudflare was the first service provider to enable people to use this new version of the protocol, TLS 1.3, improving security and performance for millions of customers.

Today we are introducing another performance-enhancing feature: zero round trip time resumption, abbreviated as 0-RTT. About 60% of the connections we see are from people who are visiting a site for the first time or revisiting after an extended period of time. TLS 1.3 speeds up these connections significantly. The remaining 40% of connections are from visitors who have recently visited a site and are resuming a previous connection. For these resumed connections, standard TLS 1.3 is safer but no faster than any previous version of TLS. 0-RTT changes this. It dramatically speeds up resumed connections, leading to a faster and smoother web experience for web sites that you visit regularly. This speed boost is especially noticeable on mobile networks.

We’re happy to announce that 0-RTT is now enabled by default for all sites on Cloudflare’s free service. For paid customers, it can be enabled in the Crypto app in the Cloudflare dashboard.

Introducing Zero Round Trip Time Resumption (0-RTT)

This is an experimental feature, and therefore subject to change.

If you're just looking for a live demo, click here.

The cost of latency

A big component of web performance is transmission latency. Simply put, transmission latency is the amount of time it takes for a message to get from one party to another over a network. Lower latency means snappier web pages and more responsive APIs; when it comes to responsiveness, every millisecond counts.

The diagram below comes from a recent latency test of Cloudflare’s network using the RIPE Atlas project. In the experiment, hundreds of probes from around the world sent a single “ping” message to Cloudflare and measured the time it took to get an answer in reply. This time is a good approximation of how long it takes for data to make the round trip from the probe to the server and back, so-called round-trip latency.

Introducing Zero Round Trip Time Resumption (0-RTT)

Latency is usually measured in milliseconds or thousandths of a second. A thousandth of a second may not seem like a long time, but they can add up quickly. It’s generally accepted that the threshold over which humans no longer perceive something as instantaneous is 100ms. Anything above 100ms will seem fast, but not immediate. For example, Usain Bolt’s reaction time out of the starting blocks in the hundred meter sprint is around 155ms, a good reference point for thinking about latency. Let’s use 155ms, a fast but human perceptible amount of time, as a unit of time measurement. Call 155ms “one Bolt.”

The map above shows that most probes have very low round-trip latency (<20ms) to Cloudflare’s global network. However, for a percentage of probes, the time it takes to reach the nearest Cloudflare data center is much longer, in some cases exceeding 300ms (or two bolts!).

Introducing Zero Round Trip Time Resumption (0-RTT) Creative Commons Attribution 2.0 Generic Nick J. Webb

Connections that travel over longer distances have higher latency. Data travel speed is limited by the speed of light. When Cloudflare opens a new datacenter in a new city, latency is reduced for people in the surrounding areas when visiting sites that use Cloudflare. This improvement is often simply because data has a shorter distance to travel.

Geographic proximity is not the only contributor to latency. WiFi and cellular networks can add tens or even hundreds of milliseconds to transmission latency. For example, using a 3G cellular network adds around 1.5 bolts to every transmission. Satellite Internet connections are even worse, adding up to 4 bolts to every transmission.

Introducing Zero Round Trip Time Resumption (0-RTT)

Round-trip latency makes an especially big difference for HTTPS. When making a secure connection to a server, there is an additional set-up phase that can require up to three messages to make the round trip between the client and the server before the first request can even be sent. For a visitor 250ms away, this can result in an excruciating one second (1000ms) delay before a site starts loading. During this time Usain Bolt has run 10 meters and you’re still waiting for a web page. TLS 1.3 and 0-RTT can’t reduce the round trip latency of a transmission, but it can reduce the number of round trips required for setting up an HTTPS connection.

HTTPS round trips

For a browser to download a web page over HTTPS, there is a some setup that goes on behind the scenes. Here are the 4 phases that need to happen the first time your browser tries to access a site.

Phase 1: DNS Lookup

Your browser needs to convert the hostname of the website (say into an Internet IP address (like 2400:cb00:2048:1::6813:c166 or before it can connect to it. DNS resolvers operated by your ISP usually cache the IP address for popular domains, and latency to your ISP is fairly low, so this step often takes a negligible amount of time.

Phase 2: TCP Handshake (1 round trip)

The next step is to establish a TCP connection to the server. This phase consists of the client sending a SYN packet to the server, and the server responding with an ACK pack. The details don’t matter as much as the fact that this requires data to be sent from client to server and back. This takes one round trip.

Phase 3: TLS Handshake (2 round trips)

In this phase, the client and server exchange cryptographic key material and set up an encrypted connection. For TLS 1.2 and earlier, this takes two round trips.

Phase 4: HTTP (1 round trip)

Once the TLS connection has been established, your browser can send an encrypted HTTP request using it. This can be a GET request for a specific URL such as, for example. The server will respond with an HTTP response containing the webpage’s HTML and the browser will start displaying the page.

Assuming DNS is instantaneous, this leaves 4 round trips before the browser can start showing the page. If you’re visiting a site you’ve recently connected to, the TLS handshake phase can be shortened from two round trips to one with TLS session resumption.

This leaves the following minimum wait times:

  • New Connection: 4 RTT + DNS
  • Resumed Connection: 3 RTT + DNS
How do TLS 1.3 and 0-RTT improve connection times?

One of the biggest advantages of TLS 1.3 over earlier versions is that it only requires one round trip to set up the connection, resumed or not. This provides a significant speed up for new connections, but none for resumed connections. Our measurements show that around 40% of HTTPS connections are resumptions (either via session IDs or session tickets). With 0-RTT, a round trip can be eliminated for most of that 40%.

Introducing Zero Round Trip Time Resumption (0-RTT) TLS connection reuse by time of day.

To summarize the performance differences:

TLS 1.2 (and earlier)

  • New Connection: 4 RTT + DNS
  • Resumed Connection: 3 RTT + DNS

TLS 1.3

  • New Connection: 3 RTT + DNS
  • Resumed Connection: 3 RTT + DNS

TLS 1.3 + 0-RTT

  • New Connection: 3 RTT + DNS
  • Resumed Connection: 2 RTT + DNS

The performance gains are huge.

0-RTT in action

Both Firefox Beta and Chrome Beta have TLS 1.3 enabled by default. The stable versions of Chrome and Firefox also ship with TLS 1.3 support, but it has to be enabled manually for now. The only browsers which supports 0-RTT as of March 2017 are Firefox Nightly and Aurora. To enable it, do the following:

  • Enter about:config in the address bar
  • Ensure security.tls.version.max is 4 (this enables TLS 1.3)
  • Set security.tls.enable_0rtt_data to true

This demo loads an image from a server that runs the Cloudflare TLS 1.3 0-RTT proxy. In order to emphasize the latency differences, we used Cloudflare's new DNS Load Balancer to direct you to a far away server. If the image is loaded over 0-RTT it will be served orange, otherwise black, based on the CF-0RTT-Unique header.

The image is loaded twice: with and without a query string. 0-RTT is disabled transparently when a query string is used to prevent replays.

The connection is pre-warmed, Keep-Alives are off and caching is disabled to simulate the first request of a resumed connection.

Preparing the live demo to run (requires Javascript)...

Click here to start the live demo.

Live demo unavailable: your browser doesn't support TLS 1.3, or the connection failed.

Live demo unavailable: your browser doesn't support 0-RTT.

Introducing Zero Round Trip Time Resumption (0-RTT)

0-RTT took:

Introducing Zero Round Trip Time Resumption (0-RTT)

1-RTT took:

.demo { text-align: center; } .demo a { font-weight: bold; text-decoration: none; color: inherit; } .demo p { margin-top: 1.5em; } .images { max-width: 440px; margin: 0 auto; } .images > div { width: 200px; float: left; margin: 0 10px; } .images > div.clear { float: none; clear: both; } .images img { width: 100%; } var currentStatus = "preparing"; function demoStatus(s) { document.querySelector(".demo .message."+currentStatus).style.display = "none"; document.querySelector(".demo .message."+s).style.display = ""; currentStatus = s; } function demoLaunch() { var startTime, startTime0; var img = document.querySelector(".demo .images img.nozrtt"); img.onloadstart = function() { startTime = new Date().getTime(); }; img.onload = function() { var loadtime = new Date().getTime() - startTime; document.querySelector(".demo .images .time.nozrtt").innerText = loadtime; }; var img0 = document.querySelector(".demo .images img.zrtt"); img0.onloadstart = function() { startTime0 = new Date().getTime(); }; img0.onload = function() { var loadtime = new Date().getTime() - startTime0; document.querySelector(".demo .images .time.zrtt").innerText = loadtime; window.setTimeout(function() { img.src = ""; }, 2000); }; img0.src = ""; document.querySelector(".demo .message.ready").style.display = "none"; document.querySelector(".demo .images").style.display = ""; }; document.querySelector(".demo .message.ready a").onclick = function() { // var r = new XMLHttpRequest(); // r.addEventListener("load", demoLaunch); //"GET", ""); // r.send(); demoLaunch(); return false; }; var r = new XMLHttpRequest(); r.addEventListener("error", function() { demoStatus("no13"); document.querySelector(".demo .fallback").style.display = ""; }); r.addEventListener("loadend", function() { var r = new XMLHttpRequest(); r.addEventListener("error", function() { demoStatus("no13"); document.querySelector(".demo .fallback").style.display = ""; }); r.addEventListener("load", function() { console.log(this.getResponseHeader("X-0rtt")); if (this.getResponseHeader("X-0rtt") == "1") { demoStatus("ready"); } else { demoStatus("no0rtt"); document.querySelector(".demo .fallback").style.display = ""; } }); window.setTimeout(function() {"GET", ""); r.send(); }, 10000); });"GET", ""); r.send();

To see what’s going on under the hood, take a look in Firefox’s Developer Tools. We’ve taken a screenshot of a version of this demo as run by a user in San Francisco. In the first screenshot, the image is served with TLS 1.3, in the second with TLS 1.3 and 0-RTT.

Introducing Zero Round Trip Time Resumption (0-RTT) Introducing Zero Round Trip Time Resumption (0-RTT)

In the top image, you can see that the blue “Waiting” bar is around 250ms shorter than it is for the second image. This 250ms represents the time it took for the extra round trip between the browser and the server. If you’re in San Francisco, 0-RTT enables the image to load 1.5 bolts faster than it would have otherwise.

What’s the catch?

0-RTT is cutting edge protocol technology. With it, encrypted HTTPS requests become just as fast as an unencrypted HTTP requests. This sort of breakthrough comes at a cost. This cost is that the security properties that TLS provides to 0-RTT request are slightly weaker than those it provides to regular requests. However, this weakness is manageable, and applications and websites that follow HTTP semantics shouldn’t have anything to worry about. The weakness has to do with replays.

Unlike any other requests sent over TLS, requests sent as part of 0-RTT resumption are vulnerable to what’s called a replay attack. If an attacker has access to your encrypted connection, they can take a copy of the encrypted 0-RTT data (containing your first request) and send it to the server again pretending to be you. This can result in the server seeing repeated requests from you when you only sent one.

This doesn’t sound like a big deal until you consider that HTTP requests are used for more than just downloading web pages. For example, HTTP requests can trigger transfers of money. If someone makes a request to their bank to “pay $1000 to Craig” and that request is replayed, it could cause Craig to be paid multiple times. A good deal if you’re Craig.

Introducing Zero Round Trip Time Resumption (0-RTT)

Luckily, the example above is somewhat contrived. Applications need to be replay safe to work with modern browsers, whether they support 0-RTT or not. Browsers replay data all the time due to normal network glitches, and researchers from Google have even shown that attackers can trick the browser into to replaying requests in almost any circumstance by triggering a particular type of network error. In order to be resilient against this reality, well-designed web applications that handle sensitive requests use application-layer mechanisms to prevent replayed requests from affecting them.

Although web applications should be replay resilient, that’s not always the reality. To protect these applications from malicious replays, Cloudflare took an extremely conservative approach to choosing which 0-RTT requests would be answered. Specifically, only GET requests with no query parameters are answered over 0-RTT. According to the HTTP specification, GET requests are supposed to be idempotent, meaning that they don’t change the state on the server and shouldn’t be used for things like funds transfer. We also implement a maximum size of 0-RTT requests, and limit how long they can be replayed.

Furthermore, Cloudflare can uniquely identify connection resumption attempts, so we relay this information to the origin by adding an extra header to 0-RTT requests. This header uniquely identifies the request, so if one gets repeated, the origin will know it's a replay attack.

Here’s what the header looks like:

Cf-0rtt-Unique: 37033bcb6b42d2bcf08af3b8dbae305a

The hexadecimal value is derived from a piece of data called a PSK binder, which is unique per 0-RTT request.

Generally speaking, 0-RTT is safe for most web sites and applications. If your web application does strange things and you’re concerned about its replay safety, consider not using 0-RTT until you can be certain that there are no negative effects.


TLS 1.3 is a big step forward for web performance and security. By combining TLS 1.3 with 0-RTT, the performance gains are even more dramatic. Combine this with HTTP/2 and the encrypted web has never been faster, especially on mobile networks. Cloudflare is happy to be the first to introduce this feature on a wide scale.

Categories: Technology

An AMP validator you can cURL

Wed, 08/03/2017 - 14:01

Cloudflare has been a long time supporter of AMP, an open-source markup language 1.5 billion web pages are using to accelerate their mobile web performance. Cloudflare runs Ampersand, the only alternative to Google’s AMP cache, and earlier this year we launched Accelerated Mobile Links, a way for sites on Cloudflare to open external links on their site in AMP format, as well as Firebolt, leveraging AMP to speed up ad performance.

One of the biggest challenges developers face in converting their web pages to AMP is testing their AMP pages for valid AMP syntax before deploying. It's not enough to make the templates work at dev time, you also need to validate individual pages before they’re published. Imagine, for example, a publishing company where content creators who are unfamiliar with AMP are modifying pages. Because the AMP markup language is so strict, one person adding an interactive element to a page can all of a sudden break the AMP formatting and stop the page from validating.

We wanted to make it as easy as possible to move webpages and sites to AMP so we built an AMP linter API for developers to check that their AMP pages are formatted correctly, even before they are deployed.

To check if a webpage’s AMP markup is correct, just send the AMP page to the endpoint like this:

curl { "source": "", "valid": true, "version": "1488238516283" }

The API has options to send just the markup content, or point the linter to the live site. To send a file, add the --data-binary flag:

curl -X POST --data-binary @amp_page.html -H 'Content-Type: text/html; charset=UTF-8'

If you send an AMP page with invalid AMP syntax, the message returned will tell you exactly what breaks your AMP page, and will point you to the specific place in the AMP reference where you can see the implementation guide for the broken element.

curl -X POST --data-binary @invalid_amp.html -H 'Content-Type: text/html; charset=UTF-8' { "errors": [ { "code": "MANDATORY_TAG_MISSING", "col": 7, "error": "The mandatory tag 'link rel=canonical' is missing or incorrect.", "help": "", "line": 13 } ], "source": "POST", "valid": false, "version": "1485227592804" }

Here’s a reference in python, and if you want to send html directly instead of a live webpage, replace line two with r ="', data=html)

import requests u = '' r = requests.get('' + u) validation = r.json() if validation['valid']: print u, 'is valid' else: print u, 'failed!' for e in validation['errors']: print e

Let us know what you think - you can send us feedback at Whether you embed this tool into your build and continuous integration processes, or into your CMS workflows, we’re excited to hear how you use it.

Categories: Technology

Cloudflare at Google NEXT 2017

Wed, 08/03/2017 - 00:44

The Cloudflare team is headed down the street to Google NEXT 2017 from March 8th - 10th at Moscone Center booth C7 in San Francisco, CA. We’re excited to meet with existing partners, customers, and new friends!

Come learn about Cloudflare’s recent partnership with Google Cloud Platform (CGP) through their CDN Interconnect Program. Cloudflare offers performance and security to over 25,000 joint customers. The CDN Interconnect program accelerates the delivery of dynamic content, allows Cloudflare’s servers to establish high-speed interconnections with Google Cloud Platform at various locations around the world.

We’ll be at booth C7 discussing the benefits of Cloudflare, our partnership with Google Cloud Platform, and handing out Cloudflare SWAG. In addition, our Co-Founder, Michelle Zatlyn, will be presenting “A Cloud Networking Blueprint for Securing Your Workloads” on Thursday, March 9th from 11:20 AM to 12:20 PM at Moscone West, Room 2005.

What is Google Cloud Platform’s CDN Interconnect Program?

Google Cloud Platform’s CDN Interconnect program allows select CDN providers to establish direct interconnect links with Google’s edge network at various locations. Customers egressing network traffic from Google Cloud Platform through one of these links will benefit from the direct connectivity to the CDN providers and will be billed according to the lower Google Cloud Interconnect pricing.

Joint customers of Cloudflare and Google Cloud Platform can expect a bandwidth savings of up to 75% and receive discounted egress pricing. Egress traffic is traffic flowing from Google Cloud Platform servers to Cloudflare’s servers. The high-speed interconnections between GCP and Cloudflare speed up the delivery of dynamic content for visitors.

How does the CDN Interconnect program work?

As part of this program, 41 Cloudflare data centers are directly connected to Google Cloud Platform’s infrastructure. When one of these Cloudflare data centers requests content from a Google Cloud Platform origin, it’s routed through a high-performance interconnect instead of the public Internet. This dramatically reduces latency for origin requests, and it also enables discounted Google Cloud Platform egress pricing in the US, Europe and Asia regions.

Joint Customer Stories

Quizlet and Discord, two prominent joint customers of Cloudflare and Google Cloud Platform, have shared their performance, security, and cost-savings stories.


Discord is a free voice and text chat app designed specifically for gaming. In one year, Discord grew from 25,000 concurrent users to 2.4 million, a 9,000 percent growth. Discord’s 25 million registered users send 100 million messages per day across the platform, requiring a global presence with tremendous amounts of network throughput. As Discord experiences explosive growth, they're thankful Cloudflare helps keep bandwidth & hardware costs down and web performance high.

  • Saving $100,000 on annual hardware costs
  • Saving $100,000 monthly on Google Cloud Network Egress bill
  • Secure traffic even with spikes of websockets events up to 2 million/second

Learn more about Discord’s use of Cloudflare on Google Cloud Platform:


Quizlet is the world’s largest student and teacher online learning community. Every month, over 20 million active learners from 130 countries practice and master more than 140 million study sets of content on every conceivable subject and topic. Quizlet’s Alexa ranking is 588 globally, and 104 in the United States, ranking it as one of the most highly-trafficked websites.

Quizlet receives performance and security benefits, while saving more than 50 percent on their Google Cloud networking egress bill by using Cloudflare.

  • Save 50% on monthly Google Cloud Network Egress Bill
  • Reduced daily bandwidth use by 76 percent reduction (or over 10 Tb)

Learn more about Quizlet’s use of Cloudflare on Google Cloud Platform:

Presentation by Cloudflare Co-Founder Michelle Zatlyn

Cloudflare’s Co-Founder, Michelle Zatlyn, will be presenting alongside Google and Palo Alto Networks, in a talk titled “A Cloud Networking Blueprint for Securing Your Workloads”.

Date & Time

Thursday, March 9th | 11:20 AM - 12:20 PM | Moscone West, Room 2005


Securing your workloads in the cloud requires shifting away from the traditional “perimeter” security to a “pervasive, hierarchical, scalable” security model. In this session, we discuss cloud networking best practices for securing enterprise and cloud-native workloads on Google Cloud Platform. We describe a network security blueprint that covers securing your virtual networks (VPCs), DDoS protection, using third-party security appliances and services, and visibility and analytics for your deployments. We also highlight Google’s experiences in delivering its own services securely and future trends in cloud network security.

Categories: Technology

Quantifying the Impact of "Cloudbleed"

Wed, 01/03/2017 - 15:27

Last Thursday we released details on a bug in Cloudflare's parser impacting our customers. It was an extremely serious bug that caused data flowing through Cloudflare's network to be leaked onto the Internet. We fully patched the bug within hours of being notified. However, given the scale of Cloudflare, the impact was potentially massive.

The bug has been dubbed “Cloudbleed.” Because of its potential impact, the bug has been written about extensively and generated a lot of uncertainty. The burden of that uncertainty has been felt by our partners, customers, our customers’ customers. The question we’ve been asked the most often is: what risk does Cloudbleed pose to me?

We've spent the last twelve days using log data on the actual requests we’ve seen across our network to get a better grip on what the impact was and, in turn, provide an estimate of the risk to our customers. This post outlines our initial findings.

The summary is that, while the bug was very bad and had the potential to be much worse, based on our analysis so far: 1) we have found no evidence based on our logs that the bug was maliciously exploited before it was patched; 2) the vast majority of Cloudflare customers had no data leaked; 3) after a review of tens of thousands of pages of leaked data from search engine caches, we have found a large number of instances of leaked internal Cloudflare headers and customer cookies, but we have not found any instances of passwords, credit card numbers, or health records; and 4) our review is ongoing.

To make sense of the analysis, it's important to understand exactly how the bug was triggered and when data was exposed. If you feel like you've already got a good handle on how the bug got triggered, click here to skip to the analysis.

Triggering the Bug

One of Cloudflare's core applications is a stream parser. The parser scans content as it is delivered from Cloudflare's network and is able to modify it in real time. The parser is used to enable functions like automatically rewriting links from HTTP to HTTPS (Automatic HTTPS Rewrites), hiding email addresses on pages from email harvesters (Email Address Obfuscation), and other similar features.

The Cloudbleed bug was triggered when a page with two characteristics was requested through Cloudflare's network. The two characteristics were: 1) the HTML on the page needed to be broken in a specific way; and 2) a particular set of Cloudflare features needed to be turned on for the page in question.

The specific HTML flaw was that the page had to end with an unterminated attribute. In other words, something like:

<IMG HEIGHT="50px" WIDTH="200px" SRC="

Here's why that mattered. When a page for a particular customer is being parsed it is stored in memory on one of the servers that is a part of our infrastructure. Contents of the other customers' requests are also in adjacent portions of memory on Cloudflare's servers.

The bug caused the parser, when it encountered unterminated attribute at the end of a page, to not stop when it reached the end of the portion of memory for the particular page being parsed. Instead, the parser continued to read from adjacent memory, which contained data from other customers' requests. The contents of that adjacent memory was then dumped onto the page with the flawed HTML.

The screenshot above is an example of how data was dumped on pages. Most of the data was random binary data which the browser is trying to interpret as largely Asian characters. That is followed by a number of internal Cloudflare headers.

If you had accessed one of the pages that triggered the bug you would have seen what likely looked like random text at the end of the page. The amount of data dumped was of random lengths limited to the size of the heap or when the parser happened across a character that caused the output to terminate.

Code Path and a New Parser

In addition to a page with flawed HTML, the particular set of Cloudflare features that were enabled mattered because it determined the version of the parser that was used. We rolled out a new version of the parser code on 22 September 2016. This new version of the parser exposed the bug.

Initially, the new parser code would only get executed under a very limited set of circumstances. Fewer than 180 sites from 22 September 2016 through 13 February 2017 had the combination of the HTML flaw and the set of features that would trigger the new version of the parser. During that time period, pages that had both characteristics and therefore would trigger the bug were accessed an estimated 605,037 times.

On 13 February 2017, not aware of the bug, we expanded the circumstances under which the new parser would get executed. That expanded the number of sites where the bug could get triggered from fewer than 180 to 6,457. From 13 February 2017 through 18 February 2017, when we patched the bug, the pages that would trigger the bug were accessed an estimated 637,034 times. In total, between 22 September 2016 and 18 February 2017 we now estimate based on our logs the bug was triggered 1,242,071 times.

The pages that typically triggered the bug tended to be on small and infrequently accessed sites. When one of these vulnerable pages was accessed and the bug was triggered, it was random what other customers would have content in memory adjacent that would then get leaked. Higher traffic Cloudflare customers would be more probable to have some data in memory because they received more requests and so, probabilistically, they're more likely to have their content in memory at any given time.

To be clear, customers that had data leak did not need to have flawed HTML or any particular Cloudflare features enabled. They just needed to be unlucky and have their data in memory immediately following a page that triggered the bug.

How a Malicious Actor Would Exploit the Bug

The Cloudbleed bug wasn't like a typical data breach. To analogize to the physical world, a typical data breach would be like a robber breaking into your office and stealing all your file cabinets. The bad news in that case is that the robber has all your files. The good news is you know exactly what they have.

Cloudbleed is different. It's more akin to learning that a stranger may have listened in on two employees at your company talking over lunch. The good news is the amount of information for any conversation that's eavesdropped is limited. The bad news is you can't know exactly what the stranger may have heard, including potentially sensitive information about your company.

If a stranger were listening in on a conversation between two employees, the vast majority of what they would hear wouldn't be harmful. But, every once in awhile, the stranger may overhear something confidential. The same is true if a malicious attacker knew about the bug and were trying to exploit it. Given that the data that leaked was random on a per request basis, most requests would return nothing interesting. But, every once in awhile, the data that leaked may return something of interest to a hacker.

If a hacker were aware of the bug before it was patched and trying to exploit it then the best way for them to do so would be to send as many requests as possible to a page that contained the set of conditions that would trigger the bug. They could then record the results. Most of what they would get would be useless, but some would contain very sensitive information.

The nightmare scenario we have been worried about is if a hacker had been aware of the bug and had been quietly mining data before we were notified by Google's Project Zero team and were able to patch it. For the last twelve days we've been reviewing our logs to see if there's any evidence to indicate that a hacker was exploiting the bug before it was patched. We’ve found nothing so far to indicate that was the case.

Identifying Patterns of Malicious Behavior

For a limited period of time we keep a debugging log of requests that pass through Cloudflare. This is done by sampling 1% of requests and storing information about the request and response. We are then able to look back in time for anomalies in HTTP response codes, response or request body sizes, response times, or other unusual behavior from specific networks or IP addresses.

We have the logs of 1% of all requests going through Cloudflare from 8 February 2017 up to 18 February 2017 (when the vulnerability was patched) giving us the ability to look for requests leaking data during this time period. Requests prior to 8 February 2017 had already been deleted. Because we have a representative sample of the logs for the 6,457 vulnerable sites, we were able to parse them in order to look for any evidence someone was exploiting the bug.

The first thing we looked for was a site we knew was vulnerable and for which we had accurate data. In the early hours of 18 February 2017, immediately after the problem was reported to us, we set up a vulnerable page on a test site and used it to reproduce the bug and then verify it had been fixed.

Because we had logging on the test web server itself we were able to quickly verify that we had the right data. The test web server had received 31,874 hits on the vulnerable page due to our testing. We had captured very close to 1% of those requests (316 were stored). From the sampled data, we were also able to look at the sizes of responses which showed a clear bimodal distribution. Small responses were from when the bug was fixed, large responses from when the leak was apparent.

This gave us confidence that we had captured the right information to go hunting for exploitation of the vulnerability.

We wanted to answer two questions:

  1. Did any individual IP hit a vulnerable page enough times that a meaningful amount of data was extracted? This would capture the situation where someone had discovered the problem on a web page and had set up a process to repeatedly download the page from their machine. For example, something as simple as running curl in a loop would show up in this analysis.

  2. Was any vulnerable page accessed enough times that a meaningful amount of data could have been extracted by a botnet? A more advanced hacker would have wanted to cover their footprints by using a wide range of IP addresses rather than repeatedly visiting a page from a single IP. To identify that possibility we wanted to see if any individual page had been accessed enough times and returned enough data for us to suspect that data was being extracted.
Reviewing the Logs

To answer #1, we looked for any IP addresses that had hit a single page on a vulnerable site more than 1,000 times and downloaded more data than the site would normally deliver. We found 7 IP addresses with those characteristics.

Six of the seven IP addresses were accessing three sites with three pages with very large HTML. Manual inspection showed that these pages did not contain the broken HTML that would have triggered the bug. They also did not appear in a database of potentially vulnerable pages that our team gathered after the bug was patched.

The other IP address belonged to a mobile network and was traffic for a ticket booking application. The particular page was very large even though it was not leaking data, however, it did not contain broken HTML, and was not in our database of vulnerable pages.

To look for evidence of #2, we retrieved every page on a vulnerable site that was requested more than 1,000 times during the period. We then downloaded those pages and ran them through the vulnerable version of our software in a test environment to see if any of them would cause a leak. This search turned up the sites we had created to test the vulnerability. However, we found no vulnerable pages, outside of our own test sites, that had been accessed more than 1,000 times.

This leads us to believe that the vulnerability had not been exploited between 8 February 2017 and 18 February 2017. However, we also wanted to look for signs of exploitation between 22 September 2016 and 8 February 2017 — a time period for which we did not have sampled log data. To do that, we turned to our customer analytics database.

Reviewing Customer Analytics

We store customer analytics data with one hour granularity in a large datastore. For every site on Cloudflare and for each hour we have the total number of requests to the site, number of bytes read from the origin web server, number of bytes sent to client web browsers, and the number of unique IP addresses accessing the site.

If a malicious attacker were sending a large number of requests to exploit the bug then we hypothesized that a number of signals would potentially appear in our logs. These include:

  • The ratio of requests per unique IP would increase. While an attacker could use a botnet or large number of machines to harvest data, we speculated that, at least initially, upon discovering the bug the hacker would send a large number of requests from a small set of IPs to gather initial data.

  • The ratio of bandwidth per request would increase. Since the bug leaks a large amount of data onto the page, if the bug were being exploited then the bandwidth per request would increase.

  • The ratio of bandwidth per unique IP would also increase. Since you’d expect that more data was going to the smaller set of IPs the attacker would use to pull down data then the bandwidth per IP would increase.

We used the data from before the bug impacted sites to set the baseline for each site for each of these three ratios. We then tracked the ratios above across each site individually during the period for which it was vulnerable and looked for anomalies that may suggest a hacker was exploiting the vulnerability ahead of its public disclosure.

This data is much more noisy than the sampled log data because it is rolled up and averaged over one hour windows. However, we have not seen any evidence of exploitation of this bug from this data.

Reviewing Crash Data

Lastly, when the bug was triggered it would, depending on what was read from memory, sometimes cause our parser application to crash. We have technical operations logs that record every time an application running on our network crashes. These logs cover the entire period of time the bug was in production (22 September 2016 – 18 February 2017).

We ran a suite of known-vulnerable HTML through our test platform to establish the percentage of time that we would expect the application to crash.

We reviewed our application crash logs for the entire period the bug was in production. We did turn up periodic instances of the parser crashing that align with the frequency of how often we estimate the bug was triggered. However, we did not see a signal in the crash data that would indicate that the bug was being actively exploited at any point during the period it was present in our system.

Purging Search Engine Caches

Even if an attacker wasn’t actively exploiting the bug prior to our patching it, there was still potential harm because private data leaked and was cached by various automated crawlers. Because the 6,457 sites that could trigger the bug were generally small, the largest percentage of their traffic comes from search engine crawlers. Of the 1,242,071 requests that triggered the bug, we estimate more than half came from search engine crawlers.

Cloudflare has spent the last 12 days working with various search engines — including Google, Bing, Yahoo, Baidu, Yandex, DuckDuckGo, and others — to clear their caches. We were able to remove the majority of the cached pages before the disclosure of the bug last Thursday.

Since then, we’ve worked with major search engines as well as other online archives to purge cached data. We’ve successfully removed more than 80,000 unique cached pages. That underestimates the total number because we’ve requested search engines purge and recrawl entire sites in some instances. Cloudflare customers who discover leaked data still online can report it by sending a link to the cache to and our team will work to have it purged.

Analysis of What Data Leaked

The search engine caches provide us an opportunity to analyze what data leaked. While many have speculated that any data passing through Cloudflare may have been exposed, the way that data is structured in memory and the frequency of GET versus POST requests makes certain data more or less likely to be exposed. We analyzed a representative sample of the cached pages retrieved from search engine caches and ran a thorough analysis on each of them. The sample included thousands of pages and was statistically significant to a confidence level of 99% with a margin of error of 2.5%. Within that sample we would expect the following data types to appear this many times in any given leak:

67.54 Internal Cloudflare Headers 0.44 Cookies 0.04 Authorization Headers / Tokens 0 Passwords 0 Credit Cards / Bitcoin Addresses 0 Health Records 0 Social Security Numbers 0 Customer Encryption Keys

The above can be read to mean that in any given leak you would expect to find 67.54 Cloudflare internal headers. You’d expect to find a cookie in approximately half of all leaks (0.44 cookies per leak). We did not find any passwords, credit cards, health records, social security numbers, or customer encryption keys in the sample set.

Since this is just a sample, it is not correct to conclude that no passwords, credit cards, health records, social security numbers, or customer encryption keys were ever exposed. However, if there was any exposure, based on the data we’ve reviewed, it does not appear to have been widespread. We have also not had any confirmed reports of third parties discovering any of these sensitive data types on any cached pages.

These findings generally make sense given what we know about traffic to Cloudflare sites. Based on our logs, the ratio of GET to POST requests across our network is approximately 100-to-1. Since POSTs are more likely to contain sensitive data like passwords, we estimate that reduces the potential exposure of the most sensitive data from 1,242,071 requests to closer to 12,420. POSTs that contain particularly sensitive information would then represent only a fraction of the 12,420 we would expect to have leaked.

This is not to downplay the seriousness of the bug. For instance, depending on how a Cloudflare customer’s systems are implemented, cookie data, which would be present in GET requests, could be used to impersonate another user’s session. We’ve seen approximately 150 Cloudflare customers’ data in the more than 80,000 cached pages we’ve purged from search engine caches. When data for a customer is present, we’ve reached out to the customer proactively to share the data that we’ve discovered and help them work to mitigate any impact. Generally, if customer data was exposed, invalidating session cookies and rolling any internal authorization tokens is the best advice to mitigate the largest potential risk based on our investigation so far.

How to Understand Your Risk

We have tried to quantify the risk to individual customers that their data may have leaked. Generally, the more requests that a customer sent to Cloudflare, the more likely it is that their data would have been in memory and therefore exposed. This is anecdotally confirmed by the 150 customers whose data we’ve found in third party caches. The customers whose data appeared in caches are typically the customers that send the most requests through Cloudflare’s network.

Probabilistically, we are able to estimate the likelihood of data leaking for a particular customer based on the number of requests per month (RPM) that they send through our network since the more requests sent through our network the more likely a customer’s data is to be in memory when the bug was triggered. Below is a chart of the number of total anticipated data leak events from 22 September 2016 – 18 February 2017 that we would expect based on the average number of requests per month a customer sends through Cloudflare’s network:

Requests per Month Anticipated Leaks ------------------ ----------------- 200B – 300B 22,356 – 33,534 100B – 200B 11,427 – 22,356 50B – 100B 5,962 – 11,427 10B – 50B 1,118 – 5,926 1B – 10B 112 – 1,118 500M – 1B 56 – 112 250M – 500M 25 – 56 100M – 250M 11 – 25 50M – 100M 6 – 11 10M – 50M 1 – 6 < 10M < 1

More than 99% of Cloudflare’s customers send fewer than 10 million requests per month. At that level, probabilistically we would expect that they would have no data leaked during the period the bug was present. For further context, the 100th largest website in the world is estimated to handle fewer than 10 billion requests per month, so there are very few of Cloudflare’s 6 million customers that fall into the top bands of the chart above. Cloudflare customers can find their own RPM by logging into the Cloudflare Analytics Dashboard and looking at the number of requests per month for their sites.

The statistics above assume that each leak contained only one customer’s data. That was true for nearly all of the leaks we reviewed from search engine caches. However, there were instances where more data may have been leaked. The probability table above should be considered just an estimate to help provide some general guidance on the likelihood a customer’s data would have leaked.

Interim Conclusion

We are continuing to work with third party caches to expunge leaked data and will not let up until every bit has been removed. We also continue to analyze Cloudflare’s logs and the particular requests that triggered the bug for anomalies. While we were able to mitigate this bug within minutes of it being reported to us, we want to ensure that other bugs are not present in the code. We have undertaken a full review of the parser code to look for any additional potential vulnerabilities. In addition to our own review, we're working with the outside code auditing firm Veracode to review our code.

Cloudflare’s mission is to help build a better Internet. Everyone on our team comes to work every day to help our customers — regardless of whether they are businesses, non-profits, governments, or hobbyists — run their corner of the Internet a little better. This bug exposed just how much of the Internet puts its trust in us. We know we disappointed you and we apologize. We will continue to share what we discover because we believe trust is critical and transparency is the foundation of that trust.

Categories: Technology

Incident report on memory leak caused by Cloudflare parser bug

Thu, 23/02/2017 - 23:01

Last Friday, Tavis Ormandy from Google’s Project Zero contacted Cloudflare to report a security problem with our edge servers. He was seeing corrupted web pages being returned by some HTTP requests run through Cloudflare.

It turned out that in some unusual circumstances, which I’ll detail below, our edge servers were running past the end of a buffer and returning memory that contained private information such as HTTP cookies, authentication tokens, HTTP POST bodies, and other sensitive data. And some of that data had been cached by search engines.

For the avoidance of doubt, Cloudflare customer SSL private keys were not leaked. Cloudflare has always terminated SSL connections through an isolated instance of NGINX that was not affected by this bug.

We quickly identified the problem and turned off three minor Cloudflare features (email obfuscation, Server-side Excludes and Automatic HTTPS Rewrites) that were all using the same HTML parser chain that was causing the leakage. At that point it was no longer possible for memory to be returned in an HTTP response.

Because of the seriousness of such a bug, a cross-functional team from software engineering, infosec and operations formed in San Francisco and London to fully understand the underlying cause, to understand the effect of the memory leakage, and to work with Google and other search engines to remove any cached HTTP responses.

Having a global team meant that, at 12 hour intervals, work was handed over between offices enabling staff to work on the problem 24 hours a day. The team has worked continuously to ensure that this bug and its consequences are fully dealt with. One of the advantages of being a service is that bugs can go from reported to fixed in minutes to hours instead of months. The industry standard time allowed to deploy a fix for a bug like this is usually three months; we were completely finished globally in under 7 hours with an initial mitigation in 47 minutes.

The bug was serious because the leaked memory could contain private information and because it had been cached by search engines. We have also not discovered any evidence of malicious exploits of the bug or other reports of its existence.

The greatest period of impact was from February 13 and February 18 with around 1 in every 3,300,000 HTTP requests through Cloudflare potentially resulting in memory leakage (that’s about 0.00003% of requests).

We are grateful that it was found by one of the world’s top security research teams and reported to us.

This blog post is rather long but, as is our tradition, we prefer to be open and technically detailed about problems that occur with our service.

Parsing and modifying HTML on the fly

Many of Cloudflare’s services rely on parsing and modifying HTML pages as they pass through our edge servers. For example, we can insert the Google Analytics tag, safely rewrite http:// links to https://, exclude parts of a page from bad bots, obfuscate email addresses, enable AMP, and more by modifying the HTML of a page.

To modify the page, we need to read and parse the HTML to find elements that need changing. Since the very early days of Cloudflare, we’ve used a parser written using Ragel. A single .rl file contains an HTML parser used for all the on-the-fly HTML modifications that Cloudflare performs.

About a year ago we decided that the Ragel parser had become too complex to maintain and we started to write a new parser, named cf-html, to replace it. This streaming parser works correctly with HTML5 and is much, much faster and easier to maintain.

We first used this new parser for the Automatic HTTP Rewrites feature and have been slowly migrating functionality that uses the old Ragel parser to cf-html.

Both cf-html and the old Ragel parser are implemented as NGINX modules compiled into our NGINX builds. These NGINX filter modules parse buffers (blocks of memory) containing HTML responses, make modifications as necessary, and pass the buffers onto the next filter.

It turned out that the underlying bug that caused the memory leak had been present in our Ragel-based parser for many years but no memory was leaked because of the way the internal NGINX buffers were used. Introducing cf-html subtly changed the buffering which enabled the leakage even though there were no problems in cf-html itself.

Once we knew that the bug was being caused by the activation of cf-html (but before we knew why) we disabled the three features that caused it to be used. Every feature Cloudflare ships has a corresponding feature flag, which we call a ‘global kill’. We activated the Email Obfuscation global kill 47 minutes after receiving details of the problem and the Automatic HTTPS Rewrites global kill 3h05m later. The Email Obfuscation feature had been changed on February 13 and was the primary cause of the leaked memory, thus disabling it quickly stopped almost all memory leaks.

Within a few seconds, those features were disabled worldwide. We confirmed we were not seeing memory leakage via test URIs and had Google double check that they saw the same thing.

We then discovered that a third feature, Server-Side Excludes, was also vulnerable and did not have a global kill switch (it was so old it preceded the implementation of global kills). We implemented a global kill for Server-Side Excludes and deployed a patch to our fleet worldwide. From realizing Server-Side Excludes were a problem to deploying a patch took roughly three hours. However, Server-Side Excludes are rarely used and only activated for malicious IP addresses.

Root cause of the bug

The Ragel code is converted into generated C code which is then compiled. The C code uses, in the classic C manner, pointers to the HTML document being parsed, and Ragel itself gives the user a lot of control of the movement of those pointers. The underlying bug occurs because of a pointer error.

/* generated code */ if ( ++p == pe ) goto _test_eof;

The root cause of the bug was that reaching the end of a buffer was checked using the equality operator and a pointer was able to step past the end of the buffer. This is known as a buffer overrun. Had the check been done using >= instead of == jumping over the buffer end would have been caught. The equality check is generated automatically by Ragel and was not part of the code that we wrote. This indicated that we were not using Ragel correctly.

The Ragel code we wrote contained a bug that caused the pointer to jump over the end of the buffer and past the ability of an equality check to spot the buffer overrun.

Here’s a piece of Ragel code used to consume an attribute in an HTML <script> tag. The first line says that it should attempt to find zero of more unquoted_attr_char followed by (that’s the :>> concatenation operator) whitespace, forward slash or then > signifying the end of the tag.

script_consume_attr := ((unquoted_attr_char)* :>> (space|'/'|'>')) >{ ddctx("script consume_attr"); } @{ fhold; fgoto script_tag_parse; } $lerr{ dd("script consume_attr failed"); fgoto script_consume_attr; };

If an attribute is well-formed, then the Ragel parser moves to the code inside the @{ } block. If the attribute fails to parse (which is the start of the bug we are discussing today) then the $lerr{ } block is used.

For example, in certain circumstances (detailed below) if the web page ended with a broken HTML tag like this:

<script type=

the $lerr{ } block would get used and the buffer would be overrun. In this case the $lerr does dd(“script consume_attr failed”); (that’s a debug logging statement that is a nop in production) and then does fgoto script_consume_attr; (the state transitions to script_consume_attr to parse the next attribute).
From our statistics it appears that such broken tags at the end of the HTML occur on about 0.06% of websites.

If you have a keen eye you may have noticed that the @{ } transition also did a fgoto but right before it did fhold and the $lerr{ } block did not. It’s the missing fhold that resulted in the memory leakage.

Internally, the generated C code has a pointer named p that is pointing to the character being examined in the HTML document. fhold is equivalent to p-- and is essential because when the error condition occurs p will be pointing to the character that caused the script_consume_attr to fail.

And it’s doubly important because if this error condition occurs at the end of the buffer containing the HTML document then p will be after the end of the document (p will be pe + 1 internally) and a subsequent check that the end of the buffer has been reached will fail and p will run outside the buffer.

Adding an fhold to the error handler fixes the problem.

Why now

That explains how the pointer could run past the end of the buffer, but not why the problem suddenly manifested itself. After all, this code had been in production and stable for years.

Returning to the script_consume_attr definition above:

script_consume_attr := ((unquoted_attr_char)* :>> (space|'/'|'>')) >{ ddctx("script consume_attr"); } @{ fhold; fgoto script_tag_parse; } $lerr{ dd("script consume_attr failed"); fgoto script_consume_attr; };

What happens when the parser runs out of characters to parse while consuming an attribute differs whether the buffer currently being parsed is the last buffer or not. If it’s not the last buffer, then there’s no need to use $lerr as the parser doesn’t know whether an error has occurred or not as the rest of the attribute may be in the next buffer.

But if this is the last buffer, then the $lerr is executed. Here’s how the code ends up skipping over the end-of-file and running through memory.

The entry point to the parsing function is ngx_http_email_parse_email (the name is historical, it does much more than email parsing).

ngx_int_t ngx_http_email_parse_email(ngx_http_request_t *r, ngx_http_email_ctx_t *ctx) { u_char *p = ctx->pos; u_char *pe = ctx->buf->last; u_char *eof = ctx->buf->last_buf ? pe : NULL;

You can see that p points to the first character in the buffer, pe to the character after the end of the buffer and eof is set to pe if this is the last buffer in the chain (indicated by the last_buf boolean), otherwise it is NULL.

When the old and new parsers are both present during request handling a buffer such as this will be passed to the function above:

(gdb) p *in->buf $8 = { pos = 0x558a2f58be30 "<script type=\"", last = 0x558a2f58be3e "", [...] last_buf = 1, [...] }

Here there is data and last_buf is 1. When the new parser is not present the final buffer that contains data looks like this:

(gdb) p *in->buf $6 = { pos = 0x558a238e94f7 "<script type=\"", last = 0x558a238e9504 "", [...] last_buf = 0, [...] }

A final empty buffer (pos and last both NULL and last_buf = 1) will follow that buffer but ngx_http_email_parse_email is not invoked if the buffer is empty.

So, in the case where only the old parser is present, the final buffer that contains data has last_buf set to 0. That means that eof will be NULL. Now when trying to handle script_consume_attr with an unfinished tag at the end of the buffer the $lerr will not be executed because the parser believes (because of last_buf) that there may be more data coming.

The situation is different when both parsers are present. last_buf is 1, eof is set to pe and the $lerr code runs. Here’s the generated code for it:

/* #line 877 "ngx_http_email_filter_parser.rl" */ { dd("script consume_attr failed"); {goto st1266;} } goto st0; [...] st1266: if ( ++p == pe ) goto _test_eof1266;

The parser runs out of characters while trying to perform script_consume_attr and p will be pe when that happens. Because there’s no fhold (that would have done p--) when the code jumps to st1266 p is incremented and is now past pe.

It then won’t jump to _test_eof1266 (where EOF checking would have been performed) and will carry on past the end of the buffer trying to parse the HTML document.

So, the bug had been dormant for years until the internal feng shui of the buffers passed between NGINX filter modules changed with the introduction of cf-html.

Going bug hunting

Research by IBM in the 1960s and 1970s showed that bugs tend to cluster in what became known as “error-prone modules”. Since we’d identified a nasty pointer overrun in the code generated by Ragel it was prudent to go hunting for other bugs.

Part of the infosec team started fuzzing the generated code to look for other possible pointer overruns. Another team built test cases from malformed web pages found in the wild. A software engineering team began a manual inspection of the generated code looking for problems.

At that point it was decided to add explicit pointer checks to every pointer access in the generated code to prevent any future problem and to log any errors seen in the wild. The errors generated were fed to our global error logging infrastructure for analysis and trending.

#define SAFE_CHAR ({\ if (!__builtin_expect(p < pe, 1)) {\ ngx_log_error(NGX_LOG_CRIT, r->connection->log, 0, "email filter tried to access char past EOF");\ RESET();\ output_flat_saved(r, ctx);\ BUF_STATE(output);\ return NGX_ERROR;\ }\ *p;\ })

And we began seeing log lines like this:

2017/02/19 13:47:34 [crit] 27558#0: *2 email filter tried to access char past EOF while sending response to client, client:, server: localhost, request: "GET /malformed-test.html HTTP/1.1”

Every log line indicates an HTTP request that could have leaked private memory. By logging how often the problem was occurring we hoped to get an estimate of the number of times HTTP request had leaked memory while the bug was present.

In order for the memory to leak the following had to be true:

The final buffer containing data had to finish with a malformed script or img tag
The buffer had to be less than 4k in length (otherwise NGINX would crash)
The customer had to either have Email Obfuscation enabled (because it uses both the old and new parsers as we transition),
… or Automatic HTTPS Rewrites/Server Side Excludes (which use the new parser) in combination with another Cloudflare feature that uses the old parser. … and Server-Side Excludes only execute if the client IP has a poor reputation (i.e. it does not work for most visitors).

That explains why the buffer overrun resulting in a leak of memory occurred so infrequently.

Additionally, the Email Obfuscation feature (which uses both parsers and would have enabled the bug to happen on the most Cloudflare sites) was only enabled on February 13 (four days before Tavis’ report).

The three features implicated were rolled out as follows. The earliest date memory could have leaked is 2016-09-22.

2016-09-22 Automatic HTTP Rewrites enabled
2017-01-30 Server-Side Excludes migrated to new parser
2017-02-13 Email Obfuscation partially migrated to new parser
2017-02-18 Google reports problem to Cloudflare and leak is stopped

The greatest potential impact occurred for four days starting on February 13 because Automatic HTTP Rewrites wasn’t widely used and Server-Side Excludes only activate for malicious IP addresses.

Internal impact of the bug

Cloudflare runs multiple separate processes on the edge machines and these provide process and memory isolation. The memory being leaked was from a process based on NGINX that does HTTP handling. It has a separate heap from processes doing SSL, image re-compression, and caching, which meant that we were quickly able to determine that SSL private keys belonging to our customers could not have been leaked.

However, the memory space being leaked did still contain sensitive information. One obvious piece of information that had leaked was a private key used to secure connections between Cloudflare machines.

When processing HTTP requests for customers’ web sites our edge machines talk to each other within a rack, within a data center, and between data centers for logging, caching, and to retrieve web pages from origin web servers.

In response to heightened concerns about surveillance activities against Internet companies, we decided in 2013 to encrypt all connections between Cloudflare machines to prevent such an attack even if the machines were sitting in the same rack.

The private key leaked was the one used for this machine to machine encryption. There were also a small number of secrets used internally at Cloudflare for authentication present.

External impact and cache clearing

More concerning was that fact that chunks of in-flight HTTP requests for Cloudflare customers were present in the dumped memory. That meant that information that should have been private could be disclosed.

This included HTTP headers, chunks of POST data (perhaps containing passwords), JSON for API calls, URI parameters, cookies and other sensitive information used for authentication (such as API keys and OAuth tokens).

Because Cloudflare operates a large, shared infrastructure an HTTP request to a Cloudflare web site that was vulnerable to this problem could reveal information about an unrelated other Cloudflare site.

An additional problem was that Google (and other search engines) had cached some of the leaked memory through their normal crawling and caching processes. We wanted to ensure that this memory was scrubbed from search engine caches before the public disclosure of the problem so that third-parties would not be able to go hunting for sensitive information.

Our natural inclination was to get news of the bug out as quickly as possible, but we felt we had a duty of care to ensure that search engine caches were scrubbed before a public announcement.

The infosec team worked to identify URIs in search engine caches that had leaked memory and get them purged. With the help of Google, Yahoo, Bing and others, we found 770 unique URIs that had been cached and which contained leaked memory. Those 770 unique URIs covered 161 unique domains. The leaked memory has been purged with the help of the search engines.

We also undertook other search expeditions looking for potentially leaked information on sites like Pastebin and did not find anything.

Some lessons

The engineers working on the new HTML parser had been so worried about bugs affecting our service that they had spent hours verifying that it did not contain security problems.

Unfortunately, it was the ancient piece of software that contained a latent security problem and that problem only showed up as we were in the process of migrating away from it. Our internal infosec team is now undertaking a project to fuzz older software looking for potential other security problems.

Detailed Timeline

We are very grateful to our colleagues at Google for contacting us about the problem and working closely with us through its resolution. All of which occurred without any reports that outside parties had identified the issue or exploited it.

All times are UTC.

2017-02-18 0011 Tweet from Tavis Ormandy asking for Cloudflare contact information
2017-02-18 0032 Cloudflare receives details of bug from Google
2017-02-18 0040 Cross functional team assembles in San Francisco
2017-02-18 0119 Email Obfuscation disabled worldwide
2017-02-18 0122 London team joins
2017-02-18 0424 Automatic HTTPS Rewrites disabled worldwide
2017-02-18 0722 Patch implementing kill switch for cf-html parser deployed worldwide

2017-02-20 2159 SAFE_CHAR fix deployed globally

2017-02-21 1803 Automatic HTTPS Rewrites, Server-Side Excludes and Email Obfuscation re-enabled worldwide

NOTE: This post was updated to reflect updated information.

Categories: Technology

LuaJIT Hacking: Getting next() out of the NYI list

Tue, 21/02/2017 - 13:40

At Cloudflare we’re heavy users of LuaJIT and in the past have sponsored many improvements to its performance.

LuaJIT is a powerful piece of software, maybe the highest performing JIT in the industry. But it’s not always easy to get the most out of it, and sometimes a small change in one part of your code can negatively impact other, already optimized, parts.

One of the first pieces of advice anyone receives when writing Lua code to run quickly using LuaJIT is “avoid the NYIs”: the language or library features that can’t be compiled because they’re NYI (not yet implemented). And that means they run in the interpreter.

CC BY-SA 2.0 image by Dwayne Bent

Another very attractive feature of LuaJIT is the FFI library, which allows Lua code to directly interface with C code and memory structures. The JIT compiler weaves these memory operations in line with the generated machine language, making it much more efficient than using the traditional Lua C API.

Unfortunately, if for any reason the Lua code using the FFI library has to run under the interpreter, it takes a very heavy performance hit. As it happens, under the interpreter the FFI is usually much slower than the Lua C API or the basic operations. For many people, this means either avoiding the FFI or committing to a permanent vigilance to maintain the code from falling back to the interpreter.

Optimizing LuaJIT Code

Before optimizing any code, it’s important to identify which parts are actually important. It’s useless to discuss what’s the fastest way to add a few numbers before sending some data, if the send operation will take a million times longer than that addition. Likewise, there’s no benefit avoiding NYI features in code like initialization routines that might run only a few times, as it’s unlikely that the JIT would even try to optimize them, so they would always run in the interpreter. Which, by the way, is also very fast; even faster than the first version of LuaJIT itself.

But optimizing the core parts of a Lua program, like any deep inner loops, can yield huge improvements in the overall performance. In similar situations, experienced developers using other languages are used to inspecting the assembly language generated by the compiler, to see if there’s some change to the source code that can make the result better.

The command line LuaJIT executable provides a bytecode list when running with the -jbc option, a statistical profiler, activated with the -jp option, a trace list with -jv, and finally a detailed dump of all the JIT operations with -jdump.

The last two provide lots of information very useful for understanding what actually happens with the Lua code while executing, but it can be a lot of work to read the huge lists generated by -jdump. Also, some messages are hard to understand without a fairly complete understanding of how the tracing compiler in LuaJIT actually works.

One very nice feature is that all these JIT options are implemented in Lua. To accomplish this the JIT provides ‘hooks’ that can execute a Lua function at important moments with the relevant information. Sometimes the best way to understand what some -jdump output actually means is to read the code that generated that specific part of the output.

CC BY 2.0 image by Kevan

Introducing Loom

After several rounds there, and being frustrated by the limitations of the sequentially-generated dump, I decided to write a different version of -jdump, one that gathered more information to process and add cross-references to help see how things are related before displaying. The result is loom, which shows roughly the same information as -jdump, but with more resolved references and formatted in HTML with tables, columns, links and colors. It has helped me a lot to understand my own code and the workings of LuaJIT itself.

For example, let's consider the following code in a file called twoloops.lua:

for i=1,1000 do for j=1,1000 do end end

With the -jv option:

$ luajit -jv twoloops.lua [TRACE 1 twoloops.lua:2 loop] [TRACE 2 (1/3) twoloops.lua:1 -> 1]

This tells us that there were two traces, the first one contains a loop, and the second one spawns from exit #3 of the other (the “(1/3)” part) and it’s endpoint returns to the start of trace #1.

Ok, let’s get more detail with -jdump:

$ luajit -jdump twoloops.lua ---- TRACE 1 start twoloops.lua:2 0009 FORL 4 => 0009 ---- TRACE 1 IR 0001 int SLOAD #5 CI 0002 + int ADD 0001 +1 0003 > int LE 0002 +1000 0004 ------ LOOP ------------ 0005 + int ADD 0002 +1 0006 > int LE 0005 +1000 0007 int PHI 0002 0005 ---- TRACE 1 mcode 47 0bcbffd1 mov dword [0x40db1410], 0x1 0bcbffdc cvttsd2si ebp, [rdx+0x20] 0bcbffe1 add ebp, +0x01 0bcbffe4 cmp ebp, 0x3e8 0bcbffea jg 0x0bcb0014 ->1 ->LOOP: 0bcbfff0 add ebp, +0x01 0bcbfff3 cmp ebp, 0x3e8 0bcbfff9 jle 0x0bcbfff0 ->LOOP 0bcbfffb jmp 0x0bcb001c ->3 ---- TRACE 1 stop -> loop ---- TRACE 2 start 1/3 twoloops.lua:1 0010 FORL 0 => 0005 0005 KSHORT 4 1 0006 KSHORT 5 1000 0007 KSHORT 6 1 0008 JFORI 4 => 0010 ---- TRACE 2 IR 0001 num SLOAD #1 I 0002 num ADD 0001 +1 0003 > num LE 0002 +1000 ---- TRACE 2 mcode 81 0bcbff79 mov dword [0x40db1410], 0x2 0bcbff84 movsd xmm6, [0x41704068] 0bcbff8d movsd xmm5, [0x41704078] 0bcbff96 movsd xmm7, [rdx] 0bcbff9a addsd xmm7, xmm6 0bcbff9e ucomisd xmm5, xmm7 0bcbffa2 jb 0x0bcb0014 ->1 0bcbffa8 movsd [rdx+0x38], xmm6 0bcbffad movsd [rdx+0x30], xmm6 0bcbffb2 movsd [rdx+0x28], xmm5 0bcbffb7 movsd [rdx+0x20], xmm6 0bcbffbc movsd [rdx+0x18], xmm7 0bcbffc1 movsd [rdx], xmm7 0bcbffc5 jmp 0x0bcbffd1 ---- TRACE 2 stop -> 1

This tells us... well, a lot of things. If you look closely, you’ll see the same two traces, one is a loop, the second starts at 1/3 and returns to trace #1. Each one shows some bytecode instructions, an IR listing, and the final mcode. There are several options to turn on and off each listing, and more info like the registers allocated to some IR instructions, the “snapshot” structures that allow the interpreter to continue when a compiled trace exits, etc.

Now using loom:

There’s the source code, with the corresponding bytecodes, and the same two traces, with IR and mcode listings. The bytecode lines on the traces and on the top listings are linked, hovering on some arguments on the IR listing highlights the source and use of each value, the jumps between traces are correctly labeled (and colored), finally, clicking on the bytecode or IR column headers reveals more information: excerpts from the source code and snapshot formats, respectively.

Writing it was a great learning experience, I had to read the dump script’s Lua sources and went much deeper in the LuaJIT sources than ever before. And then, I was able to use loom not only to analyze and optimize Cloudflare’s Lua code, but also to watch the steps the compiler goes through to make it run fast, and also what happens when it’s not happy.

The code is the code is the code is the code

LuaJIT handles up to four different representation of a program’s code:

First comes the source code, what the developer writes.

The parser analyzes the source code and produces the Bytecode, which is what the interpreter actually executes. It has the same flow of the source code, grouped in functions, with all the calls, iterators, operations, etc. Of course, there’s no nice formatting, comments, the local variable names are replaced by indices, and all constants (other than small numbers) are stored in a separate area.

When the interpreter finds that a given point of the bytecode has been repeated several times, it’s considered a “hot” part of the code, and interprets it once again but this time it records each bytecode it encounters, generating a “code trace” or just “a trace”. At the same time, it generates an “intermediate representation”, or IR, of the code as it’s executed. The IR doesn’t represent the whole of the function or code portion, just the actual options it actually takes.

A trace is finished when it hits a loop or a recursion, returns to a lower level than when started, hits a NYI operation, or simply becomes too long. At this point, it can be either compiled into machine language, or aborted if it has reached some code that can’t be correctly translated. If successful, the bytecode is patched with an entry to the machine code, or “mcode”. If aborted, the initial trace point is “penalized” or even “blacklisted” to avoid wasting time trying to compile it again.

What’s next()?

One of the most visible characteristics of the Lua language is the heavy use of dictionary objects called tables. From the Lua manual:

“Tables are the sole data structuring mechanism in Lua; they can be used to represent ordinary arrays, symbol tables, sets, records, graphs, trees, etc.”

To iterate over all the elements in a table, the idiomatic way is to use the standard library function pairs() like this:

for k, v in pairs(t) do -- use the key in ‘k’ and the value in ‘v’ end

In the standard Lua manual, pairs() is defined as “Returns three values: the next function, the table t, and nil”, so the previous code is the same as:

for k, v in next, t, nil do -- use the key in ‘k’ and the value in ‘v’ end

But unfortunately, both the next() and pairs() functions are listed as “not compiled” in the feared NYI list. That means that any such code runs on the interpreter and is not compiled, unless the code inside is complex enough, and has other inner loops (loops that doesn’t use next() or pairs(), of course). Even in that case, the code would have to fall back to the interpreter at each loop end.

This sad news creates a tradeoff: for performance sensitive parts of the code, don’t use the most Lua-like code style. That motivates people to come up with several contortions to be able to use numerical iteration (which is compiled, and very efficient), like replacing any key with a number, storing all the keys in a numbered array, or store both keys and values at even/odd numeric indices.

Getting next() out of the NYI list

So, I finally have a non-NYI next() function! I'd like to say "a fully JITtable next() function", but it wouldn't be totally true; as it happens, there's no way to avoid some annoying trace exits on table iteration.

The purpose of the IR is to provide a representation of the execution path so it can be quickly optimized to generate the final mcode. For that, the IR traces are linear and type-specific; creating some interesting challenges for iteration on a generic container.

Traces are linear

Being linear means that each trace captures a single execution path, it can't contain conditional code or internal jumps. The only conditional branches are the "guards" that make sure that the code to be executed is the appropriate one. If a condition changes and it must now do something different, the trace must be exited. If it happens several times, it will spawn a side trace and the exit will be patched into a conditional branch. Very nice, but this still means that there can be at most one loop on each trace.

The implementation of next() has to internally skip over empty slots in the table to only return valid key/value pairs. If we try to express this in IR code, this would be the "inner" loop and the original loop would be an "outer" one, which doesn't have as much optimization opportunities. In particular, it can't hoist invariable code out of the loop.

The solution is to do that slot skipping in C. Not using the Lua C API, of course, but the inner IR CALL instruction that is compiled into a "fast" call, using CPU registers for arguments as much as possible.

The IR is in Type-specific SSA form

The SSA form (Static Single Assignment) is key for many data flow analysis heuristics that allow quick optimizations like dead code removal, allocation sinking, type narrowing, strength reduction, etc. In LuaJIT's IR it means every instruction is usable as a value for subsequent instructions and has a declared type, fixed at the moment when the trace recorder emits this particular IR instruction. In addition, every instruction can be a type guard, if the arguments are not of the expected type the trace will be exited.

Lua is dynamically typed, every value is tagged with type information so the bytecode interpreter can apply the correct operations on it. This allows us to have variables and tables that can contain and pass around any kind of object without changing the source code. Of course, this requires the interpreter to be coded very "defensively", to consider all valid ramifications of every instruction, limiting the possibility of optimizations. The IR traces, on the other hand, are optimized for a single variation of the code, and deal with only the value types that are actually observed while executing.

For example, this simple code creates a 1,000 element array and then copies to another table:

local t,t2 = {},{} for i=1,1000 do t[i] = i end for i,v in ipairs(t) do t2[i]=v end

resulting in this IR for the second loop, the one that does the copy:

0023 ------------ LOOP ------------ 0024 num CONV 0017 0025 > int ABC 0005 0017 0026 p32 AREF 0007 0017 0027 num ASTORE 0026 0022 0028 rbp + int ADD 0017 +1 0029 > int ABC 0018 0028 0030 p32 AREF 0020 0028 0031 xmm7 >+ num ALOAD 0030 0032 xmm7 num PHI 0022 0031 0033 rbp int PHI 0017 0028 0034 rbx nil RENAME 0017 #3 0035 xmm6 nil RENAME 0022 #2

Here we see the ALOAD in instruction 0031 assures that the value loaded from the table is in effect a number. If it happens to be any other value, the guard fails and the trace is exited.

But if we do an array of strings instead of numbers?

a small change:

local t,t2 = {},{} for i=1,1000 do t[i] = 's'..i end for i,v in ipairs(t) do t2[i]=v end

gives us this:

0024 ------------ LOOP ------------ 0025 num CONV 0018 0026 > int ABC 0005 0018 0027 p32 AREF 0007 0018 0028 str ASTORE 0027 0023 0029 rbp + int ADD 0018 +1 0030 > int ABC 0019 0029 0031 p32 AREF 0021 0029 0032 rbx >+ str ALOAD 0031 0033 rbx str PHI 0023 0032 0034 rbp int PHI 0018 0029 0035 r15 nil RENAME 0018 #3 0036 r14 nil RENAME 0023 #2

It's the same code, but the type that ALOAD is guarding is now a string (and it now uses a different register, I guess a vector register isn't appropriate for a string pointer).

And if the table has a values of a mix of types?

local t,t2={},{} for i=1,1000,2 do t[i], t[i+1] = i, 's'..i end for i,v in ipairs(t) do t2[i]=v end 0031 ------------ LOOP ------------ 0032 num CONV 0027 0033 > int ABC 0005 0027 0034 p32 AREF 0007 0027 0035 str ASTORE 0034 0030 0036 r15 int ADD 0027 +1 0037 > int ABC 0019 0036 0038 p32 AREF 0021 0036 0039 xmm7 > num ALOAD 0038 0040 > int ABC 0005 0036 0041 p32 AREF 0007 0036 0042 num ASTORE 0041 0039 0043 rbp + int ADD 0027 +2 0044 > int ABC 0019 0043 0045 p32 AREF 0021 0043 0046 rbx >+ str ALOAD 0045 0047 rbx str PHI 0030 0046 0048 rbp int PHI 0027 0043

Now there are two ALOADs, (and two ASTOREs), one for 'num' and one for 'str'. In other words, the JIT unrolled the loop and found that that made the types constant. =8-O

Of course, this would happen only on very simple and regular patterns. In general, it's wiser to avoid unpredictable type mixing; but polymorphic code will be optimized for each type that it's actually used with.

Back to next()

First let's see the current implementation of next() as used by the interpreter:

lj_tab.c /* Advance to the next step in a table traversal. */ int lj_tab_next(lua_State *L, GCtab *t, TValue *key) { uint32_t i = keyindex(L, t, key); /* Find predecessor key index. */ for (i++; i < t->asize; i++) /* First traverse the array keys. */ if (!tvisnil(arrayslot(t, i))) { setintV(key, i); copyTV(L, key+1, arrayslot(t, i)); return 1; } for (i -= t->asize; i <= t->hmask; i++) { /* Then traverse the hash keys. */ Node *n = &noderef(t->node)[i]; if (!tvisnil(&n->val)) { copyTV(L, key, &n->key); copyTV(L, key+1, &n->val); return 1; } } return 0; /* End of traversal. */ }

It takes the input key as a TValue pointer and calls keyindex(). This helper function searches for the key in the table and returns an index; if the key is an integer in the range of the array part, the index is the key itself. If not, it performs a hash query and returns the index of the Node, offset by the array size, if successful, or signals an error if not found (it's an error to give a nonexistent key to next()).

Back at lj_tab_next(), the index is first incremented, and if it's still within the array, it's iterated over any hole until a non-nil value is found. If it wasn't in the array (or there’s no next value there), it performs a similar "skip the nils" on the Node table.

The new lj_record_next() function in lj_record.c, like some other record functions there, first checks not only the input parameters, but also the return values to generate the most appropriate code for this specific iteration, assuming that it will likely be optimal for subsequent iterations. Of course, any such assumption must be backed by the appropriate guard.

For next(), we choose between two different forms, if the return key is in the array part, then it uses lj_tab_nexta(), which takes the input key as an integer and returns the next key, also as an integer, in the rax register. We don't do the equivalent to the keyindex() function, just check (with a guard) that the key is within the bounds of the array:

lj_tab.c /* Get the next array index */ MSize LJ_FASTCALL lj_tab_nexta(GCtab *t, MSize k) { for (k++; k < t->asize; k++) if (!tvisnil(arrayslot(t, k))) break; return k; }

The IR code looks like this:

0014 r13 int FLOAD 0011 tab.asize 0015 rsi > int CONV 0012 int.num 0017 rax + int CALLL lj_tab_nexta (0011 0015) 0018 > int ABC 0014 0017 0019 r12 p32 FLOAD 0011 tab.array 0020 p32 AREF 0019 0017 0021 [8] >+ num ALOAD 0020

Clearly, the CALL itself (at 0017) is typed as 'int', as natural for an array key; and the ALOAD (0021) is 'num', because that's what the first few values happened to be.

When we finish with the array part, the bounds check (instruction ABC on 0018) would fail and soon new IR would be generated. This time we use the lj_tab_nexth() function.

lj_tab.c LJ_FUNCA const Node *LJ_FASTCALL lj_tab_nexth(lua_State *L, GCtab *t, const Node *n) { const Node *nodeend = noderef(t->node)+t->hmask; for (n++; n <= nodeend; n++) { if (!tvisnil(&n->val)) { return n; } } return &G(L)->nilnode; }

But before doing the "skip the nils", we need to do a hash query to find the initial Node entry. Fortunately, the HREF IR instruction does that: This is the IR:

0014 rdx p32 HREF 0011 0012 0016 r12 p32 CALLL lj_tab_nexth (0011 0014) 0017 rax >+ str HKLOAD 0016 0018 [8] >+ num HLOAD 0016

There's a funny thing here: HREF is supposed to return a reference to a value in the hash table, and the last argument in lj_tab_nexth() is a Node pointer. Let's see the Node definition:

lj_obj.h /* Hash node. */ typedef struct Node { TValue val; /* Value object. Must be first field. */ TValue key; /* Key object. */ MRef next; /* Hash chain. */ #if !LJ_GC64 MRef freetop; /* Top of free elements (stored in t->node[0]). */ #endif } Node;

Ok... the value is the first field, and it says right there "Must be first field". Looks like it's not the first place with some hand-wavy pointer casts.

The return value of lj_tab_next() is a Node pointer, which can likewise be implicitly cast by HLOAD to get the value. To get the key, I added the HKLOAD instruction. Both are guarding for the expected types of the value and key, respectively.

Let's take it for a spin

So, how does it perform? These tests do a thousand loops over a 10,000 element table, first using next() and then pairs(), with a simple addition in the inner loop. To get pairs() compiled, I just disabled the ISNEXT/ITERN optimization, so it actually uses next(). In the third test the variable in the addition is initialized to 0ULL instead of just 0, triggering the use of FFI.

First test is with all 10,000 elements on sequential integers, making the table a valid sequence, so ipairs() (which is already compiled) can be used just as well:

So, compiled next() is quite a lot faster, but the pairs() optimization in the interpreter is very fast. On the other hand, the smallest smell of FFI completely trashes interpreter performance, while making compiled code slightly tighter. Finally, ipairs() is faster, but a big part of it is because it stops on the first nil, while next() has to skip over every nil at the end of the array, which by default can be up to twice as big as the sequence itself.

Now with 5,000 (sequential) integer keys and 5,000 string keys. Of course, we can't use ipairs() here:

Roughly the same pattern: the compiled next() performance is very much the same on the three forms (used directly, under pairs() and with FFI code), while the interpreter benefits from the pairs() optimization and almost dies with FFI. In this case, the interpreted pairs() actually surpasses the compiled next() performance, hinting that separately optimizing pairs() is still desirable.

A big factor in the interpreter pairs() is that it doesn't use next(); instead it directly drives the loop with a hidden variable to iterate in the Node table without having to perform a hash lookup on every step.

Repeating that in a compiled pairs() would be equally beneficial; but has to be done carefully to maintain compatibility with the interpreter. On any trace exit the interpreter would kick in and must be able to seamlessly continue iterating. For that, the rest of the system has to be aware of that hidden variable.

The best part of this is that we have lots of very challenging, yet deeply rewarding, work ahead of us! Come work for us on making LuaJIT faster and more.

Categories: Technology

You can now use Google Authenticator and any TOTP app for Two-Factor Authentication

Thu, 16/02/2017 - 21:52

Since the very beginning, Cloudflare has offered two-factor authentication with Authy, and starting today we are expanding your options to keep your account safe with Google Authenticator and any Time-based One Time Password (TOTP) app of your choice.

If you want to get started right away, visit your account settings. Setting up Two-Factor with Google Authenticator or with any TOTP app is easy - just use the app to scan the barcode you see in the Cloudflare dashboard, enter the code the app returns, and you’re good to go.

Importance of Two-Factor Authentication

Often when you hear that an account was ‘hacked’, it really means that the password was stolen.

If the media stopped saying 'hacking' and instead said 'figured out their password', people would take password security more seriously.

— Khalil Sehnaoui (@sehnaoui) January 5, 2017

Two-Factor authentication is sometimes thought of as something that should be used to protect important accounts, but the best practice is to always enable it when it is available. Without a second factor, any mishap involving your password can lead to a compromise. Journalist Mat Honan’s high profile compromise in 2012 is a great example of the importance of two-factor authentication. When he later wrote about the incident he said, "Had I used two-factor authentication for my Google account, it’s possible that none of this would have happened."

What is a TOTP app?

TOTP (Time-based One Time Password) is the mechanism that Google Authenticator, Authy and other two-factor authentication apps use to generate short-lived authentication codes. We’ve written previously on the blog about how TOTP works.

We didn’t want to limit you to only using two-factor providers that we'd built integrations with, so we built an open TOTP integration in the Cloudflare dashboard, allowing you to set up two-factor with any app that implements TOTP. That means you can choose from a wide array of apps for logging into Cloudflare securely with two-factor such as Symantec, Duo Mobile and 1Password.

Get Started

If you want to enable Two-Factor Authentication with Google Authenticator or any other TOTP provider, visit your account settings here. It’s easy to set up and the best way to secure your account. We also have step by step instructions for you in our knowledge base.

Categories: Technology

Discovering Great Talent with Path Forward

Wed, 15/02/2017 - 19:20

Cloudflare's Path Forward Candidates with Janet

In the fall of 2016, I was just beginning my job search. I’d been lucky to lead HR at a number of great cutting-edge technology start-ups, and I was looking for my next adventure. I wanted to find a company that wasn’t just a great business--I wanted one that was also making a positive impact on the world, and one that had a mission I felt passionately about.

During my two decades running HR/People organizations, I’ve spent a lot of time working with--and talking to--parents in the workplace. I’ve been motivated to do so for a few reasons. According to the US census, mothers are the fastest-growing segment of the US workforce. Companies struggle to retain talented workers after they’ve become parents, especially mothers. It’s been reported that 43 percent of highly qualified women with children leave their careers. Millennials (who make up the majority of the US workforce) are reporting that they want to be more engaged parents and are placing a high value on companies that allow them to parent and still get promoted. Ultimately, I’ve come to believe that the skills you acquire while parenting are extremely relevant and valuable to the workforce.

So when Path Forward announced its launch partners in 2016, I read about the participating companies with great interest. And that is where I discovered Cloudflare! It immediately went to the top of my short list as I knew a company that valued a partnership with Path Forward was aligned with my values.

Path Forward is a nonprofit organization that aims to empower women and men to return to the paid workforce after taking two or more years away from their career to focus on caregiving. This could mean taking two years off to care for a child, or taking multiple years off to care for an elderly family member. Everyone in this program has put their careers on hold to care for the ones they love.

Candidates apply for various roles, undergo a series of interviews, and if selected, participate in an 18-week returnship. The goal, for both candidates and participating companies, is to ultimately hire the candidates for full-time employment.

At Cloudflare, we’re focused on helping to build a better Internet, and do to that, we need the best and brightest. Sometimes that means hiring people who have tenure and skills in a very specific field, and other times, that means bringing in people who can adapt quickly and think critically, contributing to both our culture and our company mission.

Path Forward & Cloudflare

Gloria Mancu re-entered the workforce as a tech support engineer at Cloudflare after taking 10 years off to care for her son. She had initially applied to Cloudflare after seeing a job opening online, and was later integrated into the Path Forward program because of its emphasis on returnships. “Being an intern is a tremendous opportunity, because you get a feeling for the group, the company, and the culture, firsthand. On the other hand, the employer gets to know you, so it goes both ways.”

The Path Forward program is indeed a two-way street. Yes, it helps people return to the workforce, but participants also bring a ton of value to their respective companies. Men and women who’ve taken time off to care for their families bring the kind of maturity and professionalism that only come with life experience.

Wanda Chiu, a software engineer on our Edge team, took time off initially to care for her ailing mother. She later decided to start a family and wanted to be there to watch her kids grow up. Fifteen years later, she says that Path Forward has helped her comfortably transition back into the workforce. “I wasn’t sure I was qualified to apply for software engineering positions, because the industry has adapted so much in the last 15 years and there are so many new tools,” she says. “Cloudflare was willing to give me the time to pick up the new skills I needed to succeed in this software engineering role and contribute to the team.”

Path Forward Quote It’s so crucial to note that a lot of people returning to the workforce think they have to start at square one. They’ll apply for entry-level positions, only to be bumped up to the next level based on their experience and age, and then finally rejected due to their employment gap. At Cloudflare, we really wanted to give people the opportunity to pick up where they left off and bring with them all of the life experience they’ve gained.

On Monday night, we hosted the Path Forward graduation at our headquarters in San Francisco and celebrated the work of the 10 participants from Demandbase, Coursera, and Zendesk, in addition to Cloudflare. Graduates snacked on hors d’oeuvres and discussed their returnship experiences, following a keynote from Tami Forman, the program’s executive director.

We’ve extended full-time offers to both Wanda and Gloria and look forward to continuing with the Path Forward program. We’re currently interviewing and plan to welcome a new group of Path Forward participants in April. We have five open returnship positions across our Marketing, Engineering, and People teams in San Francisco, so if you or someone you know is interested, please reach out to Ed Burns at

Categories: Technology

NCC Group's Cryptography Services audits our Go TLS 1.3 stack

Wed, 15/02/2017 - 00:49

The Cloudflare TLS 1.3 beta is run by a Go implementation of the protocol based on the Go standard library, crypto/tls. Starting from that excellent Go codebase allowed us to quickly start experimenting, to be the first wide server deployment of the protocol, and to effectively track the changes to the specification draft.

Of course, the security of a TLS implementation is critical, so we engaged NCC Group's Cryptography Services to perform an audit at the end of 2016.

You can find the codebase on the Cloudflare GitHub. It's a drop-in replacement for crypto/tls and comes with a go wrapper to patch the standard library as needed.

The code is developed in the open but is currently targeted only at internal use: the repository is frequently rebased and the API is not guaranteed to be stable or fully documented. You can take a sneak peek at the API here.

The final goal is to upstream the patches to the Go project so that all users of the Go standard library benefit from it. You can follow the process here.

Below we republish the article about the audit first appeared on the NCC Group's blog.

NCC Group's Cryptography Services Complete an Audit of Cloudflare's TLS1.3

NCC Group's Cryptography Services practice recently completed a two-week audit of Cloudflare's TLS 1.3 implementation. The audit took place between November 11, 2016 and December 9, 2016.

The TLS standard was last updated almost ten years ago and this version brings new features and a simplified handshake to the protocol. Many old cryptographic algorithms have been replaced with more modern ones, key exchanges have forward secrecy by default, the handshake phase will be faster, certificates will be able to enjoy security-proven signature schemes, MAC-then-Encrypt constructions are out—the weakest features of older TLS versions have been updated or removed.

Cryptography Services analyzed Cloudflare's TLS 1.3 implementation for protocol-level flaws and for deviations from the draft specification. The team found a small number of issues during the review—all of which were promptly fixed—and was pleased with the quality of the code.

Cloudflare built their implementation of TLS 1.3 on the Go programming language's standard TLS library, making use of the existing base to correctly and safely parse TLS packets. While building on top of older versions can be challenging, Cloudflare has added TLS 1.3 code in a safe and segregated way, with new defenses against downgrade attacks being added in the final implementation of the specification. This permits support for older versions of TLS while being free from unexpected conflicts or downgrades.

Using Go and its standard libraries enables Cloudflare to avoid common implementation issues stemming from vulnerable strcpy and memcpy operations, pointer arithmetic and manual memory management while providing a best-in-class crypto API.

Cloudflare implemented a conservative subset of the TLS 1.3 specification. State-of-the-art algorithms, such as Curve25519, are given priority over legacy algorithms. Session resumption is limited to the forward secure option. Cloudflare's implementation also considers efficiency, using AES-GCM if it detects accelerated hardware support and the faster-in-software Chacha20-Poly1305 in its absence.

There is still work to be done before TLS 1.3 enjoys large scale adoption. Cloudflare is paving the way with its reliable server implementation of TLS 1.3, and Firefox and Chrome's client implementations make end-to-end testing of the draft specification possible. NCC Group applauds the work of the IETF and these early implementers.

Written by: Scott Stender

Categories: Technology

Want to see your DNS analytics? We have a Grafana plugin for that

Tue, 14/02/2017 - 18:04

Curious where your DNS traffic is coming from, how much DNS traffic is on your domain, and what records people are querying for that don’t exist? We now have a Grafana plugin for you.

Grafana is an open source data visualization tool that you can use to integrate data from many sources into one cohesive dashboard, and even use it to set up alerts. We’re big Grafana fans here - we use Grafana internally for our ops metrics dashboards.

In the Cloudflare Grafana plugin, you can see the response code breakdown of your DNS traffic. During a random prefix flood, a common type of DNS DDoS attack where an attacker queries random subdomains to bypass DNS caches and overwhelm the origin nameservers, you will see the number of NXDOMAIN responses increase dramatically. It is also common during normal traffic to have a small amount of negative answers due to typos or clients searching for missing records.

You can also see the breakdown of queries by data center and by query type to understand where your traffic is coming from and what your domains are being queried for. This is very useful to identify localized issues, and to see how your traffic is spread globally.

You can filter by specific data centers, record types, query types, response codes, and query name, so you can filter down to see analytics for just the MX records that are returning errors in one of the data centers, or understand whether the negative answers are generated because of a DNS attack, or misconfigured records.

Once you have the Cloudflare Grafana Plugin installed, you can also make your own charts using the Cloudflare data set in Grafana, and integrate them into your existing dashboards.

Virtual DNS customers can also take advantage of the Grafana plugin. There is a custom Grafana dashboard that comes installed with the plugin to show traffic distribution and RTT from different Virtual DNS origins, as well as the top queries that uncached or are returning SERVFAIL.

The Grafana plugin is three steps to install once you have Grafana up and running - cd into the plugins folder, download the plugin and restart grafana. Instructions are here. Once you sign in using your user email and API key, the plugin will automatically discover zones and Virtual DNS clusters you have access to.

The Grafana plugin is built on our new DNS analytics API. If you want to explore your DNS traffic but Grafana isn’t your tool of choice, our DNS analytics API is very easy to get started with. Here’s a curl to get you started:

curl -s -H 'X-Auth-Key:####' -H 'X-Auth-Email:####' '’

To make all of this work, Cloudflare DNS is answering and logging millions of queries each second. Having high resolution data at this scale enables us to quickly pinpoint and resolve problems, and we’re excited to share this with you. More on this in a follow up deep dive blog post on improvements in our new data pipeline.

Instructions for how to get started with Grafana are here and DNS analytics API documentation is here. Enjoy!

Categories: Technology

Cloudflare Crypto Meetup #5: February 28, 2017

Tue, 07/02/2017 - 19:31

Come join us on Cloudflare HQ in San Francisco on Tuesday, Febrary 28, 2017 for another cryptography meetup. We again had a great time at the last one, we decided to host another. It's becoming a pattern.

We’ll start the evening at 6:00p.m. with time for networking, followed up with short talks by leading experts starting at 6:30p.m. Pizza and beer are provided! RSVP here.

Here are the confirmed speakers:

Deirdre Connolly

Deirdre is a senior software engineer at Brightcove, where she is trying to secure old and new web applications. Her interests include applied cryptography, secure defaults, elliptic curves and their isogenies.

Post-quantum cryptography

Post-quantum cryptography is an active field of research in developing new cryptosystems that will be resistant to attack by future quantum computers. Recently a somewhat obscure area, isogeny-based cryptography, has been getting more attention, including impressive speed and compression optimizations and robust security analyses, bringing it into regular discussion alongside other post-quantum candidates. This talk will cover isogeny-based crypto, specifically these recents results regarding supersingular isogeny diffie-hellman, which is a possible replacement for the ephemeral key exchanges in use today.

Maya Kaczorowski

Maya Kaczorowski is a Product Manager at Google in Security & Privacy. Her work focuses on encryption at rest and encryption key management.

How data at rest is encrypted in Google's Cloud, at scale

How does Google encrypt data at rest? This talk will cover how Google shards and encrypts data by default, Google's key management system, root of trust, and Google's cryptographic library. Google Cloud Platform encrypts customer content stored at rest, without any action from the customer, using one or more encryption mechanisms. We will also discuss best practices in implementing encryption for your storage system(s).

Andrew Ayer

Andrew Ayer is a security researcher interested in the Web's Public Key Infrastructure. He is the founder of SSLMate, an automated SSL certificate service, and the author of Cert Spotter, an open source Certificate Transparency monitor. Andrew participates in the IETF's Public Notary Transparency working group and recently used Certificate Transparency logs to uncover over 100 improperly-issued Symantec certificates.

Certificate Transparency

Certificate Transparency improves the security of the Web PKI by logging every publicly-trusted SSL certificate to public, verifiable, append-only logs, which domain owners can monitor to detect improperly-issued certificates for their domains. Certificate Transparency was created by Google and is now being standardized by the IETF. Beginning October 2017, Chrome will require all new certificates be logged with Certificate Transparency.

This talk will explore how Certificate Transparency works, how domain owners can take advantage of it, and what the future holds for Certificate Transparency.

Categories: Technology

DDoS Ransom: An Offer You Can Refuse

Mon, 06/02/2017 - 21:43

DDoS ransom

Cloudflare has covered DDoS ransom groups several times in the past. First, we reported on the copycat group claiming to be the Armada Collective and then not too long afterwards, we covered the "new" Lizard Squad. While in both cases the groups made threats that were ultimately empty, these types of security events can send teams scrambling to determine the correct response. Teams in this situation can choose from three types of responses: pay the ransom and enable these groups to continue their operations, not pay and hope for the best, or prepare an action plan to get protected.

Breaking the Ransom Cycle

We can’t stress enough that you should never pay the ransom. We fully understand that in the moment when your website is being attacked it might seem like a reasonable solution, but by paying the ransom, you only perpetuate the DDoS ransom group’s activities and entice other would be ransomers to start making similar threats. In fact, we have seen reports of victim organizations receiving multiple subsequent threats after they have paid the ransom. It would seem these groups are sharing lists of organizations that pay, and those organizations are more likely to be targeted again in the future. Victim organizations pay the ransom often enough that we see new “competitors” pop up every few months. As of a few weeks ago, a new group, intentionally left unnamed, has emerged and begun targeting financial institutions around the world. This group follows a similar modus operandi as previous groups, but with a significant twist.

Mostly Bark and Little Bite

The main difference between previous copycats and this new group is that this group actually sends a small demonstration attack before sending the ransom email to the typical role-based email accounts. The hope is to demonstrate to the target that the group will follow through with the ransom threat and convince them to pay the amount requested before the deadline passes. Unsurprisingly though, if the ransom amount is not paid before the deadline expires, the group does not launch a second attack.

When targeting an organization, the group sends two variations of a ransom email. The first variation is a standard threat:

Subject: ddos attack Hi! If you dont pay 8 bitcoin until 17. january your network will be hardly ddosed! Our attacks are super powerfull. And if you dont pay until 17. january ddos attack will start and price to stop will double! We are not kidding and we will do small demo now on [XXXXXXXX] to show we are serious. Pay and you are safe from us forever. OUR BITCOIN ADDRESS: [XXXXXXXX] Dont reply, we will ignore! Pay and we will be notify you payed and you are safe. Cheers!

Interestingly, the second email variation makes reference to "mirai" -- the IoT-based botnet that has been in the news recently as having contributed to many significant attacks. It is important to note -- while the second variation of ransom email references “mirai” there is no actual evidence that these demonstration attacks have anything to do with the Mirai botnet.

Subject: DDoS Attack on XXXXXXXX! Hi! If you dont pay 6 bitcoin in 24 hours your servers will be hardly ddosed! Our attacks are super powerfull. And if you dont pay in 24 hours ddos attack will start and price to stop will double and keep go up! IMPORTANT - You think you protected by CloudFlare but we pass CloudFlare and attack your servers directly. We are not kidding and we will do small demo now to show we are serious. We dont want to make damage now so we will run small attack on 2 not important your IPs - XXXXXXXX and XXXXXXXX. Just small UDP flood for 1 hour to prove us. But dont ignore our demand as we then launch heavy attack by Mirai on all your servers!! Pay and you are safe from us forever. OUR BITCOIN ADDRESS: [XXXXXXXX] Dont reply, we will ignore! Pay and we will be notify you payed and you are safe. Cheers!

While no two attacks are identical, the group’s demonstration attacks do generally follow a pattern. The attacks usually peak around 10 Gbps, last for less than an hour and use either DNS amplification or NTP reflection as the attack method. Without detailing specifics so as not to tip off the bad guys, there are also specific characteristics about the demonstration attacks that support the theory the attacks are using a booter/stresser type of service to carry out the attacks. Neither of these attack types are new, and Cloudflare successfully mitigates attacks that are substantially larger in volume many times a week.

While in this instance not paying the ransom doesn’t lead to a subsequent attack, this outcome isn’t guaranteed. Not only can your site possibly go down during the demonstration attack, but there is still nothing stopping either the original ransomer or a different attacker from launching a future attack. Regardless of an attacker’s true intent, taking no action is a suboptimal plan.

Building an Action Plan

Scrambling to build an action plan while actively under attack is not only stressful, but this is often when avoidable mistakes happen. We recommend doing your research about what protection is right for you ahead of time. DDoS protection, as well as other application level protections, don’t have to be a hassle to implement, and it can be done in under an hour with Cloudflare. Having a plan and implementing protection before a security event occurs can keep your site running smoothly. However, if you find yourself under attack and without an action plan, it’s important to remember that many of these groups are bluffing. Even when these groups are not bluffing, paying the ransom will only encourage them to continue their efforts. If you have received one of these emails, we encourage you to reach out so that we can discuss the specifics of your situation, and whether or not the specific group in question is known to follow through with their threats.

Categories: Technology

NANOG - the art of running a network and discussing common operational issues

Thu, 02/02/2017 - 12:15
NANOG - the art of running a network and discussing common operational issues

The North American Network Operators Group (NANOG) is the loci of modern Internet innovation and the day-to-day cumulative network-operational knowledge of thousands and thousands of network engineers. NANOG itself is a non-profit membership organization; but you don’t need to be a member in order to attend the conference or join the mailing list. That said, if you can become a member, then you’re helping a good cause.

The next NANOG conference starts in a few days (February 6-8 2017) in Washington, DC. Nearly 900 network professionals are converging on the city to discuss a variety of network-related issues, both big and small; but all related to running and improving the global Internet. For this upcoming meeting, Cloudflare has three network professionals in attendance. Two from the San Francisco office and one from the London office.

With the conference starting next week, it seemed a great opportunity to introduce readers of the blog as to why a NANOG conference is so worth attending.


While it seems obvious how to do some network tasks (you unpack the spiffy new wireless router from its box, you set up its security and plug it in); alas the global Internet is somewhat more complex. Even seasoned professionals could do with a recap on how traceroute actually works, or how DNSSEC operates, or this years subtle BGP complexities, or be enlightened about Optical Networking. All this can assist you with deployments within your networks or datacenter.


If there’s one thing that keeps the Internet (a network-of-networks) operating, it’s peering. Peering is the act of bringing together two or more networks to allow traffic (bits, bytes, packets, email messages, web pages, audio and video streams) to flow efficiently and cleanly between source and destination. The Internet is nothing more than a collection of individual networks. NANOG provides one of many forums for diverse network operators to meet face-to-face and negotiate and enable those interconnections.

While NANOG isn’t the only event that draws networks together to discuss interconnection, it’s one of the early forums to support these peering discussions.

Security and Reputation

In this day-and-age we are brutally aware that security is the number-one issue when using the Internet. This is something to think about when you choose your email password, lock screen password on your laptop, tablet or smartphone. Hint: you should always have a lock screen!

At NANOG the security discussion is focused on a much deeper part of the global Internet, in the very hardware and software practices, that operate and support the underlying networks we all use on a daily basis. An Internet backbone (rarely seen) is a network that moves traffic from one side of the globe to the other (or from one side of a city to the other). At NANOG we discuss how that underlying infrastructure can operate efficiently, securely, and be continually strengthened. The growth of the Internet over the last handful of decades has pushed the envelope when it comes to hardware deployments and network-complexity. Sometimes it only takes one compromised box to ruin your day. Discussions at conferences like NANOG are vital to the sharing of knowledge and collective improvement of everyone's networks.

Above the hardware layer (from a network stack point of view) is the Domain Name System (DNS). DNS has always been a major subject of discussion within the NANOG community. It’s very much up to the operational community to make sure that when you type a website name into web browser or type in someone’s email address into your email program that there’s a highly efficient process to convert from names to numbers (numbers, or IP address, are the address book and routing method of the Internet). DNS has had its fair share of focus in the security arena and it comes down to network operators (and their system administrator colleagues) to protect DNS infrastructure.

Network Operations; best practices and stories of disasters

Nearly everyone knows that bad news sells. It’s a fact. To be honest, the same is the case in the network operator community. However, within NANOG, those stories of disasters are nearly always told from a learning and improvement point of view. There’s simply no need to repeat a failure, no-one enjoys it a second time around. Notable stories have included subjects like route-leaks, BGP protocol hiccups, peering points, and plenty more.

We simply can’t rule out failures within portions of the network; hence NANOG has spent plenty of time discussing redundancy. The internet operates using routing protocols that explicitly allow for redundancy in the paths that traffic travels. Should a failure occur (a hardware failure, or a fiber cut), the theory is that the traffic will be routed around that failure. This is a recurring topic for NANOG meetings. Subsea cables (and their occasional cuts) always make for good talks.

Network Automation

While we learned twenty or more years ago how to type into Internet routers on the command line, those days are quickly becoming history. We simply can’t scale if network operational engineers have to type the same commands into hundreds (or thousands?) of boxes around the globe. We need automation. This is where NANOG has been a leader in this space. Cloudflare has been active in this arena and Mircea Ulinic presented our experience for Network Automation with Salt and NAPALM at the previous NANOG meeting. Mircea (and Jérôme Fleury) will be giving a follow-up in-depth tutorial on the subject at next week’s meeting.

Many more subjects covered

The first NANOG conference was held June 1994 in Ann Arbor, Michigan and the conference has grown significantly since then. While it’s fun to follow the history, it’s maybe more important to realize that NANOG has covered a multitude of subjects since that start. Go scan the archives at and/or watch some of the online videos.

The socials (downtime between technical talks)

Let’s not forget the advantages of spending time with other operators within a relaxed setting. After all, sometimes the big conversations happen when spending time over a beer discussing common issues. NANOG has long understood this and it’s clear that the Tuesday evening Beer ’n Gear social is set up specifically to let network geeks both grab a drink (soft drinks included) and poke around with the latest and greatest network hardware on show. The social is as much about blinking lights on shiny network boxes as it is about tracking down that network buddy.

Oh; there’s a fair number of vendor giveaways (so far there’s 15 hardware and software vendors signed up for next week’s event). After all, who doesn’t need a new t-shirt?

But there’s more to the downtime and casual hallway conversations. For myself (the author of this blog), I know that sometimes the most important work is done within the hallways during breaks in the meeting vs. standing in front of the microphone presenting at the podium. The industry has long recognized this and the NANOG organizers were one of the early pioneers in providing full-time coffee and snacks that cover the full conference agenda times. Why? Because sometimes you have to step out of the regular presentations to meet and discuss with someone from another network. NANOG knows its audience!

Besides NANOG, there’s IETF, ICANN, ARIN, and many more

NANOG isn’t the only forum to discuss network operational issues, however it’s arguably the largest. It started off as a “North American” entity; however, in the same way that the Internet doesn’t have country barriers, NANOG meetings (which take place in the US, Canada and at least once in the Caribbean) have fostered an online community that has grown into a global resource. The mailing list (well worth reviewing) is a bastion of networking discussions.

In a different realm, the Internet Engineering Task Force (IETF) focuses on protocol standards. Its existence is why diverse entities can communicate. Operators participate in IETF meetings; however, it’s a meeting focused outside of the core operational mindset.

Central to the Internet’s existence is ICANN. Meeting three times a year at locations around the globe, it focused on the governance arena and in domain names and related items. Within the meetings there’s an excellent Tech Day.

In the numbers arena ARIN is an example of a regional routing registry (an RIR) that runs members meeting. An RIR deals with allocating resources like IP addresses and AS numbers. ARIN focuses on the North American area and sometimes holds its meetings alongside NANOG meetings.

ARIN’s counterparts in other parts of the world also hold meetings. Sometimes they simply focus on resource policy and sometimes they also focus on network operational issues. For example RIPE (in Europe, Central Asia and the Middle East) runs a five-day meeting that covers operational and policy issues. APNIC (Asia Pacific), AFRINIC (Africa), LACNIC (Latin America & Caribbean) all do similar variations. There isn’t one absolute method and that's a good thing. It’s worth pointing out that APNIC holds it’s members meetings once a year in conjunction with APRICOT which is the primary operations meeting in the Asia Pacific region.

While NANOG is somewhat focused on North America, there are also the regional NOGs. These regional NOGs are vital to help the education of network operators globally. Japan has JANOG, Southern Africa has SAFNOG, MENOG in the Middle East, AUSNOG & NZNOG in Australia & New Zealand, DENOG in Germany, PHNOG in the Philippines, and just to be different, the UK has UKNOF (“Forum” vs. “Group”). It would be hard to list them all; but each is a worthwhile forum for operational discussions.

Peering specific meetings also exist. Global Peering Forum, European Peering Forum, and Peering Forum de LACNOG for example. Those focus on bilateral meetings within a group of network operators or administrators and specifically focus on interconnect agreements.

In the commercial realms there’s plenty of other meetings that are attended by networks like Cloudflare. PTC and International Telecoms Week (ITW) are global telecom meetings specifically designed to host one-to-one (bilateral) meetings. They are very commercial in nature and less operational in focus.

NANOG isn’t the only forum Cloudflare attends

As you would guess, you will find our network team at RIR meetings, sometimes at IETF meetings, sometimes at ICANN meetings, often at various regional NOG meetings (like SANOG in South East Asia, NoNOG in Norway, RONOG in Romania, AUSNOG/NZNOG in Australia/New Zealand and many other NOGs). We get around; however, we also run a global network and we need to interact with many many networks around the globe. These meetings provide an ideal opportunity for one-to-one discussions.

If you've heard something you like from Cloudflare at one of these operational-focused conferences, then check out our jobs listings (in various North American cities, London, Singapore, and beyond!)

Categories: Technology

Protecting everyone from WordPress Content Injection

Wed, 01/02/2017 - 16:53

Today a severe vulnerability was announced by the WordPress Security Team that allows unauthenticated users to change content on a site using unpatched (below version 4.7.2) WordPress.

CC BY-SA 2.0 image by Nicola Sap De Mitri

The problem was found by the team at Sucuri and reported to WordPress. The WordPress team worked with WAF vendors, including Cloudflare, to roll out protection before the patch became available.

Earlier this week we rolled out two rules to protect against exploitation of this issue (both types mentioned in the Sucuri blog post). We have been monitoring the situation and have not observed any attempts to exploit this vulnerability before it was announced publicly.

Customers on a paid plan will find two rules in WAF, WP0025A and WP0025B, that protect unpatched WordPress sites from this vulnerability. If the Cloudflare WordPress ruleset is enabled then these rules are automatically turned on and blocking.

Protecting Everyone

As we have in the past with other serious and critical vulnerabilities like Shellshock and previous issues with JetPack, we have enabled these two rules for our free customers as well.

Free customers who want full protection for their WordPress sites can upgrade to a paid plan and enable the Cloudflare WordPress ruleset in the WAF.

Categories: Technology

TLS 1.3 explained by the Cloudflare Crypto Team at 33c3

Wed, 01/02/2017 - 14:57

Nick Sullivan and I gave a talk about TLS 1.3 at 33c3, the latest Chaos Communication Congress. The congress, attended by more that 13,000 hackers in Hamburg, has been one of the hallmark events of the security community for more than 30 years.

You can watch the recording below, or download it in multiple formats and languages on the CCC website.

The talk introduces TLS 1.3 and explains how it works in technical detail, why it is faster and more secure, and touches on its history and current status.

.fluid-width-video-wrapper { margin-bottom: 45px; }

The slide deck is also online.

This was an expanded and updated version of the internal talk previously transcribed on this blog.

TLS 1.3 hits Chrome and Firefox Stable

In related news, TLS 1.3 is reaching a percentage of Chrome and Firefox users this week, so websites with the Cloudflare TLS 1.3 beta enabled will load faster and more securely for all those new users.

The last few days

You can enable the TLS 1.3 beta from the Crypto section of your control panel.

TLS 1.3 toggle

Categories: Technology