Blogroll: CloudFlare

I read blogs, as well as write one. The 'blogroll' on this site reproduces some posts from some of the people I enjoy reading. There are currently 60 posts from the blog 'CloudFlare.'

Disclaimer: Reproducing an article here need not necessarily imply agreement or endorsement!

Subscribe to CloudFlare feed CloudFlare
Helping Build a Better Internet
Updated: 28 min 38 sec ago

Encrypt that SNI: Firefox edition

Thu, 18/10/2018 - 18:00
 Firefox edition

A couple of weeks ago we announced support for the encrypted Server Name Indication (SNI) TLS extension (ESNI for short). As promised, our friends at Mozilla landed support for ESNI in Firefox Nightly, so you can now browse Cloudflare websites without leaking the plaintext SNI TLS extension to on-path observers (ISPs, coffee-shop owners, firewalls, …). Today we'll show you how to enable it and how to get full marks on our Browsing Experience Security Check.

 Firefox edition

Here comes the night

The first step is to download and install the very latest Firefox Nightly build, or, if you have Nightly already installed, make sure it’s up to date.

When we announced our support for ESNI we also created a test page you can point your browser to https://encryptedsni.com which checks whether your browser / DNS configuration is providing a more secure browsing experience by using secure DNS transport, DNSSEC validation, TLS 1.3 & ESNI itself when it connects to our test page. Before you make any changes to your Firefox configuration, you might well see a result something like this:

 Firefox edition

So, room for improvement! Next, head to the about:config page and look for the network.security.esni.enabled option (you can type the name in the search box at the top to filter out unrelated options), and switch it to true by double clicking on its value.

 Firefox edition

Now encrypted SNI is enabled and will be automatically used when you visit websites that support it (including all websites on Cloudflare).

It’s important to note that, as explained in our blog post, you must also enable support for DNS over HTTPS (also known as “Trusted Recursive Resolver” in Firefox) in order to avoid leaking the websites visited through plaintext DNS queries. To do that with Firefox, you can simply follow the instructions on this page.

Mozilla recommends setting up the Trusted Recursive Resolver in mode “2”, which means that if, for whatever reason, the DNS query to the TRR fails, it will be retried using the system’s DNS resolver. This is good to avoid breaking your web browsing due to DNS misconfigurations, however Firefox will also fallback to the system resolver in case of a failed DNSSEC signature verification, which might affect user’s security and privacy due to the fact that the query will then be retried over plaintext DNS.

This is due to the fact that any DNS failure, including DNSSEC failures, from the DNS resolver is identified by the DNS SERVFAIL return code, which is not granular enough for Firefox to differentiate different failure scenarios. We are looking into options to address this on our 1.1.1.1 resolver, in order to give Firefox and other DNS clients more information on the type of DNS failure experienced to avoid the fallback behaviour when appropriate.

Now that everything is in place, go ahead and visit our Browsing Experience Security Check page, and click on the “Check My Browser” button. You should now see results something like this:

 Firefox edition

Note: As you make changes in about:config to the ESNI & TRR settings, you will need to hard refresh the check page to ensure a new TLS connection is established. We plan to fix this in a future update.

To test for encrypted SNI support on your Cloudflare domain, you can visit the “/cdn-cgi/trace” page, for example, https://www.cloudflare.com/cdn-cgi/trace (replace www.cloudflare.com with your own domain). If the browser encrypted the SNI you should see sni=encrypted in the trace output.

On the wire

You can also go a step further and download and build the latest Wireshark code from its git repository (this feature hasn’t landed in a stable release yet so building from source is required for now).

This will allow you to see what the encrypted SNI extension looks like on the wire, while you visit a website that supports ESNI (e.g. https://cloudflare.com).

This is how a normal TLS connection looks with a plaintext SNI:

 Firefox edition

And here it is again, but this time with the encrypted SNI extension:

 Firefox edition

Fallback

As mentioned in our earlier post there may be cases when the DNS record fetched by the client doesn’t match a valid key owned by the TLS server, in which case the connection using ESNI would simply fail to be established.

This might happen for example if the authoritative DNS server and the TLS server somehow get out of sync (for example, the TLS server rotates its own key, but the DNS record is not updated accordingly). But this could also be caused by external parties, for example, a caching DNS resolver that doesn’t properly respect the TTL set by the authoritative server might serve an outdated ESNI record even though the authoritative server is up-to-date. When this happens, Firefox will fail to connect to the website.

The way we work around this problem on the Cloudflare edge network, is to simply make the TLS termination stack keep a list of valid ESNI keys for the past few hours, rather than just the latest and most recent key. This allows the TLS server to decrypt the encrypted SNI sent by a client even if a slightly outdated DNS record was used to produce it. The duration of the lifetime of ESNI keys needs to be balanced between increasing service availability, by keeping as many keys around as possible, and increasing security and forward secrecy of ESNI, which on the contrary requires keeping as few keys as possible.

There is some room for experimentation while the encrypted SNI specification is not finalized yet, and one proposed solution would allow the server to detect the failure and serve a fresh ESNI record to the client which in turn can then try to connect again using the newly received record without having to disable ESNI completely. But while this might seem easy, in practice a lot of things need to be taken into account: the server needs to serve a certificate to the client, so the client can make sure the connection is not being intercepted, but at the same time the server doesn’t know which certificate to serve because it can’t decrypt and inspect the SNI, which introduces the need for some sort of “fallback certificate”. Additionally any such fallback mechanism would inevitably add an additional round-trip to the connection handshake which would negate one of the main performance improvements introduced by TLS 1.3 (that is, shorter handshakes).

Conclusion

On our part, we’ll continue to experiment and evolve our implementation as the specification evolves, to make encrypted SNI work best for our customers and users.

Categories: Technology

Why I’m helping Cloudflare grow in Germany, Austria, and Switzerland

Thu, 18/10/2018 - 13:00
Why I’m helping Cloudflare grow in Germany, Austria, and SwitzerlandWhy I’m helping Cloudflare grow in Germany, Austria, and Switzerland

Why Cloudflare?

I am incredibly excited to announce that I’m joining Cloudflare as the Head of DACH to help to expand Cloudflare’s demand in Germany, Austria, and Switzerland. Having been in the technology industry for many years, Cloudflare’s mission to help build a better Internet was frankly the reason I joined, and I’m now very eager to start working towards this.

I quickly learned how Cloudflare helps to speed up and secure over 10 million Internet properties by protecting these customers from a wide range of online attacks and providing the reliability needed to run strong businesses. Security, privacy, and performance are key drivers for almost every business: from large traditional enterprises to purely online businesses and even individuals building their own personal brand. I could go on and on. The more I learned, the more excited I became.

One of Cloudflare’s major strengths is its global network. Cloudflare already has data centers in seven cities in the DACH region (with more to come) helping to ensure the Internet is fast, safe, and reliable for users online in the region. So while I get the honor of opening our first office in Germany (in Munich), I loved that Cloudflare had already been working towards this and in the market with customers.

Another important aspect for me was the company’s culture. During my interview experience with Cloudflare, I witnessed an incredible passion for the company from everyone, which left me with a strong feeling that this is the right environment for me. This team wants to make a difference. Cloudflare has a very determined team, and everyone is aligned behind the same goal: to help make the Internet better, for everyone. I also appreciated the company’s commitment to diversity in our employee base, and I will be building up the DACH team with that same commitment in mind. I can’t wait for what’s ahead.

Cloudflare is at the forefront of the direction the market is heading. We have an extremely talented and passionate team, and I am thrilled to now be a part of achieving Cloudflare’s mission.

What’s going on in the region?

Over the last 17 years, I have helped Symantec and Veritas to build strong teams and grow their businesses in Central Europe, including in the DACH region. I’m now excited to help expand on our strong global network and to build an even greater presence for Cloudflare in the DACH region.

Germany has the largest national economy in Europe and the fourth-largest by nominal GDP in the world.  From many of the largest corporations in the world, to the thriving German “Mittelstand” companies, I see organisations in the region trying to gain advantages from technology in a secure, reliable, and scalable way. With the opening of the new office in Munich, and the ongoing support of our EMEA headquarters in London, we will be able to significantly step up our support for DACH customers and partners.

Looking ahead

I’m excited to get started. Please look out for announcements about upcoming customer events and webinars. I’d be delighted to meet you there in person. Or, you can get in touch with me at shenke (at) cloudflare.com.

And, in case you are wondering, yes, we are hiring in the region. We are looking for Account Executives and Solution Engineers in Munich. If you are interested in exploring a career on our team in Germany, please keep in touch.

Categories: Technology

My First Grace Hopper Celebration

Wed, 17/10/2018 - 21:30
My First Grace Hopper CelebrationMy First Grace Hopper Celebration

Cloudflare #GHC18 team

I am 25+ years into my career in technology, and this was the very first time I attended a conference geared towards women.

A couple of weeks ago I went to Grace Hopper Celebration (#GHC18), and I can still feel the exuberant energy from the 22,000 women over the intensive 3 day conference. I attended with our Cloudflare team; our purpose was to connect with women in the greater tech community and recruit new talent to join our team and mission to help build a better Internet.

Cloudflare prioritizes GHC because we recognize that diversity in our company, and particularly in our technical departments, is crucial to our success. We believe that the best companies are diverse companies. This was Cloudflare’s second time sponsoring GHC, and I was part of the planning committee. This year I headed to the event with 20 of my colleagues to meet all of the incredible attendees, hold on-site interviews, and even host our own Cloudflare panel and luncheon.

Getting to #GHC18

Early Tuesday morning, the day before the conference, as I joined the Southwest Airlines boarding line at Oakland Airport, my fellow passengers were not the usual contingent of suited men on their way to business meetings. Instead I was surrounded by hundreds of women (and some men) in conversation about what to expect in Houston. The anticipation was palpable, and energy was invigorating.

The flight itself was essentially a Grace Hopper networking event. I sat next to two others who were also attending on behalf of their companies. In my row there was a product manager at a well-known and successful startup, as well as an executive who was heading to Grace Hopper to learn and hire. That was the best professional conversation I ever had on an airplane.The topics ranged from how to scale data pipelines at rapidly growing software companies, to how to find and hire great women engineers. All three of us were using the spotty airplane wifi to communicate last-minute conference plans with our colleagues all heading to the event. One of my seatmates showed me a massive airplane selfie that one of his colleagues had sent him—the whole plane was filled with women from his company, and the pilot had even made a special announcement welcoming them.

Upon arriving in Houston there was more of the same energy—it was just warmer and a bit muggier now that we were in Texas.The area of Houston around the conference centre was overtaken by the 22,000 attendees, most of whom were women at various stages of their studies. Uber drivers were eager to ask us what the hell was going on. Why so many women?

Three Non-Stop Days at #GHC18

My First Grace Hopper Celebration

Cloudflare Expo Booth photo

As a member of the Cloudflare GHC contingent I had a few jobs—working the booth on the expo floor, interviewing candidates, and being one of four panelists at our Cloudflare: Women in Leadership Lunch.

Working the booth was a whole lot more fun than I could have imagined. I am an introvert and tend to avoid crowds and interactions with too many strangers. I surprised myself by taking on the role of “traffic control”— walking the expo floors and approaching women to ask if they are looking for a great place to work. Cloudflare is a great place to work so I could authentically express my feelings and also specifically speak to why it’s an ideal place to start your career. Cloudflare is a company where you work to solve some of the internet’s biggest problems at a scale where it has real impact.

I would then proceed to walk any interested people over to our booth so that myself and my colleagues could further engage them. I got so much from my conversations with these women. It gave me insight into why the celebration is so well attended. Women at various stages of their studies and careers had very specific reasons for being there.

The highlight of my week was the Women in Leadership Luncheon that Cloudflare hosted on the last day of the event. It gave us an opportunity to interact with some of the women we had met throughout the week in a more thoughtful and private way where we could open up about our careers and personal goals.

My First Grace Hopper Celebration

Cloudflare Women in Leadership Luncheon w/ Jessica Rosenberg, Jade Wang, Lisa Retief, and Suzanne Aldrich

My First Grace Hopper Celebration

We mingled with women in a relaxed setting, and had conversations about their situations and experiences. I found it very inspiring. As part of the event, I joined a panel with my three colleagues Jade Wang, head of developer relations, Rebecca Rosenberg, head of brand design, and Suzanne Aldrich, solutions engineering lead to share some of our experiences and career journeys. All of us have different paths and have landed in different areas of the company, but all play integral roles in Cloudflare’s success. I don’t think you can underestimate the impact of seeing someone you can relate to in a position you may aspire to. This is an opportunity I wish I had when I was younger, and now am thrilled to share with the next generation of leaders in tech.

Another personal highlight of GHC was getting to really know my colleagues, many of whom I had never directly worked with. We were a team of women and men across different departments and locations who were excited to represent Cloudflare and ready to make some hires. We all had fun doing this and worked well together. While I didn’t go out dancing and singing quite as often as some of them, I made friends who I now greet enthusiastically whenever we cross paths at work. Two things we look for in candidates are empathy and curiosity, so it was great to be able to bond with my colleagues and get to see that side and know each of them personally.

My First Grace Hopper Celebration

Team dinner @ #GHC18

As I left Houston, I reflected on the contrast between the national headlines and what I had experienced at the conference. The week had coincided with Dr. Christine Blasey Ford giving testimony that was resonant to many of us. It was hard to hear. In spite of this, I saw at the conference a groundswell of potential to transform today’s companies into places that can help effect change.

When people ask me about what it’s like being a woman in tech, I often joke that I have never had to wait in line for the restroom. And while I’m being funny, it’s true. GHC was a very different experience, however. For me, attending GHC was like entering an alternate universe — something like a Margaret Atwood speculative fiction novel, except this was not a dystopian future. It was a future I want to see happen.

I look forward to #GHC19.

Categories: Technology

A Question of Timing

Wed, 17/10/2018 - 13:00
A Question of TimingA Question of Timing

Photo by Aron / Unsplash

When considering website performance, the term TTFB - time to first byte - crops up regularly. Often we see measurements from cURL and Chrome, and this article will show what timings those tools can produce, including time to first byte, and discuss whether this is the measurement you are really looking for.

Timing with cURL

cURL is an excellent tool for debugging web requests, and it includes the ability to take timing measurements. Let’s take an example website www.zasag.mn (the Mongolian government), and measure how long a request to its home page takes:

First configure the output format for cURL in ~/.curlrc:

$ cat .curlrc -w "dnslookup: %{time_namelookup} | connect: %{time_connect} | appconnect: %{time_appconnect} | pretransfer: %{time_pretransfer} | starttransfer: %{time_starttransfer} | total: %{time_total} | size: %{size_download}\n"

Now connect to the site dropping the output (-o /dev/null) since we’re only interested in the timing:

$ curl -so /dev/null https://www.zasag.mn dnslookup: 1.510 | connect: 1.757 | appconnect: 2.256 | pretransfer: 2.259 | starttransfer: 2.506 | total: 3.001 | size: 53107

These timings are in seconds. Depending on your version of cURL, you may get more decimal places than this example. 3 seconds is a long time, and remember this is only for the HTML from the home page - it doesn’t include any JavaScript, images, etc.

The diagram below shows what each of those timings refer to against a typical HTTP over TLS 1.2 connection (TLS 1.3 setup needs one less round trip):

A Question of Timing

  • time_namelookup in this example takes a long time. To exclude DNS resolver performance from the figures, you can resolve the IP for cURL: --resolve www.zasag.mn:443:218.100.84.167. It may also be worth looking for a faster resolver :).
  • time_connect is the TCP three-way handshake from the client’s perspective. It ends just after the client sends the ACK - it doesn't include the time taken for that ACK to reach the server. It should be close to the round-trip time (RTT) to the server. In this example, RTT looks to be about 200 ms.
  • time_appconnect here is TLS setup. The client is then ready to send it’s HTTP GET request.
  • time_starttransfer is just before cURL reads the first byte from the network (it hasn't actually read it yet). time_starttransfer - time_appconnect is practically the same as Time To First Byte (TTFB) from this client - 250 ms in this example case. This includes the round trip over the network, so you might get a better guess of how long the server spent on the request by calculating TTFB - (time_connect - time_namelookup), so in this case, the server spent only a few milliseconds responding, the rest of the time was the network.
  • time_total is just after the client has sent the FIN connection tear down.
Timing with Chrome

Chrome, and some other testing tools, use the W3C Resource Timing standard for measurements. In Chrome developer tools this looks like this:

A Question of Timing

Again, here’s how this maps onto a typical HTTP over TLS 1.2 connection, also showing the Resource Timing attribute names:

A Question of Timing

  • Stalled (fetchStart to domainLookupStart) is the browser waiting to start the connection, e.g. allocating cache on disk, if there are higher priority requests, or if there are already 6 connections open to this host.
  • Initial connection shown by Chrome is connectStart to connectEnd. Unlike cURL timings, this includes SSL connection setup, so if you want a fair estimate of RTT, this would be Initial connection - SSL. If an existing connection is being reused, then DNS Lookup, Initial connection and SSL won't be shown.
  • Request sent is connectEnd - requestStart, which should be negligible.
  • Similarly to cURL, if we subtract the TCP handshake time from TTFB, we can guess the amount of time the server really spent processing (again, we don't have an exact RTT timing, so this is a approximation).
What are we looking for again?

These measurements, including TTFB, can be helpful in diagnosing problems, and might help you to delve into a specific problem, but do they actually tell you about how well a website is performing? Ultimately, if you are looking to measure the experience of users, the time it takes for the first byte of some HTML to return isn’t effective. A web page might contain hundreds of images, it might have JavaScript and styles that need to load before you can interact. To reflect real user experience, you need to time how long until the web page becomes useful, and to take those measurements from representative sample of where your users are accessing the site from. And that's a topic for another day :)

Categories: Technology

Serverless Rust with Cloudflare Workers

Tue, 16/10/2018 - 13:00
Serverless Rust with Cloudflare Workers

The Workers team just announced support for WebAssembly (WASM) within Workers. If you saw my post on Internet Native Apps, you'll know that I believe WebAssembly will play a big part in the apps of the future.

It's exciting times for Rust developers. Cloudflare's Serverless Platform, Cloudflare Workers, allows you to compile your code to WASM, upload to 150+ data centers and invoke those functions just as easily as if they were JavaScript functions. Today I'm going to convert my lipsum generator to use Rust and explore the developer experience (hint: it's already pretty nice).

The Workers teams notes in the documentation:

...WASM is not always the right tool for the job. For lightweight tasks like redirecting a request to a different URL or checking an authorization token, sticking to pure JavaScript is probably both faster and easier than WASM. WASM programs operate in their own separate memory space, which means that it's necessary to copy data in and out of that space in order to operate on it. Code that mostly interacts with external objects without doing any serious "number crunching" likely does not benefit from WASM.

OK, I'm unlikely to gain significant performance improvements on this particular project, but it serves as a good opportunity illustrate the developer experience and tooling.

Categories: Technology

DC CyberWeek Is Here!

Mon, 15/10/2018 - 18:15
DC CyberWeek Is Here!DC CyberWeek Is Here!

Photo by Sarah Ferrante Goodrich / Unsplash

This October is the 15th annual National Cybersecurity Awareness Month in the United States, a collaboration between the US government and industry to raise awareness about the part we can all play in staying more secure online. Here at Cloudflare, where our mission is to help build a better internet, we look forward to this month all year.

As part of this month-long education campaign, Cloudflare is participating in D.C CyberWeek this week, the largest cybersecurity festival in the U.S, taking place in Washington, DC. This year’s event is expected to have over 10,000 attendees, more than 100 events, and feature representatives from over 180 agencies, private companies, and service providers. We will join with other leaders in cybersecurity, to share best practices, find ways to collaborate, and work to achieve common goals.

Along with the United States, the European Union also runs a month-long cyber awareness campaign in October, with the initiative having started back in 2012. The aim of this advocacy campaign is similar: promoting cybersecurity among citizens and organizations, and providing information on available tools and resources. Watch our CTO speak to some of the main considerations around good cyber hygiene, business practices and appropriate policy making in the field of cybersecurity as part of EU #CyberSecMonth.

Cloudflare’s Cybersecurity Commitment

As well as our own company efforts, we have joined with 60 other global companies to sign on to the Cybersecurity Tech Accord. The Tech Accord is a public commitment to protect and empower civilians to take action to secure the internet. The accord itself covers four simple commitments:

  • That we will protect all of our users everywhere
  • That we will oppose cyberattacks on innocent citizens and enterprises from anywhere
  • That we will help empower users, customers, and developers to strengthen cybersecurity protection
  • That we will partner with each other and with likeminded groups to enhance cybersecurity

But more than that, it is about creating a forum where companies large and small can come together to share best practices, debate threats, and hold each other accountable for our efforts in this arena. It is also a place where we can share ideas for ways in which the government can help shape good cybersecurity hygiene through appropriate laws and policies. Signing on was an easy decision for us; these are commitments we have long supported in practice.

Cloudflare’s Cybersecurity Contribution

Beyond our collaboration with the cybersecurity community, Cloudflare runs two other initiatives, designed to make the internet a more secure place for vulnerable groups who might lack financial or technical resources.

DC CyberWeek Is Here!

Project Galileo

At Cloudflare, we believe that limited resources shouldn’t preclude vulnerable groups from receiving the support they need. As part of our commitment to the overall health of the internet, we started Project Galileo in 2014 to ensure that at-risk public interest groups are able to stay online securely. We started it in response to cyber attacks launched with the intent of silencing important and vulnerable groups, like humanitarian organizations, political dissidents, and artistic groups. We partner with well-respected free speech, public interest, and civil society organizations to help us identify at-risk websites in need of our pro bono efforts. Once our partners have identified these groups, we extend our DDoS and WAF protection to ensure these websites stay online. The hundreds of websites we protect through Project Galileo includes sites for a national organization providing crisis intervention and suicide prevention services to lesbian, gay, bisexual, transgender and questioning (LGBTQ) young people, an editorial cartoonist, to an organization designed to help veterans with PTSD.

DC CyberWeek Is Here!

The Athenian Project

The Athenian Project was born out of a recognition that state and local governments had similar challenges as our Project Galileo participants. In an era of increasing distrust on the internet, it is essential that state and locally run election websites are safe, accurate, and online. So we extended our Enterprise-level services to those sites for free. We believe it’s imperative that voter data and election integrity is maintained, and that we can and should help prevent attackers from stealing sensitive voter information that may allow them to sway an election. Election sites should stay online during peak times, like voter registration deadlines, and election days. We have seen huge surges of traffic in those key days, and our AnyCast network has allowed these sites to stay up.

Moving Forward

We believe CyberWeek is an important time for private companies to spend some time thinking about the broader world. This is just the tip of the iceberg, as we continue to think about new and innovative ways we can be good members of this community. We hope that you will join us in our efforts to help make the internet more secure.

Categories: Technology

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowat

Fri, 12/10/2018 - 13:00
Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowat

Getting the best end-user performance from HTTP/2 requires good support for resource prioritization. While most web servers support HTTP/2 prioritization, getting it to work well all the way to the browser requires a fair bit of coordination across the networking stack. This article will expose some of the interactions between the web server, Operating System and network and how to tune a server to optimize performance for end users.

tl;dr

On Linux 4.9 kernels and later, enable BBR congestion control and set tcp_notsent_lowat to 16KB for HTTP/2 prioritization to work reliably. This can be done in /etc/sysctl.conf:

net.core.default_qdisc = fq net.ipv4.tcp_congestion_control = bbr net.ipv4.tcp_notsent_lowat = 16384

Browsers and Request Prioritization

A single web page is made up of dozens to hundreds of separate pieces of content that a web browser pulls together to create and present to the user. The main content (HTML) for the page you are visiting is a list of instructions on how to construct the page and the browser goes through the instructions from beginning to end to figure out everything it needs to load and how to put it all together. Each piece of content requires a separate HTTP request from the browser to the server responsible for that content (or if it has been loaded before, it can be loaded from a local cache in the browser).

In a simple implementation, the web browser could wait until everything is loaded and constructed and then show the result but that would be pretty slow. Not all of the content is critical to the user and can include things such as images way down in the page, analytics for tracking usage, ads, like buttons, etc. All the browsers work more incrementally where they display the content as it becomes available. This results in a much faster user experience. The visible part of the page can be displayed while the rest of the content is being loaded in the background. Deciding on the best order to request the content in is where browser request prioritization comes into play. Done correctly the visible content can display significantly faster than a naive implementation.

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatHTML Parser blocking page render for styles and scripts in the head of the document.

Most modern browsers use similar prioritization schemes which generally look like:

  1. Load similar resources (scripts, images, styles) in the order they were listed in the HTML.
  2. Load styles/CSS before anything else because content cannot be displayed until styles are complete.
  3. Load blocking scripts/JavaScript next because blocking scripts stop the browser from moving on to the next instruction in the HTML until they have been loaded and executed.
  4. Load images and non-blocking scripts (async/defer).

Fonts are a bit of a special case in that they are needed to draw the text on the screen but the browser won’t know that it needs to load a font until it is actually ready to draw the text to the screen. So they are discovered pretty late. As a result they are generally given a very high priority once they are discovered but aren’t known about until fairly late in the loading process.

Chrome also applies some special treatment to images that are visible in the current browser viewport (part of the page visible on the screen). Once the styles have been applied and the page has been laid out it will give visible images a much higher priority and load them in order from largest to smallest.

HTTP/1.x prioritization

With HTTP/1.x, each connection to a server can support one request at a time (practically anyway as no browser supports pipelining) and most browsers will open up to 6 connections at a time to each server. The browser maintains a prioritized list of the content it needs and makes the requests to each server as a connection becomes available. When a high-priority piece of content is discovered it is moved to the front of a list and when the next connection becomes available it is requested.

HTTP/2 prioritization

With HTTP/2, the browser uses a single connection and the requests are multiplexed over the connection as separate “streams”. The requests are all sent to the server as soon as they are discovered along with some prioritization information to let the server know the preferred ordering of the responses. It is then up to the server to do its best to deliver the most important responses first, followed by lower priority responses. When a high priority request comes in to the server, it should immediately jump ahead of the lower priority responses, even mid-response. The actual priority scheme implemented by HTTP/2 allows for parallel downloads with weighting between them and more complicated schemes. For now it is easiest to just think about it as a priority ordering of the resources.

Most servers that support prioritization will send data for the highest priority responses for which it has data available. But if the most important response takes longer to generate than lower priority responses, the server may end up starting to send data for a lower priority response and then interrupt its stream when the higher priority response becomes available. That way it can avoid wasting available bandwidth and head-of-line blocking where a slow response holds everything else up.

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatBrowser requesting a high-priority resource after several low-priority resources.

In an optimal configuration, the time to retrieve a top-priority resource on a busy connection with lots of other streams will be identical to the time to retrieve it on an empty connection. Effectively that means that the server needs to be able to interrupt the response streams of all of the other responses immediately with no additional buffering to delay the high-priority response (beyond the minimal amount of data in-flight on the network to keep the connection fully utilized).

Buffers on the Internet

Excessive buffering is pretty much the nemesis for HTTP/2 because it directly impacts the ability for a server to be nimble in responding to priority shifts. It is not unusual for there to be megabytes-worth of buffering between the server and the browser which is larger than most websites. Practically that means that the responses will get delivered in whatever order they become available on the server. It is not unusual to have a critical resource (like a font or a render-blocking script in the <head> of a document) delayed by megabytes of lower priority images. For the end-user this translates to seconds or even minutes of delay rendering the page.

TCP send buffers

The first layer of buffering between the server and the browser is in the server itself. The operating system maintains a TCP send buffer that the server writes data into. Once the data is in the buffer then the operating system takes care of delivering the data as-needed (pulling from the buffer as data is sent and signaling to the server when the buffer needs more data). A large buffer also reduces CPU load because it reduces the amount of writing that the server has to do to the connection.

The actual size of the send buffer needs to be big enough to keep a copy of all of the data that has been sent to the browser but has yet to be acknowledged in case a packet gets dropped and some data needs to be retransmitted. Too small of a buffer will prevent the server from being able to max-out the connection bandwidth to the client (and is a common cause of slow downloads over long distances). In the case of HTTP/1.x (and a lot of other protocols), the data is delivered in bulk in a known-order and tuning the buffers to be as big as possible has no downside other than the increase in memory use (trading off memory for CPU). Increasing the TCP send buffer sizes is an effective way to increase the throughput of a web server.

For HTTP/2, the problem with large send buffers is that it limits the nimbleness of the server to adjust the data it is sending on a connection as high priority responses become available. Once the response data has been written into the TCP send buffer it is beyond the server’s control and has been committed to be delivered in the order it is written.

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatHigh-priority resource queued behind low-priority resources in the TCP send buffer.

The optimal send buffer size for HTTP/2 is the minimal amount of data required to fully utilize the available bandwidth to the browser (which is different for every connection and changes over time even for a single connection). Practically you’d want the buffer to be slightly bigger to allow for some time between when the server is signaled that more data is needed and when the server writes the additional data.

TCP_NOTSENT_LOWAT

TCP_NOTSENT_LOWAT is a socket option that allows configuration of the send buffer so that it is always the optimal size plus a fixed additional buffer. You provide a buffer size (X) which is the additional amount of size you’d like in addition to the minimal needed to fully utilize the connection and it dynamically adjusts the TCP send buffer to always be X bytes larger than the current connection congestion window. The congestion window is the TCP stack’s estimate of the amount of data that needs to be in-flight on the network to fully utilize the connection.

TCP_NOTSENT_LOWAT can be configured in code on a socket-by-socket basis if the web server software supports it or system-wide using the net.ipv4.tcp_notsent_lowat sysctl:

net.ipv4.tcp_notsent_lowat = 16384

We have a patch we are preparing to upstream for NGINX to make it configurable but it isn’t quite ready yet so configuring it system-wide is required. Experimentally, the value 16,384 (16K) has proven to be a good balance where the connections are kept fully-utilized with negligible additional CPU overhead. That will mean that at most 16KB of lower priority data will be buffered before a higher priority response can interrupt it and be delivered. As always, your mileage may vary and it is worth experimenting with.

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatHigh-priority resource ready to send with minimal TCP buffering.
Bufferbloat

Beyond buffering on the server, the network connection between the server and the browser can act as a buffer. It is increasingly common for networking gear to have large buffers that absorb data that is sent faster than the receiving side can consume it. This is generally referred to as Bufferbloat. I hedged my explanation of the effectiveness of tcp_notsent_lowat a little bit in that it is based on the current congestion window which is an estimate of the optimal amount of in-flight data needed but not necessarily the actual optimal amount of in-flight data.

The buffers in the network can be quite large at times (megabytes) and they interact very poorly with the congestion control algorithms usually used by TCP. Most classic congestion-control algorithms determine the congestion window by watching for packet loss. Once a packet is dropped then it knows there was too much data on the network and it scales back from there. With Bufferbloat that limit is raised artificially high because the buffers are absorbing the extra packets beyond what is needed to saturate the connection. As a result, the TCP stack ends up calculating a congestion window that spikes to much larger than the actual size needed, then drops to significantly smaller once the buffers are saturated and a packet is dropped and the cycle repeats.

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatLoss-based congestion control congestion window graph.

TCP_NOTSENT_LOWAT uses the calculated congestion window as a baseline for the size of the send buffer it needs to use so when the underlying calculation is wrong, the server ends up with send buffers much larger (or smaller) than it actually needs.

I like to think about Bufferbloat as being like a line for a ride at an amusement park. Specifically, one of those lines where it’s a straight shot to the ride when there are very few people in line but once the lines start to build they can divert you through a maze of zig-zags. Approaching the ride it looks like a short distance from the entrance to the ride but things can go horribly wrong.

Bufferbloat is very similar. When the data is coming into the network slower than the links can support, everything is nice and fast:

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatResponse traveling through the network with no buffering.

Once the data comes in faster than it can go out the gates are flipped and the data gets routed through the maze of buffers to hold it until it can be sent. From the entrance to the line it still looks like everything is going fine since the network is absorbing the extra data but it also means there is a long queue of the low-priority data already absorbed when you want to send the high-priority data and it has no choice but to follow at the back of the line:

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatResponses queued in network buffers.
BBR congestion control

BBR is a new congestion control algorithm from Google that uses changes in packet delays to model the congestion instead of waiting for packets to drop. Once it sees that packets are taking longer to be acknowledged it assumes it has saturated the connection and packets have started to buffer. As a result the congestion window is often very close to the optimal needed to keep the connection fully utilized while also avoiding Bufferbloat. BBR was merged into the Linux kernel in version 4.9 and can be configured through sysctl:

net.core.default_qdisc = fq net.ipv4.tcp_congestion_control = bbr

BBR also tends to perform better overall since it doesn’t require packet loss as part of probing for the correct congestion window and also tends to react better to random packet loss.

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatBBR congestion window graph.

Back to the amusement park line, BBR is like having each person carry one of the RFID cards they use to measure the wait time. Once the wait time looks like it is getting slower the people at the entrance slow down the rate that they let people enter the line.

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatBBR detecting network congestion early.

This way BBR essentially keeps the line moving as fast as possible and prevents the maze of lines from being used. When a guest with a fast pass arrives (the high-priority request) they can jump into the fast-moving line and hop right onto the ride.

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatBBR delivering responses without network buffering.

Technically, any congestion control that keeps Bufferbloat in check and maintains an accurate congestion window will work for keeping the TCP send buffers in check, BBR just happens to be one of them (with lots of good properties).

Putting it all together

The combination of TCP_NOTSENT_LOWAT and BBR reduces the amount of buffering on the network to the absolute minimum and is CRITICAL for good end-user performance with HTTP/2. This is particularly true for NGINX and other HTTP/2 servers that don’t implement their own buffer throttling.

The end-user impact of correct prioritization is huge and may not show up in most of the metrics you are used to watching (particularly any server-side metrics like requests-per-second, request response time, etc).

Even on a 5Mbps cable connection proper resource ordering can result in rendering a page significantly faster (and the difference can explode to dozens of seconds or even minutes on a slower connection). Here is a relatively common case of a WordPress blog served over HTTP/2:

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatThe page from the tuned server (After) starts to render at 1.8 seconds.
Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatThe page from the tuned server (After) is completely done rendering at 4.5 seconds, well before the default configuration (Before) even started to render.
Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatFinally, at 10.2 seconds the default configuration started to render (8.4 seconds later or 5.6 times slower than the tuned server).
Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatVisually complete on the default configuration arrives at 10.7 seconds (6.2 seconds or 2.3 times slower than the tuned server).

Both configurations served the exact same content using the exact same servers with “After” being tuned for TCP_NOTSENT_LOWAT of 16KB (both configurations used BBR).

Identifying Prioritization Issues In The Wild

If you look at a network waterfall diagram of a page loading prioritization issues will show up as high-priority requests completing much later than lower-priority requests from the same origin. Usually that will also push metrics like First Paint and DOM Content Loaded (the vertical purple bar below) much later.

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatNetwork waterfall showing critical CSS and JavaScript delayed by images.

When prioritization is working correctly you will see critical resources all completing much earlier and not be blocked by the lower-priority requests. You may still see SOME low-priority data download before the higher-priority data starts downloading because there is still some buffering even under ideal conditions but it should be minimal.

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatNetwork waterfall showing critical CSS and JavaScript loading quickly.

Chrome 69 and later may hide the problem a bit. Chrome holds back lower-priority requests even on HTTP/2 connections until after it has finished processing the head of the document. In a waterfall it will look like a delayed block of requests that all start at the same time after the critical requests have completed. That doesn’t mean that it isn’t a problem for Chrome, just that it isn’t as obvious. Even with the staggering of requests there are still high-priority requests outside of the head of the document that can be delayed by lower-priority requests. Most notable are any blocking scripts in the body of the page and any external fonts that were not preloaded.

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowatNetwork waterfall showing Chrome delaying the requesting of low-priority resources.

Hopefully this post gives you the tools to be able to identify HTTP/2 prioritization issues when they happen, a deeper understanding of how HTTP/2 prioritization works and some tools to fix the issues when they appear.

Categories: Technology

Happy National Coming Out Day: Stories from Proudflare

Thu, 11/10/2018 - 21:01
 Stories from Proudflare

Today is the 30th Anniversary of National Coming Out Day. We wanted to share some coming out stories from members of Proudflare and draw attention to resources the Human Rights Campaign provides to those who are thinking about coming out or wish to be supportive of those who come out to them.

About National Coming Out Day

On October 11, 1987, about 500,000 people marched on Washington for Lesbian and Gay Rights. This was the second demonstration of this type in the capital and it resulted in the formation of several LGBTQ organizations.

In the late 1980s, the LGBTQ community recognized that they often reacted defensively to anti LGBTQIA+ actions and the community came up with the idea of a national day for celebrating coming out. The anniversary of the 1987 march was chosen as that national day.

Each year on October 11th, National Coming Out Day continues to promote a safe world for LGBTQ individuals to live truthfully and openly.

Source: https://www.hrc.org/resources/the-history-of-coming-out

Coming out stories from Proudflare

Here are seven examples of the coming out stories that surfaced from a company-wide awareness campaign. I hope you’ll enjoy reading these and will find inspiration in them. Let’s all be loud and proud and supportive of our (often silent) community members in their own coming out processes.

My Prima Bella

We were teenagers when my cousin (then male) originally came out as gay. We were and still are very close. We were born the same year, traveled Europe as small children, understood various languages and were both very adaptable middle children. Both our families settled in California when we returned to the US and continued to see each other regularly over the years. This gay coming out was no surprise to our large Latino family. We always accepted her just the way she was. It was later on when we were in college, I took a call from her when she was elated to tell me she was now, "working as a woman." That's when everything came into focus and we cried together over her transition to her true female self. She is an inspiration to me, my husband, our children and all the extended family who hold her dear, among many others. I couldn't be more proud of her and count myself lucky to be related to such a talented, honest, creative, beautiful and hard working woman.

My first love happened when I was 16 years old

From Rachel

 Stories from Proudflare

My first love happened when I was sixteen years old. We dated for four years and had what I considered was a normal break up for that age. He wanted to pursue dreams in LA and I wanted to be in the Bay Area close to my family. We both agreed we were too young for long distance, so we amicably went our separate ways and promised to remain friends. We stayed in touch over the years and tried to maintain that we could remain best of friends despite being broken-hearted. I went to visit him a few times and noticed some trends in his friends. He had a lot of gay friends and we went to gay bars while I was there. I chopped it up to the industry that he was in (male model), but I would be lying if I didn't say I started to feel suspicious. Finally by the third time I came to visit, it just seemed so apparent that he had found another part of himself: one that seemed to make him feel at home. I cornered him one evening in a bar and said, "Please just tell me," and his response was, "Why? You already know," to which I said, "Because I need to hear it from you." He then turned to me and said, "I am gay". I looked at him, I kissed him, and my response was and will always be, "And I still love you. You are still the same person to me.”  

What people don't know is that because he had been a model I was teased about my "gay" boyfriend while we dated. What people don't know is that I was suspicious of this at the end of our relationship, but at twenty years old how do you talk to someone about that? It was obvious he was closed off and I wasn't ready to admit that I thought my boyfriend of four years was gay. What I did know was that my feelings were not what were as important as what he was going through. I knew him. I knew how he fought this. I knew how he saw that in his head that the happily ever after was supposed to be me or a version of me (aka female) with a white picket fence and children. I knew if I told him how crushed I was at the time it would kill him. So I told him what I truly believe inside my soul to be true and that was, "You are and will always still be the same person you have always been to me. You are the same good human that puts everyone else first and are one of the most loyal people I know. You treated me with respect, have always been so loving, and showed me what I good relationship was. I am so proud of you for showing me who you truly are inside and I will stand by your side the rest of my life."    

I was best woman in his wedding to a man. Some people don't understand our story. Some people ask me if I felt like our relationship wasn't real because he turned out to be gay. To that, I say our relationship was more real than most. His final choice in sexuality has nothing to do with that. Again, he has and always will be the same person to me. It doesn't change our history. We were a boy and a girl who at the time fell in love and who have now since found the loves of our lives in other people. His just happened to be a man.

I forgot to come out and it still gave me rest

From Daan

Around the age of thirteen, I knew that I had more romantic attractions towards the same sex. I didn’t have a crush, as many love stories tend to start, but I noticed my fellow students showed much more interests in girls than I did and so I came to the conclusion I was gay. It was that simple, quick and painless. In the next few months I told my friends about it, in my way I was proud about it, proud that I was able to be different.

>> Fast forward four years.

Growing up in the liberal lights of Amsterdam I’ve never had the feeling that coming out was a subject I had to worry about. My mother went to Paris with my sister for the weekend and I had the house for myself and during this weekend I remembered I never told my parents that I was gay, it was just never a thing. I decided that when she returned I would tell her. After she returned on Sunday I asked her to sit down because I wanted to tell her something important, she turned all white and asked: what happened? I told her that everything is fine and that I wanted to tell her that I was gay and would come home with a guy at some point. She directly got up from her chair and, I remember this like nothing else, she said “please, never scare me again, I thought something serious happened, you don’t have to tell me you’re gay, as a mother I know”. After that I went to my dad (my parents are divorced) and he replied the same that he already knew and that’s all good. In a way I expected nothing else but I was still happy the way it went. I wish everyone the same and have supportive friends. Don’t be worried about the world, put yourself on the first place and the right people will come to you.

There is no single coming out story - nearly every day involves coming out

From Malavika

There is no single coming out story - nearly every day involves coming out. Of course, it is most difficult to come out to the people who are most important to you or whose judgement impacts your life in significant ways, but being out and coming out is a continuous process.

As a bisexual person, coming out becomes even harder: what label do I use? Is it easier to say I am a lesbian? Is it easier to just not say anything at all when I am married to a man? When I first started coming out in a professional context in my early twenties, I simply identified as a lesbian. The label bisexual is often treated either flippantly or with suspicion. But, several years into my first job, I had a serious relationship with a man. I had to come out once again, but this time as a bisexual! It was actually even harder to come out the second time, because at this point, my coworkers and mentors had known me for years as a lesbian. I even had senior executives who were gay invested in my career because of my identity as a lesbian, and I felt as if I would disappoint them by being with a man. Even my mother didn't quite understand my sexuality. Concerned that I was not being true to myself, she told me, "If you like women, you should marry a woman, you don't have to please me or society, I just want you to be happy."

I am now happily married to a man, but I still feel it is important for me to be out as a bisexual woman. It is important for bisexual individuals to maintain this identity, because identifying as straight or gay ignores the totality of our romantic experiences. And with that, I come out once again, proudly, as bisexual.

I have never lived a day as happily as the day I accepted myself

From Rex

The day after the Supreme Court legalized same-sex marriage, I asked my mom what she thought about it. I told her that she should be happy because it means I can get married one day. She cried a bit. But not for long. It didn't start easily, but 3 months later, they were ready to meet my boyfriend and make him part of the family. From there it has been coming out every day to different people, but it makes me happiest to be myself and not a wolf in sheep's clothing.

My Son's Coming Out Story

From Sherry

 Stories from Proudflare

Juliao [left] at age 16 and I [right] at his high school fashion show.

I love telling this story—my son Juliao came out to me at age six.

We had just moved to Santa Monica. Being new to the area, I set out to make new connections on the then popular platform MySpace. One day, a friend named Luna came over to hang out. We were chatting in Juliao’s bedroom while he was playing with his rather large collection of My Little Ponies. I mentioned to Luna that I found it remarkable that most of the folks I had reached out to over the social media platform were gay. I elaborated that they were the most interesting and the best looking.  

Juliao chimed in, “I’m gay” in a very matter of fact way, shrugging his shoulders. Luna and I turned to him, amazed. Luna replied, “How do you know, Juliao? What does that mean?” Juliao quickly answered, “When two boys love each other.” [Like duh.]    

We didn’t make a “big enchilada” of his revelation, though inside I was beaming. I was extremely proud that he could articulate a part of his identity so clearly and fearlessly.

"I feel it's important I tell you that I was recently dating a guy."

From Andrew

I was 24 years old when I first fell in love with a man. Before I met him, I actually thought I was dating men as part of an experimental phase in life. My boyfriend went to school in New York and I lived in Boston, so I'd sneak away on weekends to visit him and lie to my family and friends about where I was and what I was doing.  After we broke up, I knew I needed to come out to my friends and family. I hated that I had been lying to them and to myself.      

It took me a couple weeks to work up the courage to send my mother, father, and brother an email, sharing what was going on. I concluded the email with, "I can't really predict how you'll take this, so I'll probably be avoiding you for a while.  Send me an email when you can to let me know when it's not awkward to talk to you."  

My family welcomed the news swiftly with warmth and support. I was very fortunate to have a wonderful, loving family.

Resources for living openly

To find resources about living openly, visit the Human Rights Campaign’s Coming Out Center. I hope you'll be true to yourselves and always be loud and proud.

About Proudflare

To read more about Proudflare and why Cloudflare cares about inclusion in the workplace, read Proudflare’s pride blog post.

Categories: Technology

Graceful upgrades in Go

Thu, 11/10/2018 - 15:30
Graceful upgrades in Go

The idea behind graceful upgrades is to swap out the configuration and code of a process while it is running, without anyone noticing it. If this sounds error prone, dangerous, undesirable and in general a bad idea – I’m with you. However, sometimes you really need them. Usually this happens in an environment where there is no load balancing layer. We have these at Cloudflare, which led to us investigating and implementing various solutions to this problem.

Graceful upgrades in GoDingle Dangle! by Grant C. (CC-BY 2.0)

Coincidentally, implementing graceful upgrades involves some fun low-level systems programming, which is probably why there are already a bajillion options out there. Read on to learn what trade-offs there are, and why you should really really use the Go library we are about to open source. For the impatient, the code is on github  and you can read the documentation on godoc.

The basics

So what does it mean for a process to perform a graceful upgrade? Let’s use a web server as an example: we want to be able to fire HTTP requests at it, and never see an error because a graceful upgrade is happening.

We know that HTTP uses TCP under the hood, and that we interface with TCP using the BSD socket API. We have told the OS that we’d like to receive connections on port 80, and the OS has given us a listening socket, on which we call accept() to wait for new clients.

A new client will be refused if the OS doesn’t know of a listening socket for port 80, or nothing is calling accept() on it. The trick of a graceful upgrade is to make sure that neither of these two things occur while we somehow restart our service. Let’s look at the all the ways we could achieve this, from simple to complex.

Just Exec()

Ok, how hard can it be. Let’s just Exec() the new binary (without doing a fork first). This does exactly what we want, by replacing the currently running code with the new code from disk.

// The following is pseudo-Go. func main() { var ln net.Listener if isUpgrade { ln = net.FileListener(os.NewFile(uintptr(fdNumber), "listener")) } else { ln = net.Listen(network, address) } go handleRequests(ln) <-waitForUpgradeRequest syscall.Exec(os.Argv[0], os.Argv[1:], os.Environ()) }

Unfortunately this has a fatal flaw since we can’t “undo” the exec. Imagine a configuration file with too much white space in it or an extra semicolon. The new process would try to read that file, get an error and exit.

Even if the exec call works, this solution assumes that initialisation of the new process is practically instantaneous. We can get into a situation where the kernel refuses new connections because the listen queue is overflowing.

Graceful upgrades in GoNew connections may be dropped if Accept() is not called regularly enough

Specifically, the new binary is going to spend some time after Exec() to initialize, which delays calls to  Accept(). This means the backlog of new connections grows until some are dropped. Plain exec is out of the game.

Listen() all the things

Since just using exec is out of the question, we can try the next best thing. Lets fork and exec a new process which then goes through it’s usual start up routine. At some point it will create a few sockets by listening on some addresses. Except that won’t work out of the box due to errno 48, otherwise known as Address Already In Use. The kernel is preventing us from listening on the address and port combination used by the old process.

Of course, there is a flag to fix that: SO_REUSEPORT. This tells the kernel to ignore the fact that there is already a listening socket for a given address and port, and just allocate a new one.

func main() { ln := net.ListenWithReusePort(network, address) go handleRequests(ln) <-waitForUpgradeRequest cmd := exec.Command(os.Argv[0], os.Argv[1:]) cmd.Start() <-waitForNewProcess }

Now both processes have working listening sockets and the upgrade works. Right?

SO_REUSEPORT is a little bit peculiar in what it does inside the kernel. As systems programmers, we tend to think of a socket as the file descriptor that is returned by the socket call. The kernel however makes a distinction between the data structure of a socket, and one or more file descriptors pointing at it. It creates a separate socket structure if you bind using SO_REUSEPORT, not just another file descriptor. The old and the new process are thus referring to two separate sockets, which happen to share the same address. This leads to an unavoidable race condition: new-but-not-yet-accepted connections on the socket used by the old process will be orphaned and killed by the kernel. GitHub wrote an excellent blog post about this problem.

The engineers at GitHub solved the problems with SO_REUSEPORT by using an obscure feature of the sendmsg syscall called ancilliary data. It turns out that ancillary data can include file descriptors. Using this API made sense for GitHub, since it allowed them to integrate elegantly with HAProxy. Since we have the luxury of changing the program we can use simpler alternatives.

NGINX: share sockets via fork and exec

NGINX is the tried and trusted workhorse of the Internet, and happens to support graceful upgrades. As a bonus we also use it at Cloudflare, so we were confident in its implementation.

It is written in a process-per-core model, which means that instead of spawning a bunch of threads NGINX runs a process per logical CPU core. Additionally, there is a master process which orchestrates graceful upgrades.

The master is responsible for creating all listen sockets used by NGINX and sharing them with the workers. This is fairly straightforward: first, the FD_CLOEXEC bit is cleared on all listen sockets. This means that they are not closed when the exec() syscall is made. The master then does the customary fork() / exec() dance to spawn the workers, passing the file descriptor numbers as an environment variable.

Graceful upgrades make use of the same mechanism. We can spawn a new master process (PID 1176) by following the NGINX documentation. This inherits the existing listeners from the old master process (PID 1017) just like workers do. The new master then spawns its own workers:

CGroup: /system.slice/nginx.service ├─1017 nginx: master process /usr/sbin/nginx -g daemon on; master_process on; ├─1019 nginx: worker process ├─1021 nginx: worker process ├─1024 nginx: worker process ├─1026 nginx: worker process ├─1027 nginx: worker process ├─1028 nginx: worker process ├─1029 nginx: worker process ├─1030 nginx: worker process ├─1176 nginx: master process /usr/sbin/nginx -g daemon on; master_process on; ├─1187 nginx: worker process ├─1188 nginx: worker process ├─1190 nginx: worker process ├─1191 nginx: worker process ├─1192 nginx: worker process ├─1193 nginx: worker process ├─1194 nginx: worker process └─1195 nginx: worker process

At this point there are two completely independent NGINX processes running. PID 1176 might be a new version of NGINX, or could use an updated config file. When a new connection arrives for port 80, one of the 16 worker processes is chosen by the kernel.

After executing the remaining steps, we end up with a fully replaced NGINX:

CGroup: /system.slice/nginx.service ├─1176 nginx: master process /usr/sbin/nginx -g daemon on; master_process on; ├─1187 nginx: worker process ├─1188 nginx: worker process ├─1190 nginx: worker process ├─1191 nginx: worker process ├─1192 nginx: worker process ├─1193 nginx: worker process ├─1194 nginx: worker process └─1195 nginx: worker process

Now, when a request arrives the kernel chooses between one of the eight remaining processes.

This process is rather fickle, so NGINX has a safeguard in place. Try requesting a second upgrade while the first hasn’t finished, and you’ll find the following message in the error log:

[crit] 1176#1176: the changing binary signal is ignored: you should shutdown or terminate before either old or new binary's process

This is very sensible, there is no good reason why there should be more than two processes at any given point in time. In the best case, we also want this behaviour from our Go solution.

Graceful upgrade wishlist

The way NGINX has implemented graceful upgrades is very nice. There is a clear life cycle which determines valid actions at any point in time:

Graceful upgrades in Go

It also solves the problems we’ve identified with the other approaches. Really, we’d like NGINX-style graceful upgrades as a Go library.

  • No old code keeps running after a successful upgrade
  • The new process can crash during initialisation, without bad effects
  • Only a single upgrade is active at any point in time

Of course, the Go community has produced some fine libraries just for this occasion. We looked at

just to name a few. Each of them is different in its implementation and trade offs, but none of them ticked all of our boxes. The most common problem is that they are designed to gracefully upgrade an http.Server. This makes their API much nicer, but removes flexibility that we need to support other socket based protocols. So really, there was absolutely no choice but to write our own library, called tableflip. Having fun was not part of the equation.

tableflip

tableflip is a Go library for NGINX-style graceful upgrades. Here is what using it looks like:

upg, _ := tableflip.New(tableflip.Options{}) defer upg.Stop() // Do an upgrade on SIGHUP go func() { sig := make(chan os.Signal, 1) signal.Notify(sig, syscall.SIGHUP) for range sig { _ = upg.Upgrade() } }() // Start a HTTP server ln, _ := upg.Fds.Listen("tcp", "localhost:8080") server := http.Server{} go server.Serve(ln) // Tell the parent we are ready _ = upg.Ready() // Wait to be replaced with a new process <-upg.Exit() // Wait for connections to drain. server.Shutdown(context.TODO())

Calling Upgrader.Upgrade spawns a new process with the necessary net.Listeners, and waits for the new process to signal that it has finished initialisation, to die or to time out. Calling it when an upgrade is ongoing returns an error.

Upgrader.Fds.Listen is inspired by facebookgo/grace and allows inheriting net.Listener easily. Behind the scenes, Fds makes sure that unused inherited sockets are cleaned up. This includes UNIX sockets, which are tricky due to UnlinkOnClose. You can also pass straight up *os.File to the new process if you desire.

Finally, Upgrader.Ready cleans up unused fds and signals the parent process that initialization is done. The parent can then exit, which completes the graceful upgrade cycle.

Categories: Technology

Introducing Single Sign-On for the Cloudflare Dashboard

Wed, 10/10/2018 - 18:00
Introducing Single Sign-On for the Cloudflare DashboardIntroducing Single Sign-On for the Cloudflare Dashboard

The Challenge of Managing User Access to SaaS Applications

As the  number of SaaS services people use everyday grows, it has become more challenging to juggle the number of password and multi-factor authentication combinations users have to keep track of to get online.

Adopting identity services have allowed companies to centralize employee authentication. With Cloudflare Access, companies can ensure employees use a company managed identity provider when accessing websites behind Cloudflare. Last week, Sam published a blog on how Cloudflare has made it easier to connect Cloudflare Access to the Atlassian suite of tools.

Since Cloudflare has simplified access control for corporate applications, many enterprise customers have commonly asked for the ability to extend the same ease of access and control to the Cloudflare dashboard itself.

Single Sign-On for the Cloudflare Dashboard

Today, we are announcing support for enterprise customers to use single sign-on (SSO) through their identity provider to access the Cloudflare dashboard.

Cloudflare is a critical piece of infrastructure for customers, and SSO ensures that customers can apply the same authentication policies to access the Cloudflare dashboard as other critical resources.

Introducing Single Sign-On for the Cloudflare Dashboard


Once onboarded for SSO, all company user logins to the Cloudflare dashboard redirect to the customer’s identity provider. Once all required authentication checks complete successfully, the user is seamlessly redirected back to dash.cloudflare.com and logged in.

Leveraging Access & Workers to Build SSO

At Cloudflare, we  dogfood our own services as both a way to make them better for our customers and to make developing new services more efficient and robust. With SSO, this is no different. Authentication configurations are managed through Access, which allows us to launch with support for the same identity providers available in Access today, including SAML.

Cloudflare is 8 years old and we built our user authentication system way before Cloudflare Access existed. In order to connect Access to our existing authentication system, we built a Cloudflare Worker that converts Access authentication tokens to our own authentication tokens. This greatly simplified the code changes required in our system, and results in faster SSO logins because the Worker runs at the network edge and reduces the number of round trips required to authenticate users.

In addition to leveraging Cloudflare services to build Single Sign-On, we are moving all Cloudflare employees to use SSO through our existing G Suite setup. This ensures Cloudflare can uniformly enforce multi-factor authentication policies for the services we protect with Cloudflare itself.

How to Start using SSO for the Cloudflare Dashboard

Cloudflare Enterprise customers can reach out to their Customer Success Manager to learn how to start using SSO to log-in to the Cloudflare dashboard. If you are interested in using SSO yourself and becoming a Cloudflare Enterprise customer, then please get in touch.

Categories: Technology

A Tour Inside Cloudflare's G9 Servers

Wed, 10/10/2018 - 16:17
A Tour Inside Cloudflare's G9 Servers

Cloudflare operates at a significant scale, handling nearly 10% of the Internet HTTP requests that is at peak more than 25 trillion requests through our network every month. To ensure this is as efficient as possible, we own and operate all the equipment in our 154 locations around the world in order to process the volume of traffic that flows through our network. We spend a significant amount of time specing and designing servers that makes up our network to meet our ever changing and growing demands. On regular intervals, we will take everything we've learned about our last generation of hardware and refresh each component with the next generation…

If the above paragraph sounds familiar, it’s a reflecting glance to where we were 5 years ago using today’s numbers. We’ve done so much progress engineering and developing our tools with the latest tech through the years by pushing ourselves at getting smarter in what we do.

Here though we’re going to blog about muscle.

Since the last time we blogged about our G4 servers, we’ve iterated one generation each of the past 5 years. Our latest generation is now the G9 server. From a G4 server comprising 12 Intel Sandybridge CPU cores, our G9 server has 192 Intel Skylake CPU cores ready to handle today’s load across Cloudflare’s network. This server is QCT’s T42S-2U multi-node server where we have 4 nodes per chassis, therefore each node has 48 cores. Maximizing compute density is the primary goal since rental colocation space and power are costly. This 2U4N chassis form factor has served us well for the past 3 generations, we’re revisiting this option once more.

A Tour Inside Cloudflare's G9 Servers

Exploded picture of the G9 server’s main components. 4 sleds represent 4 nodes, each with 2 24-core Intel CPUs

Each high-level hardware component has gone through their own upgrade as well for a balanced scale up keeping our stack CPU bound, making this generation the most radical revision since we moved on from using HP 5 years ago. Let’s glance through each of those components.

Hardware Changes

CPU

  • Previously: 2x 12 core Intel Xeon Silver 4116 2.1Ghz 85W
  • Now: 2x 24 core Intel custom off-roadmap 1.9Ghz 150W

The performance of our infrastructure is heavily directed by how much compute we can squeeze in a given physical space and power. In essence, requests per second (RPS) per Watt is a critical metric that Qualcomm’s ARM64 46 core Falkor chip had a big advantage over Intel’s Skylake 4116.

Intel proposed to co-innovate with us an off-roadmap 24-core Xeon Gold CPU specifically made for our workload offering considerable value in Performance per Watt. For this generation, we continue using Intel as system solutions are widely available while we’re working on realizing ARM64’s benefits to production. We expect this CPU to perform with better RPS per Watt right off the bat; increasing the RPS by 200% from doubling the amount of cores, and increasing the power consumption by 174% from increasing the CPUs TDP from 85W to 150W each.

Disk
  • Previously: 6x Micron 1100 512G SATA SSD
  • Now: 6x Intel S4500 480G SATA SSD

With all the requests we foresee for G9 to process, we need to tame down the outlying and long-tail latencies we have seen in our previous SSDs. Lowering p99 and p999 latency has been a serious endeavor. To help save milliseconds in disk response time for 0.01% or even 0.001% of all the traffic we see isn’t a joke!

Datacenter grade SSDs in Intel S4500 will proliferate our fleet. These disks come with better endurance to last over the expected service life of our servers and better performance consistency with lower p95+ latency.

Network
  • Previously: dual-port 10G Solarflare Flareon Ultra SFN8522
  • Now: dual-port 25G Mellanox ConnectX-4 Lx OCP

Our DDoS mitigation program is all done in userspace, so network adapter model can be anything on market as long as it supports XDP. We went with Mellanox for their solid reliability and their readily available 2x25G CX4 model. Upgrading to 25G intra-rack ethernet network is easy future-proofing since the 10G SFP+ ethernet port shares the same physical form factor as the 25G’s SFP28. Switch and NIC vendors offer models that can be configured as either 10G or 25G.

Another change is the adapter’s form factor itself being an OCP mezzanine instead of the more conventional Low Profile sized card. QCT is a server system integrator participating in the Open Compute Project, a non-profit organization establishing an open source hardware ecosystem founded by Facebook, Intel, and Rackspace. Their T42S-2U motherboards each include 2 PCIe x16 Gen3 expansion slots: 1 fit for a regular I/O card and 1 for an OCP mezzanine. The form factor change allows us to occupy the OCP slot leaving the regular slot free to integrate something else that may not be offered with an OCP form factor like a high capacity NVMe SSD or a GPU. We like that our server has the room for upgrades if needed.

A Tour Inside Cloudflare's G9 Servers

Both Low Profile and OCP adapters offer the same throughput and features

A Tour Inside Cloudflare's G9 Servers

Rear side of G9 chassis showing all 4 sled nodes, each leaving room to add on a PCI card

Memory
  • Previously: 192G (12x16G) DDR4 2400Mhz RDIMM
  • Now: 256G (8x32G) DDR4 2666Mhz RDIMM

Going from 192G (12x16G) to 256G (8x32G) made practical sense. The motherboard has 12 DIMM channels, which were all populated in the G8. We want to have the ability to upgrade just in case, as well as keeping memory configuration balanced and at optimal bandwidth capacity. 8x32G works well leaving 4 channels open for future upgrades.

Physical stress test

Our software stack scales nicely enough that we can confidently assume we’ll double the amount of requests having twice the amount of CPU cores compared to G8. What we need to ensure before we ship any G9 servers out to our current 154 and future PoPs is that there won’t be any design issues pertaining to thermal nor power failures. At the extreme case that all of our cores run up 100% load, would that cause our server to run above operating temperature? How much power would a whole server with 192 cores totaling 1200W TDP consume? We set out to record both by applying a stress test to the whole system.

Temperature readings were recorded off of ipmitool sdr list, then graphed showing socket and motherboard temperature. For 2U4N being such a compact form factor, it’s worth monitoring that a server running hot isn’t literally running hot. The red lines represent the 4 nodes that compose the whole G9 server under test; blue lines represent G8 nodes (we didn’t stress the G8’s so their temperature readings are constant).

A Tour Inside Cloudflare's G9 Servers

A Tour Inside Cloudflare's G9 Servers

Both graphs are looking fine and not out of control mostly thanks to the T42S-2U’s 4 80mm x 80mm fans capable of blowing over 90CFM; which we managed to reach their max spec RPM.

Recording the new system’s max power consumption is critical information we need to properly design our rack stack choosing the right Power Distribution Unit and ensuring we’re below the budgeted power while keeping adequate phase balancing. For example, a typical 3-phase US-rated 24-Amp PDU gives you a max power rating of 8.6 kilowatts. We wouldn’t be able to fit 9 servers powered by that same PDU if each were running at 1kW without any way to cap them.

A Tour Inside Cloudflare's G9 Servers

A Tour Inside Cloudflare's G9 Servers

Above right graph shows our max power to be 1.9kW as the red line, or crudely 475W per node which is excellent in a modern server. Notice the blue and yellow lines representing the G9’s 2 power supplies summing the total power. The yellow line PSU appearing off is intentional as part of our testing procedure to show the PSU’s resilience in abrupt power changes.

Stressing out all available CPU, I/O, and memory along with maxing out fan RPMs combined is a good indicator for the highest possible heat and power draw this server can do. Hopefully we won’t ever see such an extreme case like this in live production environments, and we expect much milder actual results (read: we don’t think catastrophic failures to be possible).

First Impression in live production

We increased capacity to one of our most loaded PoPs by adding G9 servers. The following time graphs represent a 24 hour range with how G9 performance compares with G8 in a live PoP.

A Tour Inside Cloudflare's G9 Servers

Great! They're doing over 2x the requests compared to G8 with about 10% less CPU usage. Note that all results here are based from non-optimized systems, so we could add more load on the G9 and have their CPU usage comparable to the G8. Additionally, they're doing that amount with better CPU processing time shown as nginx execution time. You can see the latency gap between generations widening as we go towards the 99.9th percentile:

A Tour Inside Cloudflare's G9 Servers

Long-tail latencies for NGINX CPU processing time (lower is better)

Talking about latency, let’s check how our new SSDs are doing on that front:

A Tour Inside Cloudflare's G9 Servers

Cache disk IOPS and latency (lower is better)

The trend still holds that G9 is doing better. It’s a good thing that the G9’s SSDs aren’t seeing as many IOPS since it means we’re not hitting cache disks as often and are able to store and process more on CPU and memory. We’ve cut the read cache hits and latency by half. Less writes results in better performance consistency and longevity.

Another metric that G9 does more is power consumption, doing about 55% more than the G8. While it’s not a piece of information to brag about, it is expected when older CPUs were once rated at 85W TDP to now using ones with 150W TDP; and when considering how much work the G9 servers do:

A Tour Inside Cloudflare's G9 Servers

G9 is actually 1.5x more power efficient than G8. Temperature readings were checked as well. Inlet and outlet chassis temps, as well as CPU temps, are well within operating temperatures.

Now that’s muscle! In other words for every 3 G8 servers, just 2 of those G9's would take on the same workload. If one of our racks normally would have 9 G8 servers, we can switch those out with only 6 G9's. Inversely planning to turn up a cage of 10 G9 racks would be the same as if we did 15 G8 racks!

We have big plans to cover our entire network with G9 servers, with most of them planned for the existing cities your site most likely uses. By 2019, you’ll benefit with increased bandwidth and lower wait times. And we’ll benefit in expanding and turning up datacenters quicker and more efficiently.

What's next?

Gen X? Right now is exciting times at Cloudflare. Many teams and engineers are testing, porting, and implementing new stuff that can help us lower operating costs, explore new products and possibilities, and improve Quality of Service. We’re tackling problems and taking on projects that are unique in the industry.

Serverless computing like Cloudflare Workers and beyond will ask for new challenges to our infrastructure as all of our customers can program their features on Cloudflare’s edge network.

The network architecture that was conventionally made up of routers, switches, and servers has been merged into 3-in-1 box solutions allowing Cloudflare services to be set up into locations that weren’t possible before.

The advent of NVMe and persistent memory, as well as the possibility of turning SSDs into DRAM, is redefining how we design cache servers and handle tiered caching. SSDs and memory aren’t treated as separate entities like they used to.

Hardware brings the company together like a rug in a living room. See how many links I mentioned above to show you how we’re one team dedicated to build a better Internet. Everything that we do here roots down to how we manage the tons of aluminum and silicon we’ve invested. There's a lot here to develop our hardware to help Cloudflare grow to where we envision ourselves to be. If you’d like to contribute, we’d love to hear from you.  

Categories: Technology

Mapping Factorio with Leaflet

Wed, 10/10/2018 - 15:09
Mapping Factorio with Leaflet

The following is a guest post by Jacob Hands, Creator of FactorioMaps.com. He is building a community site for the game Factorio centered around sharing user creations.

Factorio is a game about building and maintaining factories. Players mine resources, research new technology and automate production. Resources move along the production line through multiple means of transportation such as belts and trains. Once production starts getting up to speed, alien bugs start to attack the factory requiring strong defenses.

Mapping Factorio with LeafletA Factorio factory producing many different items.

Mapping Factorio with LeafletA Factorio military outpost fighting the alien bugs.

Mapping Factorio with LeafletA Factorio map view of a small factory, that’s still too big to easily share fully with screenshots.

At FactorioMaps.com, I am building a place for the community of Factorio players to share their factories as interactive Leaflet maps. Due to the size and detail of the game, it can be difficult to share an entire factory through a few screenshots. A Leaflet map provides a Google Maps-like experience allowing viewers to pan and zoom throughout the map almost as if they are playing the game.

Hosting

Leaflet maps contain thousands of small images for X/Y/Z coordinates. Amazon S3 and Google Cloud Storage are the obvious choices for low-latency object storage. However, after 3.5  months in operation, FactorioMaps.com contains 17 million map images (>1TB). For this use-case, $0.05 per 10,000 upload API calls and $0.08 to 0.12/GB for egress would add up quickly. Backblaze B2 is a better fit because upload API calls are free, egress bandwidth is $0.00/GB to Cloudflare, and storage is 1/4th the price of the competition.


Backblaze B2 requires a prefix of /file/bucketName on all public files, which I don’t want. To remove it, I added a VPS proxy to rewrite paths and add a few 301 redirects. Unfortunately, the latency from the user -> VPS -> B2 was sub-par averaging 800-1200ms in the US.

A Closer Look At Leaflet

Leaflet maps work by loading images at the user's X/Y/Z coordinates to render the current view. As a map is zoomed in, it requires 4x as many images to show the same area. That means 75% of a map's images are in the max rendered zoom level.

Mapping Factorio with LeafletA diagram of how each zoom level is 4x larger than the previous

Reducing Latency

With hosting working, it's time to start making the site faster. The majority of image requests come from the first few zoom levels, representing less than 25% of a given map's images. Adding a local SSD cache on the VPS containing all except the last 1-3 zoom levels for each map reduces latency for 66% of requests. The problem with SSD storage is it's difficult to scale with ever-increasing data and is still limited to the network and CPU performance of the server it occupies.

Going Serverless with Cloudflare Workers

Cloudflare Workers can run JavaScript using the Service Workers API which means the path rewrites and redirects the VPS was accomplishing could run on Cloudflare's edge.


While Google Cloud Storage is more expensive than B2, it has much lower latency to the US and worldwide destinations because of their network and multi-regional object storage. However, it's not time to move the whole site over to GCS just yet; the upload API calls alone would cost $85 for 17 million files.

Multi-Tier Object Storage

The first few zoom levels are stored in GCS, while the rest are in B2. Cloudflare Workers figure out where files are located by checking both sources simultaneously. By doing this, 66% of requested files come from GCS with a mean latency of <350ms, while only storing 24% of files on GCS. Another benefit to using B2 as the primary storage is if GCS becomes too expensive in the future, I can move all requests to B2.

// Race GCS and B2 let gcsReq = new Request('https://storage.googleapis.com/bucketName' + url.pathname, event.request) let b2Req = new Request(getB2Url(request) + '/bucketName' + url.pathname, event.request); // Fetch from GCS and B2 with Cloudflare caching enabled let gcsPromise = fetch(gcsReq, cfSettings); let b2Promise = fetch(b2Req, cfSettings); let response = await Promise.race([gcsPromise, b2Promise]); if (response.ok) { return response; } // If the winner was bad, find the one that is good (if any) response = await gcsPromise; if (response.ok) { return response; } response = await b2Promise; if (response.ok) { return response; } // The request failed/doesn't exist return response;

Tracking Subrequests

The Cloudflare Workers dashboard contains a few analytics for subrequests, but there is no way to see what responses came from B2 vs. GCS. Fortunately, it’s easy to send request stats to a 3rd party service like StatHat with a few lines of JavaScript.

// Fetch from GCS and B2 with caching let reqStartTime = Date.now(); let gcsPromise = fetch(gcsReq, cfSettings); let b2Promise = fetch(b2Req, cfSettings); let response = await Promise.race([gcsPromise, b2Promise]); if (response.ok) { event.waitUntil(logResponse(event, response, (Date.now() - reqStartTime))); return response; }

The resulting stats prove that GCS is serving the majority of requests, and Cloudflare caches over 50% of those requests. The code for the logResponse function can be found here.  

Making B2 Faster with Argo

Tracking request time surfaced another issue. Requests to B2 from countries outside of North America are still quite slow. Cloudflare's Argo can reduce latency by over 50%, but is too expensive to enable for the whole site. Additionally, it would be redundant to smart-route content from GCS that Google already does an excellent job of keeping latency down. Cloudflare request headers include the country of origin, making it trivial to route this subset of requests through an Argo-enabled domain.

// Use CF Argo for non-US/CA users function getB2Url(request) { let b2BackendUrl = 'https://b2.my-argo-enabled-domain.com/file'; let country = request.headers.get('CF-IPCountry') if (country === 'US' || country === 'CA') { b2BackendUrl = 'https://f001.backblazeb2.com/file'; } return b2BackendUrl; }

Conclusion

Cloudflare Workers are an excellent fit for my project; they enabled me to make a cost-effective solution to hosting Leaflet maps at scale. Check out https://factoriomaps.com for performant Leaflet maps, and if you play Factorio, submit your Factorio world to share with others!

Categories: Technology

Leave your VPN and cURL secure APIs with Cloudflare Access

Fri, 05/10/2018 - 19:30
Leave your VPN and cURL secure APIs with Cloudflare AccessLeave your VPN and cURL secure APIs with Cloudflare Access

We built Access to solve a problem here at Cloudflare: our VPN. Our team members hated the slowness and inconvenience of VPN but, that wasn’t the issue we needed to solve. The security risks posed by a VPN required a better solution.

VPNs punch holes in the network perimeter. Once inside, individuals can access everything. This can include  critically sensitive content like private keys, cryptographic salts, and log files. Cloudflare is a security company; this situation was unacceptable. We need a better method that gives every application control over precisely who is allowed to  reach it.

Access meets that need. We started by moving our browser-based applications behind Access. Team members could connect to applications faster, from anywhere, while we improved the security of the entire organization. However, we weren’t yet ready to turn off our VPN as some tasks are better done through a command line. We cannot #EndTheVPN without replacing all of its use cases. Reaching a server from the command line required us to fall back to our VPN.

Today, we’re releasing a beta command line tool to help your team, and ours. Before we started using this feature at Cloudflare, curling a server required me to stop, find my VPN client and credentials, login, and run my curl command. With Cloudflare’s command line tool, cloudflared, and Access, I can run $ cloudflared access curl https://example.com/api and Cloudflare authenticates my request to the server. I save time and the security team at Cloudflare can control who reaches that endpoint (and monitor the logs).

Protect your API with Cloudflare Access

To protect an API with Access, you’ll follow the same steps that you use to protect a browser-based application. Start by adding the hostname where your API is deployed to your Cloudflare account.

Just like web applications behind Access, you can create granular policies for different paths of your HTTP API. Cloudflare Access will evaluate every request to the API for permission based on settings you configure. Placing your API behind Access means requests from any operation, CLI or other, will continue to be gated by Cloudflare. You can continue to use your API keys, if needed, as a second layer of security.

Reach a protected API

Cloudflare Access protects your application by checking for a valid JSON Web Token (JWT), whether the request comes through a browser or from the command line. We issue and sign that JWT when you successfully login with your identity provider. That token contains claims about your identity and session. The Cloudflare network looks at the claims in that token to determine if the request should proceed to the target application.

When you use a browser with Access, we redirect you to your identity provider, you login, and we store that token in a cookie. Authenticating from the command line requires a different flow, but relies on the same principles. When you need to reach an application behind Access from your command line, the Cloudflare CLI tool, cloudflared, launches a browser window so that you can login with your identity provider. Once you login, Access will generate a JWT for your session, scoped to your user identity.

Rather than placing that JWT in a cookie, Cloudflare transfers the token in a cryptographically secure handoff to your machine. The client stores the token for you so that you don’t need to re-authenticate each time. The token is valid for the session duration as configured in Access.

When you make requests from your command line, Access will look for an HTTP header, cf-jwt-access-assertion, instead of a cookie. We’ll evaluate the token in that header and on every request.  If you use cURL, we can help you move even faster. cloudflared includes a subcommand that wraps cURL and injects the JWT into the header for you.

Why use cloudflared to reach your application?

With cloudflared and its cURL wrapper, you can perform any cURL operation against an API protected by Cloudflare Access.

  • Control endpoint access for specific users
    Cloudflare Access can be configured to protect specific endpoints. For example, you can create a rule that only a small group within your team can reach a particular URL path. You can apply that granular protection to sensitive endpoints so that you control who can reach those, while making other parts of the tool available to the full team.
  • Download sensitive data
    Placing applications with sensitive data behind Access lets you control who can reach that information. If a particular file is stored at a known location, you can save time by downloading it to your machine from the command line instead of walking through the UI flow.
What's next?

CLI authentication is available today to all Access customers through the cloudflared tool. Just add the API hostname to your Cloudflare account and enable Access to start building policies that control who can reach that API. If you do not have an Access subscription yet, you can read more about the plans here and sign up.

Once you’re ready to continue ditching your VPN, follow this link to install cloudflared today. The tool is in beta and does not yet support automated scripting or service-to-service connections. Full instructions and known limitations can be found here. If you are interested in providing feedback, you can post your comments in this thread.

Categories: Technology

Announcing Firewall Rules

Wed, 03/10/2018 - 21:20
Announcing Firewall RulesAnnouncing Firewall Rules

Threat landscapes change every second. As attackers evolve, becoming more dynamic and devious, vulnerabilities materialize faster than engineers can patch their applications. Part of Cloudflare’s mission is to keep you and your applications safe. Today, Cloudflare is launching a new feature, giving customers what they have been requesting - fine-grained control over their incoming requests.

Cloudflare already offers a number of powerful firewall tools such as IP rules, CIDR rules, ASN rules, country rules, HTTP user-agent blocking, Zone Lockdown (for these URIs only allow traffic from those IPs), and our comprehensive managed rules within our WAF (Web Application Firewall). But sometimes, you need to combine the power of these to fully mitigate an attack, and to express a block rule that breaks the boundaries of the existing tools, to be able to “block traffic to this URI when the request comes from that IP and the user-agent matches one of these”.

Flexibility and Control

Announcing Firewall Rules

© Stefano Kocka : Source Wikipedia

Common themes arose when we spoke to customers about their needs and also reviewed feature requests that our customer support team had seen, and we categorised the top pieces of feedback and feature requests into three core needs:

  1. More flexibility to create a firewall rule that matches more than just a single attribute, like an IP address and User-Agent
  2. Flexibility to not only exactly match an attribute, but also partially match it, through using a string or a pattern, for example User-Agent: *Firefox*
  3. Complete self-service control through the UI and Cloudflare’s public API, even for the most complex firewall rules

The team worked together to investigate what a fresh approach to our firewall would look like, with one overarching mission being front and center: build a Swiss Army knife of firewalls for our customers. Our key objectives were to:

  1. Provide a tool to give customers flexible and granular control of their traffic
  2. Maintain a smooth and intuitive user-experience, something we thrive on delivering
  3. Ensure it is accessible and usable by all of our customers, regardless of user skill or business size
Firewall Rules

Cloudflare’s new capability, Firewall Rules, provides customers the ability to control requests, in a flexible and intuitive way, inspired by the widely known Wireshark®  language. Configuration of rules can be done through not only our Dashboard and API, but also through Terraform (link here).

The Firewall Rules engine can be thought of as 2 components:

  • Matching: define a filter that runs globally and precisely matches your traffic
  • Action: declare the action Cloudflare should apply when the filter is matched

Simply put, when the filter matches, apply the action.

Matching: scoping the rule

Cloudflare Firewall Rules gives customers access to properties of the HTTP request, including referer, user-agent, cookies, Cloudflare Threat Score (IP reputation score), and more.

All of the supported headers can be matched by many operators, including, a partial match (contains), complete string or integer match (equals), and for our Business and Enterprise customers, pattern matching (matches). Yes, you read that right, we now offer pattern matching using Regular Expressions, directly through the Dashboard and API!

The great thing about Firewall Rules is the flexibility for Cloudflare to field options; for example, Cloudflare’s Threat Intelligence, which drives our Security Level functionally on the Dashboard, will be an available field for customers. One of the most important fields Cloudflare is introducing is our cf.client.bot field, which verifies known good bots via reverse DNS. In our initial release, we provide customers access to the general classification of “Known Good Bots”. Details on the list of these bots can be found here. Cloudflare has historically whitelisted Google on behalf of our customers, and utilised Web Application Firewall rules, which are only available to Pro customers and above, to block spoofed crawlers. With Firewall Rules, all customers now have access to these protections. As Cloudflare has removed the whitelisting capability, it is important that customers include cf.client.bot eq true as an Allowed rule, to avoid inadvertently blocking good crawlers which could affect your SEO and monitoring.

Action: what action is applied to the request

All of the standard Cloudflare actions such as JavaScript Challenge, Challenge and Block are available.

There is one new addition to the standard mitigation list, which is the allow action, which provides a customer the ability to create Rule to say “if this criteria is matched, stop processing further rules”.

Give me some examples!

Sure, here’s four cool examples that we see being used today. Advanced or nested rules are not supported in the Visual Rule Builder today. These are noted below each rule.                                                                                                            

Example 1 - Challenge all countries except GB
Supported: Visual Builder, Expression Editor
This can be done using our IP Firewall but would require 150+ rules!

(ip.geoip.country ne "GB") Announcing Firewall Rules

Example 2 - Advanced Hotlink Protection
Supported: Visual Builder, Expression Editor
Cloudflare’s built-in hotlink protection can be restrictive for some customers as it does not provide abilities to bypass certain paths. This also can sometimes catch legitimate crawlers.

(http.request.method eq "GET" and http.referer ne ".example.com" and not http.user_agent matches "(googlebot|facebook)" and http.request.uri.path eq "/content/") Announcing Firewall Rules

Example 3 - Blocking Clients with a Threat Score greater than 10 or Clients originating from an abusive network by AS Number, to my Login page
Supported: Expression Editor
One of the great things about Firewall Rules is that we have provided you access to cf.threat_score, which is what powers the Security Level within the dashboard today.

(http.request.uri.path eq "/login" and (cf.threat_score < 10 or ip.geoip.asnum eq 12345)) Announcing Firewall Rules

Example 4 - Zone Lockdown-like-Use case utilising Regular Expression, IP address CIDRs, Country Code and AS Numbers to protect authentication endpoints via both Wordpress website, and an API.
Supported: Expression Editor
Zone Lockdown is a great tool; however, it is limited for some critical use cases. Here’s something quite crazy to demonstrate the flexibility:

((http.host eq "api.example.com" and http.request.uri.path eq "/api/v2/auth") or (http.host matches "^(www|store|blog)\.example.com" and http.request.uri.path contains "wp-login.php") or ip.geoip.country in {"CN" "TH" "US" "ID" "KR" "MY" "IT" "SG" "GB"} or ip.geoip.asnum in {12345 54321 11111}) and not ip.src in {11.22.33.0/24} 111 Announcing Firewall Rules

Positive and Negative Security Models

This is an awesome addition to the Firewall, as it provides our customers a way to choose between running a Positive Security policy (allow specific requests and deny everything else) or a Negative Security policy (block specific requests and allow everything else).

Cloudflare default for Firewall Rules is an implicit allow all. The great thing about this method of working is being able to block just the bad stuff. Whilst this is a very effective and efficient way of running a firewall, it causes a number of challenges. By letting all traffic through, your security operations have to be reactive when issues arise.

What the security industry has been pushing is a concept of "Zero Trust". Just as it sounds, Zero Trust means you trust nothing, and everything that comes through has to have some kind of justification. To create a "Zero Trust" security policy, you have to reverse the way your firewall default policy works, i.e. changing the last action from allow to block - aka. a positive security policy. Before today, this was not possible; however with Firewall Rules, now you can.

The Visual Rule Builder and Expression Editor

One of the biggest concerns about giving customers power, is delivering that power safely and effectively. The product design and UI engineering team worked through multiple iterations to create a powerful but approachable rule builder and editor. The team spent a number of months working through a number of iterations to create solid rule builder and a rule editor solution without cluttering or complicating the UI.

Pete Thomas, our Lead Designer on the new Firewall UI took us back to basics running paper prototyping sessions to test and discover how rules are built and managed.

Below is a photo of myself and Matthew Bullock, one of our London Solutions Engineers, going through the testing process.

Announcing Firewall Rules


Through the design process, we wanted to focus on why customers would need Firewall Rules. The results were simple, create proactive defensive rules, to secure an application, and reactive rules, to protect applications that were being attacked.

Within the Visual Rule Builder, we have provided customers an intuitive way to create Firewall Rules, whilst not restricting access to the core functionality. The future roadmap delivers more flexible grouping through the Visual Builder. However, we do have an option for more complex requirements or nested Firewall Rules. These can be created within the Rule Editor, which is based on our Wireshark®-inspired language that allows you to take expressions created in Wireshark and create Firewall Rules. David Kitchen, the Engineering Manager responsible for developing Firewall Rules will be writing a blog in the coming weeks detailing why we chose a Wireshark®-inspired DSL for our filter expressions. For a list of supported fields, head over to our documentation.  

Categories: Technology

Custom Load Balancing With Cloudflare Workers

Wed, 03/10/2018 - 08:59

The following is a guest post by Jayaprabhakar Kadarkarai, Developer of Codiva.io, an Online IDE used by computer science students across the world. He works full stack to deliver low latency and scalable web applications.

Have you launched your website? Getting a lot of traffic? And you are planning to add more servers? You’ll need load balancing to maintain the scalability and reliability of your website. Cloudflare offers powerful Load Balancing, but there are situations where off-the-shelf options can’t satisfy your specific needs. For those situations, you can write your own Cloudflare Worker.

In this post, we’ll learn about load balancers and how to set them up at a low cost with Cloudflare Service Workers.

This post assumes you have a basic understanding of JavaScript, as that’s the language used to write a Cloudflare Worker.

The Basic Pattern

The basic pattern starts with adding ‘fetch’ event listener to intercept the requests. You can configure which requests to intercept on the Cloudflare dashboard or using the Cloudflare API.

Then, modify the hostname of the URL and send the request to the new host.

addEventListener('fetch', event => { var url = new URL(event.request.url); // https://example.com/path/ to https://myorigin.example.com/path url.hostname = 'myorigin.' + url.hostname event.respondWith(fetch(url)); });

This doesn’t do anything useful yet, but this is the basic pattern that will be used in the rest of the examples.

Load Balancer with Random Routing

When you have a list of origin servers, pick a random host to route to.

This is a very basic load balancing technique to evenly distribute the traffic across all origin servers.

var hostnames = [ "0.example.com", "1.example.com", "2.example.com" ]; addEventListener('fetch', event => { var url = new URL(event.request.url); // Randomly pick the next host url.hostname = hostnames[getRandomInt(hostnames.length)]; event.respondWith(fetch(url)); }); function getRandomInt(max) { return Math.floor(Math.random() * max); } Load Balancer with Fallback

What about when a host is down? A simple fallback strategy is to route the request to a different host. Use this only if you know the requests are idempotent. In general, this means GET requests are okay, but you might wish to handle POST requests another way.

addEventListener('fetch', event => { // Randomly pick the primary host var primary = getRandomInt(hostnames.length); var primaryUrl = new URL(event.request.url); primaryUrl.hostname = hostnames[primary]; var timeoutId = setTimeout(function() { var backup; do { // Naive solution to pick a backup host backup = getRandomInt(hostnames.length); } while(backup === primary); var backupUrl = new URL(event.request.url); backupUrl.hostname = hostnames[backup]; event.respondWith(fetch(backupUrl)); }, 2000 /* 2 seconds */); fetch(primaryUrl) .then(function(response) { clearTimeout(timeoutId); event.respondWith(response); }); }); Geographic Routing

Cloudflare adds CF-IPCountry header to all requests once Cloudflare IP Geolocation is enabled.

You can access it using:

var countryCode = event.request.headers.get(‘CF-IPCountry’);

We can use the countryCode to route requests from different locations to different servers in different regions.

For example, 80% of the traffic to Codiva.io is from the US and India. So, I have servers in two different regions (Oregon, USA; and Mumbai, India). Requests from India and  other countries near it are routed to servers in India. All other requests are routed to the US data center.

const US_HOST = "us.example.com" const IN_HOST = "in.example.com" var COUNTRIES_MAP = { IN: IN_HOST, PK: IN_HOST, BD: IN_HOST, SL: IN_HOST, NL: IN_HOST } addEventListener('fetch', event => { var url = new URL(event.request.url); var countryCode = event.request.headers.get('CF-IPCountry'); if (COUNTRIES_MAP[countryCode]) { url.hostname = COUNTRIES_MAP[countryCode]; } else { url.hostname = US_HOST; } event.respondWith(fetch(url)); }); Putting it all together

Now, let us combine the geographic routing, random load balancing and fallback into a single worker:

const US_HOSTS = [ "0.us.example.com", "1.us.example.com", "2.us.example.com" ]; const IN_HOSTS = [ "0.in.example.com", "1.in.example.com", "2.in.example.com" ]; var COUNTRIES_MAP = { IN: IN_HOSTS, PK: IN_HOSTS, BD: IN_HOSTS, SL: IN_HOSTS, NL: IN_HOSTS } addEventListener('fetch', event => { var url = new URL(event.request.url); var countryCode = event.request.headers.get('CF-IPCountry'); var hostnames = US_HOSTS; if (COUNTRIES_MAP[countryCode]) { hostnames = COUNTRIES_MAP[countryCode]; } // Randomly pick the next host var primary = hostnames[getRandomInt(hostnames.length)]; var primaryUrl = new URL(event.request.url); primaryUrl.hostname = hostnames[primary]; // Fallback if there is no response within timeout var timeoutId = setTimeout(function() { var backup; do { // Naive solution to pick a backup host backup = getRandomInt(hostnames.length); } while(backup === primary); var backupUrl = new URL(event.request.url); backupUrl.hostname = hostnames[backup]; event.respondWith(fetch(backupUrl)); }, 2000 /* 2 seconds */); fetch(primaryUrl) .then(function(response) { clearTimeout(timeoutId); event.respondWith(response); }); }); function getRandomInt(max) { return Math.floor(Math.random() * max); } Recap

In this article, you saw the power of Cloudflare workers and how simple it is to use it. Before implementing custom load balancer with workers, take a look at Cloudflare’s load balancer.

For more examples, take a look at the recipes on the developer docs page.

Categories: Technology

Ulaanbaatar, Mongolia

Wed, 03/10/2018 - 00:59
Ulaanbaatar, Mongolia

Whenever you get into a conversation about exotic travel or ponder visiting the four corners of the globe, inevitably you end up discussing Ulaanbaatar in Mongolia. Travelers want to experience the rich culture and vivid blue skies of Mongolia; a feature which gives the country its nickname of “Land of the Eternal Blue Sky”.

Ulaanbaatar, Mongolia

Ulaanbaatar (or Ulan Bator; but shortened to UB by many) is the capital of Mongolia and located nearly a mile above sea level just outside the Gobi Desert - a desert that spans a good percentage of Central Asia’s Mongolia. (The rest of the Gobi Desert extends into China). The country is nestled squarely between Russia to the north and China to the south. It’s also home to some of the richest and ancient customs and festivals around. It’s those festivals that successfully draw in the tourists who want to experience something quite unique. Luckily, even with all the tourists, Mongolia has managed to keep its local customs; both in the cities and within its nomadic tribes.

Ulaanbaatar, Mongolia

via Wikipedia

History also has drawn explorers and conquerors to and from the region; but more on that later.

Cloudflare is also drawn into Mongolia

Any avid reader of our blogs will know that we frequently explain that the expansion of our network provides customers and end-users with both more capacity and less latency. That goal (covering 95% of the Internet population with 10 milliseconds or less of latency) means that Mongolia was seriously on our radar.

Now we have a data center in Ulaanbaatar, Mongolia, latency into that blue sky country is significantly reduced. Prior to this data center going live we were shipping bits into the country via Hong Kong, a whopping 1,800 miles away (or 50 milliseconds if we talk latency). That's far! We know this new data center is a win-win for both mobile and broadband customers within the country and for Cloudflare customers as a whole.

Just how did we get Cloudflare into Mongolia?

Ulaanbaatar is city number 154 on Cloudflare’s network. Our expansion into cities like Ulaanbaatar doesn’t just happen instantly; it takes many teams within Cloudflare in order to successfully deploy in a place like this.

However, before deploying, we need to negotiate a place to deploy into. A new city requires a secure data center for us to build into. A bandwidth partner is also required. We need to get access to the local networks and to also acquire cache-fill bandwidth in order to operate our CDN. Once we have those items negotiated, we can focus on the next steps. Any site we build has to match our own stringent security standards (we are PCI/DSS compliant – hence all our data centers need to also be PCI/DSS compliant). That’s a paperwork process, which surprisingly takes longer than most other stages (because we care about security).

Then logistics kicks in. A BOM (Bill of Materials) is created. Serial numbers recorded. Power plugs chosen. Fun fact: Cloudflare data centers are nearly all identical, except the power cables. While we live in a world where fiber optic cables and connectors are universal, independent of location (or speed in some cases), the power connections we receive for our data centers vary widely as we expand around the globe.

The actual shipping is possibly the more interesting part of the logistics process. Getting a few pallets of hardware strapped up and ready to move is only a small part of the process. Paperwork again becomes the all-powerful issue. Each country has its own customs and import process, each shipping company has its own twist on how to process things, and each shipment needs to be coordinated with a receiving party. Our logistics team pulls off this miracle for new sites, upgrades for existing sites, replacement parts for existing sites, all while sometimes calmly listening to mundane on-hold music from around the globe.

Then our hardware arrives! Seriously, this is a biggie and those around the office that follow these new city builds are always celebrating on those days. The logistics team has done their part; now it’s time for the deployment team to kick-in.

The deployment team’s main goal is to get hardware racked-and-stacked in a site where (in most cases) we are contracting out the remote-hands to do this work. Sometimes we send people to a site to build it; however, that’s not a scalable process and hence we use local remote-hands contractors to do this heavy-lifting and cabling. There are diagrams, there are instructions, there are color-coded cables (‘cause the right cable needs to go into the right socket). Depending on the size of the build; it can be just a days work or up-to a weeks worth of work. We vary our data center sizes based on the capacity needs for each city. Once racked-and-stacked there is one more job to get done within the Infrastructure team. They get the initial private network connection enabled and secured. That single connection provides us with the ability to continue to the next step. Actually setting up the network and servers at the new site.

Every new data center site is shipped with zero configuration loaded into network hardware and compute servers. They all ship raw with no Cloudflare knowledge embedded into them. This is by design. The network teams first goal is to configure the routers and switches, which is mainly a bootstrap process in order for the hardware to phone-home and securely request its full configuration setup. We have previously written about our extensive network automation methods. In the case of a new site, it’s not that different. Once the site can communicate back home, it’s clear to the software base that its configuration is out of date. Updates are pushed to the site and network monitoring is automatically enabled.

But let's not paint a perfect rosy picture. There can still be networking issues. Just one is worth pointing-out as it’s a recurring issue and one that plagues the industry globally. Fiber optic cables can sometimes be plugged in with their receive and transmit sides reversed. It’s a 50:50 chance of being right. Sometimes it just amazes us that we can’t get this fixed; but … a quick swap of the two connectors and we are back in business!

Those explorers and conquerors

No conversation about Mongolia would be valid unless we discuss Genghis Khan. Back in the 13th century, Genghis Khan founded the Mongol Empire. He unified the tribes of what is now Mongolia (and beyond). He established the largest land empire in history and is well described both online and via various TV documentaries (The History Channel doesn’t skimp when it comes to covering him). Genghis Khan was a name of honor that he didn’t receive till 1206. Before that he was just named “Temujin”.

Ulaanbaatar, Mongolia

Photo by SarahTz CC by/2.0

Around 30 miles outside Ulaanbaatar is the equestrian statue of Genghis Khan on the banks of the Tuul River in Gorkhi Terelj National Park. Pictured above, this statue is 131 feet tall and built from 250 tons of steel.

Meanwhile back in Mongolia in present time

We get to announce our new city (and country) data center during a very special time. The Golden Eagle Festival takes place during the first weekend of October (that’s October 6 and 7 this year). It’s a test of a hunter’s speed and agility. In this case the hunters (nomads of Mongolia) are practicing an ancient technique of using Golden Eagles to hunt. It takes place in the far west of the country in the Altai Mountains.

The most famous festival in Mongolia is the Naadam Festival in mid-July. So many things going on within that festival, all of which is a celebration of Mongolian and nomadic culture. The festival celebrates the local sports (wrestling, archery, and horse racing) along with plenty of arts. The opening ceremony can be quite elaborate.

When discussing Mongolia, travelers nearly always want to make sure their itineraries overlap at least one of these festivals!

Cloudflare continues to expand globally

Last week was Cloudflare’s Birthday Week and we announced many services and products. Rest-assured that everything we announced is instantly available in Ulaanbaatar, Mongolia (data center city 154) just like it’s available everywhere else on Cloudflare’s network.

One final trivia point regarding our Ulaanbaatar data center. With Ulaanbaatar live, we now have datacenters covering the letters A thru Z (from Amsterdam to Zurich), i.e. with U added, every letter is now fully represented.

If you like what you’ve read and want to come join our infrastructure team, our logistics team, our network team, our SRE team, or any of the teams that help with these global expansions, then we would love to see your resume or CV. Look here for job opening.

Categories: Technology

Cloudflare Access: Sharing our single-sign on plugin for Atlassian

Tue, 02/10/2018 - 19:56

Here at Cloudflare, we rely on a set of productivity tools built by Atlassian, including Jira and Confluence. We secure them with Cloudflare Access. In the past, when our team members wanted to reach those applications, they first logged in with our identity provider credentials to pass Access. They then broke out a second set of credentials, specific to Atlassian tools, to reach Jira. The flow is inconvenient on a desktop and downright painful on a mobile device.

While Access can determine who should be able to reach an application, the product alone cannot decide what the user should be able to do once they arrive at the destination. The application sets those specific permissions, typically by requiring another set of user credentials. The extra step slows down and frustrates end users. Access saves time by replacing a cumbersome VPN login. However, we wanted to also solve the SSO problem for our team.

We created a plugin, specific to Atlassian, that could take identity data from the token generated by Access and map it to a user account. Our team members log in with our identity provider to pass Access, and then Access could set their user permissions in Jira or Confluence. No extra credentials required.

We’re excited to announce that we are sharing the same SSO plugin that we use every day at Cloudflare so that all Access customers can deploy it for their hosted Atlassian tools. You can add this plugin immediately to your Atlassian instance and remove that extra credential requirement. Like we did at Cloudflare, you can make the day more convenient for everyone in your team.

We aren’t stopping with Atlassian. We’re working with partners to expand JWT-based authentication. If you’re building or maintaining a product with an authorization flow, we want to help you add this functionality.

What is a JWT and how does Access use them?

JSON Web Tokens (JWTs) allow systems to communicate stateless data, or claims, in a secure format. The tokens follow a standard established in RFC 7519. That open standard allows different groups to send secure, encrypted data in a format that both sides can create and understand. Each JWT consists of three Base64-URL strings: the header, the payload, and the signature.

  • The header defines the cryptographic operation that encrypts the data in the JWT.
  • The payload consists of name-value pairs for at least one and typically multiple claims, encoded in JSON. For example, the payload can contain the identity of a user
  • The signature allows the receiving party to confirm that the payload is authentic.

Cloudflare Access relies on JSON Web Tokens to confirm identity before allowing or denying access to sensitive resources. When your end user reaches an application protected by Access, they first sign in with your identity provider. We communicate with that provider through OAUTH or SAML flows to verify the user identity. Once confirmed, we create a JWT by using the RS256 algorithm, add data to the payload, and sign the token with a public private key pair.

The token we issue stores:

  • User identity: typically the email address of the user retrieved from your identity provider.
  • Authentication domain: the domain that signs the token. For Access, we use “example.cloudflareaccess.com” where “example” is a subdomain you can configure.
  • Audience: The domain of the application you are attempting to reach.
  • Expiration: the time at which the token is no longer valid for use.

When a request is made to an application behind Access, Cloudflare looks for the presence of that token. If available, we decrypt it, validate its authenticity, and then read the payload. If the payload contains information about a user who should be able to reach the application, we let them through. Applications can make use of the same validation step to set permissions specific to a user. You just need to associate the user in the JWT with a user in your application, which is a problem a new module, or plugin, can solve.

Sharing our Atlassian plugin

We selected Atlassian because nearly every member of our global team uses the product each day. Saving the time to input a second set of credentials, daily, has real impact. Additionally, removing the extra step makes reaching these core tools easier from mobile devices.

The Java plugin is built on top of the Atlassian Plugin SDK. When requests are made to pages within your Atlassian deployment, Atlassian will redirect to the login page. With the plugin is installed, that login page will first look for the presence of the token issued by Cloudflare Access. If present, the plugin will make sure the token is valid and parse the user info in the payload.

Next, the plugin attempts to map the email address in the payload to an email address of a user in your Atlassian account. If an account is found, the plugin will initiate a user session for that specific user. The plugin relies on the email or username in your identity provider matching the email configured in Atlassian.

When we rolled this out at Cloudflare, team members had one fewer disruption in their day. We all became accustomed to it. We only received real feedback when we disabled it, briefly, to test a new release. And that response was loud. When we  returned momentarily to the old world of multiple login flows, we started to appreciate just how convenient SSO is for a team. The lesson motivated us to make this available, quickly, to our customers.

We are making the code public for this tool, instead of releasing it as a wrapped plugin in the marketplace, so that you can review the implementation yourself. Install it on your deployment today; you can find instructions here. From our own experience, we think your end users are going to love removing that extra login step.

Install the Atlassian SSO plugin

Expanding JWT-based authorization

Atlassian tools are popular, but solving this problem for Jira and Wiki is just the beginning. Each member of your team probably uses a dozen or more internal applications each day. Different groups within your organization also rely on different sets of tools.

We’re working with projects to expand support for JWT-based SSO. One example is Redash, a popular data query and visualization tool. In the last week, Redash added the functionality to their product. Unlike the Atlassian plugin, the change only required the addition of a new option within the tool’s authentication flow. When enabled, the new feature checks for the presence of a JWT in the request. If valid, Redash will map the identity in the payload to a user profile and start a session with those permissions. The end user logs in once, at the Access screen, and arrives at their unique Redash account.

If you’re interested in improving the SSO flow for your product, please reach out through the form here and we can provide input and best practices on implementation. Feel free to use the Atlassian and Redash examples as templates. We’re excited to work with you to remove redundant steps for your users.

Categories: Technology

#BetterInternet: Join the Movement

Tue, 02/10/2018 - 16:15
#BetterInternet: Join the Movement#BetterInternet: Join the Movement

When it comes to overall awareness of Cloudflare, it seems most folks fall into one of three camps: 1) those who don’t know much about Cloudflare at all, 2) those who are familiar with one or two of Cloudflare’s many solutions (i.e. DDoS protection, caching, DNS, etc.), and finally, 3) those who understand the full breadth and scope of Cloudflare’s global cloud network. This latter group of folks are especially excited about the broad scope of Cloudflare’s mission, which is: “to help build a better Internet.” Last week our co-founder Michelle Zatlyn explained in a blog post what this mission actually means:

“Our mission at Cloudflare is to help build a better Internet. That is a big, broad mission that means many things. It means that we push to make Internet properties faster. It means respecting individual’s privacy. It means making it harder for malicious actors to do bad things. It means helping to make the Internet more reliable. It means supporting new Internet standards and protocols, and making sure they are accessible to everyone. It means democratizing technology and making sure the widest possible group has access to it. It means increasing value for our community, while decreasing their costs.”

(See Michelle’s full blog post for more color on each of these areas).

You’ll observe that all of our recent Crypto Week and Birthday Week announcements delivered on the tenets of our bold mission. We’ve been very encouraged that our customers are excited about the progress we’ve made so far. We’re confident we’re moving in the right direction, but we know that lots of people still haven’t even heard of us yet, and that not enough people yet know how Cloudflare can help their users’ internet experience be better. We’re excited to invite more people to learn about Cloudflare and join with us (as customers and partners) in helping  build a better Internet.

IF bold mission THEN bold message

This week we’re starting to experiment with a new campaign that aims to communicate the scope of Cloudflare’s mission to an ever-growing audience.

This campaign is all about being bold, provocative, and declarative in proclaiming who we are and what we do. After all, Cloudflare isn’t just another technology company — we’re on a very real mission to help build a better Internet. And we’re inviting folks to join us!

Our customers know that Cloudflare is a technology disruptor and democratizer, taking complex and sophisticated cloud network technologies, and making them available to organizations both large and small. Our global cloud network not only serves some of the world’s largest enterprises, but we also proudly serve minority groups, the disenfranchised, and sometimes, even the rebellious. The nature of our role in the Internet ecosystem sometimes puts us in the middle of criticism and controversy but we don’t shrink from the responsibility that rests on our shoulders.

We think it is appropriate that we introduce Cloudflare loudly and unapologetically. The messages in this campaign are direct and provocative. These do NOT look like polished, boring, corporate IT ads. Instead, they look a bit raw and maybe even slightly irreverent. They’ve been created fast and furiously by a team which is inspired to make our message known and to make a difference.  

What you’ll see

Below are some of campaign assets you may start seeing.

#BetterInternet: Join the Movement

Wild Postings across Manhattan

#BetterInternet: Join the Movement

Muni bus wraps in San Francisco

#BetterInternet: Join the Movement

#BetterInternet: Join the Movement

Civic Center Muni takeover

#BetterInternet: Join the Movement

Select billboard placements in San Francisco

#BetterInternet: Join the Movement

Forbes Takeover

Where you might see it

We plan to start testing this campaign in two US cities (San Francisco and New York), as well as a number of US national publications (digital versions of the New York Times, Forbes, others), social media, and other digital channels. If you’re in SF and NYC, you may see a few billboards and/or posters around town.

Let us know what you think

If you see any of these messages out in the wild, we’d love to see a photo/screenshot and hear what you think. Let us know via social media by tagging #BetterInternet.

Categories: Technology

Free to code

Mon, 01/10/2018 - 16:01
Free to code

This week at the Cloudflare Internet Summit I have the honour of sitting down and talking with Sophie Wilson. She designed the very first ARM processor instruction set in the mid-1980s and was part of the small team that built the foundations for the mobile world we live in: if you are reading this on a mobile device, like a phone or tablet, it almost certainly has an ARM processor in it.

But, despite the amazing success of ARM, it’s not the processor that I think of when I think of Sophie Wilson. It’s the BBC Micro, the first computer I ever owned. And it’s the computer on which Wilson and others created ARM despite it having just an 8-bit 6502 processor and 32k of RAM.

Luckily, I still own that machine and recently plugged it into a TV set and turned it on to make sure it was still working 36 years on (you can read about that one time blue smoke came out of it and my repair). I wanted to experience once more the machine Sophie Wilson helped to design. One vital component of that machine was BBC BASIC, stored in a ROM chip on the computer’s motherboard. She wrote the code on that ROM.

Free to code

Understandably, BASIC seems old-fashioned, useless and simplistic these days but I was struck by something as I switched on that machine: it booted instantaneously (anyone who used a BBC Micro will be familiar with the two tone boot sound) and I was presented with a prompt without any delay.

Could I immediately write and run code? Yes, I could, and so I did (see below and click here to listen to that program running). And that’s all programmers really want, isn’t it? The ability to write code and run it; the ability to express to the machine their thoughts and have those thoughts turn from the ephemeral into the real.

Free to code

Ideally, nothing gets in the way of the jump from brain to CPU. Anything that does is a useless distraction. Which is why you see programmers complaining bitterly about keyboards and buying custom keys. And it’s why you’ll see them wearing headphones to avoid distractions when they get into the magical zone where code seems to flow from the imagination onto the screen.

And for the same reason some languages have a REPL (read, edit, print loop) so that code can be written, modified and run without distraction. And it’s why the venerable make program has default options that mean you can perform common tasks (like compiling a program to an executable) without even having a Makefile at all:

$ ls helloworld.c $ cat helloworld.c #include <stdio.h> int main(int argc, char* argv[]) { printf("Hello, World!\n"); } $ make helloworld cc helloworld.c -o helloworld $ ./helloworld Hello, World!

It shouldn’t then be a surprise that the last thing programmers really want to think about when programming is the computer and its complexities. Anyone who’s worked with manual tools will know that the brain incorporates those tools as if they were part of the body. Programmers experience a similar feeling and then an almost visceral sensation when something doesn’t work correctly and a program can’t run.

A very common source of trouble is the difference between “it works on my laptop” and “it fails in production”. Programmers configure their laptops to maximize their productivity but subtle differences between the laptop and production server (such as different versions of a system library) mean that beautifully crafted code can suddenly fail to run at all because of some irritating, and irrelevant, mismatch.

All programmers know that pure exasperation of seeing their code fail to run because of some mere configuration problem.

To solve this, first Virtual Machines and then Containers have been proposed as ways for the so-called environment (all the things that a program needs to run like memory and CPU and an operating system and libraries) can be duplicated between laptop and server to make the transition to production as smooth as possible.

About that buzzword

Recently a new buzzword has appeared: serverless. As I said above programmers really don’t want to think about computers: they want running code. Serverless promises just that: push code to production and it’ll run somewhere, who cares where, who cares how? Much fun is made of the fact that, of course, there are servers running code even when it’s called ‘serverless’.

But ‘serverless’ expresses the desire to neither know nor care about the how or where.

Cloudflare has its own serverless platform that runs in every one of our machines across the globe. We call it Cloudflare Workers: push code to us through our API and it’ll be available everywhere on the planet within seconds. It’s augmented by our Workers KV service that provides a global, key-value store accessible from within a Worker.

Until today Cloudflare Workers code had to be written in JavaScript. That changed with the release of WebAssembly on the Workers platform. Now you can write in any language that compiles to WebAssembly and deploy that code to the Cloudflare network. Frictionless development and deployment on a serverless platform in your language of choice.

That’s all programmers really want.

Now, if you don’t mind, I’m switching my BBC Micro to the Acornsoft LISP ROM, because I’ve got a REPL to run.

Categories: Technology

WebAssembly on Cloudflare Workers

Mon, 01/10/2018 - 16:00
WebAssembly on Cloudflare Workers

WebAssembly on Cloudflare Workers

We just announced ten major new products and initiatives over Crypto Week and Birthday Week, but our work is never finished. We're continuously upgrading our existing products with new functionality.

Today, we're extending Cloudflare Workers with support for WebAssembly. All Workers customers can now augment their applications with WASM at no additional cost.

What is WebAssembly?

WebAssembly -- often abbreviated as "WASM" -- is a technology that extends the web platform to support compiled languages like C, C++, Rust, Go, and more. These languages can be compiled to a special WASM binary format and then loaded in a browser.

WASM code is securely sandboxed, just like JavaScript. But, because it is based on compiled lower-level languages, it can be much faster for certain kinds of resource-intensive tasks where JavaScript is not a good fit. In addition to performance benefits, WASM allows you to reuse existing code written in languages other than JavaScript.

What are Workers?

WebAssembly on Cloudflare Workers

For those that don't know: Cloudflare Workers lets you deploy "serverless" JavaScript code directly to our 153-and-growing datacenters. Your Worker handles your site's HTTP traffic directly at the location closest to your end user, allowing you to achieve lower latency and reduce serving costs. Last week we added storage to Workers, making it possible to build applications that run entirely on Cloudflare.

Until now, Workers has only supported JavaScript. With the addition of WebAssembly, you can now use a wide range of languages and do more, faster. As always, when you deploy code on Cloudflare, it is distributed to every one of our locations world-wide in under 30 seconds.

When to use WebAssembly

It's important to note that WASM is not always the right tool for the job. For lightweight tasks like redirecting a request to a different URL or checking an authorization token, sticking to pure JavaScript is probably both faster and easier than WASM. WASM programs operate in their own separate memory space, which means that it's necessary to copy data in and out of that space in order to operate on it. Code that mostly interacts with external objects without doing any serious "number crunching" likely does not benefit from WASM.

On the other hand, WASM really shines when you need to perform a resource-hungry, self-contained operation, like resizing an image, or processing an audio stream. These operations require lots of math and careful memory management. While it's possible to perform such tasks in pure JavaScript — and engines like V8 have gone to impressive lengths to optimize such code — in the end nothing beats a compiled language with static types and explicit allocation.

As an example, the image below is resized dynamically by a Cloudflare Worker using a WebAssembly module to decode and resize the image. Only the original image is cached — the resize happens on-the-fly at our edge when you move the slider. Find the code here.

How to use WebAssembly with Cloudflare Workers

WASM used in a Worker must be deployed together with the Worker. When editing a script in the online Workers editor, click on the "Resources" tab. Here, you can add a WebAssembly module.

WebAssembly on Cloudflare Workers

You will be prompted to upload your WASM module file and assign it a global variable name. One uploaded, your module will appear as a global variable of type WebAssembly.Module in your worker script. You can then instantiate it like this:

// Define imported functions that your WASM can call. const imports = { exampleImport(a, b) { return a + b; } } // Instantiate the module. const instance = new WebAssembly.Instance(MY_WASM_MODULE, imports) // Now you can call the functions that your WASM exports. instance.exports.exampleExport(123);

Check out the MDN WebAssembly API documentation for more details on instantiating WebAssembly modules.

You can also, of course, upload WebAssembly modules via our API instead of the online editor.

Check out the documentation for details »

Building WASM modules

Today, building a WebAssembly module for Cloudflare is a somewhat manual process involving low-level tools. Check out our demo repository for details.

Now that the basic support is in place, we plan to work with Emscripten and the rest of the WASM community to make sure building WASM for Cloudflare Workers is as seamless as building for a web browser. Stay tuned for further developments.

The Future

We're excited by the possibilities that WebAssembly opens up. Perhaps, by integrating with Cloudflare Spectrum, we could allow existing C/C++ server code to handle arbitrary TCP and UDP protocols on the edge, like a sort of massively-distributed inetd. Perhaps game servers could reduce latency by running on Cloudflare, as close to the player as possible. Maybe, with the help of some GPUs and OpenGL bindings, you could do 3D rendering and real-time streaming directly from the edge. Let us know what you'd like to see »

Want to help us build it? We're hiring!

Categories: Technology

Pages

Additional Terms