Blogroll: CloudFlare

I read blogs, as well as write one. The 'blogroll' on this site reproduces some posts from some of the people I enjoy reading. There are currently 32 posts from the blog 'CloudFlare.'

Disclaimer: Reproducing an article here need not necessarily imply agreement or endorsement!

Subscribe to CloudFlare feed
Cloudflare Blog
Updated: 1 hour 16 min ago

Want to try Warp? We just enabled the beta for you

Thu, 23/11/2017 - 02:00

Tomorrow is Thanksgiving in the United States. It’s a holiday for getting together with family characterized by turkey dinner and whatever it is that happens in American football. While celebrating with family is great, if you use a computer for your main line of work, sometimes the conversation turns to how to setup the home wifi or can Russia really use Facebook to hack the US election. Just in case you’re a geek who finds yourself in that position this week, we wanted to give you something to play with. To that end, we’re opening the Warp beta to all Cloudflare users. Feel free to tell your family there’s been an important technical development you need to attend to immediately and enjoy!

Hello Warp! Getting Started

Warp allows you to expose a locally running web server to the internet without having to open up ports in the firewall or even needing a public IP address. Warp connects a web server directly to the Cloudflare network where Cloudflare acts as your web server’s network gateway. Every request reaching your origin must travel to the Cloudflare network where you can apply rate limits, access policies and authentication before the request hits your origin. Plus, because your origin is never exposed directly to the internet, attackers can’t bypass protections to reach your origin.

Warp is really easy to get started with. If you use homebrew (we also have packages for Linux and Windows) you can do:

$ brew install cloudflare/cloudflare/warp $ cloudflare-warp login $ cloudflare-warp --hostname warp.example.com --hello-world

In this example, replace example.com with the domain you chose at the login command. The warp.example.com subdomain doesn’t need to exist yet in DNS, Warp will automatically add it for you.

That last command spins up a web server on your machine serving the hello warp world webpage. Then Warp starts up an encrypted virtual tunnel from that web server to the Cloudflare edge. When you visit warp.example.com (or whatever domain you chose), your request first hits a Cloudflare data center, then is routed back to your locally running hello world web server on your machine.

If someone far away visits warp.example.com, they connect to the Cloudflare data center closest to them, and then are routed to the Cloudflare data center your Warp instance is connected to, and then over the Warp tunnel back to your web server. If you want to make that connection between Cloudflare data centers really fast, enable Argo, which bypasses internet latencies and network congestions on optimized routes linking the Cloudflare data centers.

To point Warp at a real web server you are running instead of the hello world web server, replace the hello-world flag with the location of your locally running server:

$ cloudflare-warp --hostname warp.example.com http://localhost:8080 Using Warp for Load Balancing

Let’s say you have multiple instances of your application running and you want to balance load between them or always route to the closest one for any given visitor. As you spin up Warp, you can register the origins behind Warp to a load balancer. For example, I can run this on 2 different servers (e.g. one on a container in ECS and one on a container in GKE):

$ cloudflare-warp --hostname warp.example.com --lb-pool origin-pool-1 http://localhost:8080

And connections to warp.example.com will be routed seamlessly between the two servers. You can do this with an existing origin pool or a brand new one. If you visit the load balancing dashboard you will see the new pool created with your origins in it, or the origins added to an existing pool.

You can also set up a health check so that if one goes offline, it automatically gets deregistered from the load balancer pool and requests are only routed to the online pools.

Automating Warp with Docker

You can add Warp to your Dockerfile so that as containers spin up or as you autoscale, containers automatically register themselves with Warp to connect to Cloudflare. This acts as a kind of service discovery.

A reference Dockerfile is available here.

Requiring User Authentication

If you use Warp to expose dashboards, staging sites and other internal tools to the internet that you don’t want to be available for everyone, we have a new product in beta that allows you to quickly put up a login page in front of your Warp tunnel.

To get started, go to the Access tab in the Cloudflare dashboard.

There you can define which users should be able to login to use your applications. For example, if I wanted to limit access to warp.example.com to just people who work at Cloudflare, I can do:

Enjoy!

Enjoy the Warp beta! (But don't wander too deep into the Warp tunnel and forget to enjoy time with your family.) The whole Warp team is following this thread for comments, ideas, feedback and show and tell. We’re excited to see what you build.

Categories: Technology

Releasing AddThis on Cloudflare Apps: Making Disciplined Product Design Decisions

Wed, 22/11/2017 - 13:00
 Making Disciplined Product Design Decisions

This is a guest post by Emily Schwartz, Product Manager for the AddThis team at Oracle. With a background in digital media that has spanned across NPR, WaPo Labs, Trove, and others, Emily cares deeply about helping publishers leverage data and technology for success.

 Making Disciplined Product Design Decisions

The Process of Paring Down

When our team learned about the opportunity to build an AddThis app on Cloudflare Apps, I was ready to pounce. Building for distribution platforms is a core part of our business and product strategy, and I knew AddThis could bring a lot to the table for Cloudflare users. With a media background in my pocket, I understand the necessity of making content easily and quickly distributable -- and I wanted to get our tools in front of new users so we could learn more about the critical needs of publishers, merchants, and website owners.

The decision to build was the easy part. What to build was the challenging part.

 Making Disciplined Product Design Decisions

With time and resources tight, I knew building an app that offered our full suite of website tools wouldn’t be immediately feasible—or even make sense. Share buttons, follow buttons, related posts, list building, link promotion, and tip jar are all useful products, but launching with a more narrow tool and feature set meant we could reach the market sooner, learn from user behaviors, and identify needs unique to Cloudflare Apps publishers. I opted to forge ahead with our most popular tool: share buttons.

If you try to configure share buttons from addthis.com, there are a lot of ways to do this: Floating, Inline, Expanded, Image Sharing, Popup, Banner, and Slider. Seven options just for share buttons! My goal with Cloudflare Apps was to launch something simple, useful, and closer to drag-and-drop than code-and-configure. With this in mind, I made a hard decision: pare down our app to the simplest version of our floating sharing sidebar—our most popular share buttons type—and cut many of the advanced configuration options. Instead, I decided to serve auto-personalized buttons and limit settings to cosmetic changes like number of services displayed and bordered styling. Perhaps the biggest change: users don’t even need to register an AddThis account to use our share buttons on Cloudflare Apps or work with any code. We created the simplest version of our share buttons to date.

With the scope trimmed down to “Share Buttons Lite,” we got to work.

 Making Disciplined Product Design Decisions

The AddThis team is no stranger to building for third-party platforms. Our tools are found on platforms like WordPress, Shopify, Magento, and others. Building for Cloudflare Apps turned out to be more of a dream, better than we could have imagined. There was one wrinkle to figure out: if we weren’t asking users to create or log in to an addthis.com account, how would we save unique configuration settings?

Some background

Every website with AddThis tools has a configuration object where on-page tool settings are stored. This includes configuration data such as layout, color, theme, and social media handles. This data needs to be stored each time tools are updated via the Cloudflare portal and loaded each time a website visitor lands on a page with AddThis tools. Ideally, this information is stored in a database and read each time tools are rendered. While this approach is feasible when users configure tools through the addthis.com dashboard, it’s not an option in Cloudflare Apps.

How to store and render sidebar settings for each user became an anticipated hurdle. Luckily, there was a good solution: save the tool configuration data on a JS global variable using Cloudflare’s suggested INSTALL_SCOPE JS technique and, using the AddThis Smart Layers API, render the tools from this global variable to display in the preview portal. When the user saves their configuration, we call and serve the settings stored on that global variable each time tools need to be rendered.

Anyone can check out this method in action by previewing the AddThis Share Buttons app from Cloudflare Apps and playing around with the tools’ positioning, styling, and other settings.

 Making Disciplined Product Design Decisions

In the few weeks since our launch, we’ve received a lot of useful feedback—good, bad, and ugly. The Cloudflare Apps developer portal allows developers to view basic metrics and user comments that keep third parties up-to-date about what’s important to websites and publishers. In the future, we’re considering connecting the app to the addthis.com dashboard and including other tool styles or types. We’ve heard a lot about page speed scores and mobile performance being important to users, and I’m pleased to report these are both areas of continued investment for AddThis. Paring down Share Buttons—AddThis’ flagship product—was a risk, and it’s one we’re happy we took.

Check out a live preview of AddThis in Cloudflare Apps »

Want to shape the future of content and sharing? We’re all ears at help@addthis.com and @addthissupport. Happy sharing.

Categories: Technology

Living In A Multi-Cloud World

Tue, 21/11/2017 - 16:30
Living In A Multi-Cloud World

A few months ago at Cloudflare’s Internet Summit, we hosted a discussion on A Cloud Without Handcuffs with Joe Beda, one of the creators of Kubernetes, and Brandon Phillips, the co-founder of CoreOS. The conversation touched on multiple areas, but it’s clear that more and more companies are recognizing the need to have some strategy around hosting their applications on multiple cloud providers.

Earlier this year, Mary Meeker published her annual Internet Trends report which revealed that 22% of respondents viewed Cloud Vendor Lock-In as a top 3 concern, up from just 7% in 2012. This is in contrast to previous top concerns, Data Security and Cost & Savings, both of which dropped amongst those surveyed.

Living In A Multi-Cloud World

At Cloudflare, our mission is to help build a better internet. To fulfill this mission, our customers need to have consistent access to the best technology and services, over time. This is especially the case with respect to storage and compute providers. This means not becoming locked-in to any single provider and taking advantage of multiple cloud computing vendors (such as Amazon Web Services or Google Cloud Platform) for the same end user services.

The Benefits of Having Multiple Cloud Vendors

There are a number of potential challenges when selecting a single cloud provider. Though there may be scenarios where it makes sense to consolidate on a single vendor, our belief is that it is important that customers are aware of their choice and downsides of being potentially locked-in to that particular vendor. In short, know what trade offs you are making should you decide to continue to consolidate parts of your network, compute, and storage with a single cloud provider. While not comprehensive, here are a few trade-offs you may be making if you are locked-in to one cloud.

Cost Efficiences

For some companies, there may be a cost savings involved in spreading traffic across multiple vendors. Some can take advantage of free or reduced cost tiers at lower volumes. Vendors may provide reduced costs for certain times of day that are lower utilized on their infrastructure. Applications can have varying compute requirements amongst layers of the application: some may require faster, immediate processing while others may benefit from delayed processing at a lower cost.

Negotiation Strength

One of the most important reasons to consider deploying in multiple cloud providers is to minimize your reliance on a single vendor’s technology for your critical business processes. As you become more vertically integrated with any vendor, your negotiation posture for pricing or favorable contract terms becomes diminished. Having production ready code available on multiple providers allows you to have less technical debt should you need to change. If you go a step further and are already sending traffic to multiple providers, you have minimized the technical debt required to switch and can negotiate from a position of strength.

Business Continuity or High Availability

While the major cloud providers are generally reliable, there have been a few notable outages in recent years. The most significant in recent memory being Amazon’s US-EAST S3 outage in February. Some organizations may have a policy specifying multiple providers for high availability while others should consider it where necessary and feasible as a best practice. A multi-cloud strategy can lower operational risk from a single vendor’s mistakes causing a significant outage for a mission critical application.

Experimentation

One of the exciting things about having competition in the space is the level of innovation and feature velocity of each provider. Every year there are major announcements of new products or features that may have a significant impact on improving your organization's competitive advantage. Having test and production environments in multiple providers gives your engineers the ability to understand and experiment with a new capability in the context of your technology stack and data. You may even try these features for a portion of your traffic and get real world data on any benefits realized.

Cloudflare’s Role

Cloudflare is an independent third party in your multi-cloud strategy. Our goal is to minimize the layers of lock-in between you and a provider and lower the effort of change. In particular, one area where we can help right away is to minimize the operational changes necessary at the network, similar to what Kubernetes can do at the storage and compute level. As a benefit of our network, you can also have a centralized point for security and operational control.

Living In A Multi-Cloud World

Cloudflare’s Load Balancing can easily be configured to act as your global application traffic aggregator and distribute your traffic amongst origins at as many clouds as you choose to utilize. Active layer 7 health checks continually probe your origins and can automatically move traffic in the case of network or application failure. All consolidated web traffic can be inspected and acted upon by Cloudflare’s best of breed Security services, providing a single control point and visibility across all application traffic, regardless of which cloud the origin may be on. You also have the benefit of Cloudflare’s Global Anycast Network, providing for better speed and higher availability regardless of which clouds your origins are hosted on.

Billforward: Using Cloudflare to Implement Multi-Cloud

Billforward is a San Francisco and London based startup that is focused and mission driven on changing the way people bill and charge their customers, providing a solution to the complexities of Quote-to-Cash. Their platform is built on a number of Rest APIs that other developers call to bill and generate revenue for their own companies.

Billforward is using Cloudflare for its core customer facing application to failover traffic between Google Compute Engine and Amazon Web Services. Acting as a reverse proxy, Cloudflare receives all requests for and decides which of Billforward’s two configured cloud origins to use based upon the availability of that origin in near real-time. This allows Billforward to completely manage the connections to and from two disparate cloud providers using Cloudflare’s UI or API. Billforward is in the process of migrating all of their customer facing domains to a similar setup.

Configuration

Billforward has a single load balanced hostname with two available Pools. They’ve named the two Pools with “gce” and “aws” labels and each Pool has one Origin associated with it. All of the Pools are enabled and the entire LB/hostname is proxied through Cloudflare (as indicated by the orange cloud).

Living In A Multi-Cloud World

Cloudflare probes Billforward’s Origins once every minute from all of Cloudflare’s data centers around the world (a feature available to all Load Balancing Enterprise customers). If Billforward’s GCE Origin goes down, Cloudflare will quickly and automatically failover to the AWS Origin with no actions required from Billforward’s team.

Google Compute Engine was chosen as the primary provider for this application by virtue of cost. Martin Lee, Site Reliability Engineer at Billforward says, “Essentially, GCE is cheaper for our general purpose computing needs but we're more experienced with deployments in AWS. This strategy allows us to switch back and forth at will and avoid being tied in to either platform.” It is likely that Billforward will change the priority as pricing models evolve.

“It's a fairly fast moving world and features released by cloud providers can have a meaningful impact on performance and cost on a week by week basis - it helps to stay flexible,” says Martin. “We may also change priority based on features.”


For orchestration of the compute and storage layers, Billforward uses Docker containers managed through Rancher. They use distinct environments between cloud providers but are considering bridging an environment across cloud providers and using VPNs between them, which will enable them to move load between providers even more easily. “Our system is loosely coupled through a message queue,” adds Martin. “Having a container system across clouds means we can really take advantage of this - we can very easily move workloads across clouds without any danger of dropping tasks or ending up in an inconsistent state.”

Benefits

Billforward manages these connections at Cloudflare’s edge. Through this interface (or via the Cloudflare APIs), they can also manually move traffic from GCE to AWS by just disabling the GCE pool or by rearranging the Pool priority and make AWS the primary. These changes are near instant on the Cloudflare network and require no downtime to Billforward’s customer facing application. This allows them to act on potential advantageous pricing changes between the two cloud providers or move traffic to hit pricing tiers.

In addition, Billforward is now not “locked-in” to either provider’s network; being able to move traffic and without any downtime means they can make traffic changes independent of Amazon or Google. They can also integrate additional cloud providers any time they deem fit: adding Microsoft Azure, for example, as a third Origin would be as simple as creating a new Pool and adding it to the Load Balancer.

Billforward is a good example of a forward thinking company that is taking advantage of technologies from multiple providers to best serve their business and customers, while not being reliant on a single vendor. For further detail on their setup using Cloudflare, please check their blog.

Categories: Technology

The Supreme Court Wanders into the Patent Troll Fight

Mon, 20/11/2017 - 18:18
The Supreme Court Wanders into the Patent Troll Fight

Next Monday, the US Supreme Court will hear oral arguments in Oil States Energy Services, LLC vs. Greene’s Energy Group, LLC, which is a case to determine whether the Inter Partes Review (IPR) administrative process at the US Patent and Trademark Office (USPTO) used to determine the validity of patents is constitutional.

The constitutionality of the IPR process is one of the biggest legal issues facing innovative technology companies, as the availability of this process has greatly reduced the anticipated costs, and thereby lessened the threat, of patent troll litigation. As we discuss in this blog post, it is ironic that the outcome of a case that is of such great importance to the technology community today may hinge on what courts in Britain were and were not doing more than 200 years ago.

The Supreme Court Wanders into the Patent Troll FightThomas Rowlandson [Public domain], via Wikimedia Commons

As we have discussed in prior blog posts, the stakes are high: if the Supreme Court finds IPR unconstitutional, then the entire system of administrative review by the USPTO — including IPR and ex parte processes — will be shuttered. This would be a mistake, as administrative recourse at the USPTO is one of the few ways to avoid the considerable costs and delays of federal court litigation, which can take years and run into the millions of dollars. Those heavy costs are often leveraged by patent trolls when they threaten litigation in the effort to procure easy and lucrative settlements from their targets.

Cloudflare is Pursuing Our Fight Against Patent Trolls All the Way to the Steps of the Supreme Court

Cloudflare joined Dell, Facebook, and a number of other companies, all practicing entities with large patent portfolios, in a brief amici curiae (or ‘friend of the court’ brief) in support of the IPR process, because it has a substantial positive impact on technological innovation in the United States. Amicus briefs allow parties who are interested in the outcome of a case, but are not parties to the immediate dispute before the court, to have input into the court’s deliberations.

As many of you are aware, we were sued by Blackbird Technologies, a notorious patent troll, earlier this year for patent infringement, and initiated Project Jengo to crowd source prior art searches and invalidate Blackbird’s patents. One of our strategies for quickly and efficiently invalidating Blackbird’s patents is to take advantage of the IPR process at the USPTO, which can be completed in about half the time and at one tenth of the cost of a federal court case, and to initiate ex parte proceedings against Blackbird’s other patents that are overly broad and invalid.

A full copy of the Amicus Brief we joined in the Oil States case is available here, and a summary of the argument follows.

Oil States Makes its Case

Oil States is an oilfield services and drilling equipment manufacturing company. The USPTO invalidated one of its patents related to oil drilling technology in an IPR proceeding while Oil States had a lawsuit pending against one of its competitors claiming infringement of its patent. After it lost the IPR, Oil States lost an appeal in a lower federal court based on the findings of the IPR proceeding. The Supreme Court agreed to hear the case to determine whether once the USPTO issues a patent, an inventor has a constitutionally protected property right that — under Article III of the U.S. Constitution (which outlines the powers of the judicial branch of the government), and the 7th Amendment (which addresses the right to a jury trial in certain types of cases) — cannot be revoked without intervention by the court system.

The Supreme Court Wanders into the Patent Troll FightImage by Paul Lowry

As the patent owner, Oil States argues that the IPR process violates the relevant provisions of the constitution by allowing an administrative body, the Patent Trial and Appeal Board (PTAB)--a non-judicial forum, to decide a matter which was historically handled by the judiciary. This argument rests upon the premise that there was a historical analogue to cancellation of patent claims available in the judiciary. Since cancellation of patent claims was historically available in the judiciary, the cancellation of patent claims today must be consistent with that history and done exclusively by courts.

This argument is flawed on multiple counts, which are set forth in the “friend of the court” brief we joined.

First Flaw: An Administrative Process Even an Originalist Can Love

As the amicus brief we joined points out, patent revocation did not historically rest within the exclusive province of the common law and chancery courts, the historical equivalents in Britain to the judiciary in the United States. Rather, prior to the Founding of the United States, patent revocation rested entirely with the Crown of England’s Privy Council, a non-judicial body comprising of advisors to the king or queen of England. It wasn’t until later that the Privy Council granted the chancery court (the judiciary branch) concurrent authority to revoke patents. Because a non-judicial body had the authority to revoke patents when the US Constitution was framed, the general principles of separation of powers and the right to trial in the Constitution do not require that patentability challenges be decided solely by courts.

Second Flaw: The Judicial Role was Limited

Not only did British courts share the power to address patent rights historically, the part shared by the the courts was significantly limited. Historically, the common-law and chancery courts only received a partial delegation of the Privy Council’s authority to invalidate patents. Courts only had the authority to invalidate patents for issues related to things like inequitable conduct (e.g., making false statements in the original patent application). The limited authority delegated to the England Courts did not include the authority to seek claim cancellation based on elements intrinsic to the patent or patent application, like lack of novelty or obviousness as done under an IPR proceeding. Rather, such authority remained with the Privy Council, a non-court authority, which decided questions like whether the invention was really new. Thus, like the PTAB, the Privy Council was a non-judicial body charged with responsibility to assess patent validity based on criteria that included the novelty of the invention.

We think these arguments are compelling and provide very strong reasons why the Supreme Court should resist the request that such matters be resolved exclusively in federal courts. We hope that’s the position they do take because the real world implications are significant.

Don’t Mess with a Good Thing

The IPR process is not only consistent with the US Constitution, but it also advances the Patent Clause’s objective of promoting the progress of science and useful arts. That is, the “quid pro quo of the patent system; the public must receive meaningful disclosure in exchange for being excluded from practicing the invention for a limited period of time” by patent rights. (Enzo Biochem, Inc. v. Gen-probe Inc.) Congress created the IPR process in the America Invents Act in 2011 to use administrative review to weed out poor-quality patents that did not satisfy this quid pro quo because they had not actually disclosed very much. Congress sought to provide quick and cost effective administrative procedures for challenging the validity of patent claims that did not disclose novel inventions, or that claimed to disclose substantially more innovation than they actually did, to improve patent quality and restore confidence in the presumption of validity. In other words, Congress created a system to specifically permit the efficient challenge of the zealous assertion of vague and overly broad patents.

As a recent study by the Congressional Research Service found, non-practicing entity (i.e., patent troll) patent litigation “activity cost defendants and licensees $29 billion in 2011, a 400 percent increase over $7 billion in 2005” and “the losses are mostly deadweight, with less than 25 percent flowing to innovation and at least that much going towards legal fees.” (see Brian T. Yeh, Cong. Research sERV., R42668) The IPR process enables innovative companies to navigate patent troll activity in an efficient manner and devote a greater proportion of their resources to research and development, rather than litigation or cost-of-litigation settlement fees for invalid patents.

The Supreme Court Wanders into the Patent Troll FightBy EFF-Graphics (Own work), via Wikimedia Commons

Additionally, the IPR process reduces the total number and associated costs of patent disputes in a number of ways.

  • Patent owners, especially patent trolls, are less likely to threaten litigation or file an infringement suit based on patent claims that they know or suspect to be invalid. In fact, patent owners who threaten or file suit merely to seek cost-of-litigation settlements have become far less prevalent because of the availability of the IPR process to reduce the cost of litigation.

  • Patent owners are less likely to initiate litigation out of concerns that the IPR proceedings may culminate in PTAB’s cancellation of all patent claims asserted in the infringement suit.

  • Where the PTAB does not cancel all asserted claims, statutory estoppel and the PTAB’s claim construction may serve to narrow the infringement issues to be resolved by the district court.

Our hope is that the US Supreme Court justices take into full consideration the larger community of innovative companies that are helped by the IPR system in battling patent trolls, and do not limit their consideration to the implications on the parties to Oil States (neither of which is a non-practicing entity). As we have explained, not only does the IPR process enable innovative companies to focus their resources on technological innovation, instead of legal fees, but allowing the USPTO to administer IPR and ex parte proceedings is entirely consistent with the US Constitution.

While we await a decision in Oil States, expect to see Cloudflare initiate IPR and ex parte proceedings against Blackbird Technologies patents in the coming months.

We will make sure to keep you updated.

Categories: Technology

7 Cloudflare Apps Which Increase User Engagement on Your Site

Tue, 14/11/2017 - 20:21
7 Cloudflare Apps Which Increase User Engagement on Your Site

7 Cloudflare Apps Which Increase User Engagement on Your Site

Cloudflare Apps now lists 95 apps from apps which grow email lists to apps which acquire new customers to apps which help site owners make more money. The great thing about these apps is that users don't have to have any coding or development skills. They can just sign up for the app and start using it on their sites.

Let’s take a moment to highlight some apps which increase a site’s user engagement. Check out more Cloudflare Apps which grow your email list, make money on your site, and get more customers.

I hope you enjoy them and I hope you build (or use) great apps like these too.

Check out other Cloudflare Apps »

Build an app on Cloudflare Apps »

1. Privy

7 Cloudflare Apps Which Increase User Engagement on Your Site

Over 100,000 businesses use Privy to capture and convert website visitors. Privy offers a free suite of email capture tools, including exit-intent driven website popups & banners, email list sign-up, an online store, social media channels, mobile capability, and in-store traffic.

7 Cloudflare Apps Which Increase User Engagement on Your Site

In the left preview pane, you can view the different packages and their features users may sign up for from free to "growth" ($199/month) options.

In the right pane, you can preview your choices, seeing how Privy's functionality interacts with the site and users and even play around with the popups yourself. I personally love the confetti.

2. AddThis Share Buttons

7 Cloudflare Apps Which Increase User Engagement on Your Site

The Share Buttons app from AddThis is a super easy way to add sidebar share buttons to a website and get the site's content distributed to over 200 social media channels, auto-personalized for users. Site owners can display between 1-10 share services, customized for their users. Site owners are able to control the size and style of buttons as well.

7 Cloudflare Apps Which Increase User Engagement on Your Site

In the left pane, you can see the install options where you can customize the size, theme and position of the banner. You can adjust the number of social platforms listed and choose to list the number of content shares per platform or total.

In the right pane, you can preview choices, seeing what they’d look like on any website and experiment with placement and how it flows with the site. This is very similar to the tool that the app developer uses to test the app for how it behaves on a wide range of web properties.

3. Vimeo

7 Cloudflare Apps Which Increase User Engagement on Your Site

This app embeds Vimeo videos directly onto sites, so people can easily find a view videos the site owners made, or maybe just a few of their favorites. The Vimeo app supports autoplay and multiple videos on one page, in multiple locations on the page. A site owner can put videos almost all over their site, if they wish.

7 Cloudflare Apps Which Increase User Engagement on Your Site

In the left pane, you can change the location of the video on the page, change its orientation in each location, switch Vimeo video links, and add multiple videos to the page.

In the right pane, you can preview your choices, seeing where each video will be displayed.

4. SoundCloud

7 Cloudflare Apps Which Increase User Engagement on Your Site

You can't grow as a musician or podcaster or influencer if your favorite tracks are not heard. The SoundCloud app embeds SoundCloud tracks or playlists directly on sites to grow a site owner's audience, help them find fans, or gives the reader a taste of the music being reviewed in a blog post, all of which work towards engaging users on their sites longer.

7 Cloudflare Apps Which Increase User Engagement on Your Site

This preview works very similarly to the Vimeo preview. You can change the location of the track/playlist on the page, its orientation in each location, switch audio tracks/playlists, and add multiple tracks/playlists to the page.

5. Drift

7 Cloudflare Apps Which Increase User Engagement on Your Site

Drift uses messaging and artificial intelligence to catch leads that are falling through the cracks in sales funnels. Billions of people around the world are using messaging apps like Slack and HipChat to communicate in real-time these days and customers often don't want to wait for follow-up emails or calls.

After installing Drift, site owners are able to engage with leads when they’re at their most interested - while they’re actually live on their sites. Using intelligent bots, qualified leads can be captured 24/7, even when no human is available to chat.

7 Cloudflare Apps Which Increase User Engagement on Your Site

In Drift's preview, you can view how the styling and messaging of the chat box may be adjusted. You can change the organization name, welcome message, and away message. You can also select the "automatically open chat" to give users a more visible chat invitation.

6. Skype Live Chat

7 Cloudflare Apps Which Increase User Engagement on Your Site

Skype Live Chat installs on websites easily. Site owners just add their Skype usernames or bot IDs to start chatting with their customers. Considering there are over 74 million Skype users, this app allows users to chat in a way that they're comfortable and familiar with.

7 Cloudflare Apps Which Increase User Engagement on Your Site

In the left pane, you can see options to change the color of the chat button and message bubble. In the right pane, you can see the results and how users will be prompted to log into their Skype accounts, if they aren't already.

7. Weather Widget

7 Cloudflare Apps Which Increase User Engagement on Your Site

WeatherWidget.io provides an easy, interactive interface which site owners can create and customize a weather widget for any website. WeatherWidget.io is highly customizable. The labels, font, theme, number of forecast days, and icon sets, may all be edited to maximize user engagement. The app is responsive to the size of its container and available in 20 languages with more languages on their way.

7 Cloudflare Apps Which Increase User Engagement on Your Site

There are a lot of options to play with here. In the left pane, you can see the location of the banner can be changed. The weather location can be adjusted and custom labels assigned to it. I chose "SF, CA" in the preview. The language and units (celsius vs fahrenheit) can be switched. Next, the font, icon type (I chose animated climacons) and option to choose forecast vs. current weather can be changed. Finally you can play with theme with options from "orange" to "blue mountains" and adjust the widget's margins.

In the right pane, you can see your options come to life.

I hope you enjoy playing with and using these apps. Cloudflare is encouraging app authors to write their apps on the app platform. I hope you'll consider developing one of your own.

Build an app on Cloudflare Apps »

© 2017 Cloudflare, Inc. All rights reserved. The Cloudflare logo and marks are trademarks of Cloudflare. All other company and product names may be trademarks of the respective companies with which they are associated.

Categories: Technology

The Super Secret Cloudflare Master Plan, or why we acquired Neumob

Tue, 14/11/2017 - 14:03
The Super Secret Cloudflare Master Plan, or why we acquired Neumob

We announced today that Cloudflare has acquired Neumob. Neumob’s team built exceptional technology to speed up mobile apps, reduce errors on challenging mobile networks, and increase conversions. Cloudflare will integrate the Neumob technology with our global network to give Neumob truly global reach.

It’s tempting to think of the Neumob acquisition as a point product added to the Cloudflare portfolio. But it actually represents a key part of a long term “Super Secret Cloudflare Master Plan”.

The Super Secret Cloudflare Master Plan, or why we acquired Neumob CC BY 2.0 image by Neil Rickards

Over the last few years Cloudflare has been building a large network of data centers across the world to help fulfill our mission of helping to build a better Internet. These data centers all run an identical software stack that implements Cloudflare’s cache, DNS, DDoS, WAF, load balancing, rate limiting, etc.

We’re now at 118 data centers in 58 countries and are continuing to expand with a goal of being as close to end users as possible worldwide.

The data centers are tied together by secure connections which are optimized using our Argo smart routing capability. Our Quicksilver technology enables us to update and modify the settings and software running across this vast network in seconds.

We’ve also been extending the network to reach directly into devices and servers. Our 2014 technology Railgun helped to speed up connections to origin HTTP servers. Our recently announced Warp technology is used to connect servers and services (such as those running inside a Kubernetes cluster) to Cloudflare without having a public HTTP endpoint. Our IoT solution, Orbit, enables smart devices to connect to our network securely.

The goal is that any end device (web browser, mobile application, smart meter, …) should be able to securely connect to our network and have secure, fast communication from device to origin server with every step of the way optimized and secured by Cloudflare.

While we’ve spent a lot of time on the latest encryption and performance technologies for the web browser and server, we had not done the same for mobile applications. That changes with our acquisition of Neumob.

Why Neumob

The Neumob software running on your phone changes how a mobile app interacts with an API running on an HTTP server. Without Neumob, that API traffic uses that standard Internet protocols (such as HTTPS) and is prone to be affected negatively by varying performance and availability in mobile networks.

With the Neumob software any API request from a mobile application is sent across an optimized set of protocols to the nearest Cloudflare data center. These optimized protocols are able to handle mobile network variability gracefully and securely.

Cloudflare's Argo then optimizes the route across the long-haul portion of the network to our data center closest to the origin API server. Then Cloudflare's Warp optimizes the path from the edge of our network to the origin server where the application’s API is running. End-to-end, Cloudflare can supercharge and secure the network experience.

Ultimately, the Neumob software is easily extended to operate as a “VPN” for mobile devices that can secure and accelerate all HTTP traffic from a mobile device (including normal web browsing and app API calls). Most VPN software, frankly, is awful. Using a VPN feels like a backward step to the dial up era of obscure error messages, slow downs, and clunky software. It really doesn’t have to be that way.

And in an era where SaaS, PaaS, IaaS and mobile devices have blown up the traditional company ‘perimeter’ the entire concept of a Virtual Private Network is an anachronism.

Going Forward

At the current time the Neumob service has been discontinued as we move their server components onto the Cloudflare network. We’ll soon relaunch it under a new name and make it available to mobile app developers worldwide. Developers of iOS and Android apps will be able to accelerate and protect their applications’ connectivity by adding just two lines of code to take advantage of Cloudflare’s global network.

As a personal note, I’m thrilled that the Neumob team is joining Cloudflare. We’d been tracking their progress and development for a few years and had long wanted to build a Cloudflare Mobile App SDK that would bring the network benefits of Cloudflare right into devices. It became clear that Neumob’s technology and team was world-class and that it made more sense to abandon our own work to build an SDK and adopt theirs.

Categories: Technology

Thwarting the Tactics of the Equifax Attackers

Mon, 13/11/2017 - 16:09
Thwarting the Tactics of the Equifax Attackers

We are now 3 months on from one of the biggest, most significant data breaches in history, but has it redefined people's awareness on security?

The answer to that is absolutely yes, awareness is at an all-time high. Awareness, however, does not always result in positive action. The fallacy which is often assumed is "surely, if I keep my software up to date with all the patches, that's more than enough to keep me safe?". It's true, keeping software up to date does defend against known vulnerabilities, but it's a very reactive stance. The more important part is protecting against the unknown.

Something every engineer will agree on is that security is hard, and maintaining systems is even harder. Patching or upgrading systems can lead to unforeseen outages or unexpected behaviour due to other fixes which may be applied. This, in most cases, can cause huge delays in the deployment of patches or upgrades, due to requiring either regression testing or deployment in a staging environment. Whilst processes are followed, and tests are done, systems are sat vulnerable, ready to be exploited if they are exposed to the internet.

Looking at the wider landscape, an increase in security research has created a surge of CVEs (Common Vulnerability and Exposures) being announced. This compounded by GDPR, NIST and other new data protection legislation, businesses are now forced to pay much more attention to security vulnerabilities that potentially could affect their software, and ultimately put them on the forever growing list of victims of data breaches.

Thwarting the Tactics of the Equifax Attackers Stats from cvedetails.com (November 2017)

Dissecting the Equifax tragedy, in testimony from the CEO, he mentions that the reason for the breach was that there was one single person within the organisation who was responsible for communicating the availability of the patch for Apache Struts, the software at the heart of the breach. The crucial lesson learned from Equifax is that we are all human, and that mistakes can happen, however having multiple people responsible for communicating and notifying teams about threats is crucial. In this case, the mistake almost destroyed one of the largest credit agencies in the world.

How could attacks and breaches like Equifax be avoided? First is about understanding how these attacks happen. There are some key attacks which are often the source of data exfiltration through vulnerable software.

  • Remote Code Execution (RCE) - which is what was used in the Equifax Breach

  • SQL Injection (SQLi), which is delivering an SQL statement hidden in a payload, accessing a backend database powering a website.

Remote Code Execution

The Struts vulnerability, CVE-2017-5638, which is protected by rule 100054 in Cloudflare Specials, was quite simple. In a payload targeted at the web-server, a specific command could be executed which can be seen in the example below:

"(#context.setMemberAccess(#dm))))." "(#cmd='touch /tmp/hacked')." "(#iswin=(@java.lang.System@getProperty('os.name').toLowerCase().contains('win')))." "(#cmds=(#iswin?{'cmd.exe','/c',#cmd}:{'/bin/bash','-c',#cmd}))." "(#p=new java.lang.ProcessBuilder(#cmds))."

More critically however, further to this CVE, Apache Struts also announced another vulnerability earlier this year (CVE-2017-9805), which works by delivering a payload against the REST plugin combined with the Xstream handler for, which provides an XML ingest capability. By delivering an specially crafted XML payload, a shell command can be embedded, and will be executed.

<next class="java.lang.ProcessBuilder"> <command> "touch /tmp/hacked". </command> <redirectErrorStream>false</redirectErrorStream> </next>

And the result from the test:

root@struts-demo:~$ ls /tmp hacked root@struts-demo:~$

In the last week, we have seen over 180,000 hits on our WAF rules protecting against Apache Struts across the Cloudflare network.

Thwarting the Tactics of the Equifax Attackers

SQL Injection

SQL Injection (SQLi) is an attempt to inject nefarious queries into a GET or POST dynamic variable, which is used to query a database. Cloudflare, on a day to day basis will see over 2.3m SQLi attempts on our network. Most commonly, we see SQLi attacks against Wordpress sites, as it is one of the biggest web applications used on Cloudflare today. Wordpress is used by some of the world's giants, like Sony Music, all the way down to "mom & pop" businesses. The challenge with being the leader in the space, is you then become a hot target. Looking at the CVE list as we near the close of 2017, there have been 41 vulnerabilities found in multiple versions of Wordpress which would force people to upgrade to the latest versions. To protect our customers, and buy them time to upgrade, Cloudflare works with a number of vendors to address vulnerabilities and then virtual-patching using our WAF to prevent these vulnerabilities being exploited.

The way a SQL injection works is by "breaking out" or malforming a query when a web application is needing data from a database. As an example, a Forgotten Password page has a single email input field, which will be used to validate whether the username exists, and if so, sends the user a “Forgotten Password” link. Below is a straightforward SQL query example, which could be used in a web application:

SELECT user, password FROM users WHERE user = 'john@smith';

Which results in:

+------------+------------------------------+ | user | password | +------------+------------------------------+ | john@smith | $2y$10$h9XJRX.EBnGFrWQlnt... | +------------+------------------------------+

Without the right query validation, an attacker could escape out of this query, and carry out some extremely malicious queries. For example, if an attacker was looking to takeover another user’s account, and an attacker found that the query validation was inadequate, he could escape the query, and UPDATE the username, which is an email address in this instance, to his own. This can simply be done by entering the query string below into the email input field, instead of an email address.

dontcare@bla.com’;UPDATE users SET user = ‘mr@robot’ WHERE user = ‘john@smith’;

Thwarting the Tactics of the Equifax Attackers Due to the lack of validation, the query which the web application sends to the database will be:

SELECT user, password FROM users WHERE user = 'dontcare@bla.com’;UPDATE users SET user = ‘mr@robot’ WHERE user = ‘john@smith’;

Now this has been updated, the attacker can now request a password reset using his own email address, gaining him access to the victim’s account.

+----------+------------------------------+ | user | password | +----------+------------------------------+ | mr@robot | $2y$10$h9XJRX.EBnGFrWQlnt... | +----------+------------------------------+

Many SQLi attacks are often on fields which are not often considered high risk, like an authentication form for example. To put the seriousness of SQLi attacks in perspective, in the last week, we have seen over 2.4 million matches.

Thwarting the Tactics of the Equifax Attackers

The Cloudflare WAF is built to not only protect customers against SQLi and RCE based attacks, but also add protection against Cross Site Scripting (XSS) and a number of other known attacks. On an average week, just on our Cloudflare Specials WAF ruleset, we see over 138 million matches.

Thwarting the Tactics of the Equifax Attackers

The next important part is communication and awareness; understanding what you have installed, what versions you are running, and most importantly, what announcements your vendor is making. Generally, most notifications are received via email, and are usually quite cumbersome to digest, regardless of their complexity, it is crucial to try and understand them.

And, finally, the last line of defense is to have protection in front of your application, which is where Cloudflare can help. At Cloudflare, Security is very core to our values, and was one of the foundation pillars we were founded upon. Even to this day, we are known as one of the most cost effective ways of being able to sure up your Web Applications with just our Pro Plan@$20/mo.

Categories: Technology

Go, don't collect my garbage

Mon, 13/11/2017 - 10:31

Not long ago I needed to benchmark the performance of Golang on a many-core machine. I took several of the benchmarks that are bundled with the Go source code, copied them, and modified them to run on all available threads. In that case the machine has 24 cores and 48 threads.

CC BY-SA 2.0 image by sponki25

I started with ECDSA P256 Sign, probably because I have warm feeling for that function, since I optimized it for amd64.

First, I ran the benchmark on a single goroutine: ECDSA-P256 Sign,30618.50, op/s

That looks good; next I ran it on 48 goroutines: ECDSA-P256 Sign,78940.67, op/s.

OK, that is not what I expected. Just over 2X speedup, from 24 physical cores? I must be doing something wrong. Maybe Go only uses two cores? I ran top, it showed 2,266% utilization. That is not the 4,800% I expected, but it is also way above 400%.

How about taking a step back, and running the benchmark on two goroutines? ECDSA-P256 Sign,55966.40, op/s. Almost double, so pretty good. How about four goroutines? ECDSA-P256 Sign,108731.00, op/s. That is actually faster than 48 goroutines, what is going on?

I ran the benchmark for every number of goroutines from 1 to 48:

alt

Looks like the number of signatures per second peaks at 274,622, with 17 goroutines. And starts dropping rapidly after that.

Time to do some profiling.

(pprof) top 10 Showing nodes accounting for 47.53s, 50.83% of 93.50s total Dropped 224 nodes (cum <= 0.47s) Showing top 10 nodes out of 138 flat flat% sum% cum cum% 9.45s 10.11% 10.11% 9.45s 10.11% runtime.procyield /state/home/vlad/go/src/runtime/asm_amd64.s 7.55s 8.07% 18.18% 7.55s 8.07% runtime.futex /state/home/vlad/go/src/runtime/sys_linux_amd64.s 6.77s 7.24% 25.42% 19.18s 20.51% runtime.sweepone /state/home/vlad/go/src/runtime/mgcsweep.go 4.20s 4.49% 29.91% 16.28s 17.41% runtime.lock /state/home/vlad/go/src/runtime/lock_futex.go 3.92s 4.19% 34.11% 12.58s 13.45% runtime.(*mspan).sweep /state/home/vlad/go/src/runtime/mgcsweep.go 3.50s 3.74% 37.85% 15.92s 17.03% runtime.gcDrain /state/home/vlad/go/src/runtime/mgcmark.go 3.20s 3.42% 41.27% 4.62s 4.94% runtime.gcmarknewobject /state/home/vlad/go/src/runtime/mgcmark.go 3.09s 3.30% 44.58% 3.09s 3.30% crypto/elliptic.p256OrdSqr /state/home/vlad/go/src/crypto/elliptic/p256_asm_amd64.s 3.09s 3.30% 47.88% 3.09s 3.30% runtime.(*lfstack).pop /state/home/vlad/go/src/runtime/lfstack.go 2.76s 2.95% 50.83% 2.76s 2.95% runtime.(*gcSweepBuf).push /state/home/vlad/go/src/runtime/mgcsweepbuf.go

Clearly Go spends a disproportionate amount of time collecting garbage. All my benchmark does is generates signatures and then dumps them.

So what are our options? The Go runtime states the following:

The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package's SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.

The GODEBUG variable controls debugging variables within the runtime. It is a comma-separated list of name=val pairs setting these named variables:

Let’s see what setting GODEBUG to gctrace=1 does.

gc 1 @0.021s 0%: 0.15+0.37+0.25 ms clock, 3.0+0.19/0.39/0.60+5.0 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 2 @0.024s 0%: 0.097+0.94+0.16 ms clock, 0.29+0.21/1.3/0+0.49 ms cpu, 4->4->1 MB, 5 MB goal, 48 P gc 3 @0.027s 1%: 0.10+0.43+0.17 ms clock, 0.60+0.48/1.5/0+1.0 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 4 @0.028s 1%: 0.18+0.41+0.28 ms clock, 0.18+0.69/2.0/0+0.28 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 5 @0.031s 1%: 0.078+0.35+0.29 ms clock, 1.1+0.26/2.0/0+4.4 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 6 @0.032s 1%: 0.11+0.50+0.32 ms clock, 0.22+0.99/2.3/0+0.64 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 7 @0.034s 1%: 0.18+0.39+0.27 ms clock, 0.18+0.56/2.2/0+0.27 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 8 @0.035s 2%: 0.12+0.40+0.27 ms clock, 0.12+0.63/2.2/0+0.27 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 9 @0.036s 2%: 0.13+0.41+0.26 ms clock, 0.13+0.52/2.2/0+0.26 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 10 @0.038s 2%: 0.099+0.51+0.20 ms clock, 0.19+0.56/1.9/0+0.40 ms cpu, 4->5->0 MB, 5 MB goal, 48 P gc 11 @0.039s 2%: 0.10+0.46+0.20 ms clock, 0.10+0.23/1.3/0.005+0.20 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 12 @0.040s 2%: 0.066+0.46+0.24 ms clock, 0.93+0.40/1.7/0+3.4 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 13 @0.041s 2%: 0.099+0.30+0.20 ms clock, 0.099+0.60/1.7/0+0.20 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 14 @0.042s 2%: 0.095+0.45+0.24 ms clock, 0.38+0.58/2.0/0+0.98 ms cpu, 4->5->0 MB, 5 MB goal, 48 P gc 15 @0.044s 2%: 0.095+0.45+0.21 ms clock, 1.0+0.78/1.9/0+2.3 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 16 @0.045s 3%: 0.10+0.45+0.23 ms clock, 0.10+0.70/2.1/0+0.23 ms cpu, 4->5->0 MB, 5 MB goal, 48 P gc 17 @0.046s 3%: 0.088+0.40+0.17 ms clock, 0.088+0.45/1.9/0+0.17 ms cpu, 4->4->0 MB, 5 MB goal, 48 P . . . . gc 6789 @9.998s 12%: 0.17+0.91+0.24 ms clock, 0.85+1.8/5.0/0+1.2 ms cpu, 4->6->1 MB, 6 MB goal, 48 P gc 6790 @10.000s 12%: 0.086+0.55+0.24 ms clock, 0.78+0.30/4.2/0.043+2.2 ms cpu, 4->5->1 MB, 6 MB goal, 48 P

The first round of GC kicks in at 0.021s, then it starts collecting every 3ms and then every 1ms. That is insane, the benchmark runs for 10 seconds, and I saw 6,790 rounds of GC. The number that starts with @ is the time since program start, followed by a percentage that supposedly states the amount of time spent collecting garbage. This number is clearly misleading, because the performance indicates at least 90% of the time is wasted (indirectly) on GC, not 12%. The synchronization overhead is not taken into account. What really is interesting are the three numbers separated by arrows. They show the size of the heap at GC start, GC end, and the live heap size. Remember that a collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage, and defaults to 100%.

I am running a benchmark, where all allocated data is immediately discarded, and collected at the next GC cycle. The only live heap is fixed to the Go runtime, and having more goroutines does not add to the live heap. In contrast the freshly allocated data grows much faster with each additional goroutine, triggering increasingly frequent, and expensive GC cycles.

Clearly what I needed to do next was to run the benchmark with the GC disabled, by setting GOGC=off. This lead to a dramatic improvement: ECDSA-P256 Sign,413740.30, op/s.

But still not the number I was looking for, and running an application without garbage collection is unsustainable in the long run. I started playing with the GOGC variable. First I set it to 2,400, which made sense since we have 24 cores, perhaps collecting garbage 24 times less frequently will do the trick: ECDSA-P256 Sign,671538.90, op/s, oh my that is getting better.

What if I tried 4,800, for the number of threads? ECDSA-P256 Sign,685810.90, op/s. Getting warmer.

I ran a script to find the best value, from 100 to 20,000, in increments of 100. This is what I got:

alt

Looks like the optimal value for GOGC in that case is 11,300 and it gets us 691,054 signatures/second. That is 22.56X times faster than the single core score, and overall pretty good for a 24 core processor. Remember that when running on a single core, the CPU frequency is 3.0GHz, and only 2.1GHz when running on all cores.

Per goroutine performance when running with GOGC=11330 now looks like that:

The scaling looks much better, and even past 24 goroutines, when we run out of physical cores, and start sharing cores with hyper-threading, the overall performance improves.

The bottom line here is that although this type of benchmarking is definitely an edge case for garbage collection, where 48 threads allocate large amounts of short lived data, this situation can occur in real world scenarios. As many-core CPUs become a commodity, one should be aware of the pitfalls.

Most languages with garbage collection offer some sort of garbage collection control. Go has the GOGC variable, that can also be controlled with the SetGCPercent function in the runtime/debug package. Don't be afraid to tune the GC to suit your needs.

We're always looking for Go programmers, so if you found this blog post interesting, why not check out our jobs page?

Categories: Technology

Cloudflare Wants to Buy Your Meetup Group Pizza

Fri, 10/11/2017 - 15:00
Cloudflare Wants to Buy Your Meetup Group Pizza

Cloudflare Wants to Buy Your Meetup Group Pizza

If you’re a web dev / devops / etc. meetup group that also works toward building a faster, safer Internet, I want to support your awesome group by buying you pizza. If your group’s focus falls within one of the subject categories below and you’re willing to give us a 30 second shout out and tweet a photo of your group and @Cloudflare, your meetup’s pizza expense will be reimbursed.

Get Your Pizza $ Reimbursed »

Developer Relations at Cloudflare & why we’re doing this

I’m Andrew Fitch and I work on the Developer Relations team at Cloudflare. One of the things I like most about working in DevRel is empowering community members who are already doing great things out in the world. Whether they’re starting conferences, hosting local meetups, or writing educational content, I think it’s important to support them in their efforts and reward them for doing what they do. Community organizers are the glue that holds developers together socially. Let’s support them and make their lives easier by taking care of the pizza part of the equation.

Cloudflare Wants to Buy Your Meetup Group Pizza

What’s in it for Cloudflare?
  1. We want web developers to target the apps platform
  2. We want more people to think about working at Cloudflare
  3. Some people only know of Cloudflare as a CDN, but it’s actually a security company and does much more than that. General awareness helps tell the story of what Cloudflare is about.
What kinds of groups we most want to support

We want to work with groups that are closely aligned with Cloudflare, groups which focus on web development, web security, devops, or tech ops. We often work with language-specific meetups (such as Go London User Group) or diversity & inclusion meetups (such as Women Who Code Portland). We’d also love to work with college student groups and any other coding groups which are focused on sharing technical knowledge with the community.

To get sponsored pizza, groups must be focused on...

  • Web performance & optimization

  • Front-end frameworks

  • Language-specific (JavaScript / PHP / Go / Lua / etc.)

  • Diversity & inclusion meetups for web devs / devops / tech ops

  • Workshops / classes for web devs / devops / tech ops

How it works
  1. Interested groups need to read our Cloudflare Meetup Group Pizza Reimbursement Rules document.

  2. If the group falls within the requirements, the group should proceed to schedule their meetup and pull the Cloudflare Introductory Slides ahead of time.

  3. At the event, an organizer should present the Cloudflare Introductory Slides at the beginning, giving us a 30 second shout out, and take a photo of the group in front of the slides. Someone from the group should Tweet the photo and @Cloudflare, so our Community Manager may retweet.

  4. After the event, an organizer will need to fill out our reimbursement form where they will upload a completed W-9, their mailing address, group details, and the link to the Tweet.

  5. Within a month, the organizer should have a check from Cloudflare (and some laptop stickers for their group, if they want) in their hands.

Here are some examples of groups we’ve worked with so far

Go London User Group (GLUG)

Cloudflare Wants to Buy Your Meetup Group Pizza

The Go London User Group is a London-based community for anyone interested in the Go programming language.

GLUG provides offline opportunities to:

  • Discuss Go and related topics
  • Socialize with friendly people who are interested in Go
  • Find or fill Go-related jobs

They want GLUG to be a diverse and inclusive community. As such all attendees, organizers and sponsors are required to follow the community code of conduct.

This group’s mission is very much in alignment with Cloudflare’s guidelines, so we sponsored the pizza for their October Gophers event.

Women Who Code Portland

Cloudflare Wants to Buy Your Meetup Group Pizza

Women Who Code’s mission statement

Women Who Code is a global nonprofit dedicated to inspiring women to excel in technology careers by creating a global, connected community of women in technology.

What they offer

  • Monthly Networking Nights
  • Study Nights (JavaScript, React, DevOps, Design + Product, Algorithms)
  • Technical Workshops
  • Community Events (Hidden Figures screening, Women Who Strength Train, Happy Hours, etc.)
  • Free or discounted tickets to conferences

Women Who Code are a great example of an inclusive technical group. Cloudflare reimbursed them for their pizza at their October JavaScript Study Night.

GDG Phoenix / PHX Android

Cloudflare Wants to Buy Your Meetup Group Pizza

GDG Phoenix / PHX Android group is for anyone interested in Android design and development. All skill levels are welcome. Participants are welcome to the meet up to hang out and watch or be part of the conversation at events.

This group is associated with the Google Developer Group program and coordinate activities with the national organization. Group activities are not exclusively Android related, however, they can cover any technology.

Cloudflare sponsored their October event “How espresso works . . . Android talk with Michael Bailey”.

I hope a lot of awesome groups around the world will take advantage of our offer. Enjoy the pizza!

Cloudflare Wants to Buy Your Meetup Group Pizza

Get Your Pizza $ Reimbursed »

© 2017 Cloudflare, Inc. All rights reserved. The Cloudflare logo and marks are trademarks of Cloudflare. All other company and product names may be trademarks of the respective companies with which they are associated.

Categories: Technology

On the dangers of Intel's frequency scaling

Fri, 10/11/2017 - 11:06

While I was writing the post comparing the new Qualcomm server chip, Centriq, to our current stock of Intel Skylake-based Xeons, I noticed a disturbing phenomena.

When benchmarking OpenSSL 1.1.1dev, I discovered that the performance of the cipher ChaCha20-Poly1305 does not scale very well. On a single thread, it performed at the speed of approximately 2.89GB/s, whereas on 24 cores, and 48 threads it performed at just over 35 GB/s.

CC BY-SA 2.0 image by blumblaum

Now this is a very high number, but I would like to see something closer to 69GB/s. 35GB/s is just 1.46GB/s/core, or roughly 50% of the single core performance. AES-GCM scales much better, to 80% of single core performance, which is understandable, because the CPU can sustain higher frequency turbo on a single core, but not all cores.

alt

Why is the scaling of ChaCha20-Poly1305 so poor? Meet AVX-512. AVX-512 is a new Intel instruction set that adds many new 512-bit wide SIMD instructions and promotes most of the existing ones to 512-bit. The problem with such wide instructions is that they consume power. A lot of power. Imagine a single instruction that does the work of 64 regular byte instructions, or 8 full blown 64-bit instructions.

To keep power in check Intel introduced something called dynamic frequency scaling. It reduces the base frequency of the processor whenever AVX2 or AVX-512 instructions are used. This is not new, and has existed since Haswell introduced AVX2 three years ago.

The scaling gets worse when more cores execute AVX-512 and when multiplication is used.

If you only run AVX-512 code, then everything is good. The frequency is lower, but your overall productivity is higher, because each instruction does more work.

OpenSSL 1.1.1dev implements several variants of ChaCha20-Poly1305, including AVX2 and AVX-512 variants. BoringSSL implements a different AVX2 version of ChaCha20-Poly1305. It is understandable then why BoringSSL achieves only 1.6GB/s on a single core, compared to the 2.89GB/s OpenSSL does.

So how does this affect you, if you mix a little AVX-512 with your real workload? We use the Xeon Silver 4116 CPUs, with a base frequency 2.1GHz, in a dual socket configuration. From a figure I found on wikichip it seems that running AVX-512 even just on one core on this CPU will reduce the base frequency to 1.8GHz. Running AVX-512 on all cores will reduce it to just 1.4GHz.

Now imagine you run a webserver with Apache or NGINX. In addition you have many other services, performing some real, important work. What happens if you start encrypting your traffic with ChaCha20-Poly1305 using AVX-512? That is the question I asked myself.

I compiled two versions of NGINX, one with OpenSSL1.1.1dev and the other with BoringSSL, and installed it on our server with two Xeon Silver 4116 CPUs, for a total of 24 cores.

I configured the server to serve a medium sized HTML page, and perform some meaningful work on it. I used LuaJIT to remove line breaks and extra spaces, and brotli to compress the file.

I then monitored the number of requests per second served under full load. This is what I got:

alt

By using ChaCha20-Poly1305 over AES-128-GCM, the server that uses OpenSSL serves 10% fewer requests per second. And that is a huge number! It is equivalent to giving up on two cores, for nothing. One might think that this is due to ChaCha20-Poly1305 being inherently slower. But that is not the case.

First, BoringSSL performs equivalently well with AES-GCM and ChaCha20-Poly1305.

Second, even when only 20% of the requests use ChaCha20-Poly1305, the server throughput drops by more than 7%, and by 5.5% when 10% of the requests are ChaCha20-Poly1305. For reference, 15% of the TLS requests Cloudflare handles are ChaCha20-Poly1305.

Finally, according to perf, the AVX-512 workload consumes only 2.5% of the CPU time when all the requests are ChaCha20-Poly1305, and less then 0.3% when doing ChaCha20-Poly1305 for 10% of the requests. Irregardless the CPU throttles down, because that what it does when it sees AVX-512 running on all cores.

It is hard to say just how much each core is throttled at any given time, but doing some sampling using lscpu, I found out that when executing the openssl speed -evp chacha20-poly1305 -multi 48 benchmark, it shows CPU MHz: 1199.963, for OpenSSL with all AES-GCM connections I got CPU MHz: 2399.926 and for OpenSSL with all ChaCha20-Poly1305 connections I saw CPU MHz: 2184.338, which is obviously 9% slower.

Another interesting distinction is that ChaCha20-Poly1305 with AVX2 is slightly slower in OpenSSL but is the same in BoringSSL. Why might that be? The reason here is that the BoringSSL code does not use AVX2 multiplication instructions for Poly1305, and only uses simple xor, shift and add operations for ChaCha20, which allows it to run at the base frequency.

OpenSSL 1.1.1dev is still in development, therefore I suspect no one is affected by this issue yet. We switched to BoringSSL months ago, and our server performance is not affected by this issue.

What the future holds in unclear. Intel announced very cool new ISA extensions for the future generation of CPUs, that are expected to improve crypto performance even further. Those extensions include AVX512+VAES, AVX512+VPCLMULQDQ and AVX512IFMA. But if the frequency scaling issue is not resolved by then, using those for general purpose cryptography libraries will do (much) more harm than good.

The problem is not with cryptography libraries alone. OpenSSL did nothing wrong by trying to get the best possible performance, on the contrary, I wrote a decent amount of AVX-512 code for OpenSSL myself. The observed behavior is a sad side effect. There are many libraries that use AVX and AVX2 instructions out there, they will probably be updated to AVX-512 at some point, and users are not likely to be aware of the implementation details. If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.

Categories: Technology

Privacy Pass - “The Math”

Thu, 09/11/2017 - 16:05
Privacy Pass - “The Math”

This is a guest post by Alex Davidson, a PhD student in Cryptography at Royal Holloway, University of London, who is part of the team that developed Privacy Pass. Alex worked at Cloudflare for the summer on deploying Privacy Pass on the Cloudflare network.

During a recent internship at Cloudflare, I had the chance to help integrate support for improving the accessibility of websites that are protected by the Cloudflare edge network. Specifically, I helped develop an open-source browser extension named ‘Privacy Pass’ and added support for the Privacy Pass protocol within Cloudflare infrastructure. Currently, Privacy Pass works with the Cloudflare edge to help honest users to reduce the number of Cloudflare CAPTCHA pages that they see when browsing the web. However, the operation of Privacy Pass is not limited to the Cloudflare use-case and we envisage that it has applications over a wider and more diverse range of applications as support grows.

In summary, this browser extension allows a user to generate cryptographically ‘blinded’ tokens that can then be signed by supporting servers following some receipt of authenticity (e.g. a CAPTCHA solution). The browser extension can then use these tokens to ‘prove’ honesty in future communications with the server, without having to solve more authenticity challenges.

The ‘blind’ aspect of the protocol means that it is infeasible for a server to link tokens token that it signs to tokens that are redeemed in the future. This means that a client using the browser extension should not compromise their own privacy with respect to the server they are communicating with.

In this blog post we hope to give more of an insight into how we have developed the protocol and the security considerations that we have taken into account. We have made use of some interesting and modern cryptographic techniques that we believe could have a future impact on a wide array of problems.

Previously…

The research team released a specification last year for a “blind signing” protocol (very similar to the original proposal of Chaum using a variant of RSA known as ‘blind RSA’. Blind RSA simply uses the homomorphic properties of the textbook RSA signature scheme to allow the user to have messages signed obliviously. Since then, George Tankersley and Filippo Valsorda gave a talk at Real World Crypto 2017 explaining the idea in more detail and how the protocol could be implemented. The intuition behind a blind signing protocol is also given in Nick’s blog post.

A blind signing protocol between a server A and a client B roughly takes the following form:

  • B generates some value t that they require a signature from A for.
  • B calculates a ‘blinded’ version of t that we will call bt
  • B sends bt to A
  • A signs bt with their secret signing key and returns a signature bz to B
  • B receives bz and ‘unblinds’ to receive a signature z for value t.

Due to limitations arising from the usage of RSA (e.g. large signature sizes, slower operations), there were efficiency concerns surrounding the extra bandwidth and computation time on the client browser. Fortunately, we received a lot of feedback from many notable individuals (full acknowledgments below). In short, this helped us to come up with a protocol with much lower overheads in storage, bandwidth and computation time using elliptic curve cryptography as the foundation instead.

Elliptic curves (a very short introduction)

An elliptic curve is defined over a finite field modulo some prime p. Briefly, an (x,y) coordinate is said to lie on the curve if it satisfies the following equation:

y^2 = x^3 + a*x + b (modulo p)

Nick Sullivan wrote an introductory blog post on the use of elliptic curves in cryptography a while back, so this may be a good place to start if you’re new to the area.

Elliptic curves have been studied for use in cryptography since the independent works of Koblitz and Miller (1984-85). However, EC-based ciphers and signature algorithms have rapidly started replacing older primitives in the Internet-space due to large improvements in the choice of security parameters available. What this translates to is that encryption/signing keys can be much smaller in EC cryptography when compared to more traditional methods such as RSA. This comes with huge efficiency benefits when computing encryption and signing operations, thus making EC cipher suites perfect for use on an Internet-wide scale.

Importantly, there are many different elliptic curve configurations that are defined by the choice of p, a and b for the equation above. These prevent different security and efficiency benefits; some have been standardized by NIST. In this work, we will be using the NIST specified P256 curve, however, this choice is largely agnostic to the protocol that we have designed.

Blind signing via elliptic curves

Translating our blind signing protocol from RSA to elliptic curves required deriving a whole new protocol. Some of the suggestions pointed out cryptographic constructions known as “oblivious pseudorandom functions”. A pseudorandom function or PRF is a mainstay of the traditional cryptographic arsenal and essentially takes a key and some string as input and outputs some cryptographically random value.

Let F be our PRF, then the security requirement on such a function is that evaluating:

y = F(K,x)

is indistinguishable from evaluating:

y’ = f(x)

where f is a randomly chosen function with outputs defined in the same domain as F(K,-). Choosing a function f at random undoubtedly leads to random outputs, however for F, randomness is derived from the choice of key K. In practice, we would instantiate a PRF using something like HMAC-SHA256.

Oblivious PRFs

An oblivious PRF (OPRF) is actually a protocol between a server S and a client C. In the protocol, S holds a key K for some PRF F and C holds an input x. The security goal is that C receives the output y = F(K,x) without learning the key K and S does not learn the value x.

Privacy Pass - “The Math”

It may seem difficult to construct such a functionality without revealing the input x or the key K. However, there are numerous (and very efficient) constructions of OPRFs with applications to many different cryptographic problems such as private set intersection, password-protected secret-sharing and cryptographic password storage to name a few.

OPRFs from elliptic curves

A simple instantiation of an OPRF from elliptic curves was given by Jarecki et al. JKK14, we use this as the foundation for our blind signing protocol.

  • Let G be a cyclic group of prime-order
  • Let H be a collision-resistant hash function hashing into G
  • Let k be a private key held by S
  • Let x be a private input held by C

The protocol now proceeds as:

  • C sends H(x) to S
  • S returns kH(x) to C

Clearly, this is an exceptionally simple protocol, security is established since:

  • The collision-resistant hash function prevents S from reversing H(x) to learn x
  • The hardness of the discrete log problem (DLP) prevents C from learning k from kH(x)
  • The output kH(x) is pseudorandom since G is a prime-order group and k is chosen at random.
Blind signing via an OPRF

Using the OPRF design above as the foundation, the research team wrote a variation that we can use for a blind signing protocol; we detail this construction below. In our ‘blind signing’ protocol we require that:

  • The client/user can have random values signed obliviously by the edge server
  • The client can ‘unblind’ these values and present them in the future for verification
  • The edge can commit to the secret key publicly and prove that it is used for signing all tokens globally

The blind signing protocol is split into two phases.

Firstly, there is a blind signing phase that is carried out between the user and the edge after the user has successfully solved a challenge. The result is that the user receives a number of signed tokens (default 30) that are unblinded and stored for future use. Intuitively, this mirrors the execution of the OPRF protocol above.

Secondly, there is a redemption phase where an unblinded token is used for bypassing a future iteration of the challenge.

Let G be a cyclic group of prime-order q. Let H_1,H_2 be a pair of collision-resistant hash functions; H_1 hashes into the group G as before, H_2 hashes into a binary string of length n.

In the following, we will slightly different notation to make it consistent with existing literature. Let x be a private key held by the server S. Let t be the input held by the user/client C. Let ZZ_q be the ring of integers modulo q. We write all operations in their scalar multiplication form to be consistent with EC notation. Let MAC_K() be a message-authentication code algorithm keyed by a key K.

Signing phase
  • C samples a random ‘blind’ r ← ZZ_q
  • C computes T = H_1(t) and then blinds it by computing rT
  • C sends M = rT to S
  • S computes Z = xM and returns Z to C
  • C computes (1/r)*Z = xT = N and stores the pair (t,N) for some point in the future

We think of T = H_1(t) as a token, these objects form the backbone of the protocol that we use to bypass challenges.
Notice, that the only difference between this protocol and the OPRF above is the blinding factor r that we use.

Privacy Pass - “The Math”

Redemption phase
  • C calculates request binding data req and chooses an unspent token (t,N)
  • C calculates a shared key sk = H_2(t,N) and sends (t, MAC_sk(req)) to S
  • S recalculates req' based on the request data that it witnesses
  • S checks that t has not been spent already and calculates T = H_1(t), N = xT, and sk = H_2(t,N)
  • Finally S checks that MAC_sk(req') =?= MAC_sk(req), and stores t to check against future redemptions

If all the steps above pass, then the server validates that the user has a validly signed token. When we refer to ‘passes’ we mean the pair (t, MAC_sk(req)) and if verification is successful the edge server grants the user access to the requested resource.

Privacy Pass - “The Math”

Cryptographic security of protocol

There are many different ways in which we need to ensure that the protocol remains “secure”. Clearly one of the main features is that the user remains anonymous in the transaction. Furthermore, we need to show that the client is unable to leverage the protocol in order to learn the private key of the edge, or arbitrarily gain infinite tokens. We give two security arguments for our protocol that we can easily reduce to cryptographic assumptions on the hardness of widely-used problems. There are a number of other security goals for the protocol but we consider the two arguments below as fundamental security requirements.

Unlinkability in the presence of an adversarial edge

Similarly to the RSA blind signing protocol, the blind r is used to prevent the edge from learning the value of T, above. Since r is not used in the redemption phase of the protocol, there is no way that the server can link a blinded token rT in the signing phase to any token in a given redemption phase. Since S recalculates T during redemption, it may be tempting to think that S could recover r from rT. However, the hardness of the discrete log problem prevents S from launching this attack. Therefore, the server has no knowledge of r.

As mentioned and similarly to the JKK14 OPRF protocol above, we rely on the hardness of standard cryptographic assumptions such as the discrete log problem (DLP), and collision-resistant hash functions. Using these hardness assumptions it is possible to write a proof of security in the presence of a dishonest server. The proof of security shows that assuming that these assumptions are hard, then a dishonest server is unable to link an execution of the signing phase with any execution of the redemption phase with probability higher than just randomly guessing.

Intuitively, in the signing phase, C sends randomly distributed data due to the blinding mechanism and so S cannot learn anything from this data alone. In the redemption phase, C unveils their token, but the transcript of the signing phase witnessed by S is essentially random and so it cannot be used to learn anything from the redemption phase.

This is not a full proof of security but gives an idea as to how we can derive cryptographic hardness for the underlying protocol. We hope to publish a more detailed cryptographic proof in the near future to accompany our protocol design.

Key privacy for the edge

It is also crucial to prove that the exchange does not reveal the secret key x to the user. If this were to happen, then the user would be able to arbitrarily sign their own tokens, giving them an effectively infinite supply.

Notice that the only time when the client is exposed to the key is when they receive Z = xM. In elliptic-curve terminology, the client receives their blinded token scalar multiplied with x. Notice, that this is also identical to the interaction that an adversary witnesses in the discrete log problem. In fact, if the client was able to compute x from Z, then the client would also be able to solve the DLP — which is thought to be very hard for established key sizes. In this way, we have a sufficient guarantee that an adversarial client would not be able to learn the key from the signing interaction.

Preventing further deanonymization attacks using “Verifiable” OPRFs

While the proof of security above gives some assurances about the cryptographic design of the protocol, it does not cover the possibility of possible out-of-band deanonymization. For instance, the edge server can sign tokens with a new secret key each time. Ignoring the cost that this would incur, the server would be able to link token signing and redemption phases by simply checking the validation for each private key in use.

There is a solution known as a ‘discrete log equivalence proof’ (DLEQ proof). Using this, a server commits to a secret key x by publicly posting a pair (G, xG) for a generator G of the prime-order group G. A DLEQ proof intuitively allows the server to prove to the user that the signed tokens Z = xrT and commitment xG both have the same discrete log relation x. Since the commitment is posted publicly (similarly to a Certificate Transparency Log) this would be verifiable by all users and so the deanonymization attack above would not be possible.

DLEQ proofs

The DLEQ proof objects take the form of a Chaum-Pedersen CP93 non-interactive zero-knowledge (NIZK) proof. Similar proofs were used in JKK14 to show that their OPRF protocol produced “verifiable” randomness, they defined their construction as a VOPRF. In the following, we will describe how these proofs can be augmented into the signing phase above.

The DLEQ proof verification in the extension is still in development and is not completely consistent with the protocol below. We hope to complete the verification functionality in the near future.

Let M = rT be the blinded token that C sends to S, let (G,Y) = (G,xG) be the commitment from above, and let H_3 be a new hash function (modelled as a random oracle for security purposes). In the protocol below, we can think of S playing the role of the 'prover' and C the 'verifier' in a traditional NIZK proof system.

  • S computes Z = xM, as before.
  • S also samples a random nonce k ← ZZ_q and commits to the nonce by calculating A = kG and B = kM
  • S constructs a challenge c ← H_3(G,Y,M,Z,A,B) and computes s = k-cx (mod q)
  • S sends (c,s) to the user C
  • C recalculates A' = sG + cY and B' = s*M + c*Z and hashes c' = H_3(G,Y,M,Z,A’,B’).
  • C verifies that c' =?= c.

Note that correctness follows since

A' = sG + cY = (k-cx)G + cxG = kG and B' = sM + cZ = r(k-cx)T + crxT = krT = kM

We write DLEQ(Z/M == Y/G) to denote the proof that is created by S and validated by C.
In summary, if both parties have a consistent view of (G,Y) for the same epoch then the proof should verify correctly. As long as the discrete log problem remains hard to solve, then this proof remains zero-knowledge (in the random oracle model). For our use-case the proof verifies that the same key x is used for each invocation of the protocol, as long as (G,Y) does not change.

Batching the proofs

Unfortunately, a drawback of the proof above is that it has to be instantiated for each individual token sent in the protocol. Since we send 30 tokens by default, this would require the server to also send 30 DLEQ proofs (with two EC elements each) and the client to verify each proof individually.

Interestingly, Henry showed that it was possible to batch the above NIZK proofs into one object with only one verification required Hen14. Using this batching technique substantially reduces the communication and computation cost of including the proof.

Let n be the number of tokens to be signed in the interaction, so we have M_i = r_i*T_i for the set of blinded tokens corresponding to inputs t_i.

  • S generates corresponding Z_i = x*M_i
  • S also computes a seed z = H_3(G,Y,M_1,...,M_n,Z_1,...,Z_n)
  • S then initializes a pseudorandom number generator PRNG with the seed z and outputs c_1, ... , c_n ← PRNG(z) where the output domain of PRNG is ZZ_q
  • S generates composite group elements:
M = (c_1*M_1) + ... + (c_n*M_n), Z = (c_1*Z_1) + ... + (c_n*Z_n)
  • S calculates (c,s) ← DLEQ(M:Z == G:Y) and sends (c,s) to C, where DLEQ(Z/M == Y/G) refers to the proof protocol used in the non-batching case.
  • C computes c’_1, … , c’_n ← PRNG(z) and re-computes M’, Z’ and checks that c’ =?= c

To see why this works, consider the reduced case where m = 2:

Z_1 = x(M_1), Z_2 = x(M_2), (c_1*Z_1) = c_1(x*M_1) = x(c_1*M_1), (c_2*Z_2) = c_2(x*M_2) = x(c_2*M_2), (c_1*Z_1) + (c_2*Z_2) = x[(c_1*M_1) + (c_2*M_2)]

Therefore, all the elliptic curve points will have the same discrete log relation as each other, and hence equal to the secret key that is committed to by the edge.

Benefits of V-OPRF vs blind RSA

While the blind RSA specification that we released fulfilled our needs, we make the following concrete gains

  • Simpler, faster primitives
  • 10x savings in pass size (~256 bits using P-256 instead of ~2048)
  • The only thing edge to manage is a private scalar. No certificates.
  • No need for public-key encryption at all, since the derived shared key used to calculate each MAC is never transmitted and cannot be found from passive observation without knowledge of the edge key or the user's blinding factor.
  • Exponentiations are more efficient due to use of elliptic curves.
  • Easier key rotation. Instead of managing certificates pinned in TBB and submitted to CT, we can use the DLEQ proofs to allow users to positively verify they're in the same anonymity set with regard to the edge secret key as everyone else.
Download

Privacy Pass v1.0 is available as a browser extension for Chrome and Firefox. If you find any issues while using then let us know.

Source code

The code for the browser extension and server has been open-sourced and can be found at https://github.com/privacypass/challenge-bypass-extension and https://github.com/privacypass/challenge-bypass-server respectively. We are welcoming contributions if you happen to notice any improvements that can be made to either component. If you would like to get in contact with the Privacy Pass team then find us at our website.

Protocol details

More information about the protocol can be found here.

Acknowledgements

The creation of Privacy Pass has been a joint effort by the team made up of George Tankersley, Ian Goldberg, Nick Sullivan, Filippo Valsorda and myself.

I'd also like to thank Eric Tsai for creating the logo and extension design, Dan Boneh for helping us develop key parts of the protocol, as well as Peter Wu and Blake Loring for their helpful code reviews. We would also like to acknowledge Sharon Goldberg, Christopher Wood, Peter Eckersley, Brian Warner, Zaki Manian, Tony Arcieri, Prateek Mittal, Zhuotao Liu, Isis Lovecruft, Henry de Valence, Mike Perry, Trevor Perrin, Zi Lin, Justin Paine, Marek Majkowski, Eoin Brady, Aaran McGuire, and many others who were involved in one way or another and whose efforts are appreciated.

References

Cha82: Chaum. Blind signatures for untraceable payments. CRYPTO’82
CP93: Chaum, Pedersen. Wallet Databases with Observers. CRYPTO'92.
Hen14: Ryan Henry. Efficient Zero-Knowledge Proofs and Applications, August 2014.
JKK14: Jarecki, Kiayias, Krawczyk. Round-Optimal Password-Protected Secret Sharing and T-PAKE in the Password-Only model.
JKKX16: Jarecki, Kiayias, Krawczyk, Xu. Highly-Efficient and Composable Password-Protected Secret Sharing.

Categories: Technology

Cloudflare supports Privacy Pass

Thu, 09/11/2017 - 16:00
Enabling anonymous access to the web with privacy-preserving cryptography Cloudflare supports Privacy Pass

Cloudflare supports Privacy Pass

Cloudflare supports Privacy Pass, a recently-announced privacy-preserving protocol developed in collaboration with researchers from Royal Holloway and the University of Waterloo. Privacy Pass leverages an idea from cryptography — zero-knowledge proofs — to let users prove their identity across multiple sites anonymously without enabling tracking. Users can now use the Privacy Pass browser extension to reduce the number of challenge pages presented by Cloudflare. We are happy to support this protocol and believe that it will help improve the browsing experience for some of the Internet’s least privileged users.

The Privacy Pass extension is available for both Chrome and Firefox. When people use anonymity services or shared IPs, it makes it more difficult for website protection services like Cloudflare to identify their requests as coming from legitimate users and not bots. Privacy Pass helps reduce the friction for these users—which include some of the most vulnerable users online—by providing them a way to prove that they are a human across multiple sites on the Cloudflare network. This is done without revealing their identity, and without exposing Cloudflare customers to additional threats from malicious bots. As the first service to support Privacy Pass, we hope to help demonstrate its usefulness and encourage more Internet services to adopt it.

Adding support for Privacy Pass is part of a broader initiative to help make the Internet accessible to as many people as possible. Because Privacy Pass will only be used by a small subset of users, we are also working on other improvements to our network in service of this goal. For example, we are making improvements in our request categorization logic to better identify bots and to improve the web experience for legitimate users who are negatively affected by Cloudflare’s current bot protection algorithms. As this system improves, users should see fewer challenges and site operators should see fewer requests from unwanted bots. We consider Privacy Pass a piece of this puzzle.

Privacy Pass is fully open source under a BSD license and the code is available on GitHub. We encourage anyone who is interested to download the source code, play around with the implementations and contribute to the project. The Pass Team have also open sourced a reference implementation of the server in Go if you want to test both sides of the system. Privacy Pass support at Cloudflare is currently in beta. If you find a bug, please let the team know by creating an issue on GitHub.

In this blog post I'll be going into depth about the problems that motivated our support for this project and how you can use it to reduce the annoyance factor of CAPTCHAs and other user challenges online.

Enabling universal access to content

Cloudflare believes that the web is for everyone. This includes people who are accessing the web anonymously or through shared infrastructure. Tools like VPNs are useful for protecting your identity online, and people using these tools should have the same access as everyone else. We believe the vast collection of information and services that make up the Internet should be available to every person.

In a blog post last year, our CEO, Matthew Prince, spoke about the tension between security, anonymity, and convenience on the Internet. He posited that in order to secure a website or service while still allowing anonymous visitors, you have to sacrifice a bit of convenience for these users. This tradeoff is something that every website or web service has to make.

Cloudflare supports Privacy Pass

The Internet is full of bad actors. The frequency and severity of online attacks is rising every year. This turbulent environment not only threatens websites and web services with attacks, it threatens their ability to stay online. As smaller and more diverse sites become targets of anonymous threats, a greater percentage of the Internet will choose to sacrifice user convenience in order to stay secure and universally accessible.

The average Internet user visits dozens of sites and services every day. Jumping through a hoop or two when trying to access a single website is not that big of a problem for people. Having to do that for every site you visit every day can be exhausting. This is the problem that Privacy Pass is perfectly designed to solve.

Privacy Pass doesn’t completely eliminate this inconvenience. Matthew’s trilemma still applies: anonymous users are still inconvenienced for sites that want security. What Privacy Pass does is to notably reduce that inconvenience for users with access to a browser. Instead of having to be inconvenienced thirty times to visit thirty different domains, you only have to be inconvenienced once to gain access to thirty domains on the Cloudflare network. Crucially, unlike unauthorized services like CloudHole, Privacy Pass is designed to respect user privacy and anonymity. This is done using privacy-preserving cryptography, which prevents Cloudflare or anyone else from tracking a user’s browsing across sites. Before we go into how this works, let’s take a step back and take a look at why this is necessary.

Am I a bot or not?

Cloudflare supports Privacy Pass D J Shin Creative Commons Attribution-Share Alike 3.0 Unported

Without explicit information about the identity of a user, a web server has to rely on fuzzy signals to guess which request is from a bot and which is from a human. For example, bots often use automated scripts instead of web browsers to do their crawling. The way in which scripts make web requests is often different than how web browsers would make the same request in subtle ways.

A simple way for a user to prove they are not a bot to a website is by logging in. By providing valid authentication credentials tied to a long-term identity, a user is exchanging their anonymity for convenience. Having valid authentication credentials is a strong signal that a request is not from a bot. Typically, if you authenticate yourself to a website (say by entering your username and password) the website sets what’s called a “cookie”. A cookie is just a piece of data with an expiration date that’s stored by the browser. As long as the cookie hasn’t expired, the browser includes it as part of the subsequent requests to the server that set it. Authentication cookies are what websites use to know whether you’re logged in or not. Cookies are only sent on the domain that set them. A cookie set by site1.com is not sent for requests to site2.com. This prevents identity leakage from one site to another.

A request with an authentication cookie is usually not from a bot, so bot detection is much easier for sites that require authentication. Authentication is by definition de-anonymizing, so putting this in terms of Matthew’s trilemma, these sites can have security and convenience because they provide no anonymous access. The web would be a very different place if every website required authentication to display content, so this signal can only be used for a small set of sites. The question for the rest of the Internet becomes: without authentication cookies, what else can be used as a signal that a user is a person and not a bot?

The Turing Test

One thing that can be used is a user challenge: a task that the server asks the user to do before showing content. User challenges can come in many forms, from a proof-of-work to a guided tour puzzle to the classic CAPTCHA. A CAPTCHA (an acronym for "Completely Automated Public Turing test to tell Computers and Humans Apart") is a test to see if the user is a human or not. It often involves reading some scrambled letters or identifying certain slightly obscured objects — tasks that humans are generally better at than automated programs. The goal of a user challenge is not only to deter bots, but to gain confidence that a visitor is a person. Cloudflare uses a combination of different techniques as user challenges.

Cloudflare supports Privacy Pass

CAPTCHAs can be annoying and time-consuming to solve, so they are usually reserved for visitors with a high probability of being malicious.

The challenge system Cloudflare uses is cookie-based. If you solve a challenge correctly, Cloudflare will set a cookie called CF_CLEARANCE for the domain that presented the challenge. Clearance cookies are like authentication cookies, but instead of being tied to an identity, they are tied to the fact that you solved a challenge sometime in the past.

  1. Person sends Request
  2. Server responds with a challenge
  3. Person sends solution
  4. Server responds with set-cookie and bypass cookie
  5. Person sends new request with cookie
  6. Server responds with content from origin

Site visitors who are able to solve a challenge are much more likely to be people than bots, the harder the challenge, the more likely the visitor is a person. The presence of a valid CF_CLEARANCE cookie is a strong positive signal that a request is from a legitimate person.

How Privacy Pass protects your privacy: a voting analogy

You can use cryptography to prove that you have solved a challenge of a certain difficulty without revealing which challenge you solved. The technique that enables this is something called a Zero-knowledge proof. This may sound scary, so let’s use a real-world scenario, vote certification, to explain the idea.

In some voting systems the operators of the voting center certify every ballot before sending them to be counted. This is to prevent people from adding fraudulent ballots while the ballots are being transferred from where the vote takes place to where the vote is counted.

An obvious mechanism would be to have the certifier sign every ballot that a voter submits. However, this would mean that the certifier, having just seen the person that handed them a ballot, would know how each person voted. Instead, we can use a better mechanism that preserves voters’ privacy using an envelope and some carbon paper.

  1. The voter fills out their ballot
    Cloudflare supports Privacy Pass
  2. The voter puts their ballot into an envelope along with a piece of carbon paper, and seals the envelope
    Cloudflare supports Privacy Pass
  3. The sealed envelope is given to the certifier
    Cloudflare supports Privacy Pass
  4. The certifier signs the outside of the envelope. The pressure of the signature transfers the signature from the carbon paper to the ballot itself, effectively signing the ballot.
    Cloudflare supports Privacy Pass
  5. Later, when the ballot counter unseals the envelope, they see the certifier’s signature on the ballot.
    Cloudflare supports Privacy Pass

With this system, a voting administrator can authenticate a ballot without knowing its content, and then the ballot can be verified by an independent assessor.

Privacy Pass is like vote certification for the Internet. In this analogy, Cloudflare’s challenge checking service is the vote certifier, Cloudflare’s bot detection service is the vote counter and the anonymous visitor is the voter. When a user encounters a challenge on site A, they put a ballot into a sealed envelope and send it to the server along with the challenge solution. The server then signs the envelope and returns it to the client. Since the server is effectively signing the ballot without knowing its contents, this is called a blind signature.

When the user sees a challenge on site B, the user takes the ballot out of the envelope and sends it to the server. The server then checks the signature on the ballot, which proves that the user has solved a challenge. Because the server has never seen the contents of the ballot, it doesn’t know which site the challenge was solved for, just that a challenge was solved.

It turns out that with the right cryptographic construction, you can approximate this scenario digitally. This is the idea behind Privacy Pass.

The Privacy Pass team implemented this using a privacy-preserving cryptographic construction called an Elliptic Curve Verifiable Oblivious Pseudo-Random Function (EC-VOPRF). Yes, it’s a mouthful. From the Privacy Pass Team:

Every time the Privacy Pass plugin needs a new set of privacy passes, it creates a set of thirty random numbers t1 to t30, hashes them into a curve (P-256 in our case), blinds them with a value b and sends them along with a challenge solution. The server returns the set of points multiplied by its private key and a batch discrete logarithm equivalence proof. Each pair tn, HMAC(n,M) constitutes a Privacy Pass and can be redeemed to solve a subsequent challenge. Voila!



If none of these words make sense to you and you want to know more, check out the Privacy Pass team’s protocol design document.

Making it work in the browser

It takes more than a nice security protocol based on solid cryptography to make something useful in the real world. To bring the advantages of this protocol to users, the Privacy Pass team built a client in JavaScript and packaged it using WebExtensions, a cross-browser framework for developing applications that run in the browser and modify website behavior. This standard is compatible with both Chrome and Firefox. A reference implementation of the server side of the protocol was also implemented in Go.

If you’re a web user and are annoyed by CAPTCHAs, you can download the Privacy Pass extension for Chrome here and for Firefox here. It will significantly improve your web browsing experience. Once it is installed, you’ll see a small icon on your browser with a number under it. The number is how many unused privacy passes you have. If you are running low on passes, simply click on the icon and select “Get More Passes,” which will load a CAPTCHA you can solve in exchange for thirty passes. Every time you visit a domain that requires a user challenge page to view, Privacy Pass will “spend” a pass and the content will load transparently. Note that you may see more than one pass spent up when you load a site for the first time if the site has subresources from multiple domains.

Cloudflare supports Privacy Pass

The Privacy Pass extension works by hooking into the browser and looking for HTTP responses that have a specific header that indicates support for the Privacy Pass protocol. When a challenge page is returned, the extension will either try to issue new privacy passes or redeem existing privacy passes. The cryptographic operations in the plugin were built on top of SJCL.

If you’re a Cloudflare customer and want to opt out from supporting Privacy Pass, please contact our support team and they will disable it for you. We are soon adding a toggle for Privacy Pass in the Firewall app in the Cloudflare dashboard.

The web is for everyone

The technology behind Privacy Pass is free for anyone to use. We see a bright future for this technology and think it will benefit from community involvement. The protocol is currently only deployed at Cloudflare, but it could easily be used across different organizations. It’s easy to imagine obtaining a Privacy Pass that proves that you have a Twitter or Facebook identity and using it to access other services on the Internet without revealing your identity, for example. There are a wide variety of applications of this technology that extend well beyond our current use cases.

If this technology is intriguing to you and you want to collaborate, please reach out to the Privacy Pass team on GitHub.

Categories: Technology

ARM Takes Wing: Qualcomm vs. Intel CPU comparison

Wed, 08/11/2017 - 20:03

One of the nicer perks I have here at Cloudflare is access to the latest hardware, long before it even reaches the market.

Until recently I mostly played with Intel hardware. For example Intel supplied us with an engineering sample of their Skylake based Purley platform back in August 2016, to give us time to evaluate it and optimize our software. As a former Intel Architect, who did a lot of work on Skylake (as well as Sandy Bridge, Ivy Bridge and Icelake), I really enjoy that.

Our previous generation of servers was based on the Intel Broadwell micro-architecture. Our configuration includes dual-socket Xeons E5-2630 v4, with 10 cores each, running at 2.2GHz, with a 3.1GHz turbo boost and hyper-threading enabled, for a total of 40 threads per server.

Since Intel was, and still is, the undisputed leader of the server CPU market with greater than 98% market share, our upgrade process until now was pretty straightforward: every year Intel releases a new generation of CPUs, and every year we buy them. In the process we usually get two extra cores per socket, and all the extra architectural features such upgrade brings: hardware AES and CLMUL in Westmere, AVX in Sandy Bridge, AVX2 in Haswell, etc.

In the current upgrade cycle, our next server processor ought to be the Xeon Silver 4116, also in a dual-socket configuration. In fact, we have already purchased a significant number of them. Each CPU has 12 cores, but it runs at a lower frequency of 2.1GHz, with 3.0GHz turbo boost. It also has smaller last level cache: 1.375MiB/core, compared to 2.5MiB the Broadwell processors had. In addition, the Skylake based platform supports 6 memory channels and the AVX-512 instruction set.

As we head into 2018, however, change is in the air. For the first time in a while, Intel has serious competition in the server market: Qualcomm and Cavium both have new server platforms based on the ARMv8 64-bit architecture (aka aarch64 or arm64). Qualcomm has the Centriq platform (codename Amberwing), based on the Falkor core, and Cavium has the ThunderX2 platform, based on the, ahm ... ThunderX2 core?

The majestic Amberwing powered by the Falkor CPU CC BY-SA 2.0 image by DrPhotoMoto

Recently, both Qualcomm and Cavium provided us with engineering samples of their ARM based platforms, and in this blog post I would like to share my findings about Centriq, the Qualcomm platform.

altThe actual Amberwing in question

Overview

I tested the Qualcomm Centriq server, and compared it with our newest Intel Skylake based server and previous Broadwell based server.

PlatformGrantley
(Intel)Purley
(Intel)Centriq
(Qualcomm) CoreBroadwellSkylakeFalkor Process14nm14nm10nm Issue8 µops/cycle8 µops/cycle8 instructions/cycle Dispatch4 µops/cycle5 µops/cycle4 instructions/cycle # Cores10 x 2S + HT (40 threads)12 x 2S + HT (48 threads)46 Frequency2.2GHz (3.1GHz turbo)2.1GHz (3.0GHz turbo)2.5 GHz LLC2.5 MB/core1.35 MB/core1.25 MB/core Memory Channels466 TDP170W (85W x 2S)170W (85W x 2S)120W Other featuresAES
CLMUL
AVX2AES
CLMUL
AVX512AES
CLMUL
NEON
Trustzone
CRC32

Overall on paper Falkor looks very competitive. In theory a Falkor core can process 8 instructions/cycle, same as Skylake or Broadwell, and it has higher base frequency at a lower TDP rating.

Ecosystem readiness

Up until now, a major obstacle to the deployment of ARM servers was lack, or weak, support by the majority of the software vendors. In the past two years, ARM’s enablement efforts have paid off, as most Linux distros, as well as most popular libraries support the 64-bit ARM architecture. Driver availability, however, is unclear at that point.

At Cloudflare we run a complex software stack that consists of many integrated services, and running each of them efficiently is top priority.

On the edge we have the NGINX server software, that does support ARMv8. NGINX is written in C, and it also uses several libraries written in C, such as zlib and BoringSSL, therefore solid C compiler support is very important.

In addition our flavor of NGINX is highly integrated with the lua-nginx-module, and we rely a lot on LuaJIT.

Finally a lot of our services, such as our DNS server, RRDNS, are written in Go.

The good news is that both gcc and clang not only support ARMv8 in general, but have optimization profiles for the Falkor core.

Go has official support for ARMv8 as well, and they improve the arm64 backend constantly.

As for LuaJIT, the stable version, 2.0.5 does not support ARMv8, but the beta version, 2.1.0 does. Let’s hope it gets out of beta soon.

Benchmarks OpenSSL

The first benchmark I wanted to perform, was OpenSSL version 1.1.1 (development version), using the bundled openssl speed tool. Although we recently switched to BoringSSL, I still prefer OpenSSL for benchmarking, because it has almost equally well optimized assembly code paths for both ARMv8 and the latest Intel processors.

In my opinion handcrafted assembly is the best measure of a CPU’s potential, as it bypasses the compiler bias.

Public key cryptography

alt

alt

Public key cryptography is all about raw ALU performance. It is interesting, but not surprising to see that in the single core benchmark, the Broadwell core is faster than Skylake, and both in turn are faster than Falkor. This is because Broadwell runs at a higher frequency, while architecturally it is not much inferior to Skylake.

Falkor is at a disadvantage here. First, in a single core benchmark, the turbo is engaged, meaning the Intel processors run at a higher frequency. Second, in Broadwell, Intel introduced two special instructions to accelerate big number multiplication: ADCX and ADOX. These perform two independent add-with-carry operations per cycle, whereas ARM can only do one. Similarly the ARMv8 instruction set does not have a single instruction to perform 64-bit multiplication, instead it uses a pair of MUL and UMULH instructions.

Nevertheless, at the SoC level, Falkor wins big time. It is only marginally slower than Skylake at an RSA2048 signature, and only because RSA2048 does not have an optimized implementation for ARM. The ECDSA performance is ridiculously fast. A single Centriq chip can satisfy the ECDSA needs of almost any company in the world.

It is also very interesting to see Skylake outperform Broadwell by a 30% margin, despite losing the single core benchmark, and only having 20% more cores. This can be explained by more efficient all-core turbo, and improved hyper-threading.

Symmetric key cryptography

alt

alt

Symmetric key performance of the Intel cores is outstanding.

AES-GCM uses a combination of special hardware instructions to accelerate AES and CLMUL (carryless multiplication). Intel first introduced those instructions back in 2010, with their Westmere CPU, and every generation since they have improved their performance. ARM introduced a set of similar instructions just recently, with their 64-bit instruction set, and as an optional extension. Fortunately every hardware vendor I know of implemented those. It is very likely that Qualcomm will improve the performance of the cryptographic instructions in future generations.

ChaCha20-Poly1305 is a more generic algorithm, designed in such a way as to better utilize wide SIMD units. The Qualcomm CPU only has the 128-bit wide NEON SIMD, while Broadwell has 256-bit wide AVX2, and Skylake has 512-bit wide AVX-512. This explains the huge lead Skylake has over both in single core performance. In the all-cores benchmark the Skylake lead lessens, because it has to lower the clock speed when executing AVX-512 workloads. When executing AVX-512 on all cores, the base frequency goes down to just 1.4GHz---keep that in mind if you are mixing AVX-512 and other code.

The bottom line for symmetric crypto is that although Skylake has the lead, Broadwell and Falkor both have good enough performance for any real life scenario, especially considering the fact that on our edge, RSA consumes more CPU time than all of the other crypto algorithms combined.

Compression

The next benchmark I wanted to see was compression. This is for two reasons. First, it is a very important workload on the edge, as having better compression saves bandwidth, and helps deliver content faster to the client. Second, it is a very demanding workload, with a high rate of branch mispredictions.

Obviously the first benchmark would be the popular zlib library. At Cloudflare we use an improved version of the library, optimized for 64-bit Intel processors, and although it is written mostly in C, it does use some Intel specific intrinsics. Comparing this optimized version to the generic zlib library wouldn’t be fair. Not to worry, with little effort I adapted the library to work very well on the ARMv8 architecture, with the use of NEON and CRC32 intrinsics. In the process it is twice as fast as the generic library for some files.

The second benchmark is the emerging brotli library, it is written in C, and allows for a level playing field for all platforms.

All the benchmarks are performed on the HTML of blog.cloudflare.com, in memory, similar to the way NGINX performs streaming compression. The size of the specific version of the HTML file is 29,329 bytes, making it a good representative of the type of files we usually compress. The parallel benchmark compresses multiple files in parallel, as opposed to compressing a single file on many threads, also similar to the way NGINX works.

gzip

alt

alt

When using gzip, at the single core level Skylake is the clear winner. Despite having lower frequency than Broadwell, it seems that having lower penalty for branch misprediction helps it pull ahead. The Falkor core is not far behind, especially with lower quality settings. At the system level Falkor performs significantly better, thanks to the higher core count. Note how well gzip scales on multiple cores.

brotli

alt

alt

With brotli on single core the situation is similar. Skylake is the fastest, but Falkor is not very much behind, and with quality setting 9, Falkor is actually faster. Brotli with quality level 4 performs very similarly to gzip at level 5, while actually compressing slightly better (8,010B vs 8,187B).

When performing many-core compression, the situation becomes a bit messy. For levels 4, 5 and 6 brotli scales very well. At level 7 and 8 we start seeing lower performance per core, bottoming with level 9, where we get less than 3x the performance of single core, running on all cores.

My understanding is that at those quality levels Brotli consumes significantly more memory, and starts thrashing the cache. The scaling improves again at levels 10 and 11.

Bottom line for brotli, Falkor wins, since we would not consider going above quality 7 for dynamic compression.

Golang

Golang is another very important language for Cloudflare. It is also one of the first languages to offer ARMv8 support, so one would expect good performance. I used some of the built-in benchmarks, but modified them to run on multiple goroutines.

Go crypto

I would like to start the benchmarks with crypto performance. Thanks to OpenSSL we have good reference numbers, and it is interesting to see just how good the Go library is.

alt

alt

alt

alt

As far as Go crypto is concerned ARM and Intel are not even on the same playground. Go has very optimized assembly code for ECDSA, AES-GCM and Chacha20-Poly1305 on Intel. It also has Intel optimized math functions, used in RSA computations. All those are missing for ARMv8, putting it at a big disadvantage.

Nevertheless, the gap can be bridged with a relatively small effort, and we know that with the right optimizations, performance can be on par with OpenSSL. Even a very minor change, such as implementing the function addMulVVW in assembly, lead to an over tenfold improvement in RSA performance, putting Falkor ahead of both Broadwell and Skylake, with 8,009 signatures/second.

Another interesting thing to note is that on Skylake, the Go Chacha20-Poly1305 code, that uses AVX2 performs almost identically to the OpenSSL AVX512 code, this is again due to AVX2 running at higher clock speeds.

Go gzip

Next in Go performance is gzip. Here again we have a reference point to pretty well optimized code, and we can compare it to Go. In the case of the gzip library, there are no Intel specific optimizations in place.

alt

alt

Gzip performance is pretty good. The single core Falkor performance is way below both Intel processors, but at the system level it manages to outperform Broadwell, and lags behind Skylake. Since we already know that Falkor outperforms both when C is used, it can only mean that Go’s backend for ARMv8 is still pretty immature compared to gcc.

Go regexp

Regexp is widely used in a variety of tasks, so its performance is quite important too. I ran the builtin benchmarks on 32KB strings.

alt

alt

alt

alt

Go regexp performance is not very good on Falkor. In the medium and hard tests it takes second place, thanks to the higher core count, but Skylake is significantly faster still.

Doing some profiling shows that a lot of the time is spent in the function bytes.IndexByte. This function has an assembly implementation for amd64 (runtime.indexbytebody), but generic implementation for Go. The easy regexp tests spend most of time in this function, which explains the even wider gap.

Go strings

Another important library for a webserver is the Go strings library. I only tested the basic Replacer class here.

alt

alt

In this test again, Falkor lags behind, and loses even to Broadwell. Profiling shows significant time is spent in the function runtime.memmove. Guess what? It has a highly optimized assembly code for amd64, that uses AVX2, but only very simple ARM assembly, that copies 8 bytes at a time. By changing three lines in that code, and using the LDP/STP instructions (load pair/store pair) to copy 16 bytes at a time, I improved the performance of memmove by 30%, which resulted in 20% faster EscapeString and UnescapeString performance. And that is just scratching the surface.

Go conclusion

Go support for aarch64 is quite disappointing. I am very happy to say that everything compiles and works flawlessly, but on the performance side, things should get better. Is seems like the enablement effort so far was concentrated on the compiler back end, and the library was left largely untouched. There are a lot of low hanging optimization fruits out there, like my 20 minute fix for addMulVVW clearly shows. Qualcomm and other ARMv8 vendors intends to put significant engineering resources to amend this situation, but really any one can contribute to Go. So if you want to live your mark, now is the time.

LuaJIT

Lua is the glue that holds Cloudflare together.

alt

alt

With the exception of the binary_trees benchmark, the performance of LuaJIT on ARM is very competitive. It wins two benchmarks, and is in almost a tie in a third one.

That being said, binary_trees is a very important benchmark, because it triggers many memory allocations and garbage collection cycles. It will require deeper investigation in the future.

NGINX

For the NGINX workload, I decided to generate a load that would resemble an actual server.

I set up a server that serves the HTML file used in the gzip benchmark, over https, with the ECDHE-ECDSA-AES128-GCM-SHA256 cipher suite.

It also uses LuaJIT to redirect the incoming request, remove all line breaks and extra spaces from the HTML file, while adding a timestamp. The HTML is then compressed using brotli with quality 5.

Each server was configured to work with as many workers as it has virtual CPUs. 40 for Broadwell, 48 for Skylake and 46 for Falkor.

As the client for this test, I used the hey program, running from 3 Broadwell servers.

Concurrently with the test, we took power readings from the respective BMC units of each server.

alt

With the NGINX workload Falkor handled almost the same amount of requests as the Skylake server, and both significantly outperform Broadwell. The power readings, taken from the BMC show that it did so while consuming less than half the power of other processors. That means Falkor managed to get 214 requests/watt vs the Skylake’s 99 requests/watt and Broadwell’s 77.

I was a bit surprised to see Skylake and Broadwell consume about the same amount of power, given both are manufactured with the same process, and Skylake has more cores.

The low power consumption of Falkor is not surprising, Qualcomm processors are known for their great power efficiency, which has allowed them to be a dominant player in the mobile phone CPU market.

Conclusion

The engineering sample of Falkor we got certainly impressed me a lot. This is a huge step up from any previous attempt at ARM based servers. Certainly core for core, the Intel Skylake is far superior, but when you look at the system level the performance becomes very attractive.

The production version of the Centriq SoC will feature up to 48 Falkor cores, running at a frequency of up to 2.6GHz, for a potential additional 8% better performance.

Obviously the Skylake server we tested is not the flagship Platinum unit that has 28 cores, but those 28 cores come both with a big price and over 200W TDP, whereas we are interested in improving our bang for buck metric, and performance per watt.

Currently my main concern is weak Go language performance, but that is bound to improve quickly once ARM based servers start gaining some market share.

Both C and LuaJIT performance is very competitive, and in many cases outperforms the Skylake contender. In almost every benchmark Falkor shows itself as a worthy upgrade from Broadwell.

The largest win by far for Falkor is the low power consumption. Although it has a TDP of 120W, during my tests it never went above 89W (for the go benchmark). In comparison Skylake and Broadwell both went over 160W, while the TDP of the two CPUs is 170W.

If you enjoy testing and selecting hardware on behalf of millions of Internet properties, come join us.

Categories: Technology

LavaRand in Production: The Nitty-Gritty Technical Details

Mon, 06/11/2017 - 06:07
Introduction  The Nitty-Gritty Technical Details

 The Nitty-Gritty Technical Details Lava lamps in the Cloudflare lobby

Courtesy of @mahtin

As some of you may know, there's a wall of lava lamps in the lobby of our San Francisco office that we use for cryptography. In this post, we’re going to explore how that works in technical detail. This post assumes a technical background. For a higher-level discussion that requires no technical background, see Randomness 101: LavaRand in Production.

Background

As we’ve discussed in the past, cryptography relies on the ability to generate random numbers that are both unpredictable and kept secret from any adversary. In this post, we’re going to go into fairly deep technical detail, so there is some background that we’ll need to ensure that everybody is on the same page.

True Randomness vs Pseudorandomness

In cryptography, the term random means unpredictable. That is, a process for generating random bits is secure if an attacker is unable to predict the next bit with greater than 50% accuracy (in other words, no better than random chance).

We can obtain randomness that is unpredictable using one of two approaches. The first produces true randomness, while the second produces pseudorandomness.

True randomness is any information learned through the measurement of a physical process. Its unpredictability relies either on the inherent unpredictability of the physical process being measured (e.g., the unpredictability of radioactive decay), or on the inaccuracy inherent in taking precise physical measurements (e.g., the inaccuracy of the least significant digits of some physical measurement such as the measurement of a CPU’s temperature or the timing of keystrokes on a keyboard). Random values obtained in this manner are unpredictable even to the person measuring them (the person performing the measurement can’t predict what the value will be before they have performed the measurement), and thus are just as unpredictable to an external attacker. All randomness used in cryptographic algorithms begins life as true randomness obtained through physical measurements.

However, obtaining true random values is usually expensive and slow, so using them directly in cryptographic algorithms is impractical. Instead, we use pseudorandomness. Pseudorandomness is generated through the use of a deterministic algorithm that takes as input some other random value called a seed and produces a larger amount of random output (these algorithms are called cryptographically secure pseudorandom number generators, or CSPRNGs). A CSPRNG has two key properties: First, if an attacker is unable to predict the value of the seed, then that attacker will be similarly unable to predict the output of the CSPRNG (and even if the attacker is shown the output up to a certain point - say the first 10 bits - the rest of the output - bits 11, 12, etc - will still be completely unpredictable). Second, since the algorithm is deterministic, running the algorithm twice with the same seed as input will produce identical output.

The CSPRNGs used in modern cryptography are both very fast and also capable of securely producing an effectively infinite amount of output1 given a relatively small seed (on the order of a few hundred bits). Thus, in order to efficiently generate a lot of secure randomness, true randomness is obtained from some physical process (this is slow), and fed into a CSPRNG which in turn produces as much randomness as is required by the application (this is fast). In this way, randomness can be obtained which is both secure (since it comes from a truly random source that cannot be predicted by an attacker) and cheap (since a CSPRNG is used to turn the truly random seed into a much larger stream of pseudorandom output).

Running Out of Randomness

A common misconception is that a CSPRNG, if used for long enough, can “run out” of randomness. This is an understandable belief since, as we’ll discuss in the next section, operating systems often re-seed their CSPRNGs with new randomness to hedge against attackers discovering internal state, broken CSPRNGs, and other maladies.

But if an algorithm is a true CSPRNG in the technical sense, then the only way for it to run out of randomness is for somebody to consume far more values from it than could ever be consumed in practice (think consuming values from a CSPRNG as fast as possible for thousands of years or more).2

However, none of the fast CSPRNGs that we use in practice are proven to be true CSPRNGs. They’re just strongly believed to be true CSPRNGs, or something close to it. They’ve withstood the test of academic analysis, years of being used in production, attacks by resourced adversaries, and so on. But that doesn’t mean that they are without flaws. For example, SHA-1, long considered to be a cryptographically-secure collision-resistant hash function (a building block that can be used to construct a CSPRNG) was eventually discovered to be insecure. Today, it can be broken for $110,000’s worth of cloud computing resources.3

Thus, even though we aren’t concerned with running out of randomness in a true CSPRNG, we also aren’t sure that what we’re using in practice are true CSPRNGs. As a result, to hedge against the possibility that an attacker has figured out how to break our CSPRNGs, designers of cryptographic systems often choose to re-seed CSPRNGs with fresh, newly-acquired true randomness just in case.

Randomness in the Operating System

In most computer systems, one of the responsibilities of the operating system is to provide cryptographically-secure pseudorandomness for use in various security applications. Since the operating system cannot know ahead of time which applications will require pseudorandomness (or how much they will require), most systems simply keep an entropy pool4 - a collection of randomness that is believed to be secure - that is used to seed a CSPRNG (e.g., /dev/urandom on Linux) which serves requests for randomness. The system then takes on the responsibility of not only seeding this entropy pool when the system first boots, but also of periodically updating the pool (and re-seeding the CSPRNG) with new randomness from whatever sources of true randomness are available to the system in order to hedge against broken CSPRNGs or attackers having compromised the entropy pool through other non-cryptographic attacks.

For brevity, and since Cloudflare’s production system’s run Linux, we will refer to the system’s pseudorandomness provider simply as /dev/urandom, although note that everything in this discussion is true of other operating systems as well.

Given this setup of an entropy pool and CSPRNG, there are a few situations that could compromise the security of /dev/urandom:

  • The sources of true randomness used to seed the entropy pool could be too predictable, allowing an attacker to guess the values obtained from these sources, and thus to predict the output of /dev/urandom.
  • An attacker could have access to the sources of true randomness, thus being able to observe their values and thus predict the output of /dev/urandom.
  • An attacker could have the ability to modify the sources of true randomness, thus being able to influence the values obtained from these sources and thus predict the output of /dev/urandom.
Randomness Mixing

A common approach to addressing these security issues is to mix multiple sources of randomness together in the system’s entropy pool, the idea being that so long as some of the sources remain uncompromised, the system remains secure. For example, if sources X, Y, and Z, when queried for random outputs, provide values x, y, and z, we might seed our entropy pool with H(x, y, z), where H is a cryptographically-secure collision-resistant hash function. Even if we assume that two of these sources - say, X and Y - are malicious, so long as the attackers in control of them are not able to observe Z’s output,5 then no matter what values of x and y they produce, H(x, y, z) will still be unpredictable to them.

LavaRand

 The Nitty-Gritty Technical Details The view from the camera

While the probability is obviously very low that somebody will manage to predict or modify the output of the entropy sources on our production machines, it would be irresponsible of us to pretend that it is impossible. Similarly, while cryptographic attacks against state-of-the-art CSPRNGs are rare, they do occasionally happen. It’s important that we hedge against these possibilities by adding extra layers of defense.

That’s where LavaRand comes in.

In short, LavaRand is a system that provides an additional entropy source to our production machines. In the lobby of our San Francisco office, we have a wall of lava lamps (pictured above). A video feed of this wall is used to generate entropy that is made available to our production fleet.

The flow of the “lava” in a lava lamp is very unpredictable,6 and so the entropy in those lamps is incredibly high. Even if we conservatively assume that the camera has a resolution of 100x100 pixels (of course it’s actually much higher) and that an attacker can guess the value of any pixel of that image to within one bit of precision (e.g., they know that a particular pixel has a red value of either 123 or 124, but they aren’t sure which it is), then the total amount of entropy produced by the image is 100x100x3 = 30,000 bits (the x3 is because each pixel comprises three values - a red, a green, and a blue channel). This is orders of magnitude more entropy than we need.

Design

 The Nitty-Gritty Technical Details The flow of entropy in LavaRand

The overall design of the LavaRand system is pictured above. The flow of entropy can be broken down into the following steps:
The wall of lava lamps in the office lobby provides a source of true entropy.
In the lobby, a camera is pointed at the wall. It obtains entropy from both the visual input from the lava lamps and also from random noise in the individual photoreceptors.
In the office, there’s a server which connects to the camera. The server has its own entropy system, and the output of that entropy system is mixed with the entropy from the camera to produce a new entropy feed.
In one of our production data centers, there’s a service which connects to the server in the office and consumes its entropy feed. That service combines this entropy feed with output from its own local entropy system to produce yet another entropy feed. This feed is made available for any production service to consume.

Security of the LavaRand Service

We might conceive of a number of attacks that could be leveraged against this system:

  • An attacker could train a camera on the wall of lava lamps, attempting to reproduce the image captured by our camera.
  • An attacker could reduce the entropy from the wall of lava lamps by turning off the power to the lamps, shining a bright light at the camera, placing a lens cap on the camera, or any number of other physical attacks.
  • An attacker able to compromise the camera could exfiltrate or modify the feed of frames from the camera, replicating or controlling the entropy source used by the server in the office.
  • An attacker with code running on the office server could observe or modify the output of the entropy feed generated by that server.
  • An attacker with code running in the production service could observe or modify the output of the entropy feed generated by that service.

Only one of these attacks would be fatal if successfully carried out: running code on the production service which produces the final entropy feed. In every other case, the malicious entropy feed controlled by the attacker is mixed with a non-malicious feed that the attacker can neither observe nor modify.7 As we discussed in a previous section, as long as the attacker is unable to predict the output of these non-malicious feeds, they will be unable to predict the output of the entropy feed generated by mixing their malicious feed with the non-malicious feed.

Using LavaRand

Having a secure entropy source is only half of the story - the other half is actually using it!

The goal of LavaRand is to ensure that our production machines have access to secure randomness even if their local entropy sources are compromised. Just after boot, each of our production machines contacts LavaRand over TLS to obtain a fixed-size chunk of fresh entropy called a “beacon.” It mixes this beacon into the entropy system (on Linux, by writing the beacon to /dev/random). After this point, in order to predict or control the output of /dev/urandom, an attacker would need to compromise both the machine’s local entropy sources and the LavaRand beacon.

Bootstrapping TLS

Unfortunately, the reality isn’t quite that simple. We’ve gotten ourselves into something of a chicken-and-egg problem here: we’re trying to hedge against bad entropy from our local entropy sources, so we have to assume those might be compromised. But TLS, like many cryptographic protocols, requires secure entropy in order to operate. And we require TLS to request a LavaRand beacon. So in order to ensure secure entropy, we have to have secure entropy…

We solve this problem by introducing a second special-purpose CSPRNG, and seeding it in a very particular way. Every machine in Cloudflare’s production fleet has its own permanent store of secrets that it uses just after boot to prove its identity to the rest of the fleet in order to bootstrap the rest of the boot process. We piggyback on that system by storing an extra random seed - unique for each machine - that we use for that first TLS connection to LavaRand.

There’s a simple but very useful result from cryptography theory that says that an HMAC - a hash-based message authentication code - when combined with a random, unpredictable seed, behaves (from the perspective of an attacker) like a random oracle. That’s a lot of crypto jargon, but it basically means that if you have a secret, randomly-generated seed, s, then an attacker will be completely unable to guess the output of HMAC(s, x) regardless of the value of x - even if x is completely predictable! Thus, you can use HMAC(s, x) as the seed to a CSPRNG, and the output of the CSPRNG will be unpredictable. Note, though, that if you need to do this multiple times, you will have to pick different values for x! Remember that while CSPRNGs are secure if used with unpredictable seeds, they’re also deterministic. Thus, if the same value is used for x more than once, then the CSPRNG will end up producing the same stream of “random” values more than once, which in cryptography is often very insecure!

This means that we can combine those unique, secret seeds that we store on each machine with an HMAC and produce a secure random value. We use the current time with nanosecond precision as the input to ensure that the same value is never used twice on the same machine. We use the resulting value to seed a CSPRNG, and we use that CSPRNG for the TLS connection to LavaRand. That way, even if the system’s entropy sources are compromised, we’ll still be able to make a secure connection to LavaRand, obtain a new, secure beacon, and bootstrap the system’s entropy back to a secure state!

Conclusion

Hopefully we’ll never need LavaRand. Hopefully, the primary entropy sources used by our production machines will remain secure, and LavaRand will serve little purpose beyond adding some flair to our office. But if it turns out that we’re wrong, and that our randomness sources in production are actually flawed, then hopefully LavaRand will be our hedge, making it just a little bit harder to hack Cloudflare.

  1. Some CSPRNGs exist with constraints on how much output can be consumed securely, but those are not the sort that we are concerned with in this post.
  2. Section 3.1, Recommendations for Randomness in the Operating System
  3. The first collision for full SHA-1
  4. “Entropy” and “randomness” are synonyms in cryptography - the former is the more technical term.
  5. If the attacker controls X and Y and can also observe the output of Z, then the attacker can still partially influence the output of H(x, y, z). See here for a discussion of possible attacks.
  6. Noll, L.C. and Mende, R.G. and Sisodiya, S., Method for seeding a pseudo-random number generator with a cryptographic hash of a digitization of a chaotic system
  7. A surprising example of the effectiveness of entropy is the mixing of the image captured by the camera with the random noise in the camera’s photoreceptors. If we assume that every pixel captured is either recorded as the “true” value or is instead recorded as one value higher than the true value (50% probability for each), then even if the input image can be reproduced by an attacker with perfect accuracy, the camera still provides one bit of entropy for each pixel channel. As discussed before, even for a 100x100 pixel camera, that’s 30,000 bits!
Categories: Technology

Randomness 101: LavaRand in Production

Mon, 06/11/2017 - 05:54
Introduction  LavaRand in Production

 LavaRand in Production Lava lamps in the Cloudflare lobby

Courtesy of @mahtin

As some of you may know, there's a wall of lava lamps in the lobby of our San Francisco office that we use for cryptography. In this post, we’re going to explore how that works. This post assumes no technical background. For a more in-depth look at the technical details, see LavaRand in Production: The Nitty-Gritty Technical Details.

Background Randomness in Cryptography

As we’ve discussed in the past, cryptography relies on the ability to generate random numbers that are both unpredictable and kept secret from any adversary.

But “random” is a pretty tricky term; it’s used in many different fields to mean slightly different things. And like all of those fields, its use in cryptography is very precise. In some fields, a process is random simply if it has the right statistical properties. For example, the digits of pi are said to be random because all sequences of numbers appear with equal frequency (“15” appears as frequently as “38”, “426” appears as frequently as “297”, etc). But for cryptography, this isn’t enough - random numbers must be unpredictable.

To understand what unpredictable means, it helps to consider that all cryptography is based on an asymmetry of information. If you’re trying to securely perform some cryptographic operation, then what you’re concerned about is that somebody - an adversary - will try to break your security. The only thing that distinguishes you from the adversary is that you know some things that the adversary does not, and the job of cryptography is to ensure that this asymmetry of information is enough to keep you secure.

Let’s consider a simple example. Imagine that you and a friend are trying to go to a movie, but you don’t want your nemesis to figure out which movie you’re going to go to (lest she show up and thwart your movie watching!). This week, it’s your turn to choose the movie. Once you’ve made your choice, you’re going to need to send a message to your friend telling him which movie you’ve chosen, but you’re going to need to ensure that even if your nemesis intercepts the message, she won’t be able to figure out what it says.

You devise the following scheme: since there are only two movies available to watch at the moment, you label one A and the other B. Then, while in the presence of your friend, you flip a coin. You agree on the following table outlining which message that you will send depending on your choice of what movie to watch and whether the coin comes up heads (H) or tails (T). Later, once you’ve made up your mind about which movie to see, you’ll use this table to send an encrypted message to your friend telling him which movie you’ve chosen.

Movie Coin Message A H “The rain in Spain stays mainly on the plain.” A T “In Hertford, Hereford, and Hampshire, hurricanes hardly ever happen.” B H “In Hertford, Hereford, and Hampshire, hurricanes hardly ever happen.” B T “The rain in Spain stays mainly on the plain.”

If you were to decide on movie B, and the coin came up heads, you would send the message, “In Hertford, Hereford, and Hampshire, hurricanes hardly ever happen.” Since your friend knows that the coin came up heads - he was there when it happened - he knows that you must have decided on movie B. But consider it from your nemesis’ perspective. She doesn’t know the result of the coin toss - all she knows is that there’s a 50% chance that the coin came up heads, and a 50% chance that it came up tails. Thus, seeing the message “In Hertford, Hereford, and Hampshire, hurricanes hardly ever happen” doesn’t help her at all! There’s a 50% chance that the coin came up heads (implying movie B), and a 50% chance that it came up tails (implying movie A). She doesn’t know anything more than she knew before!

Let’s return now to the concept of unpredictability. Imagine that the result of the coin toss was completely predictable - say your nemesis planted a trick coin that always comes up heads on the first toss, tails on the second, heads on the third, and so on. Since she would know that there was a 100% chance of the first coin toss coming up heads, then regardless of which message you sent, she would know which movie you were going to see. While the trick coin still exhibits some basic properties of “randomness” as the term is used in the field of statistics - it comes up heads as often as it comes up tails - it’s predictable, which makes it useless for cryptography. The takeaway: when we say random in the context of cryptography, we mean unpredictable.

Randomness in Computing

Unfortunately for cryptographers, if there’s one thing computers are good at, it’s being predictable. They can execute the same code a million times, and so long as they are given the same inputs each time, they’ll always come up with the same outputs. This is very good for reliability, but it’s tricky when it comes to cryptography - after all, we need unpredictability!

The solution to this problem is cryptographically-secure pseudorandom number generators (CSPRNGs). CSPRNGs are algorithms which, provided an input which is itself unpredictable, produce a much larger stream of output which is also unpredictable. This stream can be extended indefinitely, producing as much output as required at any time in the future. In other words, if you were to flip a coin a number of times (a process which is known to be unpredictable) and then use the output of those coin flips as the input to a CSPRNG, an adversary who wasn’t able to predict the output of those coin flips would also be unable to predict the output of the CSPRNG - no matter how much output was consumed from the CSPRNG.

But even though CSPRNGs are a very powerful tool, they’re only one half of the equation - they still need an unpredictable input to operate. But as we said, computers aren’t unpredictable, so aren’t we back at square one? Well, not quite. It turns out that computers do usually have sources of unpredictability that they can use, but they’re pretty slow. What we can do is combine that slow process of gathering unpredictable input with a CSPRNG, which can take that input and produce a much larger amount of input more quickly, and we can satisfy all of our randomness needs!

But where can a computer get such unpredictable input, even slowly? The answer is the real world. While computers provide a nice simplification of the world for programmers to live in, real, physical computers still exist in the real, physical world. And that world is unpredictable. Computers have all sorts of ways of taking input from the real world - temperature sensors, keyboards, network interfaces, etc. All of these provide the ability to take measurements of the real world, and all of those measurements have some degree of inherent inaccuracy. As we’ll explain in a moment, that inaccuracy is the same thing as unpredictability, and this can be used! Common sources of unpredictable randomness include measuring the temperature of the CPU with high - and thus inaccurate - precision, measuring the timing of keystrokes on the keyboard with high precision, etc. To see how this can be used to produce unpredictable randomness, consider the question, “what’s the temperature of the room I’m sitting in right now?” You can probably estimate to within a couple of degrees - say, somewhere between 70 and 75 degrees Fahrenheit. But you probably have no idea what the temperature is to 2 decimal places of accuracy - is it 73.42 degrees or 73.47 degrees? By measuring the temperature with high precision and then using only the low-order digits of the measured value, you can get highly unpredictable randomness just by observing the world around you. And so can computers.

So, to recap:

  • Randomness used in cryptography needs to be unpredictable.
  • Computers can slowly get a small amount of unpredictable randomness by measuring their environment.
  • Computers can greatly extend this randomness by using a CSPRNG which can quickly turn it into a large amount of unpredictable randomness.
Hedging Your Bets

If there’s one thing cryptographers are wary about, it’s certainty. Cryptographic systems are routinely shown to be less secure than they were originally thought to be, and we’re constantly updating our understanding of what algorithms are safe to use in what scenarios.

So it shouldn’t be any surprise that cryptographers like to hedge their bets by using more security than they believe necessary in case it turns out that one of their assumptions was wrong. It’s sort of like the cryptographer’s version of the engineering practice of designing buildings to withstand far more weight or wind or heat than they think will arise in practice.

When it comes to randomness, this hedging often takes the form of mixing. Unpredictable random values have the neat property that if they’re mixed in the right way with more unpredictable random values, then the result is at least as unpredictable as either of the inputs. That means that if you mix a random value which is highly unpredictability with a random value which is somewhat predictable, the result will be a highly unpredictable value.

This mixing property is useful because it allows you to mix unpredictable random values from many sources, and if you later discover that one of those sources was less unpredictable than you’d originally thought, it’s still OK - the other sources come to the rescue.

LavaRand

At Cloudflare, we have thousands of computers in data centers all around the world, and each one of these computers needs cryptographic randomness. Historically, they got that randomness using the default mechanism made available by the operating system that we run on them, Linux.

But being good cryptographers, we’re always trying to hedge our bets. We wanted a system to ensure that even if the default mechanism for acquiring randomness was flawed, we’d still be secure. That’s how we came up with LavaRand.

 LavaRand in Production The view from the camera

LavaRand is a system that uses lava lamps as a secondary source of randomness for our production servers. A wall of lava lamps in the lobby of our San Francisco office provides an unpredictable input to a camera aimed at the wall. A video feed from the camera is fed into a CSPRNG, and that CSPRNG provides a stream of random values that can be used as an extra source of randomness by our production servers. Since the flow of the “lava” in a lava lamp is very unpredictable,1 “measuring” the lamps by taking footage of them is a good way to obtain unpredictable randomness. Computers store images as very large numbers, so we can use them as the input to a CSPRNG just like any other number.

Hopefully we’ll never need it. Hopefully, the primary sources of randomness used by our production servers will remain secure, and LavaRand will serve little purpose beyond adding some flair to our office. But if it turns out that we’re wrong, and that our randomness sources in production are actually flawed, then LavaRand will be our hedge, making it just a little bit harder to hack Cloudflare.

  1. Noll, L.C. and Mende, R.G. and Sisodiya, S., Method for seeding a pseudo-random number generator with a cryptographic hash of a digitization of a chaotic system
Categories: Technology

Perfect locality and three epic SystemTap scripts

Sun, 05/11/2017 - 21:56

In a recent blog post we discussed epoll behavior causing uneven load among NGINX worker processes. We suggested a work around - the REUSEPORT socket option. It changes the queuing from "combined queue model" aka Waitrose (formally: M/M/s), to a dedicated accept queue per worker aka "the Tesco superstore model" (formally: M/M/1). With this setup the load is spread more evenly, but in certain conditions the latency distribution might suffer.

After reading that piece, a colleague of mine, John, said: "Hey Marek, don't forget that REUSEPORT has an additional advantage: it can improve packet locality! Packets can avoid being passed around CPUs!"

John had a point. Let's dig into this step by step.

In this blog post we'll explain the REUSEPORT socket option, how it can help with packet locality and its performance implications. We'll show three advanced SystemTap scripts which we used to help us understand and measure the packet locality.

A shared queue

The standard BSD socket API model is rather simple. In order to receive new TCP connections a program calls bind() and then listen() on a fresh socket. This will create a single accept queue. Programs can share the file descriptor - pointing to one kernel data structure - among multiple processes to spread the load. As we've seen in a previous blog post connections might not be distributed perfectly. Still, this allows programs to scale up processing power from a limited single-process, single-CPU design.

Modern network cards split the inbound packets across multiple RX queues, allowing multiple CPUs to share interrupt and packet processing load. Unfortunately in the standard BSD API the new connections will all be funneled back to single accept queue, causing a potential bottleneck.

Introducing REUSEPORT

This bottleneck was identified at Google, where a reported application was dealing with 40,000 connections per second. Google kernel hackers fixed it by adding a TCP support for SO_REUSEPORT socket option in Linux kernel 3.9.

REUSEPORT allows the application to set multiple accept queues on a single TCP listen port. This removes the central bottleneck and enables the CPUs to do more work in parallel.

REUSEPORT locality

Initially there was no way to influence the load balancing algorithm. While REUSEPORT allowed setting up a dedicated accept queue per each worker process, it wasn't possible to influence what packets would go into them. New connections flowing into the network stack would be distributed using only the usual 5-tuple hash. Packets from any of the RX queues, hitting any CPU, might flow into any of the accept queues.

This changed in Linux kernel 4.4 with the introduction of the SO_INCOMING_CPU settable socket option. Now a userspace program could add a hint to make the packets received on a specific CPU go to a specific accept queue. With this improvement the accept queue won't need to be shared across multiple cores, improving CPU cache locality and fixing lock contention issues.

There are other benefits - with proper tuning it is possible to keep the processing of packets belonging to entire connections local. Think about it like that: if a SYN packet was received on some CPU it is likely that further packets for this connection will also be delivered to the same CPU1. Therefore, making sure the worker on the same CPU called the accept() has strong advantages. With the right tuning all processing of the connection might be performed on a single CPU. This can help keep the CPU cache warm, reduce cross-CPU interrupts and boost the performance of memory allocation algorithms.

SO_INCOMING_CPU interface is pretty rudimentary and was deemed unsuitable for more complex usage. It was superseded by the more powerful SO_ATTACH_REUSEPORT_CBPF option (and it's extended variant: SO_ATTACH_REUSEPORT_EBPF) in kernel 4.6. These flags allow a program to specify a fully functional BPF program as a load balancing algorithm.

Beware that the introduction of SO_ATTACH_REUSEPORT_[CE]BPF broke SO_INCOMING_CPU. Nowadays there isn't a choice - you have to use the BPF variants to get the intended behavior.

Setting CBPF on NGINX

NGINX in "reuseport" mode doesn't set the advanced socket options increasing packet locality. John suggested that improving packet locality is beneficial for performance. We must verify such a bold claim!

We wanted to play with setting couple of SO_ATTACH_REUSEPORT_CBPF BPF scripts. We didn't want to hack the NGINX sources though. After some tinkering we decided it would be easier to write a SystemTap script to set the option from outside the server process. This turned out to be a big mistake!

After plenty of work, numerous kernel panics caused by our buggy scripts (running in "guru" mode), we finally managed to get it into working order. The SystemTap script that calls "setsockopt" with right parameters. It's one of the most complex scripts we've written so far. Here it is:

We tested it on kernel 4.9. It sets the following CBPF (classical BPF) load balancing program on the REUSEPORT socket group. Sockets received on Nth CPU will be passed to Nth member of the REUSEPORT group:

A = #cpu A = A % <reuseport group size> return A

The SystemTap script takes three parameters: pid, file descriptor and REUSEPORT group size. To figure out the pid of a process and a file descriptor number use the "ss" tool:

$ ss -4nlp -t 'sport = :8181' | sort LISTEN 0 511 *:8181 *:* users:(("nginx",pid=29333,fd=3),... LISTEN 0 511 *:8181 *:* ... ...

In this listing we see that pid=29333 fd=3 points to REUSEPORT descriptor bound to port tcp/8181. On our test machine we have 24 logical CPUs (including HT) and we run 12 NGINX workers - the group size is 12. Example invocation of the script:

$ sudo stap -g setcbpf.stp 29333 3 12 Measuring performance

Unfortunately on Linux it's pretty hard to verify if setting CBPF actually does anything. To understand what's going on we wrote another SystemTap script. It hooks into a process and prints all successful invocations of the accept() function, including the CPU on which the connection was delivered to kernel, and current CPU - on which the application is running. The idea is simple - if they match, we'll have good locality!

The script:

Before setting the CBPF socket option on the server, we saw this output:

$ sudo stap -g accept.stp nginx|grep "cpu=#12" cpu=#12 pid=29333 accept(3) -> fd=30 rxcpu=#19 cpu=#12 pid=29333 accept(3) -> fd=31 rxcpu=#21 cpu=#12 pid=29333 accept(3) -> fd=32 rxcpu=#16 cpu=#12 pid=29333 accept(3) -> fd=33 rxcpu=#22 cpu=#12 pid=29333 accept(3) -> fd=34 rxcpu=#19 cpu=#12 pid=29333 accept(3) -> fd=35 rxcpu=#21 cpu=#12 pid=29333 accept(3) -> fd=37 rxcpu=#16

We can see accept()s done from a worker on CPU #12 returning client sockets received on some other CPUs like: #19, #21, #16 and so on.

Now, let's run CBPF and see the results:

$ sudo stap -g setcbpf.stp `pidof nginx -s` 3 12 [+] Pid=29333 fd=3 group_size=12 setsockopt(SO_ATTACH_REUSEPORT_CBPF)=0 $ sudo stap -g accept.stp nginx|grep "cpu=#12" cpu=#12 pid=29333 accept(3) -> fd=30 rxcpu=#12 cpu=#12 pid=29333 accept(3) -> fd=31 rxcpu=#12 cpu=#12 pid=29333 accept(3) -> fd=32 rxcpu=#12 cpu=#12 pid=29333 accept(3) -> fd=33 rxcpu=#12 cpu=#12 pid=29333 accept(3) -> fd=34 rxcpu=#12 cpu=#12 pid=29333 accept(3) -> fd=35 rxcpu=#12 cpu=#12 pid=29333 accept(3) -> fd=36 rxcpu=#12

Now the situation is perfect. All accept()s called from the NGINX worker pinned to CPU #12 got client sockets received on the same CPU.

But does it actually help with the performance?

Sadly: no. We've run a number of tests (using the setup introduced in previous blog post) but we weren't able to record any significant performance difference. Compared to other costs incurred by running a high level HTTP server, a couple of microseconds shaved by keeping connections local to a CPU doesn't seem to make a measurable difference.

Measuring packet locality

But no, we didn't give up!

Not being able to measure an end to end performance gain, we decided to try another approach. Why not try to measure packet locality itself!

Measuring locality is tricky. In certain circumstances a packet can cross multiple CPUs on its way down the networking stack. Fortunately we can simplify the problem. Let's define "packet locality" as the probability of a packet (to be specific: the Linux sock_buff data structure, skb) being allocated and freed on the same CPU.

For this, we wrote yet another SystemTap script:

When run without the CBPF option the script gave us this results:

$ sudo stap -g locality.stp 12 rx= 21% 29kpps tx= 9% 24kpps rx= 8% 130kpps tx= 8% 131kpps rx= 11% 132kpps tx= 9% 126kpps rx= 10% 128kpps tx= 8% 127kpps rx= 10% 129kpps tx= 8% 126kpps rx= 11% 132kpps tx= 9% 127kpps rx= 11% 129kpps tx= 10% 128kpps rx= 10% 130kpps tx= 9% 127kpps rx= 12% 94kpps tx= 8% 90kpps

During our test the HTTP server received about 130,000 packets per second and transmitted about as much. 10-11% of the received and 8-10% of the transmitted packets had good locality - were allocated and freed on the same CPU.

Achieving good locality is not that easy. On the RX side, this means the packet must be received on the same CPU as the application that will read() it. On the transmission side it's even trickier. In case of TCP, a piece of data must all: be sent() by application, get transmitted, and receive back an ACK from the other party, all on the same CPU.

We performed a bit of tuning, which included inspecting:

  • number of RSS queues and their interrupts being pinned to right CPUs
  • the indirection table
  • correct XPS settings on the TX path
  • NGINX workers being pinned to right CPUs
  • NGINX using the REUSEPORT bind option
  • and finally setting CBPF on the REUSEPORT sockets

We were able to achieve almost perfect locality! With all tweaks done the script output looked better:

$ sudo stap -g locality.stp 12 rx= 99% 18kpps tx=100% 12kpps rx= 99% 118kpps tx= 99% 115kpps rx= 99% 132kpps tx= 99% 129kpps rx= 99% 138kpps tx= 99% 136kpps rx= 99% 140kpps tx=100% 134kpps rx= 99% 138kpps tx= 99% 135kpps rx= 99% 139kpps tx=100% 137kpps rx= 99% 139kpps tx=100% 135kpps rx= 99% 77kpps tx= 99% 74kpps

Now the test runs at 138,000 packets per second received and transmitted. The packets have a whopping 99% packet locality.

As for performance difference in practice - it's too small to measure. Even though we received about 7% more packets, the end-to-end tests didn't show a meaningful speed boost.

Conclusion

We weren't able to prove definitely if improving packet locality actually improves performance for a high-level TCP application like an HTTP server. In hindsight it makes sense - the added benefit is minuscule compared to the overhead of running an HTTP server, especially with logic in a high level language like Lua.

This hasn't stopped us from having fun! We (myself, Gilberto Bertin and David Wragg) wrote three pretty cool SystemTap scripts, which are super useful when debugging Linux packet locality. They may come handy for demanding users, for example running high performance UDP servers or doing high frequency trading.

Most importantly - in the process we learned a lot about the Linux networking stack. We got to practice writing CBPF scripts, and learned how to measure locality with hackish SystemTap scripts. We got reminded of the obvious - out of the box Linux is remarkably well tuned.

Dealing with the internals of Linux and NGINX sound interesting? Join our world famous team in London, Austin, San Francisco and our elite office in Warsaw, Poland.

  1. We are not taking into account aRFS - accelerated RFS.

Categories: Technology

5 Strategies to Promote Your App

Fri, 27/10/2017 - 18:30
5 Strategies to Promote Your App

5 Strategies to Promote Your App

Brady Gentile from Cloudflare's product team wrote an App Developer Playbook, embedded within the developer documentation page. He decided to write it after he and his team conducted several app developer interviews, finding that many developers wanted to learn how to better promote their apps.

They wanted to help app authors out in the areas outside of developer core expertise. Social media posting, community outreach, email deployment, SEO, blog posting and syndication, etc. can be daunting.

I wanted to take a moment to highlight some of the tips from the App Developer Playbook because I think Brady did a great job of providing clear ways to approach promotional strategies.

5 Promotional Strategies

1. Share with online communities

Your app’s potential audience likely reads community-aggregated news sites such as HackerNews, Product Hunt, or reddit. Sharing your app across these websites is a great way for users to find your app.

5 Strategies to Promote Your App

For apps that are interesting to developers, designers, scientists, entrepreneurs, etc., be sure to share your work with the Hacker News community. Be sure to follow the official guidelines when posting and when engaging with the community. It may be tempting to ask your friends to upvote you, but honesty is the best policy, and the vote-ring detector will bury your post if you try to game it. Instead, if you don’t frontpage on the first try, consider re-posting on another day, with any of these options: the frontpage of your site, the blog post about the launch of your app, a demo of your app in action, a github repo. It may be worth taking into consideration the rate at which new posts are being added to /newest per minute or per hour, which impacts the likelihood of your post making it to the frontpage.

Since you’re sharing a project that people can play with, be sure to: 1) use “Show HN” and follow Show HN guidelines, and 2) be available to answer questions in the comments.

Be sure to start your title with the words ‘Show HN:’ (this indicates that you’ll be sharing something interesting that you’ve built with the HN community with a live demo people can try), then briefly explain your app within the same field. Rather than just use the name of your app, consider adding something informative, like the short description you use in your Cloudflare Apps marketplace tile. For instance, “Show HN: Trebble (embed voice and music on your site)” is more informative than “Show HN: Trebble” as a post title. Next, you’ll have the option of either submitting the URL of your app or explaining a little bit about yourself, the app, and pasting a link to the app itself.

Lastly, you should probably take the time to explain yourself and what you're all about in a first comment, as it helps build good rapport with the community. Block off some time on your calendar so you’re available to answer questions and engage with the community for however long your post is on the frontpage. In addition to gathering their valuable feedback, a signal that the app author is there (“Hi, I’m Name and I made this app to solve this problem -- I’d love to get your feedback.”) will often make your project more approachable and put a face on a product.

5 Strategies to Promote Your App

Product Hunt has released a blog post which outlines how to properly submit your app or product to their community. I highly recommended you review this post in its entirety prior to launching your Cloudflare App.

5 Strategies to Promote Your App

Submit a link to your app, along with some screenshots/videos, and a descriptive title for your post, and select a subreddit to post into. For the title of your post, you’ll want to use something descriptive about your app; for example you could say “I just built an app that does [X].”

If your app isn't relevant to the subreddit in which you post, it'll likely be removed by a moderator, so think carefully about which subreddits would find your app genuinely useful. I also recommend you take some time to engage with that community prior to posting your app, in part because their feedback is valuable, and in part so that you’re not a stranger. Here are two subreddits you should definitely include, though: Apps, Cloudflare.

2. Optimize your app for discoverability

One of the most important steps of the Cloudflare app deployment process is ensuring that both visitors browsing Cloudflare Apps and anyone doing a search on the web may quickly and easily find your app. By optimizing your Cloudflare app for discoverability, you’ll receive a greater number of views, installations, and revenue.

Title and description

Your app’s title and short description are the first thing millions of website owners are going to see when coming across your Cloudflare app, whether it’s through browsing Cloudflare Apps or on a search engine. It’s important that an app’s title is unique, descriptive, and identifiable.

NoAdBlock is a great example.

5 Strategies to Promote Your App

Screenshots

Showcasing how your app might appear on a user’s website gives confidence to users thinking about previewing and installing. Include a variety of screenshots, showing multiple ways in which the software can be configured on a user’s website.

Read more about how to configure your full app description and categories in the App Developer Playbook.

3. Promote through your properties

Once your app has launched on Cloudflare Apps, it’s important that users are able to envision how your app will work for them and that they're easily able to use it.

Building an app preview link

Preview links allow you to generate a link to the install page for your app, which includes customization options for users to play around with.

Check out this preview for the Spotify app:

5 Strategies to Promote Your App

Install badge and placement

Make it easy and obvious for users. The Cloudflare Install Button is an interactive badge which can be embedded in any online assets, including websites and emails.

To use the full Cloudflare App install badge, you can paste the code listed in the Playbook onto your website or marketing page.
You just need to replace [ appTitle ] and [ appId or appAlias ] with the appropriate details for your app. You can choose a standard button or customize it to your app.

Here's what NoAdBlock used:

5 Strategies to Promote Your App

4. Spread word to existing users

A quick and easy way to announce your app’s availability is to notify your user base that the app is now available for them to preview and install. Read more in the Playbook on how to grow your user base before the launch.

Here's a good starting template for an email announcement:

5 Strategies to Promote Your App

5. Form a presence on social media

Targeting users across multiple channels is a pretty easy way to ensure that website owners know your app is now available. Cloudflare can help you with this. Tag @Cloudflare in your posts, so your post will be retweeted and reshared.

5 Strategies to Promote Your App

5 Strategies to Promote Your App

Blog Stuff

Another way to promote the release of your app is by writing a blog post (or several) on your app’s website, delving into the features and benefits that your app brings to users. In addition to your launch post, you can enumerate the new features and bug fixes in a new and improved release, highlight different use cases from your own user base, or deep dive into a fascinating aspect of how you implemented your app.

Here's a well-written launch blog post, from the makers of Admiral.

Other blogs may help you with this as well. Syndication is a great way to gain significant exposure for your posts. Brainstorm a list of blogs facing the core audience for your app, and reach out and ask if you can contribute a guest blog post. If developers are the core audience, drop a line to community@cloudflare.com. I’d love to have a conversation about whether a guest post featuring your app would be right for the Cloudflare blog.

Again, this is just a glimpse into the guidance that the App Developer Playbook provides. Check it out and share it with your community of app developers.

Happy, productive app launching to you!

Categories: Technology

Using Google Cloud Platform to Analyze Cloudflare Logs

Thu, 26/10/2017 - 18:54

We’re excited to announce that we now offer deep insights into your domain’s web traffic, working with Google Cloud Platform (GCP). While Cloudflare Enterprise customers always have had access to their logs, they previously had to rely on their own tools to process them, adding extra complexity and cost.

Cloudflare logs provide real time insight into traffic, malicious activity, attack incidents, and infrastructure health checks. The output is used to help customers adjust their settings, manage costs and resources, and plan for expansion.

Working with Google, we created an end-to-end solution that allows customers to retrieve Cloudflare access logs, store and process data in a simple way. GCP components such as Google Storage, Cloud Function, BigQuery and Data Studio come together to make this possible.

One of the biggest challenges of data analysis is to store and process large volume of data within a short time period while avoiding high costs. GCP Storage and BigQuery easily address these challenges.

Cloudflare customers can decide if they wish to obtain and process data from Cloudflare access logs on demand or on a regular basis. The full solution is described in this Knowledge Base article. Initial setup takes no more than 30 minutes to an hour. Moreover, customers can still replace any part of the process with their own tool or solution.

Below is a simple visualization of the data flow:

The key elements are:


Cloudflare Logshare service

Cloudflare logs are obtained via a REST API. Usually this service can be run on your local workstation or Virtual Machine. The illustrated solution uses GCP Compute micro-instance.

Log storage and management

For storing and managing log files we used GCP Storage bucket. All logs are stored in JSON format. Google Cloud Storage allows you to adjust the storage capacity when needed and set the retention policy.

Data Import

Analyzing large data sets can be challenging. Google BigQuery makes it straightforward. When there is a new log file uploaded to the GCP Storage bucket, GCP Cloud Function triggers the process to import data from the new log file into BigQuery. BigQuery allows you to access your data almost immediately by running a simple query. As illustrated below you can, for example, pull top requested URIs with status code 404.

Data Visualization

Based on feedback from our customers about which data they are interested in, we used GCP Data Studio to create visual reports. The following reports can be created in Data Studio using BigQuery as an input: top client IP address requests, requests by URL, error types, cached or uncached URLs, top triggered WAF rules, traffic types by device or location and many more.

Data Studio “Edit” mode

Data Studio “View” mode

$500 GCP credit

Google Cloud is offering a $500 credit towards a new Google Cloud account to help you get started. In order to receive a credit, please follow these instructions.

Costs

Costs depend on several factors including the number of requests, storage, retention policy and number of queries in BigQuery, among others. For more pricing details, please use the GCP Pricing Calculator.

Please reach out to your Cloudflare Enterprise Solution Engineer or Customer Success Manager for more information.

Categories: Technology

Spotify's Cloudflare App is open source: fork it for your next project

Wed, 25/10/2017 - 18:00
 fork it for your next project

 fork it for your next project

Earlier this year, Cloudflare Apps was launched so app developers may leverage our global network of 6 million+ websites, applications, and APIs. I’d like to take a moment to highlight Spotify, which was a launch partner for Cloudflare Apps, especially since they have elected to open source the code to their Cloudflare App.

Spotify Github repo »

About Spotify
Spotify is the leading digital service for streaming music, serving more than 140 million listeners.

What does the Spotify app do?
Recently, Spotify launched a Cloudflare App to instantly and easily embed the Spotify player onto your website without having to copy / paste anything.

 fork it for your next project

Who should install the Spotify app?
A musician who runs a site for their band - they can now play samples of new tracks on their tour calendar page and psych up their fans.

A game creator who wants to share their game's soundtrack with their fans.

An activewear company which wants to deliver popular running playlists to its customers.

Web properties that install the Spotify app have the ability to increase user engagement.

Add Spotify widgets to your web pages and let your users play tracks and follow Spotify profiles. Add a Spotify Play Button to your blog, website, or social page; all your fans have to do is hit “Play” to enjoy the music. You can create Play Buttons for albums, tracks, artists, or playlists.

How it works for the user
When a logged-in Spotify user clicks the button on your page, the music will start playing in the Spotify player. If the user isn’t logged into their Spotify account, the play button will play a 30-second audio preview of the music and they will be prompted to login or sign up.

How it works for the website owner
You can customize your button as well as link to any song or album you prefer in Spotify’s music catalog or to a playlist you’ve generated. Take a look at the preview of how the Spotify app would appear on a site here:

 fork it for your next project

The Cloudflare App creator allows you to preview the app on your site without making any changes to your code.

In the left pane, you can see the install options where you can select what kind of widget you’d like displayed: a playlist, a track, or a follow button. You can customize the size, theme and position of the banner on your site. The “Pick a location” tool uses CSS selectors to allow you to pinpoint the location on your site where it’s displayed.

In the right pane, you can preview your choices, seeing what they’d look like on your website and experiment with placement and how it flows with the site. This is very similar to the tool that the app developer uses to test the app for how it behaves on a wide range of web properties.

Play with the Spotify Preview now »

Fork this App
Our friends at Spotify made their code available on GitHub. You can clone and fork the repository here. It’s a great way to get some practice developing Cloudflare Apps and to start with some basic scaffolding for your app.

Check out the documentation for Cloudflare Apps here.

Check out Cloudflare’s new App Developer Playbook, a step-by-step marketing guide for Cloudflare app developers here.

Categories: Technology

How to Monkey-Patch the Linux Kernel

Tue, 24/10/2017 - 00:28
How to Monkey-Patch the Linux Kernel

I have a weird setup. I type in Dvorak. But, when I hold ctrl or alt, my keyboard reverts to Qwerty.

You see, the classic text-editing hotkeys, ctrl+Z, ctrl+X, ctrl+C, and ctrl+V are all located optimally for a Qwerty layout: next to the control key, easy to reach with your left hand while mousing with your right. In Dvorak, unfortunately, these hotkeys are scattered around mostly on the right half of the keyboard, making them much less convenient. Using Dvorak for typing but Qwerty for hotkeys turns out to be a nice compromise.

But, the only way I could find to make this work on Linux / X was to write a program that uses X "grabs" to intercept key events and rewrite them. That was mostly fine, until recently, when my machine, unannounced, updated to Wayland. Remarkably, I didn't even notice at first! But at some point, I realized my hotkeys weren't working right. You see, Wayland, unlike X, actually has some sensible security rules, and as a result, random programs can't just man-in-the-middle all keyboard events anymore. Which broke my setup.

Yes, that's right, I'm that guy:

How to Monkey-Patch the Linux Kernel

Source: xkcd 1172

So what was I to do? I began worrying that I'd need to modify the keyboard handling directly in Wayland or in the Linux kernel. Maintaining my own fork of core system infrastructure that changes frequently was not an attractive thought.

Desperate, I asked the Cloudflare Engineering chat channel if anyone knew a better way. That's when Marek Kroemeke came to the rescue:

How to Monkey-Patch the Linux Kernel

Following Marek's link, I found:

#! /usr/bin/env stap # This is not useful, but it demonstrates that # Systemtap can modify variables in a running kernel. # Usage: ./keyhack.stp -g probe kernel.function("kbd_event") { # Changes 'm' to 'b' . if ($event_code == 50) $event_code = 48 } probe end { printf("\nDONE\n") }

Oh my. What is this? What do you mean, "this is not useful"? This is almost exactly what I want!

SystemTap: Not just for debugging?

SystemTap is a tool designed to allow you to probe the Linux kernel for debugging purposes. It lets you hook any kernel function (yes, any C function defined anywhere in the kernel) and log the argument values, or other system state. Scripts are written in a special language designed to prevent you from doing anything that could break your system.

But it turns out you can do more than just read: With the -g flag (for "guru mode", in which you accept responsibility for your actions), you can not just read, but modify. Moreover, you can inject raw C code, escaping the restrictions of SystemTap's normal language.

SystemTap's command-line tool, stap, compiles your script into a Linux kernel module and loads it. The module, on load, will find the function you want to probe and will overwrite it with a jump to your probing code. The probe code does what you specify, then jumps back to the original function body to continue as usual. When you terminate stap (e.g. via ctrl+C on the command line), it unloads the module, restoring the probed function to its original state.

This means it's easy and relatively safe to inject a probe into your running system at any time. If it doesn't do what you want, you can safely remove it, modify it, and try again. There's no need to modify the actual kernel code nor recompile your kernel. You can make your changes without maintaining a fork.

This is, of course, a well-known practice in dynamic programming languages, where it's generally much easier. We call it "Monkey-Patching".

When is it OK to Monkey-Patch?

"Monkey-patch" is often used as a pejorative. Many developers cringe at the thought. It's an awful hack! Never do that!

Indeed, in a lot of contexts, monkey-patching is a terrible idea. At a previous job, I spent weeks debugging problems caused by a bad (but well-meaning) monkey-patch made by one of our dependencies.

But, often, a little monkey-patch can save a lot of work. By monkey-patching my kernel, I can get the keyboard behavior I want without maintaining a fork forever, and without spending weeks developing a feature worthy of pushing upstream. And when patching my own machine, I can't hurt anyone but myself.

I would propose two rules for monkey patching:

  1. Only the exclusive owner of the environment may monkey-patch it. The "owner" is an entity who has complete discretion and control over all code that exists within the environment in which the monkey-patch is visible. For a self-contained application which specifies all its dependencies precisely, the application developer may be permitted to monkey-patch libraries within the application's runtime -- but libraries and frameworks must never apply monkey-patches. When we're talking about the kernel, the "owner" is the system administrator.
  2. The owner takes full responsibility for any breakages caused. If something doesn't work right, it's up to the owner to deal with it or abandon their patch.

In this case, I'm the owner of my system, and therefore I have the right to monkey-patch it. If my monkey-patch breaks (say, because the kernel functions I was patching changed in a later kernel version), or if it breaks other programs I use, that's my problem and I'll deal with it.

Setting Up

To use SystemTap, you must have the kernel headers and debug symbols installed. I found the documentation was not quite right on my Debian system. I managed to get everything installed by running:

sudo apt install systemtap linux-headers-amd64 linux-image-amd64-dbg

Note that the debug symbols are a HUGE package (~500MB). Such is the price you pay, it seems.

False Starts

Starting from the sample script that remaps 'm' to 'b', it seemed obvious how to proceed. I saved the script to a file and did:

sudo stap -g keyhack.stp

But… nothing happened. My 'm' key still typed 'm'.

To debug, I added some printf() statements (which conveniently print to the terminal where stap runs). But, it appeared the keyboard events were indeed being captured. So why did 'm' still type 'm'?

It turns out, no one was listening. The kbd_event function is part of the text-mode terminal support. Sure enough, if I switched virtual terminals over to a text terminal, the key was being remapped. But Wayland uses a totally different code path to receive key events -- the /dev/input devices. These devices are implemented by the evdev module.

Looking through evdev.c, at first evdev_event() looks tempting as a probe point: it has almost the same signature as kbd_event(). Unfortunately, this function is not usually called by the driver; rather, the multi-event version, evdev_events(), usually is. But that version takes an array, which seems more tedious to deal with.

Looking further, I came across __pass_event(), which evdev_events() calls for each event. It's slightly different from kbd_event() in that the event is encapsulated in a struct, but at least it only takes one event at a time. This seemed like the easiest place to probe, so I tried it:

# DOES NOT WORK probe module("evdev").function("__pass_event") { # Changes 'm' to 'b'. if ($event->code == 50) $event->code = 48 }

Alas, this didn't quite work. When running stap, I got:

semantic error: failed to retrieve location attribute for 'event'

This error seems strange. The function definitely has a parameter called event!

The problem is, __pass_event() is a static function that is called from only one place. As a result, the compiler inlines it. When a function is inlined, its parameters often cease to have a well-defined location in memory, so reading and modifying them becomes infeasible. SystemTap relies on debug info tables that specify where to find parameters, but in this case the tables don't have an answer.

The Working Version

Alas, it seemed I'd need to use evdev_events() and deal with the array after all. This function takes an array of events to deliver at once, so its parameters aren't quite as convenient. But, it has multiple call sites, so it isn't inlined. I just needed a little loop:

probe module("evdev").function("evdev_events") { for (i = 0; i < $count; i++) { # Changes 'm' to 'b'. if ($vals[i]->code == 50) $vals[i]->code = 48 } }

Success! This script works. I no longer have any way to type 'm'.

From here, implementing the Dvorak-Qwerty key-remapping behavior I wanted was a simple matter of writing some code to track modifier key state and remap keys. You can find my full script on GitHub.

How to Monkey-Patch the Linux Kernel

Categories: Technology

Pages

Additional Terms