Blogroll Category: Technology

I read blogs, as well as write one. The 'blogroll' on this site reproduces some posts from some of the people I enjoy reading. There are currently 60 posts from the category 'Technology.'

Disclaimer: Reproducing an article here need not necessarily imply agreement or endorsement!

Taxonomy access fix - Moderately critical - Access bypass - SA-CONTRIB-2019-093

Drupal Contrib Security - Wed, 11/12/2019 - 18:32
Project: Taxonomy access fixVersion: 8.x-2.68.x-2.58.x-2.4Date: 2019-December-11Security risk: Moderately critical 13∕25 AC:None/A:User/CI:Some/II:None/E:Theoretical/TD:AllVulnerability: Access bypassDescription: 

This module extends access handling of Drupal Core's Taxonomy module.

The module doesn't sufficiently check,

  • if a given entity should be access controlled, defaulting to allowing access even to unpublished Taxonomy Terms.
  • if certain administrative routes should be access controlled, defaulting to allowing access even to users without permission to access these administrative routes.

The vulnerability is mitigated by the facts, that

  • the user interface to change the status of Taxonomy Terms has been released in Drupal Core 8.8 and a custom or contributed module is required in earlier versions of Drupal Core to mark Taxonomy Terms as unpublished.
  • all entity operations (except the view operation) available on affected administrative routes still require appropriate permissions.
  • an attacker must have a role with permission to either access content or view a Taxonomy Term in a vocabulary.
Solution: 

Install the latest version:

Also see the Taxonomy Access Fix project page.

Reported By: Fixed By: Coordinated By: 
Categories: Technology

Smart Trim - Moderately critical - Cross site scripting - SA-CONTRIB-2019-092

Drupal Contrib Security - Wed, 11/12/2019 - 18:20
Project: Smart TrimVersion: 8.x-1.18.x-1.08.x-1.0-beta1Date: 2019-December-11Security risk: Moderately critical 13∕25 AC:Basic/A:User/CI:Some/II:Some/E:Theoretical/TD:DefaultVulnerability: Cross site scriptingDescription: 

The Smart Trim module allows site builders additional control with text summary fields.

The module doesn't sufficiently filter text when certain options are selected.

This vulnerability is mitigated by the fact that an attacker must have a role with the ability to create content on the site when certain options are selected for the trimmed output.

Solution: 

Install the latest version:

Also see the Smart Trim project page.

Reported By: Fixed By: Coordinated By: 
Categories: Technology

New tools to monitor your server and avoid downtime

CloudFlare - Wed, 11/12/2019 - 10:13
New tools to monitor your server and avoid downtimeNew tools to monitor your server and avoid downtime

When your server goes down, it’s a big problem. Today, Cloudflare is introducing two new tools to help you understand and respond faster to origin downtime — plus, a new service to automatically avoid downtime.

The new features are:

  • Standalone Health Checks, which notify you as soon as we detect problems at your origin server, without needing a Cloudflare Load Balancer.
  • Passive Origin Monitoring, which lets you know when your origin cannot be reached, with no configuration required.
  • Zero-Downtime Failover, which can automatically avert failures by retrying requests to origin.
Standalone Health Checks

Our first new tool is Standalone Health Checks, which will notify you as soon as we detect problems at your origin server -- without needing a Cloudflare Load Balancer.

A Health Check is a service that runs on our edge network to monitor whether your origin server is online. Health Checks are a key part of our load balancing service because they allow us to quickly and actively route traffic to origin servers that are live and ready to serve requests. Standalone Health Checks allow you to monitor the health of your origin even if you only have one origin or do not yet need to balance traffic across your infrastructure.

We’ve provided many dimensions for you to hone in on exactly what you’d like to check, including response code, protocol type, and interval. You can specify a particular path if your origin serves multiple applications, or you can check a larger subset of response codes for your staging environment. All of these options allow you to properly target your Health Check, giving you a precise picture of what is wrong with your origin.

New tools to monitor your server and avoid downtime

If one of your origin servers becomes unavailable, you will receive a notification letting you know of the health change, along with detailed information about the failure so you can take action to restore your origin’s health.  

Lastly, once you’ve set up your Health Checks across the different origin servers, you may want to see trends or the top unhealthy origins. With Health Check Analytics, you’ll be able to view all the change events for a given health check, isolate origins that may be top offenders or not performing at par, and move forward with a fix. On top of this, in the near future, we are working to provide you with access to all Health Check raw events to ensure you have the detailed lens to compare Cloudflare Health Check Event logs against internal server logs.

New tools to monitor your server and avoid downtime


Users on the Pro, Business, or Enterprise plan will have access to Standalone Health Checks and Health Check Analytics to promote top-tier application reliability and help maximize brand trust with their customers. You can access Standalone Health Checks and Health Check Analytics through the Traffic app in the dashboard.

Passive Origin Monitoring

Standalone Health Checks are a super flexible way to understand what’s happening at your origin server. However, they require some forethought to configure before an outage happens. That’s why we’re excited to introduce Passive Origin Monitoring, which will automatically notify you when a problem occurs -- no configuration required.

Cloudflare knows when your origin is down, because we’re the ones trying to reach it to serve traffic! When we detect downtime lasting longer than a few minutes, we’ll send you an email.

Starting today, you can configure origin monitoring alerts to go to multiple email addresses. Origin Monitoring alerts are available in the new Notification Center (more on that below!) in the Cloudflare dashboard:

New tools to monitor your server and avoid downtime

Passive Origin Monitoring is available to customers on all Cloudflare plans.

Zero-Downtime Failover

What’s better than getting notified about downtime? Never having downtime in the first place! With Zero-Downtime Failover, we can automatically retry requests to origin, even before Load Balancing kicks in.

How does it work? If a request to your origin fails, and Cloudflare has another record for your origin server, we’ll just try another origin within the same HTTP request. The alternate record could be either an A/AAAA record configured via Cloudflare DNS, or another origin server in the same Load Balancing pool.

Consider an website, example.com, that has web servers at two different IP addresses: 203.0.113.1 and 203.0.113.2. Before Zero-Downtime Failover, if 203.0.113.1 becomes unavailable, Cloudflare would attempt to connect, fail, and ultimately serve an error page to the user. With Zero-Downtime Failover, if 203.0.113.1 cannot be reached, then Cloudflare’s proxy will seamlessly attempt to connect to 203.0.113.2. If the second server can respond, then Cloudflare can avert serving an error to example.com’s user.

Since we rolled Zero-Downtime Failover a few weeks ago, we’ve prevented tens of millions of requests per day from failing!

Zero-Downtime Failover works in conjunction with Load Balancing, Standalone Health Checks, and Passive Origin Monitoring to keep your website running without a hitch. Health Checks and Load Balancing can avert failure, but take time to kick in. Zero-Downtime failover works instantly, but adds latency on each connection attempt. In practice, Zero-Downtime Failover is helpful at the start of an event, when it can instantly recover from errors; once a Health Check has detected a problem, a Load Balancer can then kick in and properly re-route traffic. And if no origin is available, we’ll send an alert via Passive Origin Monitoring.

To see an example of this in practice, consider an incident from a recent customer. They saw a spike in errors at their origin that would ordinarily cause availability to plummet (red line), but thanks to Zero-Downtime failover, their actual availability stayed flat (blue line).

New tools to monitor your server and avoid downtime

During a 30 minute time period, Zero-Downtime Failover improved overall availability from 99.53% to 99.98%, and prevented 140,000 HTTP requests from resulting in an error.

It’s important to note that we only attempt to retry requests that have failed during the TCP or TLS connection phase, which ensures that HTTP headers and payload have not been transmitted yet. Thanks to this safety mechanism, we're able to make Zero-Downtime Failover Cloudflare's default behavior for Pro, Business, and Enterprise plans. In other words, Zero-Downtime Failover makes connections to your origins more reliable with no configuration or action required.

Coming soon: more notifications, more flexibility

Our customers are always asking us for more insights into the health of their critical edge infrastructure. Health Checks and Passive Origin monitoring are a significant step towards Cloudflare taking a proactive instead of reactive approach to insights.

To support this work, today we’re announcing the Notification Center as the central place to manage notifications. This is available in the dashboard today, accessible from your Account Home.

From here, you can create new notifications, as well as view any existing notifications you’ve already set up. Today’s release allows you to configure  Passive Origin Monitoring notifications, and set multiple email recipients.

New tools to monitor your server and avoid downtime

We’re excited about today’s launches to helping our customers avoid downtime. Based on your feedback, we have lots of improvements planned that can help you get the timely insights you need:

  • New notification delivery mechanisms
  • More events that can trigger notifications
  • Advanced configuration options for Health Checks, including added protocols, threshold based notifications, and threshold based status changes.
  • More ways to configure Passive Health Checks, like the ability to add thresholds, and filter to specific status codes
Categories: Technology

The Two-Year Anniversary of The Athenian Project: Preparing for the 2020 Elections.

CloudFlare - Tue, 10/12/2019 - 16:00
 Preparing for the 2020 Elections. Preparing for the 2020 Elections.

Two years ago, Cloudflare launched its Athenian Project, an effort to protect state and local government election websites from cyber attacks. With the two-year anniversary and many 2020 elections approaching, we are renewing our commitment to provide Cloudflare’s highest level of services for free to protect election websites and ensure the preservation of these critical infrastructure sites. We started the project at Cloudflare as it directly aligns with our mission: to help build a better Internet. We believe the Internet plays a helpful role in democracy and ensuring constituents’ right to information. By helping state and local government election websites, we ensure the protection of voters’ voices, preserve citizens’ confidence in the democratic process, and enhance voter participation.

We are currently helping 156 local or state websites in 26 states to combat DDoS attacks, SQL injections, and many other hostile attempts to threaten their operations. This is an additional 34 domains in states like Ohio, Florida, Kansas, South Carolina and Wisconsin since we reported statistics after last year’s election.

The need for security protection of critical election infrastructure is not new, but it is in the spotlight again as the 2020 U.S. elections approach, with the President, 435 seats in the U.S House of Representatives, 35 of the 100 seats in the U.S. Senate, and many state and local representatives on the ballot. According to the Department of Homeland Security and Federal Bureau of Investigations, election infrastructure in all 50 states was targeted during the 2016 presidential election. The risk is real. Florida counties suffered a spearfishing attack that gave hackers access to the voter registration rolls, and a Tennessee county website was knocked offline on election night and had to resort to handing out printed election vote counts.

Although the U.S government has sought to combat malicious actors that target election infrastructure, with Congress approving funding of $250 million for states in the administering and security of U.S elections in September 2019, there is always more to be done. As states rapidly prepare for the upcoming elections, the need for inexpensive, accessible solutions to protect election infrastructure are at an all-time high. As Micah Van Maanen, the Information Technology Director for Sioux County, Iowa, put it:

 Preparing for the 2020 Elections.

At Cloudflare, we believe it is vital to the national interest that elections are secure and free from interference as these fundamentals are essential to United States democracy. In these two years, we have learned a great deal about government election offices all across the U.S, the spread of information and resources available to them, and the small number of people it takes to make an impact in the protection of election infrastructure.

We still have more to learn to ensure the protection of these critical sites and understanding how we can better prepare state and local election websites for the upcoming elections. As we look into the future of the project in upcoming years, it is important to also look at the past.

Stories from the Field:

The jurisdictions that are using Cloudflare to protect their election websites are diverse, with state and local governments representing a range of populations from over 1.2 million residents to fewer than 5,000 residents.

 Preparing for the 2020 Elections.I Voted Stickers- Element 5 Digital on Pexels

In Ohio, the Secretary of State released their yearly state directive in June 2018 and 2019, to all counties in Ohio Board of Elections on tools, resources and best cybersecurity practices to strengthen the security of their election system. The Athenian Project was recommended and encouraged in both directives for the DDoS protection, Web Application Firewall, Rate Limiting, Under Attack Mode and 24/7 support. During this past year- we have on-boarded 13 counties in Ohio with a total of 27 domains protected under Cloudflare. In the directive, Ohio plans to become the leader in best practices in the security of elections systems and we are happy to be aiding in this mission.

The Idaho Secretary of State joined the Athenian Project at the beginning of 2018 and Chad Houck, Idaho’s Chief Deputy Secretary of State, engaged with our team on how exactly the Secretary of State could benefit from Cloudflare services.

On May 11, 2018, two of Idaho’s state agency websites were defaced by an anti-government group that posted a manifesto in Italian. After receiving notifications from many different sources regarding the security breach and following several inquiries from the press regarding the matter, Chad decided to look at the Idaho Secretary of State Cloudflare account to see if there was any evidence of the same hackers trying to penetrate the IDSOS site. Using Cloudflare’s analytic tools, he was able to see 27,000 blocked requests, up from the normal 240 per day,  within the same 3.5-hour window that saw the other sites defaced. Cloudflare's Web Application Firewall had automatically blocked the bad requests that attempted to penetrate the site.

Confident in the value of Cloudflare’s tools, Deputy Secretary Houck’s plan is to create policies of operation that assist Idaho’s 44 counties in protecting their election websites and statewide voter registration systems. “With the first two counties already on board for a pilot, our goal is to be the first state to reach 100% county adoption of the Athenian Project tools.”

Understanding the U.S. Electoral System & Athenian Project Expansion:

The United States election system is fragmented and varies greatly from state to state. In some states, the administration of elections is covered by the state government and, in others, by counties or local municipalities. This system is decentralized, meaning that each state and local government has control over how the various duties of elections are distributed. According to the National Conference of State Legislators, “there are more than 10,000 election administration jurisdictions in the U.S. The size of these jurisdictions varies dramatically.” This means the voting experience differs from county to county, and from state to state.

 Preparing for the 2020 Elections.Photo by Brandon Mowinkel on Unsplash

This system fragmentation has been a challenge for the Athenian project. In the process, we have learned that state and local government election offices range on technical abilities and funding. With this in mind, our teams at Cloudflare are looking into new ways to engage the community. Among our efforts, we aim to interact with election security information sharing centers that provide recommendations and resources for election infrastructure to strengthen cybersecurity practices. Doing this helps state and local entities prepare for the upcoming election.

What’s Next:

As we have a year until the 2020 election, we are thinking of how we engage with our current Athenian participants and expand access to Cloudflare services to new states and counties within the United States that might benefit from the Athenian Project. A key aspect that we have learned in this process is that the security of election websites sits with a small group of dedicated government officials that have found each other and built their own networks to share best cybersecurity practices.

In response to my question to Athenian participants in the onboarding process about how they discovered the project and Cloudflare, many of the answers I receive are similar: they heard about the project from another county, neighboring state, or information sharing centers that recommend using Cloudflare services as an extra layer of protection on their election infrastructure. Rodney Allen, the Executive Director for the Board of Voter Registration & Elections of Pickens County, South Carolina says that “the great thing about the Athenian Project is that Pickens County Board of Elections site has not experienced any downtime or outages thanks to Cloudflare, especially during South Carolina's 2018 general election and special elections in 2019."

As we set our sights for the 2020 election, we are happy to help provide these state and local governments with the tools they need to protect their election websites. If you run a state or local election website, feel free to reach out through our webform or read more about how our Athenian Project can help secure your online presence.

Categories: Technology

Introducing Load Balancing Analytics

CloudFlare - Tue, 10/12/2019 - 14:00
Introducing Load Balancing AnalyticsIntroducing Load Balancing Analytics

Cloudflare aspires to make Internet properties everywhere faster, more secure, and more reliable. Load Balancing helps with speed and reliability and has been evolving over the past three years.

Let’s go through a scenario that highlights a bit more of what a Load Balancer is and the value it can provide.  A standard load balancer comprises a set of pools, each of which have origin servers that are hostnames and/or IP addresses. A routing policy is assigned to each load balancer, which determines the origin pool selection process.

Let’s say you build an API that is using cloud provider ACME Web Services. Unfortunately, ACME had a rough week, and their service had a regional outage in their Eastern US region. Consequently, your website was unable to serve traffic during this period, which resulted in reduced brand trust from users and missed revenue. To prevent this from happening again, you decide to take two steps: use a secondary cloud provider (in order to avoid having ACME as a single point of failure) and use Cloudflare’s Load Balancing to take advantage of the multi-cloud architecture. Cloudflare’s Load Balancing can help you maximize your API’s availability for your new architecture. For example, you can assign health checks to each of your origin pools. These health checks can monitor your origin servers’ health by checking HTTP status codes, response bodies, and more. If an origin pool’s response doesn’t match what is expected, then traffic will stop being steered there. This will reduce downtime for your API when ACME has a regional outage because traffic in that region will seamlessly be rerouted to your fallback origin pool(s). In this scenario, you can set the fallback pool to be origin servers in your secondary cloud provider. In addition to health checks, you can use the ‘random’ routing policy in order to distribute your customers’ API requests evenly across your backend. If you want to optimize your response time instead, you can use ‘dynamic steering’, which will send traffic to the origin determined to be closest to your customer.

Our customers love Cloudflare Load Balancing, and we’re always looking to improve and make our customers’ lives easier. Since Cloudflare’s Load Balancing was first released, the most popular customer request was for an analytics service that would provide insights on traffic steering decisions.

Today, we are rolling out Load Balancing Analytics in the Traffic tab of the Cloudflare  dashboard. The three major components in the analytics service are:

  • An overview of traffic flow that can be filtered by load balancer, pool, origin, and region.
  • A latency map that indicates origin health status and latency metrics from Cloudflare’s global network spanning 194 cities and growing!
  • Event logs denoting changes in origin health. This feature was released in 2018 and tracks pool and origin transitions between healthy and unhealthy states. We’ve moved these logs under the new Load Balancing Analytics subtab. See the documentation to learn more.

In this blog post, we’ll discuss the traffic flow distribution and the latency map.

Traffic Flow Overview

Our users want a detailed view into where their traffic is going, why it is going there, and insights into what changes may optimize their infrastructure. With Load Balancing Analytics, users can graphically view traffic demands on load balancers, pools, and origins over variable time ranges.

Understanding how traffic flow is distributed informs the process of creating new origin pools, adapting to peak traffic demands, and observing failover response during origin pool failures.

Introducing Load Balancing AnalyticsFigure 1

In Figure 1, we can see an overview of traffic for a given domain. On Tuesday, the 24th, the red pool was created and added to the load balancer. In the following 36 hours, as the red pool handled more traffic, the blue and green pool both saw a reduced workload. In this scenario, the traffic distribution graph did provide the customer with new insights. First, it demonstrated that traffic was being steered to the new red pool. It also allowed the customer to understand the new level of traffic distribution across their network. Finally, it allowed the customer to confirm whether traffic decreased in the expected pools. Over time, these graphs can be used to better manage capacity and plan for upcoming infrastructure needs.

Latency Map

The traffic distribution overview is only one part of the puzzle. Another essential component is understanding request performance around the world. This is useful because customers can ensure user requests are handled as fast as possible, regardless of where in the world the request originates.

The standard Load Balancing configuration contains monitors that probe the health of customer origins. These monitors can be configured to run from a particular region(s) or, for Enterprise customers, from all Cloudflare locations. They collect useful information, such as round-trip time, that can be aggregated to create the latency map.

The map provides a summary of how responsive origins are from around the world, so customers can see regions where requests are underperforming and may need further investigation. A common metric used to identify performance is request latency. We found that the p90 latency for all Load Balancing origins being monitored is 300 milliseconds, which means that 90% of all monitors’ health checks had a round trip time faster than 300 milliseconds. We used this value to identify locations where latency was slower than the p90 latency seen by other Load Balancing customers.

Introducing Load Balancing AnalyticsFigure 2

In Figure 2, we can see the responsiveness of the Northeast Asia pool. The Northeast Asia pool is slow specifically for monitors in South America, the Middle East, and Southern Africa, but fast for monitors that are probing closer to the origin pool. Unfortunately, this means users for the pool in countries like Paraguay are seeing high request latency. High page load times have many unfortunate consequences: higher visitor bounce rate, decreased visitor satisfaction rate, and a lower search engine ranking. In order to avoid these repercussions, a site administrator could consider adding a new origin pool in a region closer to underserved regions. In Figure 3, we can see the result of adding a new origin pool in Eastern North America. We see the number of locations where the domain was found to be unhealthy drops to zero and the number of slow locations cut by more than 50%.

Introducing Load Balancing AnalyticsFigure 3

Tied with the traffic flow metrics from the Overview page, the latency map arms users with insights to optimize their internal systems, reduce their costs, and increase their application availability.

GraphQL Analytics API

Behind the scenes, Load Balancing Analytics is powered by the GraphQL Analytics API. As you’ll learn later this week, GraphQL provides many benefits to us at Cloudflare. Customers now only need to learn a single API format that will allow them to extract only the data they require. For internal development, GraphQL eliminates the need for customized analytics APIs for each service, reduces query cost by increasing cache hits, and reduces developer fatigue by using a straightforward query language with standardized input and output formats. Very soon, all Load Balancing customers on paid plans will be given the opportunity to extract insights from the GraphQL API.  Let’s walk through some examples of how you can utilize the GraphQL API to understand your Load Balancing logs.

Suppose you want to understand the number of requests the pools for a load balancer are seeing from the different locations in Cloudflare’s global network. The query in Figure 4 counts the number of unique (location, pool ID) combinations every fifteen minutes over the course of a week.

Introducing Load Balancing AnalyticsFigure 4

For context, our example load balancer, lb.example.com, utilizes dynamic steering. Dynamic steering directs requests to the most responsive, available, origin pool, which is often the closest. It does so using a weighted round-trip time measurement. Let’s try to understand why all traffic from Singapore (SIN) is being steered to our pool in Northeast Asia (asia-ne). We can run the query in Figure 5. This query shows us that the asia-ne pool has an avgRttMs value of 67ms, whereas the other two pools have avgRttMs values that exceed 150ms. The lower avgRttMs value explains why traffic in Singapore is being routed to the asia-ne pool.

Introducing Load Balancing AnalyticsFigure 5

Notice how the query in Figure 4 uses the loadBalancingRequestsGroups schema, whereas the query in Figure 5 uses the loadBalancingRequests schema. loadBalancingRequestsGroups queries aggregate data over the requested query interval, whereas loadBalancingRequests provides granular information on individual requests. For those ready to get started, Cloudflare has written a helpful guide. The GraphQL website is also a great resource. We recommend you use an IDE like GraphiQL to make your queries. GraphiQL embeds the schema documentation into the IDE, autocompletes, saves your queries, and manages your custom headers, all of which help make the developer experience smoother.

Conclusion

Now that the Load Balancing Analytics solution is live and available to all Pro, Business, Enterprise customers, we’re excited for you to start using it! We’ve attached a survey to the Traffic overview page, and we’d love to hear your feedback.

Categories: Technology

Firewall Analytics: Now available to all paid plans

CloudFlare - Mon, 09/12/2019 - 15:16
 Now available to all paid plans Now available to all paid plans

Our Firewall Analytics tool enables customers to quickly identify and investigate security threats using an intuitive interface. Until now, this tool had only been available to our Enterprise customers, who have been using it to get detailed insights into their traffic and better tailor their security configurations. Today, we are excited to make Firewall Analytics available to all paid plans and share details on several recent improvements we have made.

All paid plans are now able to take advantage of these capabilities, along with several important enhancements we’ve made to improve our customers’ workflow and productivity.

 Now available to all paid plansIncreased Data Retention and Adaptive Sampling

Previously, Enterprise customers could view 14 days of Firewall Analytics for their domains. Today we’re increasing that retention to 30 days, and again to 90 days in the coming months. Business and Professional plan zones will get 30 and 3 days of retention, respectively.

In addition to the extended retention, we are introducing adaptive sampling to guarantee that Firewall Analytics results are displayed in the Cloudflare Dashboard quickly and reliably, even when you are under a massive attack or otherwise receiving a large volume of requests.

Adaptive sampling works similar to Netflix: when your internet connection runs low on bandwidth, you receive a slightly downscaled version of the video stream you are watching. When your bandwidth recovers, Netflix then upscales back to the highest quality available.

Firewall Analytics does this sampling on each query, ensuring that customers see the best precision available in the UI given current load on the zone. When results are sampled, the sampling rate will be displayed as shown below:

 Now available to all paid plansEvent-Based Logging

As adoption of our expressive Firewall Rules engine has grown, one consistent ask we’ve heard from customers is for a more streamlined way to see all Firewall Events generated by a specific rule. Until today, if a malicious request matched multiple rules, only the last one to execute was shown in the Activity Log, requiring customers to click into the request to see if the rule they’re investigating was listed as an “Additional match”.

To streamline this process, we’ve changed how the Firewall Analytics UI interacts with the Activity Log. Customers can now filter by a specific rule (or any other criteria) and see a row for each event generated by that rule. This change also makes it easier to review all requests that would have been blocked by a rule by creating it in Log mode first before changing it to Block.

 Now available to all paid plansChallenge Solve Rates to help reduce False Positives

When our customers write rules to block undesired, automated traffic they want to make sure they’re not blocking or challenging desired traffic, e.g., humans wanting to make a purchase should be allowed but not bots scraping pricing.

To help customers determine what percent of CAPTCHA challenges returned to users may have been unnecessary, i.e., false positives, we are now showing the Challenge Solve Rate (CSR) for each rule. If you’re seeing rates higher than expected, e.g., for your Bot Management rules, you may want to relax the rule criteria. If the rate you see is 0% indicating that no CAPTCHAs are being solved, you may want to change the rule to Block outright rather than challenge.

 Now available to all paid plans

Hovering over the CSR rate will reveal the number of CAPTCHAs issued vs. solved:

 Now available to all paid plansExporting Firewall Events

Business and Enterprise customers can now export a set of 500 events from the Activity Log. The data exported are those events that remain after any selected filters have been applied.

 Now available to all paid plansColumn Customization

Sometimes the columns shown in the Activity Log do not contain the details you want to see to analyze the threat. When this happens, you can now click “Edit Columns” to select the fields you want to see. For example, a customer diagnosing a Bot related issue may want to also view the User-Agent and the source country whereas a customer investigating a DDoS attack may want to see IP addresses, ASNs, Path, and other attributes. You can now customize what you’d like to see as shown below.

 Now available to all paid plans

We would love to hear your feedback and suggestions, so feel free to reach out to us via our Community forums or through your Customer Success team.

If you’d like to receive more updates like this one directly to your inbox, please subscribe to our Blog!

Categories: Technology

Announcing deeper insights and new monitoring capabilities from Cloudflare Analytics

CloudFlare - Mon, 09/12/2019 - 15:15
Announcing deeper insights and new monitoring capabilities from Cloudflare AnalyticsAnnouncing deeper insights and new monitoring capabilities from Cloudflare Analytics

This week we’re excited to announce a number of new products and features that provide deeper security and reliability insights, “proactive” analytics when there’s a problem, and more powerful ways to explore your data.

If you’ve been a user or follower of Cloudflare for a little while, you might have noticed that we take pride in turning technical challenges into easy solutions. Flip a switch or run a few API commands, and the attack you’re facing is now under control or your site is now 20% faster. However, this ease of use is even more helpful if it’s complemented by analytics. Before you make a change, you want to be sure that you understand your current situation. After the change, you want to confirm that it worked as intended, ideally as fast as possible.

Because of the front-line position of Cloudflare’s network, we can provide comprehensive metrics regarding both your traffic and the security and performance of your Internet property. And best of all, there’s nothing to set up or enable. Cloudflare Analytics is automatically available to all Cloudflare users and doesn’t rely on Javascript trackers, meaning that our metrics include traffic from APIs and bots and are not skewed by ad blockers.

Here’s a sneak peek of the product launches. Look out for individual blog posts this week for more details on each of these announcements.

  • Product Analytics:
    • Today, we’re making Firewall Analytics available to Business and Pro plans, so that more customers understand how well Cloudflare mitigates attacks and handles malicious traffic. And we’re highlighting some new metrics, such as the rate of solved captchas (useful for Bot Management), and features, such as customizable reports to facilitate sharing and archiving attack information.
    • We’re introducing Load Balancing Analytics, which shows traffic flows by load balancer, pool, origin, and region, and helps explain why a particular origin was selected to receive traffic.
  • Monitoring:
    • We’re announcing tools to help you monitor your origin either actively or passively and automatically reroute your traffic to a different server when needed. Because Cloudflare sits between your end users and your origin, we can spot problems with your servers without the use of external monitoring services.
  • Data tools:
    • The product analytics we’ll be featuring this week use a new API behind the scenes. We’re making this API generally available, allowing you to easily build custom dashboards and explore all of your Cloudflare data the same way we do, so you can easily gain insights and identify and debug issues.
  • Account Analytics:
    • We’re releasing (in beta) a new dashboard that shows aggregated information for all of the domains under your account, allowing you to know what’s happening at a glance.

We’re excited to tell you about all of these new products in this week’s posts and would love to hear your thoughts. If you’re not already subscribing to the blog, sign up now to receive daily updates in your inbox.

Categories: Technology

Cloudflare’s Response to CSAM Online

CloudFlare - Fri, 06/12/2019 - 14:06
Cloudflare’s Response to CSAM Online

Responding to incidents of child sexual abuse material (CSAM) online has been a priority at Cloudflare from the beginning. The stories of CSAM victims are tragic, and bring to light an appalling corner of the Internet. When it comes to CSAM, our position is simple: We don’t tolerate it. We abhor it. It’s a crime, and we do what we can to support the processes to identify and remove that content.

In 2010, within months of Cloudflare’s launch, we connected with the National Center for Missing and Exploited Children (NCMEC) and started a collaborative process to understand our role and how we could cooperate with them. Over the years, we have been in regular communication with a number of government and advocacy groups to determine what Cloudflare should and can do to respond to reports about CSAM that we receive through our abuse process, or how we can provide information supporting investigations of websites using Cloudflare’s services.

Recently, 36 tech companies, including Cloudflare, received this letter from a group of U.S Senators asking for more information about how we handle CSAM content. The Senators referred to influential New York Times stories published in late September and early November that conveyed the disturbing number of images of child sexual abuse on the Internet, with graphic detail about the horrific photos and how the recirculation of imagery retraumatizes the victims. The stories focused on shortcomings and challenges in bringing violators to justice, as well as efforts, or lack thereof, by a group of tech companies including Amazon, Facebook, Google, Microsoft, and Dropbox, to eradicate as much of this material as possible through existing processes or new tools like PhotoDNA that could proactively identify CSAM material.  

We think it is important to share our response to the Senators (copied at the end of this blog post), talk publicly about what we’ve done in this space, and address what else we believe can be done.

How Cloudflare Responds to CSAM

From our work with NCMEC, we know that they are focused on doing everything they can to validate the legitimacy of CSAM reports and then work as quickly as possible to have website operators, platform moderators, or website hosts remove that content from the Internet. Even though Cloudflare is not in a position to remove content from the Internet for users of our core services, we have worked continually over the years to understand the best ways we can contribute to these efforts.

Addressing  Reports

The first prong of Cloudflare’s response to CSAM is proper reporting of any allegation we receive. Every report we receive about content on a website using Cloudflare’s services filed under the “child pornography” category on our abuse report page leads to three actions:

  1. We forward the report to NCMEC. In addition to the content of the report made to Cloudflare, we provide NCMEC with information identifying the hosting provider of the website, contact information for that hosting provider, and the origin IP address where the content at issue can be located.
  2. We forward the report to both the website operator and hosting provider so they can take steps to remove the content, and we provide the origin IP of where the content is located on the system so they can locate the content quickly. (Since 2017, we have given reporting parties the opportunity to file an anonymous report if they would prefer that either the host or the website operator not be informed of their identity).
  3. We provide anyone who makes a report information about the identity of the hosting provider and contact information for the hosting provider in case they want to follow up directly.

Since our founding, Cloudflare has forwarded 5,208 reports to NCMEC. Over the last three years, we have provided 1,111 reports in 2019 (to date), 1,417 in 2018, and 627 in 2017.  

Reports filed under the “child pornography” category account for about 0.2% of the abuse complaints Cloudflare receives. These reports are treated as the highest priority for our Trust & Safety team and they are moved to the front of the abuse response queue. We are generally able to respond by filing the report with NCMEC and providing the additional information within a matter of minutes regardless of time of day or day of the week.

Cloudflare’s Response to CSAM OnlineRequests for Information

The second main prong of our response to CSAM is operation of our “trusted  reporter” program to provide relevant information to support the investigations of nearly 60 child safety organizations around the world. The "trusted reporter" program was established in response to our ongoing work with these organizations and their requests for both information about the hosting provider of the websites at issue as well as information about the origin IP address of the content at issue. Origin IP information, which is generally sensitive security information because it would allow hackers to circumvent certain security protections for a website, like DDoS protections, is provided to these organizations through dedicated channels on an expedited basis.

Like NCMEC, these organizations are responsible for investigating reports of CSAM on websites or hosting providers operated out of their local jurisdictions, and they seek the resources to identify and contact those parties as quickly as possible to have them remove the content. Participants in the “trusted reporter” program include groups like the Internet Watch Foundation (IWF), the INHOPE Foundation, the Australian eSafety Commission, and Meldpunt. Over the past five years, we have responded to more than 13,000 IWF requests, and more than 5,000 requests from Meldpunt. We respond to such requests on the same day, and usually within a couple of hours. In a similar way, Cloudflare also receives and responds to law enforcement requests for information as part of investigations related to CSAM or exploitation of a minor.

Cloudflare’s Response to CSAM Online

Among this group, the Canadian Centre for Child Protection has been engaged in a unique effort that is worthy of specific mention. The Centre’s Cybertip program operates their Project Arachnid initiative, a novel approach that employs an automated web crawler that proactively searches the Internet to identify images that match a known CSAM hash, and then alerts hosting providers when there is a match. Based on our ongoing work with Project Arachnid, we have responded to more than 86,000 reports by providing information about the hosting provider and provide the origin IP address, which we understand they use to contact that hosting provider directly with that report and any subsequent reports.

Although we typically process these reports within a matter of hours, we’ve heard from participants in our “trusted reporter” program that the non-instantaneous response from us causes friction in their systems. They want to be able to query our systems directly to get the hosting provider and origin IP  information, or better, be able to build extensions on their automated systems that could interface with the data in our systems to remove any delay whatsoever. This is particularly relevant for folks in the Canadian Centre’s Project Arachnid, who want to make our information a part of their automated system.  After scoping out this solution for a while, we’re now confident that we have a way forward and informed some trusted reporters in November that we will be making available an API that will allow them to obtain instantaneous information in response to their requests pursuant to their investigations. We expect this functionality to be online in the first quarter of 2020.

Termination of Services

Cloudflare takes steps in appropriate circumstances to terminate its services from a site when it becomes clear that the site is dedicated to sharing CSAM or if the operators of the website and its host fail to take appropriate steps to take down CSAM content. In most circumstances, CSAM reports involve individual images that are posted on user generated content sites and are removed quickly by responsible website operators or hosting providers. In other circumstances, when operators or hosts fail to take action, Cloudflare is unable on its own to delete or remove the content but will take steps to terminate services to the  website.  We follow up on reports from NCMEC or other organizations when they report to us that they have completed their initial investigation and confirmed the legitimacy of the complaint, but have not been able to have the website operator or host take down the content. We also work with Interpol to identify and discontinue services from such sites they have determined have not taken steps to address CSAM.

Based upon these determinations and interactions, we have terminated service to 5,428 domains over the past 8 years.

In addition, Cloudflare has introduced new products where we do serve as the host of content, and we would be in a position to remove content from the Internet, including Cloudflare Stream and Cloudflare Workers.  Although these products have limited adoption to date, we expect their utilization will increase significantly over the next few years. Therefore, we will be conducting scans of the content that we host for users of these products using PhotoDNA (or similar tools) that make use of NCMEC’s image hash list. If flagged, we will remove that content immediately. We are working on that functionality now, and expect it will be in place in the first half of 2020.

Part of an Organized Approach to Addressing CSAM

Cloudflare’s approach to addressing CSAM operates within a comprehensive legal and policy backdrop. Congress and the law enforcement and child protection communities have long collaborated on how best to combat the exploitation of children. Recognizing the importance of combating the online spread of CSAM, NCMEC first created the CyberTipline in 1998, to provide a centralized reporting system for members of the public and online providers to report the exploitation of children online.

In 2006, Congress conducted a year-long investigation and then passed a number of laws to address the sexual abuse of children. Those laws attempted to calibrate the various interests at stake and coordinate the ways various parties should respond. The policy balance Congress struck on addressing CSAM on the Internet had a number of elements for online service providers.

First, Congress formalized NCMEC’s role as the central clearinghouse for reporting and investigation, through the CyberTipline. The law adds a requirement, backed up by fines, for online providers to report any reports of CSAM to NCMEC. The law specifically notes that to preserve privacy, they were not creating a requirement to monitor content or affirmatively search or screen content to identify possible reports.

Second, Congress responded to the many stories of child victims who emphasized the continuous harm done by the transmission of imagery of their abuse. As described by NCMEC, “not only do these images and videos document victims’ exploitation and abuse, but when these files are shared across the internet, child victims suffer re-victimization each time the image of their sexual abuse is viewed” even when viewed for ostensibly legitimate investigative purposes. To help address this concern, the law directs providers to minimize the number of employees provided access to any visual depiction of child sexual abuse.  

Finally, to ensure that child safety and law enforcement organizations had the records necessary to conduct an investigation, the law directs providers to preserve not only the report to NCMEC, but also “any visual depictions, data, or other digital files that are reasonably accessible and may provide context or additional information about the reported material or person” for a period of 90 days.

Cloudflare’s Response to CSAM Online

Because Cloudflare’s services are used so extensively—by more than 20 million Internet properties, and based on data from W3Techs, more than 10% of the world’s top 10 million websites—we have worked hard to understand these policy principles in order to respond appropriately in a broad variety of circumstances. The processes described in this blogpost were designed to make sure that we comply with these principles, as completely and quickly as possible, and take other steps to support the system’s underlying goals.

Conclusion

We are under no illusion that our work in this space is done. We will continue to work with groups that are dedicated to fighting this abhorrent crime and provide tools to more quickly get them information to take CSAM content down and investigate the criminals who create and distribute it.

Cloudflare's Senate Response (PDF)

Cloudflare's Senate Res... by Cloudflare on Scribd

Categories: Technology

Thinking about color

CloudFlare - Thu, 05/12/2019 - 07:28
Color is my day-long obsession, joy and torment - Claude MonetThinking about colorThinking about color

Over the last two years we’ve tried to improve our usage of color at Cloudflare. There were a number of forcing functions that made this work a priority. As a small team of designers and engineers we had inherited a bunch of design work that was a mix of values built by multiple teams. As a result it was difficult and unnecessarily time consuming to add new colors when building new components.

We also wanted to improve our accessibility. While we were doing pretty well, we had room for improvement, largely around how we used green. As our UI is increasingly centered around visualizations of large data sets we wanted to push the boundaries of making our analytics as visually accessible as possible.

Cloudflare had also undergone a rebrand around 2016. While our marketing site had rolled out an updated set of visuals, our product ui as well as a number of existing web properties were still using various versions of our old palette.

Thinking about colorThinking about colorCloudflare.com in 2016 & Cloudflare.com in 2019

Our product palette wasn’t well balanced by itself. Many colors had been chosen one or two at a time. You can see how we chose blueberry, ice, and water at a different point in time than marine and thunder.

Thinking about colorThe color section of our theme file was partially ordered chronologically

Lacking visual cohesion within our own product, we definitely weren’t providing a cohesive visual experience between our marketing site and our product. The transition from the nice blues and purples to our green CTAs wasn’t the streamlined experience we wanted to afford our users.

Thinking about colorThinking about colorCloudflare.com 2017 & Sign in page 2017Thinking about colorOur app dashboard in 2017Reworking our Palette

Our first step was to audit what we already had. Cloudflare has been around long enough to have more than one website. Beyond cloudflare.com we have dozens of web properties that are publicly accessible. From our community forums, support docs, blog, status page, to numerous micro-sites.

All-in-all we have dozens of front-end codebases that each represent one more chance to introduce entropy to our visual language. So we were curious to answer the question - what colors were we currently using? Were there consistent patterns we could document for further reuse? Could we build a living style guide that didn’t cover just one site, but all of them?

Thinking about colorScreenshots of pages from cloudflare.com contrasted with screenshots from our product in 2017

Our curiosity got the best of us and we went about exploring ways we could visualize our design language across all of our sites.

Thinking about colorAbove - our product palette. Below - our marketing palette.A time machine for color

As we first started to identify the scale of our color problems, we tried to think outside the box on how we might explore the problem space. After an initial brainstorming session we combined the Internet Archive's Wayback Machine with the Css Stats API to build an audit tool that shows how our various websites visual properties change over time. We can dynamically select which sites we want to compare and scrub through time to see changes.

Below is a visualization of palettes from 9 different websites changing over a period of 6 years. Above the palettes is a component that spits out common colors, across all of these sites. The only two common colors across all properties (appearing for only a brief flash) were #ffffff (white) and transparent. Over time we haven’t been very consistent with ourselves.

Thinking about color

If we drill in to look at our marketing site compared to our dashboard app - it looks like the video below. We see a bit more overlap at first and then a significant divergence at the 16 second mark when our product palette grew significantly. At the 22 second mark you can see the marketing palette completely change as a result of the rebrand while our product palette stays the same. As time goes on you can see us becoming more and more inconsistent across the two code bases.

Thinking about color

As a product team we had some catching up to do improve our usage of color and to align ourselves with the company brand. The good news was, there was no where to go but up.

This style of historical audit gives us a visual indication with real data. We can visualize for stakeholders how consistent and similar our usage of color is across products and if we are getting better or worse over time. Having this type of feedback loop was invaluable for us - as auditing this manually is incredibly time consuming so it often doesn’t get done. Hopefully in the future as it’s standard to track various performance metrics over time at a company it will be standard to be able to visualize your current levels of design entropy.

Picking colors

After our initial audit revealed there wasn’t a lot of consistency across sites, we went to work to try and construct a color palette that could potentially be used for sites the product team owned. It was time to get our hands dirty and start “picking colors.”

Hindsight of course is always 20/20. We didn’t start out on day one trying to generate scales based on our brand palette. No, our first bright idea, was to generate the entire palette from a single color.

Our logo is made up of two oranges. Both of these seemed like prime candidates to generate a palette from.

Thinking about color

We played around with a number of algorithms that took a single color and created a palette. From the initial color we generated an array scales for each hue. Initial attempts found us applying the exact same curves for luminosity to each hue, but as visual perception of hues is so different, this resulted in wildly different contrasts at each step of the scale.

Below are a few of our initial attempts at palette generation. Jeeyoung Jung did a brilliant writeup around designing palettes last year.

Thinking about colorVisualizing peaks of intensity across hues

We can see the intensity of the colors change across hue in peaks, with yellow and green being the most dominant. One of the downsides of this, is when you are rapidly iterating through theming options, the inconsistent relationships between steps across hues can make it time consuming or impossible to keep visual harmony in your interface.

The video below is another way to visualize this phenomenon. The dividing line in the color picker indicates which part of the palette will be accessible with black and white. Notice how drastically the line changes around green and yellow. And then look back at the charts above.

Thinking about colorDemo of https://kevingutowski.github.io/color.html

After fiddling with a few different generative algorithms (we made a lot of ugly palettes…) we decided to try a more manual approach. We pursued creating custom curves for each hue in an effort to keep the contrast scales as optically balanced as possible.

Thinking about colorThinking about colorThinking about colorThinking about colorThinking about colorHeavily muted paletteThinking about color

Generating different color palettes makes you confront a basic question. How do you tell if a palette is good? Are some palettes better than others? In an effort to answer this question we constructed various feedback loops to help us evaluate palettes as quickly as possible. We tried a few methods to stress test a palette. At first we attempted to grab the “nearest color” for a bunch of our common UI colors. This wasn't always helpful as sometimes you actually want the step above or below the closest existing color. But it was helpful to visualize for a few reasons.

Thinking about colorGenerated palette above a set of components previewing the old and new palette for comparison

Sometime during our exploration in this space, we stumbled across this tweet thread about building a palette for pixel art. There are a lot of places web and product designers can draw inspiration from game designers.

Thinking about colorTwo color palettes visualized to create 3d objectsThinking about colorA color palette applied in a few different contexts

Here we see a similar concept where a number of different palettes are applied to the same component. This view shows us two things, the different ways a single palette can be applied to a sphere, and also the different aesthetics across color palettes.

Thinking about colorDifferent color palettes previewed against a common component

It’s almost surprising that the default way to construct a color palette for apps and sites isn’t to build it while previewing its application against the most common UI patterns. As designers, there are a lot of consistent uses of color we could have baselines for. Many enterprise apps are centered around white background with blue as the primary color with mixtures of grays to add depth around cards and page sections. Red is often used for destructive actions like deleting some type of record. Gray for secondary actions. Maybe it’s an outline button with the primary color for secondary actions. Either way - the margins between the patterns aren’t that large in the grand scheme of things.

Consider the use case of designing UI while the palette or usage of color hasn't been established. Given a single palette, you might want to experiment with applying that palette in a variety of ways that will output a wide variety of aesthetics. Alternatively you may need to test out several different palettes. These are two different modes of exploration that can be extremely time consuming to work through . It can be non-trivial to keep an in progress design synced with several different options for color application, even with the best use of layer comps or symbols.

How do we visualize the various ways a palette will look when applied to an interface? Here are examples of how palettes are shown on a palette list for pixel artists.

Thinking about colorThinking about colorThinking about colorThinking about colorhttps://lospec.com/palette-list/sweetie-16Thinking about colorhttps://lospec.com/palette-list/vines-flexible-linear-ramps

One method of visualization is to define a common set of primitive ui elements and show each one of them with a single set of colors applied. In isolation this can be helpful. This mode would make it easy to vet a single combination of colors and which ui elements it might be best applied to.

Alternatively we might want to see a composed interface with the closest colors from the palette applied. Consider a set of buttons that includes red, green, blue, and gray button styles. Seeing all of these together can help us visualize the relative nature of these buttons side by side. Given a baseline palette for common UI, we could swap to a new palette and replace each color with the "closest" color. This isn't always a full-proof solution as there are many edge cases to cover. e.g. what happens when replacing a palette of 134 colors with a palette of 24 colors? Even still, this could allow us to quickly take a stab at automating how existing interfaces would change their appearance given a change to the underlying system. Whether locally or against a live site, this mode of working would allow for designers to view a color in multiple contets to truly asses its quality.

Thinking about color

After moving on from the idea of generating a palette from a single color, we attempted to use our logo colors as well as our primary brand colors to drive the construction of modular scales. Our goal was to create a palette that would improve contrast for accessibility, stay true to our visual brand, work predictably for developers, work for data visualizations, and provide the ability to design visually balanced and attractive interfaces. No sweat.

Thinking about colorBrand colors showing Hue and Saturation level

While we knew going in we might not use every step in every hue, we wanted full coverage across the spectrum so that each hue had a consistent optical difference between each step. We also had no idea which steps across which hues we were going to need just yet. As they would just be variables in a theme file it didn’t add any significant code footprint to expose the full generated palette either.

One of the more difficult parts, was deciding on a number of steps for the scales. This would allow us to edit the palette in the future to a variety of aesthetics and swap the palette out at the theme level without needing to update anything else.

In the future if when we did need to augment the available colors, we could edit the entire palette instead of adding a one-off addition as we had found this was a difficult way to work over time. In addition to our primary brand colors we also explored adding scales for yellow / gold, violet, teal as well as a gray scale.

The first interface we built for this work was to output all of the scales vertically, with their contrast scores with both white and black on the right hand side. To aid scannability we bolded the values that were above the 4.5 threshold. As we edited the curves, we could see how the contrast ratios were affected at each step. Below you can see an early starting point before the scales were balanced. Red has 6 accessible combos with white, while yellow only has 1. We initially explored having the gray scale be larger than the others.

Thinking about colorEarly iteration of palette preview during development

As both screen luminosity and ambient light can affect perception of color we developed on two monitors, one set to maximum and one set to minimum brightness levels. We also replicated the color scales with a grayscale filter immediately below to help illustrate visual contrast between steps AND across hues. Bouncing back and forth between the grayscale and saturated version of the scale serves as a great baseline reference. We found that going beyond 10 steps made it difficult to keep enough contrast between each step to keep them distinguishable from one another.

Thinking about colorThinking about colorMonitors set to different luminosity & a close up of the visual difference

Taking a page from our game design friends - as we were balancing the scales and exploring how many steps we wanted in the scales, we were also stress testing the generated colors against various analytics components from our component library.

Our slightly random collection of grays had been a particular pain point as they appeared muddy in a number of places within our interface. For our new palette we used the slightest hint of blue to keep our grays consistent and just a bit off from being purely neutral.

Thinking about colorOptically balanced scales

With a palette consisting of 90 colors, the amount of combinations and permutations that can be applied to data visualizations is vast and can result in a wide variety of aesthetic directions. The same palette applied to both line and bar charts with different data sets can look substantially different, enough that they might not be distinguishable as being the exact same palette. Working with some of our engineering counterparts, we built a pipeline that would put up the same components rendered against different data sets, to simulate the various shapes and sizes the graph elements would appear in. This allowed us to rapidly test the appearance of different palettes. This workflow gave us amazing insights into how a palette would look in our interface. No matter how many hours we spent staring at a palette, we couldn't get an accurate sense of how the colors would look when composed within an interface.

Thinking about colorThinking about colorFull theme minus green & blues and orangesThinking about colorThinking about colorGrayscale example & monochromatic IndigoThinking about colorThinking about colorThinking about colorThinking about colorAnalytics charts with blues and purples & analytics charts with blues and greensThinking about colorAnalytics charts with a blues and oranges. Telling the colors of the lines apart is a different visual experience than separating out the dots in sequential order as they appear in the legend.

We experimented with a number of ideas on visualizing different sizes and shapes of colors and how they affected our perception of how much a color was changing element to element. In the first frame it is most difficult to tell the values at 2% and 6% apart given the size and shape of the elements.

Thinking about colorStress testing the application of a palette to many shapes and sizes

We’ve begun to package up some of this work into a web app others can use to create or import a palette and preview multiple depths of accessible combinations against a set of UI elements.

The goal is to make it easier for anyone to work seamlessly with color and build beautiful interfaces with accessible color contrasts.

Thinking about colorColor by Cloudflare Design

In an effort to make sure everything we are building will be visually accessible - we built a react component that will preview how a design would look if you were colorblind. The component overlays SVG filters to simulate alternate ways someone can perceive color.

Thinking about colorAnalytics component previewed against 8 different types of color blindness

While this is previewing an analytics component, really any component or page can be previewed with this method.

import React from "react" const filters = [ 'achromatopsia', 'protanomaly', 'protanopia', 'deuteranomaly', 'deuteranopia', 'tritanomaly', 'tritanopia', 'achromatomaly', ] const ColorBlindFilter = ({itemPadding, itemWidth, ...props }) => { return ( <div {...props}> {filters.map((filter, i) => ( <div style={{filter: 'url(/filters.svg#'+filter+')'}} width={itemWidth} px={itemPadding} key={i+filter} > {props.children} </div> ))} </div> ) } ColorBlindFilter.defaultProps = { display: 'flex', justifyContent: 'space-around', flexWrap: 'wrap', width: 1, itemWidth: 1/4 } export default ColorBlindFilter

We’ve also released a Figma plugin that simulates this visualization for a component.

After quite a few iterations, we had finally come up with a color palette. Each scale was optically aligned with our brand colors. The 5th step in each scale is the closest to the original brand color, but adjusted slightly so it’s accessible with both black and white.

Thinking about colorOur preview panel for palette development, showing a fully desaturated version of the palette for reference

Lyft’s writeup “Re-approaching color” and Jeeyoung Jung’s “Designing Systematic Colors are some of the best write-ups on how to work with color at scale you can find.

Color migrationsThinking about colorA visual representation of how the legacy palette colors would translate to the new scales.

Getting a team of people to agree on a new color palette is a journey in and of itself. By the time you get everyone to consensus it’s tempting to just collapse into a heap and never think about colors ever again. Unfortunately the work doesn’t stop at this point. Now that we’ve picked our palette, it’s time to get it implemented so this bike shed is painted once and for all.

If you are porting an old legacy part of your app to be updated to the new style guide like we were, even the best color documentation can fall short in helping someone make the necessary changes.

We found it was more common than expected that engineers and designers wanted to know what the new version of a color they were familiar with was. During the transition between palettes we had an interface people could input any color and get the closest color within our palette.

There are times when migrating colors, the closest color isn't actually what you want. Given the scenario where your brand color has changed from blue to purple, you might want to be porting all blues to the closest purple within the palette, not the closest blues which might still exist in your palette. To help visualize migrations as well as get suggestions on how to consolidate values within the old scale, we of course built a little tool. Here we can define those translations and import a color palette from a URL import. As we still have. a number of web properties to update to our palette, this simple tool has continued to prove useful.

Thinking about color

We wanted to be as gentle as possible in transitioning to the new palette in usage. While the developers found string names for colors brittle and unpredictable, it was still a more familiar system for some than this new one. We first just added in our new palette to the existing theme for usage moving forward. Then we started to port colors for existing components and pages.

For our colleagues, we wrote out desired translations and offered warnings in the console that a color was deprecated, with a reference to the new theme value to use.

Thinking about colorExample of console warning when using deprecated colorThinking about colorExample of how to check for usage of deprecated values

While we had a few bugs along the way, the team was supportive and helped us fix bugs almost as quickly as we could find them.

We’re still in the process of updating our web properties with our new palette, largely prioritizing accessibility first while trying to create a more consistent visual brand as a nice by-product of the work. A small example of this is our system status page. In the first image, the blue links in the header, the green status bar, and the about copy, were all inaccessible against their backgrounds.

Thinking about colorThinking about colorcloudflarestatus.com in 2017 & cloudflarestatus.com in 2019 with an accessible green

A lot of the changes have been subtle. Most notably the green we use in the dashboard is a lot more inline with our brand colors than before. In addition we’ve also been able to add visual balance by not just using straight black text on background colors. Here we added one of the darker steps from the corresponding scale, to give it a bit more visual balance.

Thinking about colorThinking about colorNotifications 2017 - Black text & Notifications 2018 - Updated palette. Text is set to 1st step in corresponding scale of background colorThinking about colorExample page within our Dashboard in 2017 vs 2019

While we aren’t perfect yet, we’re making progress towards more visual cohesion across our marketing materials and products.

2017Thinking about colorThinking about colorCloudflare.com & Sign in page 2017Thinking about colorOur app dashboard in 20172019Thinking about colorThinking about colorCloudflare homepage & updated Dashboard in 2019Next steps

Trying to keep dozens of sites all using the same palette in a consistent manner across time is a task that you can never complete. It’s an ongoing maintenance problem. Engineers familiar with the color system leave, new engineers join and need to learn how the system works. People still launch sites using a different palette that doesn’t meet accessibility standards. Our work continues to be cut out for us. As they say, a garden doesn’t tend itself.

If we do ever revisit our brand colors, we're excited to have infrastructure in place to update our apps and several of our satellite sites with significantly less effort than our first time around.

Resources

Some of our favorite materials and resources we found while exploring this problem space.

AppsWritingCodeVideos

Categories: Technology

Drupal 8.8.0 is available

Drupal - Wed, 04/12/2019 - 10:08
What’s new in Drupal 8.8.0?

The last normal feature release of Drupal 8 includes a stable Media Library as well as several improvements to workspaces and migrations. The new experimental Claro administration theme brings a fresh look to site management. This is also the first release to come with native Composer support.

Download Drupal 8.8.0

Stable Media Library

The Media Library module allows easy reuse of images, documents, videos, and other assets across the site. It is integrated into content forms and seamlessly fits into CKEditor. You can upload media right from the library and even reuse a combination of uploaded and existing media. Media Library was previously included with Drupal core as a beta experimental module.

New experimental administration theme

The Claro administration theme was added to Drupal core with beta experimental stability. The new theme is clean, accessible, and powerful. Administration pages are more touch-friendly, and color combinations and contrasts are more accessible.

Significant improvements to Workspaces

It is now possible to define hierarchical workspaces (such as preparing a "New Year's" issue for a magazine under the "winter issue", while both receive changes to be deployed). Workspaces can now work with Content Moderation, and path alias changes can also be staged.

Native Composer support included

Drupal 8.8.0 is the first release to include native Composer support without reliance on third-party projects to set up Drupal with its dependencies. New sites can be created using a one-line command.

Migration improvements

The multilingual migration path is still experimental, but has received various updates. This includes handling of vocabulary language settings, term language information, and localization. Modules can now specify whether migrations provided by them are finished or not finished to help audit completeness of available migrations.

New experimental Help Topics module

The existing help system is module based, whereas users intend to complete tasks, not use modules. A new task-based Help Topics beta experimental module has been added to bring in-Drupal help to the next level.

The way to Drupal 9

Drupal 8.8 is the last minor release of Drupal 8 to include significant new features or deprecations prior to 9.0.0. The next (and final) minor release, 8.9, is planned to be a long-term support release that will include all the same changes as Drupal 9.0. It will not contain significant new features compared to 8.8.0, although existing experimental modules may become stable, and small API and UX improvements can still be added.

Drupal 8.9.0's planned release date is June 3, 2020, and our target release date for Drupal 9.0.0 is the same day. Most Drupal 9 preparation steps can be done on your Drupal 8 site, custom code and contributed modules now.

What does this mean for me? Drupal 8 site owners

Update to 8.8.0 to continue receiving bug fixes and prepare for 9.0.0 (or 8.9.0). The next bug-fix release (8.8.1) is scheduled for January 8, 2020. (See the release schedule overview for more information.) As of this release, sites on Drupal 8.6 will no longer receive security coverage. (Drupal 8.7 will continue receiving security fixes until June 3, 2020.)

Note that all Drupal 8.8.0 sites (new installs and updates) now require at least PHP 7.0.8.

Updating your site from 8.7.10 to 8.8.0 with update.php is exactly the same as updating from 8.7.8 to 8.7.9. Drupal 8.8.0 also has updates to several dependencies. Modules, themes, and translations may need updates for these and other changes in this minor release, so test the update carefully before updating your production site. Read the 8.8.0 release notes for a full list of changes that may affect your site.

Drupal 7 site owners

Drupal 7 is fully supported by the community until November 2021, and will continue to receive bug and security fixes throughout this time. From November 2021 until at least November 2024, the Drupal 7 Vendor Extended Support program will be offered by vendors.

The migration path for monolingual Drupal 7 sites is stable, as is the built-in migrationuser interface. For multilingual sites, most outstanding issues have been resolved. Please keep testing and reporting any issues you may find.

Translation, module, and theme contributors

Minor releases like Drupal 8.8.0 include backwards-compatible API additions for developers as well as new features.

Since minor releases are backwards-compatible, modules, themes, and translations that supported Drupal 8.7.x and earlier will be compatible with 8.8.x as well. However, the new version does include some changes to strings, user interfaces, internal APIs and API deprecations. This means that some small updates may be required for your translations, modules, and themes. Read the 8.8.0 release notes for a full list of changes that may affect your modules and themes.

This release has advanced the Drupal project significantly and represents the efforts of hundreds of volunteers and contributors from various organizations, as well as testers from the Minor release beta testing program. Thank you to everyone who contributed to Drupal 8.8.0!

Categories: Technology

Great Southern Homes Sees Greater Results with Drupal 8

Drupal - Tue, 03/12/2019 - 06:05
Great Southern HomesCompleted Drupal site or project URL: https://www.greatsouthernhomes.com/

Great Southern Homes is one of the fastest growing home builders in the United States, consistently delivering dream homes to their customers with a focus on quality, affordability and energy-efficiency.

Categories: Technology

The Serverlist: Full Stack Serverless, Serverless Architecture Reference Guides, and more

CloudFlare - Mon, 02/12/2019 - 20:29
 Full Stack Serverless, Serverless Architecture Reference Guides, and more

Check out our tenth edition of The Serverlist below. Get the latest scoop on the serverless space, get your hands dirty with new developer tutorials, engage in conversations with other serverless developers, and find upcoming meetups and conferences to attend.

Sign up below to have The Serverlist sent directly to your mailbox.

MktoForms2.loadForm("https://app-ab13.marketo.com", "713-XSC-918", 1654, function(form) {   form.onSuccess(function() {     mktoForm_1654.innerHTML = "Thank you! Your email has been added to our list."     return false   }) }); Your privacy is important to us /* These !important rules are necessary to override inline color and font-size attrs on elements generated from Marketo */ .newsletter form { color:#646567 !important; font-family: inherit !important; font-size: inherit !important; width: 100% !important; } .newsletter form { display: flex; flex-direction: row; justify-content: flex-start; line-height: 1.5em; margin-bottom: 1em; padding: 0em; } .newsletter input[type="Email"], .mktoForm.mktoHasWidth.mktoLayoutLeft .mktoButtonWrap.mktoSimple .mktoButton { border-radius: 3px !important; font: inherit; line-height: 1.5em; padding-top: .5em; padding-bottom: .5em; } .newsletter input[type="Email"] { border: 1px solid #ccc; box-shadow: none; height: initial; margin: 0; margin-right: .5em; padding-left: .8em; padding-right: .8em; /* This !important is necessary to override inline width attrs on elements generated from Marketo */ width: 100% !important; } .newsletter input[type="Email"]:focus { border: 1px solid #3279b3; } .newsletter .mktoForm.mktoHasWidth.mktoLayoutLeft .mktoButtonWrap.mktoSimple .mktoButton { background-color: #f18030 !important; border: 1px solid #f18030 !important; color: #fff !important; padding-left: 1.25em; padding-right: 1.25em; } .newsletter .mktoForm .mktoButtonWrap.mktoSimple .mktoButton:hover { border: 1px solid #f18030 !important; } .newsletter .privacy-link { font-size: .8em; } .newsletter .mktoAsterix, .newsletter .mktoGutter, .newsletter .mktoLabel, .newsletter .mktoOffset { display: none; } .newsletter .mktoButtonWrap, .newsletter .mktoFieldDescriptor { /* This !important is necessary to override inline margin attrs on elements generated from Marketo */ margin: 0px !important; } .newsletter .mktoForm .mktoButtonRow { margin-left: 0.5em; } .newsletter .mktoFormRow:first-of-type, .newsletter .mktoFormRow:first-of-type .mktoFieldDescriptor.mktoFormCol, .newsletter .mktoFormRow:first-of-type .mktoFieldDescriptor.mktoFormCol .mktoFieldWrap.mktoRequiredField { width: 100% !important; } .newsletter .mktoForm .mktoField:not([type=checkbox]):not([type=radio]) { width: 100% !important } iframe[seamless]{ background-color: transparent; border: 0 none transparent; padding: 0; overflow: hidden; } const magic = document.getElementById('magic') function resizeIframe() { const iframeDoc = magic.contentDocument const iframeWindow = magic.contentWindow magic.height = iframeDoc.body.clientHeight magic.width = "100%" const injectedStyle = iframeDoc.createElement('style') injectedStyle.innerHTML = ` body { background: white !important; } .stack { display: flex; align-items: center; } #footerModulec155fe9e-f964-4fdf-829b-1366f112e82b .stack { display: block; } ` magic.contentDocument.head.appendChild(injectedStyle) function onFinish() { setTimeout(() => { magic.style.visibility = '' }, 80) } if (iframeDoc.readyState === 'loading') { iframeWindow.addEventListener('load', onFinish) } else { onFinish() } } async function fetchURL(url) { magic.addEventListener('load', resizeIframe) const call = await fetch(`https://proxy.devrel.workers.dev?domain=${url}`) const text = await call.text() const divie = document.createElement("div") divie.innerHTML = text const listie = divie.getElementsByTagName("a") for (var i = 0; i < listie.length; i++) { listie[i].setAttribute("target", "_blank") } magic.scrolling = "no" magic.srcdoc = divie.innerHTML } fetchURL("https://info.cloudflare.com/index.php/email/emailWebview?mkt_tok=eyJpIjoiTXpFNE5UWmxabUpoWmpkbCIsInQiOiIwME1oWUR2clMrdWJBY1djek16QXVXWUlZQkxIUGhMV3dUd1hQczFTTjA5NkprZFN5cGdWZXRMS3Nwc3djUnBRRFVNRzVEVXUvWHlCblVOWGNKQk1RNXdzMSsrZFJ1a1haWlh1QXNiN2Exb1g3M0NNZ2RRdDZreWdQdW9qMVMvQiJ9")
Categories: Technology

A History of HTML Parsing at Cloudflare: Part 2

CloudFlare - Fri, 29/11/2019 - 08:00
 Part 2 Part 2

The second blog post in the series on HTML rewriters picks up the story in 2017 after the launch of the Cloudflare edge compute platform Cloudflare Workers. It became clear that the developers using workers wanted the same HTML rewriting capabilities that we used internally, but accessible via a JavaScript API.

This blog post describes the building of a streaming HTML rewriter/parser with a CSS-selector based API in Rust. It is used as the back-end for the Cloudflare Workers HTMLRewriter. We have open-sourced the library (LOL HTML) as it can also be used as a stand-alone HTML rewriting/parsing library.

The major change compared to LazyHTML, the previous rewriter, is the dual-parser architecture required to overcome the additional performance overhead of wrapping/unwrapping each token when propagating tokens to the workers runtime. The remainder of the post describes a CSS selector matching engine inspired by a Virtual Machine approach to regular expression matching.

v2 : Give it to everyone and make it faster

In 2017, Cloudflare introduced an edge compute platform - Cloudflare Workers. It was no surprise that customers quickly required the same HTML rewriting capabilities that we were using internally. Our team was impressed with the platform and decided to migrate some of our features to Workers. The goal was to improve our developer experience working with modern JavaScript rather than statically linked NGINX modules implemented in C with a Lua API.

It is possible to rewrite HTML in Workers, though for that you needed a third party JavaScript package (such as Cheerio). These packages are not designed for HTML rewriting on the edge due to the latency, speed and memory considerations described in the previous post.

JavaScript is really fast but it still can’t always produce performance comparable to native code for some tasks - parsing being one of those. Customers typically needed to buffer the whole content of the page to do the rewriting resulting in considerable output latency and memory consumption that often exceeded the memory limits enforced by the Workers runtime.

We started to think about how we could reuse the technology in Workers. LazyHTML was a perfect fit in terms of parsing performance, but it had two issues:

  1. API ergonomics: LazyHTML produces a stream of HTML tokens. This is sufficient for our internal needs. However, for an average user, it is not as convenient as the jQuery-like API of Cheerio.
  2. Performance: Even though LazyHTML is tremendously fast, integration with the Workers runtime adds even more limitations. LazyHTML operates as a simple parse-modify-serialize pipeline, which means that it produces tokens for the whole content of the page. All of these tokens then have to be propagated to the Workers runtime and wrapped inside a JavaScript object and then unwrapped and fed back to LazyHTML for serialization. This is an extremely expensive operation which would nullify the performance benefit of LazyHTML.
 Part 2LazyHTML with V8LOL HTML

We needed something new, designed with Workers requirements in mind, using a language with the native speed and safety guarantees (it’s incredibly easy to shoot yourself in the foot doing parsing). Rust was the obvious choice as it provides the native speed and the best guarantee of memory safety which minimises the attack surface of untrusted input. Wherever possible the Low Output Latency HTML rewriter (LOL HTML) uses all the previous optimizations developed for LazyHTML such as tag name hashing.

Dual-parser architecture

Most developers are familiar and prefer to use CSS selector-based APIs (as in Cheerio, jQuery or DOM itself) for HTML mutation tasks. We decided to base our API on CSS selectors as well. Although this meant additional implementation complexity, the decision created even more opportunities for parsing optimizations.

As selectors define the scope of the content that should be rewritten, we realised we can skip the content that is not in this scope and not produce tokens for it. This not only significantly speeds up the parsing itself, but also avoids the performance burden of the back and forth interactions with the JavaScript VM. As ever the best optimization is not to do something.

 Part 2

Considering the tasks required, LOL HTML’s parser consists of two internal parsers:

  • Lexer - a regular full parser, that produces output for all types of content that it encounters;
  • Tag scanner - looks for start and end tags and skips parsing the rest of the content. The tag scanner parses only the tag name and feeds it to the selector matcher. The matcher will switch parser to the lexer if there was a match or additional information about the tag (such as attributes) are required for matching.

The parser switches back to the tag scanner as soon as input leaves the scope of all selector matches. The tag scanner may also sometimes switch the parser to the Lexer - if it requires additional tag information for the parsing feedback simulation.

 Part 2LOL HTML architecture

Having two different parser implementations for the same grammar will increase development costs and is error-prone due to implementation inconsistencies. We minimize these risks by implementing a small Rust macro-based DSL which is similar in spirit to Ragel. The DSL program describes Nondeterministic finite automaton states and actions associated with each state transition and matched input byte.

An example of a DSL state definition:

tag_name_state { whitespace => ( finish_tag_name?; --> before_attribute_name_state ) b'/' => ( finish_tag_name?; --> self_closing_start_tag_state ) b'>' => ( finish_tag_name?; emit_tag?; --> data_state ) eof => ( emit_raw_without_token_and_eof?; ) _ => ( update_tag_name_hash; ) }

The DSL program gets expanded by the Rust compiler into not quite as beautiful, but extremely efficient Rust code.

We no longer need to reimplement the code that drives the parsing process for each of our parsers. All we need to do is to define different action implementations for each. In the case of the tag scanner, the majority of these actions are a no-op, so the Rust compiler does the NFA optimization job for us: it optimizes away state branches with no-op actions and even whole states if all of the branches have no-op actions. Now that’s cool.

Byte slice processing optimisations

Moving to a memory-safe language provided new challenges. Rust has great memory safety mechanisms, however sometimes they have a runtime performance cost.

The task of the parser is to scan through the input and find the boundaries of lexical units of the language - tokens and their internal parts. For example, an HTML start tag token consists of multiple parts: a byte slice of input that represents the tag name and multiple pairs of input slices that represent attributes and values:

struct StartTagToken<'i> { name: &'i [u8], attributes: Vec<(&'i [u8], &'i [u8])>, self_closing: bool }

As Rust uses bound checks on memory access, construction of a token might be a relatively expensive operation. We need to be capable of constructing thousands of them in a fraction of second, so every CPU instruction counts.

Following the principle of doing as little as possible to improve performance we use a “token outline” representation of tokens: instead of having memory slices for token parts we use numeric ranges which are lazily transformed into a byte slice when required.

struct StartTagTokenOutline { name: Range<usize>, attributes: Vec<(Range<usize>, Range<usize>)>, self_closing: bool }

As you might have noticed, with this approach we are no longer bound to the lifetime of the input chunk which turns out to be very useful. If a start tag is spread across multiple input chunks we can easily update the token that is currently in construction, as new chunks of input arrive by just adjusting integer indices. This allows us to avoid constructing a new token with slices from the new input memory region (it could be the input chunk itself or the internal parser’s buffer).

This time we can’t get away with avoiding the conversion of input character encoding; we expose a user-facing API that operates on JavaScript strings and input HTML can be of any encoding. Luckily, as we can still parse without decoding and only encode and decode within token bounds by a request (though we still can’t do that for UTF-16 encoding).

So, when a user requests an element’s tag name in the API, internally it is still represented as a byte slice in the character encoding of the input, but when provided to the user it gets dynamically decoded. The opposite process happens when a user sets a new tag name.

For selector matching we can still operate on the original encoding representation - because we know the input encoding ahead of time we preemptively convert values in a selector to the page’s character encoding, so comparisons can be done without decoding fields of each token.

As you can see, the new parser architecture along with all these optimizations produced great performance results:

 Part 2Average parsing time depending on the input size - lower is better

LOL HTML’s tag scanner is typically twice as fast as LazyHTML and the lexer has comparable performance, outperforming LazyHTML on bigger inputs. Both are a few times faster than the tokenizer from html5ever - another parser implemented in Rust used in the Mozilla’s Servo browser engine.

CSS selector matching VM

With an impressively fast parser on our hands we had only one thing missing - the CSS selector matcher. Initially we thought we could just use Servo’s CSS selector matching engine for this purpose. After a couple of days of experimentation it turned out that it is not quite suitable for our task.

It did not work well with our dual parser architecture. We first need to to match just a tag name from the tag scanner, and then, if we fail, query the lexer for the attributes. The selectors library wasn’t designed with this architecture in mind so we needed ugly hacks to bail out from matching in case of insufficient information. It was inefficient as we needed to start matching again after the bailout doing twice the work. There were other problems, such as the integration of lazy character decoding and integration of tag name comparison using tag name hashes.

Matching direction

The main problem encountered was the need to backtrack all the open elements for matching. Browsers match selectors from right to left and traverse all ancestors of an element. This StackOverflow has a good explanation of why they do it this way. We would need to store information about all open elements and their attributes - something that we can’t do while operating with tight memory constraints. This matching approach would be inefficient for our case - unlike browsers, we expect to have just a few selectors and a lot of elements. In this case it is much more efficient to match selectors from left to right.

And this is when we had a revelation. Consider the following CSS selector:

body > div.foo img[alt] > div.foo ul

It can be split into individual components attributed to a particular element with hierarchical combinators in between:

body > div.foo img[alt] > div.foo ul --- ------- -------- ------- --

Each component is easy to match having a start tag token - it’s just a matter of comparison of token fields with values in the component. Let’s dive into abstract thinking and imagine that each such component is a character in the infinite alphabet of all possible components:

Selector component Character body a div.foo b img[alt] c ul d

Let’s rewrite our selector with selector components replaced by our imaginary characters:

a > b c > b d

Does this remind you of something?

A   `>` combinator can be considered a child element, or “immediately followed by”.

The ` ` (space) is a descendant element can be thought of as there might be zero or more elements in between.

There is a very well known abstraction to express these relations - regular expressions. The selector replacing combinators can be replaced with a regular expression syntax:

ab.*cb.*d

We transformed our CSS selector into a regular expression that can be executed on the sequence of start tag tokens. Note that not all CSS selectors can be converted to such a regular grammar and the input on which we match has some specifics, which we’ll discuss later. However, it was a good starting point: it allowed us to express a significant subset of selectors.

Implementing a Virtual Machine

Next, we started looking at non-backtracking algorithms for regular expressions. The virtual machine approach seemed suitable for our task as it was possible to have a non-backtracking implementation that was flexible enough to work around differences between real regular expression matching on strings and our abstraction.

VM-based regular expression matching is implemented as one of the engines in many regular expression libraries such as regexp2 and Rust’s regex. The basic idea is that instead of building an NFA or DFA for a regular expression it is instead converted into DSL assembly language with instructions later executed by the virtual machine - regular expressions are treated as programs that accept strings for matching.

Since the VM program is just a representation of NFA with ε-transitions it can exist in multiple states simultaneously during the execution, or, in other words, spawns multiple threads. The regular expression matches if one or more states succeed.

For example, consider the following VM instructions:

  • expect c - waits for next input character, aborts the thread if doesn’t equal to the instruction’s operand;
  • jmp L - jump to label ‘L’;
  • thread L1, L2 - spawns threads for labels L1 and L2, effectively splitting the execution;
  • match - succeed the thread with a match;

For example, using this instructions set regular expression “ab*c” can be translated into:

expect a L1: thread L2, L3 L2: expect b jmp L1 L3: expect c match

Let’s try to translate the regular expression ab.*cb.*d from the selector we saw earlier:

expect a expect b L1: thread L2, L3 L2: expect [any] jmp L1 L3: expect c expect b L4: thread L5, L6 L5: expect [any] jmp L4 L6: expect d match

That looks complex! Though this assembly language is designed for regular expressions in general, and regular expressions can be much more complex than our case. For us the only kind of repetition that matters is “.*”. So, instead of expressing it with multiple instructions we can use just one called hereditary_jmp:

expect a expect b hereditary_jmp L1 L1: expect c expect b hereditary_jmp L2 L2: expect d match

The instruction tells VM to memoize instruction’s label operand and unconditionally spawn a thread with a jump to this label on each input character.

There is one significant distinction between the string input of regular expressions and the input provided to our VM. The input can shrink!

A regular string is just a contiguous sequence of characters, whereas we operate on a sequence of open elements. As new tokens arrive this sequence can grow as well as shrink. Assume we represent <div> as ‘a’ character in our imaginary language, so having <div><div><div> input we can represent it as aaa, if the next token in the input is </div> then our “string” shrinks to aa.

You might think at this point that our abstraction doesn’t work and we should try something else. What we have as an input for our machine is a stack of open elements and we needed a stack-like structure to store our hereditrary_jmp instruction labels that VM had seen so far. So, why not store it on the open element stack? If we store the next instruction pointer on each of stack items on which the expect instruction was successfully executed, we’ll have a full snapshot of the VM state, so we can easily roll back to it if our stack shrinks.

With this implementation we don’t need to store anything except a tag name on the stack, and, considering that we can use the tag name hashing algorithm, it is just a 64-bit integer per open element. As an additional small optimization, to avoid traversing of the whole stack in search of active hereditary jumps on each new input we store an index of the first ancestor with a hereditary jump on each stack item.

For example, having the following selector “body > div span” we’ll have the following VM program (let’s get rid of labels and just use instruction indices instead):

0| expect <body> 1| expect <div> 2| hereditary_jmp 3 3| expect <span> 4| match

Having an input “<body><div><div><a>” we’ll have the following stack:

 Part 2

Now, if the next token is a start tag <span> the VM will first try to execute the selectors program from the beginning and will fail on the first instruction. However, it will also look for any active hereditary jumps on the stack. We have one which jumps to the instructions at index 3. After jumping to this instruction the VM successfully produces a match. If we get yet another <span> start tag later it will much as well following the same steps which is exactly what we expect for the descendant selector.

If we then receive a sequence of “</span></span></div></a></div>” end tags our stack will contain only one item:

 Part 2

which instructs VM to jump to instruction at index 1, effectively rolling back to matching the div component of the selector.

We mentioned earlier that we can bail out from the matching process if we only have a tag name from the tag scanner and we need to obtain more information by running the lexer? With a VM approach it is as easy as stopping the execution of the current instruction and resuming it later when we get the required information.

Duplicate selectors

As we need a separate program for each selector we need to match, how can we stop the same simple components doing the same job? The AST for our selector matching program is a radix tree-like structure whose edge labels are simple selector components and nodes are hierarchical combinators.
For example for the following selectors:

body > div > link[rel] body > span body > span a

we’ll get the following AST:

 Part 2


If selectors have common prefixes we can match them just once for all these selectors. In the compilation process, we flatten this structure into a vector of instructions.

[not] JIT-compilation

For performance reasons compiled instructions are macro-instructions - they incorporate multiple basic VM instruction calls. This way the VM can execute only one macro instruction per input token. Each of the macro instructions compiled using the so-called “[not] JIT-compilation” (the same approach to the compilation is used in our other Rust project - wirefilter).

Internally the macro instruction contains expect and following jmp, hereditary_jmp and match basic instructions. In that sense macro-instructions resemble microcode making it easy to suspend execution of a macro instruction if we need to request attributes information from the lexer.

What’s next

It is obviously not the end of the road, but hopefully, we’ve got a bit closer to it. There are still multiple bits of functionality that need to be implemented and certainly, there is a space for more optimizations.

If you are interested in the topic don’t hesitate to join us in development of LazyHTML and LOL HTML at GitHub and, of course, we are always happy to see people passionate about technology here at Cloudflare, so don’t hesitate to contact us if you are too :).

Categories: Technology

PHP 7.4.0 Released!

PHP - Thu, 28/11/2019 - 09:26
Categories: Technology

A History of HTML Parsing at Cloudflare: Part 1

CloudFlare - Thu, 28/11/2019 - 08:44
 Part 1 Part 1

To coincide with the launch of streaming HTML rewriting functionality for Cloudflare Workers we are open sourcing the Rust HTML rewriter (LOL  HTML) used to back the Workers HTMLRewriter API. We also thought it was about time to review the history of HTML rewriting at Cloudflare.

The first blog post will explain the basics of a streaming HTML rewriter and our particular requirements. We start around 8 years ago by describing the group of ‘ad-hoc’ parsers that were created with specific functionality such as to rewrite e-mail addresses or minify HTML. By 2016 the state machine defined in the HTML5 specification could be used to build a single spec-compliant HTML pluggable rewriter, to replace the existing collection of parsers. The source code for this rewriter is now public and available here: https://github.com/cloudflare/lazyhtml.

The second blog post will describe the next iteration of rewriter. With the launch of the edge compute platform Cloudflare Workers we came to realise that developers wanted the same HTML rewriting capabilities with a JavaScript API. The post describes the thoughts behind a low latency streaming HTML rewriter with a CSS-selector based API. We open-sourced the Rust library as it can also be used as a stand-alone HTML rewriting/parsing library.

What is a streaming HTML rewriter ?

A streaming HTML rewriter takes either a HTML string or byte stream input, parses it into tokens or any other structured intermediate representation (IR) - such as an Abstract Syntax Tree (AST). It then performs transformations on the tokens before converting back to HTML. This provides the ability to modify, extract or add to an existing HTML document as the bytes are being  processed. Compare this with a standard HTML tree parser which needs to retrieve the entire file to generate a full DOM tree. The tree-based rewriter will both take longer to deliver the first processed bytes and require significantly more memory.

 Part 1HTML rewriter

For example; consider you own a large site with a lot of historical content that you want to now serve over HTTPS. You will quickly run into the problem of resources (images, scripts, videos) being served over HTTP. This ‘mixed content’ opens a security hole and browsers will warn or block these resources. It can be difficult or even impossible to update every link on every page of a website. With a streaming HTML rewriter you can select the URI attribute of any HTML tag and change any HTTP links to HTTPS. We built this very feature Automatic HTTPS rewrites back in 2016 to solve mixed content issues for our customers.

The reader may already be wondering: “Isn’t this a solved problem, aren’t there many widely used open-source browsers out there with HTML parsers that can be used for this purpose?”. The reality is that writing code to run in 190+ PoPs around the world with a strict low latency requirement turns even seemingly trivial problems into complex engineering challenges.

The following blog posts will detail the journey of how starting with a simple idea of finding email addresses within an HTML page led to building an almost spec compliant HTML parser and then on to a CSS selector matching Virtual Machine. We learned a lot on this journey. I hope you find some of this as interesting as we did.

Rewriting at the edge

When rewriting content through Cloudflare we do not want to impact site performance. The balance in designing a streaming HTML rewriter is to minimise the pause in response byte flow by holding onto as little information as possible whilst retaining the ability to rewrite matching tokens.

The difference in requirements compared to an HTML parser used in a browser include:

Output latency

For browsers, the Document Object Model (DOM) is the end product of the parsing process but in our case we have to parse, rewrite and serialize back to HTML. In the case of Cloudflare’s reverse proxy any content processing on the edge server results in latency between the server and an eyeball. It is desirable to minimize the latency impact of HTML handling, which involves parsing, rewriting and serializing back to HTML. In all of these stages we want to be as fast as possible to minimize latency.

Parser throughput

Let’s assume that usually browsers rarely need to deal with HTML pages bigger than 1Mb in size and an average page load time is somewhere around 3s at best. HTML parsing is not the main bottleneck of the page loading process as the browser will be blocked on running scripts and loading other render-critical resources. We can roughly estimate that ~3Mbps is an acceptable throughput for browser’s HTML parser. At Cloudflare we have hundreds of megabytes of traffic per CPU, so we need a parser that is faster by an order of magnitude.

Memory limitations

As most users must realise, browsers have the luxury of being able to consume memory. For example, this simple HTML markup when opened in a browser will consume a significant chunk of your system memory before eventually killing a browser tab (and all this memory will be consumed by the parser) :

<script> document.write('<'); while(true) { document.write('aaaaaaaaaaaaaaaaaaaaaaaa'); } </script>

Unfortunately, buffering of some fraction of the input is inevitable even for streaming HTML rewriting. Consider these 2 HTML snippets:

<div foo="bar" qux="qux"> <div foo="bar" qux="qux"


These seemingly similar fragments of HTML will be treated completely differently when encountered at the end of an HTML page. The first fragment will be parsed as a start tag and the second one will be ignored. By just seeing a `<` character followed by a tag name, the parser can’t determine if it has found a start tag or not. It needs to traverse the input in the search of the closing `>` to make a decision, buffering all content in between, so it can later be emitted to the consumer as a start tag token.

This requirement forces browsers to indefinitely buffer content before eventually giving up with the out-of-memory error.

In our case, we can’t afford to spend hundreds of megabytes of memory parsing a single HTML file (actual constraints are even tighter - even using a dozen kilobytes for each request would be unacceptable). We need to be much more sophisticated than other implementations in terms of memory usage and gracefully handle all the situations where provided memory capacity is insufficient to accomplish parsing.

v0 : “Ad-hoc parsers”

As usual with big projects, it all started pretty innocently.

Find and obfuscate an email

In 2010, Cloudflare decided to provide a feature that would stop popular email scrapers. The basic idea of this protection was to find and obfuscate emails on pages and later decode them back in the browser with injected JavaScript code. Sounds easy, right? You search for anything that looks like an email, encode it and then decode it with some JavaScript magic and present the result to the end-user.

However, even such a seemingly simple task already requires solving several issues. First of all, we need to define what an email is, and there is no simple answer. Even the infamous regex supposedly covering the entire RFC is, in fact, outdated and incomplete as the new RFC added lots of valid email constructions, including Unicode support. Let’s not go down that rabbit hole for now and instead focus on a higher-level issue: transforming streaming content.

Content from the network comes in packets, which have to be buffered and parsed as HTTP by our servers. You can’t predict how the content will be split, which means you always need to buffer some of it because content that is going to be replaced can be present in multiple input chunks.

Let’s say we decided to go with a simple regex like `[\w.]+@[\w.]+`. If the content that comes through contains the email “test@example.org”, it might be split in the following chunks:

 Part 1

In order to keep good Time To First Byte (TTFB) and consistent speed, we want to ensure that the preceding chunk is emitted as soon as we determine that it’s not interesting for replacement purposes.

The easiest way to do that is to transform our regex into a state machine, or a finite automata. While you could do that by hand, you will end up with hard-to-maintain and error-prone code. Instead, Ragel was chosen to transform regular expressions into efficient native state machine code. Ragel doesn’t try to take care of buffering or anything other than traversing the state machine. It provides a syntax that not only describes patterns, but can also associate custom actions (code in a host language) with any given state.

In our case we can pass through buffers until we match the beginning of an email. If we subsequently find out the pattern is not an email we can bail out from buffering as soon as the pattern stops matching. Otherwise, we can retrieve the matched email and replace it with new content.

To turn our pattern into a streaming parser we can remember the position of the potential start of an email and, unless it was already discarded or replaced by the end of the current input, store the unhandled part in a permanent buffer. Then, when a new chunk comes, we can process it separately, resuming from a state Ragel remembers itself, but then use both the buffered chunk and a new one to either emit or obfuscate.

Now that we have solved the problem of matching email patterns in text, we need to deal with the fact that they need to be obfuscated on pages. This is when the first hints of HTML “parsing” were introduced.

I’ve put “parsing” in quotes because, rather than implementing the whole parser, the email filter (as the module was called) didn’t attempt to replicate the whole HTML grammar, but rather added custom Ragel patterns just for skipping over comments and tags where emails should not be obfuscated.

This was a reasonable approach, especially back in 2010 - four years before the HTML5 specification, when all browsers had their own quirks handling of HTML. However, as you can imagine, this approach did not scale well. If you’re trying to work around quirks in other parsers, you start gaining more and more quirks in your own, and then work around these too. Simultaneously, new features started to be added, which also required modifying HTML on the fly (like automatic insertion of Google Analytics script), and an existing module seemed to be the best place for that. It grew to handle more and more tags, operations and syntactic edge cases.

Now let’s minify..

In 2011, Cloudflare decided to also add minification to allow customers to speed up their websites even if they had not employed minification themselves. For that, we decided to use an existing streaming minifier - jitify. It already had NGINX bindings, which made it a great candidate for integration into the existing pipeline.

Unfortunately, just like most other parsers from that time as well as ours described above, it had its own processing rules for HTML, JavaScript and CSS, which weren’t precise but rather tried to parse content on a best-effort basis. This led to us having two independent streaming parsers that were incompatible and could produce bugs either individually or only in combination.

v1 : "(Almost) HTML5 Spec compliant parser"

Over the years engineers kept adding new features to the ever-growing state machines, while fixing new bugs arising from imprecise syntax implementations, conflicts between various parsers, and problems in features themselves.

By 2016, it was time to get out of the multiple ad hoc parsers business and do things ‘the right way’.

The next section(s) will describe how we built our HTML5 compliant parser starting from the specification state machine. Using only this state machine it should have been straight-forward to build a parser. You may be aware that historically the parsing of HTML had not been entirely strict which meant to not break existing implementations the building of an actual DOM was required for parsing. This is not possible for a streaming rewriter so a simulator of the parser feedback was developed. In terms of performance, it is always better not to do something. We  then describe why the rewriter can be ‘lazy’ and not perform the expensive encoding and decoding of text when rewriting HTML. The surprisingly difficult problem of deciding if a response is HTML is then detailed.

HTML5

By 2016, HTML5 had defined precise syntax rules for parsing and compatibility with legacy content and custom browser implementations. It was already implemented by all browsers and many 3rd-party implementations.

The HTML5 parsing specification defines basic HTML syntax in the form of a state machine. We already had experience with Ragel for similar use cases, so there was no question about what to use for the new streaming parser. Despite the complexity of the grammar, the translation of the specification to Ragel syntax was straightforward. The code looks simpler than the formal description of the state machine, thanks to the ability to mix regex syntax with explicit transitions.

 Part 1A visualisation of a small fraction of the HTML state machine. Source: https://twitter.com/RReverser/status/715937136520916992HTML5 parsing requires a ‘DOM’

However, HTML has a history. To not break existing implementations HTML5 is specified with  recovery procedures for incorrect tag nesting, ordering, unclosed tags, missing attributes and all the other possible quirks that used to work in older browsers.  In order to resolve these issues, the specification expects a tree builder to drive the lexer, essentially meaning you can’t correctly tokenize HTML (split into separate tags) without a DOM.

 Part 1HTML parsing flow as defined by the specification

For this reason, most parsers don’t even try to perform streaming parsing and instead take the input as a whole and produce a document tree as an output. This is not something we could do for streaming transformation without adding significant delays to page loading.

An existing HTML5 JavaScript parser - parse5 - had already implemented spec-compliant tree parsing using a streaming tokenizer and rewriter. To avoid having to create a full DOM the concept of a “parser feedback simulator” was introduced.

Tree builder feedback

As you can guess from the name, this is a module that aims to simulate a full parser’s feedback to the tokenizer, without actually building the whole DOM, but instead preserving only the required information and context necessary for correctly driving the state machine.

After rigorous testing and upstreaming a test runner to parse5, we found this technique to be suitable for the majority of even poorly written pages on the Internet, and employed it in LazyHTML.

 Part 1LazyHTML architectureAvoiding decoding - everything is ASCII

Now that we had a streaming tokenizer working, we wanted to make sure that it was fast enough so that users didn’t notice any slowdowns to their pages as they go through the parser and transformations. Otherwise it would completely circumvent any optimisations we’d want to attempt on the fly.

It would not only cause a performance hit due to decoding and re-encoding any modified HTML content, but also significantly complicates our implementation due to multiple sources of potential encoding information  required to determine the character encoding, including sniffing of the first 1 KB of the content.

The “living” HTML Standard specification permits only encodings defined in the Encoding Standard. If we look carefully through those encodings, as well as a remark on Character encodings section of the HTML spec, we find that all of them are ASCII-compatible with the exception of UTF-16 and ISO-2022-JP.

This means that any ASCII text will be represented in such encodings exactly as it would be in ASCII, and any non-ASCII text will be represented by bytes outside of the ASCII range. This property allows us to safely tokenize, compare and even modify original HTML without decoding or even knowing which particular encoding it contains. It is possible as all the token boundaries in HTML grammar are represented by an ASCII character.

We need to detect UTF-16 by sniffing and either decode or skip such documents without modification. We chose the latter to avoid potential security-sensitive bugs which are common with UTF-16, and because the character encoding is seen in less than 0.1% of known character encodings luckily.

The only issue left with this approach is that in most places the HTML tokenization specification  requires you to replace U+0000 (NUL) characters with U+FFFD (replacement character) during parsing. Presumably, this was added as a security precaution against bugs in C implementations of old engines which could treat NUL character, encoded in ASCII / UTF-8 / ... as a 0x00 byte, as the end of the string (yay, null-terminated strings…). It’s problematic for us because U+FFFD is outside of the ASCII range, and will be represented by different sequences of bytes in different encodings. We don’t know the encoding of the document, so this will lead to corruption of the output.

Luckily, we’re not in the same business as browser vendors, and don’t worry about NUL characters in strings as much - we use “fat pointer” string representation, in which the length of the string is determined not by the position of the NUL character, but stored along with the data pointer as an integer field:

typedef struct { const char *data; size_t length; } lhtml_string_t;

Instead, we can quietly ignore these parts of the spec (sorry!), and keep U+0000 characters as-is and add them as such to tag, attribute names, and other strings, and later re-emit to the document. This is safe to do, because it doesn’t affect any state machine transitions, but merely preserves original 0x00 bytes and delegates their replacement to the parser in the end user’s browser.

Content type madness

We want to be lazy and minimise false positives. We only want to spend time parsing, decoding and rewriting actual HTML rather than breaking images or JSON. So the question is how do you decide if something is a HTML document. Can you just use the Content-Type for example ? A comment left in the source code best describes the reality.

/* Dear future generations. I didn't like this hack either and hoped we could do the right thing instead. Unfortunately, the Internet was a bad and scary place at the moment of writing. If this ever changes and websites become more standards compliant, please do remove it just like I tried. Many websites use PHP which sets Content-Type: text/html by default. There is no error or warning if you don't provide own one, so most websites don't bother to change it and serve JSON API responses, private keys and binary data like images with this default Content-Type, which we would happily try to parse and transforms. This not only hurts performance, but also easily breaks response data itself whenever some sequence inside it happens to look like a valid HTML tag that we are interested in. It gets even worse when JSON contains valid HTML inside of it and we treat it as such, and append random scripts to the end breaking APIs critical for popular web apps. This hack attempts to mitigate the risk by ensuring that the first significant character (ignoring whitespaces and BOM) is actually `<` - which increases the chances that it's indeed HTML. That way we can potentially skip some responses that otherwise could be rendered by a browser as part of AJAX response, but this is still better than the opposite situation. */

The reader might think that it’s a rare edge case, however, our observations show that almost 25% of the traffic served through Cloudflare with the “text/html” content type is unlikely to be HTML.

 Part 1

The trouble doesn’t end there: it turns out that there is a considerable amount of XML content served with the “text/html” content type which can’t be always processed correctly when treated as HTML.

Over time bailouts for binary data, JSON, AMP and correctly identifying HTML fragments leads to the content sniffing logic which can be described by the following diagram:

 Part 1

This is a good example of divergence between formal specifications and reality.

Tag name comparison optimisation

But just having fast parsing is not enough - we have functionality that consumes the output of the parser, rewrites it and feeds it back for the serialization. And all the memory and time constraints that we have for the parser are applicable for this code as well, as it is a part of the same content processing pipeline.

It’s a common requirement to compare parsed HTML tag names, e.g. to determine if the current tag should be rewritten or not. A naive implementation will use regular per-byte comparison which can require traversing the whole tag name. We were able to narrow this operation to a single integer comparison instruction in the majority of cases by using specially designed hashing algorithm.

The tag names of all standard HTML elements contain only alphabetical ASCII characters and digits from 1 to 6 (in numbered header tags, i.e. <h1> - <h6>). Comparison of tag names is case-insensitive, so we only need 26 characters to represent alphabetical characters. Using the same basic idea as arithmetic coding, we can represent each of the possible 32 characters of a  tag name using just 5 bits and, thus, fit up to floor(64 / 5) = 12 characters in a 64-bit integer which is enough for all the standard tag names and any other tag names that satisfy the same requirements! The great part is that we don’t even need to additionally traverse a tag name to hash it - we can do that as we parse the tag name consuming the input byte by byte.

However, there is one problem with this hashing algorithm and the culprit is not so obvious: to fit all 32 characters in 5 bits we need to use all possible bit combinations including 00000. This means that if the leading character of the tag name is represented with 00000 then we will not be able to differentiate between a varying number of consequent repetitions of this character.

For example, considering that ‘a’ is encoded as 00000 and ‘b’ as 00001 :

Tag name Bit representation Encoded value ab 00000 00001 1 aab 00000 00000 00001 1

Luckily, we know that HTML grammar doesn’t allow the first character of a tag name to be anything except an ASCII alphabetical character, so reserving numbers from 0 to 5 (00000b-00101b) for digits and numbers from 6 to 31 (00110b - 11111b) for ASCII alphabetical characters solves the problem.

LazyHTML

After taking everything  mentioned  above into  consideration the LazyHTML (https://github.com/cloudflare/lazyhtml) library was created. It is a fast streaming HTML parser and serializer with a token based C-API derived from the HTML5 lexer written in Ragel. It provides a pluggable transformation pipeline to allow multiple transformation handlers to be chained together.

An example of a function that transforms `href` property of links:

// define static string to be used for replacements static const lhtml_string_t REPLACEMENT = { .data = "[REPLACED]", .length = sizeof("[REPLACED]") - 1 }; static void token_handler(lhtml_token_t *token, void *extra /* this can be your state */) { if (token->type == LHTML_TOKEN_START_TAG) { // we're interested only in start tags const lhtml_token_starttag_t *tag = &token->start_tag; if (tag->type == LHTML_TAG_A) { // check whether tag is of type <a> const size_t n_attrs = tag->attributes.count; const lhtml_attribute_t *attrs = tag->attributes.items; for (size_t i = 0; i < n_attrs; i++) { // iterate over attributes const lhtml_attribute_t *attr = &attrs[i]; if (lhtml_name_equals(attr->name, "href")) { // match the attribute name attr->value = REPLACEMENT; // set the attribute value } } } } lhtml_emit(token, extra); // pass transformed token(s) to next handler(s) } So, is it correct and how fast is  it?

It is HTML5 compliant as tested against the official test suites. As part of the work several contributions were sent to the specification itself for clarification / simplification of the spec language.

Unlike the previous parser(s), it didn't bail out on any of the 2,382,625 documents from HTTP Archive, although 0.2% of documents exceeded expected bufferization limits as they were in fact JavaScript or RSS or other types of content incorrectly served with Content-Type: text/html, and since anything is valid HTML5, the parser tried to parse e.g. a<b; x=3; y=4 as incomplete tag with attributes. This is very rare (and goes to even lower amount of 0.03% when two error-prone advertisement networks are excluded from those results), but still needs to be accounted for and is a valid case for bailing out.

As for the benchmarks, In September 2016 using an example which transforms the HTML spec itself (7.9 MB HTML file) by replacing every <a href> (only that property only in those tags) to a static value. It was compared against the few existing and popular HTML parsers (only tokenization mode was used for the fair comparison, so that they don't need to build AST and so on), and timings in milliseconds for 100 iterations are the following (lazy mode means that we're using raw strings whenever possible, the other one serializes each token just for comparison):

 Part 1

The results show that LazyHTML parser speeds are around an order of magnitude faster.

That concludes the first post in our series on HTML rewriters at Cloudflare. The next post describes how we built  a new streaming rewriter on top of the ideas of LazyHTML. The major update was to provide an easier to use CSS selector API.  It provides the back-end for the Cloudflare workers HTMLRewriter JavaScript API.

Categories: Technology

Introducing the HTMLRewriter API to Cloudflare Workers

CloudFlare - Thu, 28/11/2019 - 08:00
Introducing the HTMLRewriter API to Cloudflare WorkersIntroducing the HTMLRewriter API to Cloudflare Workers

We are excited to announce that the HTMLRewriter API for Cloudflare Workers is now GA! You can get started today by checking out our documentation, or trying out our tutorial for localizing your site with the HTMLRewriter.

Want to know how it works under the hood? We are excited to tell you everything you wanted to know but were afraid to ask, about building a streaming HTML parser on the edge; read about it in part 1 (and stay tuned for part two coming tomorrow!).

Faster, more scalable applications at the edge

The HTMLRewriter can help solve two big problems web developers face today: making changes to the HTML, when they are hard to make at the server level, and making it possible for HTML to live on the edge, closer to the user — without sacrificing dynamic functionality.

Since the introduction of Workers, Workers have helped customers regain control where control either wasn’t provided, or very hard to obtain at the origin level. Just like Workers can help you set CORS headers at the middleware layer, between your users and the origin, the HTMLRewriter can assist with things like URL rewrites (see the example below!).

Back in January, we introduced the Cache API, giving you more control than ever over what you could store in our edge caches, and how. By helping users have closer control of the cache, users could push experiences that were previously only able to live at the origin server to the network edge. A great example of this is the ability to cache POST requests, where previously only GET requests were able to benefit from living on the edge.

Later this year, during Birthday Week, we announced Workers Sites, taking the idea of content on the edge one step further, and allowing you to fully deploy your static sites to the edge.

We were not the first to realize that the web could benefit from more content being accessible from the CDN level. The motivation behind the resurgence of static sites is to have a canonical version of HTML that can easily be cached by a CDN.

This has been great progress for static content on the web, but what about so much of the content that’s almost static, but not quite? Imagine an e-commerce website: the items being offered, and the promotions are all the same — except that pesky shopping cart on the top right. That small difference means the HTML can no longer be cached, and sent to the edge. That tiny little number of items in the cart requires you to travel all the way to your origin, and make a call to the database, for every user that visits your site. This time of year, around Black Friday, and Cyber Monday, this means you have to make sure that your origin, and database are able to scale to the maximum load of all the excited shoppers eager to buy presents for their loved ones.

Edge-side JAMstack with HTMLRewriter

One way to solve this problem of reducing origin load while retaining dynamic functionality is to move more of the logic to the client — this approach of making the HTML content static, leveraging CDN for caching, and relying on client-side API calls for dynamic content is known as the JAMstack (JavaScript, APIs, Markdown). The JAMstack is certainly a move in the right direction, trying to maximize content on the edge. We’re excited to embrace that approach even further, by allowing dynamic logic of the site to live on the edge as well.

In September, we introduced the HTMLRewriter API in beta to the Cloudflare Workers runtime — a streaming parser API to allow developers to develop fully dynamic applications on the edge. HTMLRewriter is a jQuery-like experience inside your Workers application, allowing you to manipulate the DOM using selectors and handlers. The same way jQuery revolutionized front-end development, we’re excited for the HTMLRewriter to change how we think about architecting dynamic applications.

There are a few advantages to rewriting HTML on the edge, rather than on the client. First, updating content client-side introduces a tradeoff between the performance and consistency of the application — that is, if you want to, update a page to a logged in version client side, the user will either be presented with the static version first, and witness the page update (resulting in an inconsistency known as a flicker), or the rendering will be blocked until the customization can be fully rendered. Having the DOM modifications made edge-side provides the benefits of a scalable application, without the degradation of user experience. The second problem is the inevitable client-side bloat. Most of the world today is connected to the internet on a mobile phone with significantly less powerful CPU, and on shaky last-mile connections where each additional API call risks never making it back to the client. With the HTMLRewriter within Cloudflare Workers, the API calls for dynamic content, (or even calls directly to Workers KV for user-specific data) can be made from the Worker instead, on a much more powerful machine, over much more resilient backbone connections.

Here’s what the developers at Happy Cog have to say about HTMLRewriter:

"At Happy Cog, we're excited about the JAMstack and static site generators for our clients because of their speed and security, but in the past have often relied on additional browser-based JavaScript and internet-exposed backend services to customize user experiences based on a cookied logged-in state. With Cloudflare's HTMLRewriter, that's now changed. HTMLRewriter is a fundamentally new powerful capability for static sites. It lets us parse the request and customize each user's experience without exposing any backend services, while continuing to use a JAMstack architecture. And it's fast. Because HTMLRewriter streams responses directly to users, our code can make changes on-the-fly without incurring a giant TTFB speed penalty. HTMLRewriter is going to allow us to make architectural decisions that benefit our clients and their users -- while avoiding a lot of the tradeoffs we'd usually have to consider."

Matt Weinberg, - President of Development and Technology, Happy Cog

HTMLRewriter in Action

Since most web developers are familiar with jQuery, we chose to keep the HTMLRewriter very similar, using selectors and handlers to modify content.

Below, we have a simple example of how to use the HTMLRewriter to rewrite the URLs in your application from myolddomain.com to mynewdomain.com:

async function handleRequest(req) { const res = await fetch(req) return rewriter.transform(res) } const rewriter = new HTMLRewriter() .on('a', new AttributeRewriter('href')) .on('img', new AttributeRewriter('src')) class AttributeRewriter { constructor(attributeName) { this.attributeName = attributeName } element(element) { const attribute = element.getAttribute(this.attributeName) if (attribute) { element.setAttribute( this.attributeName, attribute.replace('myolddomain.com', 'mynewdomain.com') ) } } } addEventListener('fetch', event => { event.respondWith(handleRequest(event.request)) })

The HTMLRewriter, which is instantiated once per Worker, is used here to select the a and img elements.

An element handler responds to any incoming element, when attached using the .on function of the HTMLRewriter instance.

Here, our AttributeRewriter will receive every element parsed by the HTMLRewriter instance. We will then use it to rewrite each attribute, href for the a element for example, to update the URL in the HTML.

Getting started

You can get started with the HTMLRewriter by signing up for Workers today. For a full guide on how to use the HTMLRewriter API, we suggest checking out our documentation.

How does the HTMLRewriter work?

How does the HTMLRewriter work under the hood? We’re excited to tell you about it, and have spared no details. Check out the first part of our two-blog post series to learn all about the not-so-brief history of rewriter HTML at Cloudflare, and stay tuned for part two tomorrow!

Categories: Technology

Crack the ‘Giving Tuesday’ Code on December 3rd

Drupal - Wed, 27/11/2019 - 22:13

Giving Tuesday icon logo globe with heartGiving Tuesday (known online as #GivingTuesday) is the Tuesday after the Thanksgiving holiday in the United States, held on December 3, 2019. What began in the States has grown over the years into an international day dedicated to charitable giving at the beginning of the holiday season, whereby hundreds of millions of people give, collaborate and celebrate generosity.

Coincidentally, givingTuesday.org is built using Drupal! 

At the Drupal Association, we’ve decided to participate for the first time this year. Knowing the Drupal Community likes to have fun and delve into challenges — such as the ever-popular Trivia Night held during DrupalCons — our staff collaborated to create a trivia challenge never before seen, with Drupal project and product questions, Association facts and more. 

We challenge you to take the quiz on #GivingTuesday December 3rd, and to share with friends to see who can crack this code! Note: Not a literal code! Visit Drupal.org/giving-tuesday-2019 on Tuesday for the link.

We’ll update the leaderboard and congratulate players throughout the day on December 3! 

Thanks in advance for playing, and we encourage you to post your trivia score on social media; this may be our most challenging trivia game yet! 

We hope you'll take a minute to support the Drupal Association and join/renew membership or donate on this international day of giving. At the end of Giving Tuesday, we’ll announce three leaderboard winners: who has won the trivia, who donated the most, and who referred the most new members. The top 10 winners shown on each leaderboard will be entered to win a handcrafted Drupal prize (hint: it might be pictured in this photo)! 

#GivingTuesday
#MyGivingStory
#GivingToDrupal

Opening reception with dasjo and lizzjoy in DrupalCon Seattle. photo by hussainweb
photo from DrupalCon Seattle opening reception by Hussain Abbas (hussainweb)

Categories: Technology

Harnessing the Power of the People: Cloudflare’s First Security Awareness Month Design Challenge Winners

CloudFlare - Wed, 27/11/2019 - 17:30
 Cloudflare’s First Security Awareness Month Design Challenge Winners

Grabbing the attention of employees at a security and privacy-focused company on security awareness presents a unique challenge; how do you get people who are already thinking about security all day to think about it some more? October marked Cloudflare’s first Security Awareness Month as a public company and to celebrate, the security team challenged our entire company population to create graphics, slogans, and memes to encourage us all to think and act more securely every day.

Employees approached this challenge with gusto; global participation meant plenty of high quality submissions to vote on. In addition to being featured here, the winning designs will be displayed in Cloudflare offices throughout 2020 and the creators will be on the decision panel for next year’s winners. Three rose to the top, highlighting creativity and style that is uniquely Cloudflarian. I sat down with the winners to talk through their thoughts on security and what all companies can do to drive awareness.

Eugene Wang, Design Team, First Place Cloudflare’s First Security Awareness Month Design Challenge WinnersSílvia Flores, Executive Assistant, Second Place Cloudflare’s First Security Awareness Month Design Challenge WinnersScott Jones, e-Learning Developer, Third Place

Security Haiku

Wipe that whiteboard clean‌‌
Visitors may come and see
Secrets not for them

No tailgating please
You may be a nice person
But I don’t know that‌‌‌‌‌‌

1. What inspired your design?

Eugene: The friendly "Welcome" cloud seen in our all company slides was a jumping off point. It seemed like a great character that embodied being a Cloudflarian and had tons of potential to get into adventures. I also wanted security to be a bit fun, where appropriate. Instead of a serious breach (though it could be), here it was more a minor annoyance personified by a wannabe-sneaky alligator. Add a pun, and there you go—poster design!

Sílvia: What inspired my design was the cute Cloudflare mascot the otter since there are so many otters in SF. Also, security can be fun and I added a pun for all the employees to remember the security system in an entertaining and respectful way. This design is very much my style and I believe making things cute and bright can really grab attention from people who are so busy in their work. A bright, orange, leopard print poster cannot be missed!

Scott: I have always loved the haiku form and poems were allowed!

2. What's the number one thing security teams can do to get non-security people excited about security?

Eugene: Make them realize and identify the threats that can happen everyday, and their role in keeping things secure. Cute characters and puns help.

Sílvia: Make it more accessible for people to engage and understand it, possibly making more activities, content, and creating a fun environment for people to be aware but also be mindful.

Scott: Use whatever means available to keep the idea of being security conscious in everyone's active awareness. This can and should be done in a variety of different ways so as to engage everyone in one way or another, visually with posters and signs, mentally by having contests, multi-sensory through B.E.E.R. meeting presentations and yes, even through a careful use of fear by periodically giving examples of what can happen if security is not followed...I believe that people like working here and believe in what we are doing and how we are doing it, so awareness mixed in with a little fear can reach people on a more visceral and personal level.

3. What's your favorite security tip?

Eugene: Look at the destination of the return email.

Sílvia: LastPass. Oh my lord. I cannot remember one single password since we need to make them so difficult! With numbers, caps, symbols, emojis (ahaha). LastPass makes it easier for me to be secure and still be myself and not remembering any password without freaking out.

Scott: “See something, say something" because it both reflects our basic responsibility to each other and exhibits a pride that we have as being part of a company we believe in and want to protect.‌‌

‌‌For security practitioners and engagement professionals, it's easy to try to boil the ocean when Security Awareness Month comes around. The list of potential topics and guidance is endless. Focusing on two or three key messages, gauging the maturity of your organization, and encouraging company-wide participation makes it a company-wide effort. Extra recognition and glory for those that go over and above never hurts either.

Want to run a security awareness design contest at your company? Reach out to us at securityawareness@cloudflare.com for tips and best practices for getting started, garnering support, and encouraging participation.

‌‌

Categories: Technology

Pages

Subscribe to oakleys.org.uk aggregator - Technology
Additional Terms