Today, FluxNinja is emerging from stealth mode to announce Aperture - the first open-source flow control and reliability management platform for modern web applications.
Reliability as a competitive advantage
Over the last decade, cloud computing platforms have enabled online businesses to reach massive scale and empowered physical enterprises to bring their business online. But keeping these applications reliable is more challenging than ever. A sudden spike in traffic for an e-commerce giant on Black Friday can trigger customer-facing blank screens and crashing apps. Outages take a high toll in customer trust, missed revenue targets, and stress for internal DevOps and SRE teams.
Companies like LinkedIn (Hodor), Google (Handling Overload), Netflix (Prioritized Load Shedding), and Stripe (API Scaling) have made application reliability a competitive advantage with their cutting-edge flow control technologies. Fundamentally, flow control enables graceful degradation - the ability to preserve key user experience pathways, even in the face of application failures.
Graceful degradation with flow control
Modern web-scale apps are a complex network of inter-connected microservices that implement features such as account management, search, payments and more. This decoupled architecture has advantages for rapid feature development, but introduces complex new failure modes. When traffic surges,queues can build up on critical services, kick-starting a negative feedback loop and causing cascading failures. The application stops serving responses in a timely manner and critical end-user transactions are interrupted
Applications are governed by Little’s Law, which describes the relationship between concurrent requests in the system, arrival rate of requests, and response times. For the application to remain stable, the concurrent requests in the system must be throttled. Indirect techniques to stabilize applications such as rate-limiting and auto-scaling fall short in enabling good user experiences or business outcomes. Rate-limiting individual users is insufficient to protect services. Autoscaling is slow to respond and can be cost-prohibitive. And as the number of services scales, these techniques get harder to deploy.
This is where flow control comes in. Applications can degrade gracefully in real-time when using flow control techniques with Aperture, by prioritizing high-importance features over others.
The flow control technologies used by teams at LinkedIn, Google, Netflix, Stripe, and others have been years in development. But most companies don’t have the luxury of building these in-house. This is why we are excited to release Aperture as an open-source project - just as Kubernetes democratized deploying cloud infrastructure, we hope to democratize building reliable applications with effective flow control.
How Aperture works
At the fundamental level, Aperture enables flow control through observing, analyzing, and actuating, facilitated by agents and a controller.
Aperture Agents live next to your service instances as a sidecar and provide powerful flow control components such as a weighted fair queuing scheduler for prioritized load-shedding and a distributed rate-limiter for abuse prevention. A flow is the fundamental unit of work from the perspective of an Aperture Agent. It could be an API call, a feature, or even a database query.
Graceful degradation of services is achieved by prioritizing critical application features over background workloads. Much like when boarding an aircraft, business class passengers get priority over other passengers; every application has workloads with varying priorities. A video streaming service might view a request to play a movie by a customer as a higher priority than running an internal machine learning workload. A SaaS product might prioritize features used by paid users over those being used by free users. Aperture Agents schedule workloads based on their priorities helping maximize user experience or revenue even during overload scenarios.
Aperture Agents monitor golden signals using a built-in telemetry system and a programmable, high-fidelity flow classifier used to label requests based on attributes such as customer tier or request type. These metrics are analyzed by the controller.
The controller is powered by always-on, dataflow-driven policies that continuously track deviations from service-level objectives (SLOs) and calculate recovery or escalation actions. The policies running in the controller are expressed as circuits, much like circuit networks in the game Factorio.
For example, a gradient control circuit component can be used to implement AIMD (Additive Increase, Multiplicative Decrease) style counter-measures that limit the concurrency on a service when response times deteriorate. Advanced control components like PID can be used to further tune the concurrency limits.
Aperture’s Controller is comparable in capabilities to autopilot in aircraft or adaptive cruise control in a Tesla.
Aperture can be inserted into service instances with either Service Meshes or SDKs:
- Service Mesh: Aperture can be deployed with no changes to application code, using Envoy. It latches onto Envoy’s External Authorization API for control purposes and collects access logs for telemetry purposes. On each request, Envoy sends request attributes to the Aperture Agent for a flow control decision. Inside the Aperture Agent, the request traverses classifiers, rate-limiters, and schedulers, before the decision to accept or drop the request is sent back to Envoy. Aperture participates in the OpenTelemetry tracing protocol as it inserts flow classification labels into requests, enabling visualization in tracing tools such as Jaeger.
- Aperture SDKs: In addition to service mesh insertion, Aperture provides SDKs that can be used by developers to achieve fine-grained flow control at the feature level inside service code. For example, an e-commerce app may prioritize users in the checkout flow over new sessions when the application is experiencing an overload. The Aperture Controller can be programmed to degrade features as an escalated recovery action when basic load shedding is triggered for several minutes.
Bringing it all together
Our team at FluxNinja is no stranger to the plight of DevOps and SRE teams and the operational challenges they face. We previously built Netsil (acquired by Nutanix) which pioneered a network-centric approach to microservices monitoring. Aperture results from technical insights and customer perspectives gathered over many years of operating directly in the field with large-scale web applications.
Reliability can be a significant competitive advantage and at FluxNinja we believe that the path to reliability at web-scale begins with implementing effective flow control.