NSDI '23 Fall Paper #452 Reviews and Comments =========================================================================== Paper #452 Practical Intent-driven Routing Configuration Synthesis Review #452A =========================================================================== Overall merit ------------- 3. Weak accept (I think it should be accepted, but I am fine if others want to reject.) Reviewer expertise ------------------ 2. Some familiarity Paper summary ------------- This paper describes Aura, a system for expressing routing policies in use at Meta. Aura consists of a policy language, a verifier, and a compiler that generates BGP configurations on individual switches based on the policy. Compared to prior work, the authors use knowledge of the datacenter topology to: 1) pre-compute paths and backup paths, 2) specify policies on abstract base paths, and 3) encode policies and source-controlled path choices using community attributes. The authors show that their work speeds up network reconfiguration time and that RPLs are more straightforward for operators to grok and revise than raw BGP configurations. Comments for authors -------------------- Thank you for submitting this paper to NSDI. I enjoyed reading about how Meta handles applying routing policies to datacenter switches. The key insight — that the symmetry of the datacenter topology can be used to abstract out paths and specify policies on them — was quite interesting. Some comments below: 1. Being an industry-track paper, it would be valuable to point out opportunities for future research and where academic research is not in line with Meta’s needs. 2. It makes sense that operators would want to specify routing policies at a higher-level of abstraction than BGP router configuration files. It also makes sense that in single-domain environments, there might be a mismatch between BGP’s existing community attributes and the types of policies operators wish to express. But, I was left wondering about the size of this mismatch. For example, I understand the standard method for advertising backup paths to ASes on the Internet is not in line with Meta's policy needs (i.e., do not advertise further if a rack-level switch is the third hop). But, a more explicit discussion of these mismatches would provide valuable insights. 3. Are there any desired policies for which BGP communities are not sufficient to implement them? 4. I was a bit confused by the discussion of using RPL / BGP community tags to evolve to OpenR. The discussion seemed to imply that OpenR can work with BGP. Is this true? Slightly more background on OpenR would be useful to motivate the evolvability argument and scope the type of new routing mechanisms Aura can incorporate or use. 5. Though routes are specified in policies as end-to-end paths, they must still be selected by BGP’s path selection on individual switches. It would be useful to understand how long the resulting churn lasts and whether anything about the policies extends or decreases churn length. 6. How are switch failures first detected so that community attributes specifying the backup path can be advertised? 7. I feel like the discussion about SDN requires a bit more nuance. Effectively, Libra does seem conceptually like SDN since it has a centralized policy generator. But, its policies are implemented via BGP. Review #452B =========================================================================== Overall merit ------------- 4. Accept (Good paper, I'll advocate for it) Reviewer expertise ------------------ 3. Knowledgeable Paper summary ------------- Network configuration is hard, especially at scale, and frequently results in disruption. This operation track paper presents Meta’s Aura, a routing synthesis system. It includes a high-level language, RPL, for describing the desired behavior and a compiler for generating switch configurations. The paper also shares operational experiences in building and deploying a configuration synthesis system and discusses various interesting anecdotes/statistics (e.g., the reduction in engineering and time investment via automating routing configuration verification and generation) that can be educational and useful for informing future research. Comments for authors -------------------- Thanks for submitting to NSDI! I enjoyed reading this paper; some of the anecdotes presented here were new (to my knowledge) and interesting. I think this paper can encourage future research. Here is my “wishlist” when reading the paper. I hope you include some of them in the revised version: * In the broader testing/verification frameworks deployed at Meta, where does Aura fall? Is it deployed in conjunction with other routing verification techniques or is it the primary method for routing verification in the five deployed data centers? Are there classes of desired routing properties that are out of scope for Aura? * It would have been interesting to see the frequency and the potential use impact of the errors/bugs that Aura detects and prevents. Section 7 (Operational Experience) does list a couple of examples but they seem relatively minor (e.g., taking “the longer backup path, instead of the shorter one”). These examples actually made me wonder if other testing and verification techniques may be catching potentially more harmful errors. * The discussion about inconsistencies in the transient states is confusing and at times seems contradictory. The paper initially discusses that the deployability hurdle for research proposals targeting this very problem is their high update time. That is a valid point. Despite optimizations for accelerating consistent update techniques, one can imagine that the ultimate speed may not be fast enough at scale. However, the actual operations that the paper later discusses (the 3-step approach discussed in Section 7 for draining switches, waiting for convergence, and deploying BMP monitoring) seem very time-consuming. It will help a lot to provide more data about the time cost of these operations, as well as the constraints and requirements that Meta might have for updates. That can provide the community with some insights into this context. * Figure 2 was interesting! I wouldn’t have imagined that an average policy change requires the configuration of at least 25.6% of switches! It would be interesting to see some data about these policy changes (beyond the schematic, and I assume over-simplified?, Figure 1), e.g., what classes of policy changes are frequent, and what are their frequencies? Also, what are the locations of these affected switches in the topology? * The approach discussed in Section 5.3. for verification is intriguing because one can imagine that this emulation-based technique can’t capture all potential issues. Were there instances of routing configuration errors that your emulator failed to flag? * It was interesting to see that faster verification has enabled deploying a new technology (a new AI backend topology)! Is this an isolated incident or has Aura enabled the rapid deployment of other novel ideas? Review #452C =========================================================================== Overall merit ------------- 2. Weak reject (I think it should be rejected, but I am fine if others want to accept.) Reviewer expertise ------------------ 2. Some familiarity Paper summary ------------- The paper speaks about an intent-driven BGP configuration synthesis method that is deployed within Meta datacenters. The synthesis is for intra-datacenter routes. The paper speaks to extending intent-driven methods to handle dynamic changes such as switch failures and service movements. Comments for authors -------------------- Thanks for submitting to NSDI. I appreciated several interesting anecdotes such as on the duration to reconfigure a router (5.2 hours) and the expected growth (doubling of routers every five years). I also appreciated the operational insights. However, I found that the core techniques in Sections 4 and 5 are presented in the abstract without making a clear connection to the closest prior work and then speaking about the delta. The usage of intent to model aspects such as service movement and switch failures seems to me somewhat off-topic because for the former shouldn't routing be primarily about path selection and service discovery happens elsewise (e.g., using DNS) except for for anycast-services. For the latter, why not rely on standard fault recovery methods? I know BGP fault recovery is not ideal and there may be value in pushing for a better alternative in datacenters but am hard pressed to make a case for fault-recovery as intent-specification. Also, adding multiple configurations to router state is not trivial since router memory may not be infinite but the paper does not clearly address this constraint. I was also interested in looking for how load-balance is done and ensured inspite of network faults as well as how capacity skew is accounted for in terms of spreading load but did not find any intent related to these aspects. Is this not a first order issue? Overall, I do not have a good sense of which intents are supported by Aura or are in use in production. I would have been much more positively inclined if appropriate examples of routing configuration intents were shown that draw a contrast over state-of-the-art. With Figure 1, I do not understand why some of these are intents for a routing layer. For the other intents, I do not see a clear delta over Propane. The key techniques in Sections 4 and 5 are also hard to translate into meaningful value for these intents. Robotron, at SIGCOMM'16, covered a similar space and it would perhaps help to speak to the evolution to Aura. Details: RE: ``base paths``, compare/contrast with pathlet routing perhaps? The context is very different but the notion of stitching together a path from a succinct set of segments is used elsewhere. `We define a new declarative language called RPL`: what are the specific differences over say SQL? New operators? New atoms? Looking carefully at Table 1, I cannot tell what the operators are and here. Configuration staging requires space in the routers. I wonder how many different configurations can be placed in a router apriori. RE: `we stress one overlooked goal of configuration synthesis: producing configuration changes that lead to the shortest time to complete the reconfiguration, and making the process automated, nondisruptive to production traffic and with minimal operator burden`, the WAN traffic engineering works albeit in a different context address some of these concerns. See for example: On consistent updates in software-defined networks Reachability intents should have a \forall perhaps? That is, any two hosts should have a path between them? The list of intents in Figure 1 is thought-provoking, however the gambit of intents is unclear. Do the intents primarily encode reachability? What about having a high-capacity path? Or a low latency path? The various states listed in Figure 1 are only defined in Section 2.3 e.g., `WARM`, `LIVE`, `DRAINED`. I do not believe these are standard terms; perhaps define at first-use? Figure 1: Should you swap the lables of FAUU and FADU? Could help to speak about the fraction of inter-rack paths that go through just the spline switches. That is, it is unclear if traffic within a datacenter needs to touch the FA layer at the top of Figure 1. Can you speak to how complex the intent spec is for your production use? Is it 10 intents or a thousand intents? A taxonomy of the kinds of intents being used would also be of interest. I am not sure why `a service migrating from one data center to another` should affect the routing logic inside the datacenter. (From `Intent Changes` in 2.2) Do your intents also specify security or access control perhaps? Why would that be the right choice to couple routing and ACLs? Can you kindly give concrete examples of how a `better load balancing strategy, a more resilient failure recovery` change the intent? Please show the intent before and the intent after. In `Switch state changes`, `we need to gracefully remove the impacted switches from serving traffic, to minimize disruptions to services`. I am not sure what the affect is on the underlying intents. If it is to ensure that all hosts retain reachability etc. even when a certain switch changes, compare/contrast with Statesman. If it is to force choice of one or more configurations that are applicable or appropriate when switches fail, compare/contrast with FFC, fast failover, MPLS reroute and other pre-computed back-up or recovery schemes. In 2.3, `intents I1 to I3 specify preference for local service SL over global service SG only for RSWs in Pod1 and Pod2`, I am not sure why this is a routing function. Should we think of this as anycast configuration perhaps? There would be some value in putting this into the routing layer. However, why not defer service selection to a different layer? `To keep the production network operational, operators need a synthesis approach, which generates multiple configurations, for different switch states.` <-- I believe this is more of a method rather than a requirement. That is, I am not sure that you meant to say that no other method exists to keep the prod network operational. `The median reconfiguration time per switch is 5.2 hours` <-- What exactly happens during a reconfiguration? Why so long? 3.1, when topology is being abstracted and new nodes are added among those that have the same function to account for failures, what happens if multiple nodes in that function fail? Example: multiple switches in the SSW. I am not sure how this intuition translates to a meaningful algorithm. 3.1, storing multiple configurations at routers, as noted above, can require a lot of router memory to encode. Figure 5 and having two different policies in place at the same time to handle migration and maintenance is not technically novel. Consider VROOM: Virtual routers on the move: Live router migration as a network-management primitive, RCP: The Case for Separating Routing from Routers, and the router grafting work: Seamless BGP Migration With Router Grafting which all have similar ideas albeit in other contexts. Review #452D =========================================================================== Overall merit ------------- 3. Weak accept (I think it should be accepted, but I am fine if others want to reject.) Reviewer expertise ------------------ 3. Knowledgeable Paper summary ------------- The authors provide a novel method for scalable routing configuration synthesis that reduces the downtime and operational burden of big data centers, They introduce the concept of base paths and configuration staging. They automate the synthesis process using their custom language and compiler that translates intents to BGP community tags. (They also support OpenR.) The solution is deployed and evaluated using production data centers. Comments for authors -------------------- Thank you for submitting this very interesting work to NSDI! I see the following strenghts: - novel method for scalable routing configuration synthesis: base paths, configuration staging raises important questions - large-scale real-world measurements - the proposed solution is implemented, deployed and validated - high business value That said, I also had some doubts, regarding: - the lack of built-in consistency guarantee - some parts of the paper are somewhat high-level and a bit hard to follow (not "actionable": takeaway unclear) - the general relevance of the approch, e.g., as mentioned the context of SDN, is not clear - some figures are hard to interpret Detailed evaluation: - Section 2.3: consider using different names for the condition categories - Based on the category names the second category could be a subset of the first category. If I understand correctly, you have “where” and “when” conditions. - “Exceptions in failure scenarios”: the described scenario doesn’t sound like a “failure” but rather expected maintenance. If I understand correctly, it is a “precedence” feature between overlapping conditions. - Figure 2/a: the CDF should be normalized to the [0,1] interval, I think you should switch x and y. - Section 2.4: “On the one hand, changing O(10K) number of RSWs can take much longer than reconfiguring O(5K) of FSWs.” -> Are the numbers correct here? It sounds obvious this way since we are talking about ~2x more switches… - Section 2: I think an additional background about BGP and its related features would greatly improve the paper's understandability. - Figure 3: It is not clear to me what is automated and what is manual input. The code snippet suggests that base paths are defined by the programmer as well. - Page 6: Please introduce FBOSS before using the name. - Table 1: I think the syntax rules do not really help in understanding the paper. Consider moving it to the appendix if you have page-limit issues. - Figure 10/a: How did they collect data after the submission deadline? - Figure 10/b,c: Consider switching x and y, then it would become a CDF that is easier to interpret. - Section 6.3 / Aura’s Performance: How did you calculate these runtime values (e.g, “validation takes 4.3 seconds”)? Are these average numbers across all configurations? - Section 8: “However, re-configuring in SDN context is different, than switch reconfiguration, as forwarding state is changed directly from a centralized controller, avoiding the challenges of a large distributed network.” -> Could you clarify your standpoint related to SDN? Would this centralized control plane solve all of your issues or does SDN simply overlook important aspects of a real data center? - Do you plan to opensource your tools? Typos: Figure 1: “Spline” -> “Spine” Page 3: “intents.In” -> “intents. In” Page 8: “maintaine,d” -> “maintained” Page 12: “Coordination with non-routing policies” -> “Coordination with non-routing policies:” Review #452E =========================================================================== Overall merit ------------- 4. Accept (Good paper, I'll advocate for it) Reviewer expertise ------------------ 3. Knowledgeable Paper summary ------------- This paper proposes Aura, a system for synthesizing routing configuration for Meta's data center networks. Aura provides a high-level domain-specific language (DSL) called Routing Policy Language (RPL) that allows operators to specify blocks of the topology at their desired level of granularity and collection of different policies that govern the propagation of different prefixes over abstract paths in that topology. Aura then compiles these policies into BGP configurations (or potentially other protocols), staging multiple configurations in the switches so they can be dynamically activated and de-activated at runtime. Comments for authors -------------------- I really enjoyed reading this paper. As an operational-systems-track paper, it provides several insights into applying configuration synthesis techniques in real-world networks. Here are a few that stood out to me: - In 2.4, the paper provides a detailed discussion about the overhead of switch reconfigurations (e.g., the need to transfer services and drain traffic) and statistics about how long it takes in practice (median of 5.2 hours per switch). That is, it demonstrates the need for pre-compiling multiple configurations as opposed to just reconfiguring switches when there is a policy change. - The paper provides evidence of just how dynamic networks have become when it comes to policies and reconfigurations. For example, there are statistics about drain events (745K in a month, average of 8K per day) and policy changes (Section 6 preamble and Figure 9). This can motivate further research into handling such a degree of dynamism. - While prior work has looked into high-level policy description languages, this paper demonstrates the need for extra language constructs to support the needs of today's networks, such as more fine-grained ways to specify scopes, specifying different device "states", and how the policy depends on the device state. Overall, this paper demonstrates the necessity, benefits, and some practical challenges of applying formal and automated approaches in real large-scale networks. This, IMO, would be quite valuable to the community. Comment @A1 by Soudeh Ghorbani (Shepherd) --------------------------------------------------------------------------- This paper was discussed online. We appreciated the paper's anecdotes and operational insights and believe it can encourage further research. That said, we also believe that the paper is not always clear on the definition and scope of intents, and its acknowledgment and relationship with existing work. Moreover, some of the technical details and background on existing systems are too high-level and/or missing. Ultimately, we decided to accept the paper and work with the authors on the above points during shepherding.