Tuesday, October 16, 2018

Demistifying VRFs, Part 1 - The Ugly Theory

In my last post, I explored changing the model of an IPAM application to better describe an entity called VRF. And while my result is (I'd humbly say) closer to better describing that entity, I'm not completely satisfied. So, to better understand the thing I'm trying to model, I've tried to gather some solid information which could help me model them. And, as always, along with some really cool stuff, I've found that there's also a ton of misinformation and confusion surrounding VRFs, as well as an aura of mystique, especially in the minds of people new to the networking field.

These days you'd be hard pressed to find a service provider network not using VRFs in some form or another - from Cisco's VRF-lite to full-blown MPLS L3VPNs - and that goes for many enterprise networks as well. Therefore a good understanding of what essentially VRFs are, how and when do we implement them, and what are the tradeoffs and caveats we must be wary of - is at least beneficial to a network engineer, maybe even important. So, I decided to write a couple of posts on VRFs, and try and throw together some of the answers to the questions that I have struggled with in the past in trying to see the "big picture" about them.

So, without further ado, let's start this series from the beginning and ask the two most obvious questions - what the heck are VRFs and why would we need them anyway?

Mo routers mo problems


You may recall that a useful way of thinking about network devices in general (and Layer-3 devices in particular) is to think of them as a combination of three distinct (yet interconnected and interdependent) entities:

  • Control plane;
  • Forwarding plane (also called "data plane");
  • Management plane.

In the context of a router, the following functions loosely map to each of the planes:

  • Control plane is the part of the router hardware and software (but mostly software) which is in charge of gathering, storing and processing network reachability information - most important of those being the routes. That's the part that takes the interface addresses, static routes, runs and takes all the inputs from dynamic routing protocols on that router and produces the routing table.
  • Forwarding (or data) plane is the software and hardware (but mostly hardware) which is programmed by the control plane functions, and is tasked with actually processing (routing) the packets according to the reachability information gathered by the control plane. This is where the "rubber meets the road" - packets are received, dequeued, decapsulated, outgoing interfaces are determined, packets are mangled or dropped or punted, or, if they're lucky - new headers are (re)written, packets are enqueued and sent on their merry way.
  • Management plane is everything else that allows us as operators to have both an insight into the box and to interact with it and configure it.

So to see where VRFs fit in there, we needn't look farther than the acronym itself: VRF (although it has a long history of being interpreted differently) currently stands for "Virtual Routing and Forwarding".

In a nutshell, VRFs is a virtualization feature on layer 3 devices which allows us to have several virtual routers in a single physical router box. More precisely: that is a feature that allows us to segment the router's control plane into several virtual instances - that's the "Virtual Routing" part - and do the same to its forwarding plane (since it is dependent on the control plane) - that's the "Forwarding" part.

The product of such segmentation of control plane and forwarding plane is the creation of Layer 3 Virtual Private Networks:

  • They're Layer 3 because the segmentation process results in creation of routed (layer 3) networks,
  • They're Virtual because we're creating virtual layer 3 topologies on top of shared physical topology,
  • They're Private because those virtual topologies (networks) are separate and disjunct from each other (barring any configuration to the contrary).

So it's useful to keep one thing in mind: VRFs are intrinsically tied to Layer 3 VPNs - implementation of VRFs creates L3 VPNs, and vice versa - you cannot create a Layer 3 VPN without configuring VRFs.

Let's answer the other question - what are those good for? Well, VRFs allow us to:

  1. Save money - instead of adding another physical device whenever we need to have logically separate routing domains, we can just add another VRF and all our troubles are solved [1] and the management loves us because we're not asking for more budget for new gear. [2]
  2. Have and operate overlapping address spaces - the "customers" of your network can use the same addresses/subnets in their networks [3], and since you can now place them in different routing domains (VRFs) - you can easily [4] manage and support all [5] of their crazy addressing notions and routing demands.
  3. Think more logically about the networks - we can keep routing tables smaller when we segment the space, as opposed to a single huge table. [6] That makes them easier to operate on [7] and easier to understand [8].
  4. Isolate the routing information - which can be a boon if your customers are paranoid about their security [9] and want to keep their routing information private to themselves and subsequently disable access to their infrastructure from your other customers.

VRFs are available on practically all modern (and not-so-modern) routing platforms - even on Linux hosts. There are some terminological differences between vendors, so I'll try and list those for some of the vendors:

  • Cisco uses two names for this "splitting-routers-into-independent-virtual-instances" technology, with some overlapping functionality and syntax:
    • VRF-Lite;
    • VRF;
  • Juniper uses a concept of "routing-instances", with two types relevant for this discussion (there are others, but they relate to other types of VPNs):
    • virtual-router - analogous to Cisco's VRF-Lite;
    • vrf - for MPLS L3VPNs;
  • Ericsson IPOS and SEOS devices use a concept of "context", which is much more general than "regular" VRFs: it describes virtualization not only of the L3 functionality, but also the router's management plane, so with Ericsson "contexts" you get even individually managed virtual devices, as opposed to "classic" VRFs.
  • Nokia (ex-Alcatel) routers use a concept of "vprn", which stands for "Virtual Private Routed Network", and it's used for both L3 VPNs with or without MPLS transport.

One more thing that I keep mentioning here but not clarifying: what is the difference between the "VRF-Lite" and "just" VRFs? Well, there are quite a few differences, and we'll cover them in more detail in future posts in this series, but for now - let's say that the main difference between the two is MPLS: you can create Layer 3 VPNs using VRF-Lite without having an MPLS transport network. On the other hand, "full" VRFs need to have an MPLS transport infrastructure in place.

How do they implement it on the boxes and why should I care?


There are a couple of ways we could approach this problem. Today in 2018 AD, with virtualization being so ubiquitous - we could fully virtualize the router. We could go all in - virtualize all of the functions, including the management plane, and create as many fully-virtual routers as the hardware permits. While it is a valid approach - and widely used by many vendors today (Cisco Nexus 7k VDCs, Ericsson IPOS and SEOS contexts, all sorts of virtual routers and appliances running on commodity x86 hardware - Juniper vMX and vSRX, Cisco IOS XRv and IOSv, etc.) - we need to remember that the need for such segmentation was evident way, way before such "full" virtualization was so commonplace. In those days, let's say late nineties - virtualization was, of course, not an unknown concept, but was limited to hardware platforms with abundance of resources which could handle the resource hunger of virtualization - think mainframes. Routing platforms in those days hardly fit the description of hardware which could afford virtualization as we know it today. So the early designers needed to adopt a more targeted approach, and hence their decision to focus on virtualizing only control and forwarding plane functionality.

As previously mentioned, control plane is usually implemented in the router software - the routing protocols are run as software processes, and the resulting routing tables are (usually) kept in RAM. This consequently means that the forwarding plane lends itself to a relatively easy virtualization - we can initialize and populate additional structures in RAM to hold the new routing tables, and we can spin new routing protocol processes to interact with those new routing tables. That would incur some overhead on the box, but it would be manageable, provided enough RAM and CPU power was available.

Forwarding plane presents a bit more of a challenge: it is commonly implemented in hardware, with a plethora of different approaches to implementing it, and a whole lot of different hardware elements - TCAM for storing and matching forwarding information, and all sorts of proprietary packet processors and ASICs in charge of the packet mangling and forwarding. So, how would we go about introducing new entities in functions which are so tied to the hardware?

For starters, the easiest approach would be to throw additional dedicated hardware at the problem - we could install additional store-and-match-thingies (e.g. TCAM) as well as forwarding-thingies (e.g. ASICs) into the box to process the traffic for the newly created routing instances. But the drawbacks of this approach are immediately obvious:

  • Store-and-match-thingies tend to be both expensive and power-hungry - the price of our devices and the price of running them with hardware VRFs would skyrocket;
  • Same thing applies for the forwarding-engine thingies, maybe even more so;
  • Scalability is a major issue: how many additional pieces of hardware should we add? Do we add, for example, 5 entities, and support max 5 additional VRFs? What if the customer needs only three? They paid for hardware which can service 5 VRFs and are using only 3 - not really what we would call great value for money. What if they need more? Equally bad.

So it's kind of obvious that any sort of static and dedicated assignment of hardware elements to VRFs is not the way to go. Therefore, most router platform designs today implement a logical sharing of hardware resources (both forwarding databases and forwarding engines). The forwarding databases are unified across a single hardware "store-and-match-thingy", but they are programmed by individual control-plane entities so that their entries contain identifiers which signify which VRF they refer to, allowing per-VRF lookups and per-VRF specific forwarding.

It is important to note that all this implementation-specific stuff has some subtle consequences of importance not only to network plaform designers, but to network engineers as well: all of this can have and most often does have an impact on aspects of performance and scalability. So always, always, always bear in mind: using VRFs - as with many other features and nerd-knobs - is always an exercise in compromises.

Let me illustrate that on a single example: the total number of (unicast) routes a certain platform supports does not change when you implement VRFs, it is a platform limitation representing the total quantity of (usually) forwarding database storage available on the box. Don't forget that this limitation does not magically go away when you partition the routing information into separate spaces - you will not have the total maximum number of routes per VRF, you will have it on the whole box. Which consequently means that your per-VRF maximum number of routes can be limited as well.

So as always - keep these compromises in mind when thinking of implementing VRFs and procuring hardware, and be ready to ask yourself and your vendor support and salespeople: "How does this $performance_thing or this $limit_thing change when we implement VRFs?".

How does it work in practice?


So, the first thing I wanted to know once I started messing with VRFs is - how do the new routing tables get populated? What ends up in there now?

Let's first see what ends up in a routing table in a "normal" scenario, when we're not using VRFs and have just a single (global or default) routing table:

  • For each L3 addressed interface (physical or logical) which is "up" - an entry for a connected network will be unconditionally inserted in a routing table (along with a specific /32 prefix in Cisco and Juniper devices, designated as "Local")
  • All manually configured static routes which pass muster (e.g. resolvable outgoing interface)
  • All routes from routing protocols which pass both protocol criteria ("best route" in that protocol) and successfully compete for installation against other sources (e.g. best AD value)

It would be great if we could implement VRFs so that all the above still holds true, so that engineers working with virtual routing instances could have the same behavior as before, with some additions. So, let's review what you can generally expect to see in a (very lite) VRF-lite implementation:

  • We must instruct the router to create a new L3 entity called VRF, and provide it a (hopefully meaningful) name which we can use to refer to it. Most routing platforms allow VRFs to be configured separately for IPv4 and IPv6, so you when you configure your VRF - you will have to specify whether you want it to be used for IPv4, IPv6 or both. When you configure a VRF, the router will (in most cases [10] ) perform the needed partitioning of control plane and forwarding plane elements, and create the numeric ID for the VRF that we mentioned before. If we're implementing VRFs as part of MPLS L3VPN solution, we'll need to configure some additional information besides the name and IP protocol version, but we'll go into those when we tackle MPLS L3VPNs.
  • We must manually assign/configure L3 interfaces so that they "belong" to a VRF, that way the router knows which numerical ID to assign to connected routes - additional syntax is required.
  • The static route syntax will be modified so that it is clear from the configuration in which routing table we want it to be inserted.
  • The routing protocol syntax will be modified so that the router knows under which VRF they are running and into which routing table their "best routes" should attempt to be installed.

One more syntactic consequence of VRF implementation is that once VRFs are in place, all IP commands on the platform necessarily become "VRF aware". Syntax for displaying contents of a routing or forwarding tables needs to include a reference to the VRF we want to see, ping and traceroute commands must specify in which VRF the traffic will be generated, displaying information about routing protocols running in VRFs must specify that, etc. It's easy to forget this, and wonder why the box isn't giving us the information we need or we seemingly have reachability issues with ping/traceroute not working.

Final word of warning: even though we haven't necessarily messed with the router's management plane by introducing VRFs, if we've placed some of the router's management interfaces into a new VRF - we need to think about reachability of the router-generated and for-router traffic:

  • SSH (and hopefully not telnet) access,
  • AAA servers,
  • NTP servers,
  • Syslog/SNMP trap servers,
  • Netfow export targets.

Conclusion


VRFs are a great tool for segmenting and virtualizing the network. It's a mature solution, with good support across vendors and platforms, and is unavoidable when we need to support overlapping address spaces and isolation on a shared hardware infrastructure - and that's a pretty common requirement for networks these days, so I'd argue that you, as a network engineer - must know about them. While it's all nice and rosy, a couple of things you should keep in mind:

  • Using VRFs implies some limitations - be sure to know what those are when you're planning and implementing them;
  • Using VRFs means syntax for many IP-related commands becomes VRF-aware;
  • Placing the router's management interfaces into VRFs might change some of the for-router or router-generated traffic patterns, or necessitate some VRF-aware configuration addition

In the next post, I'll go over configuring and exploring the possibilities of litest of VRF-lite, on Cisco and Juniper devices. Stay tuned!




 [1] Disclaimer: chances are - when you start meddling with VRFs - your troubles have probably just begun.
 [2] It might actually grow, because vendors often charge extra for VRF support. And - no, you're still asking for too much money. No, the budget should be optimally negative. Nah, the business still hates ya and sees you as a cost center. Welcome to networking.
 [3] Terms and conditions apply. The rest of your network architecture might have something to say about that.
 [4] For a given value of "easily".
 [5] For a given value of "all".
 [6] That's usually marketing. You usually just end up partitioning the space differently, but all of the scale is still there. But it's hidden. But it looks a bit more tidy. But it's not.
 [7] It's not. You get additional syntax to remember.
 [8] It's not. You get additional headaches for trying to understand L3 VPNs.
 [9] They're probably not, but the good news is - you have to be for them.
 [10] Some older platforms actually required the platform to be set up and the forwarding plane elements pre-partitioned by the operator, in advance, in order to support VRFs - see Cisco SDM (Switch Database Management) templates, on Catalyst 3560-X or 3750-X switches.

No comments:

Post a Comment