jeudi 30 janvier 2014

IP Multicast: We Should Do So Much Better

IP multicast is an wonderful tool, but it’s very hard to control and debug at any decent scale. Several years ago I spent a few weeks at Dehli’s new airport trying to tune a network that carried all critical airport applications like check-in, baggage handling, signage etc, but also had to transport video feeds from 2500 security cameras. The requirements were simple: each camera spits out two 4Mbit/sec feeds, one unicast to distributed DVRs, one multicast to a set of monitoring stations. With up to 40 monitoring stations looking at 16 feeds at a time, ensure that the network converged in less than 3 seconds for unicast, less than a minute for multicast for any switch failure. The 3 seconds at the time was the magic threshold to keep IP phones connected to their signaling server. Piece of cake, right?


IP Multicast can be broken down into 2 distinctly different problems: membership management and packet delivery. If you think of multicast as a selective broadcast, you somehow need to track who has requested to receive this broadcast. Local to a switch or router this is the simplest part of IP Multicast. IGMP is used between end devices and their first router to indicate that the end device is interested in a specific multicast stream (or group). The router tracks these, and whenever a multicast packet arrives, it checks to see who had requested this group and forwards the packet out those ports. It is when you connect multiple routers together that this gets more complicated.


Protocol Independent Multicast (PIM) is pretty much the standard for IP Multicast control between routers on a network. To reach across wide area domains or create some policy control, MSDP and MBGP provide the ability to glue PIM domains together. In a way its very similar to how OSPF or ISIS are used inside a routing domain, and BGP between them.


The challenge with PIM is that it somewhat straddles membership control and actual packet forwarding, but without full control of the forwarding paths. PIM relies fully on the unicast topology to build its multicast forwarding topologies. The center point of forwarding in PIM is an entity called a Rendez-vous Point (RP). The RP is one or more PIM routers that have been selected to become the anchor point of packet distribution. Packets flow from multicast sources to the RP, then the RP sends those packets back out towards all registered receivers of that specific multicast group.


The distribution of packets from the RP to the destinations is done using a shared tree. This shared tree is a graph with the RP as its root, and each of the routers that have members of this multicast group as the leafs. The tree is constructed using unicast routing information, the individual paths of the tree towards the RP are the same of how listener would be routed to the RP for unicast. It is called a reverse tree sometimes because the tree is constructed using unicast information from the listener to the RP, but the actual traffic flow is from the RP to the listener. In his model, the RP is the center of the distribution universe and its placement in the network needs to be very carefully considered.


There are several optimizations of the traffic distribution. Where in standard PIM the traffic from the source to the RP is encapsulated as unicast, then distributed from the RP down to the listeners, there is an ability to create direct trees between the source and one or more listeners, called source trees. It removes pressure from the RP and creates more direct paths between the source and its listeners, but creates a tremendous amount of bookkeeping to track all these groups, trees and who needs to receive what. A later extension to PIM called BiDir (for Bi-Directional shared trees), allows the source to use the same RP based tree to send its traffic to the RP, which then flows back from the RP to all its listeners. It puts the RP back in the center, but significantly reduces the amount of state that needs to be tracked.


In the end, all of these are variations and optimizations based on the same theme. IP Multicast distribution is based on how unicast is delivered. Which means that different multicast streams to the same listener follow the same path to get there. The only tool to change that is to anchor different groups to different RPs, which is a completely manual exercise.


Like unicast applications, or perhaps even more pronounced, IP multicast applications have very different networking needs. Database, financial quote/transaction and other real time synchronization applications need relatively low volume but very low latency multicast distribution. Backups and archiving need lots of multicast bandwidth, but are just fine with multi microsecond or more latency. Video and voice multicast applications fit somewhere in between.


It is completely possible to build L2 and L3 multicast topologies that are different than their unicast brethren. If you have a complete view of a forwarding domain, you can calculate multicast distribution trees that use links with lots of bandwidth. Or ones that use the fewest amount of hops between sources and listeners. And of course you would take into account the amount of other (unicast) traffic that would flow on the same links to ensure they do not clash.


Perhaps easier to see than others, many multicast applications have well articulated needs and desires. My surveillance camera exercise in Delhi needed lots and lots of raw bandwidth. Close to 10Gbit/sec worth of multicast video would be traveling in and out of the monitoring stations. At the time we did not have tools to separate that from the other critical security information (smoke detectors, door alarms, you name it) that flowed into the same control room. Times are changing. Taking an application first approach, then having the means to translate that into a network and forwarding behavior will give us exactly those tools.


[Today's fun fact: Human teeth are almost as hard as rocks. Key word is "almost"]






via Business 2 Community http://ift.tt/1ifabco

Aucun commentaire:

Enregistrer un commentaire