1Distributed Switch Architecture 2=============================== 3 4Introduction 5============ 6 7This document describes the Distributed Switch Architecture (DSA) subsystem 8design principles, limitations, interactions with other subsystems, and how to 9develop drivers for this subsystem as well as a TODO for developers interested 10in joining the effort. 11 12Design principles 13================= 14 15The Distributed Switch Architecture is a subsystem which was primarily designed 16to support Marvell Ethernet switches (MV88E6xxx, a.k.a Linkstreet product line) 17using Linux, but has since evolved to support other vendors as well. 18 19The original philosophy behind this design was to be able to use unmodified 20Linux tools such as bridge, iproute2, ifconfig to work transparently whether 21they configured/queried a switch port network device or a regular network 22device. 23 24An Ethernet switch is typically comprised of multiple front-panel ports, and one 25or more CPU or management port. The DSA subsystem currently relies on the 26presence of a management port connected to an Ethernet controller capable of 27receiving Ethernet frames from the switch. This is a very common setup for all 28kinds of Ethernet switches found in Small Home and Office products: routers, 29gateways, or even top-of-the rack switches. This host Ethernet controller will 30be later referred to as "master" and "cpu" in DSA terminology and code. 31 32The D in DSA stands for Distributed, because the subsystem has been designed 33with the ability to configure and manage cascaded switches on top of each other 34using upstream and downstream Ethernet links between switches. These specific 35ports are referred to as "dsa" ports in DSA terminology and code. A collection 36of multiple switches connected to each other is called a "switch tree". 37 38For each front-panel port, DSA will create specialized network devices which are 39used as controlling and data-flowing endpoints for use by the Linux networking 40stack. These specialized network interfaces are referred to as "slave" network 41interfaces in DSA terminology and code. 42 43The ideal case for using DSA is when an Ethernet switch supports a "switch tag" 44which is a hardware feature making the switch insert a specific tag for each 45Ethernet frames it received to/from specific ports to help the management 46interface figure out: 47 48- what port is this frame coming from 49- what was the reason why this frame got forwarded 50- how to send CPU originated traffic to specific ports 51 52The subsystem does support switches not capable of inserting/stripping tags, but 53the features might be slightly limited in that case (traffic separation relies 54on Port-based VLAN IDs). 55 56Note that DSA does not currently create network interfaces for the "cpu" and 57"dsa" ports because: 58 59- the "cpu" port is the Ethernet switch facing side of the management 60 controller, and as such, would create a duplication of feature, since you 61 would get two interfaces for the same conduit: master netdev, and "cpu" netdev 62 63- the "dsa" port(s) are just conduits between two or more switches, and as such 64 cannot really be used as proper network interfaces either, only the 65 downstream, or the top-most upstream interface makes sense with that model 66 67Switch tagging protocols 68------------------------ 69 70DSA currently supports 5 different tagging protocols, and a tag-less mode as 71well. The different protocols are implemented in: 72 73net/dsa/tag_trailer.c: Marvell's 4 trailer tag mode (legacy) 74net/dsa/tag_dsa.c: Marvell's original DSA tag 75net/dsa/tag_edsa.c: Marvell's enhanced DSA tag 76net/dsa/tag_brcm.c: Broadcom's 4 bytes tag 77net/dsa/tag_qca.c: Qualcomm's 2 bytes tag 78 79The exact format of the tag protocol is vendor specific, but in general, they 80all contain something which: 81 82- identifies which port the Ethernet frame came from/should be sent to 83- provides a reason why this frame was forwarded to the management interface 84 85Master network devices 86---------------------- 87 88Master network devices are regular, unmodified Linux network device drivers for 89the CPU/management Ethernet interface. Such a driver might occasionally need to 90know whether DSA is enabled (e.g.: to enable/disable specific offload features), 91but the DSA subsystem has been proven to work with industry standard drivers: 92e1000e, mv643xx_eth etc. without having to introduce modifications to these 93drivers. Such network devices are also often referred to as conduit network 94devices since they act as a pipe between the host processor and the hardware 95Ethernet switch. 96 97Networking stack hooks 98---------------------- 99 100When a master netdev is used with DSA, a small hook is placed in in the 101networking stack is in order to have the DSA subsystem process the Ethernet 102switch specific tagging protocol. DSA accomplishes this by registering a 103specific (and fake) Ethernet type (later becoming skb->protocol) with the 104networking stack, this is also known as a ptype or packet_type. A typical 105Ethernet Frame receive sequence looks like this: 106 107Master network device (e.g.: e1000e): 108 109Receive interrupt fires: 110- receive function is invoked 111- basic packet processing is done: getting length, status etc. 112- packet is prepared to be processed by the Ethernet layer by calling 113 eth_type_trans 114 115net/ethernet/eth.c: 116 117eth_type_trans(skb, dev) 118 if (dev->dsa_ptr != NULL) 119 -> skb->protocol = ETH_P_XDSA 120 121drivers/net/ethernet/*: 122 123netif_receive_skb(skb) 124 -> iterate over registered packet_type 125 -> invoke handler for ETH_P_XDSA, calls dsa_switch_rcv() 126 127net/dsa/dsa.c: 128 -> dsa_switch_rcv() 129 -> invoke switch tag specific protocol handler in 130 net/dsa/tag_*.c 131 132net/dsa/tag_*.c: 133 -> inspect and strip switch tag protocol to determine originating port 134 -> locate per-port network device 135 -> invoke eth_type_trans() with the DSA slave network device 136 -> invoked netif_receive_skb() 137 138Past this point, the DSA slave network devices get delivered regular Ethernet 139frames that can be processed by the networking stack. 140 141Slave network devices 142--------------------- 143 144Slave network devices created by DSA are stacked on top of their master network 145device, each of these network interfaces will be responsible for being a 146controlling and data-flowing end-point for each front-panel port of the switch. 147These interfaces are specialized in order to: 148 149- insert/remove the switch tag protocol (if it exists) when sending traffic 150 to/from specific switch ports 151- query the switch for ethtool operations: statistics, link state, 152 Wake-on-LAN, register dumps... 153- external/internal PHY management: link, auto-negotiation etc. 154 155These slave network devices have custom net_device_ops and ethtool_ops function 156pointers which allow DSA to introduce a level of layering between the networking 157stack/ethtool, and the switch driver implementation. 158 159Upon frame transmission from these slave network devices, DSA will look up which 160switch tagging protocol is currently registered with these network devices, and 161invoke a specific transmit routine which takes care of adding the relevant 162switch tag in the Ethernet frames. 163 164These frames are then queued for transmission using the master network device 165ndo_start_xmit() function, since they contain the appropriate switch tag, the 166Ethernet switch will be able to process these incoming frames from the 167management interface and delivers these frames to the physical switch port. 168 169Graphical representation 170------------------------ 171 172Summarized, this is basically how DSA looks like from a network device 173perspective: 174 175 176 |--------------------------- 177 | CPU network device (eth0)| 178 ---------------------------- 179 | <tag added by switch | 180 | | 181 | | 182 | tag added by CPU> | 183 |--------------------------------------------| 184 | Switch driver | 185 |--------------------------------------------| 186 || || || 187 |-------| |-------| |-------| 188 | sw0p0 | | sw0p1 | | sw0p2 | 189 |-------| |-------| |-------| 190 191Slave MDIO bus 192-------------- 193 194In order to be able to read to/from a switch PHY built into it, DSA creates a 195slave MDIO bus which allows a specific switch driver to divert and intercept 196MDIO reads/writes towards specific PHY addresses. In most MDIO-connected 197switches, these functions would utilize direct or indirect PHY addressing mode 198to return standard MII registers from the switch builtin PHYs, allowing the PHY 199library and/or to return link status, link partner pages, auto-negotiation 200results etc.. 201 202For Ethernet switches which have both external and internal MDIO busses, the 203slave MII bus can be utilized to mux/demux MDIO reads and writes towards either 204internal or external MDIO devices this switch might be connected to: internal 205PHYs, external PHYs, or even external switches. 206 207Data structures 208--------------- 209 210DSA data structures are defined in include/net/dsa.h as well as 211net/dsa/dsa_priv.h. 212 213dsa_chip_data: platform data configuration for a given switch device, this 214structure describes a switch device's parent device, its address, as well as 215various properties of its ports: names/labels, and finally a routing table 216indication (when cascading switches) 217 218dsa_platform_data: platform device configuration data which can reference a 219collection of dsa_chip_data structure if multiples switches are cascaded, the 220master network device this switch tree is attached to needs to be referenced 221 222dsa_switch_tree: structure assigned to the master network device under 223"dsa_ptr", this structure references a dsa_platform_data structure as well as 224the tagging protocol supported by the switch tree, and which receive/transmit 225function hooks should be invoked, information about the directly attached switch 226is also provided: CPU port. Finally, a collection of dsa_switch are referenced 227to address individual switches in the tree. 228 229dsa_switch: structure describing a switch device in the tree, referencing a 230dsa_switch_tree as a backpointer, slave network devices, master network device, 231and a reference to the backing dsa_switch_ops 232 233dsa_switch_ops: structure referencing function pointers, see below for a full 234description. 235 236Design limitations 237================== 238 239DSA is a platform device driver 240------------------------------- 241 242DSA is implemented as a DSA platform device driver which is convenient because 243it will register the entire DSA switch tree attached to a master network device 244in one-shot, facilitating the device creation and simplifying the device driver 245model a bit, this comes however with a number of limitations: 246 247- building DSA and its switch drivers as modules is currently not working 248- the device driver parenting does not necessarily reflect the original 249 bus/device the switch can be created from 250- supporting non-MDIO and non-MMIO (platform) switches is not possible 251 252Limits on the number of devices and ports 253----------------------------------------- 254 255DSA currently limits the number of maximum switches within a tree to 4 256(DSA_MAX_SWITCHES), and the number of ports per switch to 12 (DSA_MAX_PORTS). 257These limits could be extended to support larger configurations would this need 258arise. 259 260Lack of CPU/DSA network devices 261------------------------------- 262 263DSA does not currently create slave network devices for the CPU or DSA ports, as 264described before. This might be an issue in the following cases: 265 266- inability to fetch switch CPU port statistics counters using ethtool, which 267 can make it harder to debug MDIO switch connected using xMII interfaces 268 269- inability to configure the CPU port link parameters based on the Ethernet 270 controller capabilities attached to it: http://patchwork.ozlabs.org/patch/509806/ 271 272- inability to configure specific VLAN IDs / trunking VLANs between switches 273 when using a cascaded setup 274 275Common pitfalls using DSA setups 276-------------------------------- 277 278Once a master network device is configured to use DSA (dev->dsa_ptr becomes 279non-NULL), and the switch behind it expects a tagging protocol, this network 280interface can only exclusively be used as a conduit interface. Sending packets 281directly through this interface (e.g.: opening a socket using this interface) 282will not make us go through the switch tagging protocol transmit function, so 283the Ethernet switch on the other end, expecting a tag will typically drop this 284frame. 285 286Slave network devices check that the master network device is UP before allowing 287you to administratively bring UP these slave network devices. A common 288configuration mistake is forgetting to bring UP the master network device first. 289 290Interactions with other subsystems 291================================== 292 293DSA currently leverages the following subsystems: 294 295- MDIO/PHY library: drivers/net/phy/phy.c, mdio_bus.c 296- Switchdev: net/switchdev/* 297- Device Tree for various of_* functions 298 299MDIO/PHY library 300---------------- 301 302Slave network devices exposed by DSA may or may not be interfacing with PHY 303devices (struct phy_device as defined in include/linux/phy.h), but the DSA 304subsystem deals with all possible combinations: 305 306- internal PHY devices, built into the Ethernet switch hardware 307- external PHY devices, connected via an internal or external MDIO bus 308- internal PHY devices, connected via an internal MDIO bus 309- special, non-autonegotiated or non MDIO-managed PHY devices: SFPs, MoCA; a.k.a 310 fixed PHYs 311 312The PHY configuration is done by the dsa_slave_phy_setup() function and the 313logic basically looks like this: 314 315- if Device Tree is used, the PHY device is looked up using the standard 316 "phy-handle" property, if found, this PHY device is created and registered 317 using of_phy_connect() 318 319- if Device Tree is used, and the PHY device is "fixed", that is, conforms to 320 the definition of a non-MDIO managed PHY as defined in 321 Documentation/devicetree/bindings/net/fixed-link.txt, the PHY is registered 322 and connected transparently using the special fixed MDIO bus driver 323 324- finally, if the PHY is built into the switch, as is very common with 325 standalone switch packages, the PHY is probed using the slave MII bus created 326 by DSA 327 328 329SWITCHDEV 330--------- 331 332DSA directly utilizes SWITCHDEV when interfacing with the bridge layer, and 333more specifically with its VLAN filtering portion when configuring VLANs on top 334of per-port slave network devices. Since DSA primarily deals with 335MDIO-connected switches, although not exclusively, SWITCHDEV's 336prepare/abort/commit phases are often simplified into a prepare phase which 337checks whether the operation is supported by the DSA switch driver, and a commit 338phase which applies the changes. 339 340As of today, the only SWITCHDEV objects supported by DSA are the FDB and VLAN 341objects. 342 343Device Tree 344----------- 345 346DSA features a standardized binding which is documented in 347Documentation/devicetree/bindings/net/dsa/dsa.txt. PHY/MDIO library helper 348functions such as of_get_phy_mode(), of_phy_connect() are also used to query 349per-port PHY specific details: interface connection, MDIO bus location etc.. 350 351Driver development 352================== 353 354DSA switch drivers need to implement a dsa_switch_ops structure which will 355contain the various members described below. 356 357register_switch_driver() registers this dsa_switch_ops in its internal list 358of drivers to probe for. unregister_switch_driver() does the exact opposite. 359 360Unless requested differently by setting the priv_size member accordingly, DSA 361does not allocate any driver private context space. 362 363Switch configuration 364-------------------- 365 366- tag_protocol: this is to indicate what kind of tagging protocol is supported, 367 should be a valid value from the dsa_tag_protocol enum 368 369- probe: probe routine which will be invoked by the DSA platform device upon 370 registration to test for the presence/absence of a switch device. For MDIO 371 devices, it is recommended to issue a read towards internal registers using 372 the switch pseudo-PHY and return whether this is a supported device. For other 373 buses, return a non-NULL string 374 375- setup: setup function for the switch, this function is responsible for setting 376 up the dsa_switch_ops private structure with all it needs: register maps, 377 interrupts, mutexes, locks etc.. This function is also expected to properly 378 configure the switch to separate all network interfaces from each other, that 379 is, they should be isolated by the switch hardware itself, typically by creating 380 a Port-based VLAN ID for each port and allowing only the CPU port and the 381 specific port to be in the forwarding vector. Ports that are unused by the 382 platform should be disabled. Past this function, the switch is expected to be 383 fully configured and ready to serve any kind of request. It is recommended 384 to issue a software reset of the switch during this setup function in order to 385 avoid relying on what a previous software agent such as a bootloader/firmware 386 may have previously configured. 387 388- set_addr: Some switches require the programming of the management interface's 389 Ethernet MAC address, switch drivers can also disable ageing of MAC addresses 390 on the management interface and "hardcode"/"force" this MAC address for the 391 CPU/management interface as an optimization 392 393PHY devices and link management 394------------------------------- 395 396- get_phy_flags: Some switches are interfaced to various kinds of Ethernet PHYs, 397 if the PHY library PHY driver needs to know about information it cannot obtain 398 on its own (e.g.: coming from switch memory mapped registers), this function 399 should return a 32-bits bitmask of "flags", that is private between the switch 400 driver and the Ethernet PHY driver in drivers/net/phy/*. 401 402- phy_read: Function invoked by the DSA slave MDIO bus when attempting to read 403 the switch port MDIO registers. If unavailable, return 0xffff for each read. 404 For builtin switch Ethernet PHYs, this function should allow reading the link 405 status, auto-negotiation results, link partner pages etc.. 406 407- phy_write: Function invoked by the DSA slave MDIO bus when attempting to write 408 to the switch port MDIO registers. If unavailable return a negative error 409 code. 410 411- adjust_link: Function invoked by the PHY library when a slave network device 412 is attached to a PHY device. This function is responsible for appropriately 413 configuring the switch port link parameters: speed, duplex, pause based on 414 what the phy_device is providing. 415 416- fixed_link_update: Function invoked by the PHY library, and specifically by 417 the fixed PHY driver asking the switch driver for link parameters that could 418 not be auto-negotiated, or obtained by reading the PHY registers through MDIO. 419 This is particularly useful for specific kinds of hardware such as QSGMII, 420 MoCA or other kinds of non-MDIO managed PHYs where out of band link 421 information is obtained 422 423Ethtool operations 424------------------ 425 426- get_strings: ethtool function used to query the driver's strings, will 427 typically return statistics strings, private flags strings etc. 428 429- get_ethtool_stats: ethtool function used to query per-port statistics and 430 return their values. DSA overlays slave network devices general statistics: 431 RX/TX counters from the network device, with switch driver specific statistics 432 per port 433 434- get_sset_count: ethtool function used to query the number of statistics items 435 436- get_wol: ethtool function used to obtain Wake-on-LAN settings per-port, this 437 function may, for certain implementations also query the master network device 438 Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN 439 440- set_wol: ethtool function used to configure Wake-on-LAN settings per-port, 441 direct counterpart to set_wol with similar restrictions 442 443- set_eee: ethtool function which is used to configure a switch port EEE (Green 444 Ethernet) settings, can optionally invoke the PHY library to enable EEE at the 445 PHY level if relevant. This function should enable EEE at the switch port MAC 446 controller and data-processing logic 447 448- get_eee: ethtool function which is used to query a switch port EEE settings, 449 this function should return the EEE state of the switch port MAC controller 450 and data-processing logic as well as query the PHY for its currently configured 451 EEE settings 452 453- get_eeprom_len: ethtool function returning for a given switch the EEPROM 454 length/size in bytes 455 456- get_eeprom: ethtool function returning for a given switch the EEPROM contents 457 458- set_eeprom: ethtool function writing specified data to a given switch EEPROM 459 460- get_regs_len: ethtool function returning the register length for a given 461 switch 462 463- get_regs: ethtool function returning the Ethernet switch internal register 464 contents. This function might require user-land code in ethtool to 465 pretty-print register values and registers 466 467Power management 468---------------- 469 470- suspend: function invoked by the DSA platform device when the system goes to 471 suspend, should quiesce all Ethernet switch activities, but keep ports 472 participating in Wake-on-LAN active as well as additional wake-up logic if 473 supported 474 475- resume: function invoked by the DSA platform device when the system resumes, 476 should resume all Ethernet switch activities and re-configure the switch to be 477 in a fully active state 478 479- port_enable: function invoked by the DSA slave network device ndo_open 480 function when a port is administratively brought up, this function should be 481 fully enabling a given switch port. DSA takes care of marking the port with 482 BR_STATE_BLOCKING if the port is a bridge member, or BR_STATE_FORWARDING if it 483 was not, and propagating these changes down to the hardware 484 485- port_disable: function invoked by the DSA slave network device ndo_close 486 function when a port is administratively brought down, this function should be 487 fully disabling a given switch port. DSA takes care of marking the port with 488 BR_STATE_DISABLED and propagating changes to the hardware if this port is 489 disabled while being a bridge member 490 491Bridge layer 492------------ 493 494- port_bridge_join: bridge layer function invoked when a given switch port is 495 added to a bridge, this function should be doing the necessary at the switch 496 level to permit the joining port from being added to the relevant logical 497 domain for it to ingress/egress traffic with other members of the bridge. 498 499- port_bridge_leave: bridge layer function invoked when a given switch port is 500 removed from a bridge, this function should be doing the necessary at the 501 switch level to deny the leaving port from ingress/egress traffic from the 502 remaining bridge members. When the port leaves the bridge, it should be aged 503 out at the switch hardware for the switch to (re) learn MAC addresses behind 504 this port. 505 506- port_stp_state_set: bridge layer function invoked when a given switch port STP 507 state is computed by the bridge layer and should be propagated to switch 508 hardware to forward/block/learn traffic. The switch driver is responsible for 509 computing a STP state change based on current and asked parameters and perform 510 the relevant ageing based on the intersection results 511 512Bridge VLAN filtering 513--------------------- 514 515- port_vlan_filtering: bridge layer function invoked when the bridge gets 516 configured for turning on or off VLAN filtering. If nothing specific needs to 517 be done at the hardware level, this callback does not need to be implemented. 518 When VLAN filtering is turned on, the hardware must be programmed with 519 rejecting 802.1Q frames which have VLAN IDs outside of the programmed allowed 520 VLAN ID map/rules. If there is no PVID programmed into the switch port, 521 untagged frames must be rejected as well. When turned off the switch must 522 accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames are 523 allowed. 524 525- port_vlan_prepare: bridge layer function invoked when the bridge prepares the 526 configuration of a VLAN on the given port. If the operation is not supported 527 by the hardware, this function should return -EOPNOTSUPP to inform the bridge 528 code to fallback to a software implementation. No hardware setup must be done 529 in this function. See port_vlan_add for this and details. 530 531- port_vlan_add: bridge layer function invoked when a VLAN is configured 532 (tagged or untagged) for the given switch port 533 534- port_vlan_del: bridge layer function invoked when a VLAN is removed from the 535 given switch port 536 537- port_vlan_dump: bridge layer function invoked with a switchdev callback 538 function that the driver has to call for each VLAN the given port is a member 539 of. A switchdev object is used to carry the VID and bridge flags. 540 541- port_fdb_prepare: bridge layer function invoked when the bridge prepares the 542 installation of a Forwarding Database entry. If the operation is not 543 supported, this function should return -EOPNOTSUPP to inform the bridge code 544 to fallback to a software implementation. No hardware setup must be done in 545 this function. See port_fdb_add for this and details. 546 547- port_fdb_add: bridge layer function invoked when the bridge wants to install a 548 Forwarding Database entry, the switch hardware should be programmed with the 549 specified address in the specified VLAN Id in the forwarding database 550 associated with this VLAN ID 551 552Note: VLAN ID 0 corresponds to the port private database, which, in the context 553of DSA, would be the its port-based VLAN, used by the associated bridge device. 554 555- port_fdb_del: bridge layer function invoked when the bridge wants to remove a 556 Forwarding Database entry, the switch hardware should be programmed to delete 557 the specified MAC address from the specified VLAN ID if it was mapped into 558 this port forwarding database 559 560- port_fdb_dump: bridge layer function invoked with a switchdev callback 561 function that the driver has to call for each MAC address known to be behind 562 the given port. A switchdev object is used to carry the VID and FDB info. 563 564- port_mdb_prepare: bridge layer function invoked when the bridge prepares the 565 installation of a multicast database entry. If the operation is not supported, 566 this function should return -EOPNOTSUPP to inform the bridge code to fallback 567 to a software implementation. No hardware setup must be done in this function. 568 See port_fdb_add for this and details. 569 570- port_mdb_add: bridge layer function invoked when the bridge wants to install 571 a multicast database entry, the switch hardware should be programmed with the 572 specified address in the specified VLAN ID in the forwarding database 573 associated with this VLAN ID. 574 575Note: VLAN ID 0 corresponds to the port private database, which, in the context 576of DSA, would be the its port-based VLAN, used by the associated bridge device. 577 578- port_mdb_del: bridge layer function invoked when the bridge wants to remove a 579 multicast database entry, the switch hardware should be programmed to delete 580 the specified MAC address from the specified VLAN ID if it was mapped into 581 this port forwarding database. 582 583- port_mdb_dump: bridge layer function invoked with a switchdev callback 584 function that the driver has to call for each MAC address known to be behind 585 the given port. A switchdev object is used to carry the VID and MDB info. 586 587TODO 588==== 589 590Making SWITCHDEV and DSA converge towards an unified codebase 591------------------------------------------------------------- 592 593SWITCHDEV properly takes care of abstracting the networking stack with offload 594capable hardware, but does not enforce a strict switch device driver model. On 595the other DSA enforces a fairly strict device driver model, and deals with most 596of the switch specific. At some point we should envision a merger between these 597two subsystems and get the best of both worlds. 598 599Other hanging fruits 600-------------------- 601 602- making the number of ports fully dynamic and not dependent on DSA_MAX_PORTS 603- allowing more than one CPU/management interface: 604 http://comments.gmane.org/gmane.linux.network/365657 605- porting more drivers from other vendors: 606 http://comments.gmane.org/gmane.linux.network/365510 607