• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Distributed Switch Architecture
2===============================
3
4Introduction
5============
6
7This document describes the Distributed Switch Architecture (DSA) subsystem
8design principles, limitations, interactions with other subsystems, and how to
9develop drivers for this subsystem as well as a TODO for developers interested
10in joining the effort.
11
12Design principles
13=================
14
15The Distributed Switch Architecture is a subsystem which was primarily designed
16to support Marvell Ethernet switches (MV88E6xxx, a.k.a Linkstreet product line)
17using Linux, but has since evolved to support other vendors as well.
18
19The original philosophy behind this design was to be able to use unmodified
20Linux tools such as bridge, iproute2, ifconfig to work transparently whether
21they configured/queried a switch port network device or a regular network
22device.
23
24An Ethernet switch is typically comprised of multiple front-panel ports, and one
25or more CPU or management port. The DSA subsystem currently relies on the
26presence of a management port connected to an Ethernet controller capable of
27receiving Ethernet frames from the switch. This is a very common setup for all
28kinds of Ethernet switches found in Small Home and Office products: routers,
29gateways, or even top-of-the rack switches. This host Ethernet controller will
30be later referred to as "master" and "cpu" in DSA terminology and code.
31
32The D in DSA stands for Distributed, because the subsystem has been designed
33with the ability to configure and manage cascaded switches on top of each other
34using upstream and downstream Ethernet links between switches. These specific
35ports are referred to as "dsa" ports in DSA terminology and code. A collection
36of multiple switches connected to each other is called a "switch tree".
37
38For each front-panel port, DSA will create specialized network devices which are
39used as controlling and data-flowing endpoints for use by the Linux networking
40stack. These specialized network interfaces are referred to as "slave" network
41interfaces in DSA terminology and code.
42
43The ideal case for using DSA is when an Ethernet switch supports a "switch tag"
44which is a hardware feature making the switch insert a specific tag for each
45Ethernet frames it received to/from specific ports to help the management
46interface figure out:
47
48- what port is this frame coming from
49- what was the reason why this frame got forwarded
50- how to send CPU originated traffic to specific ports
51
52The subsystem does support switches not capable of inserting/stripping tags, but
53the features might be slightly limited in that case (traffic separation relies
54on Port-based VLAN IDs).
55
56Note that DSA does not currently create network interfaces for the "cpu" and
57"dsa" ports because:
58
59- the "cpu" port is the Ethernet switch facing side of the management
60  controller, and as such, would create a duplication of feature, since you
61  would get two interfaces for the same conduit: master netdev, and "cpu" netdev
62
63- the "dsa" port(s) are just conduits between two or more switches, and as such
64  cannot really be used as proper network interfaces either, only the
65  downstream, or the top-most upstream interface makes sense with that model
66
67Switch tagging protocols
68------------------------
69
70DSA currently supports 5 different tagging protocols, and a tag-less mode as
71well. The different protocols are implemented in:
72
73net/dsa/tag_trailer.c: Marvell's 4 trailer tag mode (legacy)
74net/dsa/tag_dsa.c: Marvell's original DSA tag
75net/dsa/tag_edsa.c: Marvell's enhanced DSA tag
76net/dsa/tag_brcm.c: Broadcom's 4 bytes tag
77net/dsa/tag_qca.c: Qualcomm's 2 bytes tag
78
79The exact format of the tag protocol is vendor specific, but in general, they
80all contain something which:
81
82- identifies which port the Ethernet frame came from/should be sent to
83- provides a reason why this frame was forwarded to the management interface
84
85Master network devices
86----------------------
87
88Master network devices are regular, unmodified Linux network device drivers for
89the CPU/management Ethernet interface. Such a driver might occasionally need to
90know whether DSA is enabled (e.g.: to enable/disable specific offload features),
91but the DSA subsystem has been proven to work with industry standard drivers:
92e1000e, mv643xx_eth etc. without having to introduce modifications to these
93drivers. Such network devices are also often referred to as conduit network
94devices since they act as a pipe between the host processor and the hardware
95Ethernet switch.
96
97Networking stack hooks
98----------------------
99
100When a master netdev is used with DSA, a small hook is placed in in the
101networking stack is in order to have the DSA subsystem process the Ethernet
102switch specific tagging protocol. DSA accomplishes this by registering a
103specific (and fake) Ethernet type (later becoming skb->protocol) with the
104networking stack, this is also known as a ptype or packet_type. A typical
105Ethernet Frame receive sequence looks like this:
106
107Master network device (e.g.: e1000e):
108
109Receive interrupt fires:
110- receive function is invoked
111- basic packet processing is done: getting length, status etc.
112- packet is prepared to be processed by the Ethernet layer by calling
113  eth_type_trans
114
115net/ethernet/eth.c:
116
117eth_type_trans(skb, dev)
118	if (dev->dsa_ptr != NULL)
119		-> skb->protocol = ETH_P_XDSA
120
121drivers/net/ethernet/*:
122
123netif_receive_skb(skb)
124	-> iterate over registered packet_type
125		-> invoke handler for ETH_P_XDSA, calls dsa_switch_rcv()
126
127net/dsa/dsa.c:
128	-> dsa_switch_rcv()
129		-> invoke switch tag specific protocol handler in
130		   net/dsa/tag_*.c
131
132net/dsa/tag_*.c:
133	-> inspect and strip switch tag protocol to determine originating port
134	-> locate per-port network device
135	-> invoke eth_type_trans() with the DSA slave network device
136	-> invoked netif_receive_skb()
137
138Past this point, the DSA slave network devices get delivered regular Ethernet
139frames that can be processed by the networking stack.
140
141Slave network devices
142---------------------
143
144Slave network devices created by DSA are stacked on top of their master network
145device, each of these network interfaces will be responsible for being a
146controlling and data-flowing end-point for each front-panel port of the switch.
147These interfaces are specialized in order to:
148
149- insert/remove the switch tag protocol (if it exists) when sending traffic
150  to/from specific switch ports
151- query the switch for ethtool operations: statistics, link state,
152  Wake-on-LAN, register dumps...
153- external/internal PHY management: link, auto-negotiation etc.
154
155These slave network devices have custom net_device_ops and ethtool_ops function
156pointers which allow DSA to introduce a level of layering between the networking
157stack/ethtool, and the switch driver implementation.
158
159Upon frame transmission from these slave network devices, DSA will look up which
160switch tagging protocol is currently registered with these network devices, and
161invoke a specific transmit routine which takes care of adding the relevant
162switch tag in the Ethernet frames.
163
164These frames are then queued for transmission using the master network device
165ndo_start_xmit() function, since they contain the appropriate switch tag, the
166Ethernet switch will be able to process these incoming frames from the
167management interface and delivers these frames to the physical switch port.
168
169Graphical representation
170------------------------
171
172Summarized, this is basically how DSA looks like from a network device
173perspective:
174
175
176			|---------------------------
177			| CPU network device (eth0)|
178			----------------------------
179			| <tag added by switch     |
180			|                          |
181			|                          |
182			|        tag added by CPU> |
183		|--------------------------------------------|
184		| Switch driver				     |
185		|--------------------------------------------|
186                    ||        ||         ||
187		|-------|  |-------|  |-------|
188		| sw0p0 |  | sw0p1 |  | sw0p2 |
189		|-------|  |-------|  |-------|
190
191Slave MDIO bus
192--------------
193
194In order to be able to read to/from a switch PHY built into it, DSA creates a
195slave MDIO bus which allows a specific switch driver to divert and intercept
196MDIO reads/writes towards specific PHY addresses. In most MDIO-connected
197switches, these functions would utilize direct or indirect PHY addressing mode
198to return standard MII registers from the switch builtin PHYs, allowing the PHY
199library and/or to return link status, link partner pages, auto-negotiation
200results etc..
201
202For Ethernet switches which have both external and internal MDIO busses, the
203slave MII bus can be utilized to mux/demux MDIO reads and writes towards either
204internal or external MDIO devices this switch might be connected to: internal
205PHYs, external PHYs, or even external switches.
206
207Data structures
208---------------
209
210DSA data structures are defined in include/net/dsa.h as well as
211net/dsa/dsa_priv.h.
212
213dsa_chip_data: platform data configuration for a given switch device, this
214structure describes a switch device's parent device, its address, as well as
215various properties of its ports: names/labels, and finally a routing table
216indication (when cascading switches)
217
218dsa_platform_data: platform device configuration data which can reference a
219collection of dsa_chip_data structure if multiples switches are cascaded, the
220master network device this switch tree is attached to needs to be referenced
221
222dsa_switch_tree: structure assigned to the master network device under
223"dsa_ptr", this structure references a dsa_platform_data structure as well as
224the tagging protocol supported by the switch tree, and which receive/transmit
225function hooks should be invoked, information about the directly attached switch
226is also provided: CPU port. Finally, a collection of dsa_switch are referenced
227to address individual switches in the tree.
228
229dsa_switch: structure describing a switch device in the tree, referencing a
230dsa_switch_tree as a backpointer, slave network devices, master network device,
231and a reference to the backing dsa_switch_ops
232
233dsa_switch_ops: structure referencing function pointers, see below for a full
234description.
235
236Design limitations
237==================
238
239DSA is a platform device driver
240-------------------------------
241
242DSA is implemented as a DSA platform device driver which is convenient because
243it will register the entire DSA switch tree attached to a master network device
244in one-shot, facilitating the device creation and simplifying the device driver
245model a bit, this comes however with a number of limitations:
246
247- building DSA and its switch drivers as modules is currently not working
248- the device driver parenting does not necessarily reflect the original
249  bus/device the switch can be created from
250- supporting non-MDIO and non-MMIO (platform) switches is not possible
251
252Limits on the number of devices and ports
253-----------------------------------------
254
255DSA currently limits the number of maximum switches within a tree to 4
256(DSA_MAX_SWITCHES), and the number of ports per switch to 12 (DSA_MAX_PORTS).
257These limits could be extended to support larger configurations would this need
258arise.
259
260Lack of CPU/DSA network devices
261-------------------------------
262
263DSA does not currently create slave network devices for the CPU or DSA ports, as
264described before. This might be an issue in the following cases:
265
266- inability to fetch switch CPU port statistics counters using ethtool, which
267  can make it harder to debug MDIO switch connected using xMII interfaces
268
269- inability to configure the CPU port link parameters based on the Ethernet
270  controller capabilities attached to it: http://patchwork.ozlabs.org/patch/509806/
271
272- inability to configure specific VLAN IDs / trunking VLANs between switches
273  when using a cascaded setup
274
275Common pitfalls using DSA setups
276--------------------------------
277
278Once a master network device is configured to use DSA (dev->dsa_ptr becomes
279non-NULL), and the switch behind it expects a tagging protocol, this network
280interface can only exclusively be used as a conduit interface. Sending packets
281directly through this interface (e.g.: opening a socket using this interface)
282will not make us go through the switch tagging protocol transmit function, so
283the Ethernet switch on the other end, expecting a tag will typically drop this
284frame.
285
286Slave network devices check that the master network device is UP before allowing
287you to administratively bring UP these slave network devices. A common
288configuration mistake is forgetting to bring UP the master network device first.
289
290Interactions with other subsystems
291==================================
292
293DSA currently leverages the following subsystems:
294
295- MDIO/PHY library: drivers/net/phy/phy.c, mdio_bus.c
296- Switchdev: net/switchdev/*
297- Device Tree for various of_* functions
298
299MDIO/PHY library
300----------------
301
302Slave network devices exposed by DSA may or may not be interfacing with PHY
303devices (struct phy_device as defined in include/linux/phy.h), but the DSA
304subsystem deals with all possible combinations:
305
306- internal PHY devices, built into the Ethernet switch hardware
307- external PHY devices, connected via an internal or external MDIO bus
308- internal PHY devices, connected via an internal MDIO bus
309- special, non-autonegotiated or non MDIO-managed PHY devices: SFPs, MoCA; a.k.a
310  fixed PHYs
311
312The PHY configuration is done by the dsa_slave_phy_setup() function and the
313logic basically looks like this:
314
315- if Device Tree is used, the PHY device is looked up using the standard
316  "phy-handle" property, if found, this PHY device is created and registered
317  using of_phy_connect()
318
319- if Device Tree is used, and the PHY device is "fixed", that is, conforms to
320  the definition of a non-MDIO managed PHY as defined in
321  Documentation/devicetree/bindings/net/fixed-link.txt, the PHY is registered
322  and connected transparently using the special fixed MDIO bus driver
323
324- finally, if the PHY is built into the switch, as is very common with
325  standalone switch packages, the PHY is probed using the slave MII bus created
326  by DSA
327
328
329SWITCHDEV
330---------
331
332DSA directly utilizes SWITCHDEV when interfacing with the bridge layer, and
333more specifically with its VLAN filtering portion when configuring VLANs on top
334of per-port slave network devices. Since DSA primarily deals with
335MDIO-connected switches, although not exclusively, SWITCHDEV's
336prepare/abort/commit phases are often simplified into a prepare phase which
337checks whether the operation is supported by the DSA switch driver, and a commit
338phase which applies the changes.
339
340As of today, the only SWITCHDEV objects supported by DSA are the FDB and VLAN
341objects.
342
343Device Tree
344-----------
345
346DSA features a standardized binding which is documented in
347Documentation/devicetree/bindings/net/dsa/dsa.txt. PHY/MDIO library helper
348functions such as of_get_phy_mode(), of_phy_connect() are also used to query
349per-port PHY specific details: interface connection, MDIO bus location etc..
350
351Driver development
352==================
353
354DSA switch drivers need to implement a dsa_switch_ops structure which will
355contain the various members described below.
356
357register_switch_driver() registers this dsa_switch_ops in its internal list
358of drivers to probe for. unregister_switch_driver() does the exact opposite.
359
360Unless requested differently by setting the priv_size member accordingly, DSA
361does not allocate any driver private context space.
362
363Switch configuration
364--------------------
365
366- tag_protocol: this is to indicate what kind of tagging protocol is supported,
367  should be a valid value from the dsa_tag_protocol enum
368
369- probe: probe routine which will be invoked by the DSA platform device upon
370  registration to test for the presence/absence of a switch device. For MDIO
371  devices, it is recommended to issue a read towards internal registers using
372  the switch pseudo-PHY and return whether this is a supported device. For other
373  buses, return a non-NULL string
374
375- setup: setup function for the switch, this function is responsible for setting
376  up the dsa_switch_ops private structure with all it needs: register maps,
377  interrupts, mutexes, locks etc.. This function is also expected to properly
378  configure the switch to separate all network interfaces from each other, that
379  is, they should be isolated by the switch hardware itself, typically by creating
380  a Port-based VLAN ID for each port and allowing only the CPU port and the
381  specific port to be in the forwarding vector. Ports that are unused by the
382  platform should be disabled. Past this function, the switch is expected to be
383  fully configured and ready to serve any kind of request. It is recommended
384  to issue a software reset of the switch during this setup function in order to
385  avoid relying on what a previous software agent such as a bootloader/firmware
386  may have previously configured.
387
388- set_addr: Some switches require the programming of the management interface's
389  Ethernet MAC address, switch drivers can also disable ageing of MAC addresses
390  on the management interface and "hardcode"/"force" this MAC address for the
391  CPU/management interface as an optimization
392
393PHY devices and link management
394-------------------------------
395
396- get_phy_flags: Some switches are interfaced to various kinds of Ethernet PHYs,
397  if the PHY library PHY driver needs to know about information it cannot obtain
398  on its own (e.g.: coming from switch memory mapped registers), this function
399  should return a 32-bits bitmask of "flags", that is private between the switch
400  driver and the Ethernet PHY driver in drivers/net/phy/*.
401
402- phy_read: Function invoked by the DSA slave MDIO bus when attempting to read
403  the switch port MDIO registers. If unavailable, return 0xffff for each read.
404  For builtin switch Ethernet PHYs, this function should allow reading the link
405  status, auto-negotiation results, link partner pages etc..
406
407- phy_write: Function invoked by the DSA slave MDIO bus when attempting to write
408  to the switch port MDIO registers. If unavailable return a negative error
409  code.
410
411- adjust_link: Function invoked by the PHY library when a slave network device
412  is attached to a PHY device. This function is responsible for appropriately
413  configuring the switch port link parameters: speed, duplex, pause based on
414  what the phy_device is providing.
415
416- fixed_link_update: Function invoked by the PHY library, and specifically by
417  the fixed PHY driver asking the switch driver for link parameters that could
418  not be auto-negotiated, or obtained by reading the PHY registers through MDIO.
419  This is particularly useful for specific kinds of hardware such as QSGMII,
420  MoCA or other kinds of non-MDIO managed PHYs where out of band link
421  information is obtained
422
423Ethtool operations
424------------------
425
426- get_strings: ethtool function used to query the driver's strings, will
427  typically return statistics strings, private flags strings etc.
428
429- get_ethtool_stats: ethtool function used to query per-port statistics and
430  return their values. DSA overlays slave network devices general statistics:
431  RX/TX counters from the network device, with switch driver specific statistics
432  per port
433
434- get_sset_count: ethtool function used to query the number of statistics items
435
436- get_wol: ethtool function used to obtain Wake-on-LAN settings per-port, this
437  function may, for certain implementations also query the master network device
438  Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN
439
440- set_wol: ethtool function used to configure Wake-on-LAN settings per-port,
441  direct counterpart to set_wol with similar restrictions
442
443- set_eee: ethtool function which is used to configure a switch port EEE (Green
444  Ethernet) settings, can optionally invoke the PHY library to enable EEE at the
445  PHY level if relevant. This function should enable EEE at the switch port MAC
446  controller and data-processing logic
447
448- get_eee: ethtool function which is used to query a switch port EEE settings,
449  this function should return the EEE state of the switch port MAC controller
450  and data-processing logic as well as query the PHY for its currently configured
451  EEE settings
452
453- get_eeprom_len: ethtool function returning for a given switch the EEPROM
454  length/size in bytes
455
456- get_eeprom: ethtool function returning for a given switch the EEPROM contents
457
458- set_eeprom: ethtool function writing specified data to a given switch EEPROM
459
460- get_regs_len: ethtool function returning the register length for a given
461  switch
462
463- get_regs: ethtool function returning the Ethernet switch internal register
464  contents. This function might require user-land code in ethtool to
465  pretty-print register values and registers
466
467Power management
468----------------
469
470- suspend: function invoked by the DSA platform device when the system goes to
471  suspend, should quiesce all Ethernet switch activities, but keep ports
472  participating in Wake-on-LAN active as well as additional wake-up logic if
473  supported
474
475- resume: function invoked by the DSA platform device when the system resumes,
476  should resume all Ethernet switch activities and re-configure the switch to be
477  in a fully active state
478
479- port_enable: function invoked by the DSA slave network device ndo_open
480  function when a port is administratively brought up, this function should be
481  fully enabling a given switch port. DSA takes care of marking the port with
482  BR_STATE_BLOCKING if the port is a bridge member, or BR_STATE_FORWARDING if it
483  was not, and propagating these changes down to the hardware
484
485- port_disable: function invoked by the DSA slave network device ndo_close
486  function when a port is administratively brought down, this function should be
487  fully disabling a given switch port. DSA takes care of marking the port with
488  BR_STATE_DISABLED and propagating changes to the hardware if this port is
489  disabled while being a bridge member
490
491Bridge layer
492------------
493
494- port_bridge_join: bridge layer function invoked when a given switch port is
495  added to a bridge, this function should be doing the necessary at the switch
496  level to permit the joining port from being added to the relevant logical
497  domain for it to ingress/egress traffic with other members of the bridge.
498
499- port_bridge_leave: bridge layer function invoked when a given switch port is
500  removed from a bridge, this function should be doing the necessary at the
501  switch level to deny the leaving port from ingress/egress traffic from the
502  remaining bridge members. When the port leaves the bridge, it should be aged
503  out at the switch hardware for the switch to (re) learn MAC addresses behind
504  this port.
505
506- port_stp_state_set: bridge layer function invoked when a given switch port STP
507  state is computed by the bridge layer and should be propagated to switch
508  hardware to forward/block/learn traffic. The switch driver is responsible for
509  computing a STP state change based on current and asked parameters and perform
510  the relevant ageing based on the intersection results
511
512Bridge VLAN filtering
513---------------------
514
515- port_vlan_filtering: bridge layer function invoked when the bridge gets
516  configured for turning on or off VLAN filtering. If nothing specific needs to
517  be done at the hardware level, this callback does not need to be implemented.
518  When VLAN filtering is turned on, the hardware must be programmed with
519  rejecting 802.1Q frames which have VLAN IDs outside of the programmed allowed
520  VLAN ID map/rules.  If there is no PVID programmed into the switch port,
521  untagged frames must be rejected as well. When turned off the switch must
522  accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames are
523  allowed.
524
525- port_vlan_prepare: bridge layer function invoked when the bridge prepares the
526  configuration of a VLAN on the given port. If the operation is not supported
527  by the hardware, this function should return -EOPNOTSUPP to inform the bridge
528  code to fallback to a software implementation. No hardware setup must be done
529  in this function. See port_vlan_add for this and details.
530
531- port_vlan_add: bridge layer function invoked when a VLAN is configured
532  (tagged or untagged) for the given switch port
533
534- port_vlan_del: bridge layer function invoked when a VLAN is removed from the
535  given switch port
536
537- port_vlan_dump: bridge layer function invoked with a switchdev callback
538  function that the driver has to call for each VLAN the given port is a member
539  of. A switchdev object is used to carry the VID and bridge flags.
540
541- port_fdb_prepare: bridge layer function invoked when the bridge prepares the
542  installation of a Forwarding Database entry. If the operation is not
543  supported, this function should return -EOPNOTSUPP to inform the bridge code
544  to fallback to a software implementation. No hardware setup must be done in
545  this function. See port_fdb_add for this and details.
546
547- port_fdb_add: bridge layer function invoked when the bridge wants to install a
548  Forwarding Database entry, the switch hardware should be programmed with the
549  specified address in the specified VLAN Id in the forwarding database
550  associated with this VLAN ID
551
552Note: VLAN ID 0 corresponds to the port private database, which, in the context
553of DSA, would be the its port-based VLAN, used by the associated bridge device.
554
555- port_fdb_del: bridge layer function invoked when the bridge wants to remove a
556  Forwarding Database entry, the switch hardware should be programmed to delete
557  the specified MAC address from the specified VLAN ID if it was mapped into
558  this port forwarding database
559
560- port_fdb_dump: bridge layer function invoked with a switchdev callback
561  function that the driver has to call for each MAC address known to be behind
562  the given port. A switchdev object is used to carry the VID and FDB info.
563
564- port_mdb_prepare: bridge layer function invoked when the bridge prepares the
565  installation of a multicast database entry. If the operation is not supported,
566  this function should return -EOPNOTSUPP to inform the bridge code to fallback
567  to a software implementation. No hardware setup must be done in this function.
568  See port_fdb_add for this and details.
569
570- port_mdb_add: bridge layer function invoked when the bridge wants to install
571  a multicast database entry, the switch hardware should be programmed with the
572  specified address in the specified VLAN ID in the forwarding database
573  associated with this VLAN ID.
574
575Note: VLAN ID 0 corresponds to the port private database, which, in the context
576of DSA, would be the its port-based VLAN, used by the associated bridge device.
577
578- port_mdb_del: bridge layer function invoked when the bridge wants to remove a
579  multicast database entry, the switch hardware should be programmed to delete
580  the specified MAC address from the specified VLAN ID if it was mapped into
581  this port forwarding database.
582
583- port_mdb_dump: bridge layer function invoked with a switchdev callback
584  function that the driver has to call for each MAC address known to be behind
585  the given port. A switchdev object is used to carry the VID and MDB info.
586
587TODO
588====
589
590Making SWITCHDEV and DSA converge towards an unified codebase
591-------------------------------------------------------------
592
593SWITCHDEV properly takes care of abstracting the networking stack with offload
594capable hardware, but does not enforce a strict switch device driver model. On
595the other DSA enforces a fairly strict device driver model, and deals with most
596of the switch specific. At some point we should envision a merger between these
597two subsystems and get the best of both worlds.
598
599Other hanging fruits
600--------------------
601
602- making the number of ports fully dynamic and not dependent on DSA_MAX_PORTS
603- allowing more than one CPU/management interface:
604  http://comments.gmane.org/gmane.linux.network/365657
605- porting more drivers from other vendors:
606  http://comments.gmane.org/gmane.linux.network/365510
607