tc action ... bpf [ object-file OBJ_FILE ] [ section CLS_NAME ] [ export UDS_FILE ] [ verbose ]
tc action ... bpf [ bytecode-file BPF_FILE | bytecode BPF_BYTECODE ]
__bcc() { clang -O2 -emit-llvm -c $1 -o - | \\ llc -march=bpf -filetype=obj -o "`basename $1 .c`.o" } alias bcc=__bccA minimal, stand-alone unit, which matches on all traffic with the default classid (return code of -1) looks like:
#include <linux/bpf.h> #ifndef __section # define __section(x) __attribute__((section(x), used)) #endif __section("classifier") int cls_main(struct __sk_buff *skb) { return -1; } char __license[] __section("license") = "GPL";More examples can be found further below in subsection eBPF PROGRAMMING as focus here will be on tooling. There can be various other sections, for example, also for actions. Thus, an object file in eBPF can contain multiple entrance points. Always a specific entrance point, however, must be specified when configuring with tc. A license must be part of the restricted C code and the license string syntax is the same as with Linux kernel modules. The kernel reserves its right that some eBPF helper functions can be restricted to GPL compatible licenses only, and thus may reject a program from loading into the kernel when such a license mismatch occurs. The resulting object file from the compilation can be inspected with the usual set of tools that also operate on normal object files, for example objdump(1) for inspecting ELF section headers:
objdump -h bpf.o [...] 3 classifier 000007f8 0000000000000000 0000000000000000 00000040 2**3 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE 4 action-mark 00000088 0000000000000000 0000000000000000 00000838 2**3 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE 5 action-rand 00000098 0000000000000000 0000000000000000 000008c0 2**3 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE 6 maps 00000030 0000000000000000 0000000000000000 00000958 2**2 CONTENTS, ALLOC, LOAD, DATA 7 license 00000004 0000000000000000 0000000000000000 00000988 2**0 CONTENTS, ALLOC, LOAD, DATA [...]Adding an eBPF classifier from an object file that contains a classifier in the default ELF section is trivial (note that instead of "object-file" also shortcuts such as "obj" can be used): bcc bpf.c
tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 In case the classifier resides in ELF section "mycls", then that same command needs to be invoked as: tc filter add dev em1 parent 1: bpf obj bpf.o sec mycls flowid 1:1 Dumping the classifier configuration will tell the location of the classifier, in other words that it's from object file "bpf.o" under section "mycls": tc filter show dev em1
filter parent 1: protocol all pref 49152 bpf
filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid 1:1 bpf.o:[mycls] The same program can also be installed on ingress qdisc side as opposed to egress ... tc qdisc add dev em1 handle ffff: ingress
tc filter add dev em1 parent ffff: bpf obj bpf.o sec mycls flowid ffff:1 ... and again dumped from there: tc filter show dev em1 parent ffff:
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 flowid ffff:1 bpf.o:[mycls] Attaching a classifier and action on ingress has the restriction that it doesn't have an actual underlying queueing discipline. What ingress can do is to classify, mangle, redirect or drop packets. When queueing is required on ingress side, then ingress must redirect packets to the ifb device, otherwise policing can be used. Moreover, ingress can be used to have an early drop point of unwanted packets before they hit upper layers of the networking stack, perform network accounting with eBPF maps that could be shared with egress, or have an early mangle and/or redirection point to different networking devices. Multiple eBPF actions and classifier can be placed into a single object file within various sections. In that case, non-default section names must be provided, which is the case for both actions in this example: tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 \e
action bpf obj bpf.o sec action-mark \e
action bpf obj bpf.o sec action-rand ok The advantage of this is that the classifier and the two actions can then share eBPF maps with each other, if implemented in the programs. In order to access eBPF maps from user space beyond tc(8) setup lifetime, the ownership can be transferred to an eBPF agent via Unix domain sockets. There are two possibilities for implementing this: 1) implementation of an own eBPF agent that takes care of setting up the Unix domain socket and implementing the protocol that tc(8) dictates. A code example of this can be found inside the iproute2 source package under: examples/bpf/ 2) use tc exec for transferring the eBPF map file descriptors through a Unix domain socket, and spawning an application such as sh(1) . This approach's advantage is that tc will place the file descriptors into the environment and thus make them available just like stdin, stdout, stderr file descriptors, meaning, in case user applications run from within this fd-owner shell, they can terminate and restart without losing eBPF maps file descriptors. Example invocation with the previous classifier and action mixture: tc exec bpf imp /tmp/bpf
tc filter add dev em1 parent 1: bpf obj bpf.o exp /tmp/bpf flowid 1:1 \e
action bpf obj bpf.o sec action-mark \e
action bpf obj bpf.o sec action-rand ok Assuming that eBPF maps are shared with classifier and actions, it's enough to export them once, for example, from within the classifier or action command. tc will setup all eBPF map file descriptors at the time when the object file is first parsed. When a shell has been spawned, the environment will have a couple of eBPF related variables. BPF_NUM_MAPS provides the total number of maps that have been transferred over the Unix domain socket. BPF_MAP<X>'s value is the file descriptor number that can be accessed in eBPF agent applications, in other words, it can directly be used as the file descriptor value for the bpf(2) system call to retrieve or alter eBPF map values. <X> denotes the identifier of the eBPF map. It corresponds to the id member of struct bpf_elf_map from the tc eBPF map specification. The environment in this example looks as follows:
sh# env | grep BPF BPF_NUM_MAPS=3 BPF_MAP1=6 BPF_MAP0=5 BPF_MAP2=7 sh# ls -la /proc/self/fd [...] lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map sh# my_bpf_agenteBPF agents are very useful in that they can prepopulate eBPF maps from user space, monitor statistics via maps and based on that feedback, for example, rewrite classids in eBPF map values during runtime. Given that eBPF agents are implemented as normal applications, they can also dynamically receive traffic control policies from external controllers and thus push them down into eBPF maps to dynamically adapt to network conditions. Moreover, eBPF maps can also be shared with other eBPF program types (e.g. tracing), thus very powerful combination can therefore be implemented.
-1 , denotes the default classid configured from the command line
else , everything else will override the default classid to provide a facility for non-linear matching Supported 32 bit action return codes from the C program and their meanings ( linux/pkt_cls.h ): TC_ACT_OK (0) , will terminate the packet processing pipeline and allows the packet to proceed
TC_ACT_SHOT (2) , will terminate the packet processing pipeline and drops the packet
TC_ACT_UNSPEC (-1) , will use the default action configured from tc (similarly as returning -1 from a classifier)
TC_ACT_PIPE (3) , will iterate to the next action, if available
TC_ACT_RECLASSIFY (1) , will terminate the packet processing pipeline and start classification from the beginning
else , everything else is an unspecified return code Both classifier and action return codes are supported in eBPF and cBPF programs. To demonstrate restricted C syntax, a minimal toy classifier example is provided, which assumes that egress packets, for instance originating from a container, have previously been marked in interval [0, 255]. The program keeps statistics on different marks for user space and maps the classid to the root qdisc with the marking itself as the minor handle:
#include <stdint.h> #include <asm/types.h> #include <linux/bpf.h> #include <linux/pkt_sched.h> #include "helpers.h" struct tuple { long packets; long bytes; }; #define BPF_MAP_ID_STATS 1 /* agent's map identifier */ #define BPF_MAX_MARK 256 struct bpf_elf_map __section("maps") map_stats = { .type = BPF_MAP_TYPE_ARRAY, .id = BPF_MAP_ID_STATS, .size_key = sizeof(uint32_t), .size_value = sizeof(struct tuple), .max_elem = BPF_MAX_MARK, }; static inline void cls_update_stats(const struct __sk_buff *skb, uint32_t mark) { struct tuple *tu; tu = bpf_map_lookup_elem(&map_stats, &mark); if (likely(tu)) { __sync_fetch_and_add(&tu->packets, 1); __sync_fetch_and_add(&tu->bytes, skb->len); } } __section("cls") int cls_main(struct __sk_buff *skb) { uint32_t mark = skb->mark; if (unlikely(mark >= BPF_MAX_MARK)) return 0; cls_update_stats(skb, mark); return TC_H_MAKE(TC_H_ROOT, mark); } char __license[] __section("license") = "GPL";Another small example is a port redirector which demuxes destination port 80 into the interval [8080, 8087] steered by RSS, that can then be attached to ingress qdisc. The exercise of adding the egress counterpart and IPv6 support is left to the reader:
#include <asm/types.h> #include <asm/byteorder.h> #include <linux/bpf.h> #include <linux/filter.h> #include <linux/in.h> #include <linux/if_ether.h> #include <linux/ip.h> #include <linux/tcp.h> #include "helpers.h" static inline void set_tcp_dport(struct __sk_buff *skb, int nh_off, __u16 old_port, __u16 new_port) { bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check), old_port, new_port, sizeof(new_port)); bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, dest), &new_port, sizeof(new_port), 0); } static inline int lb_do_ipv4(struct __sk_buff *skb, int nh_off) { __u16 dport, dport_new = 8080, off; __u8 ip_proto, ip_vl; ip_proto = load_byte(skb, nh_off + offsetof(struct iphdr, protocol)); if (ip_proto != IPPROTO_TCP) return 0; ip_vl = load_byte(skb, nh_off); if (likely(ip_vl == 0x45)) nh_off += sizeof(struct iphdr); else nh_off += (ip_vl & 0xF) << 2; dport = load_half(skb, nh_off + offsetof(struct tcphdr, dest)); if (dport != 80) return 0; off = skb->queue_mapping & 7; set_tcp_dport(skb, nh_off - BPF_LL_OFF, __constant_htons(80), __cpu_to_be16(dport_new + off)); return -1; } __section("lb") int lb_main(struct __sk_buff *skb) { int ret = 0, nh_off = BPF_LL_OFF + ETH_HLEN; if (likely(skb->protocol == __constant_htons(ETH_P_IP))) ret = lb_do_ipv4(skb, nh_off); return ret; } char __license[] __section("license") = "GPL";The related helper header file helpers.h in both examples was:
/* Misc helper macros. */ #define __section(x) __attribute__((section(x), used)) #define offsetof(x, y) __builtin_offsetof(x, y) #define likely(x) __builtin_expect(!!(x), 1) #define unlikely(x) __builtin_expect(!!(x), 0) /* Used map structure */ struct bpf_elf_map { __u32 type; __u32 size_key; __u32 size_value; __u32 max_elem; __u32 id; }; /* Some used BPF function calls. */ static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from, int len, int flags) = (void *) BPF_FUNC_skb_store_bytes; static int (*bpf_l4_csum_replace)(void *ctx, int off, int from, int to, int flags) = (void *) BPF_FUNC_l4_csum_replace; static void *(*bpf_map_lookup_elem)(void *map, void *key) = (void *) BPF_FUNC_map_lookup_elem; /* Some used BPF intrinsics. */ unsigned long long load_byte(void *skb, unsigned long long off) asm ("llvm.bpf.load.byte"); unsigned long long load_half(void *skb, unsigned long long off) asm ("llvm.bpf.load.half");Best practice, we recommend to only have a single eBPF classifier loaded in tc and perform all necessary matching and mangling from there instead of a list of individual classifier and separate actions. Just a single classifier tailored for a given use-case will be most efficient to run.
#include <pcap.h> #include <stdio.h> int main(int argc, char **argv) { struct bpf_program prog; struct bpf_insn *ins; int i, ret, dlt = DLT_RAW; if (argc < 2 || argc > 3) return 1; if (argc == 3) { dlt = pcap_datalink_name_to_val(argv[1]); if (dlt == -1) return 1; } ret = pcap_compile_nopcap(-1, dlt, &prog, argv[argc - 1], 1, PCAP_NETMASK_UNKNOWN); if (ret) return 1; printf("%d,", prog.bf_len); ins = prog.bf_insns; for (i = 0; i < prog.bf_len - 1; ++ins, ++i) printf("%u %u %u %u,", ins->code, ins->jt, ins->jf, ins->k); printf("%u %u %u %u", ins->code, ins->jt, ins->jf, ins->k); pcap_freecode(&prog); return 0; }Given this small helper, any tcpdump(8) filter expression can be abused as a classifier where a match will result in the default classid: bpftool EN10MB 'tcp[tcpflags] & tcp-syn != 0' > /var/bpf/tcp-syn
tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1 Basically, such a minimal generator is equivalent to: tcpdump -iem1 -ddd 'tcp[tcpflags] & tcp-syn != 0' | tr '\\\\n' ',' > /var/bpf/tcp-syn Since libpcap does not support all Linux' specific cBPF extensions in its compiler, the Linux kernel also ships under tools/net/ a minimal BPF assembler called bpf_asm for providing full control. For detailed syntax and semantics on implementing such programs by hand, see references under FURTHER READING . Trivial toy example in bpf_asm for classifying IPv4/TCP packets, saved in a text file called foobar :
ldh [12] jne #0x800, drop ldb [23] jneq #6, drop ret #-1 drop: ret #0Similarly, such a classifier can be loaded as: bpf_asm foobar > /var/bpf/tcp-syn
tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1 For BPF classifiers, the Linux kernel provides additionally under tools/net/ a small BPF debugger called bpf_dbg , which can be used to test a classifier against pcap files, single-step or add various breakpoints into the classifier program and dump register contents during runtime. Implementing an action in classic BPF is rather limited in the sense that packet mangling is not supported. Therefore, it's generally recommended to make the switch to eBPF, whenever possible.