1 2Overview 3======== 4 5This readme tries to provide some background on the hows and whys of RDS, 6and will hopefully help you find your way around the code. 7 8In addition, please see this email about RDS origins: 9http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html 10 11RDS Architecture 12================ 13 14RDS provides reliable, ordered datagram delivery by using a single 15reliable connection between any two nodes in the cluster. This allows 16applications to use a single socket to talk to any other process in the 17cluster - so in a cluster with N processes you need N sockets, in contrast 18to N*N if you use a connection-oriented socket transport like TCP. 19 20RDS is not Infiniband-specific; it was designed to support different 21transports. The current implementation used to support RDS over TCP as well 22as IB. Work is in progress to support RDS over iWARP, and using DCE to 23guarantee no dropped packets on Ethernet, it may be possible to use RDS over 24UDP in the future. 25 26The high-level semantics of RDS from the application's point of view are 27 28 * Addressing 29 RDS uses IPv4 addresses and 16bit port numbers to identify 30 the end point of a connection. All socket operations that involve 31 passing addresses between kernel and user space generally 32 use a struct sockaddr_in. 33 34 The fact that IPv4 addresses are used does not mean the underlying 35 transport has to be IP-based. In fact, RDS over IB uses a 36 reliable IB connection; the IP address is used exclusively to 37 locate the remote node's GID (by ARPing for the given IP). 38 39 The port space is entirely independent of UDP, TCP or any other 40 protocol. 41 42 * Socket interface 43 RDS sockets work *mostly* as you would expect from a BSD 44 socket. The next section will cover the details. At any rate, 45 all I/O is performed through the standard BSD socket API. 46 Some additions like zerocopy support are implemented through 47 control messages, while other extensions use the getsockopt/ 48 setsockopt calls. 49 50 Sockets must be bound before you can send or receive data. 51 This is needed because binding also selects a transport and 52 attaches it to the socket. Once bound, the transport assignment 53 does not change. RDS will tolerate IPs moving around (eg in 54 a active-active HA scenario), but only as long as the address 55 doesn't move to a different transport. 56 57 * sysctls 58 RDS supports a number of sysctls in /proc/sys/net/rds 59 60 61Socket Interface 62================ 63 64 AF_RDS, PF_RDS, SOL_RDS 65 These constants haven't been assigned yet, because RDS isn't in 66 mainline yet. Currently, the kernel module assigns some constant 67 and publishes it to user space through two sysctl files 68 /proc/sys/net/rds/pf_rds 69 /proc/sys/net/rds/sol_rds 70 71 fd = socket(PF_RDS, SOCK_SEQPACKET, 0); 72 This creates a new, unbound RDS socket. 73 74 setsockopt(SOL_SOCKET): send and receive buffer size 75 RDS honors the send and receive buffer size socket options. 76 You are not allowed to queue more than SO_SNDSIZE bytes to 77 a socket. A message is queued when sendmsg is called, and 78 it leaves the queue when the remote system acknowledges 79 its arrival. 80 81 The SO_RCVSIZE option controls the maximum receive queue length. 82 This is a soft limit rather than a hard limit - RDS will 83 continue to accept and queue incoming messages, even if that 84 takes the queue length over the limit. However, it will also 85 mark the port as "congested" and send a congestion update to 86 the source node. The source node is supposed to throttle any 87 processes sending to this congested port. 88 89 bind(fd, &sockaddr_in, ...) 90 This binds the socket to a local IP address and port, and a 91 transport. 92 93 sendmsg(fd, ...) 94 Sends a message to the indicated recipient. The kernel will 95 transparently establish the underlying reliable connection 96 if it isn't up yet. 97 98 An attempt to send a message that exceeds SO_SNDSIZE will 99 return with -EMSGSIZE 100 101 An attempt to send a message that would take the total number 102 of queued bytes over the SO_SNDSIZE threshold will return 103 EAGAIN. 104 105 An attempt to send a message to a destination that is marked 106 as "congested" will return ENOBUFS. 107 108 recvmsg(fd, ...) 109 Receives a message that was queued to this socket. The sockets 110 recv queue accounting is adjusted, and if the queue length 111 drops below SO_SNDSIZE, the port is marked uncongested, and 112 a congestion update is sent to all peers. 113 114 Applications can ask the RDS kernel module to receive 115 notifications via control messages (for instance, there is a 116 notification when a congestion update arrived, or when a RDMA 117 operation completes). These notifications are received through 118 the msg.msg_control buffer of struct msghdr. The format of the 119 messages is described in manpages. 120 121 poll(fd) 122 RDS supports the poll interface to allow the application 123 to implement async I/O. 124 125 POLLIN handling is pretty straightforward. When there's an 126 incoming message queued to the socket, or a pending notification, 127 we signal POLLIN. 128 129 POLLOUT is a little harder. Since you can essentially send 130 to any destination, RDS will always signal POLLOUT as long as 131 there's room on the send queue (ie the number of bytes queued 132 is less than the sendbuf size). 133 134 However, the kernel will refuse to accept messages to 135 a destination marked congested - in this case you will loop 136 forever if you rely on poll to tell you what to do. 137 This isn't a trivial problem, but applications can deal with 138 this - by using congestion notifications, and by checking for 139 ENOBUFS errors returned by sendmsg. 140 141 setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) 142 This allows the application to discard all messages queued to a 143 specific destination on this particular socket. 144 145 This allows the application to cancel outstanding messages if 146 it detects a timeout. For instance, if it tried to send a message, 147 and the remote host is unreachable, RDS will keep trying forever. 148 The application may decide it's not worth it, and cancel the 149 operation. In this case, it would use RDS_CANCEL_SENT_TO to 150 nuke any pending messages. 151 152 153RDMA for RDS 154============ 155 156 see rds-rdma(7) manpage (available in rds-tools) 157 158 159Congestion Notifications 160======================== 161 162 see rds(7) manpage 163 164 165RDS Protocol 166============ 167 168 Message header 169 170 The message header is a 'struct rds_header' (see rds.h): 171 Fields: 172 h_sequence: 173 per-packet sequence number 174 h_ack: 175 piggybacked acknowledgment of last packet received 176 h_len: 177 length of data, not including header 178 h_sport: 179 source port 180 h_dport: 181 destination port 182 h_flags: 183 CONG_BITMAP - this is a congestion update bitmap 184 ACK_REQUIRED - receiver must ack this packet 185 RETRANSMITTED - packet has previously been sent 186 h_credit: 187 indicate to other end of connection that 188 it has more credits available (i.e. there is 189 more send room) 190 h_padding[4]: 191 unused, for future use 192 h_csum: 193 header checksum 194 h_exthdr: 195 optional data can be passed here. This is currently used for 196 passing RDMA-related information. 197 198 ACK and retransmit handling 199 200 One might think that with reliable IB connections you wouldn't need 201 to ack messages that have been received. The problem is that IB 202 hardware generates an ack message before it has DMAed the message 203 into memory. This creates a potential message loss if the HCA is 204 disabled for any reason between when it sends the ack and before 205 the message is DMAed and processed. This is only a potential issue 206 if another HCA is available for fail-over. 207 208 Sending an ack immediately would allow the sender to free the sent 209 message from their send queue quickly, but could cause excessive 210 traffic to be used for acks. RDS piggybacks acks on sent data 211 packets. Ack-only packets are reduced by only allowing one to be 212 in flight at a time, and by the sender only asking for acks when 213 its send buffers start to fill up. All retransmissions are also 214 acked. 215 216 Flow Control 217 218 RDS's IB transport uses a credit-based mechanism to verify that 219 there is space in the peer's receive buffers for more data. This 220 eliminates the need for hardware retries on the connection. 221 222 Congestion 223 224 Messages waiting in the receive queue on the receiving socket 225 are accounted against the sockets SO_RCVBUF option value. Only 226 the payload bytes in the message are accounted for. If the 227 number of bytes queued equals or exceeds rcvbuf then the socket 228 is congested. All sends attempted to this socket's address 229 should return block or return -EWOULDBLOCK. 230 231 Applications are expected to be reasonably tuned such that this 232 situation very rarely occurs. An application encountering this 233 "back-pressure" is considered a bug. 234 235 This is implemented by having each node maintain bitmaps which 236 indicate which ports on bound addresses are congested. As the 237 bitmap changes it is sent through all the connections which 238 terminate in the local address of the bitmap which changed. 239 240 The bitmaps are allocated as connections are brought up. This 241 avoids allocation in the interrupt handling path which queues 242 sages on sockets. The dense bitmaps let transports send the 243 entire bitmap on any bitmap change reasonably efficiently. This 244 is much easier to implement than some finer-grained 245 communication of per-port congestion. The sender does a very 246 inexpensive bit test to test if the port it's about to send to 247 is congested or not. 248 249 250RDS Transport Layer 251================== 252 253 As mentioned above, RDS is not IB-specific. Its code is divided 254 into a general RDS layer and a transport layer. 255 256 The general layer handles the socket API, congestion handling, 257 loopback, stats, usermem pinning, and the connection state machine. 258 259 The transport layer handles the details of the transport. The IB 260 transport, for example, handles all the queue pairs, work requests, 261 CM event handlers, and other Infiniband details. 262 263 264RDS Kernel Structures 265===================== 266 267 struct rds_message 268 aka possibly "rds_outgoing", the generic RDS layer copies data to 269 be sent and sets header fields as needed, based on the socket API. 270 This is then queued for the individual connection and sent by the 271 connection's transport. 272 struct rds_incoming 273 a generic struct referring to incoming data that can be handed from 274 the transport to the general code and queued by the general code 275 while the socket is awoken. It is then passed back to the transport 276 code to handle the actual copy-to-user. 277 struct rds_socket 278 per-socket information 279 struct rds_connection 280 per-connection information 281 struct rds_transport 282 pointers to transport-specific functions 283 struct rds_statistics 284 non-transport-specific statistics 285 struct rds_cong_map 286 wraps the raw congestion bitmap, contains rbnode, waitq, etc. 287 288Connection management 289===================== 290 291 Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and 292 ERROR states. 293 294 The first time an attempt is made by an RDS socket to send data to 295 a node, a connection is allocated and connected. That connection is 296 then maintained forever -- if there are transport errors, the 297 connection will be dropped and re-established. 298 299 Dropping a connection while packets are queued will cause queued or 300 partially-sent datagrams to be retransmitted when the connection is 301 re-established. 302 303 304The send path 305============= 306 307 rds_sendmsg() 308 struct rds_message built from incoming data 309 CMSGs parsed (e.g. RDMA ops) 310 transport connection alloced and connected if not already 311 rds_message placed on send queue 312 send worker awoken 313 rds_send_worker() 314 calls rds_send_xmit() until queue is empty 315 rds_send_xmit() 316 transmits congestion map if one is pending 317 may set ACK_REQUIRED 318 calls transport to send either non-RDMA or RDMA message 319 (RDMA ops never retransmitted) 320 rds_ib_xmit() 321 allocs work requests from send ring 322 adds any new send credits available to peer (h_credits) 323 maps the rds_message's sg list 324 piggybacks ack 325 populates work requests 326 post send to connection's queue pair 327 328The recv path 329============= 330 331 rds_ib_recv_cq_comp_handler() 332 looks at write completions 333 unmaps recv buffer from device 334 no errors, call rds_ib_process_recv() 335 refill recv ring 336 rds_ib_process_recv() 337 validate header checksum 338 copy header to rds_ib_incoming struct if start of a new datagram 339 add to ibinc's fraglist 340 if competed datagram: 341 update cong map if datagram was cong update 342 call rds_recv_incoming() otherwise 343 note if ack is required 344 rds_recv_incoming() 345 drop duplicate packets 346 respond to pings 347 find the sock associated with this datagram 348 add to sock queue 349 wake up sock 350 do some congestion calculations 351 rds_recvmsg 352 copy data into user iovec 353 handle CMSGs 354 return to application 355 356 357