1# URL syntax and their use in curl 2 3## Specifications 4 5The official "URL syntax" is primarily defined in these two different 6specifications: 7 8 - [RFC 3986](https://tools.ietf.org/html/rfc3986) (although URL is called "URI" in there) 9 - [The WHATWG URL Specification](https://url.spec.whatwg.org/) 10 11RFC 3986 is the earlier one, and curl has always tried to adhere to that one 12(since it shipped in January 2005). 13 14The WHATWG URL spec was written later, is incompatible with the RFC 3986 and 15changes over time. 16 17## Variations 18 19URL parsers as implemented in browsers, libraries and tools usually opt to 20support one of the mentioned specifications. Bugs, differences in 21interpretations and the moving nature of the WHATWG spec does however make it 22very unlikely that multiple parsers treat URLs the exact same way! 23 24## Security 25 26Due to the inherent differences between URL parser implementations, it is 27considered a security risk to mix different implementations and assume the 28same behavior! 29 30For example, if you use one parser to check if a URL uses a good host name or 31the correct auth field, and then pass on that same URL to a *second* parser, 32there will always be a risk it treats the same URL differently. There is no 33right and wrong in URL land, only differences of opinions. 34 35libcurl offers a separate API to its URL parser for this reason, among others. 36 37Applications may at times find it convenient to allow users to specify URLs 38for various purposes and that string would then end up fed to curl. Getting a 39URL from an external untrusted party and using it with curl brings several 40security concerns: 41 421. If you have an application that runs as or in a server application, getting 43 an unfiltered URL can trick your application to access a local resource 44 instead of a remote resource. Protecting yourself against localhost accesses is very 45 hard when accepting user provided URLs. 46 472. Such custom URLs can access other ports than you planned as port numbers 48 are part of the regular URL format. The combination of a local host and a 49 custom port number can allow external users to play tricks with your local 50 services. 51 523. Such a URL might use other schemes than you thought of or planned for. 53 54## "RFC3986 plus" 55 56curl recognizes a URL syntax that we call "RFC 3986 plus". It is grounded on 57the well established RFC 3986 to make sure previously written command lines and 58curl using scripts will remain working. 59 60curl's URL parser allows a few deviations from the spec in order to 61inter-operate better with URLs that appear in the wild. 62 63### spaces 64 65In particular `Location:` headers that indicate to the client where a resource 66has been redirected to, sometimes contain spaces. This is a violation of RFC 673986 but is fine in the WHATWG spec. curl handles these by re-encoding them to 68`%20`. 69 70### non-ASCII 71 72Byte values in a provided URL that are outside of the printable ASCII range 73are percent-encoded by curl. 74 75### multiple slashes 76 77An absolute URL always starts with a "scheme" followed by a colon. For all the 78schemes curl supports, the colon must be followed by two slashes according to 79RFC 3986 but not according to the WHATWG spec - which allows one to infinity 80amount. 81 82curl allows one, two or three slashes after the colon to still be considered a 83valid URL. 84 85### "scheme-less" 86 87curl supports "URLs" that do not start with a scheme. This is not supported by 88any of the specifications. This is a shortcut to entering URLs that was 89supported by browsers early on and has been mimicked by curl. 90 91Based on what the host name starts with, curl will "guess" what protocol to 92use: 93 94 - `ftp.` means FTP 95 - `dict.` means DICT 96 - `ldap.` means LDAP 97 - `imap.` means IMAP 98 - `smtp.` means SMTP 99 - `pop3.` means POP3 100 - all other means HTTP 101 102### globbing letters 103 104The curl command line tool supports "globbing" of URLs. It means that you can 105create ranges and lists using `[N-M]` and `{one,two,three}` sequences. The 106letters used for this (`[]{}`) are reserved in RFC 3986 and can therefore not 107legitimately be part of such a URL. 108 109They are however not reserved or special in the WHATWG specification, so 110globbing can mess up such URLs. Globbing can be turned off for such occasions 111(using `--globoff`). 112 113# URL syntax details 114 115A URL may consist of the following components - many of them are optional: 116 117 [scheme][divider][userinfo][hostname][port number][path][query][fragment] 118 119Each component is separated from the following component with a divider 120character or string. 121 122For example, this could look like: 123 124 http://user:password@www.example.com:80/index.hmtl?foo=bar#top 125 126## Scheme 127 128The scheme specifies the protocol to use. A curl build can support a few or 129many different schemes. You can limit what schemes curl should accept. 130 131curl supports the following schemes on URLs specified to transfer. They are 132matched case insensitively: 133 134`dict`, `file`, `ftp`, `ftps`, `gopher`, `gophers`, `http`, `https`, `imap`, 135`imaps`, `ldap`, `ldaps`, `mqtt`, `pop3`, `pop3s`, `rtmp`, `rtmpe`, `rtmps`, 136`rtmpt`, `rtmpte`, `rtmpts`, `rtsp`, `smb`, `smbs`, `smtp`, `smtps`, `telnet`, 137`tftp` 138 139When the URL is specified to identify a proxy, curl recognizes the following 140schemes: 141 142`http`, `https`, `socks4`, `socks4a`, `socks5`, `socks5h`, `socks` 143 144## Userinfo 145 146The userinfo field can be used to set user name and password for 147authentication purposes in this transfer. The use of this field is discouraged 148since it often means passing around the password in plain text and is thus a 149security risk. 150 151URLs for IMAP, POP3 and SMTP also support *login options* as part of the 152userinfo field. They're provided as a semicolon after the password and then 153the options. 154 155## Hostname 156 157The hostname part of the URL contains the address of the server that you want 158to connect to. This can be the fully qualified domain name of the server, the 159local network name of the machine on your network or the IP address of the 160server or machine represented by either an IPv4 or IPv6 address (within 161brackets). For example: 162 163 http://www.example.com/ 164 165 http://hostname/ 166 167 http://192.168.0.1/ 168 169 http://[2001:1890:1112:1::20]/ 170 171### "localhost" 172 173Starting in curl 7.77.0, curl will use loopback IP addresses for the name 174`localhost`: `127.0.0.1` and `::1`. It will not try to resolve the name using 175the resolver functions. 176 177This is done to make sure the host accessed is truly the localhost - the local 178machine. 179 180### IDNA 181 182If curl was built with International Domain Name (IDN) support, it can also 183handle host names using non-ASCII characters. 184 185When built with libidn2, curl uses the IDNA 2008 standard. This is equivalent 186to the WHATWG URL spec, but differs from certain browsers that use IDNA 2003 187Transitional Processing. The two standards have a huge overlap but differ 188slightly, perhaps most famously in how they deal with the German "double s" 189(`ß`). 190 191When winidn is used, curl uses IDNA 2003 Transitional Processing, like the rest 192of Windows. 193 194## Port number 195 196If there's a colon after the hostname, that should be followed by the port 197number to use. 1 - 65535. curl also supports a blank port number field - but 198only if the URL starts with a scheme. 199 200If the port number is not specified in the URL, curl will used a default port 201based on the provide scheme: 202 203DICT 2628, FTP 21, FTPS 990, GOPHER 70, GOPHERS 70, HTTP 80, HTTPS 443, 204IMAP 132, IMAPS 993, LDAP 369, LDAPS 636, MQTT 1883, POP3 110, POP3S 995, 205RTMP 1935, RTMPS 443, RTMPT 80, RTSP 554, SCP 22, SFTP 22, SMB 445, SMBS 445, 206SMTP 25, SMTPS 465, TELNET 23, TFTP 69 207 208# Scheme specific behaviors 209 210## FTP 211 212The path part of an FTP request specifies the file to retrieve and from which 213directory. If the file part is omitted then libcurl downloads the directory 214listing for the directory specified. If the directory is omitted then the 215directory listing for the root / home directory will be returned. 216 217FTP servers typically put the user in its "home directory" after login, which 218then differs between users. To explicitly specify the root directory of an FTP 219server start the path with double slash `//` or `/%2f` (2F is the hexadecimal 220value of the ascii code for the slash). 221 222## FILE 223 224When a `FILE://` URL is accessed on Windows systems, it can be crafted in a 225way so that Windows attempts to connect to a (remote) machine when curl wants 226to read or write such a path. 227 228curl only allows the hostname part of a FILE URL to be one out of these three 229alternatives: `localhost`, `127.0.0.1` or blank ("", zero characters). 230Anything else will make curl fail to parse the URL. 231 232### Windows-specific FILE details 233 234curl accepts that the FILE URL's path starts with a "drive letter". That's a 235single letter `a` to `z` followed by a colon or a pipe character (`|`). 236 237The Windows operating system itself will convert some file accesses to perform 238network accesses over SMB/CIFS, through several different file path patterns. 239This way, a `file://` URL passed to curl *might* be converted into a network 240access inadvertently and unknowingly to curl. This is a Windows feature curl 241cannot control or disable. 242 243## IMAP 244 245The path part of an IMAP request not only specifies the mailbox to list or 246select, but can also be used to check the `UIDVALIDITY` of the mailbox, to 247specify the `UID`, `SECTION` and `PARTIAL` octets of the message to fetch and 248to specify what messages to search for. 249 250A top level folder list: 251 252 imap://user:password@mail.example.com 253 254A folder list on the user's inbox: 255 256 imap://user:password@mail.example.com/INBOX 257 258Select the user's inbox and fetch message with uid = 1: 259 260 imap://user:password@mail.example.com/INBOX/;UID=1 261 262Select the user's inbox and fetch the first message in the mail box: 263 264 imap://user:password@mail.example.com/INBOX/;MAILINDEX=1 265 266Select the user's inbox, check the `UIDVALIDITY` of the mailbox is 50 and 267fetch message 2 if it is: 268 269 imap://user:password@mail.example.com/INBOX;UIDVALIDITY=50/;UID=2 270 271Select the user's inbox and fetch the text portion of message 3: 272 273 imap://user:password@mail.example.com/INBOX/;UID=3/;SECTION=TEXT 274 275Select the user's inbox and fetch the first 1024 octets of message 4: 276 277 imap://user:password@mail.example.com/INBOX/;UID=4/;PARTIAL=0.1024 278 279Select the user's inbox and check for NEW messages: 280 281 imap://user:password@mail.example.com/INBOX?NEW 282 283Select the user's inbox and search for messages containing "shadows" in the 284subject line: 285 286 imap://user:password@mail.example.com/INBOX?SUBJECT%20shadows 287 288For more information about the individual components of an IMAP URL please see 289RFC 5092. 290 291## LDAP 292 293The path part of a LDAP request can be used to specify the: Distinguished 294Name, Attributes, Scope, Filter and Extension for a LDAP search. Each field is 295separated by a question mark and when that field is not required an empty 296string with the question mark separator should be included. 297 298Search for the DN as `My Organisation`: 299 300 ldap://ldap.example.com/o=My%20Organisation 301 302the same search but will only return postalAddress attributes: 303 304 ldap://ldap.example.com/o=My%20Organisation?postalAddress 305 306Search for an empty DN and request information about the 307`rootDomainNamingContext` attribute for an Active Directory server: 308 309 ldap://ldap.example.com/?rootDomainNamingContext 310 311For more information about the individual components of a LDAP URL please 312see [RFC 4516](https://tools.ietf.org/html/rfc4516). 313 314## POP3 315 316The path part of a POP3 request specifies the message ID to retrieve. If the 317ID is not specified then a list of waiting messages is returned instead. 318 319## SCP 320 321The path part of an SCP URL specifies the path and file to retrieve or 322upload. The file is taken as an absolute path from the root directory on the 323server. 324 325To specify a path relative to the user's home directory on the server, prepend 326`~/` to the path portion. 327 328## SFTP 329 330The path part of an SFTP URL specifies the file to retrieve or upload. If the 331path ends with a slash (`/`) then a directory listing is returned instead of a 332file. If the path is omitted entirely then the directory listing for the root 333/ home directory will be returned. 334 335## SMB 336The path part of a SMB request specifies the file to retrieve and from what 337share and directory or the share to upload to and as such, may not be omitted. 338If the user name is embedded in the URL then it must contain the domain name 339and as such, the backslash must be URL encoded as %2f. 340 341curl supports SMB version 1 (only) 342 343## SMTP 344 345The path part of a SMTP request specifies the host name to present during 346communication with the mail server. If the path is omitted, then libcurl will 347attempt to resolve the local computer's host name. However, this may not 348return the fully qualified domain name that is required by some mail servers 349and specifying this path allows you to set an alternative name, such as your 350machine's fully qualified domain name, which you might have obtained from an 351external function such as gethostname or getaddrinfo. 352 353The default smtp port is 25. Some servers use port 587 as an alternative. 354 355## RTMP 356 357There's no official URL spec for RTMP so libcurl uses the URL syntax supported 358by the underlying librtmp library. It has a syntax where it wants a 359traditional URL, followed by a space and a series of space-separated 360`name=value` pairs. 361 362While space is not typically a "legal" letter, libcurl accepts them. When a 363user wants to pass in a `#` (hash) character it will be treated as a fragment 364and get cut off by libcurl if provided literally. You will instead have to 365escape it by providing it as backslash and its ASCII value in hexadecimal: 366`\23`. 367