• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1 // Copyright 2012 The Chromium Authors
2 // Use of this source code is governed by a BSD-style license that can be
3 // found in the LICENSE file.
4 
5 // NB: Modelled after Mozilla's code (originally written by Pamela Greene,
6 // later modified by others), but almost entirely rewritten for Chrome.
7 //   (netwerk/dns/src/nsEffectiveTLDService.h)
8 /* ***** BEGIN LICENSE BLOCK *****
9  * Version: MPL 1.1/GPL 2.0/LGPL 2.1
10  *
11  * The contents of this file are subject to the Mozilla Public License Version
12  * 1.1 (the "License"); you may not use this file except in compliance with
13  * the License. You may obtain a copy of the License at
14  * http://www.mozilla.org/MPL/
15  *
16  * Software distributed under the License is distributed on an "AS IS" basis,
17  * WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
18  * for the specific language governing rights and limitations under the
19  * License.
20  *
21  * The Original Code is Mozilla TLD Service
22  *
23  * The Initial Developer of the Original Code is
24  * Google Inc.
25  * Portions created by the Initial Developer are Copyright (C) 2006
26  * the Initial Developer. All Rights Reserved.
27  *
28  * Contributor(s):
29  *   Pamela Greene <pamg.bugs@gmail.com> (original author)
30  *
31  * Alternatively, the contents of this file may be used under the terms of
32  * either the GNU General Public License Version 2 or later (the "GPL"), or
33  * the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
34  * in which case the provisions of the GPL or the LGPL are applicable instead
35  * of those above. If you wish to allow use of your version of this file only
36  * under the terms of either the GPL or the LGPL, and not to allow others to
37  * use your version of this file under the terms of the MPL, indicate your
38  * decision by deleting the provisions above and replace them with the notice
39  * and other provisions required by the GPL or the LGPL. If you do not delete
40  * the provisions above, a recipient may use your version of this file under
41  * the terms of any one of the MPL, the GPL or the LGPL.
42  *
43  * ***** END LICENSE BLOCK ***** */
44 
45 /*
46   (Documentation based on the Mozilla documentation currently at
47   http://wiki.mozilla.org/Gecko:Effective_TLD_Service, written by the same
48   author.)
49 
50   The RegistryControlledDomainService examines the hostname of a GURL passed to
51   it and determines the longest portion that is controlled by a registrar.
52   Although technically the top-level domain (TLD) for a hostname is the last
53   dot-portion of the name (such as .com or .org), many domains (such as co.uk)
54   function as though they were TLDs, allocating any number of more specific,
55   essentially unrelated names beneath them.  For example, .uk is a TLD, but
56   nobody is allowed to register a domain directly under .uk; the "effective"
57   TLDs are ac.uk, co.uk, and so on.  We wouldn't want to allow any site in
58   *.co.uk to set a cookie for the entire co.uk domain, so it's important to be
59   able to identify which higher-level domains function as effective TLDs and
60   which can be registered.
61 
62   The service obtains its information about effective TLDs from a text resource
63   that must be in the following format:
64 
65   * It should use plain ASCII.
66   * It should contain one domain rule per line, terminated with \n, with nothing
67     else on the line.  (The last rule in the file may omit the ending \n.)
68   * Rules should have been normalized using the same canonicalization that GURL
69     applies.  For ASCII, that means they're not case-sensitive, among other
70     things; other normalizations are applied for other characters.
71   * Each rule should list the entire TLD-like domain name, with any subdomain
72     portions separated by dots (.) as usual.
73   * Rules should neither begin nor end with a dot.
74   * If a hostname matches more than one rule, the most specific rule (that is,
75     the one with more dot-levels) will be used.
76   * Other than in the case of wildcards (see below), rules do not implicitly
77     include their subcomponents.  For example, "bar.baz.uk" does not imply
78     "baz.uk", and if "bar.baz.uk" is the only rule in the list, "foo.bar.baz.uk"
79     will match, but "baz.uk" and "qux.baz.uk" won't.
80   * The wildcard character '*' will match any valid sequence of characters.
81   * Wildcards may only appear as the entire most specific level of a rule.  That
82     is, a wildcard must come at the beginning of a line and must be followed by
83     a dot.  (You may not use a wildcard as the entire rule.)
84   * A wildcard rule implies a rule for the entire non-wildcard portion.  For
85     example, the rule "*.foo.bar" implies the rule "foo.bar" (but not the rule
86     "bar").  This is typically important in the case of exceptions (see below).
87   * The exception character '!' before a rule marks an exception to a wildcard
88     rule.  If your rules are "*.tokyo.jp" and "!pref.tokyo.jp", then
89     "a.b.tokyo.jp" has an effective TLD of "b.tokyo.jp", but "a.pref.tokyo.jp"
90     has an effective TLD of "tokyo.jp" (the exception prevents the wildcard
91     match, and we thus fall through to matching on the implied "tokyo.jp" rule
92     from the wildcard).
93   * If you use an exception rule without a corresponding wildcard rule, the
94     behavior is undefined.
95 
96   Firefox has a very similar service, and it's their data file we use to
97   construct our resource.  However, the data expected by this implementation
98   differs from the Mozilla file in several important ways:
99    (1) We require that all single-level TLDs (com, edu, etc.) be explicitly
100        listed.  As of this writing, Mozilla's file includes the single-level
101        TLDs too, but that might change.
102    (2) Our data is expected be in pure ASCII: all UTF-8 or otherwise encoded
103        items must already have been normalized.
104    (3) We do not allow comments, rule notes, blank lines, or line endings other
105        than LF.
106   Rules are also expected to be syntactically valid.
107 
108   The utility application tld_cleanup.exe converts a Mozilla-style file into a
109   Chrome one, making sure that single-level TLDs are explicitly listed, using
110   GURL to normalize rules, and validating the rules.
111 */
112 
113 #ifndef NET_BASE_REGISTRY_CONTROLLED_DOMAINS_REGISTRY_CONTROLLED_DOMAIN_H_
114 #define NET_BASE_REGISTRY_CONTROLLED_DOMAINS_REGISTRY_CONTROLLED_DOMAIN_H_
115 
116 #include <stddef.h>
117 
118 #include <cstdint>
119 #include <optional>
120 #include <string>
121 #include <string_view>
122 
123 #include "base/containers/span.h"
124 #include "net/base/net_export.h"
125 
126 class GURL;
127 
128 namespace url {
129 class Origin;
130 }
131 
132 struct DomainRule;
133 
134 namespace net::registry_controlled_domains {
135 
136 // This enum is a required parameter to all public methods declared for this
137 // service. The Public Suffix List (http://publicsuffix.org/) this service
138 // uses as a data source splits all effective-TLDs into two groups. The main
139 // group describes registries that are acknowledged by ICANN. The second group
140 // contains a list of private additions for domains that enable external users
141 // to create subdomains, such as appspot.com.
142 // The RegistryFilter enum lets you choose whether you want to include the
143 // private additions in your lookup.
144 // See this for example use cases:
145 // https://wiki.mozilla.org/Public_Suffix_List/Use_Cases
146 enum PrivateRegistryFilter {
147   EXCLUDE_PRIVATE_REGISTRIES = 0,
148   INCLUDE_PRIVATE_REGISTRIES
149 };
150 
151 // This enum is a required parameter to the GetRegistryLength functions
152 // declared for this service. Whenever there is no matching rule in the
153 // effective-TLD data (or in the default data, if the resource failed to
154 // load), the result will be dependent on which enum value was passed in.
155 // If EXCLUDE_UNKNOWN_REGISTRIES was passed in, the resulting registry length
156 // will be 0. If INCLUDE_UNKNOWN_REGISTRIES was passed in, the resulting
157 // registry length will be the length of the last subcomponent (eg. 3 for
158 // foobar.baz).
159 enum UnknownRegistryFilter {
160   EXCLUDE_UNKNOWN_REGISTRIES = 0,
161   INCLUDE_UNKNOWN_REGISTRIES
162 };
163 
164 // Returns the registered, organization-identifying host and all its registry
165 // information, but no subdomains, from the given GURL.  Returns an empty
166 // string if the GURL is invalid, has no host (e.g. a file: URL), has multiple
167 // trailing dots, is an IP address, has only one subcomponent (i.e. no dots
168 // other than leading/trailing ones), or is itself a recognized registry
169 // identifier.  If no matching rule is found in the effective-TLD data (or in
170 // the default data, if the resource failed to load), the last subcomponent of
171 // the host is assumed to be the registry.
172 //
173 // Examples:
174 //   http://www.google.com/file.html -> "google.com"  (com)
175 //   http://..google.com/file.html   -> "google.com"  (com)
176 //   http://google.com./file.html    -> "google.com." (com)
177 //   http://a.b.co.uk/file.html      -> "b.co.uk"     (co.uk)
178 //   file:///C:/bar.html             -> ""            (no host)
179 //   http://foo.com../file.html      -> ""            (multiple trailing dots)
180 //   http://192.168.0.1/file.html    -> ""            (IP address)
181 //   http://bar/file.html            -> ""            (no subcomponents)
182 //   http://co.uk/file.html          -> ""            (host is a registry)
183 //   http://foo.bar/file.html        -> "foo.bar"     (no rule; assume bar)
184 NET_EXPORT std::string GetDomainAndRegistry(const GURL& gurl,
185                                             PrivateRegistryFilter filter);
186 
187 // Like the GURL version, but takes an Origin. Returns an empty string if the
188 // Origin is opaque.
189 NET_EXPORT std::string GetDomainAndRegistry(const url::Origin& origin,
190                                             PrivateRegistryFilter filter);
191 
192 // Like the GURL / Origin versions, but takes a host (which is canonicalized
193 // internally). Prefer either the GURL or Origin variants instead of this one
194 // to avoid needing to re-canonicalize the host.
195 NET_EXPORT std::string GetDomainAndRegistry(std::string_view host,
196                                             PrivateRegistryFilter filter);
197 
198 // Same as above, but returns a StringPiece that is backed by the supplied
199 // url::Origin.
200 NET_EXPORT std::string_view GetDomainAndRegistryAsStringPiece(
201     const url::Origin& origin,
202     PrivateRegistryFilter filter);
203 
204 // These convenience functions return true if the two GURLs or Origins both have
205 // hosts and one of the following is true:
206 // * The hosts are identical.
207 // * They each have a known domain and registry, and it is the same for both
208 //   URLs.  Note that this means the trailing dot, if any, must match too.
209 // Effectively, callers can use this function to check whether the input URLs
210 // represent hosts "on the same site".
211 NET_EXPORT bool SameDomainOrHost(const GURL& gurl1, const GURL& gurl2,
212                                  PrivateRegistryFilter filter);
213 NET_EXPORT bool SameDomainOrHost(const url::Origin& origin1,
214                                  const url::Origin& origin2,
215                                  PrivateRegistryFilter filter);
216 // Note: this returns false if |origin2| is not set.
217 NET_EXPORT bool SameDomainOrHost(const url::Origin& origin1,
218                                  const std::optional<url::Origin>& origin2,
219                                  PrivateRegistryFilter filter);
220 NET_EXPORT bool SameDomainOrHost(const GURL& gurl,
221                                  const url::Origin& origin,
222                                  PrivateRegistryFilter filter);
223 
224 // Finds the length in bytes of the registrar portion of the host in the
225 // given GURL.  Returns std::string::npos if the GURL is invalid or has no
226 // host (e.g. a file: URL).  Returns 0 if the GURL has multiple trailing dots,
227 // is an IP address, has no subcomponents, or is itself a recognized registry
228 // identifier.  The result is also dependent on the UnknownRegistryFilter.
229 // If no matching rule is found in the effective-TLD data (or in
230 // the default data, if the resource failed to load), returns 0 if
231 // |unknown_filter| is EXCLUDE_UNKNOWN_REGISTRIES, or the length of the last
232 // subcomponent if |unknown_filter| is INCLUDE_UNKNOWN_REGISTRIES.
233 //
234 // Examples:
235 //   http://www.google.com/file.html -> 3                 (com)
236 //   http://..google.com/file.html   -> 3                 (com)
237 //   http://google.com./file.html    -> 4                 (com)
238 //   http://a.b.co.uk/file.html      -> 5                 (co.uk)
239 //   file:///C:/bar.html             -> std::string::npos (no host)
240 //   http://foo.com../file.html      -> 0                 (multiple trailing
241 //                                                         dots)
242 //   http://192.168.0.1/file.html    -> 0                 (IP address)
243 //   http://bar/file.html            -> 0                 (no subcomponents)
244 //   http://co.uk/file.html          -> 0                 (host is a registry)
245 //   http://foo.bar/file.html        -> 0 or 3, depending (no rule; assume
246 //                                                         bar)
247 NET_EXPORT size_t GetRegistryLength(const GURL& gurl,
248                                     UnknownRegistryFilter unknown_filter,
249                                     PrivateRegistryFilter private_filter);
250 
251 // Returns true if the given host name has a registry-controlled domain. The
252 // host name will be internally canonicalized. Also returns true for invalid
253 // host names like "*.google.com" as long as it has a valid registry-controlled
254 // portion (see PermissiveGetHostRegistryLength for particulars).
255 NET_EXPORT bool HostHasRegistryControlledDomain(
256     std::string_view host,
257     UnknownRegistryFilter unknown_filter,
258     PrivateRegistryFilter private_filter);
259 
260 // Returns true if the given host name is a registry identifier. The name should
261 // be already canonicalized, and not an IP address. This returns true for
262 // registries specified by wildcard rules as well as non-wildcard rules. For
263 // example, if there is a wildcard rule of "foo.bar", then "a.foo.bar" is
264 // considered a registry identifier.
265 NET_EXPORT bool HostIsRegistryIdentifier(std::string_view canon_host,
266                                          PrivateRegistryFilter private_filter);
267 
268 // Like GetRegistryLength, but takes a previously-canonicalized host instead of
269 // a GURL. Prefer the GURL version or HasRegistryControlledDomain to eliminate
270 // the possibility of bugs with non-canonical hosts.
271 //
272 // If you have a non-canonical host name, use the "Permissive" version instead.
273 NET_EXPORT size_t
274 GetCanonicalHostRegistryLength(std::string_view canon_host,
275                                UnknownRegistryFilter unknown_filter,
276                                PrivateRegistryFilter private_filter);
277 
278 // Like GetRegistryLength for a potentially non-canonicalized hostname.  This
279 // splits the input into substrings at '.' characters, then attempts to
280 // piecewise-canonicalize the substrings. After finding the registry length of
281 // the concatenated piecewise string, it then maps back to the corresponding
282 // length in the original input string.
283 //
284 // It will also handle hostnames that are otherwise invalid as long as they
285 // contain a valid registry controlled domain at the end. Invalid dot-separated
286 // portions of the domain will be left as-is when the string is looked up in
287 // the registry database (which will result in no match).
288 //
289 // This will handle all cases except for the pattern:
290 //   <invalid-host-chars> <non-literal-dot> <valid-registry-controlled-domain>
291 // For example:
292 //   "%00foo%2Ecom" (would canonicalize to "foo.com" if the "%00" was removed)
293 // A non-literal dot (like "%2E" or a fullwidth period) will normally get
294 // canonicalized to a dot if the host chars were valid. But since the %2E will
295 // be in the same substring as the %00, the substring will fail to
296 // canonicalize, the %2E will be left escaped, and the valid registry
297 // controlled domain at the end won't match.
298 //
299 // The string won't be trimmed, so things like trailing spaces will be
300 // considered part of the host and therefore won't match any TLD. It will
301 // return std::string::npos like GetRegistryLength() for empty input, but
302 // because invalid portions are skipped, it won't return npos in any other case.
303 NET_EXPORT size_t
304 PermissiveGetHostRegistryLength(std::string_view host,
305                                 UnknownRegistryFilter unknown_filter,
306                                 PrivateRegistryFilter private_filter);
307 NET_EXPORT size_t
308 PermissiveGetHostRegistryLength(std::u16string_view host,
309                                 UnknownRegistryFilter unknown_filter,
310                                 PrivateRegistryFilter private_filter);
311 
312 typedef const struct DomainRule* (*FindDomainPtr)(const char *, unsigned int);
313 
314 // Used for unit tests. Uses default domains.
315 NET_EXPORT_PRIVATE void ResetFindDomainGraphForTesting();
316 
317 // Used for unit tests, so that a frozen list of domains is used.
318 NET_EXPORT_PRIVATE void SetFindDomainGraphForTesting(
319     base::span<const uint8_t> domains);
320 
321 }  // namespace net::registry_controlled_domains
322 
323 #endif  // NET_BASE_REGISTRY_CONTROLLED_DOMAINS_REGISTRY_CONTROLLED_DOMAIN_H_
324