1This license collection script is, fundamentally, one giant pile of 2special cases. As such, while there is an attempt to model the rules 3that apply to licenses and apply some sort of order to the process, 4the code is less than clear. This file attempts to provide an overview. 5 6main.dart is the core of the operation. It first walks the entire 7directory tree starting from the root of the repository (which is to 8be specified on the command line as the only argument), creating an 9in-memory representation of the project (make sure to run this only 10after you've run gclient sync, so that all dependencies are on disk). 11This is the step that is labeled "Preparing data structures". 12 13Then, it walks this in-memory representation, attempting to assign to 14each file one or more licenses. This is the step labeled "Collecting 15licenses", which takes a long time. 16 17Finally, it prints out these licenses. 18 19The in-memory representation is a tree of RepositoryEntry objects. 20There's three important types of these objects: RepositoryDirectory 21objects, which represent directories; RepositoryLicensedFile, which 22represents source files and resources that might end up in the binary, 23and RepositoryLicenseFile, which represents license files that do not 24themselves end up in the binary other than as a side-effect of this 25script. 26 27RepositoryDirectory objects contain three lists, the list of 28RepositoryDirectory subdirectories, the list of RepositoryLicensedFile 29children, and the list of RepositoryLicenseFile children. 30 31RepositoryDirectory objects are the objects that crawl the filesystem. 32 33While the script is pretty conservative (including probably more 34licenses than strictly necessary), it tries to avoid including 35material that isn't actually used. To do this, RepositoryDirectory 36objects only crawl directories and files for which shouldRecurse 37returns true. For example, shouldRecurse returns false for ".git" 38files. 39 40Some directories and files require special handling, and have specific 41subclasses of the above classes. To create the appropriate objects, 42RepositoryDirectory calls createSubdirectory and createFile to create 43the nodes of the tree. 44 45 46The low-level handling of files is done by classes in filesystem.dart. 47This code supports transparently crawling into archives (e.g. .jar 48files), as well as handling UTF-8 vs latin1. It contains much magic 49and hard-coded file names and so on to handle distinguishing binary 50files from text files, and so forth. 51 52This code uses the cache described in cache.dart to try to avoid 53having to repeatedly reopen the same file many times in a row. 54 55 56In the case of a binary file, the license is found by crawling around 57the directory structure looking for a "default" license file. In the 58case of text files, though, it's often the case that the file itself 59mentions the license and therefore the file itself is inspected 60looking for copyright or license text. This scanning is done by 61determineLicensesFor() in licenses.dart. 62 63This function uses patterns that are themselves in patterns.dart. In 64this file we find all manner of long complicated and somewhat crazy 65regular expressions. This is where you see quite how absurd this work 66can actually be. It is left as an exercise to the reader to look for 67the implications of many of the regular expressions; as one example, 68though, consider the case of the pattern that matches the AFL/LGPL 69dual license statement: there is one file in which the ZIP code for 70the Free Software Foundation is off by one, for no clear reason, 71leading to the pattern ending with "MA 0211[01]-1307, USA". 72 73 74The license.dart file also contains the License object, the currently 75simplistic normalizer (_reformat) for license text (which mostly just 76removes comment syntax), the code that attempts to determine what 77copyrights apply to which licenses, and the code that attempts to 78identify the licenses themselves (at a high level), to make sure that 79appropriate clauses are followed (e.g. including the copyright with a 80BSD notice). 81