README.md
1# Scripts to help with CLDR → Markdown Conversion
2
3Part of the [CLDR to Markdown Conversion Process](https://docs.google.com/document/d/1NoQX0zqSYqU4CUuNijTWKQaphE4SCuHl6Bej2C4mb58/edit?usp=sharing), aiming to automate steps 1-3.
4
5NOTE: does not get rid of all manual work, images, tables, and general review are still required.
6
7## File 1: cleanup.py
8
9Objective: this file aims to correct some of the common mistakes that show up when using a html to markdown converter on the google sites CLDR site. It is not a comprehensive list, and there can still be mistakes, but it helps to correct some of the consistently seen errors that show up, particularly with the specific markdown converter used in pullFromCLDR.py. Most of the adjustments utilize regular expressions to find and replace specific text. The functions are as follows:
10
11### Link Correction
12
13- Removing redundant links, e.g. \(https://www.example.com)[https://www.example.com] → https://www.example.com
14- Correcting relative links, e.g. \(index)[/index] → \(index)[https://cldr.unicode.org/index]
15- Correcting google redirect links, e.g. \(people)[http://www.google.com/url?q=http%3A%2F%2Fcldr-smoke.unicode.org%2Fsmoketest%2Fv%23%2FUSER%2FPeople%2F20a49c6ad428d880&sa=D&sntz=1&usg=AOvVaw38fQLnn3h6kmmWDHk9xNEm] → \(people)[https://cldr-smoke.unicode.org/cldr-apps/v#/fr/People/20a49c6ad428d880]
16- Correcting regular redirect links
17
18### Common Formatting Issues
19
20- Bullet points and numbered lists have extra spaces after them
21- Bullet points and numbered lists have extra lines between them
22- Link headings get put in with headings and need to be removed
23
24### Project specific additions
25
26- Every page has --- title: PAGE TITLE --- at the top of the markdown file
27- Every page has the unicode copyright "\!\[Unicode copyright](https://www.unicode.org/img/hb_notice.gif)" at the bottom of the markdown file
28
29## File 2: pullFromCLDR.py
30
31Objective: this file is used along side cleanup.py to automate the process of pulling html and text from a given CLDR page. It uses libraries to retrieve the htmal as well as plain text from a given page, convert the html into markdown, parse the markdown using the cleanup.py file, and create the .md file and the temporary .txt file in the cldr site location. There are a couple of things to note with this:
32
33- The nav bar header are not relevant to each page for this conversion process, so only the html within \<div role="main" ... > is pulled from the page
34- To convert the html into raw text, the script parses the text, and then seperates relevant tags with newlines to appear as text does when copy/pasted from the page.
35- This will only work with "https://cldr.unicode.org" pages, without modifying line 12 of the file
36
37## Usage
38
39### Installation
40
41To run this code, you must have python3 installed. You need to install the following Python libraries:
42
43- BeautifulSoup (from `bs4`)
44- markdownify
45- requests
46
47You can install them using pip:
48
49```bash
50pip install beautifulsoup4 markdownify requests
51```
52
53### Constants
54
55Line 8 of cleanup.py should contain the url that will be appended to the start of all relative links (always https://cldr.unicode.org):
56```
57#head to place at start of all relative links
58RELATIVE_LINK_HEAD = "https://cldr.unicode.org"
59```
60
61Line 7 of pullFromCLDR.py should contain your local location of the cloned CLDR site, this is where the files will be stored:
62```
63#LOCAL LOCATION OF CLDR
64CLDR_SITE_LOCATION = "DIRECTORY TO CLDR LOCATION/docs/site"
65```
66
67### Running
68
69Before running, ensure that the folders associated to the directory of the page you are trying to convert are within your cldr site directory, and there is a folder named TEMP-TEXT-FILES.
70
71Run with:
72```
73python3 pullFromCLDR.py
74```
75
76You will then be prompted to enter the url of the site you are trying to convert, after which the script will run.
77
78If you would like to run unit tests on cleanup, or use any of the functions indiviually, run
79```
80python3 cleanup.py
81```
82
83
84
85