Rip online dictionaries with Node.js part 2: dynamic page; connection NW.js
previous section was following are the basic steps and related tasks in the copy network of dictionaries by Node.js. This part describes how to use the most important tool for converting web sources particular level of complexity.
the
1. the more complex the structure of the web pages of the dictionary, the more reason to rely on the full range of capabilities provided by browser-honed engine. JSDOM is a pretty advanced library, but even it does not compare to a complete set of tools from Chromium.
2. are people involved in the creation and conversion of digital dictionaries — largely in the Humanities, which will of fate brought into the sphere of IT. Sometimes they feel more comfortable working with a GUI than CLI, especially if they do not write the utility yourself, and are willing to development colleagues. NW.js provides simple ways to create a trivial GUI applications for analysing, processing and converting web pages.
As an example for a brief description of this tool I chose the www.wordspy.com. Word Spy — a regularly updated dictionary of English neologisms, which have already become part of the language. That is, was not created once and used by authors for private purposes (such words are called "occasionalisms"), but "lit up" in several print and online sources of different origin. Compared to Urban Dictionary, which served as an illustration to the first article, Word Spy there are two important differences: the content of the pages is formed by an asynchronous operation scripts, and the structure of these pages are largely unpredictable and complex (as in Urban Dictionary for text entries used a very small set of tags and their order and the combination was uniform). This was the final reason to turn to NW.js.
I don't plan to repeat here part of the official documentation are already a fairly complete and systematic, if you are not familiar with NW.js, it is better to start with it (then you can scroll through wiki page on GitHub — although many of them are already obsolete, there still comes something interesting, not mentioned in the main documentation). Will confine myself to notes on the application of the project to the selected task.
the
the
Basically, the first preparatory, the script will largely resemble the program from the first article. We don't even we connect the NW.js because we just need to find the necessary links of the pages, and with this successful and JSDOM.
Will outline only the essential differences.
. Since the time of page loading and obtaining the function that responds to this event object's
b. Word Spy contains a convenient two-step the entire contents of the dictionary: 1) a list of all tags, split into several clusters; 2) link to each of the tags opens a list of all the words belonging to this tag. We will add to our dictionary as the first list of tags, and all the lists of words under the tags. To do this in our initial array of addresses (
Vladimir Something in our present task will be simplified: Word Spy dictionary, by orders of magnitude smaller in volume compared with the Urban Dictionary, so a list of addresses and articles he one-page. We don't have to check for multipage sequels, neither in this script nor in the script, save the dictionary, which will facilitate and build the URL and the relevant parts of the code.
city In function
D. Because in our code, there is another asynchronous time, we split the old function
E. Process finished pages and extract required links does not significantly differ from those described in the first article, except that we will take care of making addresses to the thematic list of all tags also generate a list of URL entries to the General content and then preserved for easier navigation in the future the dictionary.
the
Because the structure of the pages of the dictionary that is complex and predictable only to a certain level, we will try to add to the process of saving pre-pass on all the necessary pages to collect information about the types and frequency of tags used. To do this, create a script, much like the script, save the dictionary, except to extract it yet will very simple information (so we still restrict ourselves to JSDOM).
This script can be called partial hybrid script, familiar to us by the preservation of Urban Dictionary, and the script described just above: he's going to read a list of addresses (from the beginning or from the point at which it stopped and which he before stopping marked in a special log), upload the page to run its scripts and wait until they build all the essential content articles. Let only a few new details.
. When you save the dictionary we created three files: the code of the dictionary, the process log recording stored addresses and error log. In this case, we need only two files: we'll keep track of the tags so that the file with them, if necessary, could play the role of a log of the work done to restore after the break.
b. an Array of key selectors
Vladimir in Order not to overload the analysis with too much information, we according to rough preliminary estimates, we define some items that will not be in the dictionary and thus can not disassemble tags: selectors of those elements, we store it in the variable
city Analysis of each page would be to extract all tags from an item of interest, check their names in the compound object
The rest of the program code does not contain anything unfamiliar.
the
Program NW.js consist of at least two files: the service in the format JSON, describing the basic parameters of the program, and the pages on HTML that describes the GUI and contains scripts. The latter can be put in a separate file (files) and reference them via a local or network address.
the1.
Here are the minimum contents of our internal file:
the
Program NW.js when you first start creates in the system folder for user data its sub-folder and its name will be formed according to the program name of the field
Field
The optional subkey
Read more about the format and components of the utility file read the help.
the2.
The window of our program will be relatively simple and something even resemble a console application.
In the header part of the markup, other than the required minimum, you can add custom block CSS. Here he is purely illustrative, and we will not focus on it.
The first elements of our GUI are two fields for the parameters previously set using the command line keys: opening a file with the addresses dictionary pages (we used to set the folder with the input file and the file name is asked in the code that the key was shorter — now it's not necessary, and we can choose content file directly) and the directory in which to create the output files — the dictionary, the list of saved pages and log errors. Read more about the features of the file fields in the NW.js you can read here.
Next is a button that triggers the main action of the program, followed by a field for output. With CSS and some script tweaks we will make it similar to the console output, to keep the connection with the usual console versions of our scripts.
Invisible the element
Finally, the last element will play a key role "browser" — this embedded frame we will upload our pages for analysis and retrieval. About the features of the frames to NW.js and some related precautions you can read familiar.
In the beginning and at the end of the process save the dictionary can be estimated by the screenshots:


The scripting part of the program for convenience, we issued in a separate file. Refer to it better at the end of the page to launch the script could find all the essential elements of the window and begin to interact with them.
the3.
In the comments I will try to dwell on the differences and innovations compared to the console script from the previous article, because the basic structure, and many code sections will be shared.
. the First difference we see at the beginning of the introduction. Display the variables window and document of the programme itself (so it will be easier not to confuse them with the variables window and document loaded pages), and then for each GUI element. Since the file paths will be constructed dynamically (not once and for all on the command line, and in response to user actions), we will store them as changed properties of object
b. When working with a GUI, the temptation is not the time to close the window more than when working with the console. Therefore, we will create a slightly larger fuse system from abnormal termination of the program. For a start, assign a handler for the close window function
Vladimir As we can see from the reference standard precaution HTML 5 remained in force, and we can't specify the target addresses of the files with attributes or properties of our file fields — this can only be done via custom action dialog box. But we can reduce the time and effort, keeping the path to the folder in which the user is prompted to select a file (if the file is folder, pre-and final address the miracle match). For this we use another service file in the format JSON —
city After checking the saved configuration file, we set the event handlers for all the interactive elements and launch the first of them, forcibly to bring the elements in the correct initial state.
D. the Function
E. the Function
E. the Function
well. function
s. Function
I. the Function
JJ. Function
K.
l
m the Function
Sciences In function
We start with cleaning unwanted parts of a dictionary entry. Then, we define its key element, you will get a list of all text parts (XPath will allow us to obtain exactly the net final text nodes, without nested HTML elements, so we will be able to modify their contents without risk to damage the structure of the document), and then subjected to all these elements mentioned above, low-level cleaning — thus we get escaping special characters in the entire text of the article, and further adding DSL tags, you can make painless over this text.
Then we create the title of a future dictionary entry.
If this is a regular page of the dictionary, we extracted from the item's main title, and then add to it the spellings, forms, and derivatives (we considered this necessary because modeling of the morphology of neologisms in the shell of the vocabulary can be difficult, it is better to make this dictionary work). Here we first meet one of the main reasons for the connection NW.js: property
If we process one page of tags, the task is simplified — we only save the main title, prefaced with "# ". Thus, the total list of tags can be found under the vocable "# Tags by Category" and a list of titles with a common tag — with words of the type "# acronyms and abbreviations", etc.
Next, we start extracting and transferring the main part of the article. The list of tags received at the preliminary stage, allowed us to sketch out a rough treatment plan, so that we can keep surprises to a minimum. This processing would to some risky compromise: part of the structure can be foreseen and confidently to use, the rest is surprises can only roughly be safe.
When our insertions and substitutions we will use the method
So, first we work with large blocks of the article, so to speak, with the macrostructure: insert an empty line between block level elements, so that after the query properties
Then down below on the structure. Conclude the quote in quotation marks. We arrange the bullets. Random fixed markup errors (for example, in one of the articles pseudotag
After that we begin to invest the appropriate elements in the DSL.
Allocated by the color and thickness of the font headings and subheadings. Allocated in italics and the thickness of the font elements that stand out on the page or on the property themselves tags, or via CSS. This is something we can provide in advance (and query using selectors with the names of the tags or their classes), but something we have to figure out on the fly: for this we will request all the elements of
The next step — and we ticked all the superscripts and subscripts.
Then processed links: kept as is, AutoLink to the current page; intrasite links turned vnutriklubnye in the format DSL; external links was registered as a network URL (this tried to keep in front of the link and mark by underlining its readable text, as the format of the DSL does not allow hiding text behind a network address, combining all together).
We also replaced the image on their external references (images are pretty large, so we didn't build them into the dictionary — if necessary, the user will be able to go to the address on the website).
Finally, we inserted additional spacing for highlighting citations (i.e. examples of use for each neologism).
As a result, we got two in one: a fully saved layout HTML, and inside it is — markup DSL, at the same time without any conflict between them. And when we inquire the property
We recorded in a dictionary file after the list of headers, not forgetting to put a final processing function
o the Function
the
Program NW.js, as already mentioned, the system creates the user data folder your subfolder, something similar to the profile folder of the browser. We can assume that it, among other things, stored the cache files to speed downloading of similar pages. In my case this subfolder after you save the dictionary held 138 megabytes. In the end, you can safely remove the next time the program will create it automatically (except that it will take some time when you first start).
The script kept a dictionary and a half hours, was treated during this time, almost three and a half thousand pages with all their resources, while the processor is not significantly loaded. The memory consumption and the volume read/write to the end of the program, you can evaluate this screenshot.
The dictionary (based on the condition of the site on 16.02.2016) posted on rghost.net and drive.google.com. The archive includes a DSL source code in UTF-8 and UTF-16, and compiled dictionaries in LSD under the last three versions of ABBYY Lingvo. Headers: 5827; cards: 3419; examples of use: 9311.
Thank you for your attention.
Article based on information from habrahabr.ru
the
I. Why do we NW.js?
1. the more complex the structure of the web pages of the dictionary, the more reason to rely on the full range of capabilities provided by browser-honed engine. JSDOM is a pretty advanced library, but even it does not compare to a complete set of tools from Chromium.
2. are people involved in the creation and conversion of digital dictionaries — largely in the Humanities, which will of fate brought into the sphere of IT. Sometimes they feel more comfortable working with a GUI than CLI, especially if they do not write the utility yourself, and are willing to development colleagues. NW.js provides simple ways to create a trivial GUI applications for analysing, processing and converting web pages.
As an example for a brief description of this tool I chose the www.wordspy.com. Word Spy — a regularly updated dictionary of English neologisms, which have already become part of the language. That is, was not created once and used by authors for private purposes (such words are called "occasionalisms"), but "lit up" in several print and online sources of different origin. Compared to Urban Dictionary, which served as an illustration to the first article, Word Spy there are two important differences: the content of the pages is formed by an asynchronous operation scripts, and the structure of these pages are largely unpredictable and complex (as in Urban Dictionary for text entries used a very small set of tags and their order and the combination was uniform). This was the final reason to turn to NW.js.
I don't plan to repeat here part of the official documentation are already a fairly complete and systematic, if you are not familiar with NW.js, it is better to start with it (then you can scroll through wiki page on GitHub — although many of them are already obsolete, there still comes something interesting, not mentioned in the main documentation). Will confine myself to notes on the application of the project to the selected task.
the
II. Preparatory stage
the
1. Get the list of address entries
Basically, the first preparatory, the script will largely resemble the program from the first article. We don't even we connect the NW.js because we just need to find the necessary links of the pages, and with this successful and JSDOM.
script
.Will outline only the essential differences.
. Since the time of page loading and obtaining the function that responds to this event object's
window
and document
the content of the page is not ready yet, we will need to introduce additional test cycle (because the page is filled with asynchronous scripts, tracing the events of the load
will not give us anything; it would be possible to hang up event handlers on change of the DOM, but in this situation it seems unnecessary complication). After analyzing job site script, we find some significant element on the page, which means the completion of the construction of the desired structure (in this case block with a list of links to dictionary articles). The selector of the element we defined in addition to familiar variables (selectorsToCheck
in the initial code block; for the future case where different pages will need different testing items, we will make this variable an array). The second addition is the number of milliseconds that specifies how often a key element (checkFrequency
).b. Word Spy contains a convenient two-step the entire contents of the dictionary: 1) a list of all tags, split into several clusters; 2) link to each of the tags opens a list of all the words belonging to this tag. We will add to our dictionary as the first list of tags, and all the lists of words under the tags. To do this in our initial array of addresses (
tocURLs
), which will be a source of list entries, we will add the mentioned starting a page with the tags. Also, unlike the script from the first article, where this array called abc
, we immediately turn it into a list of URLS and will not generate it on the fly from the alphabet, as the address tags do not fit into a single URL pattern.Vladimir Something in our present task will be simplified: Word Spy dictionary, by orders of magnitude smaller in volume compared with the Urban Dictionary, so a list of addresses and articles he one-page. We don't have to check for multipage sequels, neither in this script nor in the script, save the dictionary, which will facilitate and build the URL and the relevant parts of the code.
city In function
getDoc
is slightly different library request jsdom.env
: Urban Dictionary was a static dictionary, but here we have to require loading and execution of scripts on the pages that is displayed in the options request.D. Because in our code, there is another asynchronous time, we split the old function
processDoc
into two: in function checkDoc
we will check on possible errors and finishing the job site scripts, and handling the finished document will transfer the deferred function in processDoc
. Test cycle performs some number of iterations (say, until after 5 seconds). If during this time there was a test item, we move to the processing functions document. If the element after a timeout no, we check whether call divert is: if not, you can suspect the hitch on the server and repeat the query if the server redirected us somewhere, it remains only to issue a warning to the user and temporarily shut down the program. Experience has shown that in a testing site script in most cases required 100-400 milliseconds, although sometimes the delay was several seconds, and only occasionally exceeded the timeout (in such cases, one needed a re-request).E. Process finished pages and extract required links does not significantly differ from those described in the first article, except that we will take care of making addresses to the thematic list of all tags also generate a list of URL entries to the General content and then preserved for easier navigation in the future the dictionary.
the
2. The list of tags
Because the structure of the pages of the dictionary that is complex and predictable only to a certain level, we will try to add to the process of saving pre-pass on all the necessary pages to collect information about the types and frequency of tags used. To do this, create a script, much like the script, save the dictionary, except to extract it yet will very simple information (so we still restrict ourselves to JSDOM).
script
.This script can be called partial hybrid script, familiar to us by the preservation of Urban Dictionary, and the script described just above: he's going to read a list of addresses (from the beginning or from the point at which it stopped and which he before stopping marked in a special log), upload the page to run its scripts and wait until they build all the essential content articles. Let only a few new details.
. When you save the dictionary we created three files: the code of the dictionary, the process log recording stored addresses and error log. In this case, we need only two files: we'll keep track of the tags so that the file with them, if necessary, could play the role of a log of the work done to restore after the break.
b. an Array of key selectors
selectorsToCheck
will now contain two elements: for the ordinary pages of the dictionary for pages with a list of tags (or words sharing a tag).Vladimir in Order not to overload the analysis with too much information, we according to rough preliminary estimates, we define some items that will not be in the dictionary and thus can not disassemble tags: selectors of those elements, we store it in the variable
selectorsToDelete
to remove unnecessary before the start of parsing.city Analysis of each page would be to extract all tags from an item of interest, check their names in the compound object
tags
(with a constant increase of statistics for each tag), the file entry in the page address and the list of tags on it. At the end of the script in the file is written to the final object tags
. Thus we get as the overall statistics of tags and their distribution across the pages, giving us the opportunity to see examples of the use of the tag by opening either of the addresses under which this tag is recorded. If the script was interrupted, already recorded in the file information, we can recover the statistical object tags
. These two similar processes — reading pages and reading the pomace from the log we see in the two appropriate places in the script: in the initial part (under the line console.log('Reading the tag file...');
) and processDoc
.The rest of the program code does not contain anything unfamiliar.
the
III. Save dictionary
Program NW.js consist of at least two files: the service in the format JSON, describing the basic parameters of the program, and the pages on HTML that describes the GUI and contains scripts. The latter can be put in a separate file (files) and reference them via a local or network address.
the
1. package.json
Here are the minimum contents of our internal file:
the
{
"name": "NW.WordSpy.get_dic",
"main": "WordSpy.get_dic.html",
"window": {
"title": "Save WordSpy.com"
}
}
Program NW.js when you first start creates in the system folder for user data its sub-folder and its name will be formed according to the program name of the field
name
.Field
main
contains the path to the main file with the GUI elements and the main script.The optional subkey
window
contains the parameters of the created window, and we will restrict ourselves to the title.Read more about the format and components of the utility file read the help.
the
2. WordSpy.get_dic.html
The window of our program will be relatively simple and something even resemble a console application.
Code of the HTML-page
.In the header part of the markup, other than the required minimum, you can add custom block CSS. Here he is purely illustrative, and we will not focus on it.
The first elements of our GUI are two fields for the parameters previously set using the command line keys: opening a file with the addresses dictionary pages (we used to set the folder with the input file and the file name is asked in the code that the key was shorter — now it's not necessary, and we can choose content file directly) and the directory in which to create the output files — the dictionary, the list of saved pages and log errors. Read more about the features of the file fields in the NW.js you can read here.
Next is a button that triggers the main action of the program, followed by a field for output. With CSS and some script tweaks we will make it similar to the console output, to keep the connection with the usual console versions of our scripts.
Invisible the element
audio
will serve to attract the user attention — we used to do this, the console player. Address of the sound file can be any others, I used one of your system files, the standard sound scheme.Finally, the last element will play a key role "browser" — this embedded frame we will upload our pages for analysis and retrieval. About the features of the frames to NW.js and some related precautions you can read familiar.
In the beginning and at the end of the process save the dictionary can be estimated by the screenshots:


The scripting part of the program for convenience, we issued in a separate file. Refer to it better at the end of the page to launch the script could find all the essential elements of the window and begin to interact with them.
the
3. WordSpy.get_dic.js
script
.In the comments I will try to dwell on the differences and innovations compared to the console script from the previous article, because the basic structure, and many code sections will be shared.
. the First difference we see at the beginning of the introduction. Display the variables window and document of the programme itself (so it will be easier not to confuse them with the variables window and document loaded pages), and then for each GUI element. Since the file paths will be constructed dynamically (not once and for all on the command line, and in response to user actions), we will store them as changed properties of object
io
, and not as a piecemeal set of constants. Another difference between the sets of selectors of different functions for more convenient manipulation of complex document structure (we are already familiar from the previous script to analyze the tags). Finally, as interactivity increases, at the end of the introductory part we will create a few variables-indicators for the current state of the program and the custom commands.b. When working with a GUI, the temptation is not the time to close the window more than when working with the console. Therefore, we will create a slightly larger fuse system from abnormal termination of the program. For a start, assign a handler for the close window function
onExit()
on actions which shall be announced later.Vladimir As we can see from the reference standard precaution HTML 5 remained in force, and we can't specify the target addresses of the files with attributes or properties of our file fields — this can only be done via custom action dialog box. But we can reduce the time and effort, keeping the path to the folder in which the user is prompted to select a file (if the file is folder, pre-and final address the miracle match). For this we use another service file in the format JSON —
config.json
, which will store an object with two properties, the number of ways. In the beginning the program will check for this file: if it is, it reads the contents of the object config
and write it to the properties of nwworkingdir
for both fields the right way. If the file is not, the object will be blank and the initial directory will be determined in the usual browser way.city After checking the saved configuration file, we set the event handlers for all the interactive elements and launch the first of them, forcibly to bring the elements in the correct initial state.
D. the Function
checkDirs()
checks the definition of all the right ways: if at least one of them is not defined, it displays a message in the information unit, otherwise, writes the data to a file for persistent settings, and unlocks with push button start the main process.E. the Function
onStop()
responds to the command interrupt the main process: it merely translates the indicator this command to the on position, so that the process could then be interrupted at a convenient moment.E. the Function
onExit()
responds to an attempt to close the window. If that's the time to save a dictionary, it asks security question. When confirmation indicators are abort and exit the program transferred to the on position for follow-up at a convenient time. If the user confirms the action, it is ignored. If the save is not made, the program closes, no questions asked.well. function
setSpeedInfo()
significant change affected only the audio signal. I left the refresh rate and the format of the information about productivity at the same level (once an hour), but if necessary they can be corrected (after all, Urban Dictionary persisted for many days, and Word Spy — about a half hour, so that the frequency conversion units can be increased to minutes).s. Function
updateInfo(str)
is responsible for the assimilation of the information block of the console. We set the buffer size to 10 rows and chop off the extra lines first (where the oldest information), scroll unit to the last line. Through this function we output constantly current information in the process of saving. For small dictionaries this behavior can be turned off (then will continue the whole Protocol, rip), but in the long process such limitations save memory and remove redundancy (especially since everything you need is written in the logs).I. the Function
logError(evt)
is called to respond to an event error
inside the inline frame window. I have it has never worked.JJ. Function
secureLow(str)
is a low-level text processing downloaded pages to bring it to the requirements of the DSL and for escaping. Whereas secureHigh
used to process blocks of text (removing extra spaces, insert the padding before the body of the dictionary entries DSL, paste special to preserve blank lines). In the console version of the first article, we have treated one function, but here our procedure for the extraction and presentation of information will change somewhat, and we have this treatment share.K.
saveDic()
— the main function of the program to be run when you click on the save button of the dictionary. It largely meets the initial procedural part of our console script from the first article, but there are some differences. First, we include a variable-indicator the save process and change the appearance and behavior of the home button: now she will be responsible for the interruption of the process. Disconnecting fulfilled its role as a file field. Then we're already familiar with file handling: check a list of addresses, create a blank dictionary and reports, read the list of addresses, read the information about the already saved pages if present in the log for the conservation and where necessary reducing the time to finally begin a series of conservation, requesting the first page in the list. New on this piece of code is to set the event handlers load
and error
for Windows built-in frame is required for the operation of our loop.l
getDoc(url)
— the starting link in the circular chain of preservation. This function we call at the beginning of the cycle and after treatment of each page. It begins with the verification of the interrupt indicator: if it was included, the cycle stops and starts a stop process. If it is turned off after the familiar operations we are changing the address of the frame, causing it to load a new page.m the Function
checkDoc()
is started automatically in response to a full download page in our browser. She is partially familiar to us from previous scripts in this article. Only now we begin with the creation of variables, allowing us to confuse the main objects of the program window and Windows downloaded page. Followed by a familiar cycle of checks on the readiness of the page content. Depending on results we either move on to information processing, or reload the page or finish the job with the message about the unknown error.Sciences In function
processDoc(iWin, iDoc, iLoc, iter)
contains the extraction, processing and storage of dictionary data page. It is the most different from a corresponding cantilever of the code, and from differences in vocabulary, and because of the characteristics of the new tool.We start with cleaning unwanted parts of a dictionary entry. Then, we define its key element, you will get a list of all text parts (XPath will allow us to obtain exactly the net final text nodes, without nested HTML elements, so we will be able to modify their contents without risk to damage the structure of the document), and then subjected to all these elements mentioned above, low-level cleaning — thus we get escaping special characters in the entire text of the article, and further adding DSL tags, you can make painless over this text.
Then we create the title of a future dictionary entry.
If this is a regular page of the dictionary, we extracted from the item's main title, and then add to it the spellings, forms, and derivatives (we considered this necessary because modeling of the morphology of neologisms in the shell of the vocabulary can be difficult, it is better to make this dictionary work). Here we first meet one of the main reasons for the connection NW.js: property
innerText
. It was not available for the JSDOM (explanation), the library was a property of the textContent
, is very difficult to extract text from complex elements (due to the blending of markup text (HTML) and displayed text). Property innerText
provides us with the necessary complex pages confidence: whatever the structure of the dictionary articles or parts, we'll get that readable text (which we would have, we copy information from the window page of the system tools via the clipboard). The same property allows us to temporarily exclude extraneous text before extracting (so for example, we remove the grammatical information before the placement of word forms in the list of headers): should we hide unnecessary elements, and their text is not on the roster of the properties (and then we turn on the display, and notes remain a part of the body of the article).If we process one page of tags, the task is simplified — we only save the main title, prefaced with "# ". Thus, the total list of tags can be found under the vocable "# Tags by Category" and a list of titles with a common tag — with words of the type "# acronyms and abbreviations", etc.
Next, we start extracting and transferring the main part of the article. The list of tags received at the preliminary stage, allowed us to sketch out a rough treatment plan, so that we can keep surprises to a minimum. This processing would to some risky compromise: part of the structure can be foreseen and confidently to use, the rest is surprises can only roughly be safe.
When our insertions and substitutions we will use the method
insertAdjacentHTML()
, because it is most gentle with respect to the structure of the markup.So, first we work with large blocks of the article, so to speak, with the macrostructure: insert an empty line between block level elements, so that after the query properties
innerText
we got more readable text; model the hr
character on pseudoline; model inline frames (or video, for example, tweets) warning with an invitation to see them on the website.Then down below on the structure. Conclude the quote in quotation marks. We arrange the bullets. Random fixed markup errors (for example, in one of the articles pseudotag
smirk
and flame
accidentally become part of the quote code in the layout and disappear from the display text). Inserted into the text content added to the page using the CSS
and thus not included in the property innerText
. After that we begin to invest the appropriate elements in the DSL.
Allocated by the color and thickness of the font headings and subheadings. Allocated in italics and the thickness of the font elements that stand out on the page or on the property themselves tags, or via CSS. This is something we can provide in advance (and query using selectors with the names of the tags or their classes), but something we have to figure out on the fly: for this we will request all the elements of
span
within the article, to check their computed stylistic parameters, and optionally add tags. We will try not to duplicate the same placing the tags inside each other (unlike HTML, the DSL is not allowed) — to do this, we will be treated to mark elements with a specific attribute, and then check its availability up the tree DOM.The next step — and we ticked all the superscripts and subscripts.
Then processed links: kept as is, AutoLink to the current page; intrasite links turned vnutriklubnye in the format DSL; external links was registered as a network URL (this tried to keep in front of the link and mark by underlining its readable text, as the format of the DSL does not allow hiding text behind a network address, combining all together).
We also replaced the image on their external references (images are pretty large, so we didn't build them into the dictionary — if necessary, the user will be able to go to the address on the website).
Finally, we inserted additional spacing for highlighting citations (i.e. examples of use for each neologism).
As a result, we got two in one: a fully saved layout HTML, and inside it is — markup DSL, at the same time without any conflict between them. And when we inquire the property
innerText
, the HTML will be only readable structured text wrapped in tags DSL ready for immediate conservation as a dictionary code.We recorded in a dictionary file after the list of headers, not forgetting to put a final processing function
secureHigh
. Then we update the log and save the information block in the program window (we decided to add debugging information about how soon formed the content of the page, asynchronous scripts), clear the list of headers before the next iteration, check the array of addresses and the requested or next page, or go to the end of the cycle.o the Function
endSaving()
is invoked or at the end of the cycle of preservation, or as a result of its interruption by the user's request or after an error. In it, we close the file handles, variables cleared paths, I/o, cancel unneeded event handlers returned by the initial appearance of the interface elements. If enabled, the flag output from the program, at the end of the function we forcibly close the main window.the
4. Resources
Program NW.js, as already mentioned, the system creates the user data folder your subfolder, something similar to the profile folder of the browser. We can assume that it, among other things, stored the cache files to speed downloading of similar pages. In my case this subfolder after you save the dictionary held 138 megabytes. In the end, you can safely remove the next time the program will create it automatically (except that it will take some time when you first start).
The script kept a dictionary and a half hours, was treated during this time, almost three and a half thousand pages with all their resources, while the processor is not significantly loaded. The memory consumption and the volume read/write to the end of the program, you can evaluate this screenshot.
The dictionary (based on the condition of the site on 16.02.2016) posted on rghost.net and drive.google.com. The archive includes a DSL source code in UTF-8 and UTF-16, and compiled dictionaries in LSD under the last three versions of ABBYY Lingvo. Headers: 5827; cards: 3419; examples of use: 9311.
Thank you for your attention.
Comments
Post a Comment