Wednesday, October 31, 2018

Proof of concept

My html text file, actual plain text in bold:

<html> 

 htmltext starting
 <body>
 <font> Font text 

 <b> Bold text </>
 <h1> header is now <b> bold</></>
 </font>
 <script> Script text </script>
 <h1> Another header is now <h> header embedded</></>
 Body text
 </body>
 html ending
 </html>


Output from scraper:


htmltext starting

 Font text

 Bold text
 header is now  bold

 Script text
 Another header is now  header embedded
Body text

html ending


This is not linear text, the scraper will retain the skip and step links, unlike plain text which is a Comma separated only. So the 'stack' is preserved, it has all the plain text pointers back into the source in their original nested form. So join will be doing skip and step through the original web page plain text, and web authors know this. Authors will have a slight implicit grammar in their layout to take advantage of structure in the join.

Now the stack of pointers into the original source is an index in a graph database. . The attachment treats the two as a skip and step base of plain text interspersed with ignored text.  Applications can allow users to browse via skip and text, external to html formatting. Users skip through plaintext groups, in each group key words highlighted, and user can select different key word lists at will.  Extremely efficient news and information gathering for the busy executive.

Join is inherently extremely smart, or as smart as the word lists are selective.  Pruned, efficient word lists will be big industry, the bots reviving the written word.

No comments: