Monday, October 29, 2018

My text scraper

I wrote a quick one. Like I say, find all tags an push them in the order found. Then sort the stack into a B-tree, the wae a parser would, or the way we do Huffman. But, a big but, I am pruning big sections of that tree as we go along, it shrinks down to a string of residual test. I dump format, script, tad 'onclick' type tags.

The text scraping job is to toss the mechanics stuff dealing with browsers, which is almost all of the document tree. So, I have a stack of sequence start and stops, nested, a compact graph. From stack bottom working up I find the first enclosed sequence and split the stack into two children, Except I am tossing more than half of the sequences discovered. My B-Tree decomposes to rank zero and I am left with a string of residual text, with big gaps; text from lists, title, body, headers, meta words.

 The plain text words, in approximately the order appearing in the document. Then I can be working that plain text with a myriad of cross referencing word lists. Best search engine available, I should start a company. Send me ten bucks and I will send you one long c source file, load it into any directory, do no special set up, just type:
  gcc onebigfile.c.

CutnPaste is the likely software distribution method, informally.

The text scraper will attach to the join, the web text scraper.  Run two inputs, a LazyJ search and control graph and the web scraping attachment which treats the web like a directed graph of URL addresses.  I reuse the dot sequence for URL: site.subsit.subsite, except I have the comma operator  and distributed property, so in Lazy J I have:

LazyJ:  masterSite.(Subsite2,Subsite2.Subsubit1,subsite3);

TextScrapper sees the grammar as a series of fetch, step and skip from the join machine. and returns the residual text for each page in graph order according to join grammar.  From there I stack on a series of conditional joins with my myriad of word lists.  It is all quite simple, anyone can do it; have a personal Watson.
And make money.  Experts in a field have word lists in their head and texts, hey justm
 need to organize them a bit, from general to specific, like a directed. The develop sets of joining word lists needed to make an intelligent robot in their field.

No comments: