I wrote a quick one.
Like I say, find all tags an push them in the order found. Then sort the stack into a B-tree, the wae a parser would, or the way we do Huffman. But, a big but, I am pruning big sections of that tree as we go along, it shrinks down to a string of residual test. I dump format, script, tad 'onclick' type tags.
The text scraping job is to toss the mechanics stuff dealing with browsers, which is almost all of the document tree.
So, I have a stack of sequence start and stops, nested, a compact graph. From stack bottom working up I find the first enclosed sequence and split the stack into two children, Except I am tossing more than half of the sequences discovered. My B-Tree decomposes to rank zero and I am left with a string of residual text, with big gaps; text from lists, title, body, headers, meta words.
The plain text words, in approximately the order appearing in the document. Then I can be working that plain text with a myriad of cross referencing word lists. Best search engine available, I should start a company. Send me ten bucks and I will send you one long c source file, load it into any directory, do no special set up, just type:
gcc onebigfile.c.
CutnPaste is the likely software distribution method, informally.
The text scraper will attach to the join, the web text scraper. Run two inputs, a LazyJ search and control graph and the web scraping attachment which treats the web like a directed graph of URL addresses. I reuse the dot sequence for URL: site.subsit.subsite, except I have the comma operator and distributed property, so in Lazy J I have:
LazyJ: masterSite.(Subsite2,Subsite2.Subsubit1,subsite3);
TextScrapper sees the grammar as a series of fetch, step and skip from the join machine. and returns the residual text for each page in graph order according to join grammar. From there I stack on a series of conditional joins with my myriad of word lists. It is all quite simple, anyone can do it; have a personal Watson.
And make money. Experts in a field have word lists in their head and texts, hey justm
need to organize them a bit, from general to specific, like a directed. The develop sets of joining word lists needed to make an intelligent robot in their field.
No comments:
Post a Comment