A simple recursive algorithm to scrape plain text without the html parser. Descend the html document recursivelly on a tag by tag basis. Work from a table of tags having acceptable text, and either dump the tag of emit the tag contents. The whole key is identifying the begin an end tags, a simple state machine. The simplification is that we dump part most of the document tree before attempting to build an tree, just emit the text on a terminal node
So, the rule book says ignore <div> pairs. You just dumped a bunch of document structure, but that structure had little to do with plain text content. For each tag pair, the choices are, descend into into, dump it, or emt its text. 85% of the document is display boiler plate, needed in the walled garden environment. We can tag some emitted text, as from a header, for classification purposes, but otherwise most useful information in in the plan text, the word significance and order of appearance.
I might think about that, and search the web a bit, a simple, text extractor.
No comments:
Post a Comment