For html pages my text extraction tool finds all the tag markers, beginning and end, and pushes them onto the stack as they are found, with pointers to their location. The stack becomes the exact schematic of the source document, it retains the nested structure.
Then I pull out the tags and look up the rule for that tag, which can be, 1) This is atomic text, emit the contents. 2) This has no text, skip ahead past the closing tag, and 3) This might have text, step into the enclosed html. Notice I am back to step, skip, or singleton.
My theory say I can easily find the text with some obvious rules, and a bit of missed text. I do not need the full html parser.
No comments:
Post a Comment