First, I find all tags, open and closing and singletons, in order of appearance and push them onto a stack. The tag hold a point back into the original web page, and holds a pointer to the code book describing the tag.
The grammar is simple, If my stack has an opening tag followed by a closing tag, then there is no embedded trags, just plain text, or plain unwanted script. So I mark those tags as text, and they are passed over by the parent block, then exit, this is a leave. (note this code is untested). If thet ar disallowed text, like script, then it is marked as singleton.
In any even, I recurse back to the same call unless this is the end of the page. When done, my stack of tags have all been converted into either singletons or text. I can step though the stack and gather the text using the pointers back into the page, I have a database. I can even take a stab at some simple html grammar, maybe collect headers at their proper level, presenting some text as a tree.
This is standard parsing, by the way.
int process_block(PStack ptr) { int i=0; PStack Tag * s; PStack endptr; s = beginptr->tagid; if(*s->name = 0) return(i); // end of page if(s->code == Singleton || s->code== Plaintext){ // Just step through beginptr++; process_stack_block(open_ptr);// skip one element } else { endptr=inptr+1; if(endptr->id->code == Close) // If no embedded blocks? if(s->code== Emit) {// emit residual texttext atom beginptr->is = &Text; endptr->id=&Text; } else{ beginptr->is = &Single; endptr->id=&Singleton; } process_block(endptr+1); // descend ustart of next block } return(i); }
No comments:
Post a Comment