First, I find all tags, open and closing and singletons, in order of appearance and push them onto a stack. The tag hold a point back into the original web page, and holds a pointer to the code book describing the tag.
The grammar is simple, If my stack has an opening tag followed by a closing tag, then there is no embedded trags, just plain text, or plain unwanted script. So I mark those tags as text, and they are passed over by the parent block, then exit, this is a leave. (note this code is untested). If thet ar disallowed text, like script, then it is marked as singleton.
In any even, I recurse back to the same call unless this is the end of the page. When done, my stack of tags have all been converted into either singletons or text. I can step though the stack and gather the text using the pointers back into the page, I have a database. I can even take a stab at some simple html grammar, maybe collect headers at their proper level, presenting some text as a tree.
This is standard parsing, by the way.
int process_block(PStack ptr) {
int i=0;
PStack
Tag * s;
PStack endptr;
s = beginptr->tagid;
if(*s->name = 0)
return(i); // end of page
if(s->code == Singleton || s->code== Plaintext){ // Just step through
beginptr++;
process_stack_block(open_ptr);// skip one element
}
else {
endptr=inptr+1;
if(endptr->id->code == Close) // If no embedded blocks?
if(s->code== Emit) {// emit residual texttext atom
beginptr->is = &Text;
endptr->id=&Text;
}
else{
beginptr->is = &Single;
endptr->id=&Singleton;
}
process_block(endptr+1); // descend ustart of next block
}
return(i);
}
No comments:
Post a Comment