Monday, October 29, 2018

my text scraper, main loop

My simple loop for web pages.  It is built on the Tag protocol, each tag has a unique identifier, and ope and closing tags ae present, except for singletons.  But I code the tags I wan in a static table, the unwanted 'blocks' are ignored.  I don't build a tree, the grammar too simple.

First, I find all tags, open and closing and singletons, in order of appearance and push them onto a stack. The tag hold a point back into the original web page, and holds a pointer to the code book describing the tag.

The grammar is simple,  If my stack has an opening tag followed by a closing tag, then  there is no embedded trags, just plain text, or plain unwanted script. So I mark those tags as text, and they are passed over by the parent block, then exit, this is a leave. (note this code is untested). If thet ar disallowed text, like script, then it is marked as singleton.

In any even, I recurse back to the same call unless this is the end of the page.  When done, my stack of tags have all been converted into either singletons or text. I can step though the stack and gather the text using the pointers back into the page, I have a database. I can even take a stab at some simple html grammar, maybe collect headers at their proper level, presenting some text as a tree.

This is standard parsing, by the way.

int process_block(PStack ptr) { 
	int i=0;
	Tag * s;
	PStack endptr;	
	s = beginptr->tagid;

	if(*s->name = 0)
		return(i); // end of page
	if(s->code == Singleton || s->code== Plaintext){  // Just step through
		process_stack_block(open_ptr);// skip one element
	else  { 
		if(endptr->id->code == Close)  // If  no embedded blocks?
			if(s->code== Emit)	 {// emit residual texttext atom	
				beginptr->is = &Text;
				beginptr->is = &Single;					
			process_block(endptr+1);  // descend ustart of next block

No comments: