Tuesday, October 30, 2018

Simple loop, final

My html web extractor. (updated)

 All it needs to do is recurse through the html tree and mark selected tags as plain text. Plain text is what is left after all the open and close sequences have been resolved to either skipped tags or a string of plain text in the original source. The hard part is the spaghetti of crawling through HTML and finding tags.

The plan is to have an independent utility that grabs web  pages and puts them to disk.  Then, asynchronously, scraper attaches to  join and  will refine the pages from gibberish to the main idea, like Watson.


int process_block(int in) { 
 int i=0;
 Tag t;
 int code;
 if((in+2) >= intodex)
  return(1); // doxe
 PStack ptr = stack+in;
 t= *(ptr->id); // debug look
//
// process all blocks  in sequence until parent close
// We just need to mark the residual text
 do {
  if(Open(ptr+2))  // Peek beyond this close tag
   process_block(in+2);  // descend into next block
  // Now at this point, until parent close, 
  // everything is Singleton, skip, or text.

  code=(ptr+1)->code; // save origal code
  if(ptr->code == Singleton) ptr++; // Singletons ignored
  else if(ptr->code != Skip)  {// Text is what left after most is filtered away
   ptr->code = Text;
   ptr++;
   ptr->code =Text;
   ptr++;
  } 
  if(code == Close)   //Parent is closed
   return(0);



 } while(1);
 return(1);
}

No comments: