All it needs to do is recurse through the html tree and mark selected tags as plain text. Plain text is what is left after all the open and close sequences have been resolved to either skipped tags or a string of plain text in the original source. The hard part is the spaghetti of crawling through HTML and finding tags.
The plan is to have an independent utility that grabs web pages and puts them to disk. Then, asynchronously, scraper attaches to join and will refine the pages from gibberish to the main idea, like Watson.
int process_block(int in) { int i=0; Tag t; int code; if((in+2) >= intodex) return(1); // doxe PStack ptr = stack+in; t= *(ptr->id); // debug look // // process all blocks in sequence until parent close // We just need to mark the residual text do { if(Open(ptr+2)) // Peek beyond this close tag process_block(in+2); // descend into next block // Now at this point, until parent close, // everything is Singleton, skip, or text. code=(ptr+1)->code; // save origal code if(ptr->code == Singleton) ptr++; // Singletons ignored else if(ptr->code != Skip) {// Text is what left after most is filtered away ptr->code = Text; ptr++; ptr->code =Text; ptr++; } if(code == Close) //Parent is closed return(0); } while(1); return(1); }
No comments:
Post a Comment