Monday, November 5, 2018

Scraper and very large files

No working at the moment, these files will have to be 'streamed'  or processed in large chunks, but that may mean maintaining source pointers inside the dik file and using fseek to move data in.  If not done right I end up putting the disk in a jam doing fseeks when a complete sequence is not in buffer.  On the other hand,  do not want to go to disk for every single character, I do not know how well the disk IO will stream. Hmm....

I am thinking this. Windows does no like unblocked io, and a context switch is going to happen on read whether I want is or not, I think.  The good news is that scraper does n o back scan, once a set of markers is processed, the system moves mast that point in the file, random access is not needed.
So when processing a stack html tag pointerw, I reset the disk and starting from the beginning can read large buffs, one at a times

But I am still stuck, everytime I move the source pointer along I have to check end of buffer and load more. But I can do that inline:

if(!*src) src= getmore();
src++;  

Thus one test per character passed. I can erase the buf in doing so, the data already established in the stack pointers.  On read and index, I have the same result, passing src in one direction only makes it easy to keep the input buffer full.

I can do this for plain text.  This is mostly test and adjust, fixing the static set up from the lab and adapt to real world production, standard procedure.  Along the way during test I fix the memory leaks that lab folks leave in the code, as well as duplicate functions.

We get a much broader solution because we can open one large plain text of html file and have multiple cursors going through it with joins searches.  Everyu cursos contains a void * caloled start and the attachment can use start to point to a particular buffer process.  So the attachment performs the blind operation of a structer defining input forone of many cursors:

typedef struct {
char* buf;
FILE* fin;
int bufsize; } Into;


Everything the attachment needs to read more data, as needed.  Multiple cursors opening the flle for read only.  This is a general solution for lazy J searches, plaintext, and html.  It does not work with MEM, mem has inline pointers, not an external index stack.  MEM is designed for in and out of memory, fast and structured.


#include <stdio.h>
#include <stdlib.h>
typedef struct {
char* buf;
FILE* fin;
int bufsize; } Into;
typedef Into * PInto;
#define bump(p,q) if(!*p) p= More(q); else p++;
char * More(PInto p) {
int i;
i = fread(p->buf,1,p->bufsize,p->fin);
if(i<=0)
return(0);
p->buf[i]=0;
//printf("%d bytes read start \n",i);
return(p->buf);
}
#define BUFSIZE 20
int main(void){
int i;
char * src;
Into p;
PInto q=&p;
p.fin = fopen("grammar.txt","r");
if(!p.fin)
return(1);
p.bufsize=BUFSIZE;
p.buf = malloc(p.bufsize+1);// keep room for terminator
*p.buf=0;
src=p.buf;
do {
if(*src) putchar(*src);
bump(src,q);
} while(src);
return 0;

}

No comments: