Notes on the join rules

Join and how it works.

[ see appendix to contact author]
This note uses lazy json to depict graph structure.
Join takes two 'databases' presented as directed graphs, and executes their complete convolution. Examples of operations are not restricted to Json type text files, as the discussion below makes clear.
Consider:
a.b.(xyz) join q.(r,s,t).k

It will decompose into
(a.q).(b.r,b.s,b.t).(x.k,y.k,z.k)

Each of the three graphs (two in and one out) are managed by external attachments that responds to call back methods with status. . None of the attachments need the same data structure, their interfaces are isolated from each other.
  • Status values returned from the attachments, and their methods:
  • Methods implemented by the attachment:
  • Reset=0, // Like a pop of previous pointer
  • Set, // a push notification
  • Fetch, // next node request
  • Step, // step one node forward
  • Skip, // Skip to end of sequence
  • Append //output to a graph
  • Descend, // Start new sequence
  • Branch // Close current sequence
Directives returned from the attachment back to join.

  • Done //a request to break 
  • Break //mean exit from latest push 
  • Push //is a neutral descent by one side or the other due to the Parenth 
  • Null //is the 'no action implied' condition 
  • Match means a true value for some dot operator 
  • Pause // Drop the current in out pair and run another from the list of active join 
  • Singleton // Indicates a Dotted key but unmatched.

Singleton status indicates one element in the sequence and replaces the need to peek ahead when incrementing the Comma loops.

Break, the left side can force the right to skip the current Comma sequence. Break gives us the distributive property. Break and match have nearly the same meaning, except that break is from a match in an Comma sequence, Match is from a Dot operator.

Two Breaks is a match within a comma sequence, and both should skip ahead, and a skip ahead over a singleton default to a step. The join directs the process with call backs

Break and Match, Break and match, or two matches cause the current nodes to be collected at output. The Match logic can be modified by setting the mode value on output.

Break Law An attachment that returns break will not break unless the other has a Match or a Break.

Skip. Applications kow this is graph, they have to get the next descent or increment set; a binary return possibility.

Done will force a break from the loop

Left Right rule

Join let's the left graph have some dominate, but the right graph generates output. The implicit assumption is we have globs of data on the right righ being refined by the more intelligent search graph to the left. Output comes from the right graph, modifies by the left.

Pause allows multiple, interacting joins, one feeding the other. When an attachment issues a pause, the join for that pair is suspended and the next pair in the stack runs. Any two threads can run from tghe same 'database' input. If one join out feeds another join in then the attachment can issue a pause to the one pair catch the other.

Default Pause Rule:

Unless over ridden, the current join will yield upon finishing a right sequence. Normally the current instance is the only one ready and join returns to it.

Output structure

In the default, the output does skip and step following th left side, generally considered the search graph. So output is the collected path through the search graph.

Unequal rank graphs

When the right graph, reports Done, the left will step through Null elements until it replies with Done.

Grouping, the Parenth operation

Works as expected, managed with a Push status.

Null keys will not match, at the moment. Under consideration.

Multi-processing is enabled, being tested. It allows multiple ongoing joins in the same machine, mainly because the joins are chained, one output feeding another input simultaneously, via the pause status. No need for threads in this application.

Round Robin access

It will cycle through its list of cursor pairs, looking for work Here is where it can yield the cursor stack to an outside application for entry and exit.

The Cursor stack and pauses

Only the active cursors are on the stack. A push will cause a duplicate cursor to replace the forking parent's position on the stac..

The stack is generally paused convolves. Input can be halted waiting for data. Output can be jammed.


Not all matches are the same

There is a standard KeyMatch (see appendix(). That match function includes modifiers,(Unmatched, All, Once, Not,Some). This match modifier, the wild card flag, is carrier in the cursor, and is set by the attachment using its own methods and syntax. Lazy J has a usual set of wild card punctuation marls planned. But, the integer is reserved in the cursor, available for matching regardless of the match function.

Macro Shell :

This release includes a bear bones macro shell and command handler with macro definition and expansion.  The shell.c utility includes no file headers other than  standard header. It does expect an EsceCommand API. The command line is parsed into a char*[], with the arguments now being point back into the command string and the proper null terminations applied.  All macros are expanded.

Built ins:

  • //  system wide
  • InitJoin
  • StartJoin
  • LoadDevice device.dll
  • SetCursor slot device args
  • DupCursor slot to from
  • ListCursors
  • ListTypes
  • ExecSlot slot method data
  • ListStack
  • ListMem
  • SaveDB deviceid file
  • // shell specific
  • ExecFile
  • ListMacros
  • GetEnv
  • PutEnv
  • ListArgs
  • InitSystem

I try to follow the PowerShell standard on command names:
MethodObject.  Methods being mostly Put,Get,List and Set.  The Objects being the argumenents themselves, GetArgs, for example.  PutEnv GetEnv just deal with the symbol table.  All operations on Shell object, ExecFile is another example, are handled in shell.c .
Operations on specific oibjects like Cursor, Attachment, Buffers happen down the line, the command arguments passed down from general to specific; until finally the  Debug handler has home made, but executable spaghetti.

Attachment design:


Initialization. The package comes with attachments, enumerated:


CONSOLE, MEM, TEXT, STRING and LAZYJ
  • CONSOLE opens standard IO,either stdin, stdout, or filename if presented.

  • MEM is the standard graph memory structure, for input or output slots.

  • STING sets up a LAZYJ base using the console string.

  • TEXTt is the plain text attachment which extracts large words from a plain text document.

Type the help command for format. The attachments needs its standard initialization, and a new_cursor function.  Join itself is unaware of what attachments are responsible for the main call backs, exec and eval. See appendix


The simplest design is in the appendix, just issue script to the console by link to exec_shell_command(char * argd-str);

Memory interface

The memory interface is a read/append direct graph in memory using normal record structures. It allows stacking of joins. Nothing like json but it still uses the pointer variables in the Cursor structure. Memory is pruned when the left side generator exits the system.

Users can specify the key size on a per instance basis. The record for any of a given number of mem bases is variable, and the calling user provides memory space.:


A key record structure in the mem attachment. The parenth opcode has no key but keeps the first four entries for step and skip.

All integers count records, record size established at each new instance. Linked backward to enable set up on append and skip. The arrangement yields very fast scanning in join. The memory interface has no dot operator, instead it has Singletons, a set of one element.

Null Cursor

It should be default, it returns null unless otherwise instructed by the mode flag. This is part of join, happens internally.

Keys

The join does not use them, but carries a pointer to the key current in the associated cursor. All key matches are deferred to the attachment. So, wildcard keys are out, but I keep a wild card operator, Asterisk. for testing. It mpost;ly fouls thing up, but still useful in the lab. Asterisk will likely be made functional, a wild card operator.

Join doesn't care what the grammar operators mean, it really is a push, pop and loop machine. It requires the attachment to return suggestions about when to push and pop.
Example join when enabling debug print out. Each side reporting the proper push,break and match results.
Dynamics:

The mainloop has five call backs, but that can be reduced to three, and one for an idle attachment Paused. All the rest is about testing the logic of local variables, so this is extremely fast. The operation costly is the Push, executed to start a new sequence..

It takes two passes, sometimes, to move forward, but still the shortest pass.

The general idea is that the two cursors remain synchronized, so the loop is a kind of arbitration method. It is more akin to a telephone digital switch than an OS, hardware enhanceable.

Output uses Skip and and adds a pitch So its natural output is Dot, Comma grammar, a graph.

There is a shared flag called logic in the Cursor define. That is a cheap way for join to signal back to the attachment.


Command shell

Yes, there is one. The help string is in the appendix. The shell can operate from a batch file and set up triples of cursors for multi-level joins.


Debug:


There is an app this is a debug file with a console debug loop one cn call it any time. The console command include a quit and continue, return code so the app can swarp between debug, console and operations. Another command, debug is optional, as is the debug loop.


Files:

Other file links to pages:


DLL interface:

Fairly simple, there are two call backs, one to compare two nodes, and one to execute step and fetch, and other methods. Examples in the code.  The text scraping too is a dll attachment, though not released.
Appendix:

Starting with the next_cursor which defines the round robin access to the loop.
.
// round robin access
int next_cursor(PCursor *l, PCursor *r)
{
    if(top==-1)
        printf("\nStack is empty!!");
    else {
  *l = stack[index];
  *r = stack[index+1];
  index= index+2;
  if(index == MAX) 
   index = 0;
 }
 return(Null);
}

And what is lazy json? , Comma and Dot are special, as is the Parenth pair. Keys are alpha, everything else skipped. It is just one of the maninterfaces to the join, and the term lazy implies some spaghetti in the interface.

Typical printf debug output

"f.(a,b).K" join "(s,f,t).b.K"


Null Push [f.][(]
Null Null [f.][s,]
Match Break [f.][f,]
Push Null [(][b.]
Null Null [a,][b.]
Break Match [b,][b.]
Match Match [K.][K.]

Typical new_cursor interface for attachments:



PCursor new_json_cursor(void){
 PCursor self;
 self = init_cursor(exec_json,eval_json);
 self->prevOp = ',';
 return(self);
}
Setting up join chains example
 PCursor j1,j2,m,t,o;
 j1 = new_lazyJ_cursor("search string");
 j2 = new_lazyJ_cursor("search string");
 m = new_mem_cursor);
 t = new_text_ciursor("filename");
 0 = new_console_cursor();  /printf outputs
 join_server_add(t,j,m);
 join_server_add(m,j,o);
Match Key, standard function wjth enumerated modifiers
enum { M_All = LastStatus,M_Once,M_Not,M_Except,M_Some};
int MatchKey(Element* p, Element* q,int logic) {
 if(logic == M_All) return(Match);
 if(*p->key || *q->key) {  // Null keys do not match
  if(!strcmp(p->key,q->key)) 
   return(Match);
  }
 else 
  return(Null);
 }


Below is a simple  app that simply executes a script line by line to the console executive.  Works fine, has one interface: int exec_shell_command(char *);



/*
A simple app that keep a script inline the 
code to control the join
*/

static char* script[] = {
 "set MEM 2 20 2000",
 "set TEXT 0 grmmar.txt,10000",
 "set LAZYJ 1 join.txt",
 "mode 2 All",
 "join",
 "dup 2,3",
 "set CONSOLOE 5",
 "set LAZYJ 4 join.txt",
 "join" };

int exec_shell_command(char *);
int main(void){
 int i=0;
 int count = sizeof(script)/sizeof(char *); 
 printf("%d %d\n",sizeof(script),count);
 while(i < count) {
   exec_shell_command(setup[i]);
  i++;
 }
 return 0;
}

// Author
Matt Young
youngsanger@gmail.com

No comments: