Friday, November 30, 2018

My c language parser

After looking at two or tree simple c interpreters, I stole some code and got a simple parser down to 200 lines of code.  But it is parser only, and there are a few hundred lines of code that executes the c code, I just parse it.

I used the mass approach. I grabbed a list of every actionable tokens in the c language, and check the source token against that. If this is a key c language token, I get a tag ptr and push it on a stack. The tag ptr tell me the string representation of the action and a ptr into the source token triggering that.

If the source token is not n the official list, then is is a variable, and goes in the symbol table.

Then pass two is a simple switch statement:

Idiom *  process_tags(Idiom * tag) {
void * arg_list[20];
int argc=0;
while(1) {
if(tag->str == 0 || *tag->str == '}')  return(tags+1);
switch(tag->form) {
case null: // Not implemented or pass through
case unary: // ^ a
case binary: // + a b as a cmd sequence
arg_list[argc]=tag->str; argc++;
break;
case command: 
if(*tag->str ==';' || *tag->str==0) // The cmd sequence is complete
tag = ExecCommand(&argc,arg_list));
; // Exeute cmd calls the generic cmd list
break;
case cast: // Already in the symbol table
break;
case pair:  // process internals until done
tag = process_tags(tag+1);
break;
case variable: //Already in the symbol table
arg_list[argc]=tag->str;
case integer: // Becomes an argument to pass or value in an expression
case string:  // ditto
arg_list[argc]=tag->str; argc++;
break;
}
tag++;
}
}



OK, notice the one thing, in my syntax the ';' char causes the switch to immediately call ExecCommand. A command of the form:

cmg arg1 arg2 ..; That process has hundreds of lines of code, all of them unwritten. Here is my object code for while:
while arg1; 

Test the argument, then continue if true, return from the routine above if not. The decision is mad in the command handler for while, and it returns the continuing tag pointer into the sack.  There is some code for that.  Also there is code that needs to compute expressions, none of it written, and any of it I do insert will be mostly stolen.

My point is, ExecCommand, that routine is a commandhandler, it can to ls, cd, gcc eve, as well as while, for >>,<< and soon, a unified command handler with all arguments being:

ExecCommand(int &argc,vid*args[]);

In other words, the standard linux and c argument list grammar is my object code format. The parser above will pass through all the undefined as long as they fit a proper cmd sequence form.  

Well, actually, the parser above does none of that now, it is untested. But we can see the key idea, combine the cmd sequence grammar and scripting grammar in one parser. We can get this in under a thousand lines of code, get the basic c control structure, and add command implementations as we go to get more functionality.  Making that argument list full duplex, and typeless gives great flexibility down at the sub system level.

Plus the idea that this is opaque to any application underneath. A very powerful concept, an adaptable layer that can compete and win over PowerShell.  It is really a layer, slightly more than a protocol; thin, easy to deploy. The customer will load the dynamic lib for any sub system he needs to manage. The customer get the standard command format,

VerbAdjectiveNoun arg1 arg2 ..;

The light bulb clicks because, hey, we all already agree on what a command sequence looks like, all the existing utilities have adapted to the linux argument format. So the pros can write the sophisticated spaghetti to get in and out of linux, then just connect the user to his system with a load lib on the shell. All the subsystem commands become part of macro shall language, the pros can pass arguments down the chain, return well formed results using he arglist format.

Macros and functions?

Do them both.  Macros already easy, just inset the expanded script in place.  Functions are set on an arg_list, a separate one or even the same one. Uses the same parser, but a separate arg list.  The cmd format is a great format, frees the developer to write whatever spaghettit desired, and always the same interface.

No comments: