jeudi 7 août 2014

Create a parser using Lemon with re2c for the lexer

Hello,

Last time, I explained how to use re2c to create a lexer, now I will present how to combine it with Lemon for the parser.

I wanted first to show it with a more concrete example than a calculator (the famous one), but I swear I tried, but the result is too big and it is not easy to write an simple article, focused on Lemon.

I decided then to continue showing Lemon for a calculator but then writing other article focusing on how to use it especially for more complex grammar like PLSQL.

I encourage you first to read the documentation (short enough) of Lemon... and read it again, and again ... Actually it is well explained but I had personally to read it multiple times. In fact, you need to try a concrete example to understand it (no I didn't say fully :-)).

http://www.hwaci.com/sw/lemon/lemon.html

Lemon is just a C file to compile. very simple, just download lemon.c and lempar.c
http://www.hwaci.com/sw/lemon/

compile lemon simply by:
> clang lemon.c -o lemon

I personally put lemon and lempar.c in my /usr/local/bin, you can actually put it everywhere you want.

see the grammar of the calculator (calc_parser.y):

%name calcParse
%token_prefix CALC_TOKEN_

%token_type {int}  

%left PLUS MINUS. 
   
%include {
#include ≶assert.h>
#include ≶iostream>
}  
   
%syntax_error {
    std::cout << "syntax error - ";
    int n = sizeof(yyTokenName) / sizeof(yyTokenName[0]);
    for (int i = 0; i < n; ++i) {
            int a = yy_find_shift_action(yypParser, (YYCODETYPE)i);
            if (a < YYNSTATE + YYNRULE) {
                    std::cout << "expected " << yyTokenName[i] << std::endl;
            }
    }
}

calc ::= exp(e) END.                { std::cout << "= " << e << std::endl; }
   
exp(lhs) ::= NUMBER(e).             { lhs = e; }
exp(lhs) ::= LPAR exp(e) RPAR.      { lhs = e; }
exp(lhs) ::= exp(le) MINUS exp(re). { lhs = le-re; }

exp(lhs) ::= exp(le) PLUS exp(re).  { lhs = le+re; }

Lemon is really better than bison or yacc on how to handle variables, I really like to not reference $1 but named variable.

Look at the syntax error, it will show which token was expected if the parsing fails.
(I would have liked a default one like that)

Now, we have to generate the .c and the .h files.
lemon -m -l calc_parser.y

  -m           Output a makeheaders compatible file.
  -l           Do not print #line statements.

as you can see, I generate the .c file compatible with "makeheaders"
makeheaders is also made by the same author than lemon and sqlite.
By default lemon only export the token in the .h, but not the functions.
it will be generated if I use makeheaders

makeheaders calc_parser.c

now the calc_parser.c should be forced to be compiled as c++.
In xcode, it is simple, just select the file and on the right select the source type.


calc_parser.h:

/* This file was automatically generated.  Do not edit! */
#define calcParseTOKENTYPE int
#define calcParseARG_PDECL
void calcParse(void *yyp,int yymajor,calcParseTOKENTYPE yyminor calcParseARG_PDECL);
#if defined(YYTRACKMAXSTACKDEPTH)
int calcParseStackPeak(void *p);
#endif
void calcParseFree(void *p,void(*freeProc)(void *));
void *calcParseAlloc(void *(*mallocProc)(size_t));
#if !defined(NDEBUG)
void calcParseTrace(FILE *TraceFILE,char *zTracePrompt);
#endif
#define calcParseARG_STORE
#define calcParseARG_FETCH
#define calcParseARG_SDECL
#define CALC_TOKEN_RPAR                            6
#define CALC_TOKEN_LPAR                            5
#define CALC_TOKEN_NUMBER                          4
#define CALC_TOKEN_END                             3
#define CALC_TOKEN_MINUS                           2
#define CALC_TOKEN_PLUS                            1
#define INTERFACE 0

Also something to know about lemon is that it is a push parser, it will never call the lexer.
We need then some code to read from the lexer and call the parser (main.cpp):

#include "lexer.hpp"
#include "calc_parser.h"
#include <iostream>

using namespace std;

int main() {
    const char * content = "1+4+1+(5+3)";

    void* parser = calcParseAlloc( malloc );

    Lexer lexer(content);
    
    int token;
    
    do {

        token = lexer.scan();
        
        switch (token) {
            case CALC_TOKEN_NUMBER:
            {
                string num = lexer.getTokenValue();
                calcParse(parser, CALC_TOKEN_NUMBER, atoi(num.c_str()));
                break;
            }
            default:
                calcParse(parser, token, 0);
                break;
        };
    }
    while( token != CALC_TOKEN_END );
    
    calcParse(parser, 0, 0);
    
    cout << "finished!" << endl;

    calcParseFree( parser, free );
    
    return 0;
}

I modified the lexer to use now the token from the parser.

#ifndef LEXER_HPP
#define LEXER_HPP

#include <string>

class Lexer {
public:

    Lexer( const char *s );

    int scan();
    
    std::string getTokenValue() const;

private:
    const char *m_content;
    
    const char *m_start;
    const char *m_cursor;
    const char *m_limit;
    const char *m_marker;
    const char *m_ctxmarker;
};

#endif

and the lexer implementation:

#include "lexer.hpp"
#include "calc_parser.h"
#include <iostream>

using namespace std;

Lexer::Lexer( const char *s ) : m_content(s)
{
    m_start = m_cursor = m_content;
    m_limit = m_content+strlen(m_content);
}

std::string Lexer::getTokenValue() const
{
    return string(m_start,m_cursor-m_start);
}

/*!max:re2c */

int Lexer::scan()
{
    m_start = m_cursor;

    #define YYCTYPE char
    #define YYCURSOR m_cursor
    #define YYLIMIT m_limit
    #define YYMARKER m_marker
    #define YYCTXMARKER     m_ctxmarker
    #define YYFILL(n)   

    /*!re2c
        re2c:indent:top      = 1;
        re2c:yyfill:enable   = 0;
        
        '\000'              { return CALC_TOKEN_END; }
        [0-9]+              { return CALC_TOKEN_NUMBER; }
        "+"                 { return CALC_TOKEN_PLUS; }
        "-"                 { return CALC_TOKEN_MINUS; }
        "("                 { return CALC_TOKEN_LPAR; }
        ")"                 { return CALC_TOKEN_RPAR; }
    */
}

please refer to my previous article how to use re2c to compile the .re file.

everything should work now and produce the output below:
= 14
finished!

please let me know if it has been useful for you