Comments for "PHP_LexerGenerator"

» Submit Your Comment
Comments are only accepted during the "Proposal" phase. This proposal is currently in the "Finished" phase.
» Comments
  • Alexander Merz  [2006-06-25 19:30 UTC]

    I miss two things:

    1.) It should be possible to add a phpdoc description to the lexer class.
    2.) A "Full-Service" mode would be nice - you only define the rules and class name, and the lexer class and methods are generated automatically following an interface definition. A parser generator might depend on the interface definition . So the developer can easily pass the generated lexer class to parser without being confronted with any API stuff.

    But I also like the current state!
  • Greg Beaver  [2006-06-26 22:53 UTC]

    I'm not sure what you mean by #1 - to add a phpdoc block, all you need to do is write one in the .plex file.

    If you want one before each of the private lexing methods, that's another story.

    I would enlist you to code #2 if you need it, providing assistance :)
  • Alexander Merz  [2006-06-27 18:26 UTC]

    #1 It would be nice to add something like a docblock for the plex file which is not a part of the <?php ?>-class part in the plex file, but in a /*!lex2php */ section at the beginning. This 'lexdoc' could be added as docblock to the class. This topic is maybe just a question of style. As long as we have no 'lexdoc', it doesn't really matter, where to add docs. But we could keep this in mind. There is no real doc standard in the lexer and parser world out there, a fact, i hate...

    #2 Ok :-)
  • Alexander Merz  [2006-07-03 18:44 UTC]

    Two additional problems:

    1.) I implemented a "Skip token" feature in the lexer class:

    function advance() {
    $ret = $this->{'yylex' .$this->state}();
    if($this->token != 17) {
    if($ret) {
    return true;
    }
    return false;
    } else {
    return $this->advance();
    }
    }

    As you can see, it is necessary to know which number the pattern has. It would be nice to use the pattern name as placeholder. Something like:

    if($this->token != %TOKEN%)...

    2.) If you define a literal as pattern for a token, you would expect, that the literal matches "correctly", but this isn't true. So you have trouble if you have a regex as token pattern, which could also match the literal.

    An example:

    --- the tokens ---
    NODE = "node" (node is a keyword)
    ID = /\w(\w|\d)*/ (an Id is just a literal)
    ------------------

    --- The data to parse ---
    node [color=grey]
    node1 -> node2
    -------------------------

    In the first line "node" is correctly recognized as NODE. But in the next line the lexer splits "node1" into "node" and "1". So it is recognized as NODE and something other. Thats not correct, it should by ID.

    I could solve this by using a regex for NODE instead of a literal:
    NODE = /node[^\w\d]/

    Other lexer generator does not seem to have such a problem.
  • Greg Beaver  [2006-07-04 23:50 UTC]

    thanks for trying the package out so extensively Alex :)

    1) skip token has been implemented from the beginning, simply use this as the action:

    {return false;}

    If you change state and wish to re-process the token, use something like:

    {$this->yybegin(self::NEWSTATE);return true;}

    To do what is known as "yymore" in flex, use:

    {return 'more';}

    yymore basically instructs the scanner to ignore the matching rule and try against the remaining rules.

    For your second problem, use a lookahead, like this simple rule:

    NODE = /node(?=\s)/
    ID = /\w(\w|\d)*/

    then, in order to be sure you also match a NODE at the end of input, use:

    NODE {}
    ID {}
    "node" {}

    The lexer processes rules in strict order, so that the first rule to match wins. Many lexers attempt to optimize regular expressions passed in. In my experience this leads to unpredictable behavior, including the inability to do fine-grained matching. In short, it makes it impossible to write a good lexer.

    PHP_LexerGenerator allows complete control over the regular expressions and their speed: you are responsible for optimizing and ordering, but it will always be possible to look at a lexer .plex file and figure out how it will work. This is not the case for any other lexer generator (which is why your example "just works" in other lexer generators).

    Hope this explains some of the design choices better.