Package home | Report new bug | New search | Development Roadmap Status: Open | Feedback | All | Closed Since Version 0.4.0

Request #8879 lexer not picking longest match
Submitted: 2006-10-06 20:17 UTC Modified: 2006-10-12 13:42 UTC
From: hholzgra Assigned: cellog
Status: Closed Package: PHP_LexerGenerator (version CVS)
PHP Version: 5.2.0 RC4 OS: linux
Roadmaps: (Not assigned)    

 [2006-10-06 20:17 UTC] hholzgra (Hartmut Holzgraefe)
Description: ------------ PHP_LexerGenerator picks the first matching expression instead of the longest matching expression. This is different to the behavior of re2c and flex :( Test script: --------------- <?php class plexBug { private $data; public $token; public $value; private $line; private $count; function __construct($data) { $this->data = $data; $this->count = 0; $this->line = 1; } /*!lex2php %input $this->data %counter $this->count %token $this->token %value $this->value %line $this->line whitespace = /[ ]+/ name = /[_a-zA-Z][_a-zA-Z0-9]+/ bool = /bool/ */ /*!lex2php whitespace { echo "WHITESPACE\n"; } bool { echo "BOOL: {$this->value}\n"; } name { echo "NAME: {$this->value}\n"; } */ } $l = new plexBug("foobar "); $l->yylex(); $l->yylex(); echo "---\n"; $l = new plexBug("bool "); $l->yylex(); $l->yylex(); echo "---\n"; $l = new plexBug("boolx "); $l->yylex(); $l->yylex(); echo "---\n"; ?> Expected result: ---------------- NAME: foobar WHITESPACE --- BOOL: bool WHITESPACE --- NAME: boolx WHITESPACE Actual result: -------------- NAME: foobar WHITESPACE --- BOOL: bool WHITESPACE --- BOOL: bool NAME: x


 [2006-10-07 20:20 UTC] cellog (Greg Beaver)
this is expected behavior, and in my experience, far better. however, you are wrong about re2c, it matches the first regex in my experience. This system requires you to organize the patterns from most important to least important, and is deterministic, making it far easier to debug a faulty regex. I won't be removing this feature :)
 [2006-10-09 04:43 UTC] hholzgra (Hartmut Holzgraefe)
if i put the "bool" pattern first i'll get BOOL token back for both "bool" and "boolx", -------- /*!lex2php ... bool = 'bool' name = /[_a-zA-Z][_a-zA-Z0-9]*/ */ /*!lex2php bool { echo "BOOL: '{$this->value}'\n"; } name { echo "NAME: '{$this->value}'\n"; } */ ------ if i put the name pattern first both "bool" and "boolx" will match the NAME pattern instead: -------- /*!lex2php ... name = /[_a-zA-Z][_a-zA-Z0-9]*/ bool = 'bool' */ /*!lex2php name { echo "NAME: '{$this->value}'\n"; } bool { echo "BOOL: '{$this->value}'\n"; } */ ------ so how would i archieve the expected behavior that "boolx" is a name and not a combination of the keyword "bool" and the name "x"? With re2c putting the name pattern first will make it match both "bool" and "boolx" without ever reaching the bool keyword pattern, but with the bool pattern first it will match "bool" as a keyword but "boolx" as a name pattern as expected IMHO the re2c behavior is the right one and the current plex behavior makes it pretty much useless to me i'm afraid
 [2006-10-09 09:53 UTC] cellog (Greg Beaver)
use a lookahead assertion (see preg docs for more examples) BOOL = /bool(?![a-zA-Z_])|bool$/
 [2006-10-12 13:42 UTC] cellog (Greg Beaver)
This bug has been fixed in CVS. If this was a documentation problem, the fix will appear on by the end of next Sunday (CET). If this was a problem with the website, the change should be live shortly. Otherwise, the fix will appear in the package's next release. Thank you for the report and for helping us make PEAR better. this is now changed to a feature request, and is implemented in CVS. Unfortunately, the generated lexers will be slower, but they will be accurate. To use, add "%longestmatch 1" to the declaration comment