Greg Beaver [2005-01-27 22:10 UTC]I would be interested in seeing a package like this package in PEAR. Your php source link appears to be spitting out plain text, and I think you mean PHP license version 3 (it's not the same as the PHP version)
Justin Patrin [2005-01-27 22:15 UTC]You say in the proposal that this is PHP4 & 5, but in your description that it's written in PHP5.
Did you follow the Exception guidelines that were much talked about a little while ago?
Louis Mullie [2005-01-27 23:11 UTC]Hello,
Sorry about the .phps issue, looks like my host doesn't support it, so I converted it on my system. I also changed the license version to 3 (I also added a comment to RFC::Header Comment Blocks, it states PHP versions 4 & 5)
I did follow the Exception guidelines as closely as possible, but some points of it were unclear. Since the class is not very complicated, there is really no need to rethrow/bubble up exceptions. There is one exception base class (descending from PEAR_Exception) and all other exceptions are children of this one.
Ryan King [2005-01-28 00:31 UTC]This would be a great addition to PEAR. I hope it is soon followed by a parser package (or a parser generator package).
Alan Knowles [2005-01-28 01:16 UTC]I dont think this should be called purley lexer, it's not a classic lexer in the true sense, (as you say it doesnt create optimized parse classes.)
PHP_LexerRegex may be more specific.
Alan Knowles [2005-01-28 01:24 UTC]Just thinking a bit more on this.
The exceptions should look like
which allows you to optionally split them into a Lexer/Exception.php file - and still be clear where the code would be found.
Louis Mullie [2005-01-28 01:35 UTC]Hello,
Thanks for the comment. I changed the exception class names as you stated. I'm not sure about PHP_RegexLexer as a name, but I don't have any other suggestions right now... I'll try to think of a better name.
Nikolas Coukouma [2005-01-28 02:28 UTC]Do you have plans to produce a finite state machine version?
If so, it might make sense to have classes:
Lexer - factory
Lexer_Common - interface definition
Lexer_Regex - a subclass/implementation of Lexer_Common
It would make the implementation clear while providing an oppurtunity for other implementations.
As for the issue of compilation, it's not particularly difficult or useless. You're already compiling the grammar into a regular expression (a particular string representation).
If you were generating a finite state machine, you might store it as a class (possibly in string form as well). The SOAP package does this internally.
You could generalize getParallelRegex and rename it to something like getCompiledForm. You could then make the compiled form an optional parameter to the constructor. If it's provided, then you don't need to recompile the grammar.
I'd also move it into the Lexer class because it depends more on that implementation than on Grammar itself.
Michael Wallner [2005-01-28 09:27 UTC]First, I found Nikolas' comment very reasonable.
I think the package would be a good addition to PEAR, but should go into Text category (as Alan mentioned).
We should recategorize many of the packages within the PHP category anyway (credits to Pierre :).
bertrand Gugger [2005-01-28 15:52 UTC]A grammar/BNF simple interpreter is great.
Code is quite rough and suppose the grammar respects some simple rules (not cyclic ?)
I'm also afraid if people like to have some '/' in patterns.
Why compile again (make to regexp) if we want to analyse several strings for the same grammar ? (sure what Nikolas said)
It exists a FSM package in PEAR.
Could be of value to see some very simple example of this "Lexer"
Louis Mullie [2005-01-28 23:21 UTC]Nikolas :
I will defintely use a structure of this sort, great idea. My understanding of finite state machines is quite basic, so I still have to learn more before I code an FSM version. There is already a port of phpLex (a port of Jlex to C# that produces PHP code) to PHP (http://cvs.joshuaeichorn.com/cvs.php/phpLex?Horde=fb50029ef79f4b20489a7047830a435d) that produces a finite state machine; I could definitely adapt it to PEAR.
"I'm also afraid if people like to have some '/' in patterns."
They are escaped automatically.
"Why compile again (make to regexp) if we want to analyse several strings for the same grammar ?"
I'll definitely cache the regex, and as Nikolas said I'll add an option to pass an already compiled form.
The implentation is suggest is (vastly based of what Nikolas said):
o class Lexer_Grammar: grammar as it is now, with some modifications to conform to the FLEX way (i.e. define identifiers, then add patterns associated with callbacks)
o class Lexer_Grammar_Flex extends Lexer_Grammar : ability to convert a FLEX file to a compatible grammar definition
o abstract class Lexer : factory for any type of Lexer and interface definition
o class Lexer_Regex : implementation as it is now
o class Lexer_FSM : eventually, implementation of a finite state machine based Lexer
Any comments on this ?
Joshua Eichorn [2005-01-28 23:52 UTC]The url to phplex isn't really what you want.
Alan did the actual port, i started the work of porting the c# code to php so you could generate new parsers without getting mono running.
But i didn't finish because its mind-numblingly boring.
Anyhow I think the full version of phpLex is at: http://cvs.sourceforge.net/viewcvs.py/php-sharp/phpLex/
Louis Mullie [2005-01-29 00:08 UTC]I gave a little more thought to this, and due to interface conflicts, the FSM version would have to be a completetely independent class, as it's interface would be split in two(the grammar in the generator and the tokenizing in the compiled class).
Harry Fuecks [2005-02-22 09:16 UTC]Looks cool. I know the original Lexer from SimpleTest quite well now, from building a wiki parser based on it.
Think the approach you've used, returning the matches as an array (vs. callbacks to functions) will make this more appealing to many plus deliver better performance.
At the same time, it might be worth introducing the "modes" that were in the original Lexer - these basically solve the problem of state machines - you might have something like;
// 'base' in the initial state
Now no other rules will be applied while inside the <pre /> tags
Does it support users providing regexes that contain parenthesis? I wonder because the version I know escapes these, to disable subpatterns and prevent conflicts with the internal regex subpatterns (had to hack it so I could use lookbehinds / lookaheads and setting regex options from inside the expression).
Also, would it be worth having an alternative UTF-8 version of the Lexer? Personally have found it's doable using the /u pattern modifier in all the pcre calls as well as having alternative implementations of some of the str fns. Have a few you're welcome to re-use here: http://cvs.sourceforge.net/viewcvs.py/xmlrpccom/dokuwiki/lib/utf8_string.php?rev=1.5 - from the looks of it that would only leave you needing to reimplement substr_count()
Following Alan's suggestion on the package name, what about Text_Lexer_Regex? Could imagine there might be a Text_Lexer_Str, based on the string functions, someday.
Louis Mullie [2005-02-25 00:31 UTC]Hello,
Thanks for your comment. The new version I just released also allows callbacks, but optionally, and still returns a token stack at the end -- as for the naming, if you read the "about" section of the homepage, you'll see that it is much more pertinent to call it "Lexer", now.
How were nested tags handled in the Simple Test lexer ? Or were they handled at all ? I think it would be as good to do :
$grammar = new Lexer_Grammar;
than do deal with states : the goal of the regex lexer is to be as simple as possible.
> Now no other rules will be applied while inside the <pre /> tags
With this new version, the above regexp will match the most text, so the <pre> will have priority and nothing else will be matched inside them.
I'm a total newbie to encodings -- is UTF-8 widely used in things susceptible of being parsed by this lexer ?
>Does it support users providing regexes
>that contain parenthesis? I wonder
>because the version I know escapes
>these, to disable subpatterns and
>prevent conflicts with the internal
>regex subpatterns (had to hack it so I
>could use lookbehinds / lookaheads and
>setting regex options from inside the
The substr_count calls do just this - they calculate the number of parentheses and an array that maps preg_match_all() key to token type is created.
Harry Fuecks [2005-03-01 20:46 UTC]"How were nested tags handled in the Simple Test lexer ?"
Looking at the code again (http://cvs.sourceforge.net/viewcvs.py/simpletest/simpletest/parser.php?rev=1.66) the SimpleTest lexer offers an API to end users which is working at a higher level, the state machine being "bundled" with the lexing capabilities. You'll notice with methods like SimpleLexer::addEntryPattern() that it creates new instances of ParallelRegex - which itself is more or less equivalent to your Lexer. So I guess someone could build that on top of what you already have.
"is UTF-8 widely used in things susceptible of being parsed by this lexer ?"
By no means an expert on this but here's my take.
Basically we should (as web developers) all be converging on UTF-8 but parsing, in particular, could become a problem, depending on what your tokens are, particularily where a specific number of characters is being searched for.
It's difficult to give a full example here, as the page is encoding as ISO-8859-1 (Western Europe) but there's some good starting point here: http://www.intertwingly.net/blog/2005/03/01/Yahoo-Search-and-I18n
For PHP the problem is all the string functions regard a character as always being a single byte (as it is for the 127 ASCII chars). But in the example on that blog, some of those characters require multiple bytes to represent correctly.
That would mean if you had regex that was intended to match sequences of three characters, seperated by word boundaries like;
You would match 3 ASCII characters but not three multibyte characters. You could use the /u pattern modifier to instruct the Perl regex engine that the text is UTF-8 encoded (assuming it is) but in your case you'd probably also need to be careful using substr() when reducing the remaining text to parse.
Derick Rethans has some more useful stuff up here: http://www.derickrethans.nl/files/wereldveroverend-ffm2004.pdf
Also the best general read I've found is http://www.cs.tut.fi/~jkorpela/chars.html
Louis Mullie [2005-03-17 22:23 UTC]I'll definitely think of this for a future version (maybe when a UTF package is implemented ?), but I'll wait until this has been released and tested a bit longer before I add such a feature. Maybe you should (if you have any time) release an UTF string functions static class for PEAR ?