|» Metadata||» Status|
This package provides support for indexing XML files. It assists you in
It makes use of an XPath-like syntax, and currently stores indexes on the local filesystem.
Example 1 - Numeric index
Example 2 - Attribute value
XML_Indexing does not exactly supports XPath. It uses an XPath like syntax, to easily locate a subset of the whole XML document. The rule, in this regard, is that any of the supported expressions must be valid XPath, but the opposit isn't true. The current plan isn't to provide support for the whole XPath language, but to make a working implementation that will address what a big XML file is expected to contain.
What is this ? What usually contains a big XML file ? In my own experience, that is a lot of repetitive blocks, like RDF, or any other RDB (native XML database) format. And indexing that kind of data, to allow rapidly seeking through the file, is mainly a matter of using numerical indexes (ie: get me the 12333th row of that RDB file).
Now, it is pretty handy to allow attribute based search as well, but XPath functions are not in my current plan.
From here, once again, the plan is not to provide bloated software that will try to reinvent the wheel by implementing the whole of XPath. That would be in total contradiction with the goal of this package : speed.
The idea is to address about 80% of the needs when it comes to accessing big xml files.
Now, in the future, there may be additional classes in this package (ie: XML_Indexing_PowerReader), that would support all of the XPath language, but these wouldn't replace the much lighter XML_Indexing_Reader class. They would be an alternative for people with complex needs (the 20 other percents).
So, now, what are these supported XPath-like expressions ?
I recognize this is somehow limited and support for things like multiple attributes ([@name='value' or @otherName='otherValue']), as well as multiple expressions (expr1|expr2) should come soon.
What is not supported :
Current problems with PHP5
The Expat Parser is buggy in PHP5. Especially the xml_get_current_byte_index() function which XML_Indexing heavily relies on.
For details, please see the following PHP bug : http://bugs.php.net/bug.php?id=30257
When writing this first implementation, this issue upset me and I decided to make XML_Indexing bugproof by hacking a workaround. It works fine, but, as reported, by Christian Stocker, it eats memory when building indexes.
This is a very sad point, because I attached importance to use a buffer with Expat parsing, so that huge files (gigabytes) should be parsable. They are never entirely loaded in memory, a 1Mb buffer is used.
For now, this feature will only work with PHP4, though. On PHP5, you should expect slow index building and heavy memory usage.
Namespaces support is pretty rudimentary for now, but these are handled as a part of the indexing process, that is : at index building time, namespaces declaration are extracted, and then stored in the index, for speed.
Example 3 - Retrieving namespaces definitions
For a Mozilla RDF file, this should produce something like :
There is another Expat Bug in PHP5 that forbids setting up XML namespace declaration handlers when parsing (See : http://bugs.php.net/bug.php?id=30061). I've hacked another workaround for this issue, which manually checks all attributes for an 'xmlns:' prefix (see XML/Indexing/Builder.php). According to my tests, the overhead seems acceptable though.
The following graph shows the speed ratio I got on my test sytem, when comparing access times for pure DOM, with XML_Indexing + DOM.
I used twenty RDF files, ranging from 300Kb to 7Mb, using a simple attribute based search ('/foo/bar[@attr="value"]).
Results are pretty encouraging, with XML_Indexing + DOM (7ms) running 20 times faster than pure DOM (89ms), for a small 700Kb RDF file. When getting close to 7Mb, this ratio goes up to 105 times faster (19ms compared to more than 2 seconds for pure DOM).
|» Dependencies||» Links|
|» Timeline||» Changelog|