Proposal for "XML_Indexing"

» Metadata » Status
» Description

Introduction

This package provides support for indexing XML files. It assists you in
creating and using such indexes in order to reduce access time to local
XML files.

It makes use of an XPath-like syntax, and currently stores indexes on the local filesystem.
For now, it works transparently, that is : it will rebuild the index if the corresponding XML file changes.

Basic usage

Example 1 - Numeric index


<?php

$reader = new XML_Indexing_Reader ('test.xml');

// The first time it is called, the following is going
// to build a "numeric" index for all /foo/bar occurences
// in the document : this takes some time and memory.
// Subsequent calls will then use this index, if they
// are "numeric", that is /foo/bar[n]
$reader->find('/foo/bar[232]'); 

echo "Extracted XML data : \n";
while ($xml = $reader->next()) {
    echo $xml;
}
?>

Example 2 - Attribute value


<?php

$reader = new XML_Indexing_Reader ('test.xml');

// Same as above, except that when first run, this one
// is going to index all of the "name" attributes attached
// to a /foo/bar node. Provided that you always use the
// same "name" attribute, this index will be used, whatever 
// value you use for the comparison in subsequent calls.
$reader->find('/foo/bar[@name="value"]'); 
 
echo "Extracted XML data : \n";
while ($xml = $reader->next()) {
    echo $xml;
}
?>

XPath-like expressions

XML_Indexing does not exactly supports XPath. It uses an XPath like syntax, to easily locate a subset of the whole XML document. The rule, in this regard, is that any of the supported expressions must be valid XPath, but the opposit isn't true. The current plan isn't to provide support for the whole XPath language, but to make a working implementation that will address what a big XML file is expected to contain.

What is this ? What usually contains a big XML file ? In my own experience, that is a lot of repetitive blocks, like RDF, or any other RDB (native XML database) format. And indexing that kind of data, to allow rapidly seeking through the file, is mainly a matter of using numerical indexes (ie: get me the 12333th row of that RDB file).

Now, it is pretty handy to allow attribute based search as well, but XPath functions are not in my current plan.

From here, once again, the plan is not to provide bloated software that will try to reinvent the wheel by implementing the whole of XPath. That would be in total contradiction with the goal of this package : speed.

The idea is to address about 80% of the needs when it comes to accessing big xml files.

Now, in the future, there may be additional classes in this package (ie: XML_Indexing_PowerReader), that would support all of the XPath language, but these wouldn't replace the much lighter XML_Indexing_Reader class. They would be an alternative for people with complex needs (the 20 other percents).

So, now, what are these supported XPath-like expressions ?

  • /foo/bar[n] : The n'th occurence(s) of bar in the scope of foo.
  • /foo/bar[@name] : The /foo/bar occurences which have a 'name' attribute
  • /foo/bar[@name='value'] : The /foo/bar occurences where the 'name' attribute equals 'value'

I recognize this is somehow limited and support for things like multiple attributes ([@name='value' or @otherName='otherValue']), as well as multiple expressions (expr1|expr2) should come soon.

What is not supported :

  • /foo/bar[n]/child : To match that, you need to issue a '/foo/bar[n]' and then locate the 'child' element by yourself. Once again, in my own opinion, this will suit 80% of the needs. In a RDF document for example, you can retrieve the whole RDF:Description branch, which will usually be a few dozen of bytes long, and then locate the NC:Whatever element using the DOM or XML_Serializer.
  • //foo : Currently only absolute addressing is supported, that is '/absolute/path/to/element' + some simple expression like [n] or [@name...]. This kind of "relative" expression may be supported in the future.

Current problems with PHP5

The Expat Parser is buggy in PHP5. Especially the xml_get_current_byte_index() function which XML_Indexing heavily relies on.

For details, please see the following PHP bug : http://bugs.php.net/bug.php?id=30257

When writing this first implementation, this issue upset me and I decided to make XML_Indexing bugproof by hacking a workaround. It works fine, but, as reported, by Christian Stocker, it eats memory when building indexes.

This is a very sad point, because I attached importance to use a buffer with Expat parsing, so that huge files (gigabytes) should be parsable. They are never entirely loaded in memory, a 1Mb buffer is used.

For now, this feature will only work with PHP4, though. On PHP5, you should expect slow index building and heavy memory usage.

Namespaces support

Namespaces support is pretty rudimentary for now, but these are handled as a part of the indexing process, that is : at index building time, namespaces declaration are extracted, and then stored in the index, for speed.

Example 3 - Retrieving namespaces definitions


<?php

$reader = new XML_Indexing_Reader ('test.rdf');

// find() has to be called before getNamespaces()
$reader->find('/RDF:RDF/RDF:Description[100]');

$nsList = $reader->getNamespaces();

print_r ($nsList);
?>

For a Mozilla RDF file, this should produce something like :


Array
(
    [NC] => http://home.netscape.com/NC-rdf#
    [RDF] => http://www.w3.org/1999/02/22-rdf-syntax-ns#
)

There is another Expat Bug in PHP5 that forbids setting up XML namespace declaration handlers when parsing (See : http://bugs.php.net/bug.php?id=30061). I've hacked another workaround for this issue, which manually checks all attributes for an 'xmlns:' prefix (see XML/Indexing/Builder.php). According to my tests, the overhead seems acceptable though.

Benchmark

The following graph shows the speed ratio I got on my test sytem, when comparing access times for pure DOM, with XML_Indexing + DOM.

I used twenty RDF files, ranging from 300Kb to 7Mb, using a simple attribute based search ('/foo/bar[@attr="value"]).

Results are pretty encouraging, with XML_Indexing + DOM (7ms) running 20 times faster than pure DOM (89ms), for a small 700Kb RDF file. When getting close to 7Mb, this ratio goes up to 105 times faster (19ms compared to more than 2 seconds for pure DOM).

http://www.samalyse.com/code/xml_indexing/pico/ratio.png

» Dependencies » Links
  • PEAR >= 1.2
» Timeline » Changelog
  • First Draft: 2004-10-03
  • Proposal: 2004-10-03
  • Call for Votes: 2004-10-10
  • Voting Extended: 2004-10-18
  • Olivier Guilyardi
    [2004-10-04 20:49 UTC]

    XML_Indexing (0.2)

    • Fixed a bug where /foo/bar[n] returned the n'th occurence of /foo/bar in the scope of the whole document : it will now return the n'th occurences of bar in the scope of foo. Thanks Christian Stocker
    • Fixing the above bug has a consequence : searching for /foo/bar[n] may now return several occurences, instead of a single one in the prior implementation : the numeric index format got modified consequently.
    • PEAR proposal update : added two sections to the proposal description : details about "XPath-like expressions", and "Current problems with PHP5"

    XML_Indexing (0.1)

    • This is the initial release
  • Olivier Guilyardi
    [2004-10-05 18:44 UTC]

    XML_Indexing (0.3)

    • Added rudimentary namespaces support (new section in the proposal Description as well)
    • PHP5's Expat bug workaround now eats less memory (still hungry though)
    • Removed the XML_Indexing_Reader::create() factory method
    • Added an option for zlib-compressing the indexes
    • Fixed a bug about the default temporary directory
    • Updated documentation headers