Unserializing XML documents into simple data structures

Sometimes it is necessary to read XML documents, but one does not want to parse the document with a full blown XML parser because the result will only be some small and simple data structures represented by the simple PHP types string, integer, float, boolean and null and additionally by arrays for a little bit more complex structures. Inspired by the PEAR package XML_Serializer, which also contains such an unserializer implementation, Stubbles provides the XML unserializer class net::stubbles::xml::unserializer::stubXMLUnserializer. The main difference to the PEAR implementation is that it respects PHP5 rules, and that support for unserializing into objects has been dropped. If you want to construct objects and object structures from XML files please use XJConf for PHP, which is bundled with Stubbles.

Usage example

One of the most common usage scenarios of the XML unserializer is to work with documents like this:

<?xml version="1.0" encoding="iso-8859-1"?>
<root>
  <item>
    <name>schst</name>
  </item>
  <item>
    <name>mikey</name>
  </item>
</root>

What we want is to transform this document into a structure like this to be able to use it in our program:

Array
(
   [item] => Array
   (
       [0] => Array
           (
               [name] => schst
           )

       [1] => Array
           (
               [name] => mikey
           )

   )

)

First, we need to set the option which tells the unserializer that the tag item should be transformed into a list:

$options = array(stubXMLUnserializerOption::FORCE_LIST => array('item'));

You can set several different tag names within the array, each of them will be transformed into a list then, but independent of each other. In the next step we create the unserializer instance:

$unserializer = new stubXMLUnserializer($options);

And finally we can get our array structure from the XML document:

$itemList = $unserializer->unserialize($xml);

It is assumed that XML document was a string in the $xml variable. Most times you will have the XML document in a file:

$itemList = $unserializer->unserializeFile($fileName);

Options

Options for the unserializer are set with an array in the constructor. You can combine any of the options as you like.

Whitespace handling

Whitespace can be handled in three different ways: keep it as is, trim the value or normalizing the value.

$options = array(stubXMLUnserializerOption::WHITESPACE => stubXMLUnserializerOption::WHITESPACE_KEEP);

This does not change the whitespace of the value.

    This
    is
    a
    value.

remains as is.

$options = array(stubXMLUnserializerOption::WHITESPACE => stubXMLUnserializerOption::WHITESPACE_NORMALIZE);

This will remove line breaks, trim the value, and replace all whitespace characters with one whitespace:

    This
    is
    a
    value.

becomes This is a value..

$options = array(stubXMLUnserializerOption::WHITESPACE => stubXMLUnserializerOption::WHITESPACE_TRIM);

This will remove whitespace at the beginning and end of the value:

    This
    is
    a
    value.

becomes

This
    is
    a
    value.

The default value for this option if not set is stubXMLUnserializerOption::WHITESPACE_TRIM.

Using a tag map

Sometimes the name of the tags should not be the name of the keys in the resulting array. By setting a tag map it is possible to replace tag names with other names:

$options = array(stubXMLUnserializerOption::TAG_MAP => array('foo' => 'bar',
                                                             'bar' => 'foo'
                                                       )
           );

The result of unserializing this document:

<?xml version="1.0" encoding="iso-8859-1"?>
<root>
  <foo>FOO</foo>
  <bar>BAR</bar>
</root>

will be:

array('bar' => 'FOO',
      'foo' => 'BAR'
)

Type guessing

It may be helpful to get numbers as integers or floats, and boolean values as booleans instead of strings. By enabling type guessing you can switch on such a type conversion:

$options = array(stubXMLUnserializerOption::GUESS_TYPES => true);

Applied onto this document:

<?xml version="1.0" encoding="iso-8859-1"?>
<root>
  <string>Just a string...</string>
  <booleanValue>true</booleanValue>
  <foo>-563</foo>
  <bar>4.73736</bar>
</root>

The result will be:

array(4) {
  ["string"]=>
  string(16) "Just a string..."
  ["booleanValue"]=>
  bool(true)
  ["foo"]=>
  int(-563)
  ["bar"]=>
  float(4.73736)
}

If the unserializer makes a false guess about a value you may help him by having a type attribute:

<?xml version="1.0" encoding="iso-8859-1"?>
<root>
  <string _type="string">Just a string...</string>
  <booleanValue _type="boolean">true</booleanValue>
  <foo _type="integer">-563</foo>
  <bar _type="float">4.73736</bar>
</root>

The _type attribute will be recognized even if type guessing is switched off. If the attribute name _type does not fulfill your needs you may change the attribute name:

$options = array(stubXMLUnserializerOption::ATTRIBUTE_TYPE => 'typeHint');

Source encoding

The unserializer will try to detect the encoding of the document by itself. If this fails you may help by giving a hint about the encoding of the XML document:

$options = array(stubXMLUnserializerOption::ENCODING_SOURCE => 'iso-8859-1');

It is not possible to define the target encoding. Strings returned from the unserializer will always be in UTF-8.

Parsing attributes

By default, attributes are ignored by the unserializer. To enable parsing of attributes you need to set the option for this:

$options = array(stubXMLUnserializerOption::ATTRIBUTES_PARSE => true);