Tuesday, November 24, 2015

How to write BREX context rules (part 2): XPath Primer

In Part 1, I provided a basic introduction to writing BREX context rules and the <contextRules> element. In the second part of this series, I will provide a brief primer to XPath expressions. A complete guide to writing XPath expressions is beyond the scope of this post, but you will need a basic understanding of XPath to get started in writing structured rules for your project. After reading this post, I recommended reviewing any of the numerous XPath tutorials on the web.

What is XPath?

XPath provides a syntax for identifying parts (formally known as nodes) of an XML document. An XML document intrinsically defines a tree structure, similar to how files on a file system are organized. For example, see how the location of the folder "program" is identified at the top of Windows file explorer:

The absolute location is "C:\Program Files\LibreOffice 5\program". With XPath, we identify XML elements in a similar manner, but we use forward slashes, "/", instead of backslashes. For example, say we have the following XML document structure:

  <dmodule>
    <content>
      <brex> <!-- We want to identify this element here -->
      ...
    </content>
  </dmodule>

We can identify the <brex> element with the following XPath expression:

/dmodule/content/brex

Unlike a file system, an XML document can have the items of the same name at the same level. For example:

  <dmodule>
    <content>
      <brex>
        <contextRules>
          <structureObjectRuleGroup>
            <structureObjectRule>... <-- We want this one -->
            <structureObjectRule>... <-- Not this one >
            ...
          </structureObjectRuleGroup>
        </contextRules>
      </brex>
    </content>
  </dmodule>

If we use the XPath expression,

/dmodule/content/brex/contextRules/structureObjectRuleGroup/structureObjectRule

we are actually identifying all <structureObjectRule> elements under <structureObjectRuleGroup>. If we only want the first <structureObjectRule> element, we do the following:

/dmodule/content/brex/contextRules/structureObjectRuleGroup/
    structureObjectRule[1]

Technically, we may still identify more than one <structureObjectRule> element. If you are familiar with the BREX schema—note, this applies to any XML document, just using BREX type as an example— <contextRules> and <structureObjectRuleGroup> are repeatable. So if we take,

/dmodule/content/brex/contextRules/structureObjectRuleGroup/
    structureObjectRule[1]

and apply it to the following XML document:

  <dmodule>
    <content>
      <brex>
        <contextRules>
          <structureObjectRuleGroup>
            <structureObjectRule>... <-- MATCH -->
            ...
          </structureObjectRuleGroup>
        </contextRules>
        <contextRules>
          <structureObjectRuleGroup>
            <structureObjectRule>... <-- MATCH -->
            ...
          </structureObjectRuleGroup>
        </contextRules>
      </brex>
    </content>
  </dmodule>

We will have identified two <structureObjectRule> elements. If we want to only identify the very first <structureObjectRule> element in the document, we use the following:

/dmodule/content/brex/contextRules[1]/
    structureObjectRuleGroup[1]/structureObjectRule[1]

Identifying by ID

If your XML documents contain IDs, using them to identify elements is much easier than using full paths. Take the following for example:

  <dmodule>
    <content>
      <brex>
        <contextRules>
          <structureObjectRuleGroup>
            <structureObjectRule id="SOR-001">... <-- We want this one -->
            <structureObjectRule>...
            ...
          </structureObjectRuleGroup>
        </contextRules>
      </brex>
    </content>
  </dmodule>

The element we want to identify can be expressed as follows:

//structureObjectRule[@id="SOR-001"]

The expression contains some components that need further explanation:

//

This is a shorthand notation indicate any decendant node. Since it is at the start of the expression, it indicates any node within the document.

[@id="SOR-001"]

The "[]" represents a conditional expression on the node that precedes it. In this case, the node that proceeds it is structureObjectRule. In order for a structureObjectRule to match the expression, the expression inside the []'s must evaluate to a true value.

In our example, the conditional expression,

@id="SOR-001"

is only true if the attribute named "id" has the value "SOR-001". In XPath, to distinguish an element name from an attribute name, attribute names are prefixed with the '@ character, hence the use of "@id". If we left out the '@', the name "id" would have been interpreted as the name of a child element.

Identifying by any attribute

You are not limited to ID attributes for identifying elements in an XML document. For example, if I wanted to identify all elements marked as deleted, I can use the following:

//*[@changeType="delete"]

The special character "*" will match any element, but the attribute test condition limits the matching to only those elements that have the changeType set to "delete".

More Information

More complete tutorials on XPath can be found by searching the web.

Next

Learning by Example

No comments:

Post a Comment