Most current ixml grammars are small. However there are examples of large grammars, and it is likely that in the future more large grammars will emerge as ixml usage increases.
To make large grammars more manageable, and to enable reuse, it would be useful to have a way to modularise them.
One of the requirements of modularisation for reuse in any notation is to have a method of specifying the contractual interface, such that it is possible for the producers of the modules to change their internal structure without breaking any existing usage of the module.
This paper describes a proposal for an ixml preprocessor that permits an ixml grammar to invoke other modules of ixml grammars, specifying their linkage. This involves renaming rules with name clashes in the modules, using ixml renaming, resulting in a single ixml grammar with no rule-name clashes, and so that the resultant XML serialisations remain the same. The invoking grammar remains unchanged.
There is no change to the syntax or semantics of ixml proper.
Keywords: ixml, parsing, context-free grammars, XML, modularisation
Invisible XML (ixml) is a notation and process that uses context-free grammars to describe the format of textual documents.
This allows documents to be parsed into an abstract parse-tree, which can be processed in various ways, but principally serialised into an XML document, thus making the implicit structure of the textual document explicit in the XML.
Most current ixml grammars are small (the grammar for ixml itself for example is around 70 lines).
Large grammars may emerge containing subparts that are authored by different people.
E.g. there is a grammar for XPath 4 at around 350 lines which could be used by grammars for languages that use XPath 4.
The nice thing about general context-free grammars is that they can be combined, and remain general context-free, which makes modularisation feasible.
The main problem to be solved: rule name clashes between modules.
Other requirements and desiderata:
Renaming is a new ixml feature agreed by the working group.
Already present in several implementations.
It allows you to specify for a rule a different name than default for a rule to be used on serialisation.
Consider a grammar that accepts both 31/12/1999 and
31 December 1999 forms of dates:
date: numeric; textual.
-numeric: day, -"/", month, -"/", year.
-textual: day, -" "+, tmonth, -" "+, year.
day: d, d?.
month: d, d?.
year: d, d, d, d.
tmonth: -"January", +"1";
-"February", +"2";
...
-"December", +"12".
-d: ["0"-"9"].
While 31/12/1999 produces
<date> <day>31</day> <month>12</month> <year>1999</year> </date>
31 December 1999 produces
<date> <day>31</day> <tmonth>12</tmonth> <year>1999</year> </date>
where the difference is because it is produced from a different input syntax.
Using renaming, you can specify that both have the same serialised name:
tmonth > month:
-"January", +"1";
-"February", +"2";
...
-"December", +"12".
tmonth is the rule name, month is the name used on
serialisation.
A module consists of a regular ixml grammar, preceded by specifications of rules used from other modules and what is shared for use from this module.
+uses css from css.ixml +uses iri, url, uri, urn from uri.ixml
It is possible to combine them
+uses css from css.ixml; iri, url, uri, urn from uri.ixml
Also possible:
+uses iri from https://example.com/ixml/modules/iri.ixml
The specification of what can be used is similar:
+shares iri, url, uri, urn
There are two main choices for a grammar for these. The first literally recognises the structure as it is specified above:
module: s, (uses; shares)*, ixml.
uses: -"+uses", rs, from++(-";", s).
shares: -"+shares", rs, entries.
from: entries, rs, -"from", rs, location, s.
-entries: share++(-",", s).
share: @name, s.
@source: iri.
using s, rs, name, and
ixml from the ixml grammar, and presupposing a rule for
iri
A specification like
+uses css from css.ixml; iri, url, uri, urn from uri.ixml
then produces
<uses>
<from source='css.ixml'>
<share name='css'/>
</from>
<from source='iri.ixml'>
<share name='iri'/>
<share name='url'/>
<share name='uri'/>
<share name='urn'/>
</from>
</uses>
module: s, (multiuse; shares)*, ixml.
-multiuse: -"+uses", rs, uses++(-";", s).
shares: -"+shares", rs, entries.
uses: entries, rs, -"from", rs, from.
-entries: share++(-",", s).
share: @name, s.
@from: iri, s.
where the resulting structure is then:
<uses from='css.ixml'> <share name='css'/> </uses> <uses from='uri.ixml'> <share name='iri'/> <share name='url'/> <share name='uri'/> <share name='urn'/> </uses>
+uses css from css.ixml +uses iri, url, uri, urn from uri.ixml +shares model, control
uses and shares specifications in
a module must be unique;uses;shares.Modules are allowed to invoke each other.
E.g. a programming language where declarations can include procedures, and procedures can include declarations.
Module for procedures:
+uses declaration from declaration.ixml +shares procedure
module for declarations:
+uses procedure from procedure.ixml +shares declaration
This illustrates that a uses specification is different from,
for instance, #include in C preprocessing, since uses
only ensures that the module will be present in the final grammar.
A module can only share rules it defines; it is not permitted to share a rule from a different module like this:
+uses x, y from z.ixml +shares x
We can now use modules to define modules:
+uses ixml, name, s, rs from ixml.ixml
+uses iri from iri.ixml
+shares module
module: s, (multiuse; shares)*, ixml.
-multiuse: -"+uses", rs, uses++(-";", s).
shares: -"+shares", rs, entries.
uses: entries, rs, -"from", rs, from.
-entries: share++(-",", s).
share: @name, s.
@from: iri, s.
The invoking module and all invoked modules are collected.
If any two contain the definition of a rule of the same name, one of the rules is renamed:
A rule is renamed by generating a new unique name, different from all other rule names in the set of modules:
name > alias), the
rule is redefined with the new name and the existing alias (newname
> alias)newname > oldname).All applications of the old name in the module grammar, and any of the other modules that use that rule are replaced with the new name.
Once all naming conflicts are resolved, all invoked modules are appended to
the invoking module, with the uses and shares
specifications removed.
What these rules ensure is that:
Imagine a language of identity statements of the style
total=price+tax+shipping tax=price×10÷100 shipping=5
expressed using the definition of expr from another module:
+uses expr from expr.ixml
data: identity+.
identity: id, -"=", expr, -#a.
id: [L]+.
However the expr module has a clashing rule for
id:
+shares expr expr: id++op. id: [L; Nd]+. op: ["+-×÷"].
Since the invoking grammar never gets changed, the rule in the module gets renamed, resulting in the following complete grammar:
data: identity+.
identity: id, -"=", expr, -#a.
id: [L]+.
expr: id_++op.
id_>id: [L; Nd]+.
op: ["+-×÷"].
If the module's rule for id had instead been a
renaming, for instance:
id>ident: [L; Nd]+.
then the renaming would have ended up as:
id_>ident: [L; Nd]+.
Making the example slightly more complex, with rules like
result[1]=a1+b1+c1 result[2]=a2+b2+c2
using this grammar:
+uses expr from expr.ixml; identity from id.ixml rules: rule+. rule: identity, -"=", expr, -#a.
Module expr.ixml
+shares expr
expr: operand++op.
operand: id; number.
id: [L], [L; Nd]*.
op: ["+-×÷"].
number: ["0"-"9"]+.
Module identity.ixml has a clash with both id and
number:
+shares identity
identity: id; id, -"[", number, -"]".
id: [L]+.
number: digits, (".", digits)?.
-digits: [Nd]+.
The invoking grammar never changes:
rules: rule+. rule: identity, -"=", expr.
In module expr.ixml nothing needs changing
expr: operand++op.
operand: id; number.
id: [L], [L; Nd]*.
op: ["+-×÷"].
number: ["0"-"9"]+.
In identity.ixml both id and number
are renamed:
identity: id_; id_, -"[", number_, -"]".
id_>id: -"@", [L]+.
number_>number: digits, ".", digits.
-digits: [Nd]+.
The rules allow either or both to be renamed in expr.ixml
instead.
In this example there are two rules called id each shared and
used by two different modules.
The invoking grammar:
+uses id from ident.ixml; expr from expr.ixml rules: rule+. rule: id, -"=", expr.
Module ident.ixml
+shares id id: [L]+.
Module expr.ixml
+uses id, number from id.ixml
+shares expr
expr: operand++op.
operand: id; number.
op: ["+-×÷"].
Module id.ixml
+shares id, number
id: [L], [L; Nd]*.
number: [Nd]+.
The invoking grammar is never changed:
rules: rule+. rule: id, -"=", expr.
and since the id rule is used from module
ident.ixml, the rule may not be renamed there:
id: [L]+.
This means that the id rule in module id.ixml has
to be renamed:
id_>id: [L], [L; Nd]*. number: [Nd]+.
and in module expr.ixml that uses it
expr: operand++op.
operand: id_; number.
op: ["+-×÷"].
Imagine you were defining a textual format for XForms:
Example XForm style xform.css model M instance data data.xml submission save put:data.xml replace:none input name "What is your name?" submit "OK"
This is going to need definitions for CSS, URIs, XPath, and a lot more. Then you might define a grammar like this (this is not a complete example).
+uses form from form.ixml +uses content from content.ixml xform>html: h, form, content. @h>xmlns: +"http://www.w3.org/1999/xhtml".
+shares form
+uses css from css.ixml;
model from model.ixml;
iri from iri.ixml;
s from xforms-basics.ixml
form>head: title, styling?, model*.
title: ~[" "; #a], ~[#a]+, -#a.
-styling: -"style", s, (style; stylelink).
stylelink>link: csstype, cssrel, href.
style: csstype, css.
@csstype>type: +"text/css".
@cssrel>rel: +"stylesheet".
@href: -iri, s.
+shares model
+uses s, ref, xf from xforms-basics.ixml;
id, name from xml.ixml;
Action from action.ixml;
iri from iri.ixml
model: -"model", s, id, s, xf, -#a,
s, (instance; bind; submission; Action)+.
instance: -"instance", s, id, s, resource, s.
@resource: -iri.
bind: "bind", s, (id, s)?, ref, s, property*.
property: type {; readonly; relevant; required; etc}.
type: "type:", name, s.
submission: -"submission", s, id, s,
(method, -":", resource, s)?, replace?.
@method: "get"; "put".
@replace: -"replace:", name, s.
{etc}
+shares content
+uses IDREF from xml.ixml;
xf, ref, string, s from xforms-basics.ixml
content>body: group.
group: xf, control*.
-control: input; submit {more}.
input: -"input", s, ref, label.
label: string.
submit: -"submit", s, subid?, label?.
@subid>submission: -"submission:", IDREF, s.
From
Example XForm style xform.css model M instance data data.xml submission save put:data.xml replace:none input name "What is your name?" submit "OK"
we get:
<html xmlns='http://www.w3.org/1999/xhtml'>
<head>
<title>Example XForm</title>
<link type='text/css' rel='stylesheet' href='xform.css'/>
<model id='M' xmlns='http://www.w3.org/2002/xforms'>
<instance id='data' resource='data.xml'/>
<submission id='save' method='put' resource='data.xml' replace='none'/>
</model>
</head>
<body>
<group xmlns='http://www.w3.org/2002/xforms'>
<input ref='name'>
<label>What is your name?</label>
</input>
<submit>
<label>OK</label>
</submit>
</group>
</body>
</html>
Modularisation can imitate scoping in a simple and direct manner through renaming
A pre-processor can produce a complete ixml grammar that produces an identical serialisation of the parsed input
No change in the syntax or semantics of ixml proper.