by Karen Blaedow
It was only a matter of time before Babble learned a second
language. But I didn't think it would be this soon and I didn't
think it would be XML. But as part of the project with the
Trace Center, Babble needed to handle requests concerning
TV programs a user might want to watch, record, or just get
information about. So Babble needed to acquire TV listing
information, which is available as an XML document.
Thus I programmed into Babble the ability to understand XML
in a similar way to how it understands natural language: it
maps the XML to tridbits. While this is not the English to
Spanish translation Babble may someday be capable of, there
are some interesting things to note about Babble's new ability.
First of all, being able to understand XML gives Babble
access to a whole new world of information. Many databases
and other data sources are able to export to XML. It also
focuses more attention on the use of Babble as a database
itself.
As with natural language, the XML data is mapped to tridbits
that convey the meaning of the XML. The way in which this
mapping is carried out is by necessity quite different and
it is interesting to compare it with how meaning is generated
from natural language. In many aspects the natural language
is less ambiguous. I know that sounds strange and I will try
to explain what I mean by that. But first I'll describe in
practical terms how Babble reads XML.
|
|
Before Babble can understand an XML document a person who
understands the document must go through it to define how
to process any of the elements that will have meaning to Babble.
So for the TV listing document, Babble is told that when it
sees a <program> tag it should generate a thing referent
representing a TV program and having properties that are defined
by data or attribute elements of that tag.
Thus if you want Babble to read your data, you have to export
it to an XML document and then define the mapping between
the elements in the XML document and the tridbits to be generated.
Once this is done however, Babble will know how to read all
future documents with that XML structure and be able to converse
about the data using natural language. Babble becomes a natural
language interface to any database that exports XML.
The XML parsing bypasses word use and syntax patterns and
simply maps an element of XML to the generation of one or
more tridbits. All of the difficult disambiguation is resolved
by the XML to tridbit definitions created by a human. There
were not enough clues in the XML to disambiguate it any other
way. Even a human can stare at a piece of XML for quite a
while before discerning anything meaningful. It is a very
difficult task, far harder than natural language, to disambiguate
XML.
To understand the kind of ambiguity I am talking about, consider
the following example from the TV Listing document.
|
The program tag above should generate a thing referent whose
category, loosely taken from the tag, is TV program. The id
attribute assigns a value to the id attribute of the referent.
Being called an id implies that its value is unique to this
referent. Unique keys are an important element of conventional
database design - something natural language lacks. Perhaps
this is why we like to think data stored in a conventional
database is less ambiguous. But there are many other types
of ambiguity that are easy for us to miss since our processing
resolves them without any conscious effort.
|
<subtitle>The Sandkings</subtitle>
<description>A Martian soil sample yields eggs that
hatch intelligent, insect like creatures with a violent bent.</description>
<showType>Series</showType>
|
|
The Sandkings is an episode of The Outer Limits. It is about
a Martian soil sample that yields eggs that hatch intelligent,
insect like creatures with a violent bent
|
Here the logic gets a little complicated because what is being
represented changes depending on showType . If it is a series,
there are now two thing referents: the series itself (The
Outer Limits) and the episode (The Sandkings). If it the showType
were "special" there would not be an episode.
Like the program tag, the subtitle tag generates a thing
referent. But it only does this when showType=Series. The
category of this referent is TV program, but the category
is not taken directly from the subtitle tag as it was with
the program tag. In addition, when the showType=Series the
thing referent generated by the program tag needs to be assigned
a second category of TV series. The name of the episode is
taken from the data element of the subtitle tag unlike in
the program tag where the name of the series is taken from
a tittle tag.
The point here is not to knock how the TV listing data is
structured, because it is done in a reasonable fashion following
conventional database design. The point is to illustrate that
this unambiguous data encoding is anything but. We've only
looked at a handful of tags in this XML file and clearly there
is little consistency that would allow automatic determination
of the referents of this information and especially the relationship
between the referents. And it gets worse:
|
<schedule program='EP1340710023' station='11097' time='2004-11-02T13:00:00Z'
duration='PT01H00M' tvRating='TV-PG' stereo='true' closeCaptioned='true'>
<part number='1' total='2'/>
</schedule>
|
|
The Sandkings episode of The Outer Limits is broadcast by
the Sci-Fi Channel on 11/2/2004 at 13:00. It is an hour long,
It has a TV rating of TV-PG, is in stereo and close captioned.
It is part 1 of a 2 part series.
|
Here the schedule tag is representing a broadcast event. The
program attribute gives the category and id of the object
of this broadcast event, which must not be interpreted as
a new referent, but rather one that was previously defined.
It is only because we understand beforehand the things being
represented that we can figure out the different ways the
information is encoded.
The second thing to take away from this is a new appreciation
for how well natural language disambiguates many of the issues
I raised above. Clearly an average person can read the English
translation of the XML in the right columns and construct
the referents and relationships conveyed, in other words understand
it. Even the person who created the XML would find the English
easier to understand. Natural language provides the clues
needed to disambiguate the referents and how they are related,
but only when the receiving entity understands the structure
and constraints of that information, whether that entity be
a human or Babble.
|