Babble Learns a Second Language: XML

November 2004



by Karen Blaedow

It was only a matter of time before Babble learned a second language. But I didn't think it would be this soon and I didn't think it would be XML. But as part of the project with the Trace Center, Babble needed to handle requests concerning TV programs a user might want to watch, record, or just get information about. So Babble needed to acquire TV listing information, which is available as an XML document.

Thus I programmed into Babble the ability to understand XML in a similar way to how it understands natural language: it maps the XML to tridbits. While this is not the English to Spanish translation Babble may someday be capable of, there are some interesting things to note about Babble's new ability.

First of all, being able to understand XML gives Babble access to a whole new world of information. Many databases and other data sources are able to export to XML. It also focuses more attention on the use of Babble as a database itself.

As with natural language, the XML data is mapped to tridbits that convey the meaning of the XML. The way in which this mapping is carried out is by necessity quite different and it is interesting to compare it with how meaning is generated from natural language. In many aspects the natural language is less ambiguous. I know that sounds strange and I will try to explain what I mean by that. But first I'll describe in practical terms how Babble reads XML.

 


Before Babble can understand an XML document a person who understands the document must go through it to define how to process any of the elements that will have meaning to Babble. So for the TV listing document, Babble is told that when it sees a <program> tag it should generate a thing referent representing a TV program and having properties that are defined by data or attribute elements of that tag.

Thus if you want Babble to read your data, you have to export it to an XML document and then define the mapping between the elements in the XML document and the tridbits to be generated. Once this is done however, Babble will know how to read all future documents with that XML structure and be able to converse about the data using natural language. Babble becomes a natural language interface to any database that exports XML.

The XML parsing bypasses word use and syntax patterns and simply maps an element of XML to the generation of one or more tridbits. All of the difficult disambiguation is resolved by the XML to tridbit definitions created by a human. There were not enough clues in the XML to disambiguate it any other way. Even a human can stare at a piece of XML for quite a while before discerning anything meaningful. It is a very difficult task, far harder than natural language, to disambiguate XML.

To understand the kind of ambiguity I am talking about, consider the following example from the TV Listing document.



Example XML

<program id='EP1340710023'>
<title>The Outer Limits</title>

 

English translation

The Outer Limits is a TV program. It has an id of EP1340710023.


The program tag above should generate a thing referent whose category, loosely taken from the tag, is TV program. The id attribute assigns a value to the id attribute of the referent. Being called an id implies that its value is unique to this referent. Unique keys are an important element of conventional database design - something natural language lacks. Perhaps this is why we like to think data stored in a conventional database is less ambiguous. But there are many other types of ambiguity that are easy for us to miss since our processing resolves them without any conscious effort.


<subtitle>The Sandkings</subtitle>
<description>A Martian soil sample yields eggs that hatch intelligent, insect like creatures with a violent bent.</description>
<showType>Series</showType>

 


The Sandkings is an episode of The Outer Limits. It is about a Martian soil sample that yields eggs that hatch intelligent, insect like creatures with a violent bent


Here the logic gets a little complicated because what is being represented changes depending on showType . If it is a series, there are now two thing referents: the series itself (The Outer Limits) and the episode (The Sandkings). If it the showType were "special" there would not be an episode.

Like the program tag, the subtitle tag generates a thing referent. But it only does this when showType=Series. The category of this referent is TV program, but the category is not taken directly from the subtitle tag as it was with the program tag. In addition, when the showType=Series the thing referent generated by the program tag needs to be assigned a second category of TV series. The name of the episode is taken from the data element of the subtitle tag unlike in the program tag where the name of the series is taken from a tittle tag.

The point here is not to knock how the TV listing data is structured, because it is done in a reasonable fashion following conventional database design. The point is to illustrate that this unambiguous data encoding is anything but. We've only looked at a handful of tags in this XML file and clearly there is little consistency that would allow automatic determination of the referents of this information and especially the relationship between the referents. And it gets worse:


<schedule program='EP1340710023' station='11097' time='2004-11-02T13:00:00Z' duration='PT01H00M' tvRating='TV-PG' stereo='true' closeCaptioned='true'>
<part number='1' total='2'/>
</schedule>

 


The Sandkings episode of The Outer Limits is broadcast by the Sci-Fi Channel on 11/2/2004 at 13:00. It is an hour long, It has a TV rating of TV-PG, is in stereo and close captioned. It is part 1 of a 2 part series.


Here the schedule tag is representing a broadcast event. The program attribute gives the category and id of the object of this broadcast event, which must not be interpreted as a new referent, but rather one that was previously defined. It is only because we understand beforehand the things being represented that we can figure out the different ways the information is encoded.

The second thing to take away from this is a new appreciation for how well natural language disambiguates many of the issues I raised above. Clearly an average person can read the English translation of the XML in the right columns and construct the referents and relationships conveyed, in other words understand it. Even the person who created the XML would find the English easier to understand. Natural language provides the clues needed to disambiguate the referents and how they are related, but only when the receiving entity understands the structure and constraints of that information, whether that entity be a human or Babble.