The ease of construction and potential Internet-wide impact of autonomous software agents on the World Wide Web has spawned a great deal of discussion and occasional controversy. Based upon our experience in the Repository Based Software Engineering (RBSE) project with the design and operation of the RBSE Spider [4], the MORE repository system [6] and Sulla, a personal agent for the Web, such tools can create substantial value to users of the Web. Most current service agents involve Web spiders, which traverse the interlinked documents making up the Web, constructing an index of the information thus discovered. The difficulty with relying solely upon service agents to access the relevance of an artifact to a user's interest profile is that none of the service agents provide any persistence of state concerning what the user has already been presented with. Each user must periodically poll any given service agent with a query, and then filter that which is new from that which has already been seen.
A user agent's architecture should reflect the concerns of the individual,
rather than the concerns of the community. Sulla supports
long-lived, goal-oriented Web activity. Our initial approach to agent
interaction entailed Sulla mimicking the behavior of a human interacting
with each service agent. This approach suffers from the ambiguities of
natural language and the limitations of interaction through simplistic
query interfaces.
Related Work
Infoharness [11] is an open, extensible
system designed to provide access to large amounts of heterogeneous
information through encapsulation of these information resources in
meta-data objects. The system architecture is comprised of a HTTP gateway,
one or more InfoHarness servers, one or more InfoHarness collections, a
meta-data generator which populates the collections, and a set of access
tools (e.g., WAIS, relational databases, etc.). Users interact with the
system through the gateway, which transforms requests into a form
acceptable to the servers, which then act upon the request by returning
portions of the meta-data, or by routing appropriate requests on to the
access tool (which are responsible for manipulation of actual data).
Harvest [2, 3] supports ``gathering, indexing, caching, replicating, and accessing Internet information'' [2]. It was designed for scalability and customization through the separation of gatherers, responsible for the acquisition of information, and brokers, responsible for collection, index generation and dissemination of that information. Gatherers run at provider sites, and transmit information thus acquired back to one or more brokers using a ``summary object interchange format'' [2]. This allows for a significant reduction in network overhead when the transmitted information is heavily summarized or when there are many documents involved. Brokers interact with one or more gatherers for initial acquisition and with other brokers where useful to further filter information already collected by those brokers.
PAINT (Personalized Adaptive Internet Navigation Tool) [8] supports hierarchical hotlists in conjunction with Mosaic. This distinguishes it in that it is intended to support a single individual user, rather than a community of users. PAINT supports the creation of hierarchical clusters of Web resources as name spaces. The principle design goal was to simplify the comprehension of hotlist elements. Based upon the number of hotlist manipulation schemes springing up to support Mosaic, this is a significant problem for serious users of the Web.
The Lycos system [7] employs a Gnu DBM file to store the information discovered during its exploration. The information stored for a given document includes: the title, headings, the 100 most weighty words, the first 20 lines of the document and the size of the document, both in bytes and in words. The rationale behind these choices is the creation of a scheme that is finite in scope - the information concerning a document is not dependent upon the size of that document. Lycos caches the first twenty lines of the document for display as part of the results of a user search of the index, providing a limited context for the user without the need to access the matched documents.
WebCrawler [9] full-text indexes the documents encountered, operating with multiple retrieval agents in a server-breadth-first approach. The rationale behind the notion of a bread-first search with respect to servers rather than documents is that most servers currently have many related documents in a single subject area, rather than multiple subject areas. Skipping from server to server ensures broader coverage in results at the cost of requiring users to explore particular servers that seem to be relevant to see if they truly contain what is sought. Of course, subsequent passes by WebCrawler can reduce this coverage gap by eventually indexing the full set of documents on a given server.
Relationship with Our Other Projects
The RBSE Spider [4] retains both the
structure of the Web as a graph representation in a relational database
and a full text index of the HTML documents encountered. Searches can thus
be specified either as SQL queries against the database, resulting in
information concerning the nature of the Web itself, or against the full
text index, resulting in information concerning the contents of documents
that make up the Web. The full text index is currently supported both
through a full text index in the relational database and through a
slightly modified WAIS server (provided for performance reasons to support
simple searches). The spider selects candidates for retrieval and indexing
using a set of cached heuristics. The architecture readily supports
multiple discovery modes through respecification of the candidate retrieval
query.
The Multimedia Oriented Repository Environment (MORE) [6] operates in conjunction with a stock HTTP server to provide access to a relational database of meta-data. MORE provides separate hierarchies of meta-classes and collections and support for controlled access to proprietary collections through the definition of user groups. With the single exception of the system front page, the entire user interface is accomplished as dynamically generated HTML.
MORE is a meta-data based repository - the information stored in its
underlying database is not the artifacts themselves, but rather
information concerning the artifact, which is stored using other
mechanisms (the file system, another database, or another software package
such as a configuration management tool or CASE environment). The two
distinct representation mechanisms allow a mix of homogeneous (through the
class definition hierarchy) and heterogeneous information (through the
collection hierarchy).
Sulla acts as a proxy, with the user employing an unmodified Web client
from an arbitrary host to interact with the agent, which resides on a
particular host (typically the user's desktop system). We are currently
extending Sulla's current scope (HTML documents) with the ability to
access a variety of information sources, both via direct access to those
sources (e.g., HTML documents, FTP files, WAIS databases, articles posted
to newsgroups, etc.) and those referenced by service agents.
Knowledge Representation
There are few practical information retrieval systems on the Web that operate with
anything other than a simple textual representation of a user query.
Recent work in the area of knowledge engineering has led to significant,
reusable environments for the construction of ontologies - models of the
world that comprehend the relationships and terminologies used by humans
in their reasoning and discourse.
Protocol Definition
Given a knowledge representation scheme, user-agent and agent-agent
interaction requires the ability to construct a fragment of an ontological
context through which a comparison can be made between the sought after
goals and the information present within a given agent's knowledge base.
We will define an encoding of this representation fragment as the query
portion of a Universal Resource Locator (URL). This encoding will allow
the layering of the agent protocol on the top of the existing HyperText
Transport Protocol (HTTP), avoiding the need to define and implement a
system/server to system/server communications mechanism.Prototyping
Once we have the knowledge representation and transferral schemes
completed, each of our three prototypes (Sulla, Spider and MORE) will be
extended to support the new representations to improve the nature of their
interactions. These enhancements can easily be carried out in parallel
once a shared implementation of the knowledge representation library has
been completed. Furthermore, testing interactions based upon the
enhancements will require only pair-wise integration. This approach
provides scalability, as each of the prototypes is representative of a
distinct class of agent or provider system.Bibliography