The ease of construction and potential Internet-wide impact of autonomous software agents on the World Wide Web has spawned a great deal of discussion and occasional controversy. Based upon our experience in the Repository Based Software Engineering (RBSE) project with the design and operation of the RBSE Spider [4], such tools can create substantial value to users of the Web. Most current service agents involve Web spiders, which traverse the interlinked documents making up the Web, constructing an index of the information thus discovered. The difficulty with relying solely upon service agents to access the relevance of an artifact to a user's interest profile is that none of the service agents provide any persistence of state concerning what the user has already been presented with. Each user must periodically poll any given service agent with a query, and then filter that which is new from that which has already been seen.
A user agent's architecture should reflect the concerns of the individual, rather than the concerns of the community. We are currently constructing (with support from Texas Instruments) Sulla, a user agent that supports long-lived, goal-oriented Web activity. Our current approach to agent interaction entails Sulla mimicking the behavior of a human interacting with each service agent. This approach suffers from the ambiguities of natural language and the limitations of interaction through simplistic query interfaces.
Related Work
Infoharness [11] is an open, extensible system designed to provide access to large amounts of heterogeneous information through encapsulation of these information resources in meta-data objects. The system architecture is comprised of a HTTP gateway, one or more InfoHarness servers, one or more InfoHarness collections, a meta-data generator which populates the collections, and a set of access tools (e.g., WAIS, relational databases, etc.). Users interact with the system through the gateway, which transforms requests into a form acceptable to the servers, which then act upon the request by returning portions of the meta-data, or by routing appropriate requests on to the access tool (which are responsible for manipulation of actual data).
Harvest [2, 3] supports ``gathering, indexing, caching, replicating, and accessing Internet information'' [2]. It was designed for scalability and customization through the separation of gatherers, responsible for the acquisition of information, and brokers, responsible for collection, index generation and dissemination of that information. Gatherers run at provider sites, and transmit information thus acquired back to one or more brokers using a ``summary object interchange format'' [2]. This allows for a significant reduction in network overhead when the transmitted information is heavily summarized or when there are many documents involved. Brokers interact with one or more gatherers for initial acquisition and with other brokers where useful to further filter information already collected by those brokers.
PAINT (Personalized Adaptive Internet Navigation Tool) [8] supports hierarchical hotlists in conjunction with Mosaic. This distinguishes it in that it is intended to support a single individual user, rather than a community of users. PAINT supports the creation of hierarchical clusters of Web resources as name spaces. The principle design goal was to simplify the comprehension of hotlist elements. Based upon the number of hotlist manipulation schemes springing up to support Mosaic, this is a significant problem for serious users of the Web.
The Lycos system [7] employs a Gnu DBM file to store the information discovered during its exploration. The information stored for a given document includes: the title, headings, the 100 most weighty words, the first 20 lines of the document and the size of the document, both in bytes and in words. The rationale behind these choices is the creation of a scheme that is finite in scope - the information concerning a document is not dependent upon the size of that document. Lycos caches the first twenty lines of the document for display as part of the results of a user search of the index, providing a limited context for the user without the need to access the matched documents.
WebCrawler [9] full-text indexes the documents encountered, operating with multiple retrieval agents in a server-breadth-first approach. The rationale behind the notion of a bread-first search with respect to servers rather than documents is that most servers currently have many related documents in a single subject area, rather than multiple subject areas. Skipping from server to server ensures broader coverage in results at the cost of requiring users to explore particular servers that seem to be relevant to see if they truly contain what is sought. Of course, subsequent passes by WebCrawler can reduce this coverage gap by eventually indexing the full set of documents on a given server.
Relationship with Our Other Projects
The RBSE Spider [4] retains both the structure of the Web as a graph representation in a relational database and a full text index of the HTML documents encountered. Searches can thus be specified either as SQL queries against the database, resulting in information concerning the nature of the Web itself, or against the full text index, resulting in information concerning the contents of documents that make up the Web. The full text index is currently supported through a slightly modified WAIS server.
The Multimedia Oriented Repository Environment (MORE) [6] operates in conjunction with a stock HTTP server to provide access to a relational database of meta-data. MORE provides separate hierarchies of meta-classes and collections and support for controlled access to proprietary collections through the definition of user groups. With the single exception of the system front page, the entire user interface is accomplished as dynamically generated HTML.
MORE is a meta-data based repository - the information stored in its underlying database is not the artifacts themselves, but rather information concerning the artifact, which is stored using other mechanisms (the file system, another database, or another software package such as a configuration management tool or CASE environment). The two distinct representation mechanisms allow a mix of homogeneous (through the class definition hierarchy) and heterogeneous information (through the collection hierarchy).
Sulla acts as a proxy, with the user employing an unmodified Web client from an arbitrary host to interact with the agent, which resides on a particular host (typically the user's desktop system). We are currently extending Sulla's current scope (HTML documents) with the ability to access a variety of information sources, both via direct access to those sources (e.g., HTML documents, FTP files, WAIS databases, articles posted to newsgroups, etc.) and those referenced by service agents.
Research Personnel
Dr. David Eichmann (Principal Investigator)
Eichmann is an Assistant Professor of Software Engineering at UHCL and Director of Research and Development for the Repository Based Software Engineering program, a NASA-funded multi-year project in reuse and reengineering of large software systems. He is also principal investigator on the Sulla project, an industry-funded project developing a user agent for the World Wide Web. The combined projects involve three faculty, three full-time research staff, and ten graduate research assistants.Michael Weisskopf (Senior Research Associate)
Weisskopf was recently promoted from Programmer Analyst to Senior Research Associate on the Repository Based Software Engineering program. His responsibilities there include coordination of graduate research assistants under his direction, independent interaction with RBSE collaborators (including NASA/JSC, Rockwell Space Operations, Unisys and Loral) and experimental deployment of World Wide Web environments developed by the RBSE Research and Development team.Graduate Student Research Assistants
Three graduate students will be recruited from the Software Engineering and/or Computer Science Master's programs at UHCL. These students will each be assigned a prototype and will be responsible for participating in the design of the interaction protocols and responsible for the implementation and testing of those protocols within their assigned prototype.Methodology
Knowledge Representation
There are few practical information retrieval systems that operate with anything other than a simple textual representation of a user query. Recent work in the area of knowledge engineering has led to significant, reusable environments for the construction of ontologies - models of the world that comprehend the relationships and terminologies used by humans in their reasoning and discourse.Protocol Definition
Given a knowledge representation scheme, user-agent and agent-agent interaction requires the ability to construct a fragment of an ontological context through which a comparison can be made between the sought after goals and the information present within a given agent's knowledge base. We will define an encoding of this representation fragment as the query portion of a Universal Resource Locator (URL). This encoding will allow the layering of the agent protocol on the top of the existing HyperText Transport Protocol (HTTP), avoiding the need to define and implement a system/server to system/server communications mechanism.Prototyping
Once we have the knowledge representation and transferral schemes completed, each of our three prototypes (Sulla, Spider and MORE) will be extended to support the new representations to improve the nature of their interactions. These enhancements can easily be carried out in parallel once a shared implementation of the knowledge representation library has been completed. Furthermore, testing interactions based upon the enhancements will require only pair-wise integration. This approach provides scalability, as each of the prototypes is representative of a distinct class of agent or provider system.Schedule
We have structured the proposal as a two year effort, but plan distinct annual milestones for evaluation and feedback (durations given are from initiation of project). Work on design will involve the team as a whole. Work on prototypes will proceed in parallel, with pair-wise integration testing occurring as the appropriate prototypes become available: