Semantic Levels of Web Index Interaction(1)
David Eichmann
Repository Based Software Engineering Program
University of Houston -- Clear Lake
2700 Bay Area Boulevard
Houston, TX 77058
eichmann@rbse.jsc.nasa.gov
Table of Contents
1 -- Introduction
2 -- A Strawman Architecture to Support Scalable Shared Indexing
References
Operators of Web indexing facilities are faced with two dramatic drivers: the explosive growth of the shared artifact that they attempt to index (the Web) and the increasing expectations placed upon them by their clientele. In [3], I posed among other criteria, that service agents should attempt to be authoritative, that is, up-to-date and as complete as reasonable regarding their covered domain. Satisfying the expectations of the user community and the criteria of authoritativeness leads inevitably to the need for indexer/providers of the Web to share information and avoid replication where feasible. Note, however, that I am not advocating the position that research and experimentation in this area is no longer of interest - far from it! I am instead seeking to suggest that the era of "the index with the most URLs cataloged wins" is past and that a new phase of serious research into scalable distributed indexing is needed.
My thesis is that there must be a hierarchy of semantic structure associated with shared Web indices. Such a hierarchy can allow for easy entry costs and gradual increases in sophistication by an indexer/provider, rather than expecting a substantial development effort in order to `play.' Any such expectation would result in significant lack of participation, effectively obviating the purpose of the initiative. This section lays out a strawman architecture involving increasing levels of sophistication regarding the information that an indexer/provider could both absorb and serve. Each of them can be configured in (at least) two distinct ways:
- total dump of index entity (e.g., a set of word occurrences) and the URL as pairs
- a single pair of the set of all index entities and the URL
This level entails simple matching of word occurrences to URLs. This is scalable in a number of different ways:
- respond with all occurrences
- respond with X most frequent occurrences
- respond with relevant occurrences
- respond with X most relevant occurrences
The semantic support provided by this level of interaction is minimal there is no context for the document (i.e., the document's local link neighborhood) or correlation to the inquirer's domain of interest (if any). However, even the simplest of spiders can support this interaction to some degree.
This level is a slight variation of that in section 2.1, adding the HTML construct in which the text occurs. This would allow, for instance, WWWW [7] to accept a feed from the RBSE Spider [2] or WebCrawler [9], and only request and/or store title strings, etc.
This level is independent of those in sections 2.1 and 2.2, in that only the HTML header and the URL would be exchanged. This mode would allow exchange of the new proposals for embedded metadata [1] and potentially handle AliWeb-style interchanges as well [6].
The RBSE Spider's current public index is really two distinct databases, one containing word occurrence - URL pairs and one containing a relational representation of the Web as discovered to date. We're currently testing a new implementation that integrates the structure and text into a single data model supporting mix of structural, temporal and textual search criteria. This implies an architecture level based upon retrieval of a local neighborhood of HTML artifacts (e.g., multi-file documents, technical report series, etc.) without the need to interrogate the provider site.
Enhancing the structural model of section 2.4 through the attribution of nodes with information from sections 2.1, 2.2, or 2.3 yields a enriched layer in the architecture capable of supporting substantial search algorithms and intelligent agents [3, 4].
This layer involves the construction of a knowledge representation based, distributed conceptual model of artifacts accessible through the Web, effectively forming a meta-Web comprised of formal characterizations of the Web itself using a shared ontology. The Knowledge Interchange Format (KIF) under development by the ARPA-sponsored Knowledge Sharing Effort [8] is an example of the type of notation that might be used between indexer/providers interacting within such a framework.
- [1]
- Desai, B. C., Semantic Header aka Cover Page,
http://www.cs.concordia.ca/~faculty/bcdesai/semantic-header.html
- [2]
- Eichmann, D., RBSE's URL database,
http://rbse.jsc.nasa.gov/eichmann/urlsearch.html
- [3]
- Eichmann, D., "Ethical Web Agents," Second International World-Wide Web Conference: Mosaic and the Web, Chicago, IL, October 18-20, 1994, pages 3-13.
- [4]
- Eichmann, D., Sulla A User Agent for the Web,
http://ricis.cl.uh.edu/agents/sulla.html
- [5]
- Fletcher, J., Jumpstation,
http://www.stir.ac.uk/jsbin/js
- [6]
- Koster, M., ALIWEB (Archie Like Indexing the WEB),
http://web.nexor.co.uk/aliweb/doc/aliweb.html
- [7]
- McBryan, O. A., World Wide Web Worm,
http://www.cs.colorado.edu/home/mcbryan/WWWW.html
- [8]
- Neches, R., The Knowledge Sharing Effort,
http://www-ksl.stanford.edu/knowledge-sharing/papers/kse-overview.html
- [9]
- Pinkerton, B., Finding What People Want: Experiences with the WebCrawler,
http://webcrawler.cs.washington.edu/WebCrawler/WWW94.html
Footnotes
- (1)
- This work has been supported by NASA Cooperative Agreement NCC-9-16, RICIS research activity RB02.