22C:196 Text Retrieval & Text Mining Seminar

Spring 2007

Friday 9:30 am to 11:00 am.

3092 Main Library

3rd floor Main Library (here is a map showing the building location.
The room is on the side closest to Burlington Street)

Office hours: Thu 10:00 to 11:30 and 1:00 to 2:30 pm. and by appointment (3067 Main Library).

Reading Lists from Previous Years

Some Key Conference Deadlines:

ACM SIGIR 2007 (January 28 deadline - Amsterdam)
ACM IEEE JCDL 2007 (January 29 deadline - Vancouver)

Resources:

TREC web site
BioCreAtIve web site
Downloading SMART from Cornell University
Managing Gigabytes (MG) retrieval system
Lucene
Lucene in Action. by Erik Hatcher and Otis Gospodnetic. Manning Publications Co. 2004.
Introduction to Information Retrieval. C.D. Manning, P. Raghavan, H. Schutze. Cambridge UP, 2007. Draft.
Information Retrieval C. J. van RIJSBERGEN. London: Butterworths, 1979.
Information Retrieval Interaction. Peter Ingwersen, Taylor Graham, 1992.
Information Retrieval: A Survey. Ed Greengrass. 2000.
CMU-Cambridge Statistical Language Modeling toolkit

Student Projects

Goal: This seminar course will cover current research in text retrieval and text mining. After reading some foundational papers and book chapters we will study papers from journals (such as ACM TOIS, TOIT, Bioinformatics) and conference proceedings (such as ACM SIGIR, WWW, CIKM). Examples of problems include expert detection, web retrieval and web mining, ranking strategies, ambiguity resolution, knowledge discovery, web phenomenon including social networks, information extraction and text classification. Interested students (from beginning to advanced) are invited to participate in the reading group. It is run as a seminar with individuals taking turns to present an overview of the selected paper and lead the discussion. Upon completion of this course students will have gained broad exposure to a variety of current text based research problems and applications. The semester long project will allow students to gain significant understanding of a specific problem.

Special Focus: We will start with fairly introductory concepts and then move quickly towards papers from different proceedings and journals. A big emphasis will be on problems emphasized in TREC (Text REtrieval Conference) which is an international forum for testing algorithms and models. Students will have to complete a project for this seminar course. Students are encouraged to select projects from the TREC framework.

Evaluation: Participation (15%), Project (70%), Project presentation (15%)

  1. January 19, 2007:

    Introduction to seminar.
    The Text REtrieval Conference Chapter 1 from Experiment and Evaluation in Information Retrieval. Edited by Ellen M. Voorhees and Donna K. Harman. MIT Press.

  2. January 26, 2007:

    Chapter 2: What is information retrieval (Greengrass book).
    Chapter 3: Approaches to IR (Greengrass book).
    Chapter 4: Classical Boolean Approach to IR (Greengrass book).
    Chapter 6: Vector Space Approach (Greengrass book). (you may stop after 6.3).
    Chapter 6: Scoring and Term Weighting (Manning, Raghavan and Schutze book).
    Chapter 7: Vector Space Retrieval (Manning, Raghavan and Schutze book).

  3. February 2, 2007:

    Exploring the Similarity Space. Zobel and Moffat. ACM SIGIR Forum, 1998. (Do a web search or get from the ACM Digital Library).
    Chapter 6.4: Computation of Similarity between Document & Query (Greengrass book).
    Chapter 6.5: Latent Semantic Indexing ... (Greengrass book).
    Chapter 7 (upto & including 7.4.1): Probabilistic models ... (Greengrass book).

  4. February 9, 2007:

    Lucene - demos.

  5. February 16, 2007:

    TREC - Enterprise Track: website. Read the 2006 overview paper: ENT.OVERVIEW.pdf
    TREC - Spam Track: website. Read guidelines at that site and read the 2006 overview paper: SPAM.OVERVIEW.pdf

  6. February 23, 2007:

    TREC - Legal Track: website. Read the guidelines and the 2006 overview paper: LEGAL.OVERVIEW.pdf
    TREC - Blog Track. Read the 2006 overview paper: BLOG.OVERVIEW.pdf
    TREC - Terabyte Track. Read the 2006 overview paper: TERA.OVERVIEW.pdf

  7. March 2, 2007:

    Common Evaluation Measures. NIST document (2005)
    Each individual pick a paper from their favourite TREC track. Focus on methodology and results.
    Selection of project is due - submit a brief 1 page writeup

  8. March 9, 2007:

    Tools for Projects. Presentation by Aditya Sehgal.
    TREC - Question answering track. Read the 2006 overview paper: QA.OVERVIEW.pdf
    The Open University at TREC 2006 Enterprise Track Expert Search Task. To be presented by Jeremy Robinson
    SVM-Based Spam Filter with Active and Online Learning. To be presented by Nengda Jin.

  9. March 23, 2007:

    RelevanceBased Language Models. Victor Lavrenko and W. Bruce Croft.
    Slides from Brian Almquist on TREC Legal Track.

  10. March 30, 2007:

    Language Models for Expert Finding -- UIUC TREC 2006 Enterprise Track Experiments, H. Fang, L. Zhou, C.-X. Zhai, University of Illinois at Urbana-Champaign
    Information Retrieval Using Language Models Kieran McDonald thesis (read chapter 2. Information Retrieval Using Language Models (available from his web site)).
    TREC 2006 Genomics Track Overview W. Hersh et al.
    TREC 2007 Genomics Track protocol

  11. April 6, 2007: no class
    (set up individual meetings with me for April 9, 11 or 12)

  12. April 13, 2007:

    Thumbs up? Sentiment Classification using Machine Learning Techniques. by Pang, Lee and Vaithyanathan. Presented by J.T.
    Concept recognition and the TREC Genomics tasks. (Get from TREC 2006 site). Presented by Si-Chi
    An Adaptive, SemiStructured Language Model Approach to Spam Filtering on a New Corpus by Ben Medlock. Paper presented by Aravind

  13. April 20, 2007

    Using Social Network Analysis to Automatically Discover Competitor Relationships from Business News. by Ma, Sheng and Pant. (Paper emailed to all). Presented by Jeremy Robinson.

  14. April 27, 2007

    Answering relationship queries on the Web. Luo et al. WWW 2007. Get paper from here. Presented by Ritesh Nadhani

    Bibliometric impact measures leveraging topic analysis. JCDL 2006. Get paper from here. Presented by Junfeng Zheng.

  15. May 4, 2007

    Coauthorship networks and patterns of scientific collaboration M. E. J. Newman, PNAS , 2004. Presented by Chris Timko.

  16. May 11, 2007 - Final Presentations. (9:30 am to 11:30 am)