ICML 2006 Tutorial - Ronen Feld

Information Extraction (IE), is one of the most prominent techniques currently used in Text Mining. In particular, by combining Natural Language Processing tools, lexical resources and semantic constraints, it can provide effective modules for mining the biomedical literature, or to help in preventing terrorism. Complementary visualization tools enable the user to explore, check (and correct if required) the results of the Text Mining process effectively.

As a first step in tagging documents, each document is processed to find (extract) Entities and Relationships that are likely to be meaningful and content-bearing. In “Relationships” we refer to Facts or Events involving certain Entities. A possible “Event” may be that a company has entered into a joint venture to develop a new drug. A “Fact” may be that a gene causes a certain disease. Facts are static in nature and usually do not change; events are more dynamic in nature and have a specific time stamp associated with them. The extracted information provides more concise and precise data for the mining process than the more naive word-based approaches such as those used for text categorization, and tends to represent concepts and relationships that are more meaningful and relate directly to the examined document’s domain.

In this tutorial we will present the general theory of Information Extraction and will demonstrate several systems that use these principles to enable interactive exploration of large textual collections. We will present a general architecture for information extraction and will outline the algorithms and data structures behind the systems. The Tutorial will cover the state of the art in this rapidly growing area of research. Several real world applications of information extraction will be presented in the areas of business intelligence, competitive intelligence, bio information, and military intelligence.

Outline

1. Introduction to Text Mining

a. The need for Text Mining

b. What is unique about Text Mining

c. Term Extraction

d. Introduction to Information Extraction

i. What is Information Extraction?

ii. Architecture of Information Extraction Systems

2. A general architecture for Text Mining

a. Browsing

i. Browsing Large Text Collections

ii. Dynamic Browsers

b. Text Analytics

i. Taxonomy Construction and Refinement

ii. Comparing Distributions

iii. Trend Analysis

iv. Isolating interesting patterns

v. Association Generation

1. Maximal Associations

2. Association Clustering and Filtering Algorithms

c. Text Mining Query Languages

d. Visualizations

i. Visual interfaces for KDD query languages

ii. Keyword Graphs

iii. Visualizing the evolution of concept relationships

3. Information Extraction in Depth

a. Types of IE Tasks

i. Entity Extraction

ii. Fact Extraction

iii. Event Extraction

b. Pre Processing Techniques

i. Zoning

ii. Part of Speech Tagging

iii. Morphological Analysis

iv. Shallow Parsing

c. Types of IE systems

i. Rule Based Systems

1. Propositional systems

2. First Order Systems

ii. Machine Learning Based Systems

1. IE Rule Learning Approaches

a. PALKA

b. CRYSTAL

c. RAPIER

d. SRV

2. Boot Strapping Approaches

a. Mutual Boot Strapping

b. Multi Class Boot Strapping

3. Classic HMM models

a. Learning emission probabilities

b. Learning the HMM Topology

4. Bigram models

a. Using Back-off models

b. Modeling unknown words

5. Creating hybrids between ML and RB systems

a. TEG

b. MERGE

6. Classification based IE

a. SVM

iii. Unsupervised Learning

1. The KnowItAll approach

a. Using HIT COUNT

b. Discriminator Phrases

c. Using Bootstrapping

2. KnowItNow

a. Using the BE Engine

3. The URES System

a. Seed Acquisition

b. Pattern Learning

c. Using the ADIOS Algorithm

d. Anaphora Resolution

i. Manually Encoded Algorithms

ii. Machine Learning Algorithms

e. Environments for Creating IE Systems

i. Programmer’s workbench

ii. Visual Interfaces

f. Evaluation of IE Systems

i. MUC

ii. ACE

4. Applications of Information Exatrction

a. Financial Applications

i. Fraud Detection

ii. Money Laundry

iii. Stock Market Prediction

iv. Demo: Extraction of relationships between business entities.

b. Military Applications

i. Anti Terror Applications

ii. Live Demo: Information Extraction of terror related events based on public news sources.

c. Information Extraction for Bioinformatics

i. Extraction of Genes, Gene Products, Proteins from Scientific Articles.

ii. Extracting Experimental Evidence about Genes, Proteins and their relationships from Medline Articles.

d. Competitive Intelligence Applications

Lecturer’s Biography

Ronen Feldman is a senior lecturer at the Mathematics and Computer Science Department of Bar-Ilan University in Israel, and the Director of the Data Mining Laboratory. He received his B.Sc. in Math, Physics and Computer Science from the Hebrew University, M.Sc. in Computer Science from Bar-Ilan University, and his Ph.D. in Computer Science from Cornell University in NY. He was an Adjunct Professor at NYU Stern Business School. He is the founder of ClearForest Corporation, a Boston based company specializing in development of text mining tools and applications. He has given more than 30 tutorials on text mining and information extraction and authored numerous papers on these topics. He just finished writing his book "The Text Mining Handbook" to be published by Cambridge University Press in early 2006.