ICML 2006 Tutorial - Ronen Feldman
Information Extraction, Theory and Practice |
ICML 2006 Tutorial Slides - Ronen Feldman
Here are the slides for the Tutorial Click Here to Download the ICML Tutorial SlidesAbstract
Information Extraction (IE), is one of the most prominent techniques currently used in Text Mining. In particular, by combining Natural Language Processing tools, lexical resources and semantic constraints, it can provide effective modules for mining the biomedical literature, or to help in preventing terrorism. Complementary visualization tools enable the user to explore, check (and correct if required) the results of the Text Mining process effectively.
As a first step in tagging documents, each document is processed to find (extract) Entities and Relationships that are likely to be meaningful and content-bearing. In “Relationships” we refer to Facts or Events involving certain Entities. A possible “Event” may be that a company has entered into a joint venture to develop a new drug. A “Fact” may be that a gene causes a certain disease. Facts are static in nature and usually do not change; events are more dynamic in nature and have a specific time stamp associated with them. The extracted information provides more concise and precise data for the mining process than the more naive word-based approaches such as those used for text categorization, and tends to represent concepts and relationships that are more meaningful and relate directly to the examined document’s domain.
In this tutorial we will present the general theory of Information Extraction and will demonstrate several systems that use these principles to enable interactive exploration of large textual collections. We will present a general architecture for information extraction and will outline the algorithms and data structures behind the systems. The Tutorial will cover the state of the art in this rapidly growing area of research. Several real world applications of information extraction will be presented in the areas of business intelligence, competitive intelligence, bio information, and military intelligence.
Outline
1. Introduction to Text Mining
a. The need for Text Mining
b. What is unique about Text Mining
c. Term Extraction
d. Introduction to Information Extraction
i. What is Information Extraction?
ii. Architecture of Information Extraction Systems
2. A general architecture for Text Mining
a. Browsing
i. Browsing Large Text Collections
ii. Dynamic Browsers
b. Text Analytics
i. Taxonomy Construction and Refinement
ii. Comparing Distributions
iii. Trend Analysis
iv. Isolating interesting patterns
v. Association Generation
1. Maximal Associations
2. Association Clustering and Filtering Algorithms
c. Text Mining Query Languages
d. Visualizations
i. Visual interfaces for KDD query languages
ii. Keyword Graphs
iii. Visualizing the evolution of concept relationships
3. Information Extraction in Depth
a. Types of IE Tasks
i. Entity Extraction
ii. Fact Extraction
iii. Event Extraction
b. Pre Processing Techniques
i. Zoning
ii. Part of Speech Tagging
iii. Morphological Analysis
iv. Shallow Parsing
c. Types of IE systems
i. Rule Based Systems
1. Propositional systems
2. First Order Systems
ii. Machine Learning Based Systems
1. IE Rule Learning Approaches
a. PALKA
b. CRYSTAL
c. RAPIER
d. SRV
2. Boot Strapping Approaches
a. Mutual Boot Strapping
b. Multi Class Boot Strapping
3. Classic HMM models
a. Learning emission probabilities
b. Learning the HMM Topology
4. Bigram models
a. Using Back-off models
b. Modeling unknown words
5. Creating hybrids between ML and RB systems
a. TEG
b. MERGE
6. Classification based IE
a. SVM
iii. Unsupervised Learning
1. The KnowItAll approach
a. Using HIT COUNT
b. Discriminator Phrases
c. Using Bootstrapping
2. KnowItNow
a. Using the BE Engine
3. The URES System
a. Seed Acquisition
b. Pattern Learning
c. Using the ADIOS Algorithm
d. Anaphora Resolution
i. Manually Encoded Algorithms
ii. Machine Learning Algorithms
e. Environments for Creating IE Systems
i. Programmer’s workbench
ii. Visual Interfaces
f. Evaluation of IE Systems
i. MUC
ii. ACE
4. Applications of Information Exatrction
a. Financial Applications
i. Fraud Detection
ii. Money Laundry
iii. Stock Market Prediction
iv. Demo: Extraction of relationships between business entities.
b. Military Applications
i. Anti Terror Applications
ii. Live Demo: Information Extraction of terror related events based on public news sources.
c. Information Extraction for Bioinformatics
i. Extraction of Genes, Gene Products, Proteins from Scientific Articles.
ii. Extracting Experimental Evidence about Genes, Proteins and their relationships from Medline Articles.
d. Competitive Intelligence Applications
Lecturer’s Biography
Ronen Feldman is a senior lecturer at the Mathematics and
Computer Science Department of Bar-Ilan University in Israel, and the
Director of the Data Mining Laboratory. He received his B.Sc. in Math,
Physics and Computer Science from the Hebrew University, M.Sc. in
Computer Science from Bar-Ilan University, and his Ph.D. in Computer
Science from Cornell University in NY. He was an Adjunct Professor at
NYU Stern Business School. He is the founder of ClearForest Corporation,
a Boston based company specializing in development of text mining tools
and applications. He has given more than 30 tutorials on text mining and
information extraction and authored numerous papers on these topics. He
just finished writing his book "The Text Mining Handbook" to be
published by Cambridge University Press in early 2006.