We develop a new framework to achieve the goal of Wikipedia entity expansion and attribute extraction from the Web. Our framework takes a few existing entities that are automatically collected from a particular Wikipedia category as seed input and explores their attribute infoboxes to obtain clues for the discovery of more entities for this category and the attribute content of the newly discovered entities. One characteristic of our framework is to conduct discovery and extraction from desirable semi-structured data record sets which are automatically collected from the Web. A semi-supervised learning model with Conditional Random Fields is developed to deal with the issues of extraction learning and limited number of labeled examples derived from the seed entities. We make use of a proximate record graph to guide the semi-supervised learning process. The graph captures alignment similarity among data records. Then the semi-supervised learning process can leverage the unlabeled data in the record set by controlling the label regularization under the guidance of the proximate record graph. Extensive experiments on different domains have been conducted to demonstrate its superiority for discovering new entities and extracting attribute content. Copyright © 2013 ACM.
|Publication status||Published - Feb 2013|
CitationBing, L., Lam, W., & Wong, T.-L. (2013, February). Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. Paper presented at The 6th ACM International Conference on Web Search and Data Mining, Auditorium Antonianum, Roma, Italy.
- Semi-supervised learning
- Information extraction
- Entity expansion
- Proximate record graph