Leveraging visual features and hierarchical dependencies for conference information extraction

Yue YOU, Guandong XU, Jian CAO, Yanchun ZHANG, Guangyan HUANG

Research output: Chapter in Book/Report/Conference proceedingChapters

3 Citations (Scopus)

Abstract

Traditional information extraction methods mainly rely on visual feature assisted techniques; but without considering the hierarchical dependencies within the paragraph structure, some important information is missing. This paper proposes an integrated approach for extracting academic information from conference Web pages. Firstly, Web pages are segmented into text blocks by applying a new hybrid page segmentation algorithm which combines visual feature and DOM structure together. Then, these text blocks are labeled by a Tree-structured Random Fields model, and the block functions are differentiated using various features such as visual features, semantic features and hierarchical dependencies. Finally, an additional post-processing is introduced to tune the initial annotation results. Our experimental results on real-world data sets demonstrated that the proposed method is able to effectively and accurately extract the needed academic information from conference Web pages. Copyright © 2013 Springer-Verlag Berlin Heidelberg.

Original languageEnglish
Title of host publicationWeb technologies and applications: 15th Asia-Pacific Web Conference, APWeb 2013, Sydney, Australia, April 4-6, 2013, Proceedings
EditorsYoshiharu ISHIKAWA, Jianzhong LI, Wei WANG, Wenjie ZHANG, Rui ZHANG
PublisherSpringer
Pages404-416
ISBN (Electronic)9783642374012
ISBN (Print)9783642374005
DOIs
Publication statusPublished - 2013

Citation

You, Y., Xu, G., Cao, J., Zhang, Y., & Huang, G. (2013). Leveraging visual features and hierarchical dependencies for conference information extraction. In Y. Ishikawa, J. Li, W. Wang, W. Zhang, & R. Zhang (Eds.), Web technologies and applications: 15th Asia-Pacific Web Conference, APWeb 2013, Sydney, Australia, April 4-6, 2013, Proceedings (pp. 404-416). Springer. https://doi.org/10.1007/978-3-642-37401-2_41

Keywords

  • Information extraction
  • Visual feature
  • DOM structure
  • Tree-structured Conditional Random Fields

Fingerprint

Dive into the research topics of 'Leveraging visual features and hierarchical dependencies for conference information extraction'. Together they form a unique fingerprint.