Normalizing web product attributes and discovering domain ontology with minimal effort

Tak Lam WONG, Lidong BING, Wai LAM

Research output: Chapter in Book/Report/Conference proceedingChapters

Abstract

We have developed a framework aiming at normalizing product attributes from Web pages collected from different Web sites without the need of labeled training examples. It can deal with pages composed of different layout format and content in an unsupervised manner. As a result, it can handle a variety of different domains with minimal effort. Our model is based on a generative probabilistic graphical model incorporated with Hidden Markov Models (HMM) considering both attribute names and attribute values to extract and normalize text fragments from Web pages in a unified manner. Dirichlet Process is employed to handle the unlimited number of attributes in a domain. An unsupervised inference method is proposed to predict the unobservable variables. We have also developed a method to automatically construct a domain ontology using the normalized product attributes which are the output of the inference on the graphical model. We have conducted extensive experiments and compared with existing works using prouct Web pages collected from real-world Web sites in three different domains to demonstrate the effectiveness of our framework. Copyright © 2011 ACM.
Original languageEnglish
Title of host publicationProceedings of the 4th ACM International Conference on Web Search and Data Mining
Place of PublicationNew York
PublisherAssociation for Computing Machinery
Pages805-814
ISBN (Print)9781450304931
Publication statusPublished - 2011

Citation

Wong, T.-L., Bing, L. & Lam, W. (2011). Normalizing web product attributes and discovering domain ontology with minimal effort. Proceedings of the 4th ACM International Conference on Web Search and Data Mining (pp. 805-814). New York: Association for Computing Machinery.

Keywords

  • Information extraction
  • Graphical models
  • Web mining

Fingerprint

Dive into the research topics of 'Normalizing web product attributes and discovering domain ontology with minimal effort'. Together they form a unique fingerprint.