Normalizing web product attributes and discovering domain ontology with minimal effort

Tak Lam WONG, Lidong BING, Wai LAM

Research output: Contribution to conferencePapers

12 Citations (Scopus)

Abstract

We have developed a framework aiming at normalizing product attributes from Web pages collected from different Web sites without the need of labeled training examples. It can deal with pages composed of different layout format and content in an unsupervised manner. As a result, it can handle a variety of different domains with minimal effort. Our model is based on a generative probabilistic graphical model incorporated with Hidden Markov Models (HMM) considering both attribute names and attribute values to extract and normalize text fragments from Web pages in a unified manner. Dirichlet Process is employed to handle the unlimited number of attributes in a domain. An unsupervised inference method is proposed to predict the unobservable variables. We have also developed a method to automatically construct a domain ontology using the normalized product attributes which are the output of the inference on the graphical model. We have conducted extensive experiments and compared with existing works using prouct Web pages collected from real-world Web sites in three different domains to demonstrate the effectiveness of our framework. Copyright © 2011 ACM.
Original languageEnglish
Publication statusPublished - Feb 2011

Citation

Wong, T.-L., Bing, L. & Lam, W. (2011, February). Normalizing web product attributes and discovering domain ontology with minimal effort. Paper presented at The 4th ACM International Conference on Web Search and Data Mining, Sheraton Hong Kong Hotel and Towers, China.

Keywords

  • Information extraction
  • Graphical models
  • Web mining

Fingerprint Dive into the research topics of 'Normalizing web product attributes and discovering domain ontology with minimal effort'. Together they form a unique fingerprint.