We have developed a framework aiming at normalizing product attributes from Web pages collected from different Web sites without the need of labeled training examples. It can deal with pages composed of different layout format and content in an unsupervised manner. As a result, it can handle a variety of different domains with minimal effort. Our model is based on a generative probabilistic graphical model incorporated with Hidden Markov Models (HMM) considering both attribute names and attribute values to extract and normalize text fragments from Web pages in a unified manner. Dirichlet Process is employed to handle the unlimited number of attributes in a domain. An unsupervised inference method is proposed to predict the unobservable variables. We have also developed a method to automatically construct a domain ontology using the normalized product attributes which are the output of the inference on the graphical model. We have conducted extensive experiments and compared with existing works using prouct Web pages collected from real-world Web sites in three different domains to demonstrate the effectiveness of our framework. Copyright © 2011 ACM.
|Title of host publication||Proceedings of the 4th ACM International Conference on Web Search and Data Mining|
|Place of Publication||New York|
|Publisher||Association for Computing Machinery|
|Publication status||Published - 2011|
CitationWong, T.-L., Bing, L. & Lam, W. (2011). Normalizing web product attributes and discovering domain ontology with minimal effort. Proceedings of the 4th ACM International Conference on Web Search and Data Mining (pp. 805-814). New York: Association for Computing Machinery.
- Information extraction
- Graphical models
- Web mining