邢唷 > { t v x | 欹 q` 鳵 q bjbjqPqP B& : : 宯 } = 0 0 0 0 0 0 0 0 zh zh zh 8 瞙 $ 謏 t 0 姡 H Vk Dl Dl Zl Zl 巑 杗 d 鷑 4 $ 姚 h : 息 u 0 噖 巑 巑 噖 噖 息 0 0 Zl Zl 4 D k k k 噖 0 Zl 0 Zl k 噖 k k y | 0 0 Zl Jk 衶鲷6徤 zh '{ 鯅 鍨 Z 0 姡 戛 醸 , 戛 $ 戛 0 < .o q X k hr |s .o .o .o 息 息 ^ .o .o .o 姡 噖 噖 噖 噖 0 0 0 禢 0 0 0 禢 0 0 0 0 0 0 0 0 0 Structured Web Review Extraction and Opinion Summarization Suke LI1,2, Zhong CHEN1, , Liyong TANG1 , Jianbin HU1 1 Institute of Software, School of Electronics Engineering and Computer Science, Peking University Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, Beijing 100871, China 2School of Software and Microelectronics, Peking UniversityAbstract This paper focuses on structured Web review extraction and opinion summarization. An opinion extraction framework and a feature-opinion based opinion summarization algorithm (FOOSA) are proposed. Opinion related Web pages are first clustered according to their link and title similarities. Then reviews are extracted using tag patterns which are generated through one instance learning method. The extracted reviews are filtered basically according to some rules to remove meaningless symbols. At last, all reviews are segmented into feature-opinion sentences for continual processing. In opinion summarization, FOOSA uses feature-opinion pairs as keys to construct feature-opinion (FO) buckets. Then we use inter bucket reducing method based on information entropy to remove duplicated sentences in different buckets. FOOSA also carries out intra bucket clustering to select representative sentences from every bucket. FOOSA builds positive and negative text opinion summaries through our ranking method. A structured review extraction and opinion summarization system is implemented based on our approach. Our system provides not only text opinion summarization, but also Web based graphical overall sentiment summarization. Experiments show our review extraction and opinion summarization approaches are effective and promising. Keywords: Opinion Extraction; Opinion Summarization; Opinion Mining Introduction In recent years, World Wide Web grows dramatically not only in size but also in online reviews. More and more websites, such as amazon.com, have provided services that facilitate Web users to publish their opinions about events, products and services. Web users can express their opinions in websites, forums, blogs, etc. Reviews on the Web are often organized as data records. One such record may contain information about review holder, publishing date, review data, etc. While several reviews may be in the same Web page and one or several Web pages may contain reviews about the same topic. These reviews are often generated by programs from backend databases. They often have the same HTML structure in one website. We decompose our review extraction and opinion summarization problem into following main subtasks: I. Extract reviews. II. Extract opinion sentences. III. Determine the polarity of opinion sentences. IV. Summarize opinions, including positive and negative opinion summarization. This paper focuses on subtask I and subtask IV. Related work. In recent years, opinion mining has been studied extensively and has become a significant subject of research in the field of data mining [1]. But there is not so much work has been done on feature level opinion mining for Chinese reviews [2]. Regularly structured review data records can be extracted through wrapper induction, which usually requires manually labeling positive and negative examples to learn extraction rules. Softmeanly [3], WIEN [4], Stalker [5] are the examples of wrapper induction systems. However, it is time consuming and labor intensive to manually label data. There are also some researches focusing on automatic extraction approaches [6][7][8]. However, automatic extraction approaches may suffer from less accuracy and require manually identifying the items of interests [9]. Hu and Liu [10][11][12] 抯 works are the early feature based opinion mining endeavor. In Hu and Liu抯 works, opinion mining has three steps: mining product features, identifying sentiment of each sentence and summarizing the result. Turney et al. [13] uses point mutual information (PMI) [14] method to identify the sentiment of reviews. There are also some works [15][16] that use machine learning methods to identify the sentiment of a sentences. Opinion mining systems such as Review Seer [17], Red opal [18], Opinion Observer [12], OPINE [19] provide opinion summarization too. Unfortunately, most of these systems are mining product reviews and focus on the overall sentiment of opinions. Some of them do not provide text summarization such as [12], or some of them only show some simple opinion sentences for example [17]. Actually many reviews are not about products. They may contain opinions about restaurants, shops, services, etc. Such kind of opinion object may have hundreds of features. If an opinion mining system provides text and also graphical overall sentiment summarization according to features, an opinion viewer can easily find opinions and overall sentiment about the most important features. Second, it is common that we can find hundreds of opinion reviews about the same product or topic in a website. Because they are too many in number, and opinion reviews of the same topic may be distributed in several Web pages. An opinion viewer needs to click 搉ext page again and again to read all the opinions about the same topic. However, reading so many reviews is tedious and time consuming. Therefore text opinion summary for the top reasonable and important features of reviews about the same topic will be helpful for an opinion viewer to understand the overall sentiment. Furthermore, it is difficult to identify sentiment of every sentence correctly by machine because of the nature of human languages. But human can distinguish distinct sentiment more efficiently. In this paper, we use one instance learning approach to extract review data records. Our approach needs less manual work than traditional wrapper induction but has promising precision. Furthermore our approach is robust, easy to use and implement. Using our approach, a user only needs to select and copy text snippets which he or she is interested in from a Web browser, then fill in a Web page form to generate extraction configuration files. The major contributions of this paper are as follows: 1) We propose a Chinese opinion extraction framework. In this framework, Web pages are first clustered according to similarities of their links and titles instead of their body contents. Then reviews are extracted using tag patterns generated through one instance learning method. All extracted records are filtered basically according to some rules. Then, reviews are segmented into feature-opinion (FO) containing sentences prepared for continual processing. 2) A feature-opinion based opinion summation algorithm (FOOSA) is proposed. The algorithm uses feature-opinion (FO) as key to construct FO buckets. It selects representative sentences from every bucket through inner bucket clustering. At last, the algorithm builds positive and negative opinion summaries through our ranking method. Our opinion summarization system provides Web bar graph to show the distribution of the most important FO pairs. As far as we know, in China, until now, we do not find any opinion website which provides both text opinion summarization and feature-based graphical overall sentiment summarization services. Note that our opinion summarization is different from traditional text summarization in three main aspects: 1) Our text opinion summarization focuses on FO pair bearing sentences, not for all the sentences appearing in the corpus; 2) Our text opinion summarization is overall sentiment summarization, so sentences in our summaries are ranked by FO distribution with positive and negative polarity classification. 3) Our summarization combines text opinion summarization with graphical overall sentiment summarization. Opinion Extraction Framework We use our vertical spider to crawl Web pages from opinion related websites. Our opinion extraction framework is only used to extract structured reviews. Fig. 1 presents the main modules of the framework. It includes four main parts: 1) Web Page Cluster. This module clusters all the same topic related opinion Web pages. 2) Review Extractor. We use tag patterns and link patterns to extract reviews in a Web page. A one instance learning method is implemented to extract these reviews because of its natural simplicity and high precision. 3) Review Cleaner. Reviews may contain meaningless symbols. Filtering out these noises is important for next step processing. The cleaner uses rules edited manually to filter the reviews. 4) Opinion Sentence Extractor. EMBED Visio.Drawing.11 EMBED Visio.Drawing.11 Fig.1 Opinion extraction framework Fig. 2 A HTML tag tree 2.1. Opinion Web pages clustering A Web page is opinion related if it contains opinions. There are many opinion irrelative Web pages in our database, such as navigation and index pages. We find that most of detailed opinion pages with the same topic in the same website have similar URLs structure and the same title if they have. Based on our observations we assume (1) Web pages which contain reviews and have the same URL pattern usually have the same HTML structure layout; (2) Web pages with the same topic have URLs that have the same structure. The clustering processing includes two steps. First, all the Web pages URLs are clustered by their domain names and similarities of URL path components. Then, all Web pages with the same title are clustered into a whole unit. For example, the URL http://www.dianping.com/shop/510047 may appear as a detailed page of an opinion topic. And http://www.dianping.com/shop/510047_p2#ur may be the second detailed page of the same opinion topic. 2.2. Review extraction and opinion sentence extraction In order to get tag patterns, a tag tree is constructed from a review page instance. We use HTML tag tree to extract review records in the same review page. We adopt single instance learning method to extract records. Before extraction, a user must provide two or more text snippets which present the same kind of information but from different records. The extraction method includes two sub steps. 1) Get page record抯 tag pattern; 2) Use the tag pattern to extract other records. Note that we do not use the automatic record extraction methods such as [6][7], the reasons are: 1) Our approach is robust, fast and simple; 2) The extracting precision and recall of our method for structure records are acceptable; 3) We have implemented tag pattern changing automatic detection mechanism and Web based visual extraction graphical interface which does not require a user knowing HTML syntax. Fig. 2 gives a HTML tag tree as an example. Note that each pair of HTML tags, such as