Polynomial Networks Model for Arabic Text Summarization
Mohammed Salem Binwahlan
Information Technology Department, College of Applied Science, Seiyun University
Received: 24 January 2023; Accepted: 08 February 2023; Published: 21 March 2023
Abstract- Online sources enable users to get their information needs. But, finding the relevant information, in such sources, became a big challenge and time consumption due to the massive size of data those sources contain. Automatic text summarization is an important facility to overcome such a problem. To this end, many text summarization algorithms have been proposed based on different techniques and different methodologies. Text features are the main entries in text summarization, where each feature plays a different role for showing the most important content. This study introduces the polynomial networks (PN) for Arabic text summarization problem. The role of the polynomial networks (PN) is to compute optimal weights, through the training process of PN classifier, where these weights were used to adjust the text features scores. Adjusting the text features scores creates a fair dealing with those features according to their importance and plays an important role in the differentiation between higher and less important ones. The proposed model produces a summary of an original document through classifying each sentence as summary sentence or non-summary sentence. Six summarizers (Naïve Bayes, AQBTSS, Gen–Summ, LSA–Summ, Sakhr1 and Baseline–1) were used as benchmarks. The proposed model and benchmarks were evaluated using the same dataset (EASC – the Essex Arabic Summaries Corpus). The results shew that the proposed model defeats the all six summarizers. In addition, the rate error results of both the proposed model (PN classifier) and Naïve Bayes (NB classifier), it is a clear that the proposed model (PN classifier) works better. In general, the proposed model provides a good enhancement indicating that the polynomial networks (PN) are a promising technique for text summarization problem.
Keywords- Automatic text summarization, polynomial networks, sentence similarity, term frequency, text feature.
I. Introduction
Online sources enable users to get their information needs. But, finding the relevant information, in such sources, became a big challenge and time consumption due to the massive size of data those sources contain. For this reason, it is believed that automatic text summarization is, the process of scanning a full text for discovering its parts bearing the most important meaning and presenting those parts in a limited size space, an important facility to overcome such a problem. The requirement of including the most informative parts in that limited size space (which is called a summary) addresses a big challenge. Such challenge forces the researchers in the area of text summarization to deal with it in two directions, the first one is how to determine the most important parts of a full text and second one is how to control the inclusion of those parts in the limited size space (the summary). A summary of the full text content helps readers to make a decision to read the whole document or not. Reading the summary, instead of the full text, can save the time and effort. To this end, many text summarization algorithms have been proposed based on different techniques and different methodologies. Those proposed algorithms were classified into two main categories, extractive and abstractive (Mani, 2001). Extractive algorithms insert the most important parts of the original document, without changing the structure of those parts (simple copy), into the final summary. Similar to extractive algorithms, abstractive algorithms insert the most important parts of the original document into the final summary, but after editing the structure of those parts (perform paraphrasing). And this makes the abstractive algorithms more complicated than extractive algorithms.
The cornerstone of automatic text summarization systems is those approaches which dates back to the 1950s and 1960s [Luhn, 1958; Edmundson, 1969]. Such approaches depend on a linear combination of shallow features of text units to calculate the score of these units [Luhn, 1958; Edmundson, 1969; Baxendale, 1958]. Luhn (1958) proposed that the word significance is determined by frequency of its occurrence and the significance of sentence is determined by the relative position of its words. A combination of these two measurements determines the significance factor of a sentence. The highest score sentences are chosen as summary sentences “auto-abstract” whereas the sentences are reordered based on their significance order. Edmundson (1969) presented summarization system to generate extracts in which four features are used: word frequency, positional importance, cue words, and title or heading words. Each sentence is scored by the weights of the four features. Each feature is given a weight manually. The advantages of these approaches are simplicity and efficiency. In Baxendale’s study (1958), a sentence is selected as a candidate for the summary based on its position. The sentence appearing in the beginning and the end of the paragraph has been given more significance. Zechner (1996) presented a pure statistical abstract-based system employing only tf*idf weight to score the text sentences. The system is a neutral of domain knowledge and text characteristics. Although automatic text summarization has gained researchers’ attention since Luhn’s work (Luhn, 1958) but the topmost work on it started from the year 2000 (Binwahlan, 2015).