INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN SOCIAL SCIENCE (IJRISS)
ISSN No. 2454-6186 | DOI: 10.47772/IJRISS | Volume IX Issue XI November 2025
[16],[17]. It is a set of metrics commonly used for text summarization tasks to generate a concise summary of a
longer text automatically. It was designed to evaluate the quality of machine-generated summaries by
comparing them to human reference summaries. ROUGE has variants like ROUGE-N focusing on n-gram
overlap, ROUGE-L on the longest common subsequence (LCS), and ROUGE-S on skip-bigram overlap. The
ROUGE score ranges from 0 to 1, with higher values indicating better summary quality. It's widely used for
objectivity but may not fully capture semantic meaning or coherence.
Another method used to evaluate the quality of text summarization is BERTScore [18],[19]. This method
measures the similarity between the summary and the original text. It addresses issues encountered by n-
grambased metrics using contextualized token embeddings from models like BERT to compute similarity. The
process involves representing sentences with contextual embeddings, measuring cosine similarity, token
matching for precision and recall, considering word importance using IDF, and rescaling values for readability.
For a basic level BERTScore calculation, the output will be precision, recall, and F1 score [20]. BERTScore
enhances text similarity measurement, making it more accurate and balanced, with potential applications in
various domains of natural language processing. However, this method has its pros and cons. For example,
BERTScore can handle different types of texts, but it can be biased towards models that are more similar to its
underlying model.
In essence, ensuring the transparency and interpretability of the summaries is crucial, where explainable
artificial intelligence (XAI) plays an important role. Some examples of XAI methods in NLP include
visualizing attention mechanisms in neural networks, generating textual explanations for model predictions,
and interpreting the reasoning behind the models' decision-making process [21]. Vaswani introduced the
attention mechanism in 2017 [22]. In traditional Deep Learning models like LSTMs and RNNs, longer inputs
pose challenges for retaining relevant information, prompting the need for attention mechanisms to signal the
model about focus areas [23]. However, transformer models, utilizing self-attention across all encoder and
decoder layers, circumvent this issue [23]. Attention mechanisms are widely used in text summarization across
diverse domains like news, reviews, scientific papers, legal documents, and social media posts, where models
such as the Pointer-generator network, Transformer, and BART exemplify this trend [24].
Research on an open-source tool for visualizing attention mechanisms in transformer-based language models is
proposed by Jesse Vig [25]. The tool offers three levels of granularity: attention head, model, and neuron
views. Its application has been demonstrated on BERT and GPT-2 models. The tool aids in interpreting model
decisions and identifying patterns, such as model bias detection, recurring pattern identification, and neurons to
model behaviour linkage. This allows a comprehensive understanding of how the model attends to different
input parts and how individual neurons contribute to attention computation. It enhances model interpretability,
enables targeted improvements through user manipulation, and offers versatility for various analysis tasks and
model types.
On the other hand, a theoretical analysis of local interpretable model-agnostic explanations (LIME) has been
done by Garreau and Luxburg [26]. This explainer is commonly used to provide interpretability to machine
learning models. The study derives closed-form expressions for the coefficients of the interpretable model
when the function to explain is linear, demonstrating that LIME can uncover meaningful features proportional
to the function's gradient. It aids in understanding model decisions, improving trust, and facilitating
compliance with regulations. However, it also highlights potential limitations of LIME where poor parameter
choices may cause the algorithm to overlook important features.
2. METHODS
The system proposed in this research is the text summarization system with the explainability feature using
NLP and machine learning techniques. Figure 1 shows the architecture of the proposed system. It involved
backend and frontend parts connected by the Flask framework. In the frontend part, the user interacts with the
system through a website. The user can upload a legal document to the system. Once the document is sent to
the backend part, the system will undergo preprocessing, such as chunking and tokenizing. Then, the
preprocessed document will be passed to the trained BART model to generate a summary. Additionally, the
attention weight of the tokens in the document is visualized to produce a highlighted original document,
Page 7796