Hybrid Zero-Shot NLP Pipeline for Text Summarization and Question Generation
- Inioluwa Daniel Osibajo
- Oluwaseyi Ezekiel Olorunshola
- Fatimah Adamu-Fika
- Tsentob Joy Samson
- 342-354
- Aug 2, 2025
- International Relations
Hybrid Zero-Shot NLP Pipeline for Text Summarization and Question Generation
1Oluwaseyi Ezekiel Olorunshola,2Inioluwa Daniel Osibajo,3Fatimah Adamu-Fika,4Tsentob Joy Samson
1,2,4Department of Computer Science, Faculty of Computing, Air Force Institute of Technology, Kaduna, Nigeria.
3Department of Cyber Security, Faculty of Computing, Air Force Institute of Technology, Kaduna, Nigeria.
DOI: https://doi.org/10.51584/IJRIAS.2025.100700030
Received: 27 June 2025; Accepted: 01 July 2025; Published: 02 August 2025
ABSTRACT
This study presents a sophisticated hybrid zero-shot Natural Language Processing (NLP) pipeline for text summarization and multiple-choice question (MCQ) generation, specifically designed for low-resource educational environments. The system integrates Bidirectional Encoder Representations from Transformers (BERT) for extractive summarization, Bidirectional and Auto-Regressive Transformers (BART) for abstractive summarization, and the Text-to-Text Transfer Transformer (T5) for MCQ generation. Built using the Hugging Face Transformers library, Natural Language Toolkit (NLTK), Spa Cy, and Sentence Transformers, the pipeline operates efficiently on a 12 GB Graphics Processing Unit (GPU) without the need for model fine-tuning. The workflow involves preprocessing academic texts, identifying key sentences through BERT and TextRank—a graph-based ranking algorithm—generating coherent and concise summaries with BART, and producing diverse, contextually relevant MCQs using T5. Evaluations were conducted on user-generated academic texts and the CNN/Daily Mail dataset for benchmarking. The system achieved a BERT Score F1 of 0.87, Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-1 and ROUGE-L of 0.54, Bilingual Evaluation Understudy (BLEU) of 0.20, Metric for Evaluation of Translation with Explicit OR dering (METEOR) of 0.35, a compression ratio of 0.37, coherence score of 0.50, and 80% human-rated MCQ relevance—outperforming Generative Pre-trained Transformer (GPT-3) baselines. To assess educational impact, a study was conducted with 20 students of average academic standing using a 25-mark test generated by the pipeline. Results showed that 13 students scored above 20, 4 scored between 15–20, and 3 scored between 10–15, indicating that 85% of participants exceeded a 60% proficiency threshold. Qualitative analysis revealed minor factual inaccuracies in 10% of summaries and relevance drift in 15% of MCQs, highlighting areas for further enhancement. Overall, the study demonstrates the practical potential of transformer-based hybrid NLP pipelines for scalable, accessible educational content creation in resource-constrained contexts.
Keywords: Zero-Shot Learning, Text Summarization, Multiple-Choice Question (MCQ) Generation, Transformer Models, Low-Resource Environments
INTRODUCTION
The digital era has produced an unprecedented volume of textual content across academic, policy, and professional domains, creating a pressing need for intelligent systems capable of distilling and repurposing unstructured text for efficient consumption. Natural Language Processing (NLP) a pivotal subfield of artificial intelligence (AI) facilitates machine understanding and generation of human language, thereby transforming the way information is accessed and utilized. Two core NLP capabilities are text summarization and automated question generation (AQG). Summarization reduces lengthy documents into concise, coherent overviews that retain critical information, while AQG generates contextually relevant questions, supporting content repurposing, knowledge retrieval, and educational resource development. Deploying these capabilities in low-resource environments, characterized by limited computational infrastructure, intermittent power supply, and a scarcity of labeled data, presents significant challenges. Transformer-based models, such as Bidirectional Encoder Representations from Transformers (BERT), Bidirectional and Auto-Regressive Transformers (BART), and the Text-to-Text Transfer Transformer (T5), have demonstrated state-of-the-art performance in summarization and question generation.
However, these models typically require extensive fine-tuning and high computational capacity, making them less viable in constrained settings. Zero-shot learning, which enables pre-trained models to perform new tasks without additional task-specific training, offers a promising alternative. However, while large generative models like Generative Pre-trained Transformer 3.5 (GPT-3.5) offer flexibility, their outputs often lack the structured format needed for educational applications—particularly for assessment tools like multiple-choice questions.
This study introduces a hybrid zero-shot NLP pipeline that integrates BERT for extractive summarization, BART for abstractive summarization, and T5 for multiple-choice question (MCQ) generation, specifically optimized for low-resource educational environments and deployable on a 12GB Graphics Processing Unit (GPU). To evaluate its educational impact, a pilot study was conducted involving 20 students with average grade point averages (GPA), who were administered a 25-mark test generated by the system. The results showed that 13 students scored above 20, 4 scored between 15–20, and 3 scored between 10–15, with 85% of participants achieving over 60% proficiency, demonstrating the pipeline’s practical efficacy. A closer analysis of the system’s output on a 300-word text about artificial intelligence showed that the generated summary effectively captured medical and ethical dimensions but omitted references to finance-related applications, which may have contributed to lower comprehension scores.
The remainder of this paper is organized as follows: Section 2 reviews related work; Section 3 outlines the methodology; Section 4 details system implementation; Section 5 presents and analyzes results; and Section 6 concludes the study and outlines future directions
Review of Related Work
Recent advancements in natural language processing (NLP) have significantly transformed text summarization, automated question generation (AQG), and strategies for low-resource environments. These developments inform the present study’s design of a zero-shot hybrid pipeline tailored to educational contexts. Text summarization techniques can be broadly categorized into extractive, abstractive, and hybrid approaches. Extractive summarization involves selecting salient sentences from the source text. Traditional methods, such as TextRank and LexRank, employed statistical metrics like term frequency–inverse document frequency (TF-IDF) and graph-based centrality but were limited by their lack of semantic understanding (Mihalcea & Tarau, 2004). The advent of contextual embeddings, particularly through Bidirectional Encoder Representations from Transformers (BERT), enhanced extractive summarization, with BERTSUM offering improved sentence selection (Devlin et al., 2019; Liu & Lapata, 2019). Nonetheless, extractive summaries often suffer from rigid sentence structures and reduced fluency Aggarwal. (2023).
Abstractive summarization, in contrast, generates novel paraphrased content, enhancing coherence and fluency. Notable models such as BART and the Text-to-Text Transfer Transformer (T5) excel in this domain—BART through denoising autoencoding and T5 via a unified text-to-text framework (Lewis et al., 2020; Raffel et al., 2020). Similarly, the PEGASUS model, which leverages gap-sentence prediction during pretraining, has demonstrated high performance in abstractive summarization tasks Zhang et al., (2020). However, these models are prone to factual inconsistencies, particularly in domain-specific or specialized educational content Afzal et al, (2023). To address this, hybrid approaches combine the accuracy of extraction with the fluency of abstraction. For instance, Divya et al. (2024) proposed a pipeline that integrates BERT-based sentence extraction with transformer-based paraphrasing. Additionally, coherence has been improved through architectural innovations such as dual attention mechanisms Chersoni et al. (2021).
Automated question generation (AQG) plays a critical role in educational technologies, especially for multiple-choice question (MCQ) generation. T5’s versatile text-to-text paradigm supports zero-shot AQG (Khashabi et al., 2020; Raffel et al., 2020). Despite their capabilities, transformer models typically require extensive fine-tuning on large datasets, limiting their utility in low-resource educational environments (Brown et al., 2020; Liu, 2022). While zero-shot models like GPT-3 have shown promise, they often fail to produce pedagogically structured outputs Khashabi et al, (2020). Earlier frameworks, including those by Divya et al. (2024) and Karanja and Matheka (2024), relied heavily on fine-tuning with general-purpose corpora, thereby limiting domain relevance.
To bridge these gaps, this study proposes a scalable, zero-shot NLP pipeline that integrates BERT with TextRank for extraction, and BART and T5 for abstractive summarization and AQG, respectively. The system is further enhanced with linguistic processing tools such as the Natural Language Toolkit (NLTK), spaCy, and SentenceTransformers. Designed to run efficiently on a 12GB GPU, the pipeline supports coherent, context-aware summarization and question generation in educational settings.
Research Design
The zero-shot hybrid Natural Language Processing (NLP) pipeline integrates extractive summarization, abstractive summarization, and multiple-choice question (MCQ) generation to process user-input educational texts without task-specific training. Optimized for low-resource settings, it operates on a 12GB Graphics Processing Unit (GPU) using Hugging Face Transformers, Natural Language Toolkit (NLTK), Spa Cy; and Sentence Transformers, enabling scalable deployment in data-scarce educational environments.
Data Collection and Preprocessing
The pipeline processes arbitrary user-input educational texts, such as lecture notes or academic articles. The preprocess text function normalizes whitespace, addresses edge cases like empty inputs, and employs NLTK’s Punkt tokenizer for robust sentence boundary detection across diverse writing styles. Using Spa Cy’s en core web sm model, the pipeline extracts the top 10 non-stop word keywords, which serve as semantic anchors for weighting sentences in Text Rank-based extractive summarization (extractive summary). This approach ensures focus on core concepts, enhancing relevance for educational applications. The preprocessing is designed to be lightweight, facilitating compatibility with modest hardware and supporting broad accessibility.
Model Selection
The pipeline integrates pre-trained transformer models for zero-shot performance in low-resource educational contexts. Bidirectional Encoder Representations from Transformers (BERT; bert-base-uncased) drives extractive summarization, using its bidirectional architecture and Text Rank (nx page rank) to select semantically significant sentences, with a centroid-based fallback for robustness (extractive summary). Bidirectional and Auto-Regressive Transformers (BART; facebook/bart-large-cnn) generates fluent abstractive summaries via beam search (abstractive summary). Text-to-Text Transfer Transformer (T5; valhalla/t5-small-qg-hl) produces diverse MCQs through structured prompts (generate mcqs). Sentence Transformers (all-MiniLM-L6-v2) enhances summary coherence and MCQ diversity, ensuring efficient operation on a 12GB GPU.
Pipeline Architecture
The zero-shot hybrid NLP pipeline is a modular system designed for text summarization and MCQ generation in low-resource educational environments. Tailored to minimize computational demands while ensuring semantic richness and linguistic fluency, it integrates pre-trained transformer models Bidirectional Encoder Representations from Transformers (BERT; bert-base-uncased), Bidirectional and Auto-Regressive Transformers (BART; facebook/bart-large-cnn), and Text-to-Text Transfer Transformer (T5; valhalla/t5-small-qg-hl)—alongside Natural Language Toolkit (NLTK), Spa Cy, and Sentence Transformers (all-MiniLM-L6-v2), operating efficiently on a 12GB Graphics Processing Unit (GPU) without task-specific fine-tuning.
The pipeline comprises six interconnected stages to transform raw educational texts (e.g., lecture notes, articles) into concise summaries and pedagogically relevant MCQs. Data collection accepts user-input texts, followed by preprocessing, which normalizes whitespace, tokenizes sentences using NLTK’s Punkt tokenizer, and extracts top-10 non-stop word keywords via Spa Cy’s en core web sm model extract keywords for downstream compatibility. Extractive summarization employs BERT embeddings with Text Rank (nx.pagerank) to select semantically salient sentences, weighted by keywords, with a centroid-based fallback for robustness (extractive summary). Abstractive summarization uses BART to generate fluent summaries via beam search (abstractive summary), with sentences reordered for coherence using Sentence Transformers. Question generation leverages T5 to produce diverse MCQs targeting definitions, applications, and ethics, using structured prompts with fallback mechanisms to ensure variety (generate mcqs). Evaluation assesses outputs using lexical, semantic, and coherence metrics, visualized via Matplotlib.
This modular architecture, as depicted in Figure 1, seamlessly integrates BERT, BAR
T, T5, and Sentence Transformers, delivering a scalable, zero-shot solution for educational text processing, balancing informativeness, coherence, and pedagogical value on modest hardware.
Figure 1: Sequence Diagram of the Proposed Pipeline
Evaluation Metrics
The pipeline’s performance is assessed using a suite of quantitative and qualitative metrics, each with a mathematical foundation:
ROUGE-N: Measures n-gram overlap between generated and reference summaries, with ROUGE-1 (unigrams) and ROUGE-L (longest common subsequences) providing lexical and structural insights the ROUGE-N score evaluates the overlap of n-grams between a candidate summary and a reference summary, defined as the harmonic mean of precision and recall:
- BLEU: Evaluates precision through n-gram overlap, adjusted by a brevity penalty to penalize under-length summaries (Papineni et al., 2002). The BLEU score evaluates the quality of a candidate text by computing the geometric mean of n-gram precisions, adjusted by a brevity penalty.
BERT Score F1: Computes semantic similarity using BERT embeddings, with F1 as the harmonic mean of precision and recall (Zhang et al., 2020). The BERT Score F1 score evaluates semantic similarity between a candidate and reference text by computing the harmonic mean of precision and recall based on cosine similarities of BERT embeddings:
METEOR: Assesses lexical alignment by incorporating stemming, synonymy, and word order, providing a more nuanced evaluation. The Metric for Evaluation of Translation with Explicit OR dering assesses lexical alignment by incorporating stemming, synonymy, and word order, offering a nuanced evaluation of summary quality compared to reference texts. It computes the harmonic mean of unigram precision and recall, adjusted by a penalty for word order differences
Compression Ratio: Quantifies conciseness as the ratio of summary word count to input word count
Coherence Score: A custom metric that evaluates the semantic flow of a summary by combining local coherence (cosine similarity between consecutive sentences) and global coherence (average pairwise similarity among all sentences), weighted 80% and 20%, respectively, using Sentence Transformer embeddings:
MCQ Diversity: Measures the variety of generated MCQs as one minus the average cosine similarity of their Sentence Transformer embeddings, ensuring diverse question content.
- Question Relevance: A human-rated metric (0–100%) assessing the pedagogical utility of MCQs, evaluated by students for relevance to the input text’s content and educational objectives. No mathematical formula is defined, as it relies on subjective human judgment.
Implementation
Model Loading
The load models function initializes BERT, BART, T5, and Sentence Transformer models, caching weights in /content/model cache to reduce memory overhead on the 12GB GPU. Models are offloaded to a CUDA device if available, ensuring efficient utilization. Logging tracks model loading progress, enhancing debugging capabilities in low-resource settings where system monitoring is critical.
Preprocessing and Tokenization
The preprocessing module ensures data readiness for text processing, enabling compatibility with zero-shot models like BERT, BART, and T5. The preprocess text function normalizes whitespace by collapsing multiple spaces and handles edge cases, such as empty inputs, by logging warnings and returning empty strings, ensuring robustness for diverse, potentially noisy educational texts. Sentence tokenization is performed using NLTK’s Punkt tokenizer, which leverages unsupervised algorithms to segment text into sentences, accommodating varied writing styles and inconsistent punctuation. Keyword extraction employs Spa Cy’s en core web sm model to identify the top 10 most frequent alphabetic, non-stop word tokens, which serve as semantic anchors to weight sentences in the extractive summarization stage, enhancing focus on domain-specific terminology. No lemmatization or additional normalization is applied, but the pipeline maintains semantic integrity through frequency-based keyword filtering and robust tokenization, preparing the text for downstream summarization and question generation.
Extractive Summarization
Extractive summarization employs BERT (bert-base-uncased) in conjunction with an enhanced TextRank algorithm to select key sentences from input text. Sentence embeddings, generated by Sentence Transformer (all-MiniLM-L6-v2), form a similarity matrix weighted by keyword presence to prioritize domain-relevant content. The similarity between sentences si and sj is defined as:
where si is the embedding of sentence i, cos (·, ·) is the cosine similarity, Keyword Scorei quantifies the presence of top-10 keywords extracted via Spa Cy’s en core web sm , and ϵ = 10−6 prevents division by zero. The similarity matrix is used to construct a graph, and PageRank ranks sentences based on their centrality. A centroid-based fallback, using BERT embeddings, ensures robustness when keyword distribution is sparse or TextRank fails, selecting sentences closest to the mean embedding vector. The number of selected sentences, topn, is dynamically set as min (max (7, ⌊ sentences/1.5⌋), sentences), balancing conciseness and coverage.
Abstractive Summarization
Abstractive summarization refines the extractive output using BART (facebook/bart-large-cnn) to generate fluent, paraphrased summaries. The model is configured with optimized hyperparameters: num beams = 14, max length = 550, min length = 90, length penalty = 1.0, and repetition penalty =1.8, minimizing redundancy while preserving coherence. Input text is prefixed with “summarize:” to prompt BART, processed in zero-shot mode without fine-tuning, ensuring compatibility with lower source environments (e.g., 12GB GPUs). Post-generation, the optimize sentence order function reorders sentences to maximize coherence by solving:
where π is a permutation of sentence indices, sπ(i) is the Sentence Transformer embedding of the i-th sentence in the permutation, and N is the number of sentences. A greedy algorithm iteratively selects the next sentence with the highest cosine similarity to the previous one, enhancing the summary’s semantic flow. This leverages pre-trained weights without fine-tuning, ensuring adaptability to low-resource constraints.
Question Generation
Multiple-choice question (MCQ) generation leverages T5 (valhalla/t5-small-qg-hl) to produce up to five MCQs per input summary in zero-shot mode. A structured prompt instructs T5 to generate questions with four answer options and one correct answer, covering diverse aspects (e.g., definitions, applications, ethical concerns), formatted as: “Question: [text] A) [option1] B) [option2] C) [option3] D) [option4] Correct Answer: [correct option]”. The prompt is encoded with max length = 1024, and generation uses num beams= 12, no repeat ngram size = 3, and max length = 512 to ensure variety and quality. The calculate mcq diversity function quantifies question variety as:
Evaluation and Output Delivery
The evaluate summary function computes ROUGE, BLEU, BERT Score, and METEOR metrics, integrating external evaluation libraries for consistency. The plot metrics function visualizes results in a bar chart using Matplotlib, providing an intuitive representation of performance. The process text function orchestrates the pipeline, delivering summaries, MCQs, and metrics with comprehensive logging, including timestamps for traceability and reproducibility.
Qualitative example and Error Analysis
To illustrate pipeline behavior, the 300-word artificial intelligence (AI) text is processed. The extractive summary selects sentences on AI’s healthcare and ethical aspects but may omit finance applications due to keyword weighting biases in TextRank. The abstractive summary generates a fluent summary but risks factual inaccuracies (e.g., adding unverified AI applications), observed in approximately 10% of cases due to BART’s hallucination tendencies. The generate mcqs function produces MCQs, such as one on AI ethics, but 15% exhibit relevance drift (e.g., distractors like “hardware costs”), reducing pedagogical value. These errors, logged via the pipeline’s logging system, highlight areas for refinement, such as enhanced keyword selection or prompt tuning, to improve accuracy and relevance in educational contexts.
Performance Metrics
The pipeline was evaluated on a standard dataset with average length 500 words, yielding the following metrics:
Table 1: A Table of the Results gotten
Metric | Value | Description |
ROUGE-1 | 0.539 | Unigram overlap with reference summaries |
ROUGE-L | 0.539 | Longest common subsequence overlap |
BLEU | 0.20 | Precision with brevity penalty |
BERT Score F1 | 0.87 | Semantic similarity F1 score |
METEOR | 0.35 | Lexical alignment with synonymy |
ROGUE-2 | 0.36 | Bigram overlap between generated and reference summary |
Compression Ratio | 0.37 | Average summary length (185 words) |
Coherence Score | 0.50 | Weighted local and global sentence similarity |
MCQ Diversity | 0.46 | Question variety based on embedding similarity |
Question Relevance | 80% | Human-rated pedagogical utility |
DISCUSSION
The zero-shot hybrid NLP pipeline, integrating BERT, BART, and T5, was evaluated on user-generated educational texts and the CNN/Daily Mail dataset. As depicted in Figure 3, the pipeline achieves a BERT Score F1 of 0.87, indicating robust semantic preservation in generated summaries, outperforming the fine-tuned hybrid system of Divya et al. (2024). This highlights the effectiveness of the zero-shot approach in capturing meaning across diverse educational texts without domain-specific training. The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores—ROUGE-1: 0.539, ROUGE-2: 0.360, and ROUGE-L: 0.539—reflect moderate lexical overlap at unigram, bigram, and longest common subsequence levels, consistent with zero-shot summarization’s tendency to prioritize semantic fidelity over exact lexical matches. The Bilingual Evaluation Understudy (BLEU) score of 0.20 and Metric for Evaluation of Translation with Explicit ORdering (METEOR) score of 0.35 further confirm the pipeline’s ability to retain essential content.
The coherence score of 0.50, calculated via Sentence Transformers, indicates acceptable logical flow in summaries but suggests room for enhancement. Post-processing techniques, such as advanced graph-based sentence reordering or coherence-aware neural modules, could improve readability and textual continuity in future iterations.
For multiple-choice question (MCQ) generation, Figure 3 reports a diversity score of 0.46 and 80% human-rated relevance, computed by calculate mcq diversity and human evaluation, respectively. These metrics underscore T5’s capability to produce varied, pedagogically relevant questions. However, 15% of MCQs exhibit relevance drift (e.g., implausible distractors like “hardware costs” in an ethics question), suggesting improvements through refined prompt engineering or semantic disambiguation.
A study with 20 average-grade-point-average students on a 25-mark test, prepared solely using the pipeline’s summaries and MCQs, demonstrated educational efficacy. Thirteen students scored above 20, four scored between 15 and 20, and three scored between 10 and 15, with 85% achieving over 60% proficiency. Students relied on summaries generated by hybrid summarizer and MCQs from the pipeline T5 for MCQ generation, covering topics like artificial intelligence (AI) applications and ethics. The summaries, averaging a compression ratio of 0.37, condensed complex texts (e.g., a 300-word AI text) into concise outputs but occasionally omitted key details (e.g., finance applications), impacting the three lower-scoring students (10–15). Similarly, MCQs with irrelevant distractors confused some students, contributing to lower scores. Error analysis, logged via the pipeline’s logging system, revealed that 10% of summaries contained factual inaccuracies (e.g., adding unverified AI applications due to BART’s hallucination tendencies) and 15% of MCQs suffered from relevance drift, aligning with observed performance gaps. These findings, suggest targeted refinements, such as enhanced keyword weighting or prompt optimization for T5, to boost accuracy and pedagogical impact in low-resource educational settings.
Deploying the zero-shot hybrid Natural Language Processing (NLP) pipeline in low-resource educational environments presents several challenges that must be addressed to maximize its practical impact. Designed to operate on a 12GB Graphics Processing Unit (GPU) using pre-trained models—Bidirectional Encoder Representations from Transformers (BERT; bert-base-uncased), Bidirectional and Auto-Regressive Transformers (BART; facebook/bart-large-cnn), Text-to-Text Transfer Transformer (T5; valhalla/t5-small-qg-hl), Natural Language Toolkit (NLTK), Spa Cy, and Sentence Transformers (all-MiniLM-L6-v2)—the pipeline (process text, lines 233–255) is optimized for scalability but faces hurdles in real-world deployment, particularly in regions like Nigeria with limited infrastructure.
Offline Capabilities: Low-resource settings often experience intermittent power and internet connectivity, critical barriers for a pipeline reliant on GPU computation and model weight loading (load models, lines 35–54). Offline operation requires pre-caching model weights (e.g., in /content/model cache) and preprocessing resources (e.g., NLTK’s Punkt tokenizer, Spa Cy’s en core web sm) on local devices. However, the 12GB GPU requirement may exclude deployment on low-end hardware common in educational institutions. Transitioning to lightweight models like MobileBERT or DistilBERT could reduce memory demands, enabling offline execution on devices with 4–8GB RAM, though this risks reduced performance in tasks like extractive summarization (extractive summary, lines 62–102).
User Accessibility: Educators and students in low-resource settings may lack technical expertise to configure or troubleshoot the pipeline. Simplifying the interface (e.g., a command-line or GUI wrapper around process_text) and providing offline documentation are essential for adoption. However, the pipeline’s English-only support limits accessibility in multilingual regions like Nigeria, where Yoruba, Hausa, and Igbo are prevalent. Integrating multilingual models (e.g., mBERT, XLM-R) requires additional offline language resources, increasing storage needs.
Factual Consistency and Relevance: Zero-shot challenges, such as BART’s factual inaccuracies (e.g., unverified details) and T5’s irrelevant distractors (e.g., “hardware costs” in ethics questions), are exacerbated in offline settings where real-time knowledge base integration (e.g., Retrieval-Augmented Generation, RAG) is infeasible. Pre-loaded knowledge bases or adapter layers for parameter-efficient fine-tuning could mitigate these issues but require additional storage and preprocessing.
To address these challenges, the pipeline should incorporate lightweight models (e.g., MobileBERT) to support low-end devices, pre-cache all dependencies for offline use, and develop a user-friendly interface. Multilingual support via mBERT or XLM-R will enhance inclusivity, while offline knowledge bases or prompt optimization can improve factual accuracy and MCQ relevance. These enhancements ensure the pipeline’s scalability and pedagogical impact in diverse, resource-constrained educational contexts.
Figure 2: Sequence Diagram of the inference time
The average inference shown in figure 2, time of approximately 10 seconds per document on a 12GB GPU demonstrates the pipeline’s computational efficiency. This makes it well-suited for deployment in low-resource environments, although real-time classroom applications may benefit from further optimization.
To contextualize the effectiveness of the proposed zero-shot pipeline, a comparative evaluation was conducted against several state-of-the-art summarization models using standardized metrics. The performance is visualized in the Table below, which presents a side-by-side comparison of key evaluation scores including ROUGE-1, ROUGE-2, ROUGE-L, and BERT Score F1 for the proposed model and existing approaches. As shown, the proposed model achieves competitive scores, particularly in semantic preservation (BERT Score F1) and overall informativeness (ROUGE-1/L), despite operating in a zero-shot setting without task-specific tuning. This performance reinforces the pipeline’s utility in low-resource educational environments and demonstrates its potential to rival or surpass fine-tuned systems under certain constraints.
Table 2: A Table of the proposed model benchmarked against other models
Model | Setup | ROUGE-1 | ROUGE-2 | ROUGE-L | BERT Score F1 | Coherence Score | MCQ Diversity |
BART (facebook/bart-large-cnn) | Zero-shot summarization | 0.45 | 0.17 | 0.41 | 0.84 | 0.46 | — |
T5-base | Zero-shot summarization | 0.42 | 0.14 | 0.39 | 0.82 | 0.43 | — |
GPT-3 (ChatGPT) | Zero-shot summarization | 0.48 | 0.19 | 0.43 | 0.85 | 0.47 | 0.40 |
Proposed Model | Zero-shot hybrid Summarisation | 0.54 | 0.20 | 0.54 | 0.87 | 0.50 | 0.46 |
As shown in Table 2, the proposed zero-shot hybrid summarization and MCQ generation pipeline outperforms baseline models across multiple evaluation metrics. It achieves the highest ROUGE-1 and ROUGE-L scores of 0.54, indicating superior overlap with reference summaries at both the unigram level and in terms of longest common subsequences. Additionally, its ROUGE-2 score of 0.20 reflects a stronger ability to preserve meaningful bigram relationships compared to BART (0.17), T5-base (0.14), and GPT-3 (0.19).
In terms of semantic fidelity, the proposed model obtains the highest BERT Score F1 of 0.87, suggesting improved semantic alignment between generated and reference summaries. This marks a notable enhancement over BART (0.84), T5-base (0.82), and GPT-3 (0.85), reinforcing the effectiveness of the integrated approach.
The model also records the best coherence score (0.50), indicating more logically structured outputs compared to other zero-shot models, which ranged from 0.43 to 0.47. Furthermore, it is the only model evaluated for MCQ diversity, achieving a score of 0.46, which reflects the system’s ability to produce a range of contextually appropriate and pedagogically varied questions. GPT-3, the only other model evaluated for question diversity, trails behind with a score of 0.40.
These findings, as summarized in Table 2, collectively demonstrate the superiority of the proposed model in both summarization and question generation tasks. Its performance advantage across content overlap, semantic preservation, coherence, and diversity metrics affirms the value of a unified, zero-shot pipeline tailored for pedagogical applications.
The zero-shot hybrid Natural Language Processing (NLP) pipeline, integrating Bidirectional Encoder Representations from Transformers (BERT), Bidirectional and Auto-Regressive Transformers (BART), and Text-to-Text Transfer Transformer (T5), offers a scalable solution for text summarization and multiple-choice question (MCQ) generation in low-resource educational settings. Operating on a 12GB Graphics Processing Unit (GPU) with Natural Language Toolkit (NLTK), SpaCy, and Sentence Transformers, it delivers concise summaries and pedagogically relevant MCQs without task-specific fine-tuning, addressing the computational constraints of resource-scarce environments. However, the pipeline’s current limitation to English restricts its accessibility in linguistically diverse educational contexts. As low-resource NLP literature emphasizes, incorporating multilingual capabilities through models like multilingual BERT or XLM-R is essential for equitable access and broader applicability across global educational settings.
Zero-shot learning, while enabling flexibility in data-scarce environments, faces challenges in maintaining factual consistency and relevance. The pipeline’s abstractive summarization, driven by BART, may introduce factual inaccuracies (e.g., unverified details in summaries) due to the model’s tendency to generate novel content without domain-specific grounding. Similarly, T5’s MCQ generation can produce irrelevant distractors (e.g., “hardware costs” in ethics questions), reducing pedagogical value, particularly in low-resource settings where robust evaluation data is scarce. These issues stem from the absence of fine-tuning and limited contextual cues in user-input texts, compounded by the pipeline’s reliance on pre-trained weights. Addressing these challenges requires advanced prompt engineering, integration of knowledge bases for factual grounding, or hybrid training strategies that balance zero-shot flexibility with minimal supervised adaptation. Future work should explore these enhancements, alongside multilingual support, to ensure the pipeline’s summaries and MCQs are accurate, relevant, and inclusive for diverse educational contexts.
Figure 3: Image of the text generated by the image
This a result from a standard text about the topic Artificial Intelligence of about 300 words in qhich the figure 3 above shows evidence for
“Summary
(76 words, Score: 0.37)
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. In healthcare, AI is used to analyze medical images, predict patient outcomes, and assist in diagnosis. In finance, AI algorithms detect fraud, manage investments, and automate trading. Despite its advancements, AI raises ethical concerns, including privacy issues, job displacement, and decision-making transparency. Ongoing research aims to address these challenges while enhancing AI’s capabilities.
Question: What is the primary goal of AI according to the text? A) Simulating human intelligence B) Hardware optimization C) Data deletion D) Network security Correct Answer: A
Question: Which industry uses AI for content recommendation? A) Entertainment B) Agriculture C) Construction D) Mining Correct Answer: A
Question: What is a key discipline in AI development? A) Mathematics B) Biology C) Geology D) History Correct Answer: A
Question: What ethical concern is associated with AI? A) Privacy issues B) Increased hardware costs C) Reduced data storage D) Limited connectivity Correct Answer: A
Question: What is the purpose of explainable AI systems? A) Enhance transparency B) Reduce processing speed C) Limit data access D) Simplify algorithms Correct Answer: A”
CONCLUSION
This study introduces a scalable zero-shot hybrid Natural Language Processing (NLP) pipeline for text summarization and multiple-choice question (MCQ) generation, tailored for low-resource educational environments. Integrating Bidirectional Encoder Representations from Transformers (BERT), Bidirectional and Auto-Regressive Transformers (BART), and Text-to-Text Transfer Transformer (T5) with Natural Language Toolkit (NLTK), Spa Cy, and Sentence Transformers, the pipeline operates efficiently on a 12GB Graphics Processing Unit (GPU) without task-specific fine-tuning. Its zero-shot design ensures competitive performance across diverse educational texts, offering a practical solution for regions with limited computational infrastructure.
However, zero-shot learning presents challenges in factual consistency and relevance. BART’s abstractive summarization risks factual inaccuracies (e.g., unverified details), and T5’s MCQ generation may produce irrelevant distractors (e.g., “hardware costs” for ethics questions), reducing pedagogical value in data-scarce settings. The pipeline’s English-only support limits accessibility in multilingual regions like Nigeria, where Yoruba, Hausa, and Igbo are prevalent.
Future enhancements include adopting lightweight models like Mobile BERT to enable deployment on low-end devices, improving accessibility. Multilingual transformers (e.g., m BERT, XLM-R) will support non-English educational contexts. Advanced coherence mechanisms, such as graph-based sentence reordering, can enhance summary fluency. Incorporating knowledge-augmented models (e.g., RAG) or parameter-efficient fine-tuning (e.g., adapter layers) will improve factual accuracy and relevance while maintaining low-resource compatibility, ensuring inclusive and effective educational tools.
REFERENCES
- Afzal, A., Vladika, J., Braun, D., & Matthes, F. (2023). Challenges in domain-specific abstractive summarization and how to overcome them. arXiv preprint arXiv:2307.00963.
- Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., … Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
- Chersoni, E., Santus, E., Huang, C.-R., & Lenci, A. (2021). Decoding word embeddings with brain-based semantic features. Computational Linguistics, 47(3), 663–698. https://doi.org/10.1162/coli_a_00412
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171–4186. https://doi.org/10.18653/v1/N19-1423
- Divya, S., N, S., Andrew, J., & Mazzara, M. (2024). Unified extractive-abstractive summarization: A hybrid approach utilizing BERT and transformer models for enhanced document summarization. PeerJ Computer Science, 10, e2424. https://doi.org/10.7717/peerj-cs.2424
- Karanja, R., & Matheka, D. (2024). Hybrid text summarization using BERT and LSTM with particle swarm optimization. In 2024 5th International Conference for Emerging Technology (INCET) (pp. 1–7). IEEE. https://doi.org/10.1109/INCET61516.2024.10569730
- Khashabi, D., Min, S., Khot, T., Sabharwal, A., & Hajishirzi, H. (2020). UnifiedQA: Crossing format boundaries with a single QA system. Findings of the Association for Computational Linguistics: EMNLP 2020, 1896–1907. https://doi.org/10.18653/v1/2020.findings-emnlp.171
- Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2020). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703
- Liu, Z. (2022). Effective transfer learning for low-resource natural language understanding. arXiv. https://arxiv.org/abs/2208.09180
- Liu, Y., & Lapata, M. (2019). Text summarization with pretrained encoders. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3730–3740. https://doi.org/10.18653/v1/D19-1387
- Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing order into texts. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 404–411. https://aclanthology.org/W04-3252
- Raffel, C., Shleifer, J., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67. http://jmlr.org/papers/v21/20-074.html
- Zhang, J., Zhao, Y., Saleh, M., & Liu, P. J. (2020). PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. Proceedings of the 37th International Conference on Machine Learning, 11328–11339. https://proceedings.mlr.press/v119/zhang20ae.html