Submission Deadline- 11th December 2025

Last Issue of 2025 : Publication Fee: 30$ USD Submit Now

Submission Deadline-04th December 2025

Special Issue on Economics, Management, Sociology, Communication, Psychology: Publication Fee: 30$ USD Submit Now

Submission Deadline-19th December 2025

Special Issue on Education, Public Health: Publication Fee: 30$ USD Submit Now

How Many Keywords are Enough? Determining the Optimal Top-K for Educational Website Classification

Mohd Nazrien Zaraini
Noorrezam Yusop
1574-1586
Jul 3, 2025
Education

How Many Keywords are Enough? Determining the Optimal Top-K for Educational Website Classification

Mohd Nazrien Zaraini, Noorrezam Yusop

Fakulti Teknologi Maklumat Dan Komunikasi, Universiti Teknikal Malaysia Melaka

DOI: https://dx.doi.org/10.47772/IJRISS.2025.906000126

Received: 30 May 2025; Accepted: 03 June 2025; Published: 03 July 2025

ABSTRACT

The classification of educational websites has become increasingly challenging, as traditional indicators such as domain extensions no longer reliably reflect a site’s purpose. This study investigates the optimal number of TF-IDF-ranked keywords (K) required to balance classification accuracy and computational efficiency in a one-class setting. Using a curated dataset sourced from the DMOZ directory and verified educational websites, multiple Top-K keyword subsets (K = 10–200) were evaluated. A One-Class Support Vector Machine (SVM) was employed, with performance assessed through cross-validation and separate positive/negative test sets. Results indicate that classification accuracy peaks within the range of K = 30–100, with diminished performance beyond this range due to the inclusion of irrelevant or noisy terms. These findings offer a practical and scalable framework for content-based educational website classification, particularly for applications in low-resource environments, and challenge the default reliance on exhaustive keyword feature sets.

Keywords: ranking, web classification, tf-idf, top-k, information retrieval

INTRODUCTION

With the growing variety of domain names, it has become increasingly difficult to identify educational websites based solely on their URLs. As a result, content-based classification methods have become more important. One common approach is keyword extraction, which involves identifying significant terms from a web page’s content.

TF-IDF is a widely used technique in keyword extraction and helps highlight words that are important within individual documents relative to a larger collection. While it is well established in information processing, there has been limited research on how many of these top-ranked keywords are needed to accurately classify educational websites. This study focuses on finding the optimal number of keywords (Top-K) that should be used as features to achieve reliable classification.

RELATED WORKS

Educational Website Classification

The exponential growth of online educational materials has created a crucial demand for sophisticated classification methodologies to precisely identify and categorize these resources (Chen et al., 2020). Early classification systems heavily relied on structural elements, such as domain name extensions like “.edu” or “.edu.my” which were initially reliable due to the restricted usage of these domains by accredited educational institutions (Duggal et al., 2021). However, the subsequent emerging of top-level domains and the widespread adoption of generic TLDs (e.g., .com, .org) by educational entities have significantly eroded the effectiveness of these methods, introduced more complicated classification processes (Chen et al., 2020). Many educational resources prefer to use common TLDs such as “.org,” “.com,” or country-specific domains, causing more sophisticated classification techniques (Li & Peng, 2021). Furthermore, previous methodologies that depended on manually curated datasets or directories like DMOZ, while providing structured taxonomies, have proven to be labour-intensive and lack the necessary scalability to keep pace with the rapidly evolving web environment. The growing volume of online data underscores the impracticality of manual filtering and the necessity for automated tools to manage and categorize digital educational content (Álvarez et al., 2023; Jiang et al., 2021). For this, the focus has shifted towards automated approaches capable of identifying educational content based on core features and contextual analysis, leveraging advancements in machine learning and natural language processing to overcome the limitations of earlier methods.

Keyword Selection and Feature Reduction in Text Classification

Keyword selection is a fundamental step in text classification, directly influencing the accuracy and efficiency of the classification process, particularly in the context of educational website analysis (Ternikov & Александрова, 2020). The Term Frequency-Inverse Document Frequency (TF-IDF) algorithm is a term-weighting scheme that plays a vital role in ranking and filtering relevant features from web documents (Tan & Song, 2021). TF-IDF is a commonly used technique to identify important terms within a document in relation to a broader corpus. (Gutiérrez et al., 2020). It balances the frequency of a term within a document with its inverse frequency across the entire corpus, effectively highlighting words that are both frequent in a specific document and relatively rare across the broader collection (Dastani et al., 2021). The number of top-ranked keywords to include, represented by the parameter K in Top-K selection, is a critical determinant of model performance, yet its optimization is frequently ad hoc or overly tailored to specific datasets (“TF–IDF,” 2010). Selecting too few keywords can lead to underfitting, causing the loss of important discriminative information and reducing the model’s ability to accurately classify documents. On the other hand, including too many keywords can introduce noise, increase computational complexity, and potentially reduce the model’s ability to generalize. While feature reduction techniques such as mutual information, chi-square selection, and frequency thresholds have been explored in research, the specific impact of varying the K value in TF-IDF for web classification remains an area that requires further investigation (Gasparetti et al., 2017). This is particularly relevant in resource-constrained environments, where the balance between accuracy and computational efficiency is paramount, necessitating a deeper understanding of the trade-offs involved in keyword selection. Therefore, effectively tuning the K parameter within TF-IDF is essential for optimizing text classification performance, especially in specialized domains such as educational website analysis. (Wu, 2023).

The initial concept of assigning importance to words in a text by examining word frequency was introduced decades ago (Bisht, 2021). The weight of a term is typically calculated as the product of three components: a term frequency component that measures how often the term appears in the document, an inverse document frequency component that measures how rare the term is across the entire corpus, and a normalization component that adjusts for document length (“TF–IDF,” 2010). The TF-IDF approach marks a significant departure from earlier methods that focused primarily on term frequency within and across documents (Nomoto, 2022). An improved method called CTF-IDF enhances the accuracy of text classification (Xu & Wu, 2014). The enhanced TF-IDF algorithm factors in the part of speech and reader’s comments to improve accuracy (Guan et al., 2019). The most common formulation of TF-IDF weighting is defined by a specific equation (“TF–IDF,” 2010). This equation combines local term frequency with global inverse document frequency to derive a composite weight for each term in a document (Imperial & Ong, 2021). The application of this weighting scheme has demonstrated considerable utility across various text mining tasks, enabling more accurate and effective analysis of textual data.

PROPOSED METHODOLOGY

The methodology is organized into four primary stages: data collection, pre-processing, keyword extraction and ranking, and classification with validation. A visual summary of this process is presented in Figure 1.

Figure 1: Methodology for extracting and ranking Top-K TF-IDF keywords for educational website classification using One-Class SVM.

Dataset Preparation

A total of 700 websites were used in this research. These were divided into training and testing datasets. For the training phase, three distinct training sets were used, each comprising 200 pre-determined educational websites randomly selected from the DMOZ directory. Although DMOZ has been archived since 2017, it remains one of the most comprehensive, structured, and openly available web directories. Its educational category provides a reliable, consistent dataset for evaluating classification strategies in a reproducible manner. The findings derived from this dataset remain relevant for content-based classification techniques, especially in low-resource or interpretable model scenarios.

Training Set 1
Training Set 2
Training Set 3

For testing, two different sets were created:

Positive Test Set: Includes 50 educational websites randomly taken from the DMOZ education category.
Negative Test Set: Includes 50 websites from a mix of categories such as sports, arts, entertainment, and news, also from the DMOZ directory. This set simulates a real-world scenario where the classifier must distinguish educational content from other types.
Positive Test Set List
Negative Test Set List:

Data Pre-processing

There are seven activities of data pre-processing proposed for this research, which are data extraction, tokenization, HTML tag removal, symbol removal, word decapitalization, stop words removal and words stemming.

Data Extraction

The first step in pre-processing data is the data extraction process. By using the Data Extraction operator in Rapid Miner, it extracts the textual content of a given HTML document and returns the extracted text blocks as documents. Only text blocks consisting of a given number of words are extracted to prevent single words (e.g. In navigation bars) to be kept.

Tokenization

Tokenization is a process of chopping character sequence or sentence into words, phrases, symbols, or other meaningful elements called tokens (Tamang & Bora, 2024). The objective of doing tokenization process is to explore the words in a sentence. Textual data is only a textual interpretation or block of characters at the beginning. In information retrieval require the words of the data set. Therefore, a parser used to process the tokenization of the documents. This may be trivial as the text is already stored in machine-readable formats. The main use of tokenization is the identification of meaningful keywords. Another problem are abbreviations and acronyms which need to be transformed into a standard form The list of tokens will become input for further process such as classifications. Figure 2 below shows the example of the tokenization process output.

Figure 2: Example of Tokenization Process

But there are still some problems that have been left, for example, the removal of punctuation marks as well as other characters like brackets, hyphens and such. The symbol removal process will help to solve this problem.

HTML Tag Removal

HTML or Hypertext Markup Language is very common in web pages. It is the standard markup language used to create web pages. It is written in the form of HTML components comprising of tags enclosed in angle brackets, for example <html>. Most of these tags do not have any meaning and can be removed.

Symbols Removal

Text pre-processing often includes removing punctuation, special characters, and converting text to lowercase (Ding et al., 2023; Gangar et al., 2021). Punctuation can be defined as the utilization of spacing, conventional signs, and certain typographical devices to help in the process of understanding and correct reading, both silently and aloud, of handwritten and printed texts (Brown, 2013). The example of punctuation marks includes apostrophe, bracket, colon, semi-colon and others.

Word Decapitalization

Decapitalization is a process to change the words in lower case. This is also an important process as words extracted from websites come in lower- and upper-case form. Words that are in lower case and upper case will be counted as two different elements, although it is the same words as the learners and classifiers are case sensitive. By doing the decapitalization process, all the words can be changed to lower case and will be counted as a single element.

Stop Words Removal

Classification often starts by looking at documents and finding the most important keywords in those documents.

The words that appear frequently in a document are not necessarily the important keywords. It is exactly the other way round. The most frequent words will most surely be the common words such as “the” or “and,” which help build ideas and sentences but do not carry any significance meaning themselves (Fan et al., 2017). Therefore, the several hundred most common words in English (called stop words) are often removed from documents before any attempt to classify them.

Stop words are a list of common words that sometimes have a very little value in the process of machine learning and classification. A method to determine stop words is to set the terms of collection frequency (the total number of times each term appears in the document collection). MySql has provided a complete list of stop words which can be used in pre-processing data which make the process of finding stop words become much easier.

Word Stemming

Stemming is a process of converting words to its root words. The most common stemming algorithm for English words is the Porter Stemmer. It is being developed by Martin Porter at the University of Cambridge in 1980. Sometimes it just chops off the ends of words in the hope of achieving this goal correctly most of the time. But it can be very helpful to get a better classification result.

TF-IDF Keyword Extraction and Ranking

After pre-processing, Term Frequency-Inverse Document Frequency (TF-IDF) was applied to compute the importance of each term within a document relative to the entire corpus. The terms were then ranked based on their TF-IDF scores, with higher values indicating more informative keywords. To explore the impact of keyword set size, six distinct Top-K keyword sets were created, where K = 10, 20, 50, 100, 150, and 200. These sets represented varying levels of feature granularity and were used in the subsequent classification phase.

Training and Classification

This research focusses on classifying a single class website where the training data only consist of positive data and no negative data. Hence, one class classification method (One-Class Support Vector Machine (SVM) ) must be used instead of conventional multi-class classification method. The One-Class SVM learns to recognize patterns typical of educational content and identifies deviations as non-educational.

Experimental Setup

Three separate training datasets (Training Data Set 1, 2, and 3) were used to evaluate the performance of web content classification.

Different validation techniques were applied:

Training Data Set 1: Cross-validation (10 folds).
Training Data Set 2: Positive Test Set and Negative Test Set evaluations.
Training Data Set 3: Cross-validation and Positive Test Set evaluations.

Each dataset was tested using varying numbers of keywords (top 10, top 20, …, top 200, and all extracted keywords).

Positive Test Set: To assess how well the model identifies new educational websites.
Negative Test Set: To evaluate its ability to reject non-educational content and avoid false positives.

The performance of the classification model was measured using accuracy, defined as the proportion of total correct predictions (true positives and true negatives) over all predictions made.

This combination of cross-validation and external testing ensures both robustness and generalizability of the classification results.

RESULT AND DISCUSSION

Data Pre-Processing

There are seven activities of data pre-processing being done in this research, which are data extraction, tokenization, HTML tag removal, symbol removal, word decapitalization, stop words removal and words stemming.

After tokenization, HTML tag removal and stop words removal process, there are more than 3000 words being extracted from the content and metadata of training and testing web sites. The words are also being changed to lowercase so that the same word will be counted as one data.

Stop words are a list of common words that sometimes have a very little value in the process of machine learning and classification. Some examples of stop words that were removed in this phase is shown in Table 4.1 below. The full list of stop words can be retrieved from MySQL websites (MySQL, 2015)

Table 4.1: Example of stop words removed

a’s	able	about	above	according
accordingly	across	actually	after	afterwards
again	against	ain’t	all	allow
allows	almost	alone	along	already
also	although	always	am	among
amongst	an	and	another	any
anybody	anyhow	anyone	anything	anyway
anyways	anywhere	apart	appear	appreciate
appropriate	are	aren’t	around	as
aside	ask	asking	associated	at
available	away	awfully	be	became

The purpose of doing word stemming process by using Potter Stemmer is to change a word into its root word. There are five steps in Potter Stemmer, and within each step, rules are applied until one of them passes the conditions (Potter, 1980). If a rule is accepted, the suffix is removed accordingly, and the next step is performed. The resulting stem at the end of the fifth step is returned. APPENDIX A shows in detail all the steps available in the Potter Stemmer algorithm.

Although most of the words can change to its root words, but there are still several word changes into a different word instead of its root word. Table 4.2 below shows some examples of words that are not being changed into its root words.

Table 4.2: Example of words that are not changed into root words

Original words	Stem
Educational, education, educating, educator, educate	educ
University, universities	univers
Course, courses	cours
Student, students, studies, study	studi
Resource, resources	resourc
Online	onlin
Communicate, communication, communicates, communicating	commun
Day	dai
Manage, manages, manager, management	manag
Service, services	servic
Certify, certificate, certification	certif
Science	scienc

This happened because Porter Stemmer focuses on removing suffixes and prefixes (Khan & Yadav, 2019; Luo et al., 2021), and might not always identify the true root of a word. Despite its limitations, it remains a significant initial step in many Information Retrieval processes (Khan & Yadav, 2019).

Keywords Ranking

After the information about content and metadata extracted from training websites, it is found that the attributes for content, 3199 are much higher compared to the attributes of metadata data which is only 4950. If all the attributes were chosen for classification the result will become bias as the difference between both attributes are very high, more than six times fold. Therefore, keyword ranking was used based on the TF-IDF score to determine same number of attributes for both web content experiment and metadata experiment.

TF-IDF uses term frequency and times the inverse document frequency of each attribute. The Table 4.3 below shows twenty of the attributes with its TF-IDF score.

Table 4.3: k20 keyword example based on average TF-IDF

Rank	Keyword	Average_TF-IDF
1	colleg	0.033919529
2	languag	0.031320286
3	homeschool	0.029999825
4	learn	0.025583805
5	teacher	0.023995656
6	school	0.023971804
7	book	0.022452303
8	read	0.021418466
9	student	0.021374708
10	program	0.020375315
11	univers	0.020250898
12	intern	0.019222332
13	onlin	0.018737295
14	educ	0.018705899
15	disabl	0.018553035
16	curriculum	0.018108217
17	studi	0.01763062
18	test	0.017522349
19	lesson	0.017362117
20	cours	0.017133632

Training Data Set 1

Training Data Set 1 consists of 200 predetermined educational websites taken from DMOZ library. Several tests were conducted using several datasets that contain all keywords, top 10 keywords, top 20 keywords, top 30 keywords, top 40 keywords, top 50 keywords, top 100 keywords, top 150 keywords and the top 200 keywords..

Each test was validated using a cross validation technique by using 10 numbers of validation. Table 4.4 below shows the results of each test.

Table 4.4: Classification accuracy

Keywords	Classification Accuracy
k-10	89.07%
k-20	89.07%
k-30	90.05%
k-40	89.55%
k-50	88.05%
k-100	89.07%
k-150	88.52%
k-200	88.00%
All	38.71%

Based on the result shown in Table 4.4 the top 30 keywords have the best accuracy of classification based on web content with 90.05%.

Training Data Set 2

For Training Data Set 2, the experiment being done again, but using different training and testing data sets and a different method of validation. Another set of 200 pre-determined educational websites being used as training data. For the test data, 2 sets of data being used; all positive set name Positive Test Set and set with various categories of website name Negative Test Set. Table 4.5 shows the result for all positive and Table 4.6 below shows the result for various categories.

Table 4.5: Result for Positive Test Set

Keywords	Classification Accuracy
k-10	90.00%
k-20	90.00%
k-30	88.00%
k-40	90.00%
k-50	90.00%
k-100	92.00%
k-150	86.00%
k-200	84.00%
All	86.00%

For Positive Test Set, the best classification is by using top 100 keywords (k-100) as the attributes for classification based on web content which is 92% and the worst result came from top 200 for content-based classification with the value of 84%.

Table 4.6: Result for Negative Test Data

Keywords	Classification Accuracy
k-10	70.00%
k-20	72.00%
k-30	82.00%
k-40	72.00%
k-50	74.00%
k-100	70.00%
k-150	66.00%
k-200	64.00%
All	58.00%

For Negative Test Set, the best classification is by using top 30 keywords as the attributes for classification based on web content which is 82%. The worst result came from all keywords for content-based classification with the value of 58%.

The best attribute for classification comes from the range k-30 keywords only while classification with all keywords gives the worst classification accuracy. This because data from all keywords may consist of words that are not related with educational terms. Table 4.7 below shows the example of keywords that are ranging from 1000-1020 which are not quite related to education.

Table 4.7: Keywords range from 1000 – 1020

Rank	Keyword
1000	strand
1001	summerhai
1002	uganda
1003	upstag
1004	continun
1005	educt
1006	estat
1007	saftei
1008	camp
1009	cultur
1010	enrich
1011	fantast
1012	lyceum
1013	present
1014	red
1015	ribbon
1016	counsellor
1017	cypru
1018	dubai
1019	franc
1020	germani

Training Data Set 3

To ensure the reliability of the results, the web content classification experiment was repeated using an alternative dataset, Training Data Set 3. Two validation approaches were employed: cross-validation and evaluation using a Positive Test Set.

Table 4.8: Result obtained through cross-validation.

Keywords	Classification Acuracy
k-10	87.50%
k-20	89.50%
k-30	87.50%
k-40	86.00%
k-50	84.50%
k-100	83.50%
k-150	81.50%
k-200	81.00%
All	40.50%

Table 4.9: Result based on the Positive Test Set evaluation.

Keywords	Classification Acuracy
k-10	86%
k-20	94%
k-30	94%
k-40	98%
k-50	98%
k-100	98%
k-150	98%
k-200	98%
All	98%

For the Positive Test Set evaluation, the classification accuracy showed a steady increase with the number of keywords used. Accuracy improved from 86% with the top 10 keywords to 98% when 40 or more keywords were used, maintaining a consistent 98% accuracy even when all extracted keywords were included.

In contrast, the cross-validation results exhibited a different pattern. The highest accuracy of 89.5% was achieved when using the top 20 keywords. Beyond this point, the accuracy progressively declined as more keywords were included, reaching only 40.5% when all extracted keywords were used. This indicates that, under cross-validation, using a limited number of carefully selected keywords yields better classification performance, whereas using all keywords introduces noise and significantly degrades accuracy.

DISCUSSION OF KEY TRENDS

Optimal Keywords Threshold

Analysis across multiple datasets revealed that selecting between 30 and 100 top-ranked keywords consistently yielded the highest classification accuracy. This range appears to capture the core thematic content of educational websites while minimizing the inclusion of irrelevant or low-signal terms. Beyond this threshold, performance began to degrade, likely due to the introduction of noisy or overly specific keywords (e.g., “lyceum,” “dubai”), as shown in Table 4.7. These findings support the principle that more features do not always translate to better performance, particularly in text classification settings.

Impact of Validation Method

The effectiveness of keyword quantity was also influenced by the choice of validation method. In cross-validation experiments, smaller keyword sets (K = 20–30) performed best, suggesting that leaner models are more generalizable when trained and tested across randomly partitioned folds. In contrast, evaluation using separate test sets favoured larger keyword sets (K = 40–100), particularly when the test data shared structural similarity with the training data. However, this also raises concerns about potential overfitting, especially when the test set is curated or limited in scope.

Figure 3 Heatmap Accuracy vs Number of Keywords for Each Dataset

Practical Implication

From a practical standpoint, selecting the top 30–50 keywords provide an optimal balance between model accuracy and computational efficiency. This range is especially suitable for real-time or resource-constrained applications, such as automated content filtering, educational search systems, or lightweight classifiers deployed in embedded environments.

CONCLUSION

This study presents a systematic evaluation of Top-K keyword selection for the classification of educational websites using a one-class learning approach. By leveraging TF-IDF to rank keyword relevance and employing a One-Class SVM model trained exclusively on positive instances, we examined how varying the number of keywords (K) impacts classification accuracy.

Our results demonstrate that selecting between 30 and 100 keywords consistently yields high classification performance, with peak accuracy achieved around K = 40–50. Beyond this range, the inclusion of lower-ranked, noisy keywords introduces feature redundancy and degrades accuracy. These findings challenge the common practice of using all available keywords and underscore the importance of feature curation, particularly in resource-constrained or interpretable web classification scenarios.

The proposed approach offers a lightweight, interpretable, and effective framework for content-based classification especially valuable for educational institutions, digital libraries, and filtering systems that lack labelled negative examples. Moreover, the methodology can be extended to other domains where positive-only training data is available and computational simplicity is essential.

Future research may explore the integration of semantic keyword representations, multilingual stemmer and the possibility enhancing the Porter Stemmer capabilities to get a better and understandable root words.

ACKNOWLEDGEMENTS

We gratefully acknowledge the support provided by University Technical Malaysia Melaka (UTeM) and the Faculty Technology Maklumat & Komunikasi (FTMK).

REFERENCES

Álvarez, D. P., Orozco, A. L. S., García-Miguel, J. P., & Villalba, L. J. G. (2023). Learning Strategies for Sensitive Content Detection. Electronics, 12(11), 2496. https://doi.org/10.3390/electronics12112496
Bisht, R. K. (2021). A Comparative Evaluation of Different Keyword Extraction Techniques. International Journal of Information Retrieval Research, 12(1), 1. https://doi.org/10.4018/ijirr.289573
Chen, Y., Zheng, R., Zhou, A., Liao, S., & Liu, L. (2020). Automatic Detection of Pornographic and Gambling Websites Based on Visual and Textual Content Using a Decision Mechanism. Sensors, 20(14), 3989. https://doi.org/10.3390/s20143989
Dastani, M., Chelak, A. M., Ziaei, S., & Delghandi, F. (2021). Identifying Emerging Trends in Scientific Texts Using TF-IDF Algorithm: A Case Study of Medical Librarianship and Information Articles. Health Technology Assessment in Action. https://doi.org/10.18502/htaa.v4i2.6231
Ding, K., Niu, Y., & Choo, W. C. (2023). The evolution of Airbnb research: A systematic literature review using structural topic modeling [Review of The evolution of Airbnb research: A systematic literature review using structural topic modeling]. Heliyon, 9(6). Elsevier BV. https://doi.org/10.1016/j.heliyon.2023.e17090
Duggal, K., Singh, P., & Gupta, L. R. (2021). Intrinsic and Extrinsic Motivation for Online Teaching in COVID-19: Applications, Issues, and Solution. In Studies in systems, decision and control (p. 327). Springer International Publishing. https://doi.org/10.1007/978-3-030-60039-6_17
Fan, A., Doshi‐Velez, F., & Miratrix, L. (2017). Prior matters: simple and general methods for evaluating and improving topic quality in topic modeling. arXiv (Cornell University). https://doi.org/10.48550/arXiv.1701.0322
Gangar, K., Ruparel, H., & Lele, S. (2021). Hindi to English: Transformer-Based Neural Machine Translation. In Lecture notes in electrical engineering (p. 337). Springer Science Business Media. https://doi.org/10.1007/978-981-33-4909-4_25
Gasparetti, F., Medio, C. D., Limongelli, C., Sciarrone, F., & Temperini, M. (2017). Prerequisites between learning objects: Automatic extraction based on a machine learning approach. Telematics and Informatics, 35(3), 595. https://doi.org/10.1016/j.tele.2017.05.007
Guan, X., Li, Y., & Gong, H. (2019). Improved TF-IDF for We Media Article Keywords Extraction. Journal of Physics Conference Series, 1302(3), 32003. https://doi.org/10.1088/1742-6596/1302/3/032003
Gutiérrez, L. F. G., Abri, F., Armstrong, M., Namin, A. S., & Jones, K. S. (2020). Phishing Detection through Email Embeddings. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2012.14488
Imperial, J. M., & Ong, E. (2021). Under the Microscope: Interpreting Readability Assessment Models for Filipino. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2110.00157
Jiang, Z., Bo, G., He, Y., Han, Y., Doyle, P., & Zhu, Q. (2021). Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports. Mathematical Problems in Engineering, 2021, 1. https://doi.org/10.1155/2021/6619088
Khan, N., & Yadav, Prof. S. R. (2019). Analysis of Text Classification Algorithms: A Review [Review of Analysis of Text Classification Algorithms: A Review]. International Journal of Trend in Scientific Research and Development, 579. Rekha Patel. https://doi.org/10.31142/ijtsrd21448
Li, W., & Peng, Y. (2021). An Intelligent Recommendation Strategy for Online Courses Based on Collaborative Filtering Algorithm for Educational Platforms. Frontiers in Educational Research, 4(16). https://doi.org/10.25236/fer.2021.041605
Luo, Y., Liang, P., Wang, C., Shahin, M., & Zhan, J. (2021). Characteristics and Challenges of Low-Code Development: The Practitioners’ Perspective.
MySQL, 2015. MySQL :: MySQL 5.1 Reference Manual :: 12.9.4 Full-Text Stopwords. [online] Available at: https://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html [Accessed 24 July. 2025].
Porter, M. (1980). An algorithm for suffix stripping. Program Electronic Library and Information Systems, 14(3), 130. https://doi.org/10.1108/eb046814
Porter Stemming Algorithm. (2002). https://tartarus.org/martin/PorterStemmer/index-old.html
Nomoto, T. (2022). Keyword Extraction: A Modern Perspective. SN Computer Science, 4(1). https://doi.org/10.1007/s42979-022-01481-7
Tamang, S., & Bora, D. J. (2024). Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2411.12240
Tan, M., & Song, Y. (2021). Research on the Topic Mining of Learners’ Interest Based on the Mongolian MOOC Platform Course Discussion Text. 52, 1563. https://doi.org/10.1145/3456887.3459721
Ternikov, A., & Александрова, Е. (2020). Demand for skills on the labor market in the IT sector. Business Informatics, 14(2), 64. https://doi.org/10.17323/2587-814x.2020.2.64.83
TF–IDF. (2010). In Encyclopedia of Machine Learning (p. 986). https://doi.org/10.1007/978-0-387-30164-8_832
Wu, S. (2023). JobHam-place with smart recommend job options and candidate filtering options. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2303.17930
Xu, D., & Wu, S. B. (2014). An Improved TFIDF Algorithm in Text Classification. Applied Mechanics and Materials, 2258. https://doi.org/10.4028/www.scientific.net/amm.651-653.2258

Article Statistics

Track views and downloads to measure the impact and reach of your article.

PDF Downloads

70 views

Metrics

PlumX

Altmetrics

Download Cover Page

Download Editorial Board

Acceptance Letter

Table of Contents

DOI

10.47772/IJRISS.2025.906000126

Cite this article

Mohd Nazrien Zaraini, Noorrezam Yusop (2025). How Many Keywords are Enough? Determining the Optimal Top-K for Educational Website Classification. International Journal of Research and Innovation in Social Science (IJRISS), 9(06), 1574-1586. https://doi.org/https://dx.doi.org/10.47772/IJRISS.2025.906000126

Mohd Nazrien Zaraini, Noorrezam Yusop. "How Many Keywords are Enough? Determining the Optimal Top-K for Educational Website Classification" International Journal of Research and Innovation in Social Science (IJRISS), vol. 9, no. 06, 1574-1586, 2025, https://doi.org/https://dx.doi.org/10.47772/IJRISS.2025.906000126.

Mohd Nazrien Zaraini, Noorrezam Yusop, 2025. How Many Keywords are Enough? Determining the Optimal Top-K for Educational Website Classification. International Journal of Research and Innovation in Social Science (IJRISS), 9(06), pp. 1574-1586. Available at: https://doi.org/https://dx.doi.org/10.47772/IJRISS.2025.906000126

Mohd Nazrien Zaraini, Noorrezam Yusop (2025), International Journal of Research and Innovation in Social Science (IJRISS), Vol. 9, Issue 06, Pages 1574-1586, DOI: https://doi.org/https://dx.doi.org/10.47772/IJRISS.2025.906000126

Metrics

70 views

Dimensions

PlumX

Altmetric Badge

Share on Social Media

Linking Participation and Perception to Student Satisfaction: A Study of CSR Engagement in Universities
November 4, 2025
Sustainability Assurance: Enhancing Credibility and Accountability in Corporate Reporting
November 4, 2025
The Triad of Financial Stability: Savings, Investing, and Expense Management Strategies
November 4, 2025
Accounting History During The Colonial Era In The North Borneo: A Methodological Discussion
November 4, 2025
Effectiveness of a Training Program in Developing Scientific Concepts among Children with Intellectual Disabilities
November 4, 2025

About RSIS International

Publication Method

Conference

Join Our Team