Interpretable, Expert-Aligned Composite Metric with Domain-Aware Calibration for Evaluating Natural Language Generation

Allan C. Taracatac; Arnel C. Fajardo

doi:10.51244/IJRSI.2026.13020053

Interpretable, Expert-Aligned Composite Metric with Domain-Aware Calibration for Evaluating Natural Language Generation

Authors

Allan C. Taracatac

College of Computing Studies, Information and Communication Technology, Isabela State University (Philippines)

Arnel C. Fajardo

College of Computing Studies, Information and Communication Technology, Isabela State University (Philippines)

Article Information

DOI: 10.51244/IJRSI.2026.13020053

Subject Category: Computer Science

Volume/Issue: 13/2 | Page No: 598-609

Publication Timeline

Submitted: 2026-02-11

Accepted: 2026-02-19

Published: 2026-02-27

Abstract

Automated metrics for natural language generation (NLG) often show weak or unstable alignment with expert judgment in domain-specific settings that require interpretability and tunability. Hence, this study designs and validates an interpretable composite metric that can be calibrated to expert consensus while being transparent. The researchers propose Comprehensive Quality Scoring (CQS), a hierarchical metric integrating contextual coherence and continuity (C3) with 5 interpretable linguistic factors, specifically relevance, readability, conciseness, structure, and information density, also introducing CLARION-G. This constrained calibrator learns a nonnegative simplex weight vector while preserving factor-level attribution. Evaluation uses 20 agriculture-oriented farmer FAQ items with responses generated by a local LLaMA 3.1 (8B) model and scored by expert panels across Agriculture, Linguistics, and Information Technology using a rubric based on MetricEval. Expert ratings are z-scored per rater and aggregated into a consensus target, with reliability assessed via ICC(2,1). To prevent leakage under 𝑛=20, calibration is performed strictly within leave-one-out cross-validation (LOOCV) (train on 𝑛-1, freeze weights, score the held-out item), with uncertainty quantified via Fisher-z confidence intervals and bootstrap resampling (B=1000). CLARION-G maximizes a penalized correlation objective with fixed coefficients 𝜆1=0.01, 𝜆𝑏𝑎𝑙=0.005, and 𝜆𝑣𝑎𝑟=0.003, optimized using Differential Evolution (population=15, maxiter=50, tol=10−4, polish=True) with optional L-BFGS-B refinement (maxiter=300-500, ftol=10−6-10−8). In Agriculture, calibrated CQS achieves Pearson’s 𝑟=0.688 with 95% CI [0.353, 0.867], surpassing baselines (e.g., BERTScore, Prometheus, METEOR) with statistically significant dependent-correlation gains. Learned top-level weights allocate 0.4 to C3 and 0.6 to linguistic quality, emphasizing relevance and information density. Bland-Altman analysis shows no fixed bias with limits of agreement ±0.1134, and runtime remains practical (≈1.254 ms/item), supporting CQS/CLARION-G as an interpretable and operationally lightweight framework for expert-aligned NLG evaluation in specialized domains.

Keywords

Automated evaluation, Comprehensive Quality Scoring

Downloads

PDF JATS XML

References

1. Aickin, M., & Gensler, H. (1996). Adjusting for multiple testing when reporting research results: The Bonferroni vs Holm methods. American Journal of Public Health, 86(5), 726–728. https://doi.org/10.2105/AJPH.86.5.726 [Google Scholar] [Crossref]

2. Aynetdinov, A., & Akbik, A. (2024). SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity. http://arxiv.org/abs/2401.17072 [Google Scholar] [Crossref]

3. Banerjee, S., & Lavie, A. (2005). METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings Ofthe ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, June, 65–72. [Google Scholar] [Crossref]

4. Bartko, J. J. (1966). The intraclass correlation coefficient as a measure of reliability. Psychological Reports, 19(1), 3–11. https://doi.org/10.2466/pr0.1966.19.1.3 [Google Scholar] [Crossref]

5. Bavaresco, A., Bernardi, R., Bertolazzi, L., Elliott, D., Fernández, R., Gatt, A., Ghaleb, E., Giulianelli, M., Hanna, M., Koller, A., Martins, A. F. T., Mondorf, P., Neplenbroek, V., Pezzelle, S., Plank, B., Schlangen, D., Suglia, A., Surikuchi, A. K., Takmaz, E., & Testoni, A. (2024). LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks. http://arxiv.org/abs/2406.18403 [Google Scholar] [Crossref]

6. Chung, G., & Baker, E. L. (2003). Issues in the reliability and validity of automated scoring of constructed responses. Automated Essay Grading: A Cross-Disciplinary Approach, 23–40. [Google Scholar] [Crossref]

7. Colombo, P., Peyrard, M., Noiry, N., West, R., & Piantanida, P. (2023). The Glass Ceiling of Automatic Evaluation in Natural Language Generation. IJCNLP-AACL 2023 - 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Findings of the Association for Computational Linguistics: IJCNLP-AA, 178–183. https://doi.org/10.18653/v1/2023.findings-ijcnlp.16 [Google Scholar] [Crossref]

8. Cox, N. J. (2008). Speaking Stata: Correlation with confidence, or Fisher’s z revisited. Stata Journal, 8(3), 413–439. https://doi.org/10.1177/1536867x0800800307 [Google Scholar] [Crossref]

9. Davoodijam, E., & Alambardar Meybodi, M. (2024). Evaluation metrics on text summarization: comprehensive survey. Knowledge and Information Systems, 66(12), 7717–7738. https://doi.org/10.1007/s10115-024-02217-0 [Google Scholar] [Crossref]

10. Gao, M., Hu, X., Ruan, J., Pu, X., & Wan, X. (2024). LLM-based NLG Evaluation: Current Status and Challenges. http://arxiv.org/abs/2402.01383 [Google Scholar] [Crossref]

11. Giavarina, D. (2015). Understanding Bland Altman analysis. Biochemia Medica, 25(2), 141–151. https://doi.org/10.11613/BM.2015.015 [Google Scholar] [Crossref]

12. Graham, Y., & Baldwin, T. (2014). Testing for significance of increased correlation with human judgment. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 172–176. https://doi.org/10.3115/v1/d14-1020 [Google Scholar] [Crossref]

13. Gunawan, D., Sembiring, C. A., & Budiman, M. A. (2018). The Implementation of Cosine Similarity to Calculate Text Relevance between Two Documents. Journal of Physics: Conference Series, 978(1). https://doi.org/10.1088/1742-6596/978/1/012120 [Google Scholar] [Crossref]

14. Hu, X., Gao, M., Hu, S., Zhang, Y., Chen, Y., Xu, T., & Wan, X. (2024). Are LLM-based Evaluators Confusing NLG Quality Criteria? Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1, 9530–9570. https://doi.org/10.18653/v1/2024.acl-long.516 [Google Scholar] [Crossref]

15. Kalinauskaitė, D. (2018). Detecting information-dense texts: Towards an automated analysis. CEUR Workshop Proceedings, 2145, 95–98. [Google Scholar] [Crossref]

16. Karch, J. (2020). Improving on adjusted R-squared. Collabra: Psychology, 6(1), 1–11. https://doi.org/10.1525/collabra.343 [Google Scholar] [Crossref]

17. Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., & Seo, M. (2024). Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models. 12th International Conference on Learning Representations, ICLR 2024, 1–37. [Google Scholar] [Crossref]

18. Kim, S., Suk, J., Welleck, S., Neubig, G., Longpre, S., Yuchen, B., Jamin, L., Lee, M., Lee, K., Seo, M., & Ai, K. (2024). PROMETHEUS 2: An Open Source Language Model Specialized in Evaluating Other Language Models. [Google Scholar] [Crossref]

19. Kirschen, R. H., O’Higgins, E. A., & Lee, R. T. (2000). A Study of CrossValidation and Bootstrap for Accuracy Estimation and Model Selection. American Journal of Orthodontics and Dentofacial Orthopedics, 118(4), 456–461. https://doi.org/10.1067/mod.2000.109032 [Google Scholar] [Crossref]

20. Lin, C.-Y. (2004). Looking for a Few Good Metrics: ROUGE and its Evaluation. NTCIR Workshop, June, 1–8. [Google Scholar] [Crossref]

21. Meijer, R. J., & Goeman, J. J. (2013). Efficient approximate k-fold and leave-one-out cross-validation for ridge regression. Biometrical Journal, 55(2), 141–155. https://doi.org/10.1002/bimj.201200088 [Google Scholar] [Crossref]

22. O’Neill, J., & Bollegala, D. (2020). Learning to Evaluate Neural Language Models BT - Computational Linguistics (L.-M. Nguyen, X.-H. Phan, K. Hasida, & S. Tojo (eds.); pp. 123–133). Springer Singapore. [Google Scholar] [Crossref]

23. Oyama, M., & Shimodaira, H. (2023). Revisiting Cosine Similarity via Normalized ICA-transformed Embeddings. 2023. https://doi.org/https://doi.org/10.48550/arXiv.2406.10984 Focus to learn more [Google Scholar] [Crossref]

24. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 311–318. https://doi.org/10.1002/andp.19223712302 [Google Scholar] [Crossref]

25. Schmidtova, P., Mahamood, S., Balloccu, S., Dusek, O., Gatt, A., Gkatzia, D., Howcroft, D. M., Platek, O., & Sivaprasad, A. (2025). Automatic Metrics in Natural Language Generation: A survey of Current Evaluation Practices. 557–583. https://doi.org/10.18653/v1/2024.inlg-main.44 [Google Scholar] [Crossref]

26. Steiger, J. H. (1980). Tests for comparing elements of a correlation matrix. Psychological Bulletin, 87(2), 245–251. https://doi.org/10.1037//0033-2909.87.2.245 [Google Scholar] [Crossref]

27. Susoy, Z. (2023). Lexical Density, Lexical Diversity and Academic Vocabulary Use: Differences in Dissertation Abstracts. Acuity: Journal of English Language Pedagogy, Literature, and Culture, 8(2), 198–210. https://doi.org/10.35974/acuity.v8i2.3079 [Google Scholar] [Crossref]

28. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models. http://arxiv.org/abs/2302.13971 [Google Scholar] [Crossref]

29. Vajjala, S., & Meurers, D. (2016). Readability-based Sentence Ranking for Evaluating Text Simplification. http://arxiv.org/abs/1603.06009 [Google Scholar] [Crossref]

30. Verduijn, M., Peek, N., de Keizer, N. F., van Lieshout, E. J., de Pont, A. C. J. M., Schultz, M. J., de Jonge, E., & de Mol, B. A. J. M. (2008). Individual and Joint Expert Judgments as Reference Standards in Artifact Detection. Journal of the American Medical Informatics Association, 15(2), 227–234. https://doi.org/10.1197/jamia.M2493 [Google Scholar] [Crossref]

31. Xiao, Z., Zhang, S., Lai, V., & Liao, Q. V. (2023). Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory. EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings, 10967–10982. https://doi.org/10.18653/v1/2023.emnlp-main.676 [Google Scholar] [Crossref]

32. Yang, Y., Zhong, J., Wang, C., & Li, Q. (2022). Exploring Relevance and Coherence for Automated Text Scoring using Multi-task Learning. Proceedings of the International Conference on Software Engineering and Knowledge Engineering, SEKE, 323–328. https://doi.org/10.18293/SEKE2022-024 [Google Scholar] [Crossref]

33. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). Bertscore: Evaluating Text Generation With Bert. 8th International Conference on Learning Representations, ICLR 2020, 1–43. [Google Scholar] [Crossref]

Interpretable, Expert-Aligned Composite Metric with Domain-Aware Calibration for Evaluating Natural Language Generation

Authors

Article Information

Publication Timeline

Abstract

Keywords

Downloads

References

Metrics

Views & Downloads

Similar Articles