Multi-Modal Soft-Skill Interview Assessment: Real-Time Emotion, Speech Analytics, and LLM Scoring
Authors
Vishwakarma Institute of Technology, Pune (India)
Vishwakarma Institute of Technology, Pune (India)
Vishwakarma Institute of Technology, Pune (India)
Vishwakarma Institute of Technology, Pune (India)
Vishwakarma Institute of Technology, Pune (India)
Article Information
DOI: 10.51244/IJRSI.2025.1213CS0011
Subject Category: Artificial Intelligence
Volume/Issue: 12/13 | Page No: 130-139
Publication Timeline
Submitted: 2025-11-17
Accepted: 2025-11-24
Published: 2025-12-13
Abstract
We introduce a full-stack, multi-modal platform for soft-skill interview assessment that integrates automatic speech recognition (Whisper), facial-emotion analysis (DeepFace), and LLM reasoning (Gemini) into a single, real-time workflow. Audio streams are transcribed and analyzed to compute words-per-minute (WPM), filler-word rate/count, and lightweight lexical cues; webcam frames yield per-frame emotion distributions that are aggregated into an emotion timeline. Resumes are parsed to a normalized skills inventory that seeds skills-aware technical questions, while curated banks provide six soft-skill probes. Each response is scored by the LLM (1–5) with a concise rationale and an “ideal answer,” then fused with speech and affect features to infer communication clarity, confidence/composure, attentiveness/engagement, and linguistic hygiene via transparent, rule-based heuristics (e.g., optimal WPM band, low filler rate, neutral/happy dominance with low negative variance). The system is engineered for scale and auditability stateless services, base64 media handling, prompt versioning, distribution-only emotion storage and persists metrics and narratives for explainable reporting. We detail the architecture, schemas, and fusion logic, and demonstrate how multi-signal evidence produces consistent, actionable insights that improve interviewer trust and candidate coaching value versus single-modal baselines.
Keywords
multi-modal assessment; soft skills; interview analytics; Whisper; DeepFace; Gemini; speech metrics; emotion timeline; LLM scoring
Downloads
References
1. H. Chandhana, “Resume Analyzer Using LLM,” IRJWEB, Dec. 2024. [Google Scholar] [Crossref]
2. https://www.irjweb.com/RESUME%20ANALYZER%20USING%20LLM..pdf. [Google Scholar] [Crossref]
3. C. Daryani, G. S. Chhabra, H. Patel, I. K. Chhabra, and R. Patel, “An Automated Resume Screening System Using Natural Language Processing and Similarity,” in Proc. Ethics and Information Technology (ETIT), 2020, pp. 99–103, doi:10.26480/etit.02.2020.99.103. [Google Scholar] [Crossref]
4. H. T. and V. Varalatchoumy, “Facial Emotion Recognition System using Deep Learning and Convolutional Neural Networks,” Int. J. Engineering Research & Technology (IJERT), vol. 10, no. 06, June 2021, doi: 10.17577/IJERTV10IS060338. [Google Scholar] [Crossref]
5. G. M. Dar and R. Delhibabu, “Speech Databases, Speech Features, and Classifiers in Speech Emotion Recognition: A Review,” IEEE Access, Jan. 2024, doi: 10.1109/ACCESS.2024.3476960. [Google Scholar] [Crossref]
6. S. Fareri, N. Melluso, F. Chiarello, and G. Fantoni, “SkillNER: Mining and Mapping Soft Skills from any Text,” Expert Systems with Applications, vol. 169, 2021, [Google Scholar] [Crossref]
7. doi: 10.1016/j.eswa.2021.115544. [Google Scholar] [Crossref]
8. Tun, S. S. Y., Okada, S., Huang, H.-H., Leong, C. W. "Multimodal Transfer Learning for Oral Presentation Assessment," IEEE Access, vol. 11, pp. 84013-84026, 2023. [Google Scholar] [Crossref]
9. https://doi.org/10.1109/ACCESS.2023.3301016. [Google Scholar] [Crossref]
10. Kasa, K., Burns, D., Goldenberg, M. G., Selim, O., Whyne, C., Hardisty, M. "Multi-Modal Deep Learning for Assessing Surgeon Technical Skill," Sensors, vol. 22, no. 19, 7328, 2022. [Google Scholar] [Crossref]
11. https://doi.org/10.3390/s22197328 [Google Scholar] [Crossref]
12. Chen, L., Feng, G., Joe, J. N., Leong, C. W., Kitchen, C., Lee, C. M. "Towards Automated Assessment Of Public Speaking Skills Using Multimodal Cues," Proceedings of the 16th International Conference on Multimodal Interaction (ICMI),2014,pp.163-170. [Google Scholar] [Crossref]
13. https://doi.org/10.1145/2663204.2663271. [Google Scholar] [Crossref]
Metrics
Views & Downloads
Similar Articles
- The Role of Artificial Intelligence in Revolutionizing Library Services in Nairobi: Ethical Implications and Future Trends in User Interaction
- ESPYREAL: A Mobile Based Multi-Currency Identifier for Visually Impaired Individuals Using Convolutional Neural Network
- Comparative Analysis of AI-Driven IoT-Based Smart Agriculture Platforms with Blockchain-Enabled Marketplaces
- AI-Based Dish Recommender System for Reducing Fruit Waste through Spoilage Detection and Ripeness Assessment
- SEA-TALK: An AI-Powered Voice Translator and Southeast Asian Dialects Recognition