Multi-Modal Soft-Skill Interview Assessment: Real-Time Emotion, Speech Analytics, and LLM Scoring

Authors

Sachin Jadhav

Vishwakarma Institute of Technology, Pune (India)

Soham Surdas

Vishwakarma Institute of Technology, Pune (India)

Saif Khan

Vishwakarma Institute of Technology, Pune (India)

Soham Kasurde

Vishwakarma Institute of Technology, Pune (India)

Rahul Sakpal

Vishwakarma Institute of Technology, Pune (India)

Article Information

DOI: 10.51244/IJRSI.2025.1213CS0011

Subject Category: Artificial Intelligence

Volume/Issue: 12/13 | Page No: 130-139

Publication Timeline

Submitted: 2025-11-17

Accepted: 2025-11-24

Published: 2025-12-13

Abstract

We introduce a full-stack, multi-modal platform for soft-skill interview assessment that integrates automatic speech recognition (Whisper), facial-emotion analysis (DeepFace), and LLM reasoning (Gemini) into a single, real-time workflow. Audio streams are transcribed and analyzed to compute words-per-minute (WPM), filler-word rate/count, and lightweight lexical cues; webcam frames yield per-frame emotion distributions that are aggregated into an emotion timeline. Resumes are parsed to a normalized skills inventory that seeds skills-aware technical questions, while curated banks provide six soft-skill probes. Each response is scored by the LLM (1–5) with a concise rationale and an “ideal answer,” then fused with speech and affect features to infer communication clarity, confidence/composure, attentiveness/engagement, and linguistic hygiene via transparent, rule-based heuristics (e.g., optimal WPM band, low filler rate, neutral/happy dominance with low negative variance). The system is engineered for scale and auditability stateless services, base64 media handling, prompt versioning, distribution-only emotion storage and persists metrics and narratives for explainable reporting. We detail the architecture, schemas, and fusion logic, and demonstrate how multi-signal evidence produces consistent, actionable insights that improve interviewer trust and candidate coaching value versus single-modal baselines.

Keywords

multi-modal assessment; soft skills; interview analytics; Whisper; DeepFace; Gemini; speech metrics; emotion timeline; LLM scoring

Downloads

References

1. H. Chandhana, “Resume Analyzer Using LLM,” IRJWEB, Dec. 2024. [Google Scholar] [Crossref]

2. https://www.irjweb.com/RESUME%20ANALYZER%20USING%20LLM..pdf. [Google Scholar] [Crossref]

3. C. Daryani, G. S. Chhabra, H. Patel, I. K. Chhabra, and R. Patel, “An Automated Resume Screening System Using Natural Language Processing and Similarity,” in Proc. Ethics and Information Technology (ETIT), 2020, pp. 99–103, doi:10.26480/etit.02.2020.99.103. [Google Scholar] [Crossref]

4. H. T. and V. Varalatchoumy, “Facial Emotion Recognition System using Deep Learning and Convolutional Neural Networks,” Int. J. Engineering Research & Technology (IJERT), vol. 10, no. 06, June 2021, doi: 10.17577/IJERTV10IS060338. [Google Scholar] [Crossref]

5. G. M. Dar and R. Delhibabu, “Speech Databases, Speech Features, and Classifiers in Speech Emotion Recognition: A Review,” IEEE Access, Jan. 2024, doi: 10.1109/ACCESS.2024.3476960. [Google Scholar] [Crossref]

6. S. Fareri, N. Melluso, F. Chiarello, and G. Fantoni, “SkillNER: Mining and Mapping Soft Skills from any Text,” Expert Systems with Applications, vol. 169, 2021, [Google Scholar] [Crossref]

7. doi: 10.1016/j.eswa.2021.115544. [Google Scholar] [Crossref]

8. Tun, S. S. Y., Okada, S., Huang, H.-H., Leong, C. W. "Multimodal Transfer Learning for Oral Presentation Assessment," IEEE Access, vol. 11, pp. 84013-84026, 2023. [Google Scholar] [Crossref]

9. https://doi.org/10.1109/ACCESS.2023.3301016. [Google Scholar] [Crossref]

10. Kasa, K., Burns, D., Goldenberg, M. G., Selim, O., Whyne, C., Hardisty, M. "Multi-Modal Deep Learning for Assessing Surgeon Technical Skill," Sensors, vol. 22, no. 19, 7328, 2022. [Google Scholar] [Crossref]

11. https://doi.org/10.3390/s22197328 [Google Scholar] [Crossref]

12. Chen, L., Feng, G., Joe, J. N., Leong, C. W., Kitchen, C., Lee, C. M. "Towards Automated Assessment Of Public Speaking Skills Using Multimodal Cues," Proceedings of the 16th International Conference on Multimodal Interaction (ICMI),2014,pp.163-170. [Google Scholar] [Crossref]

13. https://doi.org/10.1145/2663204.2663271. [Google Scholar] [Crossref]

Metrics

Views & Downloads

Similar Articles