Special Issue on Emerging Paradigms in Computer Science and Technology  
|
Multi-Modal Soft-Skill Interview Assessment: Real-Time Emotion,  
Speech Analytics, and LLM Scoring  
Sachin Jadhav1, Soham Surdas2, Saif Khan3, Soham Kasurde4, Rahul Sakpal5  
Vishwakarma Institute of Technology, Pune, India  
Received: 17 November 2025; Accepted: 24 November 2025; Published: 13 December 2025  
ABSTRACT  
We introduce a full-stack, multi-modal platform for soft-skill interview assessment that integrates automatic  
speech recognition (Whisper), facial-emotion analysis (DeepFace), and LLM reasoning (Gemini) into a  
single, real-time workflow. Audio streams are transcribed and analyzed to compute words-per-minute  
(WPM), filler-word rate/count, and lightweight lexical cues; webcam frames yield per-frame emotion  
distributions that are aggregated into an emotion timeline. Resumes are parsed to a normalized skills  
inventory that seeds skills-aware technical questions, while curated banks provide six soft-skill probes. Each  
response is scored by the LLM (1–5) with a concise rationale and an “ideal answer,” then fused with speech  
and affect features to infer communication clarity, confidence/composure, attentiveness/engagement, and  
linguistic hygiene via transparent, rule-based heuristics (e.g., optimal WPM band, low filler rate,  
neutral/happy dominance with low negative variance). The system is engineered for scale and auditability  
stateless services, base64 media handling, prompt versioning, distribution-only emotion storage and persists  
metrics and narratives for explainable reporting. We detail the architecture, schemas, and fusion logic, and  
demonstrate how multi-signal evidence produces consistent, actionable insights that improve interviewer  
trust and candidate coaching value versus single-modal baselines.  
Keywordsmulti-modal assessment; soft skills; interview analytics; Whisper; DeepFace; Gemini; speech  
metrics; emotion timeline; LLM scoring  
INTRODUCTION  
Organizations increasingly recognize that long-term performance depends not only on technical proficiency  
but on a candidate’s ability to communicate clearly, collaborate effectively, and demonstrate leadership  
potential in dynamic environments. Yet despite this shift, the majority of hiring and talent-development  
assessments remain manual, subjective, and heavily biased toward hard skills. Traditional interview formats  
rely on human interpretation of verbal responses, leaving non-verbal behavior, delivery quality, and  
contextual consistency largely unmeasured. Even when digital tools are used, they are typically single-  
modalfocused on text sentiment, audio transcription, or questionnaire scoringproducing generic,  
surface-level feedback that lacks depth, repeatability, and credibility.  
The limitations of these single-channel approaches are well documented. Text-only systems cannot capture  
tone, pace, hesitation patterns, or emotional dynamics; audio-only systems miss content richness and  
conversational alignment; and rule-based behavioral checklists cannot contextualize performance in relation  
to the candidate’s background, domain, or real-time cognitive load. As a result, organizations struggle to  
obtain decision-grade evidence on soft skills, diminishing trust among hiring managers, candidates, and HR  
teams. The absence of rich, multimodal data also weakens downstream coaching, making development  
plans overly generic and reducing their perceived value.  
To address these gaps, we implement an end-to-end, production-ready multimodal assessment platform that  
unifies verbal signals, non-verbal behavior, and LLM-based reasoning into a single interpretable pipeline.  
Our system integrates three complementary signal streams:  
Page 130  
Special Issue on Emerging Paradigms in Computer Science and Technology  
|
1. Verbal: Whisper generated transcripts enriched with quantitative speech analytics, including Words-  
per-Minute (WPM), filler rate/count, hesitation clusters, and lexical quality indicators.  
2. Non-verbal: DeepFace-derived emotion distributions aggregated into emotion timelines, per-class  
averages, and temporal variance profiles that reveal engagement, composure, and affective stability.  
3. Reasoning: Gemini for resume understanding & skills normalization, skills-aware question  
generation, per-answer rating (15) with rationale and “ideal answer,” and post-interview narrative  
synthesis.  
By fusing these modalities, the platform captures a richer representation of candidate behavior, reduces  
interpretive bias, and produces auditable, explainable soft-skill assessments aligned with how organizations  
actually make leadership, promotion, and hiring decisions. The result is a scientifically grounded, real-time  
capability that strengthens evaluative consistency, enhances perceived fairness, and unlocks more targeted  
coaching pathways for continuous development.  
II. LITERATURE SURVEY  
[1] Resume Analyzer Using LLM (2024). This study leverages an LLM with a domain-adaptation strategy  
(e.g., MGAT-style alignment) to classify resumes against job descriptions, reporting an F1 of ~80% and  
outperforming CNN and Bi-LSTM baselines. The use of job-posting text as a source domain and decreasing  
reliance on large labeled resume corpora shows that LLMs can provide scalable, high-precision screening  
with lesser data overheads, which can be beneficial for resume-aware question generation and skill  
normalization within our pipeline.  
[2] An Automated Resume Screening System Using NLP (2024). By using classic NLP combined with  
vectorization and evaluating cosine similarity between the resume and JD embeddings, the software ranks  
candidates based on semantic distance based on entity and key phrase extraction. The findings demonstrate  
reliable, interpretable agreements among candidates at low cost, allowing our LLM-based skill extractor to  
act as a practical baseline while retaining the benefit of being interpretable considering the career and skills  
similarity scores.  
[3] Facial Emotion Detection and Recognition (2021). Pipelines based on CNN have outperformed  
standard ML pipelines for frame-level emotion recognition, with the CNN-derived frameworks again  
providing the strongest balance between accuracy and runtime. The authors also note degradation in the  
opposite effect with multiple faces per frame, reinforcing our decision to opt for single-face tracking in  
tandem with timeline aggregation to stabilize affective signals during interviews.  
[4] Emotion Recognition from Speech: A Review (2024). By reviewing simulated, elicited, and natural  
speech corpora, the review demonstrates a connection between emotional state and acoustics (pitch,  
duration, energy), and a comparison of SVMs and neural networks - with neural networks generally being  
better than SVMs. Findings also provide validation for our speech-feature taps (e.g., WPM, fillers as  
paralinguistic proxies) and provide evidence for the potential for multi-modal fusion with text to improve  
affect inference.  
[5] Skill NER: Mining and Mapping Soft Skills from Any Text (2021). An approach based on NER  
(named entity recognition) and the evaluation occurred using the ESCO-aligned job datafills in soft-skill  
entities in job roles and maps these to role profiles to improve job classification and skill retrieval. Since the  
mapping is ontology-aware, it formed the basis of our resume parsing and skill normalization layer,  
providing better grounding for skills-aware technical question generation.  
[6] Multi-Class Confidence Detection Using Deep Learning (2024). A CNN (e.g., GoogleNet) model  
detected hand-gesture-based confidence states (confidence, cooperation, confusion, discomfort) with an  
accuracy rate of ~90.48%, outperforming SVM/KNN baseline approaches. This work illustrates the  
Page 131  
Special Issue on Emerging Paradigms in Computer Science and Technology  
|
importance of non-verbal cues, and it further supports our design decision to use timers around facial-  
emotion timelines as an easily scalable proxy for engagement and confidence to support scaling.  
[7] Estimation of Presentation Skills from Slides and Audio (2021). ML models classified presenter  
performance based on slide features (word count, images, fonts; ~65% accuracy) and audio prosody  
features(pitch variation, filled pauses; ~69% accuracy). The relative lift from audio corroborates our  
emphasis on speech analytics (WPM, fillers) as high-signal indicators for communication quality.  
[8] Automated Prediction of Job Interview Performance (2015). A multimodal framework combining  
prosody (intonation, pitch, pauses) with facial expression analysis (e.g., Smile/Nod detection via Shore +  
AdaBoost) predicts interview outcomes. The work shows that “what you say” and “how you say it” jointly  
drive performance, directly motivating our fusion of ASR-derived metrics, affect timelines, and LLM-based  
scoring to reduce variance and enhance feedback utility.  
The following reviewed studies collectively show several trends:  
1. LLMs/NLP for resume understanding are viable at production scale. Domain-adapted LLMs  
outperform classical NLP and reduce the need for labeled resume corpora; cosine-similarity  
pipelines remain a strong, interpretable baseline.  
2. Paralinguistics matter. Neural models using prosody features (pitch, energy, pauses) beat classical  
SVMs; audio features (e.g., filled pauses) are more predictive of presentation/interview quality than  
slide features.  
3. Multimodal beats unimodal. Combining what is said (text/semantics) and how it is said (speech +  
facial affect) yields stronger prediction of interview performance than any single stream.  
III. SYSTEM ARCHITECTURE  
Our platform operationalizes three signal streams behind a thin orchestration layer (React UI; Python  
services; MongoDB persistence):  
1. Verbal stream (speech → text → analytics). Audio is transcribed by Whisper (base). We compute  
communication KPIs: Words-per-Minute (WPM), Filler-word rate/count, and a lightweight lexical  
signal.  
2. Non-verbal stream (video → affect). Client frames are processed with DeepFace to estimate per-  
frame emotion distributions (happy, neutral, sad, anger, fear, disgust, surprise). Distributions are  
persisted to form an emotion timeline for each interview.  
3. LLM reasoning stream. Gemini is used for (i) resume understanding & skills normalization, (ii)  
skills-aware technical question generation, (iii) per-answer rating (15) with rationale and an  
idealized answer, and (iv) post-interview narrative summaries  
Data model  
MongoDB collections:  
1. Resume (raw extract, LLM analysis JSON, skills summary).  
2. Soft Skill Questions (six banks: communication, teamwork, problem-solving, adaptability,  
leadership, time management),  
3. Interviews (question set, per-answer transcripts/ metrics/ LLM assessments, emotion Timeline,  
status and timestamps).  
4. Ownership is scoped via a lightweight user identity header.  
Page 132  
Special Issue on Emerging Paradigms in Computer Science and Technology  
|
Design choices. Stateless processing for scale; base64 media handling and short-lived temp files for  
portability; persisted metrics and model outputs for auditability; minimal PII with options to store emotion  
distributions instead of images.  
Fig. 1. High-level architecture of the multi-modal interview platform  
IV. METHODOLOGY  
We implemented a multi-modal pipeline that provides decision-grade evidence from interviews and consists  
of four stages: (1) resume understanding to locate the conversation in the candidate’s skills, (2) speech  
analytics to quantify delivery quality, (3) facial-emotion tracking to capture engagement dynamics, and  
(4) fusion-based soft-skill inference to translate raw signals into auditable measures. Each stage emits  
structured artifacts (skills lists, transcripts, KPIs, emotion timelines, rationale from LLM) that are persisted  
for explainability, reviewer calibration, and longitudinal coaching.  
Resume Analysis:  
Convert unstructured resumes (PDF/DOCX) into a structural skills repository which informs technical  
question generation and subsequent assessment.  
1.  
Document Parsing and Text Normalization:  
Low-level extraction using PyPDF2/docx2txt.  
Text sanitation including whitespace canonicalization, header/footer removal, and boilerplate  
stripping.  
Sentence segmentation to support downstream skill mining.  
Page 133  
Special Issue on Emerging Paradigms in Computer Science and Technology  
|
2.  
LLM-based Semantic Decomposition  
Gemini-based parser transforms the cleaned text into structured resume schema:  
Personal summary  
Work experience (role, domain, impact statements)  
Technical/non-technical skill sets  
Certifications, tools, and domain tags  
Skills are clustered by domain (analytics, cloud, frontend, process, etc.  
3.  
Skills-Aware Question Grounding:  
The normalized skills list seeds the interview:  
Technical questions require ≥2 distinct skills per prompt to enforce depth and avoid generic  
questions.  
Domain probes are automatically aligned to the candidate’s background, preserving fairness and face  
validity.  
The system stores: original text snippet, parsed structure, and final skills inventory → all linked to  
the interview session.  
Fig. 2. Resume Analysis Flow  
Signal Processing & Feature Extraction: Multi-modal raw signals (audiovideotext) are processed into  
quantitative indicators used for soft-skill inference.  
Speech analytics (Whisper + metrics):  
Transcription & language detection: Whisper (base).  
Words-per-Minute (WPM):  
(Words in transcript)/(speaking duration(min))  
This measure captures pace, fluency, and cognitive load.  
Page 134  
Special Issue on Emerging Paradigms in Computer Science and Technology  
|
Filler-word rate using a fixed lexicon {“um”, “uh”, “like”, “you know”, “er”, “ah”, “so”, “well”,  
“actually”}Filler Rate = (filler tokens) / (total tokens)  
Fig. 3. Speech Analytics Flow  
Facial affect timeline (DeepFace):  
For each frame, we obtain dominant emotion and distribution over classes.  
We aggregate over time to compute: per-emotion averages and temporal variance (standard  
deviation) as stability proxies.  
Timeline visualizations enable qualitative inspection  
Predominantly neutral → calm composure (not disengagement).  
Positive peaks (happy/surprise) → enthusiasm/engagement.  
Persistent anger/sad/fear → stress signal; cross-check with filler spikes or off-band WPM.  
Fig. 4. Facial Emotion Timeline Flow  
Page 135  
Special Issue on Emerging Paradigms in Computer Science and Technology  
|
LLM reasoning:  
1. Resume understanding & skills normalization → concise bulletized skills inventory that seeds  
technical question generation.  
2. Question orchestration → enforce ≥2 skills per technical question; soft-skill questions sampled from  
curated banks.  
3. Answer evaluation → for each response, the LLM returns a rating (15), a short explanation, and an  
ideal answer summary.  
4. Post-interview synthesis → multi-signal executive summary plus engagement bullets.  
LLM rationales are persisted to maintain interpretability across sessions.  
Multimodal Fusion & Soft-Skill Inference:  
We infer soft-skill constructs by combining verbal, non-verbal, and LLM signals through transparent rules  
calibrated on practitioner guidance. Our current implementation surfaces metrics and evidence; the same  
logic can be converted into indices when needed.  
Communication Clarity:  
Principal signals: LLM response score and rationale; transcript coherence.  
Support signals: lower filler and general WPM are in optimal band (≈110–160 WPM) indicate  
clearer delivery.  
Heuristic indicator:  
S(clarity) = 0.6 × LLM_rating / 5 + 0.25 × g(WPM) + 0.15 × (1−FillerRate)  
where 푔()is maximized in the 110160 WPM band (triangular membership).  
Confidence & Composure  
1. Principal signals: Emotions timelinehigh average for neutral/happy, slight-to-low variations in  
negative emotions; consistent WPM band; slight-to-low filler.  
2. Interpretation rules:  
3. High neutral (e.g., >60%) → calm composure (not disengaged).  
4. Happy/surprise spikes → positive engagement.  
5. Persistent anger/sadness/fear → stress or discomfort flags.  
Attentiveness/Engagement:  
Indicators:  
1. Emotion variance (not flat, not erratic)  
2. Response immediacy (silence < 1.5s)  
3. Alignment to question (LLM rationale)  
4. Verbal continuity (few long pauses)  
Vocabulary & Linguistic Hygiene:  
Indicators:  
1. Filler usage  
2. Lexical richness  
Page 136  
Special Issue on Emerging Paradigms in Computer Science and Technology  
|
3. Jargon accuracy (checked via LLM explanations)  
4. Misuse or over-generalization of terms  
Why rule-based fusion? Its auditable, easy to calibrate with human reviewers, and avoids secret weighting.  
Where a single number is needed, we use the above weighted indices but always make the underlying  
evidence available (metrics, emotions, transcript snippets, LLM rationale).  
Question Strategy (Skills-aware + Soft-skill probes):  
Technical questions that include skills-based probes are grounded in the candidate’s normalized  
skills list, adding face-validity and toning down any hallucinations.  
Six soft-skill probes mapped to common organizational competency frameworks:  
Communication  
Teamwork  
Problem-Solving  
Adaptability  
Leadership  
Time Management  
This ensures coverage consistency and comparability across candidates.  
Reporting & Explainability:  
Key Performance Indicator roll-ups: total words, average rating, filler rate, per-emotion  
averages/variance.  
Visual timelines: emotion timeline, WPM banding  
Executive summary: generated with explicit interpretation rules .  
Coaching artifacts: per-answer ideal answers and bulletized strengths/areas for improvement per  
soft-skill category.  
RESULTS AND DISCUSSIONS  
Our deployment shows that a multi-modal, resume-aware interview process can produce decision-grade  
insights in real time while preserving auditability.  
1. Multi-modal lift: Combining Whisper speech metrics, DeepFace emotion timelines, and Gemini  
scoring produced more consistent and explainable judgments than single-modal baselines.  
2. Actionable coaching: WPM banding, filler profiles, and ideal answer converted abstract feedback  
into concrete next steps (pace control, filler reduction, example quality).  
3. Relevance by design: Resume-aware technical questions reduced off-topic drift and increased  
perceived fairness/face validity.  
4. Operational reliability: Stateless services, prompt versioning, and persisted artifacts enabled side-  
by-side reviews and quick issue triage.  
Reason for effectiveness: Soft skills naturally result in multi-signals; clear and verifiable fusion rules,  
optimal WPM, minimum fillers, neutral/happy dominance, and lower negative variance, convert raw signals  
into credible signals indicating clarity, confidence, and engagement.  
What makes it distinctive: Evidence-first explanations, standardized soft-skill probes, and resume-grounded  
questions offer both decision support and coaching valuewithout opaque scoring  
Future Scope  
To scale this multi-modal platform from a prototype to an enterprise-grade competency layer, the long term  
Page 137  
Special Issue on Emerging Paradigms in Computer Science and Technology  
|
roadmap focuses on three vectors: (1) intelligencetransitioning from rule-based fusion to calibrated, data-  
driven models; (2) signal depthadding low-cost, high-signal behavioral cues; and (3) governance—  
making fairness, drift, and human oversight actionable. These moves preserve “glass-box” explainability  
while unlocking higher accuracy, role fit, and longitudinal learning value.  
1. Learned Fusion & Calibration: Evolve from heuristic weights to data-driven fusion (e.g., regularized  
regression or calibrated ensembles) trained against expert panels; maintain glass-box explainability  
by surfacing the contribution of each feature to each decision.  
2. Richer Signal Stack: Add prosody (pitch, pause ratio, energy), conversation turn-taking dynamics,  
discourse markers, and lightweight head-pose cues to strengthen confidence and engagement  
inference without heavy computation.  
3. Role- and Level-Specific Rubrics: Parameterize prompts and scoring rubrics by function (sales,  
consulting, support, engineering), and seniority bands; enable profiles based on policy guiding the  
auto-selection of question banks and thresholds that align with each rubric context.  
4. Fairness, Drift & Compliance Ops: Execute scheduled bias audit, stratified performance report,  
prompt/version drift monitors, and retention of consent-aware data, and publish model cards and  
data sheets to institutionalize governance.  
5. Human-in-the-Loop Tooling: Add reviewer calibration workbenches (side-by-side evidence,  
disagreement heat maps), rubric alignment checks, and adjudication workflows to iteratively  
improve inter-rater reliability over time.  
CONCLUSION  
This work operationalizes a multi-modal, resume-specific interview assessment pipeline that combines  
ASR, affect analytics, and LLM reasoning into decision-grade, explainable outputs. By integrating transcript  
quality (WPM, fillers), affect dynamics (average emotion/variance), and LLM rationale (ratings, rationales,  
ideal answers), the platform provides both evaluation fidelity and coaching utility. or reviewers, the  
measures prevent black-box scores, whereas candidates receive targeted, high-leverage guidance relative to  
generic guidance.  
Strategically, the architecture is production ready and governance friendly: stateless services, prompt and  
version control, with the distribution only of emotion-in-space storage (no raw video), and auditable  
artifacts. Tactically, presenting technical probes informed by the resume and common soft-skill questions  
has added to face validity and comparison potential across interview sessions. Future directions build on the  
data with learned fusion, extended nonverbal signals, and role-specific scoring rubrics to improve the  
quality of the signal while maintaining transparency. With the recommend extensions - and commitments to  
fairness, drift monitoring, and enterprise integrations - the model can develop into an institutionally scalable  
competency layer for hiring and L&D, moving the interview process from subjective snapshots of  
performance to a measurable model of development.  
REFERENCES  
1. H. Chandhana, “Resume Analyzer Using LLM,” IRJWEB, Dec. 2024.  
2. C. Daryani, G. S. Chhabra, H. Patel, I. K. Chhabra, and R. Patel, “An Automated Resume  
Screening System Using Natural Language Processing and Similarity,” in Proc. Ethics and  
Information Technology (ETIT), 2020, pp. 99103, doi:10.26480/etit.02.2020.99.103.  
3. H. T. and V. Varalatchoumy, “Facial Emotion Recognition System using Deep Learning and  
Convolutional Neural Networks,” Int. J. Engineering Research & Technology (IJERT), vol. 10, no.  
06, June 2021, doi: 10.17577/IJERTV10IS060338.  
4. G. M. Dar and R. Delhibabu, “Speech Databases, Speech Features, and Classifiers in Speech  
Emotion Recognition: A Review,” IEEE Access, Jan. 2024, doi: 10.1109/ACCESS.2024.3476960.  
5. S. Fareri, N. Melluso, F. Chiarello, and G. Fantoni, “SkillNER: Mining and Mapping Soft Skills  
from any Text,” Expert Systems with Applications, vol. 169, 2021,  
Page 138  
Special Issue on Emerging Paradigms in Computer Science and Technology  
|
doi: 10.1016/j.eswa.2021.115544.  
6. Tun, S. S. Y., Okada, S., Huang, H.-H., Leong, C. W. "Multimodal Transfer Learning for Oral  
Presentation Assessment," IEEE Access, vol. 11, pp. 84013-84026, 2023.  
7. Kasa, K., Burns, D., Goldenberg, M. G., Selim, O., Whyne, C., Hardisty, M. "Multi-Modal Deep  
Learning for Assessing Surgeon Technical Skill," Sensors, vol. 22, no. 19, 7328, 2022.  
8. Chen, L., Feng, G., Joe, J. N., Leong, C. W., Kitchen, C., Lee, C. M. "Towards Automated  
Assessment Of Public Speaking Skills Using Multimodal Cues," Proceedings of the 16th  
International Conference on Multimodal Interaction (ICMI),2014,pp.163-170.  
Page 139