INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)

ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue XIII November 2025

Special Issue on Emerging Paradigms in Computer Science and Technology

|

Multi-Modal Soft-Skill Interview Assessment: Real-Time Emotion,

Speech Analytics, and LLM Scoring

Sachin Jadhav¹, Soham Surdas², Saif Khan³, Soham Kasurde⁴, Rahul Sakpal⁵

Vishwakarma Institute of Technology, Pune, India

DOI: https://doi.org/10.51244/IJRSI.2025.1213CS0011

Received: 17 November 2025; Accepted: 24 November 2025; Published: 13 December 2025

ABSTRACT

We introduce a full-stack, multi-modal platform for soft-skill interview assessment that integrates automatic

speech recognition (Whisper), facial-emotion analysis (DeepFace), and LLM reasoning (Gemini) into a

single, real-time workflow. Audio streams are transcribed and analyzed to compute words-per-minute

(WPM), filler-word rate/count, and lightweight lexical cues; webcam frames yield per-frame emotion

distributions that are aggregated into an emotion timeline. Resumes are parsed to a normalized skills

inventory that seeds skills-aware technical questions, while curated banks provide six soft-skill probes. Each

response is scored by the LLM (1–5) with a concise rationale and an “ideal answer,” then fused with speech

and affect features to infer communication clarity, confidence/composure, attentiveness/engagement, and

linguistic hygiene via transparent, rule-based heuristics (e.g., optimal WPM band, low filler rate,

neutral/happy dominance with low negative variance). The system is engineered for scale and auditability

stateless services, base64 media handling, prompt versioning, distribution-only emotion storage and persists

metrics and narratives for explainable reporting. We detail the architecture, schemas, and fusion logic, and

demonstrate how multi-signal evidence produces consistent, actionable insights that improve interviewer

trust and candidate coaching value versus single-modal baselines.

Keywords— multi-modal assessment; soft skills; interview analytics; Whisper; DeepFace; Gemini; speech

metrics; emotion timeline; LLM scoring

INTRODUCTION

Organizations increasingly recognize that long-term performance depends not only on technical proficiency

but on a candidate’s ability to communicate clearly, collaborate effectively, and demonstrate leadership

potential in dynamic environments. Yet despite this shift, the majority of hiring and talent-development

assessments remain manual, subjective, and heavily biased toward hard skills. Traditional interview formats

rely on human interpretation of verbal responses, leaving non-verbal behavior, delivery quality, and

contextual consistency largely unmeasured. Even when digital tools are used, they are typically single-

modal—focused on text sentiment, audio transcription, or questionnaire scoring—producing generic,

surface-level feedback that lacks depth, repeatability, and credibility.

The limitations of these single-channel approaches are well documented. Text-only systems cannot capture

tone, pace, hesitation patterns, or emotional dynamics; audio-only systems miss content richness and

conversational alignment; and rule-based behavioral checklists cannot contextualize performance in relation

to the candidate’s background, domain, or real-time cognitive load. As a result, organizations struggle to

obtain decision-grade evidence on soft skills, diminishing trust among hiring managers, candidates, and HR

teams. The absence of rich, multimodal data also weakens downstream coaching, making development

plans overly generic and reducing their perceived value.

To address these gaps, we implement an end-to-end, production-ready multimodal assessment platform that

unifies verbal signals, non-verbal behavior, and LLM-based reasoning into a single interpretable pipeline.

Our system integrates three complementary signal streams:

Page 130

www.rsisinternational.org

INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)

ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue XIII November 2025

Special Issue on Emerging Paradigms in Computer Science and Technology

|

1. Verbal: Whisper generated transcripts enriched with quantitative speech analytics, including Words-

per-Minute (WPM), filler rate/count, hesitation clusters, and lexical quality indicators.

2. Non-verbal: DeepFace-derived emotion distributions aggregated into emotion timelines, per-class

averages, and temporal variance profiles that reveal engagement, composure, and affective stability.

3. Reasoning: Gemini for resume understanding & skills normalization, skills-aware question

generation, per-answer rating (1–5) with rationale and “ideal answer,” and post-interview narrative

synthesis.

By fusing these modalities, the platform captures a richer representation of candidate behavior, reduces

interpretive bias, and produces auditable, explainable soft-skill assessments aligned with how organizations

actually make leadership, promotion, and hiring decisions. The result is a scientifically grounded, real-time

capability that strengthens evaluative consistency, enhances perceived fairness, and unlocks more targeted

coaching pathways for continuous development.

II. LITERATURE SURVEY

[1] Resume Analyzer Using LLM (2024). This study leverages an LLM with a domain-adaptation strategy

(e.g., MGAT-style alignment) to classify resumes against job descriptions, reporting an F1 of ~80% and

outperforming CNN and Bi-LSTM baselines. The use of job-posting text as a source domain and decreasing

reliance on large labeled resume corpora shows that LLMs can provide scalable, high-precision screening

with lesser data overheads, which can be beneficial for resume-aware question generation and skill

normalization within our pipeline.

[2] An Automated Resume Screening System Using NLP (2024). By using classic NLP combined with

vectorization and evaluating cosine similarity between the resume and JD embeddings, the software ranks

candidates based on semantic distance based on entity and key phrase extraction. The findings demonstrate

reliable, interpretable agreements among candidates at low cost, allowing our LLM-based skill extractor to

act as a practical baseline while retaining the benefit of being interpretable considering the career and skills

similarity scores.

[3] Facial Emotion Detection and Recognition (2021). Pipelines based on CNN have outperformed

standard ML pipelines for frame-level emotion recognition, with the CNN-derived frameworks again

providing the strongest balance between accuracy and runtime. The authors also note degradation in the

opposite effect with multiple faces per frame, reinforcing our decision to opt for single-face tracking in

tandem with timeline aggregation to stabilize affective signals during interviews.

[4] Emotion Recognition from Speech: A Review (2024). By reviewing simulated, elicited, and natural

speech corpora, the review demonstrates a connection between emotional state and acoustics (pitch,

duration, energy), and a comparison of SVMs and neural networks - with neural networks generally being

better than SVMs. Findings also provide validation for our speech-feature taps (e.g., WPM, fillers as

paralinguistic proxies) and provide evidence for the potential for multi-modal fusion with text to improve

affect inference.

[5] Skill NER: Mining and Mapping Soft Skills from Any Text (2021). An approach based on NER

(named entity recognition) and the evaluation occurred using the ESCO-aligned job data—fills in soft-skill

entities in job roles and maps these to role profiles to improve job classification and skill retrieval. Since the

mapping is ontology-aware, it formed the basis of our resume parsing and skill normalization layer,

providing better grounding for skills-aware technical question generation.

[6] Multi-Class Confidence Detection Using Deep Learning (2024). A CNN (e.g., GoogleNet) model

detected hand-gesture-based confidence states (confidence, cooperation, confusion, discomfort) with an

accuracy rate of ~90.48%, outperforming SVM/KNN baseline approaches. This work illustrates the

Page 131

www.rsisinternational.org

INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)

ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue XIII November 2025

Special Issue on Emerging Paradigms in Computer Science and Technology

|

importance of non-verbal cues, and it further supports our design decision to use timers around facial-

emotion timelines as an easily scalable proxy for engagement and confidence to support scaling.

[7] Estimation of Presentation Skills from Slides and Audio (2021). ML models classified presenter

performance based on slide features (word count, images, fonts; ~65% accuracy) and audio prosody

features(pitch variation, filled pauses; ~69% accuracy). The relative lift from audio corroborates our

emphasis on speech analytics (WPM, fillers) as high-signal indicators for communication quality.

[8] Automated Prediction of Job Interview Performance (2015). A multimodal framework combining

prosody (intonation, pitch, pauses) with facial expression analysis (e.g., Smile/Nod detection via Shore +

AdaBoost) predicts interview outcomes. The work shows that “what you say” and “how you say it” jointly

drive performance, directly motivating our fusion of ASR-derived metrics, affect timelines, and LLM-based

scoring to reduce variance and enhance feedback utility.

The following reviewed studies collectively show several trends:

1. LLMs/NLP for resume understanding are viable at production scale. Domain-adapted LLMs

outperform classical NLP and reduce the need for labeled resume corpora; cosine-similarity

pipelines remain a strong, interpretable baseline.

2. Paralinguistics matter. Neural models using prosody features (pitch, energy, pauses) beat classical

SVMs; audio features (e.g., filled pauses) are more predictive of presentation/interview quality than

slide features.

3. Multimodal beats unimodal. Combining what is said (text/semantics) and how it is said (speech +

facial affect) yields stronger prediction of interview performance than any single stream.

III. SYSTEM ARCHITECTURE

Our platform operationalizes three signal streams behind a thin orchestration layer (React UI; Python

services; MongoDB persistence):

1. Verbal stream (speech → text → analytics). Audio is transcribed by Whisper (base). We compute

communication KPIs: Words-per-Minute (WPM), Filler-word rate/count, and a lightweight lexical

signal.

2. Non-verbal stream (video → affect). Client frames are processed with DeepFace to estimate per-

frame emotion distributions (happy, neutral, sad, anger, fear, disgust, surprise). Distributions are

persisted to form an emotion timeline for each interview.

3. LLM reasoning stream. Gemini is used for (i) resume understanding & skills normalization, (ii)

skills-aware technical question generation, (iii) per-answer rating (1–5) with rationale and an

idealized answer, and (iv) post-interview narrative summaries

Data model

MongoDB collections:

1. Resume (raw extract, LLM analysis JSON, skills summary).

2. Soft Skill Questions (six banks: communication, teamwork, problem-solving, adaptability,

leadership, time management),

3. Interviews (question set, per-answer transcripts/ metrics/ LLM assessments, emotion Timeline,

status and timestamps).

4. Ownership is scoped via a lightweight user identity header.

Page 132

www.rsisinternational.org

INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)

ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue XIII November 2025

Special Issue on Emerging Paradigms in Computer Science and Technology

|

Design choices. Stateless processing for scale; base64 media handling and short-lived temp files for

portability; persisted metrics and model outputs for auditability; minimal PII with options to store emotion

distributions instead of images.

Fig. 1. High-level architecture of the multi-modal interview platform

IV. METHODOLOGY

We implemented a multi-modal pipeline that provides decision-grade evidence from interviews and consists

of four stages: (1) resume understanding to locate the conversation in the candidate’s skills, (2) speech

analytics to quantify delivery quality, (3) facial-emotion tracking to capture engagement dynamics, and

(4) fusion-based soft-skill inference to translate raw signals into auditable measures. Each stage emits

structured artifacts (skills lists, transcripts, KPIs, emotion timelines, rationale from LLM) that are persisted

for explainability, reviewer calibration, and longitudinal coaching.

Resume Analysis:

Convert unstructured resumes (PDF/DOCX) into a structural skills repository which informs technical

question generation and subsequent assessment.

1.

Document Parsing and Text Normalization:



Low-level extraction using PyPDF2/docx2txt.

Text sanitation including whitespace canonicalization, header/footer removal, and boilerplate

stripping.



Sentence segmentation to support downstream skill mining.

Page 133

www.rsisinternational.org

INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)

ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue XIII November 2025

Special Issue on Emerging Paradigms in Computer Science and Technology

|

2.

LLM-based Semantic Decomposition

Gemini-based parser transforms the cleaned text into structured resume schema:



Personal summary

Work experience (role, domain, impact statements)

Technical/non-technical skill sets

Certifications, tools, and domain tags

Skills are clustered by domain (analytics, cloud, frontend, process, etc.

3.

Skills-Aware Question Grounding:

The normalized skills list seeds the interview:



Technical questions require ≥2 distinct skills per prompt to enforce depth and avoid generic

questions.



Domain probes are automatically aligned to the candidate’s background, preserving fairness and face

validity.



The system stores: original text snippet, parsed structure, and final skills inventory → all linked to

the interview session.

Fig. 2. Resume Analysis Flow

Signal Processing & Feature Extraction: Multi-modal raw signals (audio–video–text) are processed into

quantitative indicators used for soft-skill inference.

Speech analytics (Whisper + metrics):



Transcription & language detection: Whisper (base).

Words-per-Minute (WPM):

(Words in transcript)/(speaking duration(min))

This measure captures pace, fluency, and cognitive load.

Page 134

www.rsisinternational.org

INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)

ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue XIII November 2025

Special Issue on Emerging Paradigms in Computer Science and Technology

|



Filler-word rate using a fixed lexicon {“um”, “uh”, “like”, “you know”, “er”, “ah”, “so”, “well”,

“actually”}Filler Rate = (filler tokens) / (total tokens)

Fig. 3. Speech Analytics Flow

Facial affect timeline (DeepFace):



For each frame, we obtain dominant emotion and distribution over classes.

We aggregate over time to compute: per-emotion averages and temporal variance (standard

deviation) as stability proxies.



Timeline visualizations enable qualitative inspection

Predominantly neutral → calm composure (not disengagement).

Positive peaks (happy/surprise) → enthusiasm/engagement.

Persistent anger/sad/fear → stress signal; cross-check with filler spikes or off-band WPM.

Fig. 4. Facial Emotion Timeline Flow

Page 135

www.rsisinternational.org

INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)

ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue XIII November 2025

Special Issue on Emerging Paradigms in Computer Science and Technology

|

LLM reasoning:

1. Resume understanding & skills normalization → concise bulletized skills inventory that seeds

technical question generation.

2. Question orchestration → enforce ≥2 skills per technical question; soft-skill questions sampled from

curated banks.

3. Answer evaluation → for each response, the LLM returns a rating (1–5), a short explanation, and an

ideal answer summary.

4. Post-interview synthesis → multi-signal executive summary plus engagement bullets.

LLM rationales are persisted to maintain interpretability across sessions.

Multimodal Fusion & Soft-Skill Inference:

We infer soft-skill constructs by combining verbal, non-verbal, and LLM signals through transparent rules

calibrated on practitioner guidance. Our current implementation surfaces metrics and evidence; the same

logic can be converted into indices when needed.

Communication Clarity:



Principal signals: LLM response score and rationale; transcript coherence.

Support signals: lower filler and general WPM are in optimal band (≈110–160 WPM) indicate

clearer delivery.



Heuristic indicator:

S(clarity) = 0.6 × LLM_rating / 5 + 0.25 × g(WPM) + 0.15 × (1−FillerRate)

where 푔()is maximized in the 110–160 WPM band (triangular membership).

Confidence & Composure

1. Principal signals: Emotions timeline— high average for neutral/happy, slight-to-low variations in

negative emotions; consistent WPM band; slight-to-low filler.

2. Interpretation rules:

3. High neutral (e.g., >60%) → calm composure (not disengaged).

4. Happy/surprise spikes → positive engagement.

5. Persistent anger/sadness/fear → stress or discomfort flags.

Attentiveness/Engagement:

Indicators:

1. Emotion variance (not flat, not erratic)

2. Response immediacy (silence < 1.5s)

3. Alignment to question (LLM rationale)

4. Verbal continuity (few long pauses)

Vocabulary & Linguistic Hygiene:

Indicators:

1. Filler usage

2. Lexical richness

Page 136

www.rsisinternational.org

INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)

ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue XIII November 2025

Special Issue on Emerging Paradigms in Computer Science and Technology

|

3. Jargon accuracy (checked via LLM explanations)

4. Misuse or over-generalization of terms

Why rule-based fusion? It’s auditable, easy to calibrate with human reviewers, and avoids secret weighting.

Where a single number is needed, we use the above weighted indices but always make the underlying

evidence available (metrics, emotions, transcript snippets, LLM rationale).

Question Strategy (Skills-aware + Soft-skill probes):



Technical questions that include skills-based probes are grounded in the candidate’s normalized

skills list, adding face-validity and toning down any hallucinations.



•



Six soft-skill probes mapped to common organizational competency frameworks:

Communication

Teamwork

Problem-Solving

Adaptability

Leadership

Time Management

This ensures coverage consistency and comparability across candidates.

Reporting & Explainability:



Key Performance Indicator roll-ups: total words, average rating, filler rate, per-emotion

averages/variance.



Visual timelines: emotion timeline, WPM banding

Executive summary: generated with explicit interpretation rules .

Coaching artifacts: per-answer ideal answers and bulletized strengths/areas for improvement per

soft-skill category.

RESULTS AND DISCUSSIONS

Our deployment shows that a multi-modal, resume-aware interview process can produce decision-grade

insights in real time while preserving auditability.

1. Multi-modal lift: Combining Whisper speech metrics, DeepFace emotion timelines, and Gemini

scoring produced more consistent and explainable judgments than single-modal baselines.

2. Actionable coaching: WPM banding, filler profiles, and ideal answer converted abstract feedback

into concrete next steps (pace control, filler reduction, example quality).

3. Relevance by design: Resume-aware technical questions reduced off-topic drift and increased

perceived fairness/face validity.

4. Operational reliability: Stateless services, prompt versioning, and persisted artifacts enabled side-

by-side reviews and quick issue triage.

Reason for effectiveness: Soft skills naturally result in multi-signals; clear and verifiable fusion rules,

optimal WPM, minimum fillers, neutral/happy dominance, and lower negative variance, convert raw signals

into credible signals indicating clarity, confidence, and engagement.

What makes it distinctive: Evidence-first explanations, standardized soft-skill probes, and resume-grounded

questions offer both decision support and coaching value—without opaque scoring

Future Scope

To scale this multi-modal platform from a prototype to an enterprise-grade competency layer, the long term

Page 137

www.rsisinternational.org

INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)

ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue XIII November 2025

Special Issue on Emerging Paradigms in Computer Science and Technology

|

roadmap focuses on three vectors: (1) intelligence—transitioning from rule-based fusion to calibrated, data-

driven models; (2) signal depth—adding low-cost, high-signal behavioral cues; and (3) governance—

making fairness, drift, and human oversight actionable. These moves preserve “glass-box” explainability

while unlocking higher accuracy, role fit, and longitudinal learning value.

1. Learned Fusion & Calibration: Evolve from heuristic weights to data-driven fusion (e.g., regularized

regression or calibrated ensembles) trained against expert panels; maintain glass-box explainability

by surfacing the contribution of each feature to each decision.

2. Richer Signal Stack: Add prosody (pitch, pause ratio, energy), conversation turn-taking dynamics,

discourse markers, and lightweight head-pose cues to strengthen confidence and engagement

inference without heavy computation.

3. Role- and Level-Specific Rubrics: Parameterize prompts and scoring rubrics by function (sales,

consulting, support, engineering), and seniority bands; enable profiles based on policy guiding the

auto-selection of question banks and thresholds that align with each rubric context.

4. Fairness, Drift & Compliance Ops: Execute scheduled bias audit, stratified performance report,

prompt/version drift monitors, and retention of consent-aware data, and publish model cards and

data sheets to institutionalize governance.

5. Human-in-the-Loop Tooling: Add reviewer calibration workbenches (side-by-side evidence,

disagreement heat maps), rubric alignment checks, and adjudication workflows to iteratively

improve inter-rater reliability over time.

CONCLUSION

This work operationalizes a multi-modal, resume-specific interview assessment pipeline that combines

ASR, affect analytics, and LLM reasoning into decision-grade, explainable outputs. By integrating transcript

quality (WPM, fillers), affect dynamics (average emotion/variance), and LLM rationale (ratings, rationales,

ideal answers), the platform provides both evaluation fidelity and coaching utility. or reviewers, the

measures prevent black-box scores, whereas candidates receive targeted, high-leverage guidance relative to

generic guidance.

Strategically, the architecture is production ready and governance friendly: stateless services, prompt and

version control, with the distribution only of emotion-in-space storage (no raw video), and auditable

artifacts. Tactically, presenting technical probes informed by the resume and common soft-skill questions

has added to face validity and comparison potential across interview sessions. Future directions build on the

data with learned fusion, extended nonverbal signals, and role-specific scoring rubrics to improve the

quality of the signal while maintaining transparency. With the recommend extensions - and commitments to

fairness, drift monitoring, and enterprise integrations - the model can develop into an institutionally scalable

competency layer for hiring and L&D, moving the interview process from subjective snapshots of

performance to a measurable model of development.

REFERENCES

1. H. Chandhana, “Resume Analyzer Using LLM,” IRJWEB, Dec. 2024.

https://www.irjweb.com/RESUME%20ANALYZER%20USING%20LLM..pdf.

2. C. Daryani, G. S. Chhabra, H. Patel, I. K. Chhabra, and R. Patel, “An Automated Resume

Screening System Using Natural Language Processing and Similarity,” in Proc. Ethics and

Information Technology (ETIT), 2020, pp. 99–103, doi:10.26480/etit.02.2020.99.103.

3. H. T. and V. Varalatchoumy, “Facial Emotion Recognition System using Deep Learning and

Convolutional Neural Networks,” Int. J. Engineering Research & Technology (IJERT), vol. 10, no.

06, June 2021, doi: 10.17577/IJERTV10IS060338.

4. G. M. Dar and R. Delhibabu, “Speech Databases, Speech Features, and Classifiers in Speech

Emotion Recognition: A Review,” IEEE Access, Jan. 2024, doi: 10.1109/ACCESS.2024.3476960.

5. S. Fareri, N. Melluso, F. Chiarello, and G. Fantoni, “SkillNER: Mining and Mapping Soft Skills

from any Text,” Expert Systems with Applications, vol. 169, 2021,

Page 138

www.rsisinternational.org

INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)

ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue XIII November 2025

Special Issue on Emerging Paradigms in Computer Science and Technology

|

doi: 10.1016/j.eswa.2021.115544.

6. Tun, S. S. Y., Okada, S., Huang, H.-H., Leong, C. W. "Multimodal Transfer Learning for Oral

Presentation Assessment," IEEE Access, vol. 11, pp. 84013-84026, 2023.

https://doi.org/10.1109/ACCESS.2023.3301016.

7. Kasa, K., Burns, D., Goldenberg, M. G., Selim, O., Whyne, C., Hardisty, M. "Multi-Modal Deep

Learning for Assessing Surgeon Technical Skill," Sensors, vol. 22, no. 19, 7328, 2022.

https://doi.org/10.3390/s22197328

8. Chen, L., Feng, G., Joe, J. N., Leong, C. W., Kitchen, C., Lee, C. M. "Towards Automated

Assessment Of Public Speaking Skills Using Multimodal Cues," Proceedings of the 16th

International Conference on Multimodal Interaction (ICMI),2014,pp.163-170.

https://doi.org/10.1145/2663204.2663271.

Page 139

www.rsisinternational.org