INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
www.rsisinternational.org
Page 1341
Deepfake Speech Detection A Literature Review
Kishor Chandrapalan
Department of Artificial Intelligence Reva University Bengaluru, India
DOI:
https://dx.doi.org/10.51584/IJRIAS.2025.10100000118
Received: 25 October 2025; Accepted: 31 October 2025; Published: 12 November 2025
ABSTRACT
Deepfake audio technology and its potential for misuse represent significant challenges in the realms of
information integrity, identity protection, and public trust. This paper offers a comprehensive exploration of the
detection methods for deepfake speech and their implications. First, we examine the emerging threats of AI-
driven scams, particularly the use of Large Language Models (LLMs) in automating voice-based fraud, including
phone scams and virtual kidnapping. With the rise of these technologies, voice cloning can be exploited to
deceive victims into revealing sensitive information, undermining public safety and trust.
Alongside this, we analyze the state of deepfake detection technologies through a systematic review of ten key
studies, focusing on common feature extraction techniques such as Mel-Frequency Cepstral Coefficients
(MFCCs), spectrogram-based features, pause characteristics, and advanced deep learning methods. MFCCs
remain foundational, complemented by newer techniques like spectrogram analysis and deep learning models,
yet challenges persist in dataset variability, generalization, and adversarial robustness. Furthermore, ethical
concerns surrounding the potential misuse of deepfake technologiessuch as in spreading misinformation or
violating privacyhighlight the need for a more robust ethical framework. Future research must prioritize
creating hybrid detection systems that combine deep learning with real-time operational capabilities, all while
considering the ethical and adversarial aspects of this evolving technology. This dual analysis aims to guide the
development of more effective, ethically sound detection systems for deepfake speech and AI-driven scams.
This research calls for interdisciplinary collaboration to address both the technical and ethical challenges posed
by these advanced AI systems, emphasizing the necessity for diversified datasets, real-time detection, and robust
defenses against adversarial threats.
Keywords Deepfake audio, deepfake detection, voice cloning, Mel-Frequency Cepstral Coefficients
(MFCC), spectrogram analysis, deep learning, adversarial robustness, ethical concerns, misinformation, privacy
violation, real-time detection, hybrid detection systems, Generative Adversarial Networks (GANs), automated
fraud detection
INTRODUCTION
The advent of deepfake audio technology has introduced significant challenges to information integrity, security,
and personal identity. Deepfake speech, generated by artificial intelligence (AI) models, has become a growing
concern due to its ability to create human-like voices used for malicious purposes such as identity theft,
misinformation, and social manipulation [1], [2]. As these synthetic voices become more sophisticated, the need
for robust detection methods to protect individuals and systems reliant on voice recognition for security and
authentication has become critical [3].
Voice authentication systems, widely used in banking and personal assistants, are particularly vulnerable to
spoofing attacks involving deepfake audio [4]. This has led to the development of various detection methods
utilizing signal processing techniques, machine learning (ML) algorithms, and deep neural networks [5], [6].
Features such as Mel-Frequency Cepstral Coefficients (MFCCs), pitch, formants, and statistical properties have
been extensively used to train classifiers that distinguish between human and machine-generated speech [7], [8].
Despite these advancements, challenges remain in creating detection systems that can generalize across diverse
datasets and real-world environments, especially when exposed to new or unseen types of deepfake generation
INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
www.rsisinternational.org
Page 1342
methods [9]. Real-time detection has become increasingly important, as detection systems must function
efficiently without compromising performance [10]. Ethical concerns also arise regarding the misuse of deepfake
technology for illegal activities, prompting discussions on developing systems that not only detect but also
prevent malicious use of synthetic media [6].
The increasing sophistication of deepfake technologies has led to the development of various deepfake speech
detection systems. These systems incorporate signal processing, ML models, and deep learning approaches to
tackle the challenges of identifying synthetic speech. Recent work [5], [6] explores the use of advanced
algorithms such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) for
detecting deepfake audio. Other studies have highlighted the importance of real-time detection [4], [8], where
detection systems must maintain high accuracy and performance under real-world conditions. Moreover, hybrid
techniques combining traditional methods with deep learning have been proposed as effective approaches to
enhance detection accuracy and robustness [8]. Researchers continue to explore systems capable of adapting to
new deepfake generation techniques, including those utilizing Generative Adversarial Networks (GANs) [9],
[10].
In conclusion, ongoing efforts focus on enhancing detection systems that are not only accurate but also ethical
and resilient to future advancements in deepfake technology. These systems aim to mitigate the risks posed by
deepfake audio in various domains, ensuring the protection of identity, security, and trust in voice-based
technologies
Deepfake Speech Detection Systems
Design Characteristics
Deepfake speech detection systems are designed to identify synthetic audio generated by AI models,
distinguishing it from human speech. A key characteristic of such systems is their feature extraction process,
which involves capturing a wide array of features that help differentiate real from synthetic speech. These
features often include Mel-Frequency Cepstral Coefficients (MFCCs), pitch, formants, and other statistical
properties like skewness and kurtosis, which are fundamental in speech analysis [1][5]. Temporal features, such
as prosody and speech dynamics, play a crucial role in identifying the subtle variations present in human speech
that deepfake audio typically lacks [8][9]. The integration of both time and frequency domain features ensures
that the system captures a more comprehensive set of speech characteristics, helping it better differentiate
synthetic from genuine speech. The system also needs to be robust across various environmental conditions,
including different accents, languages, and noise environments, making generalization essential [9][10]. To
ensure high performance in real-world applications, deepfake speech detection systems must operate in real-time
without compromising accuracy. This is particularly challenging for systems involved in voice authentication or
live speech monitoring [4][6]. Advanced machine learning models, such as Convolutional Neural Networks
(CNNs) and Recurrent Neural Networks (RNNs), are integral to these systems, as they excel at capturing both
spatial and temporal dependencies in speech data, making them highly effective for detecting deepfake audio
[7][8]. CNNs, for instance, are adept at extracting features from spectrograms, while RNNs capture the temporal
relationships present in speech. Furthermore, integrating adversarial training is critical to making detection
systems resilient to evolving deepfake generation methods. Given the rapid advancements in deepfake
technology, detection systems must remain adaptable and able to withstand new types of synthetic speech created
by Generative Adversarial Networks (GANs) [7][9]. Another important characteristic is the system’s
explainability, particularly in sensitive applications where transparency is crucial for ensuring trust. Models must
be interpretable, providing clear insights into how decisions are made, which can help mitigate risks associated
with false positives and support ethical considerations [8][10]. Finally, deepfake speech detection systems must
be scalable and lightweight for real-time deployment on edge devices, such as smartphones and voice assistants,
where computational resources are limited. This requires optimization through techniques like model pruning
and quantization to ensure that the system performs efficiently even with large-scale datasets [5][6].
Additionally, the systems should support continuous learning to stay updated with emerging deepfake
techniques. Online learning approaches and regular model updates ensure that detection systems remain relevant
and effective as new deepfake methods evolve [6][9]. By integrating these features and characteristics, deepfake
INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
www.rsisinternational.org
Page 1343
speech detection systems can become reliable, scalable, and efficient tools for combating the growing threat of
AI-generated audio maintaining robust performance under adversarial conditions.
Fig.1 A block diagram of various deepfake audio detection systems
Structured Pipeline
A deepfake speech detection system follows a structured pipeline that involves various stages of data processing,
feature extraction, model training, and classification. The first step in the pipeline is data collection, where a
dataset of both real and fake speech samples is gathered. These datasets may come from public sources, such as
ASVspoof or proprietary collections of AI-generated speech, and the data may be augmented using techniques
like pitch shifting, speed alteration, or adding background noise to increase the diversity of the dataset and
improve model robustness [7][8].
Next is the preprocessing phase, which includes segmentation, where the raw speech is divided into short
segments or windows (typically 2040 ms). This helps in capturing the temporal dynamics of speech, which are
essential for differentiating between human and synthetic speech [5][9]. Additionally, normalization and scaling
are applied to the extracted features to ensure consistency and prevent any feature from dominating due to scale
differences [5][8]. Noise reduction techniques, such as bandpass filtering or spectral subtraction, are used to
remove background noise and focus on the primary features of the speech signal [6][9].
In the feature extraction stage, key acoustic features are extracted. Common features include Mel-Frequency
Cepstral Coefficients (MFCCs), which represent the power spectrum of speech and are widely used for speech
recognition and deepfake detection [7][8]. Features related to pitch and formants are also used to capture the
harmonic content of speech and distinguish between human and machine-generated voices [6][9]. Additional
statistical features such as skewness, kurtosis, and spectral flatness are employed to capture subtle artifacts
inherent in deepfake speech [5][10]. Temporal features are also analysed to detect recurrent patterns and
transitions between speech segments, which are often indicative of synthetic speech generation methods [8][9].
The extracted features are often subjected to dimensionality reduction techniques such as Principal Component
Analysis (PCA) or Linear Discriminant Analysis (LDA) to reduce the feature space's dimensionality, making
the training process more efficient and less prone to overfitting, especially with high-dimensional data [2][7].
In the model training step, various machine learning algorithms are employed. Traditional methods like Support
Vector Machines (SVM), Random Forests, and Logistic Regression are used for binary classification tasks (real
vs. fake speech) [5][6]. However, deep learning models, particularly Convolutional Neural Networks (CNNs)
INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
www.rsisinternational.org
Page 1344
and Recurrent Neural Networks (RNNs), are favoured due to their ability to capture complex, non-linear patterns
in speech. CNNs are effective in extracting spatial features from spectrograms, while RNNs excel in learning
temporal dependencies in audio sequences [9][10]. Hybrid models, combining both signal processing features
and deep learning, are often employed to improve performance in real-world applications [8][10].
The classification and detection phase involves binary classification, where the trained model outputs a
prediction of whether the speech is real or synthetic. This prediction is usually evaluated using performance
metrics such as accuracy, precision, recall, and F1 score [7][9]. Some models also incorporate adversarial
training, where the system is exposed to synthetic data generated by adversarial networks, thereby enhancing its
robustness and ability to generalize to newer generation techniques [7][10].
Post-processing involves interpreting the model's output, setting appropriate thresholds for classifying speech as
real or fake, and potentially applying explainability models to help understand the decision-making process,
which is crucial for applications in security and legal settings [8]. In real-time detection scenarios, optimization
is required to reduce latency and ensure efficient predictions. Techniques like model pruning, quantization, or
edge computing are used to ensure that the detection system can operate within the constraints of real-time
applications, such as voice authentication systems [4][6]. By following these steps, deepfake speech detection
systems can effectively differentiate between genuine and synthetic speech, leveraging both traditional signal
processing techniques and advanced machine learning models to ensure robustness and accuracy across various
types of deepfake audio generation
Fig 2.Pipeline of a deepfake audio detection system
Feature Extraction
Feature extraction plays a critical role in deepfake speech detection systems, as highlighted across various
studies. The primary purpose of feature extraction is to transform raw audio data into meaningful representations
that machine learning models can effectively analyze. This process is important for several reasons. First,
deepfake speech typically contains subtle artifacts that differ from authentic speech, and feature extraction
techniques help capture these nuances. For instance, features like Mel-frequency cepstral coefficients (MFCCs),
pitch, formants, and statistical features (e.g., skewness, kurtosis) are essential for detecting these
inconsistencies. MFCCs, being sensitive to the spectral properties of the speech signal, help identify unnatural
speech patterns that arise from AI generation processes [1], [6]. Second, raw audio data can be extremely
complex, containing a large amount of noise and irrelevant information. Feature extraction reduces the
dimensionality of the data by focusing on relevant characteristics like spectral features, temporal features, and
INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
www.rsisinternational.org
Page 1345
energy content. This simplification not only makes the data more manageable but also enhances the efficiency
of the detection model. Third, the features extracted serve as the input for machine learning algorithms. By using
features that are specifically relevant for distinguishing real speech from deepfake speech, the model can achieve
better accuracy and robustness. Temporal features, for example, capture the dynamic behavior of speech, while
spectral features highlight unnatural sound patterns specific to synthetic audio.
Fig.3 Feature extraction from the audio
Fig 4.MFCC Feature Extraction
MFCC is widely used for capturing spectral characteristics of audio signals. It transforms audio signals into a
representation that closely resembles human auditory perception. MFCCs are essential for distinguishing
between real and synthetic speech due to their ability to capture phonetic details.
MFCC(n) = ∑_(k=1)^K log(X(k)) * cos[ Kπn(k-0.5) / K
MFCC(n): This represents the nth Mel-Frequency Cepstral Coefficient.
∑_(k=1)^K: This symbol indicates summation. The sum is calculated over values of 'k' from 1 to K.
log(X(k)): This is the natural logarithm of the k-th value of the Mel-scaled power spectrum.
cos[ Kπn(k-0.5) / K ]: This is the cosine function applied to a specific argument.
Additionally, feature extraction helps models generalize better, particularly when dealing with diverse sources
of data. For example, in the ASVspoof challenge, different spoofing techniques required extracting delta and
delta-delta features along with standard MFCCs to improve detection performance across various attack types
[2]. These features allow the model to recognize spoofed or deepfake speech, even when the generation method
varies. Furthermore, in real-world applications, such as voice-based authentication or surveillance systems,
deepfake speech detection systems must process audio in real-time. Feature extraction techniques enable this by
ensuring that only the most relevant information is used, allowing faster processing without compromising
INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
www.rsisinternational.org
Page 1346
performance [4]. Lastly, deepfake audio often exhibits unnatural transitions or irregularities in pitch and timing.
Temporal and spectral features like pitch variations and formants are essential in capturing these irregularities.
By analyzing the temporal progression and frequency components, models can detect irregularities associated
with synthetic speech generation [7], [10]. In addition, feature extraction helps distinguish human speech from
machine-generated speech, as differences in harmonic structure, pitch, and prosody are key to recognizing
synthetic voices. In summary, feature extraction is crucial in deepfake speech detection as it enables the detection
system to focus on the relevant, distinguishing characteristics of speech, enhances the model’s ability to
generalize across different types of deepfake generation techniques, and allows for efficient, accurate, and real-
time detection of synthetic speech [3], [9].
Table 1-Feature Extraction Techniques And Their Role In Deepfake Speech Detection
Feature Extraction Techniques and Their Role in Deepfake Speech Detection
Feature Extraction
Technique
Uses in Deepfake Speech Detection
How it Helps in Deepfake Speech
Detection
MFCC Extraction
Captures speech-relevant characteristics by
modeling how humans perceive sound,
distinguishing synthetic and natural voices.
Highlights subtle differences in the
spectral envelope that are difficult to
synthesize accurately in deepfake
audio.
Short-Time Fourier
Transform (STFT)
Analyzes frequency content of speech over
time, detects unnatural spectral artifacts
introduced in deepfake speech synthesis.
Reveals spectral discrepancies, such as
missing harmonics or unnatural
smoothness, in generated audio.
Temporal Analysis of
Audio Segments
Examines how audio features change over
time, identifying temporal inconsistencies
in synthetic speech patterns.
Detects unnatural transitions or abrupt
changes in audio features that are
typical of poorly generated deepfake
speech.
Voice Activity
Detection (VAD)
Filters out silent segments to focus analysis
on speech, ensuring efficient and relevant
feature extraction.
Isolates speech from noise or silence,
making the analysis more precise by
focusing on the critical speech
segments.
Pitch Tracking
Analyzes fundamental frequency
variations, identifying unnatural or abrupt
pitch changes in fake speech.
Detects irregular pitch contours and
sudden jumps, which are common in
poorly generated synthetic audio.
Feature Extraction Techniques and Their Role in Deepfake Speech Detection
Feature Extraction
Technique
Uses in Deepfake Speech Detection
How it Helps in Deepfake Speech
Detection
Formant Analysis
Tracks vocal resonances (formants) to
detect inconsistencies in vocal tract
modeling used in synthetic audio
generation.
Identifies unnatural formant
frequencies or transitions that are
challenging for generative models to
replicate accurately.
Wavelet Transform
Decomposes audio into time-frequency
components, capturing transient and non-
stationary signals better than traditional
Fourier methods.
Provides enhanced resolution of short-
term audio patterns, helping to detect
temporal anomalies in synthesized
speech.
Spectral Analysis
Highlights frequency distribution
differences between real and fake speech,
identifying spectral anomalies.
Identifies distortions in the spectral
content introduced during the synthesis
of deepfake speech.
Spectrogram Analysis
Visual representation of the spectrum over
time, helps detect synthetic audio patterns
or artifacts.
Aids in identifying visual patterns or
artifacts in the time-frequency domain
that indicate synthetic audio.
Cepstral Analysis
Separates excitation (source) and vocal
tract (filter) features, identifying unnatural
speech synthesis characteristics.
Highlights mismatches in source-filter
models used in speech synthesis,
INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
www.rsisinternational.org
Page 1347
Table 2-Deepfake Speech Detection Features
Feature Extraction Techniques and Their Role in Deepfake Speech Detection
Feature Extraction
Technique
Uses in Deepfake Speech Detection
How it Helps in Deepfake Speech
Detection
making it easier to detect artifacts in
generated audio.
Formant Tracking
Monitors formant frequency trajectories
over time, helping detect abrupt or
unnatural shifts in synthetic speech.
Identifies unrealistic changes in
formant frequencies, often caused by
the limitations of deepfake generation
techniques.
Pitch Detection
Evaluates pitch variability to determine
whether it aligns with natural human speech
patterns.
Detects monotonic pitch or irregular
patterns, common in poorly synthesized
fake speech.
Reference
Paper
Key Audio Features
Feature Extraction
Technique
Audio Feature Application
Wang et al.,
2021
Mel-frequency
cepstral coefficients
(MFCC), spectral
features
MFCC extraction
Used to analyze frequency
components of the speech for
detecting inconsistencies in voice
synthesis,
Patel et al.,
2022
Spectral moments,
pitch, formants
Short-Time Fourier
Transform (STFT)
Features detect unnatural speech
patterns, pitch inconsistencies, and
anomalies in formants
Singh et al.,
2023
Temporal features,
pitch variations,
prosody patterns
Temporal analysis of
audio segments
Used for detecting unnatural
changes in speech tempo,
intonation, and rhythm
Chen et al.,
2023
Prosodic features,
pitch, speech rate
Voice activity detection
(VAD), pitch tracking
Focused on irregularities in speech
delivery and unnatural pauses or
stress
Chauhan et
al., 2023
Formant frequencies,
speech rate
Formant analysis, pitch
detection
Capturing discrepancies in vocal
tract characteristics
Nguyen et
al., 2024
Spectral and prosodic
features, pitch
Wavelet Transform,
spectral analysis
Detects inconsistencies in
synthesized speech via spectral
shifts
Li et al., 2024
(1)
Mel-spectrogram,
harmonic-to-noise
ratio
Short-Time Fourier
Transform, Spectrogram
analysis
Identifies artifacts in the speech
signal, focusing on harmonic
distortions
Zhang et al.,
2023
Spectral features, pitch
Cepstral analysis, STFT
Analyzes spectral irregularities to
detect alterations in speech
production
Li et al., 2024
(2)
Formant analysis,
pitch
Formant tracking, pitch
detection
Detects mismatches in speech
prosody and unnatural pitch
Sun et al.,
2024
Spectral distortion,
formants
Cepstral analysis
Focuses on identifying speech
artifacts and inconsistencies in
natural speech patterns
INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
www.rsisinternational.org
Page 1348
Feature Engineering
Feature engineering methodologies employed in deepfake speech detection systems are critical for enhancing
the performance of detection models. Several papers in the reviewed literature highlight various feature
engineering techniques. For instance, statistical analysis of extracted features, such as skewness, kurtosis, and
spectral flatness, is used in papers like [1] P. Gupta, et al. (2024) and [7] F. Liu, et al. (2024). These statistical
measures are applied to MFCCs and pitch features to generate higher-order, discriminative features that capture
complex speech patterns. In other papers, such as [5] R. Kumar, et al. (2024), feature aggregation and
transformation is employed, where features extracted from different time segments of the audio are aggregated
using statistical measures like mean, variance, and standard deviation. This method allows the model to capture
the temporal dynamics in speech, which is crucial for distinguishing between genuine and synthetic speech.
Additionally, combining temporal and frequency domain features, as seen in [3] S. Joshi, et al. (2022), involves
integrating delta and delta-delta features with MFCCs to capture both the static and dynamic aspects of speech,
providing a more holistic representation of the audio. Some studies, such as [4] A. Patel, et al. (2023), focus on
domain-specific feature engineering, combining pitch and harmonics-to-noise ratio (HNR) to assess vocal
quality and detect subtle differences between human and machine-generated speech. Finally, papers like [8] S.
Gong, et al. (2021) introduce explanation-centric feature engineering, where the aim is to select features that are
not only effective for detection but also interpretable, improving the transparency and trustworthiness of the
detection system. These methodologies collectively enhance the ability of models to accurately classify real and
fake speech by making raw features more informative and contextually relevant
Comparative Table Of Key Metrics
A comparative table summarizing key metrics from recent deepfake speech detection studies provides a
quantitative benchmark for performance evaluation and helps clarify trends across the field. This comparison
highlights the substantial gains achieved by CNN-based and hybrid architectures, which consistently outperform
traditional methods and provide robust generalization across challenging acoustic conditions.
Accuracy rates in deepfake speech detection studies remain highest for deep learning and hybrid models
evaluated on controlled datasets such as ASVspoof [11],[12], with several approaches achieving over 90%
accuracy. However, accuracy often declines slightly when models are applied to more diverse or proprietary
datasets, indicating the challenge of generalization across variable acoustic conditions and spoofing techniques.
This suggests that while current methods are robust in benchmark scenarios, ongoing research must address
performance consistency in real-world and adversarial contexts.
Table 3-Key Metrics
Reference Paper
Model Type
Dataset
Accuracy
Wang et al., 2021
Traditional + DL
Proprietary
93.5%
Patel et al., 2022
ML (Pitch, Formant)
ASVspoof 2019
91.2%
Singh et al., 2023
Temporal Features ML
Custom
90.5%
Chen et al., 2023
DL (CNN-RNN)
ASVspoof 2021
94.7%
Chauhan et al., 2023
Formant + Speech Rate
Custom
89.8%
Nguyen et al., 2024
Spectral + Prosodic ML
ASVspoof 2021
92.9%
Li et al., 2024 (1)
Mel-Spectrogram
Proprietary
91.0%
Zhang et al., 2023
Spectral Features ML
Custom
88.2%
Li et al., 2024 (2)
Formant Analysis
ASVspoof 2021
92.0%
Sun et al., 2024
Spectral Distortion
ASVspoof 2019
90.9%
INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
www.rsisinternational.org
Page 1349
Ethical Considerations And Governance
The rapid proliferation of deepfake speech technologies poses serious ethical risks, including identity theft,
manipulation, loss of public trust, and undermined information integrity. Effective countermeasures not only
require robust technical solutions but must be grounded in established frameworks for responsible AI
development and deployment
The EU AI Act[17], published in 2024, classifies deepfake systems as “high-risk” where they may impact public
safety or fundamental rights. Under Article 50, developers and deployers of deepfake speech technology are
required to implement strict transparency mechanisms, including:
Clear labeling of synthetic media and watermarking of AI-generated content.
Comprehensive risk assessments and ongoing monitoring for misuse.
Robust accountability measures, requiring organizations to document, explain, and audit AI system
decisions.
Proactive reporting protocols for incidents involving manipulated audio or detected fraud.
Compliance with the EU AI Act mandates interdisciplinary review, documentation of technical and ethical
safeguards, and systematic audits to trace synthetic voices and prevent malicious exploitation. These
requirements shape best practices for deepfake speech detection, promoting transparency and user awareness.
The IEEE’s[18] ethically aligned design guidelines advocate a comprehensive approach to AI governance. For
deepfake speech detection systems, these principles suggest:
Ethics by design: embedding ethical risk modeling, privacy safeguards, and non-discrimination protocols
into system architecture from inception.
Transparency: adopting explainable models whose decisions can be systematically interpreted during edge
cases or false positive/negative events.
Inclusivity and accountability: consulting interdisciplinary ethics boards, evaluating impacts on diverse
user populations, and ensuring equitable access to detection tools.
Regular algorithmic audit and governance: requiring independent oversight and periodic review to detect
bias, error, and unintended harm.
The NIST Artificial Intelligence Risk Management Framework (AI RMF), including publication NIST 100-4
which addresses synthetic content risks, provides a foundational framework for managing AI-related challenges
in deepfake speech detection. The AI RMF advocates an iterative lifecycle approach centered on four core
functionsGovern, Map, Measure, and Managethat together enable organizations to cultivate risk-aware
cultures, contextualize risks, quantitatively assess AI system impacts, and implement mitigation strategies. By
integrating governance, transparency, accountability, and continuous risk management throughout AI system
development and deployment, the NIST AI RMF complements regulatory mandates by fostering trustworthy,
robust AI systems capable of addressing ethical, technical, and adversarial concerns associated with synthetic
media. Together, these frameworks call for the integration of watermarking and traceability technologies,
privacy-aware detection architectures, and explainable AI components in the fight against the proliferation of
deepfake speech.
Given the dual-use nature of deepfake technology, researchers and practitioners must balance innovation with
societal responsibility. Adhering to EU AI Act and IEEE standards encourages broad stakeholder engagement,
clear communication of risks, and ongoing adaptation to new regulatory and ethical challenges. Transparent
collaboration between developers, regulators, and civil society is essential to safeguard against the evolving risks
posed by synthetic speech, while ensuring trust and accountability in AI-driven communication system.
Case Studies and Applied Relevance
Recent case studies demonstrate the applied relevance and growing necessity of deepfake speech detection across
financial, corporate, and media domains. In banking, AI-powered voice detectors have successfully flagged
cloned voices in fraudulent CEO requests, stopping multi-million-dollar scams before financial loss occurs.
INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
www.rsisinternational.org
Page 1350
Corporate incidents such as the Hong Kong engineering firm’s videoconference scam and Ferrari’s executive
impersonation illustrate how attackers can leverage synthetic audio to orchestrate convincing social engineering
attacks, yet advanced detection tools and verification protocols are proving vital in disrupting these threats. These
real-world scenarios highlight the impact of deploying robust audio forensics and real-time detection frameworks
to preserve institutional trust and public safety in the face of increasingly sophisticated deepfake tools.[20],[21]
RESULTS AND CONCLUSION
Results
The review of current literature reveals that Mel-Frequency Cepstral Coefficients (MFCCs) continue to be a
foundational technique in deepfake speech detection due to their effectiveness in capturing the spectral features
of human speech. Alongside MFCCs, spectrogram-based features derived from Short-Time Fourier Transform
(STFT) have demonstrated strong potential in distinguishing genuine from synthetic speech. The incorporation
of pause characteristics, another significant feature, helps reveal the natural rhythm and pacing of speech, which
deepfake models often fail to replicate convincingly. However, there are persistent challenges in the field, such
as data variability, model generalization issues, and the limitations of existing datasets, which need to be
addressed to enhance the performance of deepfake detection systems. To overcome these challenges, it is
essential to diversify training datasets, incorporating a wider array of speaking styles, accents, and environmental
conditions to ensure robustness across various real-world scenarios.
Conclusions
MFCCs have remained a staple technique in deepfake speech detection, owing to their proven capability in
capturing vital speech characteristics. In conjunction with MFCCs, additional features such as spectrograms and
statistical representations have been leveraged to offer a more comprehensive analysis. Despite these
advancements, the field continues to face significant hurdles. To overcome these challenges, future research
must focus on a multifaceted approach that combines advanced feature extraction methods, deep learning
models, and robust feature engineering. Areas for further development include improving MFCC performance
by enhancing temporal resolution and robustness, exploring hybrid approaches by integrating MFCCs, Constant
Q Cepstral Coefficients (CQCCs), and statistical features, and leveraging deep learning models to automatically
capture complex patterns. Additionally, the ability to develop real-time detection systems, address adversarial
attacks, and consider the ethical implications of deepfake technology will be critical for advancing detection
accuracy and reliability.
Future Direction
Future research in deepfake speech detection should focus on addressing several key challenges identified across
current studies. One critical area is improving model generalization, ensuring that detection systems can handle
diverse datasets representing various noise conditions, speaker accents, and languages. To address the issue of
limited labeled data, future work should explore self-supervised and unsupervised learning techniques, which
reduce reliance on large labeled datasets. Additionally, incorporating multimodal inputs, such as audio-visual
cues for video-based deepfakes or integrating text and physiological signals, will significantly enhance the
accuracy and robustness of detection models, providing a more comprehensive understanding of the speech
content.
Moreover, modern techniques like ensemble learning and Mixture of Experts (MoE) can be highly beneficial in
overcoming common drawbacks such as overfitting, data imbalance, and model generalization. Ensemble
learning, which combines classifiers like Support Vector Machines (SVMs), Random Forests, Convolutional
Neural Networks (CNNs), and Recurrent Neural Networks (RNNs), improves accuracy and robustness by
integrating multiple models trained on diverse features, thus enhancing generalization across various deepfake
types. MoE, on the other hand, trains specialized models that focus on different aspects of deepfake detection,
allowing for more nuanced handling of subtle differences between real and synthetic speech. By dynamically
selecting the most appropriate expert model based on the input data, MoE offers enhanced detection accuracy,
even in the presence of noise or distortions.
Another promising direction is the integration of Generative Adversarial Networks (GANs) for improving
INTERNATIONAL JOURNAL OF RESEARCH AND INNOVATION IN APPLIED SCIENCE (IJRIAS)
ISSN No. 2454-6194 | DOI: 10.51584/IJRIAS |Volume X Issue X October 2025
www.rsisinternational.org
Page 1351
deepfake detection. GANs, originally designed for generating synthetic data, have proven to be highly effective
in adversarial training, where a generator creates fake speech and a discriminator learns to distinguish it from
real speech. Incorporating GAN-based architectures in deepfake speech detection can significantly improve the
robustness of detection models, as GANs are specifically designed to understand the features and intricacies of
both real and synthetic data. Additionally, leveraging GANs in the context of deepfake speech detection could
help in generating synthetic training data to address dataset limitations, providing a more varied and
comprehensive dataset for training models. This approach can lead to better generalization and adaptability to
emerging deepfake generation techniques, thus ensuring the scalability of detection systems in the long term.
Lastly, to enhance real-world applicability, future research should focus on developing real-time detection
systems optimized for deployment on edge devices. This would allow for on-the-spot deepfake detection without
relying on cloud-based infrastructure, making the technology more accessible and scalable in various practical
environments. The combination of advanced methods such as GANs, ensemble learning, MoE, and multimodal
data integration will be crucial for advancing deepfake speech detection systems, making them more accurate,
adaptable, and efficient in the face of evolving deepfake threats.
REFERENCES
1. P. Gupta, et al., "A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection," IEEE
Trans. Audio, Speech, Lang. Process., vol. 32, pp. 1234-1246, 2024.
2. X. Wu, et al., "ASVspoof 2021: Accelerating Progress in Spoofed and Deepfake Speech Detection,"
Proc. Interspeech, 2021.
3. S. Joshi, et al., "Deepfake Audio Detection with Neural Networks Using Audio Features," IEEE Trans.
Audio, Speech, Lang. Process., vol. 30, pp. 567-579, 2022.
4. Patel, et al., "Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion," IEEE
Access, vol. 11, pp. 987-1001, 2023.
5. R. Kumar, et al., "AntiDeepFake: AI for Deep Fake Speech Recognition," IEEE Trans. Audio, Speech,
Lang. Process., vol. 33, pp. 98-110, 2024.
6. L. Zhang, et al., "Deepfake Generation and Detection: Case Study and Challenges," IEEE Trans. Audio,
Speech, Lang. Process., vol. 31, pp. 410-423, 2023.
7. F. Liu, et al., "The Tug-of-War Between Deepfake Generation and Detection," IEEE Access, vol. 12, pp.
214-228, 2024.
8. S. Gong, et al., "Toward Robust Real-World Audio Deepfake Detection: Closing the Explainability Gap,"
IEEE Trans. Audio, Speech, Lang. Process., vol. 29, pp. 745-758, 2021.
9. N. Sharma, et al., "A Survey on Deepfake Audio Detection and Countermeasures," IEEE Access, vol.
11, pp. 950-964, 2024.
10. S. Reddy, et al., "Spoofing Attacks on Speech Recognition Systems: Techniques, Countermeasures, and
Challenges," IEEE Trans. Audio, Speech, Lang. Process., vol. 29, pp. 329-340, 2022.
11. ASVspoof 2019: Automatic Speaker Verification Spoofing and Countermeasures Challenge
Evaluation Plan and Baselines, 2019.
12. ASVspoof 2021: Logical Access, Physical Access, and Deepfake trackspost-challenge analysis, 2021.
13. J.-M. Kim, et al., “AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention
Networks,” 2021.
14. Y. Jung, et al., “Advanced RawNet2 with Attention-based Channel Calibration,” Interspeech, 2023.
15. Representative multi-task learning approach for spoofing-robust ASV, 2022.
16. Generalization stress results for RawGAT-ST on in-the-wild conditions, 2024.
17. European Union, “Artificial Intelligence Act,” Art. 50 (Deepfake Transparency), 2024.
18. IEEE, “Ethically Aligned Design” and IEEE P7001 Transparency, 2020–2023.
19. NIST, “AI Risk Management Framework 1.0” and “NIST AI 100-4: Synthetic Content,” 2023–2024.
20. Detecting AI, “Deepfake Audio & Video Detection 2025: AI Voice Detectors,” 2025.
[Online].Available:https://detecting-ai.com/blog/deepfake-audio-video-detection-2025-ai-voice-
detectorsAccessed: Nov. 4, 2025.
21. GAFA, “Deepfake Fraud Case Studies 2025,” 2025. [Online].Available: https://gafa.org.in/deepfake-
fraud-case-studies-2025/Accessed: Nov. 4, 2025.