Towards MeluBot: A Multimodal AI Agent Integrating Text, Voice, Image, and Automation for Education and Health

Authors

Gabriel Henrique Alencar Medeiros

SeaFortress / INSA Rouen Normandie (France)

Article Information

DOI: 10.51584/IJRIAS.2025.10100000123

Subject Category: Artificial Intelligence

Volume/Issue: 10/10 | Page No: 1392-1400

Publication Timeline

Submitted: 2025-10-20

Accepted: 2025-10-26

Published: 2025-11-13

Abstract

This paper presents MeluBot, a multimodal AI agent that integrates text, voice, and image modalities, combined with workflow automation, for interactive applications in education and healthcare. We describe the architectural design, enabling technologies, use-case scenarios, and discuss the potential, lim- itations, and future directions. We also position MeluBot with respect to related work in multimodal agents and intelligent tutoring or medical assistants.

Keywords

AI Agent

Downloads

References

1. J. Xie, Z. Chen, R. Zhang, X. Wan, and G. Li, “Large multimodal agents:A survey,” 2024. [Online]. Available: https://arxiv.org/abs/2402.15116 [Google Scholar] [Crossref]

2. Z. Durante, Q. Huang, N. Wake, R. Gong, J. S. Park, B. Sarkar,R. Taori, Y. Noda, D. Terzopoulos, Y. Choi, K. Ikeuchi, H. Vo, [Google Scholar] [Crossref]

3. L. Fei-Fei, and J. Gao, “Agent ai: Surveying the horizons of multimodal interaction,” 2024. [Online]. Available: https://arxiv.org/abs/2401.03568 [Google Scholar] [Crossref]

4. L. R. Soenksen, Y. Ma, C. Zeng, L. Boussioux, K. Villalobos Carballo,L. Na, H. M. Wiberg, M. L. Li, I. Fuentes, and D. Bertsimas, “Integrated multimodal artificial intelligence framework for healthcare applications,” NPJ Digit. Med., vol. 5, no. 1, p. 149, Sep. 2022. [Google Scholar] [Crossref]

5. F. Krones, U. Marikkar, G. Parsons, A. Szmul, and A. Mahdi, “Review of multimodal machine learning approaches in healthcare,” 2024. [Online]. Available: https://arxiv.org/abs/2402.02460 [Google Scholar] [Crossref]

6. K. Saab and J. Freyberg, ““amie gains vision: A research ai agent for multimodal diagnostic dialogue”,” Blog post, Google Research, May 2025, accessed: YYYY- MM-DD. [Online]. Available: https://research.google/blog/ amie-gains-vision-a-research-ai-agent-for-multi-modal-diagnostic-dialogue/ [Google Scholar] [Crossref]

7. Z. Gao, B. Zhang, P. Li, X. Ma, T. Yuan, Y. Fan, Y. Wu,Y. Jia, S.-C. Zhu, and Q. Li, “Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage,” 2025. [Online]. Available: https://arxiv.org/abs/2412.15606 [Google Scholar] [Crossref]

8. L. Chen, Y. Zhang, S. Ren, H. Zhao, Z. Cai, Y. Wang, P. Wang, T. Liu, and B. Chang, “Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond,” 2023. [Online]. Available: https://arxiv.org/abs/2310.02071 [Google Scholar] [Crossref]

9. H. Yao, R. Zhang, J. Huang, J. Zhang, Y. Wang, B. Fang, R. Zhu,Y. Jing, S. Liu, G. Li, and D. Tao, “A survey on agentic multimodal large language models,” Oct. 2025, version v1; accessed: YYYY-MM-DD. [Online]. Available: https://arxiv.org/abs/2510.10991 [Google Scholar] [Crossref]

10. X. Ma, Y. Wang, Y. Yao, T. Yuan, A. Zhang, Z. Zhang, and Zhao, “Caution for the environment: Multimodal llm agents are susceptible to environmental distractions,” 2025. [Online]. Available: https://arxiv.org/abs/2408.02544 [Google Scholar] [Crossref]

11. G. Verma, R. Kaur, N. Srishankar, Z. Zeng, T. Balch, andM. Veloso, “Adaptagent: Adapting multimodal web agents with few- shot learning from human demonstrations,” 2024. [Online]. Available: https://arxiv.org/abs/2411.13451 [Google Scholar] [Crossref]

Metrics

Views & Downloads

Similar Articles