strong encryption, anonymization, and compliance (e.g., HIPAA, GDPR).
5. Interpretability & trust: explainable multimodal deci- sions remain an open problem.
6. Data scarcity & bias: multimodal training data are harder to gather; biases in one modality can
propagate.
Future Directions
1. Continual & online learning: adapting the model over time from real interactions.
2. Multi-agent collaboration: multiple MeluBots or sub- agents collaborating in classrooms or hospitals
[8].
3. Few-shot adaptation & domain transfer: applying the model in new settings with minimal data [10].
4. Embodied / AR interfaces: integrating mixed reality or gesture inputs along with voice/text/image
fusion.
5. Rigorous evaluation benchmarks: combining user stud- ies, robustness tests, adversarial scenarios,
and modality ablation analyses.
CONCLUSION
We have presented MeluBot, a multimodal AI agent ar- chitecture that integrates text, voice, and image inputs
with workflow automation, targeting applications in education and healthcare. We believe this approach points
towards a new generation of interactive, context-aware systems capable of deep, seamless human-AI
collaboration. Many open challenges remain—especially in latency, robustness, and trust—but the potential
impact in socially relevant domains is significant.
ACKNOWLEDGMENTS.
This work is part of the SeaFortress initiative.
REFERENCES
1. J. Xie, Z. Chen, R. Zhang, X. Wan, and G. Li, “Large multimodal agents:A survey,” 2024. [Online].
Available: https://arxiv.org/abs/2402.15116
2. Z. Durante, Q. Huang, N. Wake, R. Gong, J. S. Park, B. Sarkar,R. Taori, Y. Noda, D.
Terzopoulos, Y. Choi, K. Ikeuchi, H. Vo,
3. L. Fei-Fei, and J. Gao, “Agent ai: Surveying the horizons of multimodal interaction,” 2024. [Online].
Available: https://arxiv.org/abs/2401.03568
4. L. R. Soenksen, Y. Ma, C. Zeng, L. Boussioux, K. Villalobos Carballo,L. Na, H. M. Wiberg, M. L. Li,
I. Fuentes, and D. Bertsimas, “Integrated multimodal artificial intelligence framework for healthcare
applications,” NPJ Digit. Med., vol. 5, no. 1, p. 149, Sep. 2022.
5. F. Krones, U. Marikkar, G. Parsons, A. Szmul, and A. Mahdi, “Review of multimodal machine learning
approaches in healthcare,” 2024. [Online]. Available: https://arxiv.org/abs/2402.02460
6. K. Saab and J. Freyberg, ““amie gains vision: A research ai agent
for multimodal diagnostic dialogue”,” Blog post, G o o g le Research, May 2025,
accessed: YYYY- MM-DD. [Online]. Available:
https://research.google/blog/ amie-gains-vision-a-research-ai-agent-for-multi-modal-diagnostic-
dialogue/
7. Z. Gao, B. Zhang, P. Li, X. Ma, T. Yuan, Y. Fan, Y. Wu,Y. Jia, S.-C. Zhu, and Q. Li, “Multi-
modal agent tuning: Building a vlm-driven agent for efficient tool usage,” 2025. [Online]. Available:
https://arxiv.org/abs/2412.15606
8. L. Chen, Y. Zhang, S. Ren, H. Zhao, Z. Cai, Y. Wang, P. Wang, T. Liu, and B. Chang, “Towards end-
to-end embodied decision making via multi-modal large language model: Explorations with gpt4-
vision and beyond,” 2023. [Online]. Available: https://arxiv.org/abs/2310.02071
9. H. Yao, R. Zhang, J. Huang, J. Zhang, Y. Wang, B. Fang, R. Zhu,Y. Jing, S. Liu, G. Li, and D. Tao,
“A survey on agentic multimodal large language models,” Oct. 2025, version v1; accessed: YYYY-
MM-DD. [Online]. Available: https://arxiv.org/abs/2510.10991