Advanced Architectural Frameworks for Scalable, Production-Grade Agentic RAG Pipelines
DOI:
https://doi.org/10.15662/IJRAI.2026.0901001Keywords:
Retrieval-Augmented Generation, Agentic AI, Production Architecture, Context Engineering, Vector Databases, Knowledge Graphs, Infrastructure as CodeAbstract
The evolution of artificial intelligence from monolithic generative models to modular, retrieval-augmented architectures represents a fundamental shift in enterprise software engineering. This paper presents a comprehensive examination of production-grade Retrieval-Augmented Generation (RAG) systems, introducing a six-layer architectural framework that addresses the limitations of standalone large language models through distributed computing, autonomous reasoning, and rigorous evaluation protocols. Our analysis demonstrates that modern RAG architecture requires systematic context engineering rather than simple retrieval algorithms, with empirical evidence showing 2-3× improvements in GPU utilization through advanced inference engines and up to 90% recall accuracy through layered retrieval strategies. This framework provides enterprise organizations with a blueprint for building reliable, scalable AI systems capable of processing millions of documents while maintaining low latency and high ground fidelity.
References
1. Anyscale. (2024). Optimize performance for Ray Serve LLM. https://www.anyscale.com
2. AWS Documentation. (2024). Compute and autoscaling: Amazon EKS best practices. https://docs.aws.amazon.com
3. AWS Labs. (2024). Ray Serve with vLLM: AI on EKS blueprints [GitHub repository]. https://github.com/aws-samples
4. Chen, W., et al. (2025). RAGOps: Operating and managing RAG pipelines. arXiv. https://arxiv.org/abs/2506.03401
5. Comprehensive AI governance framework: A strategic approach for organizations in dynamic regulatory environments. (2025). International Journal of Engineering & Extended Technologies Research (IJEETR), 7(2), 9653–9660. https://doi.org/10.15662/IJEETR.2025.0702004
6. Data Nucleus. (2025). RAG in 2025: The enterprise guide to retrieval-augmented generation, graph RAG, and agentic AI. https://www.datanucleus.ai
7. Docker. (2024). Docker + E2B: Building the future of trusted AI. https://www.docker.com
8. E2B. (2024). Docker & E2B partner to introduce MCP support. https://e2b.dev
9. External Secrets Operator. (2024). Introduction and documentation. https://external-secrets.io
10. Goswami, P. (2024). Building a scalable RAG data ingestion pipeline. Medium. https://medium.com
11. Khan, F. (2024). Scalable RAG pipeline: A production-grade implementation [GitHub repository]. GitHub. https://github.com
12. Kumar, H. (2025). RAG in 2025: From quick fix to core architecture. Medium. https://medium.com
13. Kumar, S. N. P. (2025a). Fraud detection in banking using generative AI. Sarcouncil Journal of Engineering and Computer Sciences, 4(11), 133–145. https://doi.org/10.5281/zenodo.17634095
14. Kumar, S. N. P. (2025b). Hallucination detection and mitigation in large language models: A comprehensive review. Journal of Information Systems Engineering and Management.
15. Kumar, S. N. P. (2025c). Multi-agent AI systems in finance: Models, applications, and challenges. International Journal of Advanced Research in Computer Science & Technology (IJARCST), 8(1), 11555–11573.
16. Kumar, S. N. P. (2025d). Recent innovations in cloud-optimized retrieval-augmented generation architectures for AI-driven decision systems. Engineering Management Science Journal, 9(4). https://doi.org/10.59573/emsj.9(4).2025.81
17. Kumar, S. N. P. (2025e). Regulating autonomous AI agents: Prospects, hazards, and policy structures. Journal of Computer Science and Technology Studies, 7(10), 393–399.
18. Kumar, S. N. P. (2025f). RMHAN: Random multi-hierarchical attention network with RAG-LLM-based sentiment analysis using text reviews. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. https://www.worldscientific.com/doi/10.1142/S1469026825500075
19. Kumar, S. N. P. (2025g). Scalable cloud architectures for AI-driven decision systems. Journal of Computer Science and Technology Studies. https://al-kindipublishers.org/index.php/jcsts/article/view/10545
20. Kumar, S. N. P. (2025h). AI and cloud data engineering transforming healthcare decisions. SAR Council. https://sarcouncil.com/2025/08/ai-and-cloud-data-engineering-transforming-healthcare-decisions
21. Li, J. (2024). ReAct vs. plan-and-execute: A practical comparison. Dev.to. https://dev.to
22. Neo4j. (2024). RAG tutorial: How to build a RAG system on a knowledge graph. https://neo4j.com
23. Patronus AI. (2024). RAG evaluation metrics: Best practices. https://www.patronus.ai
24. Ray Documentation. (2024). Scalable RAG data ingestion with Ray Data. https://docs.ray.io
25. Red Hat Developer. (2025). Why vLLM is the best choice for AI inference today. https://developers.redhat.com
26. Saish, P. (2024). Production-grade RAG: Architecture, trade-offs, and hard-won lessons. Medium. https://medium.com
27. Sharma, S., et al. (2025). Retrieval-augmented generation: A comprehensive survey. arXiv. https://arxiv.org/abs/2506.00054
28. Sinha, D. (2024). The ultimate guide to chunking strategies for RAG applications. Medium. https://medium.com
29. Towards Data Science. (2024). Is RAG dead? The rise of context engineering. https://towardsdatascience.com
30. Vespa. (2024). Eliminating the precision–latency trade-off in large-scale RAG. https://vespa.ai
31. Zarnecki, M. (2025). LLM & AI agent applications with LangChain and LangGraph. Medium. https://medium.com
32. Zhou, Y., et al. (2025). AgentX: Orchestrating robust agentic workflows. arXiv. https://arxiv.org/abs/2509.07595





