RAG-Driven Cybersecurity Intelligence: Leveraging Semantic Search for Improved Threat Detection

Sahaj Tushar Gandhi

doi:10.15662/IJRAI.2023.0603003

Authors

Sahaj Tushar Gandhi Independent Researcher, San Francisco, CA, USA Author

DOI:

https://doi.org/10.15662/IJRAI.2023.0603003

Keywords:

Retrieval-Augmented Generation, semantic search, cyber threat intelligence, knowledge graph, vector retrieval, threat detection

Abstract

Retrieval-Augmented Generation (RAG) unifies dense retrieval with generative models to ground generated outputs in external documents, suppressing hallucinations and supporting up-to-date, domain-specific reasoning. We introduce an architecture combining semantic search (dense vector retrieval and knowledge-graph indexing) with RAG workflows to improve CTI ingestion, correlation and detection. The system ingests heterogeneous CTI sources (OSINT reports, vendor feeds, malware descriptions) and locks and loads the semantic chunking and entity linking process that indexes embeddings in a vector store alongside a cybersecurity knowledge graph for relational reasoning. A policy-aware RAGenerator which produces ranked threat hypotheses and suggested actions. Methodologically, we deploy as prototype a dense bi-encoder retriever and FAISS index alongside an off-the-shelf seq2seq generator fine-tuned on CTI summarization tasks and a knowledge graph with Neo4j underneath. The evaluation is based on a set of 2,400 CTI incident reports and synthetic network alert sequences with known ground truth; metrics include detection precision, recall, F1 measure, time-to-context (TTC), and reduction in analyst workload. On the other hand, results demonstrate a 25.7% increase in detection F1 over keyword/TTP-matching based baseline and an average decrease of 31% in analyst triage time for RAG-driven pipeline, while knowledge-graph augmentation enhanced true positive correlation of multi-stage attacks by 22%. It also lowered hallucination rate on generated advisories by 45% (as measured with ground-truth grounding). Conclusion: Only indexing corpus quality reliance and possible privacy leakage in retrieval. In the future, secure retrieval technique and automated counter-adversarial training will be perfected.

References

[1] L. F. Sikos, “Cybersecurity knowledge graphs,” Knowledge and Information Systems, vol. 65, pp. 3511–3531, Apr. 2023. [Online]. Available: SpringerLink.

[2] Abid A, Jemili F (2020) Intrusion detection based on graph oriented big data analytics. Procedia Comput Sci 176:572–581.

[3] Chen X, Shen W, Yang G (2021) Automatic generation of attack strategy for multiple vulnerabilities based on domain knowledge graph. In: 47th Annual Conference of the IEEE Industrial Electronics Society. IEEE.

[4] C. Shin, I. Lee, and C. Choi, ‘‘Towards GloVe-based TTP embedding with ATT&CK framework,’’ in Proc. Korea Inst. Military Sci. Technol., Daejeon, South Korea, 2023, pp. 1606–1607.

[5] Noor, U.; Anwar, Z.; Amjad, T.; Choo, K.-K.R. A machine learning-based FinTech cyber threat attribution framework using high-level indicators of compromise. Future Gener. Comput. Syst. 2019, 96, 227–242

[6] Husák, M.; Bartoš, V.; Sokol, P.; Gajdoš, A. Predictive methods in cyber defense: Current experience and research challenges. Future Gener. Comput. Syst. 2021, 115, 517–530.

[7] Tang, B.; Wang, J.; Yu, Z.; Chen, B.; Ge, W.; Yu, J.; Lu, T. Advanced Persistent Threat intelligent profiling technique: A survey. Comput. Electr. Eng. 2022, 103, 108261

[8] Garrido JS, Dold D, Frank J (2021) Machine learning on knowledge graphs for context-aware security monitoring. In: 2021 IEEE International Conference on Cyber Security and Resilience. IEEE, pp 55–60

[9] Grojek AE, Sikos LF (2022) Ontology-driven artificial intelligence in IoT forensics. In: Daimi K, Francia G III, Encinas LH (eds) Breakthroughs in digital biometrics and forensics. Springer, Cham, pp 257–286

[10] Homayoun, S.; Dehghantanha, A.; Ahmadzadeh, M.; Hashemi, S.; Khayami, R.; Choo, R.; Newton, D.E. Deep Dive into Ransomware Threat Hunting and Intelligence at Fog Layer. Future Gener. Comput. Syst. 2018, 90, 94–104

[11] Lekkala, C. (2020). Leveraging Lambda Architecture for Efficient Real-Time Big Data Analytics. European Journal of Advances in Engineering and Technology, 7(2), 59–64.

[12] Islam R, Refat RUD, Yerram SM et al (2022) Graph-based intrusion detection system for controller area networks. IEEE Trans Intell Transp Syst 23(3):1727–1736.

[13] Kang JJ, Sikos LF, Yang W (2021) Reducing the attack surface of edge computing IoT networks via hybrid routing using dedicated nodes. In: Ahmed M, Haskell-Dowland P (eds) Secure edge computing: applications, techniques and challenges. CRC Press, Boca Raton, pp 97–111.

[14] Wagner, T.D.; Palomar, E.; Mahbub, K.; Abdallah, A.E. A Novel Trust Taxonomy for Shared Cyber Threat Intelligence. Secur. Commun. Netw. 2018, 2018, 9634507.

[15] Khan, T.; Alam, M.; Akhunzada, A.; Hur, A.; Asif, M.; Khan, M.K. Towards augmented proactive cyberthreat intelligence. J. Parallel Distrib. Comput. 2019, 124, 47–59.

[16] Tatam, M.; Shanmugam, B.; Azam, S.; Kannoorpatti, K. A review of threat modelling approaches for APT-style attacks. Heliyon 2021, 7, e05969