Applying Machine Learning for Automated Data Quality and Anomaly Detection in Enterprise Data Pipelines
DOI:
https://doi.org/10.15662/IJRAI.2022.0501006Keywords:
Enterprise AI, Evidence Mapping, Advanced Analytics, Business Intelligence, Pharmaceutical AI, Manufacturing Intelligence, Responsible AI, Data PlatformsAbstract
Data quality failures including missing values, inconsistent representations, duplicate entities, and anomalous records continue to be a dominant barrier to trustworthy analytics and effective machine learning (ML) deployment, particularly as organizations scale across diverse, fast-moving data sources. Traditional rule-based validation and constraint checking, while effective in narrow domains, struggle to generalize in environments characterized by high volume, velocity, and schema heterogeneity, often requiring extensive manual maintenance and domain expertise. Recent advances in ML-based data management shift this paradigm by learning statistical, relational, and semantic patterns directly from data, enabling automated detection, diagnosis, and, in some cases, repair of quality defects. This article surveys these approaches through a structured lens, connecting foundational ideas in probabilistic modeling and anomaly detection with modern deep learning techniques and practical data-cleaning systems. By examining representative systems such as HoloClean and ActiveClean, we analyze architectural tradeoffs between accuracy, computational cost, and human-in-the-loop effort, as well as the balance between aggressive cleaning and error propagation risk. Empirical results across these systems demonstrate that ML-informed data quality pipelines can significantly improve anomaly detection accuracy, reduce manual labeling and correction effort, and produce measurable gains in downstream predictive performance, underscoring data quality as a first-class concern in end-to-end ML system design rather than a preprocessing afterthought.
References
1. Stonebraker, M., Çetintemel, U., & Zdonik, S. (2005). The 8 requirements of real-time stream processing. ACM SIGMOD Record, 34(4), 42-47.https://doi.org/10.1145/1107499.1107504
2. Chaudhuri, S. (2007). Self-tuning database systems: A decade of progress. Proceedings of the VLDB Endowment, 1(1), 3-14.https://dl.acm.org/doi/10.5555/1325851.1325856
3. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.https://doi.org/10.1145/2939672.2939778
4. Doshi-Velez, F., & Kim, B. (2017).Towards a rigorous science of interpretable machine learning. arXiv preprint. https://arxiv.org/abs/1702.08608
5. Amershi, S., et al. (2019).Software engineering for machine learning: A case study. Proceedings of the IEEE/ACM International Conference on Software Engineering (ICSE). https://doi.org/10.1109/ICSE-SEIP.2019.00042
6. Rahwan, I., et al. (2019). Machine behaviour. Nature, 568(7753), 477-486. https://doi.org/10.1038/s41586-019-1138-y
7. Kranthi Kumar Routhu. (2018). Reusable Integration Frameworks in Oracle HCM: Accelerating Enterprise Automation through Standardized Architecture. In International Journal of Scientific Research & Engineering Trends (Vol. 4, Number 4). Zenodo. https://doi.org/10.5281/zenodo.17670619
8. Qin, S. J. (2012). Survey on data-driven industrial process monitoring and diagnosis. Annual Reviews in Control, 36(2), 220-234.https://doi.org/10.1016/j.arcontrol.2012.09.004
9. Sudhir Vishnubhatla. (2019). From Rules To Neural Pipelines: NLP-Powered Automation For Regulatory Document Classification In Financial Systems. In International Journal of Science, Engineering and Technology (Vol. 7, Number 1). Zenodo. https://doi.org/10.5281/zenodo.17473977
10. Salhi, H., Odeh, F., Nasser, R., & Taweel, A. (2017). Open source in-memory data grid systems: Benchmarking Hazelcast and Infinispan. Proceedings of ACM/IFIP ICPE ’17. https://doi.org/10.1145/3030207.3053671
11. Sudhir Vishnubhatla. (2020). Adaptive Real-Time Decision Systems: Bridging Complex Event Processing And Artificial Intelligence. In International Journal of Science, Engineering and Technology (Vol. 8, Number 2). Zenodo. https://doi.org/10.5281/zenodo.17471901
12. Salhi, H., Odeh, F., Nasser, R., & Taweel, A. (2017). Benchmarking and performance analysis for distributed cache systems. LNCS 10661. Springer https://doi.org/10.1007/978-3-319-72401-0_11
13. Shravan Kumar Reddy Padur "Empowering Developer & Operations Self-Service: Oracle APEX + ORDS as an Enterprise Platform for Productivity and Agility" International Journal of Scientific Research in Science, Engineering and Technology (IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 4, Issue 11, pp.364-372, November-December-2018. Available at doi : https://doi.org/10.32628/IJSRSET1844429
14. AdCONIP Proceedings. (2017).Advances in big data analytics at The Dow Chemical Company. https://skoge.folk.ntnu.no/prost/proceedings/adconip-2017/media/files/0111.pdf
15. Sudhir Vishnubhatla. (2019). From Rules To Neural Pipelines: NLP-Powered Automation For Regulatory Document Classification In Financial Systems. In International Journal of Science, Engineering and Technology (Vol. 7, Number 1). Zenodo. https://doi.org/10.5281/zenodo.17473977





