Applying Machine Learning for Automated Data Quality and Anomaly Detection in Enterprise Data Pipelines

Nagender Yamsani

doi:10.15662/IJRAI.2022.0501006

Authors

Nagender Yamsani Software Development Advisor, USA Author

DOI:

https://doi.org/10.15662/IJRAI.2022.0501006

Keywords:

Enterprise AI, Evidence Mapping, Advanced Analytics, Business Intelligence, Pharmaceutical AI, Manufacturing Intelligence, Responsible AI, Data Platforms

Abstract

Data quality failures including missing values, inconsistent representations, duplicate entities, and anomalous records continue to be a dominant barrier to trustworthy analytics and effective machine learning (ML) deployment, particularly as organizations scale across diverse, fast-moving data sources. Traditional rule-based validation and constraint checking, while effective in narrow domains, struggle to generalize in environments characterized by high volume, velocity, and schema heterogeneity, often requiring extensive manual maintenance and domain expertise. Recent advances in ML-based data management shift this paradigm by learning statistical, relational, and semantic patterns directly from data, enabling automated detection, diagnosis, and, in some cases, repair of quality defects. This article surveys these approaches through a structured lens, connecting foundational ideas in probabilistic modeling and anomaly detection with modern deep learning techniques and practical data-cleaning systems. By examining representative systems such as HoloClean and ActiveClean, we analyze architectural tradeoffs between accuracy, computational cost, and human-in-the-loop effort, as well as the balance between aggressive cleaning and error propagation risk. Empirical results across these systems demonstrate that ML-informed data quality pipelines can significantly improve anomaly detection accuracy, reduce manual labeling and correction effort, and produce measurable gains in downstream predictive performance, underscoring data quality as a first-class concern in end-to-end ML system design rather than a preprocessing afterthought.

References

1. Stonebraker, M., Çetintemel, U., & Zdonik, S. (2005). The 8 requirements of real-time stream processing. ACM SIGMOD Record, 34(4), 42-47.https://doi.org/10.1145/1107499.1107504

2. Chaudhuri, S. (2007). Self-tuning database systems: A decade of progress. Proceedings of the VLDB Endowment, 1(1), 3-14.https://dl.acm.org/doi/10.5555/1325851.1325856

3. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.https://doi.org/10.1145/2939672.2939778

4. Doshi-Velez, F., & Kim, B. (2017).Towards a rigorous science of interpretable machine learning. arXiv preprint. https://arxiv.org/abs/1702.08608

5. Amershi, S., et al. (2019).Software engineering for machine learning: A case study. Proceedings of the IEEE/ACM International Conference on Software Engineering (ICSE). https://doi.org/10.1109/ICSE-SEIP.2019.00042

6. Rahwan, I., et al. (2019). Machine behaviour. Nature, 568(7753), 477-486. https://doi.org/10.1038/s41586-019-1138-y

7. Kranthi Kumar Routhu. (2018). Reusable Integration Frameworks in Oracle HCM: Accelerating Enterprise Automation through Standardized Architecture. In International Journal of Scientific Research & Engineering Trends (Vol. 4, Number 4). Zenodo. https://doi.org/10.5281/zenodo.17670619

8. Qin, S. J. (2012). Survey on data-driven industrial process monitoring and diagnosis. Annual Reviews in Control, 36(2), 220-234.https://doi.org/10.1016/j.arcontrol.2012.09.004

9. Sudhir Vishnubhatla. (2019). From Rules To Neural Pipelines: NLP-Powered Automation For Regulatory Document Classification In Financial Systems. In International Journal of Science, Engineering and Technology (Vol. 7, Number 1). Zenodo. https://doi.org/10.5281/zenodo.17473977

10. Salhi, H., Odeh, F., Nasser, R., & Taweel, A. (2017). Open source in-memory data grid systems: Benchmarking Hazelcast and Infinispan. Proceedings of ACM/IFIP ICPE ’17. https://doi.org/10.1145/3030207.3053671

11. Sudhir Vishnubhatla. (2020). Adaptive Real-Time Decision Systems: Bridging Complex Event Processing And Artificial Intelligence. In International Journal of Science, Engineering and Technology (Vol. 8, Number 2). Zenodo. https://doi.org/10.5281/zenodo.17471901

12. Salhi, H., Odeh, F., Nasser, R., & Taweel, A. (2017). Benchmarking and performance analysis for distributed cache systems. LNCS 10661. Springer https://doi.org/10.1007/978-3-319-72401-0_11

13. Shravan Kumar Reddy Padur "Empowering Developer & Operations Self-Service: Oracle APEX + ORDS as an Enterprise Platform for Productivity and Agility" International Journal of Scientific Research in Science, Engineering and Technology (IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 4, Issue 11, pp.364-372, November-December-2018. Available at doi : https://doi.org/10.32628/IJSRSET1844429

14. AdCONIP Proceedings. (2017).Advances in big data analytics at The Dow Chemical Company. https://skoge.folk.ntnu.no/prost/proceedings/adconip-2017/media/files/0111.pdf

15. Sudhir Vishnubhatla. (2019). From Rules To Neural Pipelines: NLP-Powered Automation For Regulatory Document Classification In Financial Systems. In International Journal of Science, Engineering and Technology (Vol. 7, Number 1). Zenodo. https://doi.org/10.5281/zenodo.17473977