Privacy Attacks and Defenses on Foundation Mode
DOI:
https://doi.org/10.15662/IJRAI.2025.0806001Keywords:
Foundation Models, Membership Inference, Model Inversion, Attribute Inference, Differential Privacy, Adversarial Regularization, Privacy-Utility Trade-off, Black-Box Attacks, Gradient Leakage, Explainability LeakageAbstract
Foundation models—large pre-trained models like BERT and GPT—have revolutionized AI across domains. However, their scale and training on massive, often uncontrolled datasets raise serious privacy concerns. Attacks such as membership inference, model inversion, and prompt-based exfiltration can expose sensitive training data or user information. This paper explores these privacy vulnerabilities and reviews effective defense strategies. We first characterize prominent privacy attacks on foundation models: (1) membership inference, determining whether specific data was used in training; (2) model inversion or reconstruction attacks, which attempt to reconstruct input data from model outputs or gradients; and (3) attribute inference attacks, inferring sensitive attributes from representations. White-box and black-box threat scenarios are examined. Next, we survey defense mechanisms, including adversarial regularization to prevent membership inference; differential privacy, especially DP-SGD, for provable privacy during training; and gradient obfuscation techniques to thwart inversion attacks. The trade-offs between privacy, interpretability, and model utility are discussed. Our methodology proposes empirical evaluation of membership inference risk on masked language models using shadow-model techniques under black-box conditions, and testing differential privacy during fine-tuning. Metrics include inference attack accuracy, privacy-utility trade-offs (e.g., accuracy loss), and explanation leakage. Results from prior work indicate that adversarial regularization can significantly reduce membership inference accuracy with limited utility loss (Nasr et al., 2018), while differential privacy guarantees formal protection but often degrades performance. Explanation mechanisms like gradients or backprop-based saliency can inadvertently leak membership information. We conclude that while protection mechanisms exist, they require careful tuning to balance privacy and performance. Future directions include improved adaptive defenses, privacy auditing tools for deployed foundation models, and standardized privacy benchmarks for LLMs.
References
1. Nasr, M., Shokri, R., & Houmansadr, A. (2018). Machine Learning with Membership Privacy using Adversarial Regularization. arXiv preprint arXiv.
2. Shokri, R., Strobel, M., & Zick, Y. (2019). On the Privacy Risks of Model Explanations. arXiv preprint arXiv.
3. Zhu, L., et al. (2019). Gradient Leakage Attacks on BERT Models. arXiv preprint (as referenced in summary) arXiv.
4. Abadi, M., et al. (2016). Deep Learning with Differential Privacy (DP-SGD). Proceedings of CCS arXiv.
5. Black-box membership inference in PLMs (Xin et al., 2022 preprint). arXiv.
6. Mireshghallah, et al. (Masked LM susceptibility to MIAs). ResearchGate.
7. Fredrikson, M., et al. (2014). Attribute Inference Attacks in Machine Learning. Conference Proceedings ar5iv.
8. Model extraction and parameter stealing attacks (Tramèr et al., 2016).