**CPSC 538L: Differential Privacy** **Theory and Practice** Machine learning ecosystems run on vast amounts of personal information which are digested into models used for understanding and prediction. However, these ML models have been shown to leak information about users. Differential Privacy enables privacy-preserving statistical analyses on sensitive datasets with provable privacy guarantees. As such, it is seeing increasing interest from both academia and industry. In this course we will explore Differential Privacy theory, and its application to machine learning, from individual models to end-to-end applications. **Instructor**: [Mathias Lécuyer](https://mathias.lecuyer.me) **Schedule**: MW 10:30-12:00 -- Term 2 (January - April 2022) **Location**: ICCS 246 (or Zoom for those who want / need to) **Office Hours**: Thu 2-3pm (tentatively), over zoom (same link as the class) **Logistics**: Logistics and discussions will happen on Piazza [^piazza]. You can [login through Canvas here](https://canvas.ubc.ca/courses/89626), which also shows the access code. **If you want to audit or attend the first sessions, email me for the zoom link**. Objectives ========== The learning objectives for this class are to: - Understand the challenges and importance of privacy in ML. - Learn the basics of Differential Privacy (DP) theory. - Get a deeper understanding of some advanced topics in DP though a focus on three broad topics: privacy attacks, DP deep learning, DP workloads. - Have the necessary tools, and a first experience, to conducting DP research. This is a seminar course with a majority of the lecture time devoted to student-led paper presentations and discussions, although there will be some lectures. Prerequisites ============= There is no formal pre-requisite for this class. The main topics we will build on are: - Probability and statistics. Students who are not familiar with such topics are expected to acquire the necessary background knowledge on their own. This will require some work but should be doable. These [useful probability facts](http://www.cs.toronto.edu/~anikolov/CSC2412F20/notes/prob.pdf) are a good starting point. - Basic ML algorithms (though mostly deep learning). - Python and a deep learning framework for assignments (using [Jax](https://github.com/google/jax) is highly encouraged, though knowledge of it is not assumed). Evaluation ========== Your course grade will be based on the following breakdown: - Paper reviews (drop any 3): 10% - Paper presentation: 20% - Paper discussion participation: 20% - Assignments: 10% - Project: 40% Paper reviews ------------- Before each paper discussion based class, all students are expected to read the paper, and write a short critical review of the paper (max 1 page, shorter is no issue) that includes a list of discussion points (see paper presentation for inspiration, no need to cover everything each time). Students must write the reviews in their own words, with text or resources copied from another source appropriately cited. Otherwise, it will be construed as plagiarism. Students **must submit their paper reviews in PDF form via Piazza**. Paper presentation ------------------ Each student must give a 45-60 min presentation of one paper (some days, we may need to have two presenters, making presentations a bit shorter). The class discussion will begin after the presentation, and the presenter(s) will be expected to lead the discussion. The following questions can help structure the presentation and discussion: 1. What problem is the paper solving? 2. Why is that problem important? 3. What was the previous state of the art? 4. How does this paper advance the state of the art? 5. How does the system/method/algorithm work? 6. What are the key results? 7. How is it evaluated? 8. What are challenges in applying the proposed solution? 9. What related problems are still open? Students are welcome to use slides, and will present over zoom (in addition to in person when available). Each student must share their slides with at least 30 min before the start of the class, in case of technical issues. The presentation will be graded based on content, clarity, delivery, and participation in the follow up discussion. Paper discussion ---------------- Participation will be graded based on the quality of participation (not quantity) and consistency (participate in most classes): this is to encourage interesting discussions for each meeting. Students can participate in discussions in class (in person or via zoom over voice or chat) as well as on Piazza. If you are unable to attend a class on a given day, you can still get participation points by sharing discussion questions on Piazza before the class. Assignments ----------- There will be two short assignments. They are open ended, but should be fairly easy. You can start the assignments whenever you want: **the lecture after which you have all the necessary material is marked on the syllabus**. Each student will return a short explanation ($\leq 1$ page, including plot -- shorter is good), and present their code and result to me in a short meeting scheduled for this purpose. Assignments: 1. Implement one of the basic DP mechanisms on a query class of your choice. Show one plot that visually and empirically supports the DP guarantee. 2. Implement DP-SGD in a deep learning framework of your choice ([Jax](https://github.com/google/jax) recommended). Accounting can be ported from existing frameworks ([Opacus](https://opacus.ai/) recommended). Show the behavior of interesting quantities in one plot. Project ------- The course project can be done in teams of up to 4, and groups are encouraged. The project deliverables will include a project proposal, a final research paper like report (for the group, with one unique paragraph per member explaining their contribution), and an oral presentation. Key dates to keep in mind are: - Feb 7: pitch your project ideas to find teammates and get feedback. - Feb 18: project proposal ($\leq 2$ page, non binding). - TBD: intermediary progress report (in office hours). - Apr 4-6: presentations. - Apr 15: final report due ($4-8$ pages, presented like a paper in format and structure. Formalize the setting and privacy results, as well as empirical results. Shortcomings on the setting/theoretical results are fine, but should be acknowledged and briefly discussed. Empirical results do not have to be good, but should be analyzed: what did you learn, good or bad?). Feel free to chat with me (early) for: - project ideas - feedback on your ideas Syllabus ======== This is a tentative syllabus and schedule. 10 Jan 2022: Introduction & Notions of Privacy (lecture) References and related work: - [Robust De-anonymization of Large Sparse Datasets](https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf) - [Unique in the Crowd: The privacy bounds of human mobility](https://www.nature.com/articles/srep01376) - [Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays](https://pubmed.ncbi.nlm.nih.gov/18769715/) 12 Jan 2022: Reconstruction Attacks (lecture) References and related work: - Chapter 8 of the [Privacy Book](https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf) - [Revealing Information while Preserving Privacy](https://crypto.stanford.edu/seclab/sem-03-04/psd.pdf) - [Linear Program Reconstruction in Practice](https://journalprivacyconfidentiality.org/index.php/jpc/article/view/711/693) - [Understanding Database Reconstruction Attacks on Public Data](https://ecommons.cornell.edu/handle/1813/89104) 17 Jan 2022: Membership attacks in ML (paper) Shadab Shaikh will present [Membership Inference Attacks Against Machine Learning Models (Shokri, Stronati, Song, Shmatikov 2017)](https://www.cs.cornell.edu/~shmat/shmat_oak17.pdf) 19 Jan 2022: Reconstruction Attacks in ML (paper) Amrutha Varshini Ramesh will present [The secret sharer: evaluating and testing unintended memorization in neural networks (Carlini, Liu, Erlingsson, Kos, Song 2019)](https://dl.acm.org/doi/10.5555/3361338.3361358) and (optionally) [Extracting Training Data from Large Language Models (Carlini, Tramèr, Wallace, Jagielski, Herbert-Voss, Lee, Roberts, Brown, Song, Erlingsson, Oprea, Raffel 2021)](https://www.usenix.org/system/files/sec21-carlini-extracting.pdf) 24 Jan 2022: DP theory (lecture). ⚠️ Good time to start assignment 1 26 Jan 2022: DP theory (lecture) 31 Jan 2022: DP theory (lecture) - Strong composition - Amplification by subsampling 02 Feb 2022: DP SGD (paper). ⚠️ Good time to start assignment 2 Amir Sabzi will present [Deep Learning with Differential Privacy (Abadi, Chu, Goodfellow, McMahan, Mironov, Talwar, Zhang 2016)](https://arxiv.org/abs/1607.00133) and [Learning Differentially Private Recurrent Language Models (McMahan, Ramage, Talwar, Zhang 2018)](https://arxiv.org/abs/1710.06963) 03 Feb 2022: ⚠️ Assignment 1 (due). Special Office Hours to present your assignment. 04 Feb 2022: ⚠️ Assignment 1. Special Office Hours to present your assignment. 07 Feb 2022: Project pitches 🔥 Everyone can share their project ideas 09 Feb 2022: Rényi-DP (paper) [Renyi Differential Privacy (Mironov 2017)](https://arxiv.org/abs/1702.07476) 14 Feb 2022: Discussion about Local-DP, Randomized Response. Optional (no review needed): Sampled Gaussian with RDP (paper). [Rényi Differential Privacy of the Sampled Gaussian Mechanism (Mironov, Talwar, Zhang 2019)](https://arxiv.org/abs/1908.10530) 16 Feb 2022: Questions❓ Coming back to what we have covered so far to answer questions 18 Feb 2022: Project proposals due 🔥 Coming back to what we have covered so far to answer questions 21 Feb 2022: Midterm break, no class 23 Feb 2022: Midterm break, no class 28 Feb 2022: PATE (paper) Qiaoyue Tang will present [Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data (Papernot, Abadi, Erlingsson, Goodfellow, Talwar 2017)](https://arxiv.org/abs/1610.05755) and [Scalable Private Learning with PATE (Papernot, Song, Mironov, Raghunathan, Talwar, Erlingsson 2018)](https://arxiv.org/abs/1802.08908) 02 Mar 2022: Auditing (paper) Mishaal Kazmi will present [Auditing Differentially Private Machine Learning: How Private is Private SGD? (Jagielski, Ullman, Oprea 2020)](https://arxiv.org/pdf/2006.07709.pdf) 04 Mar 2022: ⚠️ Assignment 2. Special Office Hours to present your assignment. 07 Mar 2022: DP impact (paper) Haley Li will present [Differential Privacy Has Disparate Impact on Model Accuracy (Bagdasaryan, Poursaeed, Shmatikov 2019)](https://papers.nips.cc/paper/2019/file/fc0de4e0396fff257ea362983c2dda5a-Paper.pdf) and [Differentially Private Learning Needs Better Features (or Much More Data) (Tramèr, Boneh 2020)](https://arxiv.org/abs/2011.11660) 09 Mar 2022: More DP mechanisms (lecture) Noisy max, Exponential, Sparse vector 14 Mar 2022: Project updates 🔥 Each project's group will have a 25min slot to present the current version of their project, and get questions/feedback. Focus on: (1) formalizing the DP setting, and (2) scope the concrete steps you are tackling for the project. Of course, you can also show first results if available! 16 Mar 2022: Model selection (paper) Zainab Saeed Wattoo will present [Private Selection from Private Candidates (Liu, Talwar, 2019)](https://dl.acm.org/doi/10.1145/3313276.3316377) 21 Mar 2022: Model selection with RDP (paper) Mayank Tiwary will present [Hyper-parameter Tuning with Rényi Differential Privacy (Papernot, Steinke 2021)](https://arxiv.org/abs/2110.03620) 23 Mar 2022: Questions❓ Coming back to what we have covered so far to answer questions 28 Mar 2022: Overview of argmax mechanisms (continued, ~30min). Lab session. (no review) Rényi SVT [Improving Sparse Vector Technique with Renyi Differential Privacy (Zhu, Wang 2020)](https://proceedings.neurips.cc//paper/2020/file/e9bf14a419d77534105016f5ec122d62-Paper.pdf) Multiplicative Weights [A Multiplicative Weights Mechanism for Privacy-Preserving Data Analysis (Hardt, Rothblum 2010)](https://ieeexplore.ieee.org/document/5670948) Private DB release [Differentially Private Query Release Through Adaptive Projection (Aydore, Brown, Kearns, Kenthapadi, Melis, Roth, Siva 2021)](https://arxiv.org/pdf/2103.06641.pdf) 30 Mar 2022: Overview of argmax mechanisms (end). Time permitting: lab session/review. (no review) DP ML Systems [Privacy Accounting and Quality Control in the Sage Differentially Private ML Platform (Lécuyer, Spahn, Vodrahalli, Geambasu, Hsu 2019)](https://arxiv.org/abs/1909.01502) and [Privacy Budget Scheduling (Luo, Pan, Tholoniat, Cidon, Geambasu, Lécuyer 2021)](https://arxiv.org/pdf/2106.15335.pdf) 04 Apr 2022: Project presentations 06 Apr 2022: Project presentations 15 Apr 2022: Project final report due 🔥 Health ====== Learning and teaching is challenging if you are not healthy, safe, and secure. If you face any challenges in CPSC 538L to your well-being, please let us know! We will try to support you. To make the best of this situations, here are rules/guidelines and mechanisms for flexibility: - In keeping with BC’s mandate, masks are required for all in-person course activities (except where needed for your health, please refrain from eating and drinking in class). Those seeking medical masking exemption must work with the Centre for Accessibility at info.accessibility@ubc.ca (they will give you a letter to share with me). - Lectures will be available on zoom. You can take advantage of this if you feel more comfortable this way. If you are sick, do not come to class. - Paper presentations can be done in person (when available) or via the online alternative. - You can participate remotely through the chat and Piazza. - Office hours will be online. **Your personal health**: If you are sick, it is important that you stay home – no matter what you think you may be sick with (e.g., cold, flu, other). Similarly if you have recently tested positive for COVID, or are required to quarantine. Your precautions will help reduce risk and keep everyone safer. The grading scheme is intended to provide flexibility so that you can prioritize your health and still be able to succeed. [^piazza]: In this course, you will be using Piazza, which is a tool to help facilitate discussions. When creating an account in the tool, you will be asked to provide personally identifying information. Because this tool is hosted on servers in the U.S. and not in Canada, by creating an account you will also be consenting to the storage of your information in the U.S. Please know you are not required to consent to sharing this personal information with the tool, if you are uncomfortable doing so. If you choose not to provide consent, you may create an account using a nickname and a non-identifying email address, then let your instructor know what alias you are using in the tool.