Privacy and Safety - accentient/github-copilot-devs GitHub Wiki
Privacy and Safety
Privacy
If you have an Individual GitHub Copilot license, you can choose whether or not to allow your prompts and suggestions to be used for improving the Copilot model. This is an optional setting that you can enable or disable at any time. If you decide to opt in, your prompts and suggestions will be anonymized and combined with data from other users to enhance the Copilot model. If you prefer to opt out, your prompts and suggestions will not be included in training or improvement processes.
For users with a Business or Enterprise GitHub Copilot license, prompts and suggestions are not utilized for training or improving the Copilot model. This measure is in place to ensure the privacy and security of your organization’s data and code. Additionally, these prompts and suggestions are not shared with GitHub or any third parties, and there is no option to opt in.
GitHub collects user interaction data to enhance the performance of the Copilot model and provide user support. This data consists of prompts, suggestions, and other interactions with the Copilot system. It is securely stored and managed in accordance with GitHub’s privacy policy. Interaction data is retained for up to 24 months but does not include specific prompts or suggestions.
What Data is Collected?
GitHub Copilot processes personal data based on how Copilot is accessed and used: whether via github.com, mobile app, extensions, or one of various IDE extensions, or through features like suggestions for the command line interface (CLI), IDE code completions, or personalized chat on GitHub.com. The types of personal data processed may include:
- User Engagement Data: This includes pseudonymous identifiers captured on user interactions with Copilot, such as accepted or dismissed completions, error messages, system logs, and product usage metrics.
- Prompts: These are inputs for chat or code, along with context, sent to Copilot's AI to generate suggestions.
- Suggestions: These are the AI-generated code lines or chat responses provided to users based on their prompts.
- Feedback Data: This comprises real-time user feedback, including reactions (e.g., thumbs up/down) and optional comments, along with feedback from support tickets. Feedback data is retained as long as necessary.
Relevant Settings These settings are important as they contribute to the continuous improvement of GitHub's services and AI capabilities, ensuring users benefit from more precise and efficient tools. For more details, you can refer to the relevant GitHub documentation: GitHub Models and Best practices for preventing data leaks in your organization.
- The setting "Allow GitHub to use my data for product improvements" allows GitHub to utilize your data, such as usage patterns and interactions, to enhance the product and its features. The default setting is typically enabled, helping GitHub improve the overall user experience by refining functionalities and addressing common issues.
- The setting "Allow GitHub to use my data for AI model training" permits GitHub to use your data to train and improve AI models, such as Copilot. This setting is also usually enabled by default, which helps in enhancing the accuracy and relevance of AI-generated suggestions.
Public Code Matches
A common concern with GitHub Copilot is its potential to generate code resembling existing code from public repositories, raising legal and ethical questions about using AI-generated code that mirrors publicly available content. Public code matches occur when Copilot produces code closely aligning with code from GitHub's public repositories, which can happen because Copilot is trained on a vast dataset of both public and licensed private repositories. These matches may arise unintentionally during suggestions, resulting in generated code that mirrors pre-existing snippets or structures.
Understanding and addressing such occurrences is vital to ensure proper usage and compliance with licensing requirements, and this behavior can be configured under Settings to better align with your preferences or organizational policies.
Helpful Tools
To ensure that your codebase remains compliant with open source licenses and free from security vulnerabilities, it is recommended to use tools like FOSSA, Mend (Whitesource), Snyk, Black Duck, and Licensee. These tools can scan your GitHub repository to detect any open source code and potential license issues. Integrating these tools into your CI/CD pipeline can help automatically monitor your repositories, providing an extra layer of security and compliance, especially when using tools like GitHub Copilot that may introduce public code. Careful review and verification of AI-generated code are essential to minimize potential legal and ethical risks associated with public code matches.
Legal and Ethical Considerations
Public code matches introduce concerns regarding copyright and intellectual property laws. When GitHub Copilot generates code resembling existing code from public repositories, there is a possibility that the output could violate the copyright or ownership rights of the original author. Such scenarios may result in legal disputes over the use and distribution of the generated code, raising questions about responsibility and accountability for potential infringements.
Ethically, public code matches bring up issues related to plagiarism and proper attribution. If code generated by GitHub Copilot closely mirrors publicly available code and is used without giving appropriate credit to the original author, it could be perceived as plagiarism. This raises important questions about transparency, responsibility, and the ethical use of AI-generated code in software development.
Developers must remain aware of these legal and ethical considerations and take proactive steps to review, validate, and properly attribute any code generated by GitHub Copilot.
Other Ethical Considerations When Using GitHub Copilot
GitHub Copilot is trained on a wide range of public and private repositories hosted on GitHub. These sources might contain code that reflects biases or lacks completeness. It's important to remain vigilant and carefully review any generated code to identify and address potential bias or inaccuracies.
The suggestions provided by GitHub Copilot are derived from the context of your code, which may occasionally include sensitive or confidential information. Users should exercise caution when working with private or sensitive projects and always double-check the generated code to prevent unintentional exposure of critical information.
Code produced by GitHub Copilot is influenced by its training data, which includes a mix of publicly available repositories and licensed content. This means the generated code might inadvertently resemble copyrighted material. As the user, you are responsible for ensuring that your use of generated code aligns with relevant copyright laws and licensing agreements.
Staying informed about these considerations will help you use GitHub Copilot responsibly and effectively while minimizing potential risks.
Learn more at the GitHub Copilot Trust Center