Ethics and privacy - digitalmethodsinitiative/4cat GitHub Wiki
4CAT's approach to research ethics and data privacy
4CAT is designed as a tool for researchers, flexible enough to be used for a variety of use cases. In some of these use cases there may be a legitimate reason to collect and analyse sensitive or personal data; in fact, when analysing social media data, it is almost impossible to not include personal data of some kind.
We consider it the responsibility of the researcher to make the relevant choices here to balance the research interests with the interests of the people whose data is included in a dataset. This includes the choice of whether 4CAT is the right tool for the job, and in what environment to run it; perhaps for sensitive data you might not want to use a 4CAT server to which other people also have access, but running it on your personal laptop and processing data locally can be an acceptable solution.
Nevertheless, we do aim to facilitate ethical research through a range of features we include in the tool. The final responsibility for ethical conduct lies with the researcher, but by making it easy to, for example, pseudonymize, anonymize, or delete data, we can gently recommend a particular approach and contribute to a research ethos sensitive to potential privacy risks.
Concretely, the following 4CAT features can be helpful in incorporating it into an ethically sound research project:
- Anonymisation and pseudonymisation at capture time: When capturing datasets through 4CAT directly, by default it pseudonymises usernames and similar personal identifiers, replacing them with an irreversible and arbitrary pseudonym. Optionally, these personal identifiers can be removed from the data altogether.
- Anonymisation and pseuydonymisation at analysis time: The same features are available after a dataset has been created, if it becomes clear that personal identifiers are no longer needed or should be pseudonymised.
- Filters: Datasets can be filtered in a variety of ways, and you can create a new dataset without items deemed too sensitive to include in there, based on relevant criteria. The original dataset can then be removed and the filtered dataset can be analysed on its own.
- Account management: On shared 4CAT servers, datasets are only accessible to the user account that created them, and accounts that have been granted access by the creator of the dataset.
- Modular analysis workflow: Each step of the analysis creates a separate, independent dataset. These can be downloaded or shared on their own, without requiring one to also share the underlying data.
- Auditable code: 4CAT is an open source tool. Any dataset created through it contains, as part of its metadata, a reference to the exact version of the code that generated the dataset. This makes it possible to audit the code and verify that it handles the data properly and generates correct results.
- Dataset expiration: It is possible to configure 4CAT so that datasets automatically expire a certain amount of time after they have been created. People can be allowed to opt-out of this or it can be made mandatory. This can be a way to force people to download their datasets to a secure offline location, limiting any risk of illicit access through 4CAT itself.
There are a number of measures that can additionally be taken to secure a 4CAT instance, but that are outside of the scope of the tool itself:
- Disk encryption: The drive on which 4CAT stores its data could benefit from using an encrypted filesystem, so that access to the hard drive in itself does not allow one to view a dataset's contents.
- Firewalling: 4CAT is accessible in a browser and, to this end, includes a small web server component. This type of access can further be restricted through system-level firewall rules and other access controls (such as a HTTP password).
- Offline use: 4CAT does not require an internet connection to run and can be deployed on e.g. a laptop without internet access, which would allow one to use the software's analysis features with no possibility of access from third parties.
We have written more about these issues and aspects of the tool in two research papers, which may be useful as a further reference:
- Peeters, S., & Hagen, S. (2022). The 4CAT Capture and Analysis Toolkit: A Modular Tool for Transparent and Traceable Social Media Research. Computational Communication Research, 4(2), 571–589. https://doi.org/10.5117/CCR2022.2.007.HAGE
- Rieder, B., Peeters, S., & Borra, E. (2024). From tool to tool-making: Reflections on authorship in social media research software. Convergence, 30(1), 216–235. https://doi.org/10.1177/13548565221127094