Exploring and annotating datasets - digitalmethodsinitiative/4cat GitHub Wiki

4CAT’s Explorer lets you browse and annotate datasets in a way that closely resembles the original platform the data came from. For example, tweets are rendered as interactive HTML cards—with images, names, clickable links, and retweet metrics—making them far more readable than plain rows in Excel.

To open the Explorer, click the Explore & annotate button on your dataset’s page.

Admin note: Explorer settings can be adjusted via Control panel → Settings → Explorer.

By default, the Explorer displays the first few items in your dataset. You can re-order the view using the dropdown to sort by any field. The example below shows BlueSky posts mentioning "4CAT".

Not all data sources have Explorer templates; check this folder for supported data sources. If no template is present, a generic template is used, which shows the id, thread_id, timestamp, author, subject, body, and image fields of uploaded items.

Anonymisation

When a dataset is anonymised, the Explorer will only display anonymised items and will remove links to the original posts. This supports ethical research practices while preserving valuable contextual information. See the image above.

Note: Some fields may still contain personal data—review carefully before sharing or publishing.

Annotations

Annotations are pieces of additional data added to an existing dataset. For example, you might add an extracted_url annotation to BlueSky posts. In essence, annotations let you enrich and label your data, whether with informal field notes or structured categories.

Annotation fields are included when downloading a dataset and can also be used as inputs for various processors, making them a useful starting point for deeper analysis.

In 4CAT, annotations can be generated manually and through processors:

Manual annotations

Annotations can be made by yourself in the Explorer. You first need to create an annotation field. On the top of the Explorer page, click Edit fields to add, change, and delete annotation fields.

Click New field to add a new annotation field. Give it a Label via the left-most text box. The dropdown then lets you choose the type of manual annotation. The following types are currently supported:

  • Text: a simple text input.
  • Text (large): a larger text input.
  • Single choice: a dropdown menu that lets you select one option.
  • Multiple choice: checkboxes where you can select multiple options.

Clicking Update fields will apply your changes to the page. You can then start annotating:

The Show annotations toggle lets you show or hide annotations for all items in the Explorer. You can also hide specific fields using the eye icon in the fields editor. This is especially helpful during intercoder reliability checks, where maintaining coder independence is important.

Manual annotations save automatically after you edited them, and you can use the Save annotations button on the top of the page to be sure. Annotations can be made simultaneously by different users, but new changes will override old ones.

Processor-made annotations

Some processors can automatically generate annotations. For instance, most LLM-based processors can add annotations directly to the parent dataset.

These machine-generated annotations work just like human-created ones: you can view them in the Explorer, sort items based on their values, or use them as input for further processors.

If a processor can generate annotations, it will look as follows:

Some processors can generate multiple annotations:

When the processor is ran, the annotations will show up in the Explorer. This is an example of a BlueSky post being annotated with the Extract URLs processor:

This can get quite elaborate with various post-processors. Below you for instance see a TikTok dataset annotated with audio transcripts, possible through running the processors Download TikTok videos -> Extract audio from videos -> Audio to text.

These processor-made annotations can then form the input for manual coding, or other processors. Why not run text analysis processors on the audio transcripts from TikTok videos?