Available processors - digitalmethodsinitiative/4cat GitHub Wiki

Below is a list of available processors. Depending on when this page was last updated, it may not reflect the current state of 4CAT.

Some of these processors are tried-and-tested, others are more experimental. Always check the code and data if the results seem fishy!

You only find short descriptions here. Make sure to inpect the code of the processors if you want to know more - we try our best to add as many comments as possible.

Let us know if you have any processors to add to this list.

Combined processors

Name	Description	Usage
Annotate images with Google Vision API	Use the Google Vision API to extract labels detected in the most-linked images from the dataset. Note that this is a paid service and will count towards your API credit.	Requires entering an API key
Monthly histogram	Generates a histogram with the number of posts per month.
Extract neologisms	Retrieve uncommon terms by deleting all known words. Assumes English-language data. Uses stopwords-iso as its stopword filter.
Find similar words	Uses Word2Vec models (Mikolov et al.) to find words used in a similar context as the queried word(s). Note that this will usually not give useful results for small (<100.000 items) datasets.
Upload to DMI-TCAT	Convert the dataset to a TCAT-compatible format and upload it to an available TCAT server.	Available TCAT servers to be configured by instance admin

Conversion

Name	Description	Usage
Convert to JSON	Change a CSV file to a JSON file
Convert to Excel-compatible CSV	Change a CSV file so it works with Microsoft Excel.
Convert NDJSON file to CSV	Change a NDJSON file to a CSV file.
Convert to TCAT JSON	Convert a Twitter dataset to a TCAT-compatible format. This file can then be uploaded to TCAT.
Convert Vision results to CSV	Convert the Vision API output to a simplified CSV file.
Merge datasets	Merge this dataset with another dataset of the same type. A new, third dataset is created containing items from both original datasets.
Split by thread	Split the dataset per thread. The result is a zip archive containing separate CSV files.
Merge texts	Merges the data from the body column into a single text file. The result can be used for word clouds, word trees, etc.
Upload to DMI-TCAT	Send a TCAT-ready JSON file to a particular DMI-TCAT server.	Available TCAT servers to be configured by instance admin

Cross-platform

Name	Description	Usage
Download YouTube thumbnails	Downloads the thumbnails of YouTube videos and stores it in a zip archive.

Filtering

Name	Description	Usage
Replace or transliterate accented and non-Latin characters	Replaces non-latin characters with the closest ASCII equivalent, convertng e.g. 'á' to 'a', 'ç' to 'c', et cetera. Creates a new dataset.
Remove author information	Anonymises a dataset by removing content of any column starting with 'author'
Filter by value	A generic filter that checks whether a value in a selected column matches a custom requirement. This will create a new dataset.
Filter by date	Retains posts between given dates. This will create a new dataset.
Expand shortened URLs	Replaces any URL in the dataset's 'body' field that is recognised as a shortened URL with the URL it redirects to. URLs are followed up to a depth of 5 links. This can take a long time for large datasets, and it is not recommended to run this processor on datasets larger than 10,000 items. This creates a new dataset with expanded URLs in place of redirects.
Update Reddit scores	Updates the scores for each post and comment to more accurately reflect the real score. Can only be used on datasets with < 5,000 posts due to the heavy usage of the Reddit API.	Requires server admin to provide a Reddit API key
Filter by words or phrases	Retains posts that contain selected words or phrases, including preset word lists. This creates a new dataset.
Random sample	Retain a pseudorandom set of posts. This creates a new dataset.
Filter for unique posts	Retain posts with a unique body text. Only keeps the first encounter of a text. Useful for filtering spam. This creates a new dataset.
Filter by wildcard	Retains only posts that contain certain words or phrases. Input may contain a wildcard *, which matches all text in between. This creates a new dataset.
Write annotations	Writes annotations from the Explorer to the dataset. Each input field will get a column. This creates a new dataset.	Cannot be called directly; called via the Explorer feature

Networks

Name	Description	Usage
Custom network	Create a GEXF network file comprised of linked values between a custom set of columns (e.g. 'author' and 'subreddit'). Nodes and edges are weighted by frequency.
Bipartite Author-tag Network	Produces a bipartite graph based on co-occurence of (hash)tags and people. If someone wrote a post with a certain tag, there will be a link between that person and the tag. The more often they appear together, the stronger the link. Tag nodes are weighed on how often they occur. User nodes are weighed on how many posts they've made.
Co-tag network	Create a GEXF network file of tags co-occurring in a posts. Edges are weighted by the amount of tag co-occurrences; nodes are weighted by how often the tag appears in the dataset.
Co-word network	Create a GEXF network file of word co-occurences. Edges denote words that appear close to each other. Edges and nodes are weighted by the amount of co-word occurrences.
Reply network	Create a GEXF network file of posts replying to each other. Each reference to another post creates an edge between posts.	Only available for data sources where replying is a feature
URL co-occurence network	Create a GEXF network file comprised of URLs appearing together (in a post or thread). Edges are weighted by amount of co-links.
Google Vision API Label network	Create a GEXF network file comprised of all annotations returned by the Google Vision API. Labels returned by the API are nodes. Labels occurring on the same image areedges.	Requires API key
Wikipedia category network	Create a GEXF network file comprised network comprised of linked-to Wikipedia pages, linked to the categories they are part of. English Wikipedia only. Will only fetch the first 10,000 links.	Slow!

Post metrics

Name	Description	Usage
Count values	Count values in a dataset column, like URLs or hashtags (overall or per timeframe)
Count posts	Counts how many posts are in the dataset (overall or per timeframe).
Google Vision API Analysis	Use the Google Vision API to annotate images with tags and labels identified via machine learning. One request will be made per image per annotation type. Note that this is NOT a free service and requests will be credited by Google to the owner of the API token you provide!
Hatebase analysis	Assign scores for 'offensiveness' and hate speech propability to each post by using Hatebase.	Uses included Hatebase lexicon (which has limitations)
Extract top hateful phrases	Count frequencies for hateful words and phrases found in the dataset and rank the results (overall or per timeframe).	Uses included Hatebase lexicon (which has limitations)
Over-time offensivess trend	Extracts offensiveness trends over-time. Offensiveness is measured as the amount of words listed on Hatebase that occur in the dataset. Also includes engagement metrics.	Uses included Hatebase lexicon (which has limitations)
Over-time word counts	Determines the counts over time of particular set of words or phrases.
Sort by most replied-to	Sort posts by how often they were replied to by other posts in the dataset.
Extract Text from Images	Uses optical character recognition (OCR) to extract text from images via machine learning.	Requires a separate OCR server, to be configured by 4CAT admin
Thread metadata	Create an overview of the threads present in the dataset, containing thread IDs, subjects, and post counts.
Rank image URLs	Collect all image URLs and sort by most-occurring.
Extract top words	Ranks most used tokens per tokenset (overall or per timeframe). Limited to 100 most-used tokens.
Extract YouTube metadata	Extract information from YouTube videos and channels linked-to in the dataset

Text analysis

Name	Description	Usage
Extract co-words	Extracts words appearing close to each other from a set of tokens.	After tokenisation
Count documents per topic	Uses the LDA model to predict to which topic each item or sentence belongs and counts as belonging to whichever topic has the highest probability.	After tokenisation
Post/Topic matrix	Uses the LDA model to predict to which topic each item or sentence belongs and creates a CSV file showing this information. Each line represents one 'document'; if tokens are grouped per 'item' and only one column is used (e.g. only the 'body' column), there is one row per post/item, otherwise a post may be represented by multiple rows (for each sentence and/or column used).	After tokenisation
Extract nouns	Retrieve nouns detected by SpaCy's part-of-speech tagging, and rank by frequency. Make sure to have selected "Part of Speech" in the previous module, as well as "Dependency parsing" if you want to extract compound nouns or noun chunks.	After SpaCy processing
Generate word embedding models	Generates Word2Vec or FastText word embedding models (overall or per timeframe). These calculate coordinates (vectors) per word on the basis of their context. The coordinates are positioned in a "vector space" with a large amount of dimensions (so a coordinate can e.g. exist of 100 numbers). These numeric word representations can be used to extract words with similar contexts. Note that good models require a lot of data.	After tokenisation
Extract named entities	Retrieve named entities detected by SpaCy, ranked on frequency. Be sure to have selected "Named Entity Recognition" in the previous module.
Annotate text features with SpaCy	Annotate your text with a variety of linguistic features using the SpaCy library, including part-of-speech tagging, depencency parsing, and named entity recognition. Subsequent processors can extract the words labelled by SpaCy (e.g. as a noun or name). Produces a Doc file using the en_core_web_sm model. Currently only available for datasets with less than 100,000 items.
Semantic frames	Extract semantic frames from text. This connects to the VUB's PENELOPE API to extract causal frames from the text using the framework developed by the Evolutionary and Hybrid AI (EHAI) group.
Sentence split	Split a body of posts into discrete sentences. Output file has one row per sentence, containing the sentence and post ID.
Extract similar words	Uses a Word2Vec model to find words used in a similar context	After tokenisation and model building
Tf-idf	Get the tf-idf values of tokenised text. Works better with more documents (e.g. time-separated).	After tokenisation
Tokenise	Splits the post body texts in separate words (tokens). This data can then be used for text analysis. The output is a list of lists (each list representing all post tokens or tokens per sentence).
Visualise LDA Model	Creates a visualisation of the chosen LDA model allowing exploration of the various words in each topic.	After tokenisation
Top words per topic	Creates a CSV file with the top tokens (words) per topic in the generated topic model, and their associated weights.	After tokenisation and model building
Generate topic models	Creates topic models per tokenset using Latent Dirichlet Allocation (LDA). For a given number of topics, tokens are assigned a relevance weight per topic, which can be used to find clusters of related words.	After tokenisation
Count words	Counts all tokens so they are transformed into word => frequency counts.This is also known as a bag of words.	After tokenisation

Thread metrics

Name	Description	Usage
Debate metrics	Returns a csv with meta-metrics per thread.

Twitter Analysis

Name	Description	Usage
Twitter Statistics	Contains the number of tweets, number of tweets with links, number of tweets with hashtags, number of tweets with mentions, number of retweets, and number of replies
Custom Statistics	Group tweets by category and count tweets per timeframe to collect aggregate group statistics. For retweets and quotes, hashtags, mentions, URLs, and images from the original tweet are included in the retweet/quote. Data on public metrics (e.g., number of retweets or likes of tweets) are as of the time the data was collected.
Aggregated Statistics	Group tweets by category and count tweets per timeframe and then calculate aggregate group statistics (i.e. min, max, average, Q1, median, Q3, and trimmed mean): number of tweets, urls, hashtags, mentions, etc. Use for example to find the distribution of the number of tweets per author and compare across time.
Aggregated Statistics Visualization	Gathers Aggregated Statistics data and creates Box Plots visualising the spread of intervals. A large number of intervals will not properly display.
Hashtag Statistics	Lists by hashtag how many tweets contain hashtags, how many times those tweets have been retweeted/replied to/liked/quoted, and information about unique users and hashtags used alongside each hashtag. For retweets and quotes, hashtags from the original tweet are included in the retweet/quote.
Identical Tweet Frequency	Groups tweets by text and counts the number of times they have been (re)tweeted indentically.
Mentions Export	Identifies mentions types and creates mentions table (tweet id, from author id, from username, to user id, to username, mention type)
Source Statistics	Lists by source of tweet how many tweets contain hashtags, how many times those tweets have been retweeted/replied to/liked/quoted, and information about unique users and hashtags used alongside each hashtag. For retweets and quotes, hashtags from the original tweet are included in the retweet/quote.
Individual User Statistics	Lists users and their number of tweets, number of followers, number of friends, how many times they are listed, their UTC time offset, whether the user has a verified account and how many times they appear in the data set.
User Visibility	Collects usernames and totals how many tweets are authored by the user and how many tweets mention the user

Visual

Name	Description	Usage
Histogram	Generates a histogram (bar graph) from time frequencies.
Chart diachronic nearest neighbours	Visualise nearest neighbours of a given query across all models and show the closest neighbours per model in one combined graph. Based on the 'HistWords' algorithm by Hamilton et al.
Download images	Download images and store in a a zip file. May take a while to complete as images are retrieved externally. Note that not always all images can be saved. For imgur galleries, only the first image is saved. For animations (GIFs), only the first frame is saved if available. A JSON metadata file is included in the output archive. 4chan datasets should include the image_md5 column.
Download Telegram images	Download images and store in a zip file. Downloads through the Telegram API might take a while. Note that not always all images can be retrieved. A JSON metadata file is included in the output archive.
Image wall	Put all images in a single combined image. Images can be sorted and resized.
Create PixPlot visualisation	Put all images from an archive into a PixPlot visualisation: an explorable map of images algorithmically grouped by similarity.	Requires a separate PixPlot service, to be configured by the 4CAT admin
Side-by-side graphs	Generate area graphs showing prevalence per item over time. These are visualised side-by-side on an isometric plane for easy comparison.
RankFlow diagram	Create a diagram showing changes in prevalence over time for ranked lists (following Bernhard Rieder's RankFlow.
Word tree	Generates a word tree for a given query, a "graphical version of the traditional 'keyword-in-context' method" (Wattenberg & Viégas, 2008).
Word cloud	Generates a word cloud with words sized on occurrence.
YouTube thumbnails image wall	Make an image wall from YouTube video thumbnails.

Additional Processor Instructions

Some processors may require additional setup or modification. Processors can be configured by 4CAT administrators via the '4CAT Settings' navigation menu option.