Available processors - digitalmethodsinitiative/4cat GitHub Wiki
Below is a list of available processors. Depending on when this page was last updated, it may not reflect the current state of 4CAT.
Some of these processors are tried-and-tested, others are more experimental. Always check the code and data if the results seem fishy!
You only find short descriptions here. Make sure to inpect the code of the processors if you want to know more - we try our best to add as many comments as possible.
Let us know if you have any processors to add to this list.
Combined processors
Name
Description
Usage
Annotate images with Google Vision API
Use the Google Vision API to extract labels detected in the most-linked images from the dataset. Note that this is a paid service and will count towards your API credit.
Requires entering an API key
Monthly histogram
Generates a histogram with the number of posts per month.
Extract neologisms
Retrieve uncommon terms by deleting all known words. Assumes English-language data. Uses stopwords-iso as its stopword filter.
Find similar words
Uses Word2Vec models (Mikolov et al.) to find words used in a similar context as the queried word(s). Note that this will usually not give useful results for small (<100.000 items) datasets.
Upload to DMI-TCAT
Convert the dataset to a TCAT-compatible format and upload it to an available TCAT server.
Available TCAT servers to be configured by instance admin
Conversion
Name
Description
Usage
Convert to JSON
Change a CSV file to a JSON file
Convert to Excel-compatible CSV
Change a CSV file so it works with Microsoft Excel.
Convert NDJSON file to CSV
Change a NDJSON file to a CSV file.
Convert to TCAT JSON
Convert a Twitter dataset to a TCAT-compatible format. This file can then be uploaded to TCAT.
Convert Vision results to CSV
Convert the Vision API output to a simplified CSV file.
Merge datasets
Merge this dataset with another dataset of the same type. A new, third dataset is created containing items from both original datasets.
Split by thread
Split the dataset per thread. The result is a zip archive containing separate CSV files.
Merge texts
Merges the data from the body column into a single text file. The result can be used for word clouds, word trees, etc.
Upload to DMI-TCAT
Send a TCAT-ready JSON file to a particular DMI-TCAT server.
Available TCAT servers to be configured by instance admin
Cross-platform
Name
Description
Usage
Download YouTube thumbnails
Downloads the thumbnails of YouTube videos and stores it in a zip archive.
Filtering
Name
Description
Usage
Replace or transliterate accented and non-Latin characters
Replaces non-latin characters with the closest ASCII equivalent, convertng e.g. 'á' to 'a', 'ç' to 'c', et cetera. Creates a new dataset.
Remove author information
Anonymises a dataset by removing content of any column starting with 'author'
Filter by value
A generic filter that checks whether a value in a selected column matches a custom requirement. This will create a new dataset.
Filter by date
Retains posts between given dates. This will create a new dataset.
Expand shortened URLs
Replaces any URL in the dataset's 'body' field that is recognised as a shortened URL with the URL it redirects to. URLs are followed up to a depth of 5 links. This can take a long time for large datasets, and it is not recommended to run this processor on datasets larger than 10,000 items. This creates a new dataset with expanded URLs in place of redirects.
Update Reddit scores
Updates the scores for each post and comment to more accurately reflect the real score. Can only be used on datasets with < 5,000 posts due to the heavy usage of the Reddit API.
Requires server admin to provide a Reddit API key
Filter by words or phrases
Retains posts that contain selected words or phrases, including preset word lists. This creates a new dataset.
Random sample
Retain a pseudorandom set of posts. This creates a new dataset.
Filter for unique posts
Retain posts with a unique body text. Only keeps the first encounter of a text. Useful for filtering spam. This creates a new dataset.
Filter by wildcard
Retains only posts that contain certain words or phrases. Input may contain a wildcard *, which matches all text in between. This creates a new dataset.
Write annotations
Writes annotations from the Explorer to the dataset. Each input field will get a column. This creates a new dataset.
Cannot be called directly; called via the Explorer feature
Networks
Name
Description
Usage
Custom network
Create a GEXF network file comprised of linked values between a custom set of columns (e.g. 'author' and 'subreddit'). Nodes and edges are weighted by frequency.
Bipartite Author-tag Network
Produces a bipartite graph based on co-occurence of (hash)tags and people. If someone wrote a post with a certain tag, there will be a link between that person and the tag. The more often they appear together, the stronger the link. Tag nodes are weighed on how often they occur. User nodes are weighed on how many posts they've made.
Co-tag network
Create a GEXF network file of tags co-occurring in a posts. Edges are weighted by the amount of tag co-occurrences; nodes are weighted by how often the tag appears in the dataset.
Co-word network
Create a GEXF network file of word co-occurences. Edges denote words that appear close to each other. Edges and nodes are weighted by the amount of co-word occurrences.
Reply network
Create a GEXF network file of posts replying to each other. Each reference to another post creates an edge between posts.
Only available for data sources where replying is a feature
URL co-occurence network
Create a GEXF network file comprised of URLs appearing together (in a post or thread). Edges are weighted by amount of co-links.
Google Vision API Label network
Create a GEXF network file comprised of all annotations returned by the Google Vision API. Labels returned by the API are nodes. Labels occurring on the same image areedges.
Requires API key
Wikipedia category network
Create a GEXF network file comprised network comprised of linked-to Wikipedia pages, linked to the categories they are part of. English Wikipedia only. Will only fetch the first 10,000 links.
Slow!
Post metrics
Name
Description
Usage
Count values
Count values in a dataset column, like URLs or hashtags (overall or per timeframe)
Count posts
Counts how many posts are in the dataset (overall or per timeframe).
Google Vision API Analysis
Use the Google Vision API to annotate images with tags and labels identified via machine learning. One request will be made per image per annotation type. Note that this is NOT a free service and requests will be credited by Google to the owner of the API token you provide!
Hatebase analysis
Assign scores for 'offensiveness' and hate speech propability to each post by using Hatebase.
Uses included Hatebase lexicon (which has limitations)
Extract top hateful phrases
Count frequencies for hateful words and phrases found in the dataset and rank the results (overall or per timeframe).
Uses included Hatebase lexicon (which has limitations)
Over-time offensivess trend
Extracts offensiveness trends over-time. Offensiveness is measured as the amount of words listed on Hatebase that occur in the dataset. Also includes engagement metrics.
Uses included Hatebase lexicon (which has limitations)
Over-time word counts
Determines the counts over time of particular set of words or phrases.
Sort by most replied-to
Sort posts by how often they were replied to by other posts in the dataset.
Extract Text from Images
Uses optical character recognition (OCR) to extract text from images via machine learning.
Requires a separate OCR server, to be configured by 4CAT admin
Thread metadata
Create an overview of the threads present in the dataset, containing thread IDs, subjects, and post counts.
Rank image URLs
Collect all image URLs and sort by most-occurring.
Extract top words
Ranks most used tokens per tokenset (overall or per timeframe). Limited to 100 most-used tokens.
Extract YouTube metadata
Extract information from YouTube videos and channels linked-to in the dataset
Text analysis
Name
Description
Usage
Extract co-words
Extracts words appearing close to each other from a set of tokens.
After tokenisation
Count documents per topic
Uses the LDA model to predict to which topic each item or sentence belongs and counts as belonging to whichever topic has the highest probability.
After tokenisation
Post/Topic matrix
Uses the LDA model to predict to which topic each item or sentence belongs and creates a CSV file showing this information. Each line represents one 'document'; if tokens are grouped per 'item' and only one column is used (e.g. only the 'body' column), there is one row per post/item, otherwise a post may be represented by multiple rows (for each sentence and/or column used).
After tokenisation
Extract nouns
Retrieve nouns detected by SpaCy's part-of-speech tagging, and rank by frequency. Make sure to have selected "Part of Speech" in the previous module, as well as "Dependency parsing" if you want to extract compound nouns or noun chunks.
After SpaCy processing
Generate word embedding models
Generates Word2Vec or FastText word embedding models (overall or per timeframe). These calculate coordinates (vectors) per word on the basis of their context. The coordinates are positioned in a "vector space" with a large amount of dimensions (so a coordinate can e.g. exist of 100 numbers). These numeric word representations can be used to extract words with similar contexts. Note that good models require a lot of data.
After tokenisation
Extract named entities
Retrieve named entities detected by SpaCy, ranked on frequency. Be sure to have selected "Named Entity Recognition" in the previous module.
Annotate text features with SpaCy
Annotate your text with a variety of linguistic features using the SpaCy library, including part-of-speech tagging, depencency parsing, and named entity recognition. Subsequent processors can extract the words labelled by SpaCy (e.g. as a noun or name). Produces a Doc file using the en_core_web_sm model. Currently only available for datasets with less than 100,000 items.
Semantic frames
Extract semantic frames from text. This connects to the VUB's PENELOPE API to extract causal frames from the text using the framework developed by the Evolutionary and Hybrid AI (EHAI) group.
Sentence split
Split a body of posts into discrete sentences. Output file has one row per sentence, containing the sentence and post ID.
Extract similar words
Uses a Word2Vec model to find words used in a similar context
After tokenisation and model building
Tf-idf
Get the tf-idf values of tokenised text. Works better with more documents (e.g. time-separated).
After tokenisation
Tokenise
Splits the post body texts in separate words (tokens). This data can then be used for text analysis. The output is a list of lists (each list representing all post tokens or tokens per sentence).
Visualise LDA Model
Creates a visualisation of the chosen LDA model allowing exploration of the various words in each topic.
After tokenisation
Top words per topic
Creates a CSV file with the top tokens (words) per topic in the generated topic model, and their associated weights.
After tokenisation and model building
Generate topic models
Creates topic models per tokenset using Latent Dirichlet Allocation (LDA). For a given number of topics, tokens are assigned a relevance weight per topic, which can be used to find clusters of related words.
After tokenisation
Count words
Counts all tokens so they are transformed into word => frequency counts.This is also known as a bag of words.
After tokenisation
Thread metrics
Name
Description
Usage
Debate metrics
Returns a csv with meta-metrics per thread.
Twitter Analysis
Name
Description
Usage
Twitter Statistics
Contains the number of tweets, number of tweets with links, number of tweets with hashtags, number of tweets with mentions, number of retweets, and number of replies
Custom Statistics
Group tweets by category and count tweets per timeframe to collect aggregate group statistics.
For retweets and quotes, hashtags, mentions, URLs, and images from the original tweet are included in the retweet/quote. Data on public metrics (e.g., number of retweets or likes of tweets) are as of the time the data was collected.
Aggregated Statistics
Group tweets by category and count tweets per timeframe and then calculate aggregate group statistics (i.e. min, max, average, Q1, median, Q3, and trimmed mean): number of tweets, urls, hashtags, mentions, etc.
Use for example to find the distribution of the number of tweets per author and compare across time.
Aggregated Statistics Visualization
Gathers Aggregated Statistics data and creates Box Plots visualising the spread of intervals. A large number of intervals will not properly display.
Hashtag Statistics
Lists by hashtag how many tweets contain hashtags, how many times those tweets have been retweeted/replied to/liked/quoted, and information about unique users and hashtags used alongside each hashtag.
For retweets and quotes, hashtags from the original tweet are included in the retweet/quote.
Identical Tweet Frequency
Groups tweets by text and counts the number of times they have been (re)tweeted indentically.
Mentions Export
Identifies mentions types and creates mentions table (tweet id, from author id, from username, to user id, to username, mention type)
Source Statistics
Lists by source of tweet how many tweets contain hashtags, how many times those tweets have been retweeted/replied to/liked/quoted, and information about unique users and hashtags used alongside each hashtag.
For retweets and quotes, hashtags from the original tweet are included in the retweet/quote.
Individual User Statistics
Lists users and their number of tweets, number of followers, number of friends, how many times they are listed, their UTC time offset, whether the user has a verified account and how many times they appear in the data set.
User Visibility
Collects usernames and totals how many tweets are authored by the user and how many tweets mention the user
Visual
Name
Description
Usage
Histogram
Generates a histogram (bar graph) from time frequencies.
Chart diachronic nearest neighbours
Visualise nearest neighbours of a given query across all models and show the closest neighbours per model in one combined graph. Based on the 'HistWords' algorithm by Hamilton et al.
Download images
Download images and store in a a zip file. May take a while to complete as images are retrieved externally. Note that not always all images can be saved. For imgur galleries, only the first image is saved. For animations (GIFs), only the first frame is saved if available. A JSON metadata file is included in the output archive.
4chan datasets should include the image_md5 column.
Download Telegram images
Download images and store in a zip file. Downloads through the Telegram API might take a while. Note that not always all images can be retrieved. A JSON metadata file is included in the output archive.
Image wall
Put all images in a single combined image. Images can be sorted and resized.
Create PixPlot visualisation
Put all images from an archive into a PixPlot visualisation: an explorable map of images algorithmically grouped by similarity.
Requires a separate PixPlot service, to be configured by the 4CAT admin
Side-by-side graphs
Generate area graphs showing prevalence per item over time. These are visualised side-by-side on an isometric plane for easy comparison.
RankFlow diagram
Create a diagram showing changes in prevalence over time for ranked lists (following Bernhard Rieder's RankFlow.
Word tree
Generates a word tree for a given query, a "graphical version of the traditional 'keyword-in-context' method" (Wattenberg & Viégas, 2008).
Word cloud
Generates a word cloud with words sized on occurrence.
YouTube thumbnails image wall
Make an image wall from YouTube video thumbnails.
Additional Processor Instructions
Some processors may require additional setup or modification. Processors can be configured by 4CAT administrators via the '4CAT Settings' navigation menu option.