WCMA Collection Cluster Analysis Project - Yolanda128/wcma_digital_collection GitHub Wiki
Project Overview
WCMA’s Mellon Digital Project gives students, faculty, researchers, artists and many others easy access to downloading metadata and thumbnail images for the entire WCMA collection. The availability of such a large collection of over 5000 digitized artworks opens up the possibility of creatively engaging with these works using tools that are not traditionally available to art historians and curators, such as machine learning and data visualization.
Typically, art historians and curators define dis(similarity) between artworks based on criterion such as artist, genre, date, geography, etc. Our primary goal for this project is to cluster a manageable proportion of the WCMA collection (randomly selected 500 images) according to a set of newly defined features that aim to capture (dis)similarity outside of these typical and rigidly defined categories.
We present four new ways of thinking of dissimilarity between artworks (we refer to these as feature sets):
- Complexity: a combined measure of the variation of hue, luminosity, saturation,etc.
- Style/Personality: a combine measure of the dominance of warm colors, average value/saturation, etc.
- Centrality: whether or not there is a central object in the image or if there is some kind of framing in the image.
- Color: whether or not 40 percent or more of the pixels in an image are primarily of one color (red, orange, yellow, green, blue, purple, black, white, or gray). If not, then we categorize the image as “colorful”. Otherwise, the variable takes the name of the primary color.
For the first three ways of defining dissimilarity, we used the k-means clustering algorithm to group the images into clusters based on features that we created. For the last one, we did not need to run a clustering algorithm because there is only one feature in it.
Frequently Used Terms
Here is a list of terms that are frequently used in the project :
- Feature: a variable pertaining to a specific quality about an image.
- Feature set: a group of features that are used to form clusters.
- Pixel: the smallest single unit of a digital image.
- Resize: using the "imager" R package , we can adjust the number of pixels that form the image. By sizing down, we are essentially reducing the resolution of the image by using a smaller number of pixels to compose it.
- HSV System: using combinations of hue, saturation, and value, we can theoretically represent any color.
- n-Neighbor: a pixel that is n pixels away from a selected pixel. For instance, a 1-Neighbor-left refers to the first pixel to the left of a selected pixel.
Method
Feature Engineering
Here are the 11 features that we will be using to capture dissimilarity and cluster the collection. Each image will have a quantitative value for each of its 11 features.
- Center: As shown in the diagram above, we evenly split each (resized ) image into 64 boxes and take the absolute difference between the average hue of the 16 boxes in the center (colored in orange) and that of the 48 boxes surrounding them. This feature would help us identify if there’s something in the center of the image that might be the point of focus. Figure 1: Demonstration of Center/Edges
- Edges: Measures how a picture’s edge varies in hue via standard deviation. The edge is defined as the outer 28 boxes . Some pictures include a frame, in which case there would be very small variability in the hue of the edge.
- Value Mean: Value is a measure of lightness/darkness. The average of an image’s value across all of its pixels can give us information on the overall brightness of an image relative to other images.
- Value Variance: Measures how much value varies among an image’s pixels. If there is low variability, then the pixels all tend to be either light or dark.
- Saturation Mean: Saturation measures the intensity of a color. The mean of an image’s saturation across all of its pixels can give us information on the overall color intensity of an image relative to other images.
- Saturation Variance: Measures how much saturation varies among an image’s pixels. If there is low variability, then the pixels all tend to have intense colors or not.
- Dominance: Using the k-means clustering algorithm (described in the methods section), we formed 8 clusters for the pixels in an image based on their values of hue, saturation, value. We recognized that having only eight clusters to represent color might not be able to capture all the unique colors that can compose an image, but we decided that eight was a good number to represent the fundamental colors, since the most common retail packages for crayons are in multiples of eight. Dominance measures the proportion of pixels that are in the largest cluster. If an image has “dominance” of .8, that means 80 percent of the pixels of the image are similar in hue, saturation, and value.
- 2-Neighbors-Left: We randomly selected 500 pixels from the image and looked at the correlation between the grayscale intensity of the selected pixel and that of the second pixel to its left. 2
- 2-Neighbors-Above: We randomly selected 500 pixels from the image and looked at the correlation between the grayscale intensity of the selected pixel and that of the second pixel above it.
- Warmth: Measures the proportion of warm colored pixels, defined as pixels for which the hue is less than or equal to 90 or between 330 and 360.
- Color: Outside of these ten features that we used to form clusters, we also created the “color” feature, which we described in the introduction.
We grouped the set of images based on 4 feature sets (as shown below), which consist of a subset of the 11 features above. “Colorful” is special because it forms a feature set by itself, on which we did not need to run clustering algorithm. Each feature set is supposed to represent one quality that pertains to dissimilarity. For example, the “complexity” feature set contains a set of features that all aim to capture the uniformity/complexity of a given image, including saturation variance . Using the clustering algorithm, we can theoretically group images of similar levels of complexity into their respective clusters.
Dissimilarity Measure: Manhattan Distance
In order to use these features to group the images into clusters, we used a distance measure to calculate how different an image is from another image. We decided to use the Manhattan distance, which is the sum of absolute differences of each of the features for two given images, to measure distance. A given image has a value for each of the 10 features that we use in our clustering algorithm. To calculate the Manhattan distance between image A and image B, we took the difference between the first feature (ex. “center”) for both images and take the absolute value of that difference.
Note that we had to standardize the values so that the values for the features would be on the same scale. We did this for each feature in our feature set, and then summed the absolute differences. This gave us the Manhattan distance between image A and image B, for that given feature set. Essentially, a larger Manhattan distance implies a more significant difference in the features of two images for a given feature set.
Clustering Algorithm
In order to separate the set of images into clusters, we used a clustering algorithm. In essence, a clustering algorithm uses given dissimilarity measures to group images into clusters. The goal is to group the images that are least distant from each other (and therefore most similar) into the same clusters. We chose to implement the k-means partitioning algorithm because it is computationally efficient, intuitive, and generally works well for large datasets, such as the one we are dealing with for this project.
In order to run the k-means partitioning algorithm, we needed to pre-specify the number of clusters for each feature set (the number of clusters is referred to as k). To do this, we examined average silhouette plots, which plot the average silhouette on the y-axis and k on the x-axis. Essentially, average silhouette captures how well the clusters are separated from each other. The higher the average silhouette, the more clearly separated the clusters are. Since we wanted the clusters to be distinct from each other, we wanted as high of an average silhouette as possible. The average silhouette plot we got for the first feature set is shown below.
For both the first and the second feature set, we found that the average silhouette was greatest at k=2. We were hesitant to use such a small k, as we would risk having dissimilar images clustered together, since the nearest group means might be very different from the features of a certain image, but that image would still be grouped in that cluster because its means are the closest. On the other hand, if we used a large k, we would risk having two otherwise similar images that should be grouped together being in two different groups, since the differences between two groups’ means can be very small (which suggests the two groups can potentially just be one group).
We decided to use k=5 for the first feature set and k=6 for the second feature set, as anything less would probably result in images within each cluster not being as similar, and any value of k between 5 and 20 and 6 and 20, respectively, resulted in a significantly lower average silhouette. For the third feature set, centrality, we decided that it made sense to use k=2, as the algorithm can probably group centered images and un-centered images into two separate clusters.
Average silhouette plot for the "complexity" feature set
We ran a k-means clustering algorithm for each of the three feature sets, creating 5, 6, and 2 clusters for the first, second, and third feature sets, respectively. We also created 9 clusters using the “colorful” feature: red, orange, yellow, green, blue, purple, black, white, and colorful (which means the color scheme of an image isn’t primarily of one color.)
Cluster Results
We visualized the cluster results by creating a composite image for each cluster under each feature set (complexity, personality, centrality,color.) The idea of the composite is to resize each image into 128x128 pixels, take the average hue, saturation, and value of each pixel across all the images in the same cluster, and create a new image using these average values. Each composite image is essentially an average of the images in a given cluster.
Clustering by "Complexity"
"Complexity" measures the variability of color and brightness of an image. As we can see from the composite images(See Figure 5), there is a clear difference between the most complex groups (whose composites are on the right end) and the least complex groups. Note that the “scale” we present here is not objective - one can easily rearrange the five composites based on how visually complex one perceives each one to be. We use letters to represent each composite, and the number in the parenthesis represents the size of the cluster. For example, A (114) means that composite A is composed of 114 images. Figure 5: Composite Images for "Complexity" Feature Set
Clustering by "Personality/Style"
As we can see from the composite images(See Figure 6), each one has a different “personality” to it. For instance, composite E appears to be more gloomy than composite D. Figure 6: Composite Images for "Personality" Feature Set
Clustering by "Centrality"
Ideally, one of these two composites should not have any discernible edge. However, as shown here (See Figure 7), both composites images have a clear separation between the central subject and the edge, with the first one’s central subject spreading out slightly more to its edges. 5 Figure 7: Composite Images for "Centrality" Feature Set
Clustering Images for "Color"
Figure 8: Composite Images for "Color" Feature Set We are pleased to present the composite results (See Figure 8) for each color group. We believe that this feature did a good job at grouping the colors, as most of the 9 colors are clearly and uniquely visible in its corresponding composite.