Digital sobriety - Hypertopic/Steatite GitHub Wiki

The analysis of the digital sobriety of this service is part of a student’s project carried out by Aurélien Desprez and Callista Spiteri during the 2022 spring semester. It was tutored by Aurélien Benel and realised in the Université de Technologie de Troyes, a French engineering school. Its goal is to observe the environmental impacts of a web service depending on its IT architecture on the AWS Cloud.

It was done in collaboration with IMAJ’s (Institut Mondial d’Art de la Jeunesse), a worldwide institute which aims to promote the recognition of childhood and youth and to work for its inclusion in the memory of humanity, from educational, cultural and scientific activities organized in the international ideal of UNESCO. One of their projects consists in organising an art competition for young people (aged 3 to 25) all over the world. They collect the drawings and analyse them in Troyes. Thus, they will use this service to publicly expose the drawings that they have scanned. Because they work for actual and future generations, it is one of their stakes to have a system that has a positive impact, hence our partnership.

Our analysis will take place on Amazon Web Services. We are aware that Amazon is not the best in sustainability, but our project is only a prototype: it can be transplanted to other providers which have approximately the same services. Another reason why we chose AWS is that it offers a large choice in IT architecture types that we could compare in the same economical and sustainable system.

The impacts of IT

Before talking about the project, we feel that it is important to start by learning the environmental impacts of IT. Indeed, not many know that, to build and use the equipment, we need to dig up rare minerals from a distant mine and use energy to cool it down. These processes pollute a lot their surroundings. For an example, if IT was a country, it would be the third most consuming in electricity of the world, after China and the US. A study from the CNRS shows that half of the impacts are due to the electricity consumed by servers. That is why it is interesting for us to understand and analyse different IT architectures that enable to reduce the use of servers, therefore their consumption.

Source : https://hal.archives-ouvertes.fr/hal-03009741v2

The advantages of serverless

To answer the problem stipulated previously, we chose to analyse the serverless architecture. It is a cloud development model that enables developers to concentrate only on the code and not on the server management of a project. But before going further into Serveless, we need to understand what was used before. Usually, every application is hosted on one server that is owned by the entity developing it. Thus, the servers are managed by the company. That is to say that if there's a problem or if they want to expand their application to a bigger volume, it is the company that needs to oversee that. It can be a problem if there is not enough human resources or money to manage it properly. This is why Serveless is interesting for developers.

As we can see in this scheme, Serverless enables to mutualize servers. Indeed, we can put more than one app per server, which saves some space in case of an increase in demand for the service hosted. If we take the example of our provider, AWS is specialised in handling those types of problems. They're used to it thanks to experience in handling the orders in holiday periods such as Christmas. Furthermore, we understand by the structure of the system that it has a lower impact. Indeed, when we do not use a server, it is switched off and only put back on when needed. A report written by HAL Open science in 2021 confirms that it is better for the environment.

The aim of this project is to confirm this assumption by comparing Serveless with what is usually used. However, it is difficult to have a proper idea of the environmental impacts because there is no data calculated by the provider. So, we shall make the hypothesis that the cost of the platform is proportional to its environmental impact. Indeed, the cost represents the electricity that we use. this will give us an order of magnitude of the environmental impacts in the aim of comparing both architectures.

Source : https://www.ibm.com/cloud/learn/serverless, https://www.redhat.com/en/topics/cloud-native-apps/what-is-serverless, https://hal.archives-ouvertes.fr/hal-03009741v2

AWS

Before continuing we need to make sure that we understand how AWS works. in order to implement the serverless architecture, we used different tools of this provider:

In a nutshell,

API GateWay is the interface between the users and AWS. It enables to communicate between both entities to understand what the user wants to do;
S3 is the place where the photos are stocked;
DynamoDB stocks the information of the photos;
Lambda is the link between S3 and DynamoDB. The interest in using DynamoDB is that we can change its structure more dynamically. If the name of an entity is changed, DynamoDB will stock it and we will not need to change the code.

Consequently, we used this system to develop a service that is the replication on Steatite. You can see the functionalities in the README file.

Storage class

One of the functionalities that AWS includes in their offer is to change the storage classes of the data that they stock. In order to better understand what they are, we need to analyse how Amazon stocks the information, that is to say where it is, in what type of storage it is, in how many copies, … If we take the example of the storage class “standard”, the information is stocked in different datacentres with several copies. Indeed, information can be lost because of server failures or inundations so it is relevant to keep it in different places. After, it is stored in modern servers. There exist different types of servers that stock information in hardware or in magnetic discs for example. The main difference between both is that modern servers are constantly switched on if they have information on them, whereas magnetic discs are only switched on when we need to access the data. Therefore, the price is bigger when the power is on all the time, but the access time is much smaller (one needs to manually get the information from magnetic discs). Thus, the storage class that is chosen depends on how and when we want to access the data.

In our project, we have 3 types of photos:

Original: it is the scanned version of the drawing. The detail is precise; therefore, it is the heaviest type of picture (20Mo per picture). Because we can only scan the photo once (if the price would be too high), it is important that we do not loose them. However, because we do not access them a lot, we do not need a short access time;
Thumbnail: it is a miniatured version of the original. It is generated from that last photo to enable people that look at the pictures to have an idea of the drawing. Therefore it is very light (3Ko) and we need a frequent access;
Optimized: it is a lighter version of the original (3Mo). It enables the user to view the drawing with a good enough quality picture. Because we do not know if we will need the optimized version as soon as the original one is uploaded, we generate it on the first query to get it.

Therefore, we can understand that there are two needs for storage, that means two storage classes that we will use:

Glacier Deep Archive for originals: critical, rarely accessed data;
RSS for thumbnails and the optimised: noncritical, frequently accessed data.

If we look at the following scheme, we will see where the information would be stocked if we used the datacentres from France, Ireland and one in the US. The thumbnails and optimized would have only one copy in the closest datacentre and would be on a modern server whereas the original would be in three different places on a magnetic disc.

This will enable to intelligently distribute resources to lower the environmental impact of our system.

Source: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-class-intro.html

Comparison

In order to make the calculations to compare both architectures, we use EC2, representing what we usually use, and S3&Lambda, that is Serveless. To compare the costs, we need to collect data. This is what we found out:

There are 100 000 drawings ready to be uploaded, if we scan the front and back of the photos, that makes 200 000 originals;
1 000 drawings arrive each year (2 000 originals);
Each original weighs 20 Mo;
Each optimized weighs 3 Mo;
Each optimized weighs 3 Ko (1 Mo = 1 000 Ko);

To simplify the comparison, we made a few assumptions:

The cost of Lamda is negligible: it costs 0,2$ for one million requests. Knowing that we will not make enough for it to really count, we can take it out of the calculations;
S3 requests are negligible for the same reason as Lambda’s;
The thumbnails are negligible: they are more 5 000 times smaller than the original images. Thus, they will not have enough weight to modify the calculations;
The optimized are negligible: we assume that Steatite will not be used a lot (12 times per year perhaps) so the number of picture of that type will be nothing compared to the lot of original photos;
It is necessary to have an EC2 instance that has 5 000 Go to stock the pictures.

EC2

The price of EC2 depends on its volume (how much it stores) and the instance it uses (processor and memory). In order to calculate the cost of the volume, we use this formula: PriceOfTheInstance x storageNeeded x monthsUsed Which gives us : 0,045 USD per Go per month x 5 000 Go x 12 months = 7,5 x 12 = 90$ per year.

For the instance, we multiply its cost by the time we use it : 0,0376*8 760 = 329,38 $ per year

Which makes a total of 419,38 $ per year.

S3

To calculate the cost of S3, we need to multiply the storage by the price of the storage class. Because Glacier Deep Archieve takes 180 days to activate, the storage is in the standard storage class during 6 months. The price of the standard storage class is 0,024 $ Go/month whereas the price of Glacier is 0,00252 $ Go/month.

There are the calculations that we did to evaluate the price of the S3 instance:

A critique that could be made is that if we only scanned the front of the drawing, the cost would be divided by 2 because we would consider 100 000 pictures instead of 200 000. Perhaps one could question the pertinence of scanning both sides: could the information put at the back of the drawing be put in the metadata of the picture instead?

S3 vs EC2

Viewing the results, we can see that we cannot compare both architectures by only analysing one year. Indeed, because the first year of S3 is more than six times more expensive than the others, we need to spread its cots. To do so, we consider a 10-year plan:

The EC2 instance would cost 4 19,38 x 10 = 4 193,8 $ for 10 years
The S3 instance would cost firstYear + otherYears + newData = 619,2 + 92,59 x 9 + 6 x 8 = 1 500,51 $ for 10 years

That means that the EC2 instance is 2,8 times more environmentally impactful that the S3 instance (if we would only have considered the front of the drawing, it would have been 5,6 times more impactful). This shows that Serveless architecture is environmentally better that what is usually used and confirms that mutualisation is better for the environment.