CloudTides Elastic Platform on Idle Cloud Resources - ji-it/CloudTides GitHub Wiki

Overview

In enterprises that have private clouds, their capacity is usually designed to meet the average or peak time usage. When usage drops, the resources are very likely to keep running but stay idle. Although those resources are running idle, they are still consuming power, cooling, space and operational labor to maintain them. These costs are wasted. Considering the huge number of companies (with private cloud) in the world, the idle resources and the waste are huge.

On the other side, there are many organizations in the world that need computing resources, for example, many non-profitable organizations (NPO) and science research organizations. They may not have enough budget to purchase all of the required capacity or the procurement and installation may take a long time to complete. Therefore, they have a strong need to leverage the existing resources in the world. I would like to propose this project to bridge the requirements of such organizations with the idle resources in the enterprises.

Although we don't have the market data of how large the requirements from such organizations are, there are existing solutions in this area to meet the needs. For example, the famous SETI@home project run by UC Berkeley is a good example to unveil such cases. SETI@home calls people to donate their idle personal devices, from home lab to desktop/laptop to mobile devices. This project has been running for more than 20 years and during peak time, there were more than 1 million devices contributing computing resources to the project. Such kind of computing is also known as volunteer computing. However, SETI@home only includes personal devices. They have collaborated with companies like IBM to include employees' devices to run SETI@home clients, but there is no company contributing private cloud resources to SETI@home.

In addition, SETI@home is based on the BOINC program to distribute the jobs to the individual devices. And there are about 30 science projects using BOINC today. Our project wants to support BOINC and thus is able to contribute idle private cloud resources to the various of cutting-edge science research projects in the world.

Organizations/programs like SETI@home are the demanders of our project, while enterprises with private cloud are the suppliers. As VMware's product is the most used virtualization software in the market, our project assumes that the suppliers have vSphere in their private cloud and thus we can use vSphere API to communicate with the private cloud.

From a single enterprise viewpoint, they can contribute idle resources to multiple science research projects and from a science research project viewpoint, they can distribute computing jobs to multiple enterprises. We want to have a stats center to collect usage data. CloudTides_Overview

The Architecture

Project Tides is the middleman between the demander and the supplier. It will have the following modules.

  • Resource Registration
    This module will provide an interface to register or deregister idle resources. It will allow the administrators to add/remove/update/enable/disable private cloud resources in this platform. It now should support vSphere (vCenter/ESXi/VCD) as the cloud provider and use vSphere API or vMomi API to connect to the infrastructure.
  • Resource Monitor
    Once a resource is registered in the platform, we need to continuously monitor the usage on the target resource. vSphere/vMomi API provides interface to query the real-time usage data and we need to poll the data and pass them to a queue. Then other module can kick off the corresponding work based on the defined policies. This module should be able to allow users to choose whether to run in a K8S cluster or in some static worker VMs, depending on the number of workloads.
  • Contribution Policy Manager
    This module will define the 'idle' state for the registered cloud infrastructure. For example, CPU usage below 30% can be defined as idle and trigger the contribution workloads provisioning. It should also define when to stop the new workload deployments and when to destroy the workloads. Different resources may have different thresholds.
  • Workload Manager
    This module will execute the deployment and destroy job. It will monitor the usage data in the queue and compare it with the contribution policy. Once a policy is satisfied, it will execute the defined operation. This module should be able to allow users to choose whether to run in a K8S cluster or in some static worker VMs, depending on the number of workloads.

    Generally an operation includes the following steps
    • Start a workload
      • Deploy a VM in the vSphere environment based on the prepared template.
      • Login into the VM and kick off the computing job.
      • Record the details of the workloads in the platform.
    • Stop a workload
      • Login into the VM and stop/terminate the computing job.
      • Destroy the VM in the vSphere environment.
      • Record the details of the job.
  • Template Manager
    This module controls the VM template files in the platform. All of the computing jobs need to run in VMs and the VM should be deployed from those templates. In the template, the administrators can install necessary software and patches to compliant with the enterprise security policy/guideline.
  • Contribution Manager
    In this module, the administrator can define which external project/site the enterprise wants to contribute the idle resources to.
  • Cost Estimator
    This module will try to estimate the value the enterprise has contributed to the research projects.
  • Dashboard
    The dashboard will provide UI for the above modules, show the contribution history and the current status of the workloads. CloudTides Architecture

The Contribution Workloads

In the current phase, we will only support the contribution workloads to run BOINC program to contribute to a BOINC-based project. Therefore, each VM will be a BOINC client with some configurations to point to the specific BOINC server. We will provide the base template to have BOINC program installed and the administrator can customize it to add more tools as needed.

Once a BOINC client VM is started, the platform will login the guest OS to configure the BOINC client and start it, through SSH or VMTools. When the VM has to be destroyed, the platform will stop the client through SSH or VMTools and then delete the VM. The CPU & memory usage can be collected either at runtime or before the VM is deleted.

The deployed BOINC VM should have a small configuration on CPU & memory. We can deploy multiple VMs to fully utilize the idle resources and allow different VMs to taking jobs from different BOINC servers. A stretch goal is to explore if we can spawn a diskless VM then we will have zero footprint on the private cloud.

The Security

The Supplier Side

One of the key concerns the enterprise will have is the security. How can we ensure the security of the current corporate cloud and intranet and how can be ensure security of the other workloads on the private clouds?

Enterprises wants to make sure there is no malicious jobs and job executor running in the private infrastructure.

  • This platform does not ship with any job executor but will try to provide a list of trusted organizations as a reference/recommendation to the enterprise. The administrator can decide which organization to contribute to. Then the administrator can either use our prepared templates if it is for BOINC based projects or build their own templates.
  • For BOINC based projects, BOINC program is open sourced and it has a signing mechanism to ensure the job executor is not modified.

In addition, many enterprises may have very restricted access control to the internet or even no internet access from the private cloud. How can Project Tides run workloads in such requirements?

We have a few suggestions to meet different level security concerns.

  1. Running Project Tides Platform in DMZ zone
    The Project Tides platform need to have internet access. So, it has to run in the DMZ zone and have direct access to internet.
  2. Contribute idle resources in the DMZ zone
    To get started with this platform, the enterprise can start to contribute some idle resources in the DMZ zone. VMs in DMZ zone can access BOINC server directly and perform computing jobs just like a normal individual device.
  3. Setting up a BOINC proxy and only allow VMs to connect to that proxy
    When the VMs are in the intranet or the enterprise does not want/allow direct internet access, BOINC VM can connect to a proxy to communicate with the BOINC server. Administrators can add firewall rules in the template to restrict the access.
  4. Completely disable/delete the NIC on the VM and use VMTools to load the data and read the results
    A stretch goal is to explore VMs without a NIC card. Then it will be no security risk on those VMs. Project Tides will create such a VM and download the computing job to the VM through VMTools. It needs to monitor the status of the job and retrieve the results once it is done. The challenge for this solution is to cache a few jobs at the Project Tide platform and distribute them when needed. This might need some change on the BOINC program.

The Demand Side

The research organization also wants to ensure the job is not hijacked and the result is not fake. In BOINC, such support is already implemented thus we can fully rely on BOINC to offer such support.

The Global Stats Center

As each Project Tides platform is running inside the enterprise network, we need to let the platform to report the stats data to a central global server in order to know how many resources each platform has contributed and count the total value of the contribution. Thus, we need such service running in the public domain. It can also publish critical information, e.g. updates, new templates and newly supported research projects.

Misc

Power Consumption

Before the enterprise decides to contribute the resources, they may want to know what the additional cost for this contribution is. Space, cooling and labor cost won't change as they are the sunk costs. Power cost is not clear. We want to investigate the correlation between the resource usage and the power usage. If there is no existing data in the market, we need to collect the data from different server models. With such data, we can then give a few usage recommendations as thresholds to the administrator.

Cost Estimation

When the enterprise starts to contribute, they will want to know how much value they have contributed to each research projects they participate in. The cost will include the power, cooling, space, labor and other cost the enterprise wants to count. We need to offer a interface for the administrator to define the cost variables.

BOINC Improvement

BOINC is designed to distribute workloads to individual devices. Each individual device is relatively small, and liveness may change unpredictably, power on/off or disconnect at any time. While enterprise private cloud has a more stable behavior with larger capacity as an entity. Project Tides platform can act as a middleman to request jobs and report job results in batch. There might be some potential changes we can investigate to improve BOINC code and then contribute back to BOINC.