cloudigrade use case overview and FAQs - cloudigrade/cloudigrade GitHub Wiki

Overview

This document serves as a general overview of cloudigrade and aims to answer a lot of commonly asked questions.

Some additional related resources you may find helpful:

recording of Brad Smith reading and discussing and older version of this document here (as of 2020-05-04)
cloudigrade's glossary of jargon and acronyms
Using Cloud Meter originally written for our decommissioned independent pilot, but may be a good historic reference

What is cloudigrade? houndigrade?

cloudigrade is our API and asynchronous task processor for tracking use of RHEL and OpenShift in public clouds (currently only AWS). cloudigrade was envisioned to fill the role of tracking where the customer is not using Red Hat Insights or Subscription Management in its public cloud instances. We expect this may be a common case especially for short-lived instances that are not configured to phone home or check for updates, such as when a customer needs to scale up for burst capacity.

houndigrade is an image with a small, short-lived process that inspects attached filesystems for specific markers that indicate RHEL presence. In the case of AWS accounts, when cloudigrade discovers that a user has an instance with a base machine image that cloudigrade does not yet know, cloudigrade initiates a sequence of steps to copy that image, run houndigrade against it, and record the results. houndigrade's operation is generally not visible to end users, and a user would only know that houndigrade had run by observing the HTTP API for images or usage data.

What are the general high-level use cases and actions relevant to cloudigrade?

create, update, and destroy cloud accounts
generate RHEL activity in AWS and observe positive findings in the HTTP APIs
generate OpenShift activity in AWS and observe positive findings in the HTTP APIs
generate not-RHEL, not-OpenShift activity in AWS and observe negative findings in the HTTP APIs
generate varying RHEL activity for multiple instances over several days and observe positive findings in the "concurrent usage" HTTP APIs

How do I create a cloudigrade account for tracking?

Accounts should be created by interacting with the sources-api or the sources web app. Objects must be created in sources-api with specific types and relationships in order for cloudigrade to successfully verify access and create its account. See Sources Integration for detailed examples.

For a (possibly outdated by the time you see it) demonstration of how to interact with sources-api to create (and destroy) those objects, see the recording Disable and enable a cloud account.

IMPORTANT NOTE Interacting via sources-api is asynchronous. cloudigrade reads periodically from Kafka topics and processes messages as soon as possible, but this may not immediately follow when you interact with the sources web app. Furthermore, if cloudigrade fails to create an account, as a user you may not see a result unless you actively (and periodically) reload the sources web app and specifically look for the application's status. Upon failure, cloudigrade will attempt to notify sources-api to set the application's status to "unavailable" with an appropriate error message.

Some of the various side effects of interacting with sources-api and the HTTP API for accounts are documented in Managing CloudAccounts.

How does cloudigrade authenticate with the user's AWS account?

The sources-api authentication object must have an AWS ARN for an IAM Role that has an IAM Policy applied to grant cloudigrade's primary AWS account to perform specific tasks in the user's AWS account. The current required IAM Policy definition can be retrieved from the sysconfig API in the aws_policies.traditional_inspection element of its response. For example:

http --auth "${REDHAT_USERNAME}:${REDHAT_PASSWORD}" https://cloud.redhat.com/api/cloudigrade/v2/sysconfig/ | jq .aws_policies.traditional_inspection
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "CloudigradePolicy",
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeImages",
        "ec2:DescribeInstances",
        "ec2:ModifySnapshotAttribute",
        "ec2:DescribeSnapshotAttribute",
        "ec2:DescribeSnapshots",
        "ec2:CopyImage",
        "ec2:CreateTags",
        "ec2:DescribeRegions",
        "cloudtrail:CreateTrail",
        "cloudtrail:UpdateTrail",
        "cloudtrail:PutEventSelectors",
        "cloudtrail:DescribeTrails",
        "cloudtrail:StartLogging",
        "cloudtrail:DeleteTrail"
      ],
      "Resource": "*"
    }
  ]
}

How does cloudigrade track instance activity in AWS?

Upon initial account creation, cloudigrade performs a thorough "describe" operation in AWS to discover any currently running instances. Following that initial description, cloudigrade never actively "scans" for activity in the customer's account; cloudigrade then relies on AWS CloudTrail to notify of new activity with EC2 instances.

VERY IMPORTANT AWS CLOUDTRAIL DETAIL: AWS CloudTrail is not fast. On average we have observed 20 minutes between performing an action (start, stop, terminate) on an instance and seeing that action appear in CloudTrail. Sometimes that may be faster; sometimes that may be slower. AWS CloudTrail is a wholly AWS-managed service and we have no control over the speed at which it delivers information.

How does cloudigrade track OpenShift presence for AWS instance activity?

cloudigrade relies exclusively on the user to apply the tag name cloudigrade-ocp-present to the EC2 images they intend to use for OpenShift. This is obviously not an air-tight solution, but we have been instructed to trust the user to act in good faith.

cloudigrade looks for that tag in the initial describe, when CloudTrail indicates a new image is encountered from instance activity, and when CloudTrail indicated a user adds or removes the tag from an image.

How does cloudigrade track RHEL presense for AWS instance activity?

cloudigrade has two methods for tracking RHEL. Like with OpenShift, if looks for a tag named cloudigrade-rhel-present on EC2 images; this follows the same process as before. However, the more interesting process involves copying the user's image into our account and running houndigrade against it.

What follows is a slight simplification of the actual process, but this should cover the main points.

VERY IMPORTANT AWS EC2 DETAIL: We copy the customer's image snapshot from their AWS account into our AWS account. This may take seconds, few minutes, or many minutes depending on unknown factors in AWS. We have no control and very little visibility into that process, but we do log progress (like in check_snapshot_state as we attempt to proceed from the copy for up to one hour before giving up and raising SnapshotNotReadyException.

After the image snapshot successfully copies, we create a volume from it and add that volume ID to a ready_volumes queue.

Independent of that process so far, a periodic task (scale_up_inspection_cluster) runs hourly (or by HOUNDIGRADE_ECS_SCALE_UP_CLUSTER_SCHEDULE) that looks for any items in that ready_volumes queue and uses their IDs as inputs to run houndigrade. houndigrade then runs in our AWS account and writes its results to a inspection_results queue.

Another periodic task runs once every five minutes (or by HOUNDIGRADE_ECS_PERSIST_INSPECTION_RESULTS_SCHEDULE) to read messages from that inspection_results queue, scale down the houndigrade inspection cluster, and record results to the cloudigrade DB.

What kind of activity will show up as "RHEL positive" in cloudigrade?

running an instance with an image that has the cloudigrade-rhel-present tag
running an instance with a user-owned image that we can read (via mount -t auto) and in which at least one of the following is true:
- image has known RHEL signed packages
- image has known RHEL product certificates
- image has RHEL /etc/release file
- image has enabled RHEL dnf/yum repos
running an instance with a Cloud Access image

What kind of activity will NOT show up as "RHEL positive" in cloudigrade?

running an instance with an image from the AWS Marketplace
running an instance with an encrypted EC2 image
running an instance with an image shared from a third party that does not allow us to read it
running an instance with an image using a filesystem that is not handled by mount -t auto
- Filesystems using LVM may not be recognized.
- Volumes with no partition table may not be recognized.
- Filesystems with non-Linux formats (MS-DOS, Apple File System) may not be recognized.

Why do we ignore AWS Marketplace images, even if they (probably) have RHEL?

cloudigrade as a product only wants to know about RHEL usage when the customer has (or should have) a direct relationship with Red Hat. In the case of Marketplace and shared images, the customer's relationship is with AWS or a third party vendor, not Red Hat. Presumably the customer is paying someone else for their RHEL subscription, in which case cloudigrade has no interest for tracking.

What other Red Hat-relevant information does cloudigrade track?

If we proceed and succeed with the full image copy inspection process through houndigrade, we also attempt to find and store the RHEL version and full system purpose (/etc/rhsm/syspurpose/syspurpose.json).

What other general instance information does cloudigrade track?

From AWS, cloudigrade stores the region and EC2 instance type for each running instance it encounters. From that type, cloudigrade knows how much memory (in GBs) and how many (virtual) CPU cores were used. cloudigrade does not know and cannot know the number of CPU sockets because that information is not available and is nonsensical in a public cloud setting.

What does the `concurrentusage` API return?

The concurrent usage API looks at whole days over some requested period of time, and for each day, it determines the maximum count of concurrent RHEL-identified instances. The concurrent RHEL instances are grouped by various combinations of system role, sla, architecture, service_type, and usage.

Why am I not seeing the data I expect?

In many situations, it's because AWS CloudTrail is slow, EC2 image copying is slow, or one of the periodic tasks for inspection has not yet run.

IMPORTANT NOTE ABOUT EXPECTATIONS: cloudigrade is not intended to be a real-time reporting tool, specifically because of the aforementioned unknown delays in AWS. The general expected "turnaround time" for results is on the order of hours, and our main consumer (tally service) expects data at a "next day" time frame. Yes, this poses some challenges for integration testing.