How Can a Linux Have a Load above 5 with a High Percentage of CPU Being Idle?

Problem scenario
You are using Linux, and you run top, and you see this:

top - 21:38:04 up 14 days, 12:12,  1 user,  load average: 5.30, 5.10, 4.90
Tasks:  54 total,   15 running,  39 sleeping,   0 stopped,   0 zombie
%Cpu(s):  11.7 us,  13.7 sy,  0.0 ni, 74.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

How can the load average be so high with so much CPU being idle?

Solution
The server has many CPUs. Having a load above 1 is completely normal for a healthy server without being overloaded if the server has many CPUs or many cores to its CPU. A server load of 1 is the max for a server with a single-core CPU. A server with 8 CPUs can have a load under 8 and be operating under its maximum capacity. A server with 20 CPUs could have a load of 5.3 and a tremendous amount of CPU in idle.

For more information, see this posting.

How Do You Find Out what kubectl Commands Correspond with Revision Numbers in the “rollout history”?

Problem scenario
When you run a kubectl rollout history deployment foobar command, the "CHANGE-CAUSE" is empty for certain rows (revision numbers). What can you do to keep this column with useful descriptions (or relevant commands)?

Possible Solution #1
Going forward, use the --record flag when running kubectl create [deployment] or kubectl set image deployment … commands.

Possible Solution #2
Manually edit the manifest of the resource in the deployment. The source of this one is the main Kubernetes website.

Possible Solution #3
Going forward use the kubectl annotate command. See this page for more information.

Possible Solution #4 (This is related, but not strictly relevant to the question because the revision number would not appear in this instance.)
If you do not see older revisions, and want to see more history, remember that there is a configurable limit. revisionHistoryLimit in the deployment resource is set to a default value of 10 (according to this external site). You can change this value. Here is an example of changing the value.

How Do You Troubleshoot the Error “Unable to connect to the server: dial tcp: lookup … eks.amazonaws.com on x.x.x.x:53: no such host” when Running kubectl Commands?

Problem scenario
kubectl commands are failing with an error message like this: "Unable to connect to the server: dial tcp: lookup ABCD123EFG.gr7.us-west-1.eks.amazonaws.com on x.x.x.x:53: no such host"

How do you get kubectl commands to work?

Possible Solution #1
If you are using EKS and you have the name of the cluster (e.g., "foobar"), run a command like this:

aws eks update-kubeconfig --name foobar

Possible Solution #2

  1. Find what VPC your EC-2 instance is in. Remember the x.x.x.x IP address in the error message above. Now look at the relevant VPC. One way to find the relevant VPC would be to use the AWS CLI (link to this posting). Then run this command: aws ec2 describe-vpcs Look at the output. Find the VpcId value for the x.x.x.x IP address.
  2. In the web UI, go here (but replace "west-1" with the region seen in the error):
    https://us-west-1.console.aws.amazon.com/vpc/home?region=us-west-1#vpcs:
  1. Click on the VPC that was determined in step #1.
  2. Make sure DNS resolution and DNS hostnames are enabled. These are two separate features of a VPC.

Possible Solution #3
Try redeploying the Kubernetes cluster. The cluster itself may have issues. If you want explicit directions on creating an EKS cluster, see these postings:

https://www.continualintegration.com/miscellaneous-articles/how-do-you-use-amazon-elastic-kubernetes-service-with-the-cli/
https://www.continualintegration.com/miscellaneous-articles/how-do-you-deploy-a-kubernetes-cluster-to-azure/
https://www.continualintegration.com/miscellaneous-articles/how-do-you-deploy-aks-azures-kubernetes-using-the-gui/
https://www.continualintegration.com/miscellaneous-articles/how-do-you-deploy-a-kubernetes-cluster-in-google-cloud-platform/

How Do You Troubleshoot the Kubernetes Error “no nodes available to schedule pods”?

Problem scenario
You run a kubectl command, but you receive one of the following error messages:

Warning FailedScheduling default-scheduler no nodes available to schedule pods
No resources found

What should you do?

Possible Solution #1
Did the control plane lose connectivity with the worker nodes? An intermediate network device such as a firewall may have been implemented. Did a data center become unavailable?

Possible Solution #2 (if using AWS)
1. Create a node.
2. You may need to create an IAM role for Kubernetes; if you need assistance with this, see this posting. Once you have identified or created an IAM role that is suitable, proceed to step 3.
3. Add these policies to the role: AmazonEKSClusterPolicy, AmazonEKSWorkerNodePolicy, AmazonEC2ContainerRegistryReadOnly.

How Do You Create an IAM Role in AWS to Allow for Nodegroups to Be Created in EKS?

Problem scenario
In the AWS Management Console, you cannot add a Node to an EKS cluster. The "Node IAM Role" never has any option. You click the "refresh" arrow, but all you see is "No roles found. Follow the link above to create a new role." What should you do?

Solution

1. Install and configure the AWS CLI. If you need assistance with this, see this posting.

2.a. Create Test-Role-Trust-Policy.json like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

2.b. Create special.json like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Action": [
          "iam:AmazonEKSClusterPolicy",
          "iam:AmazonEKSWorkerNodePolicy",
          "iam:AmazonEC2ContainerRegistryReadOnly"],
      "Resource": "arn:aws:iam::1234567891011:role/contintdelete-role"
    }
  ]
}

3. Run commands like these (but replace "contintdelete-role" with the role name of your choice, and "DELETEPOLICY" with the policy name of your choice):

aws iam create-role --role-name contintdelete-role --assume-role-policy-document file://Test-Role-Trust-Policy.json
aws iam put-role-policy --role-name contintdelete-role --policy-name DELETEPOLICY --policy-document file://special.json

How Do You Back Up Many Emails from a Web-based Email Quickly?

Problem scenario
You want to back up many different emails from a web-based email. You cannot select several and print them all at once. What should you do?

Solution
Prerequisite

This assumes your web-based email can be configured to work with a desktop email client (such as Outlook).

Procedures
Install and configure Thunderbird, Outlook or an email client. You should be able to select many and go to "Save As" or "Print".


How Do You Get Kubernetes Nodes to Be Ready?

Problem scenario
Your Kubernetes cluster is not working. The nodes are not ready. You see your Kubernetes nodes are not ready (with kubectl get nodes). You also see this error from a kubectl describe node foobar command:

runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

What should you do?

(If you get a different error, see this posting How Do You Troubleshoot a Kubernetes Cluster That is Not Working at the Node Level?.)

Possible Solution #1
Run these two commands (as we found them to work with an EKS cluster):

kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/2140ac876ef134e0ed5af15c65e414cf26827915/Documentation/kube-flannel.yml

kubectl -n kube-system apply -f https://raw.githubusercontent.com/coreos/flannel/bc79dd1505b0c8681ece4de4c0d86c5cd2643275/Documentation/kube-flannel.yml

Possible Solution #2
You may want to see this posting if there is a problem with the subnet.env file on the worker nodes.

Possible Solution #3 (if using AKS)
Reboot the node. If that doesn't work, reimage the node. (The error message likely would have been different in the problem scenario for this solution to work. This solution was adapted from this external page.)

Possible Solution #4 (if you are getting an error about a network plugin)
See the posting How Do You Get Nodes in Kubernetes to Be Ready with a Network Plugin Error?.

How Do You Find out What CNI Plugin Has Been Installed in Your Kubernetes Cluster?

Problem scenario
You want to know what CNI plugin your Kubernetes cluster is using (e.g., Flannel, Calico, Weave Net, Romana or another one). What do you do?

Possible solution
"You can install only one Pod network per cluster." taken from Kubernetes.io.

Go to a worker node and run these commands and look at the output:

ls -lh /etc/cni/net.d

ls -lh /opt/cni/bin | grep -i flannel

ls -lh /opt/cni/bin | grep -i calico

Use the sudo find / -name command to search for Romana or Weavenet vestiges.

Try running this command:

kubectl get pods --namespace kube-system
#The results may indicate Flannel, Calico, Romana or Weavenet

How Do You Tell If the .yaml File for a kubectl Command Will Work?

Problem scenario
You want to do some pre-testing on the .yaml file(s) you will use with kubectl. How do you validate a .yaml file has correct syntax for Kubernetes?

Possible Solution #1
Try this command:
kubectl apply --validate=true --dry-run=true --filename=nameofyourfile.yaml

Possible Solution #2
Try this website:
https://www.kubeyaml.com/

Possible Solution #3
Try kubeval: https://www.kubeval.com/

Possible Solution #4
Use Copper: http://copper.sh/