CONTINUAL INTEGRATION – Page 43 – A Technical I.T./DevOps Blog

How Do You Install GraphQL for Python?

Problem scenario
You want to install GraphQL for Python in Linux. What should you do?

Solution

Install pip3 and venv. (If you need assistance with pip3, see this posting. If you need assistance with venv, see this posting.)
Run these commands:

python3 -m venv to_test
cd to_test
source bin/activate
pip3 install graphql-python
pip3 install flask ariadne flask-sqlalchemy flask-cors

What is Pipelinization in Technology/Computing?

Question
In I.T. what is pipelinization?

Answer / Disambiguation
The configuration, creation, or execution of a repeatable process that involves a series of stages with a start and finish. Another definition of pipelinization would be making a procedure into a controlled stream (for reproducibility for parallel and independent development or parallel and increased throughput of the original process). A final definition would be adopting a manual or automated process by developing a sequence of substeps for a [batch] job to incrementally pass through. The pipeline could be a cyclic (a circular process) or acyclic (uni-directional) graph.

The three pipelinizations below refer to pipelines; none of those pipelines are operating system pipes (that convey output from one process to another process). The pipelines below are uni-directional.

Pipelinization for the CI/CD pipeline:
It is the transformation of a process into a pipeline. A pipeline is a sequence of operations usually across multiple servers or pods to build, compile, test, and release software code. It can be managed by a tool such as Jenkins, GitLab, TravisCI, CircleCI, Bamboo, Azure DevOps, AWS tools such as CodeBuild & CodePipeline, or others. A pipeline can be triggered automatically based on an event (a certain time of day or code being checked into a repository) or run manually.

Pipelines can create infrastructure (e.g., with Terraform), virtual networks, send automated messages, perform QA tests, and release software to multiple environments including production.

Pipelinization for an ETL process:
The implementation of a repeatable extract-transform-load (ETL) process. An ETL pipeline is a process where data is taken from one format (e.g. such as a .csv file), cleansed or modified so it can be inserted into a database table, and finally injected into the database table. Some database loads happen in an ad hoc way. To create a configuration or a platform for the integration of this loading of data into a database such that it happens on a regular basis or from a manually triggering is the pipelinization of an ETL process.

Pipelinization for programming a given processor:
Pipelinization is the utilization of parallel processing -- sending instructions to a CPU in a way that maximizes the efficiency and overall capabilities of the CPU itself. Pipelining is the sophisticated programmatic design of sending of instructions to a CPU for parallel computations to leverage superior memory of a computer. CPU registers have faster memory than the CPU's cache which is better than RAM (1). The slowest memory is that of virtual memory (saved to disk) (1). Swapping or paging is the I/O activity that happens in virtual memory.

Citation
(1) https://www.elprocus.com/memory-hierarchy-in-computer-architecture/
For processing, "[p]ipelining is a method to obtain high efficiency for processes inside computers." (Page 2 of The Art of High Performance Computing for Computational Science.)

See also this Quora question and answer.

Is It a Best/Recommended Practice to Update a Server Application when Clients/Apps May Not Be Compatible with the Version?

Problem scenario
Some updates are essential for security. In some situations a patch to an OS or an application may cause some clients to have problems connecting to the server. The ramifications could apply to middleware or other applications. Is it a best/recommended practice to push an update to a server's OS or application when other clients or smart phone apps will not be able to be compatible?

Answer
It is not clear. Here is one example where "best practices" are not compatible:

TechRepublic says it is a best practice to tolerate system incompatibilities (#8 here). On the other hand, a well-respected website betterprogramming.pub, that gives advice to programmers, recommends eliminating incompatible versions by forcing updates.

From a pragmatic perspective in an enterprise environment, managers wrestle with problems like this. There may not be two options that are clear cut. Forbes uses phrases like "the right security controls" as a best practice. It is a safe bet to use the Goldilicks approach as a "best practice." With qualifiers and specific context, "best practices" can have applicability.

In Kubernetes, What Happens when an API Server or Admission Controller Receives a kubectl Command and Updates etcd?

Problem scenario
You know that an admission controller in the API server has various stages to process a kubectl command in Kubernetes. You want to know how the API server or relevant admission controller works in the correct sequence. The sequence starts with a request (HTTP POST from a kubectl command) and ends with updating etcd. How does this happen in detail?

Solution
The API server's process of handling a kubectl command is very similar to what are called "admission controller phases."*

It is not clear what is the difference (if there is any).

In pure text, the successful process is as follows:

1) a kubectl command is run
2) an HTTP POST request is made to the API server
3) the API server receives the request and handles it
4) an authentication plugin receives the request and passes it on
5) an authorization plugin receives the request and passes it on
6) an admission control plugin receives the request and passes it on
7) object schema validation is done
8) a validating admission control phase may happen that could involve an external validating webook (possible step, but could be bypassed). No changes can happen at this stage; either acceptance or rejection could happen here*.
9) the changes are saved in the "distributed persistent storage" (page 310 of Kubernetes in Action) system (etcd).
10) a response is returned to the client terminal that ran the original kubectl command.

The above steps were adapted from page 316 (most of the above steps) and 318 (#10) of Kubernetes in Action and this external website (several steps were corroborated or borrowed).

* This external article refers to "admission controller phases."

What Does “Branch by Abstraction” Entail?

Problem scenario
You have read about "branching by abstraction," but you do not know what it means. You read that it is ideal when the coding you are doing cannot be done incrementally. How do you "branch by abstraction"?

Solution
It is a method of trunk-based development; instead of using a branch, you commit to the trunk (or main) branch of your repository. As you modify and develop a solution, you interact with an abstraction -- not the production code. To branch by abstraction, do the following:

Create an abstraction over the part of the system that you need to change.
Refactor the rest of the system to use the abstraction layer.
Create a new implementation, which is not part of the production code path until complete.
Update your abstraction layer to delegate to your new implementation.
Remove the old implementation.
Remove the abstraction layer if it is no longer appropriate.

(This solution was taken from page 350 of Continuous Delivery by Humble and Farley.)

Is It a Best/Recommended Practice to Use Log4j?

Problem scenario
You are not sure if Log4j is acceptable to use. It is a best/recommended practice to use Log4j?

Answer
Maybe.

Log4j has been well-adopted by many of the most trusted companies in the I.T. industry. According to cybersecuritydive.com "Fortinet, IBM, Microsoft, Red Hat, Salesforce, and Siemens" use(d) Log4j. Log4j has been vulnerable since 2013 (according to this external website).

Below are some examples of credible sources referring to "best practices" and using Log4j:

Carbonite says it is a best practice to "stay current" (in this external posting). Many sources say it is a best practice to apply maintenance patches regularly and stay current; these sources include Faronics.com, The C2 Group, the U.K. National Cyber Security Centre, Forbes recommends covering "[a]ll [y]our [b]ases." Staying current was not enough for Log4j. Some companies used such an old version of Log4j -- in contradiction of the "best practice" of staying current. (Legacy software is widely used; if you want proof that Log4j version 1 was still being used after the major December 2021 vulnerability, see this thread https://github.com/apache/logging-log4j1/pull/18.) These companies using version 1 were not susceptible to the recent Log4j vulnerability. (Admittedly there is a weakness to version 1, but it appears to be less of a vulnerability according to https://www.slf4j.org/log4shell.html.)

As far as vulnerabilities are concerned, CVE-2021-44228 is probably as bad as it gets.
…
As log4j 1.x does NOT offer a JNDI look up mechanism at the message level, it does NOT suffer from CVE-2021-44228.
https://www.slf4j.org/log4shell.html

What is interesting is that https://www.slf4j.org/log4shell.html estimates that 10 times as many companies use version 1, which has been unmaintained since 2015, than version 2. Therefore very few companies use the "best practice" of switching to version 2. Arguably the market got it right by not adopting the Log4j version 2.x. (There is a book explained a story about investors guessing which company was responsible for the component that made the Challenger fail. The estimates were based on the share price of the possible contractor companies losing market value at the time of the explosion. At the conclusion of the investigation, the company that lost the largest percentage of value on the day when the Challenger blew up was the company that made the part that caused the failure. This book is called Wisdom of the Crowds. We are not necessarily endorsing the book, but it is an interesting concept.)

According to CRN, these security vendors were using Log4j version 2:
Broadcom, CyberArk, ForgeRock, F-Secure, Okta, SonicWall and Sophos. (This was taken from https://www.crn.com/slide-shows/security/12-cybersecurity-vendors-susceptible-to-the-log4j-vulnerability.)

Were these companies using best practices? Do not many people in the industry trust those companies? Some companies do not worry about peer-review and best practices. They are focused on multiple lines of defense.

Even Symantec products were affected:

Broadcom determined as of Tuesday that some or all versions of its CA Advanced Authentication, Symantec PAM Server Control, Symantec SiteMinder, and VIP Authentication Hub products are affected by the Log4j vulnerability.
https://www.crn.com/slide-shows/security/12-cybersecurity-vendors-susceptible-to-the-log4j-vulnerability

The vulnerability reminds us that the term "best practice" suggests there is a clear alternative toward something safe but some situations have no clear course of action even to the most well-respected companies in the world.

Is It a Best Practice to Normalize a SQL Database?

Problem scenario
There are sources that recommend database normalization as a best practice for relational databases. Is it always a best practice to normalize databases?

Answer
There is no clear answer as the sources vary.

A heavily voted-up answer on StackOverflow says that denormalization for OLAP performance is something to be avoided. We find recognition of this "best practice" on a venerated engineering blog:

Keeping data normalized is considered a best practice in MySQL. As we mentioned before, however, when using a NoSQL database like Cassandra, denormalizing data often improves query performance.
https://engineeringblog.yelp.com/2016/08/how-we-scaled-our-ad-analytics-with-cassandra.html

The above sources have legitimacy, but authoritative sources dispute that normalization is a best practice. No one should dispute that there are proponents and opponents of normalization of SQL/relational databases as a best practice.

The opponents of normalization as a best practice say that you need to have denormalized data for OLAP databases (for reporting to reduce the expensive operation of a JOIN in SQL, and to avoid aggregate queries which can be time-intensive) and for datawarehouses. Datawarehouses are usually not updated that frequently; it is common for batch jobs to happen monthly with datawarehouses. The work to normalize such a database may be economically prohibitive, and the benefits may be negligible or have a negative value.

Kimball cubes are widely used in the I.T. industry; they have changed the database and BI verticals. The man behind the namesake of Kimball cubes says this on his company's website:

In general, dimensional designers must resist the normalization urges caused by years of operational database designs and instead denormalize the many-to-one ﬁxed depth hierarchies into separate attributes on a ﬂattened dimension row.
https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/denormalized-flattened-dimension/

NetFlix and Meta (formerly known as Facebook) both use denormalization. A Microsoft MVP says that a datawarehouse best practice is to denormalize data.

In a dataware house the best practice is to denormalize the data to increase the performanze [sic] of ad-hoc reporting; with the prize of a higher data volume.
https://social.technet.microsoft.com/Forums/en-US/8c54d1d2-5fbd-4b62-a07f-16c34a863668/is-denormalization-best-practices?forum=transactsql

Other examples suggesting denormalization is beneficial (and indicated a "best practice") are the following:

If you want to read about the controversy of "best practices" in general, we recommend this posting.

What is the Difference between Yarn, the Package Manager, and YARN, the Hadoop Framework Tool?

Problem scenario
You are familiar with two different software tools called Yarn. You are not sure what they do or how they are different.

Solution / Yarn Disambiguation
There is a package manager that has some project manager functionalities called yarn. It often works with JavaScript.

This tool has an icon of a one-line drawing of a cat; the original Yarn website had other cat imagery, and there is the word "cat" in a yarn command. Our theory is that cats like to play with balls of yarn. This type of play can get out of control. Package managers make disparate and complex problems manageable. We do not think the etymology of this tool was that of a backronym.

We don't think resource negotiation is a factor or function of this Yarn. The product became quite big. This tool was released in 2016 (according to https://github.com/yarnpkg/yarn/releases/tag/0.2.0).

To learn more about it, see these postings:

Be advised we do not agree with the Wikipedia statement that this Yarn tool is an acronym.

To learn about the CLI, see this posting.

There is a separate tool called YARN that that is associated with Hadoop. Hadoop YARN is "[a] framework for job scheduling and cluster resource management" (as taken from https://hadoop.apache.org/).

The term YARN stands for "Yet Another Resource Negotiator" (according to TechTarget.com).

This tool was released in 2012 (according to Cloudera); it is part of the Hadoop framework. There is a GUI component to this tool as well as a command line component. To learn about the commands, see this external posting. To learn about the GUI, see this posting.

Products have been built on Hadoop YARN such as Tez; Tez is "[a] generalized data-flow programming framework, built on Hadoop YARN…" (according to https://hadoop.apache.org/).

To learn more about the sheer differences between the Yarn tools, see these postings:

https://stackshare.io/stackups/yarn-vs-yarn-hadoop
https://stackoverflow.com/questions/44739402/hadoop-yarn-vs-yarn-package-manager-command-conflict
https://superuser.com/questions/1193406/hadoop-yarn-yarn-npm-name-conflict
https://github.com/yarnpkg/yarn/issues/2337

Is It a Recommended Practice to Use Production Data when Testing Software?

Problem scenario
You want to test your application. You are considering using mock/fake data. But you are concerned that the tests will be insufficient. Is it a best practice to test with production data?

Answer
It depends. We prefer the term "recommended practice" because of questions like these.

Some sources recommend using production data:

…many production issues are due the lack of real(istic) test data…
To ensure software of the highest quality possible, you’ll need to keep the test environment as “in-sync” as possible with production.
https://www.datprof.com/blogs/using-production-data-for-testing/

Microsoft MVP recommends using production data for testing. This IBM document advises that someone use production data in a test environment. We would summarize this NimbleAMS document to suggest restoring production data in non-prod. This website essentially says that if you are following Agile best practices, then you are testing with production data.*

Some sources gived mixed recommendations on this topic
This U.K. source says it sometimes makes sense to use production data. This university thesis gives mixed treatment to using production data in a non-production environment. This thread gives mixed treatment to the practice of copying production data to lower environments. Another StackOverflow.com thread gives mixed treatment to the practice of copying production data to lower environments.

Continuous Delivery by Humble and Farley (on page 204 and 205) says that you spend more time trying to get a dataset than testing if you attempt to get data from production. In 2011 they did not think that production data should be used (but they did say for capacity testing it could be "occasionally" useful).

Some sources do not recommend using production data
These sources explicitly recommend you not use production data:

To not violate GDPR, you probably need to mask the production data (according to this source).

These sources also say you should not use production data in a test environment:

Red-gate.com also says you should not use production data in a test environment (see the term "safe"). For further reading on "best practices" versus "recommended practices, see this posting.

* See this quote:

Most of Agile development and product management’s best practices are forms of testing in development. We’re talking about very common practices like

CI/CD
A/B Testing
Phased Rollouts
Canary Deployments
Blue/green deployments
Usability Testing
Smoke & Sanity Testing

If you are following any of these practices—and many more like them—then you are already running tests with real-world users in a live production environment.
https://www.flagship.io/testing-in-production/

How Do You Install and Configure Terragrunt on a Linux Machine?

Prerequisite
Install make. If you need assistance, see this posting.

Procedures
Run these three commands:

curl https://github.com/gruntwork-io/terragrunt/releases/download/v0.35.18/terragrunt_linux_arm64 --output /tmp/terragrunt_linux_arm64

sudo mv -i /tmp/terragrunt_linux_arm64 /bin/terragrunt

sudo chmod u+x /bin/terragrunt