tratnayake.dev

A misadventure with Terraform Sets & PagerDuty Schedules

Thilina Ratnayake — Sat, 22 Jul 2023 23:20:39 GMT

"T, why didn't I get this page?" 🤨
"Wait, why does it show that is on call? They just did it the other week." 🧐

Are two phrases that you don't want to hear after making changes to your PagerDuty schedules terraform.

Intro

In the last couple of weeks, I've been leading the efforts to on-board 3 new engineers to our on-call rotation. As part of that work, one of the tasks is to get those engineers added to PagerDuty(PD), the app we use for managing on-call shifts and alerting. While this can easily be done in the PD UI, we implement these changes via Terraform so that it's documented, codified, and tracked via version control. Also because it adds another layer of auditability.

Some key concepts for working with Pagerduty:

A schedule determines the WHO, and WHEN. (Who will be in the rotation, how long the rotation will be, and when the rotation starts).

An escalation policy determines the ordering/logic for which schedules get paged.

A service is what represents your service (or system) and will be linked to an escalation policy.

So from the top:

When a service has an alert, PD will look at the escalation policy.
Based on the escalation policy and the current situation (i.e. first alert, first loop), PD will notify the appropriate schedule

You can see a full gist of the old code here.

An important note for this example is that my team is actually considered a subteam (A) that shares its pager with subteam (B)

Before

Prior to this work, I had originally set my schedule up as follows:

I also had each person's membership in a PagerDuty team like this:

Given that:

I was specifying an association to a user twice AND
Creating a new resource for each team membership; I wondered if I could refactor this.

Enter, the Good Idea Fairy 🧚🏼

Since my last brush with Terraform, I'd like to think I'd gotten better with it - especially with the use of for_each statements. So when looking at a solution to this "problem" - I thought:

Why not just create a locals.members list with all the users, and then use that as (1) the members for the schedule and (2) to have a single statement to create the team_memberships via a for_each?
In FACT! Since we have two subteams, I could create two lists and simply combine them!

After

This is what I ended up with after refactoring and thinking what I thought were good changes. Gist.

I thought I was pretty slick by doing the following:

Setting up the list of teammates in a local variable.

locals {    my_team_subteam_a_members = toset([        pagerduty_user.thilina_ratnayake.id,        pagerduty_user.teammate_b.id,        pagerduty_user.teammate_c.id,    ])    my_team_subteam_b_members = toset([        pagerduty_user.teammate_a.id,        pagerduty_user.teammate_d.id    ])}

Use the list from above with setunion() to combine both subteams A and B.

resource "pagerduty_schedule" "myteam_schedule" {  name        = "My Team"  time_zone   = "America/Los_Angeles"  description = "PD Schedule for My Team, Slack #my-team, Email: my-team@company.com"    layer {    name                         = "weekday"    rotation_turn_length_seconds = 1209600    rotation_virtual_start       = "2023-01-1T09:00:00-08:00"    start                        = "2023-01-1T09:00:00-08:00"    users = setunion(local.my_team_subteam_a_members, local.my_team_subteam_b_members)    }}

Iterate through the memberships for each subteam.

resource "pagerduty_team_membership" "my_team_subteam_a_members" {  for_each = local.my_team_subteam_a_members  user_id = each.value  team_id = pagerduty_team.my_team_subteam_a.id}resource "pagerduty_team_membership" "my_team_subteam_b_members" {  for_each = local.my_team_subteam_b_members  user_id = each.value  team_id = pagerduty_team.my_team_subteam_b.id}

Except, I wasn't. Because this didn't go as planned - and the day after I made the changes we noticed that the PagerDuty schedules were completely off.

The Reason

In a schedule, ordering matters.

Before, we had specified the ordering and had that ordering based on a start date. That meant that after every interval (rotation), the next person would be in the hot seat to carry the pager.

However, when we did:

users = setunion(local.my_team_subteam_a_members, local.my_team_subteam_b_members)

This ended up doing a union of the sets, which completely changes & disregards the order. In fact, that's actually specified in the documentation that I missed 🤦🏽:

> setunion(["a", "b"], ["b", "c"], ["d"])[  "d",  "b",  "c",  "a",]

The given arguments are converted to sets, so the result is also a set and the ordering of the given elements is not preserved.

By doing a setunion on the locals.my_team_subteam_a_members and locals.my_team_subteam_b_members - the ordering was completely disregarded which led to PagerDuty setting up someone that wasn't scheduled as the person on-call for the rotation

Conclusion

While it's great to be DRY and avoid the repetition of values - that shouldn't get in the way of functionality. With regards to Terraform:

If ordering matters in a list, don't use setunion()
Especially if you're setting up a PagerDuty schedule, just "hardcode" / manually specify the rotation order.

Common CI Pipeline Considerations: Ordering and Caching

Thilina Ratnayake — Sat, 15 Jul 2023 17:10:16 GMT

At work last week, I found myself getting burned by ordering and "file doesn't exist" errors. Ultimately - the thing that (would have) helped me get past most of these issues is something that I can't believe I forgot: a table.

If you've been following along with my last two blog posts, I've been tasked with implementing a check on PR to validate dynamically generated configuration files for a service running Open Telemetry Collectors.

To do this we had to:

Use YQ to template out the config files from the Helm templates

Create the bash script that would loop through each of the .yaml's to (1) generate the config files and (2) run them through the OTC's validate command

You'll recall that the goal from Step 2 was to get this running successfully on my laptop so we could then set it up to run on CI.

Which brings us to today.

Setting this up in CI

At this point, I had a working script that I simply had to run on my Continuous Integration (CI) pipeline on each PR. We currently use CircleCI so that meant updating our /.circleci/config.yml file which looks something like this:

workflows:  version: 2  build_and_test:    jobs:      [...]      - test-my-service-build

So I figured, maybe I'd just add a new job called validate-my-service-config which would checkout the repo and run anytime any code changed in our Helm chart (where the config files were defined).

Perhaps something like this:

workflows:  version: 2  build_and_test:    jobs:      [...]      - test-my-service-build      - validate-my-service-config###validate-my-service-config:    working_directory: /home/circleci/workspace/myrepo    # Run this step in a Docker container using the docker image we've    # sepc'd in pipeline params.    docker:      - image: cimg/go:<< pipeline.parameters.golang-image-version >>    steps:      # Git checkout the repo      - shallow-checkout      # Check if the charts dir has changed which contains the config      - check-if-code-changed:          path: kubernetes/charts/my-service      - run:          name: "validate-my-service-config"          command: |            export MYREPO_REPO_ROOT=/home/circleci/workspace/MYREPO            export PATH=$MYREPO_REPO_ROOT/bin:$PATH            # make validate-all-configs runs run_all_checks.sh            make -C kubernetes/charts/my-service validate-all-configs

But there arose two new problems:

This step runs in a Go Docker image, it doesn't have Helm onboard; and
This step runs the make validate-all-configs target which runs run_all_checks.sh script which expects the my-service binary to be present in a specific file location, but what if the binary doesn't exist in the expected location at the time of running?

Dependency: Install Helm

Installing Helm was easy thanks to a make target that was created by SREs before me, called make install-helm (it would fetch the script from get.helm.sh, uncommpress it, and install from source)

Dependency: Ensure that the my-service binary is available before the script runs.

This is where ordering became important. You will recall from further up that our CircleCI workflow had a job called test-my-service-build which was used to test and confirm that the binary would, in fact, build.

test-my-service-build:    working_directory: /home/circleci/workspace/myrepo    docker:      - image: cimg/go:<< pipeline.parameters.golang-image-version >>    steps:      - shallow-checkout      - check-if-code-changed:          path: go/src/github.com/myrepo/my-service      - run:          name: "build"          command: |            export MYREPO_REPO_ROOT=/home/circleci/workspace/myrepo            export PATH=$MYREPO_REPO_ROOT/bin:$PATH            make -C go/src/github.com/myrepo/my-service build

I initially thought:

Sweet! Maybe I'll just run the validation steps inside of testing the image, because that way the binary will be available AND it kills two birds with one stone.

(Also, on a completely random tangent, that Idiom always reminds me of this comic by Nathan W. Pyle)

Anyways, it would look like this:

test-my-service-build:    [...]      # Check build AND config paths for changes      - check-if-code-changed:          path: go/src/github.com/myrepo/my-service kubernetes/charts/my-service      - run:          name: "build"          command: |            [...]            make -C go/src/github.com/myrepo/my-service build      - run:          name: "validate"          command: |            [...]            make -C kubernetes/charts/my-service validate-all-configs

But I had to quickly disqualify that idea. Mainly because test-my-service-build was (1) specifically "targeted" to run on file changes to the go source and (2) only check the build. Doing config validation in this step felt like inappropriate overloading;

What would happen if we only needed to check config, or only check build? Is it appropriate that we have to do the other step as well?

More specifically, this is inappropriate because it could lead to situations where we are unable to make a build related change because config is bad, and vice versa. This creates unnecessary coupling between the two.

Go Code Changes	Helm Config Changes	Result
🏾	None or	A PR dealing with Go Code would be blocked due to a Config issue
None or	🏾	A PR dealing with Helm congig changes would be blocked due to a Go Code issue.

Okay what if we made test-my-service-build a pre-requisite for validate-my-service-config ?

Something like:

workflows:  version: 2  build_and_test:    jobs:      [...]      - test-my-service-build      - validate-my-service-config        requires:            - test-my-service-build

Since the binary is built in test-myservice-build, it would ensure that the binary is available on the file system for validate-my-service-config

But there was a problem with this approach:

Incorrect Targeting

Since test-my-service-build was targeted as follows:

test-my-service-build:    [...]      # Check build AND config paths for changes      - check-if-code-changed:          path: go/src/github.com/myrepo/my-service

It would never actually build if there was a change to config, meaning that the binary would never be built and thus be available for the validate-my-service-config job, which is getting back to Square 0.

Go Code Change	Helm Config Change	Result
🏾		CI would work as intended
	🏾	This would fail because changes were not made in the directory that's targeted by `test-my-service-build` and thus the required binary would not be available.

Okay okay, what if we made validate-my-service-config do it's own build?

Something like this:

In Code:

validate-my-service-config:    [...]      # Check build AND config paths for changes      - check-if-code-changed:          path: kubernetes/charts/my-service      - run:          name: "build"          command: |            [...]            make -C go/src/github.com/myrepo/my-service build      - run:          name: "validate"          command: |            [...]            make -C kubernetes/charts/my-service validate-all-configs

Inefficient

This setup would mean that the my-service binary would need to get built twice if there was an incoming change that required changing Go code and Helm config.

Go Code Changes	Helm Config Changes	Results
🏾		1. Triggers `test-my-service-build` which builds the `my-service` binary
	🏾	1. Triggers `validate-my-service-config` which (1) builds the `my-service` binary and (2) validates config
🏾	🏾	1. Builds the binary in `test-my-service-build`
2. Builds the binary again in `validate-my-service-config`

This was starting to feel like inappropriate overloading again.

Okay okay okay, what if - we made use of caching to cut down on the amount of image building we did?

Yes. This would work. CircleCI has a couple of strategies for persisting data between jobs and workflows and for our use case we picked caching.

Specifically:

test-my-service-build will always build the image, and then save to the cache.
cache validate-my-service-config cache - where the cache is made available to the validate step.
1. If the binary exists in cache, use that.
2. If it doesn't, create it!
3. If a binary is created, save it to cache!

This solves our problem as follows:

Go Code Changes	Helm Config Changes	Result
🏾		Always builds the image, and saves to cache.
	🏾	Reads from cache to fetch a binary if it was recently built, and if not, creates the binary and saves to cache.
🏾	🏾	Will build the image ONCE in either step (and then save to cache), and that same image will be used in the second step. (The image only gets built once)

This is what that looks like in CircleCI config:

workflows:  version: 2  build_and_test:    jobs:      [...]      - test-my-service-build      - validate-my-service-config###test-my-service-build:    working_directory: /home/circleci/workspace/myrepo    docker:      - image: cimg/go:<< pipeline.parameters.golang-image-version >>    steps:      - shallow-checkout      - check-if-code-changed:          path: go/src/github.com/myrepo/my-service      - run:          name: "build"          command: |            export MYREPO_REPO_ROOT=/home/circleci/workspace/myrepo            export PATH=$MYREPO_REPO_ROOT/bin:$PATH            make -C go/src/github.com/myrepo/my-service build      - save_cache:          key: my-service-binary-cache          paths:             - go/src/github.com/myrepo/my-service/dist/my-service###validate-my-service-config:    working_directory: /home/circleci/workspace/myrepo    docker:      - image: cimg/go:<< pipeline.parameters.golang-image-version >>    steps:      - shallow-checkout      - check-if-code-changed:          path: kubernetes/charts/my-service      - restore_cache:          keys:             - my-service-binary-cache      - run:          name: "validate-my-service-config"          command: |            export MYREPO_REPO_ROOT=/home/circleci/workspace/MYREPO            export PATH=$MYREPO_REPO_ROOT/bin:$PATH            # Check if a binary exists in the cache            my-servicebin="go/src/github.com/myrepo/my-service/dist/my-service"            if [ ! -e "$my-servicebin" ]; then                echo "my-service binary does not exist."                make -C go/src/github.com/myrepo/my-service build            fi            make -C install-helm             # make validate-all-configs runs run_all_checks.sh            make -C kubernetes/charts/my-service validate-all-configs      - save_cache:          key: my-service-binary-cache          paths:             - go/src/github.com/myrepo/my-service/dist/my-service

Learnings

By understanding the ordering required by our CI steps and utilizing the use of Circle CI's cache, we are able to ensure that the dependencies for each step are met and that we're being efficient in doing only the steps required for each type of change.

As you saw from each of my iterations on changing the CI pipeline, I ended up having to build a table to test if that configuration would satisfy each use case (i.e.go change, helm change, go & helm change); If I were to do this again in the future, I think I would make that table, a part of my design and planning process.

For example:

If there's a Go Change (Binary)	If there's a Helm change (Config)	What should we test in CI?
Yes	No	We only need to test that the binary is built successfully.
No	Yes	We only need to test that the config files can be validated against the binary. Prerequisite:
Yes	Yes	The binary must be built AND the config files must be validated. But we only need to build the binary once - good opportunity to use caching.
No	No	Nothing is required, kick back and relax.

Considerations:

The binary must be built for every go change
The binary must be built prior to testing a helm change, but it may run after the go binary is built in test-my-service-build - this might be a good opportunity to use caching.

Conclusion

In conclusion, understanding the ordering of CI steps and utilizing CircleCI's cache feature can help ensure that dependencies are met efficiently for each type of change. Creating a table during the planning process can help anticipate different scenarios and optimize the CI pipeline accordingly.

Lessons Learned - Sharpening my Bash Skills

Thilina Ratnayake — Fri, 07 Jul 2023 23:12:03 GMT

Bash scripting is one of those things that I always associate with a strong engineers and especially those in SRE. Conversely, it's not something I get to write a lot of and so - I'll take any opportunity to sharpen those skills.

This blog post explains how to set up a CI pipeline to validate config files for Open Telemetry Collectors (OTCs). It includes a bash script that checks a config file and deletes the directory if the config check passes, and also uses getopts to parse command line arguments and assign values to them. Additionally, I talk about the Shellcheck VS Code Extension which can be used to quickly fix linting errors.

Background

For the past 2 weeks, I've been continuing my task of setting up our CI pipeline to support config file validation for our Open Telemetry Collectors (OTCs). This is a continuation of the work I've been doing in my last blog post

On the surface - this task seems pretty easy:

You build an Open Telemetry Collector (OTC)
You grab a config file
You feed the config file into the OTC using the validate command like so: opentelemetry validate --config which exits with 0 if valid; and
You do this in your CI (Continuous Integration) pipeline to ensure that the config files you're going to be deploying with, are valid

Easy peasy right?

Not quite.

`Problems & Work to be Done`

There were a couple of problems and todo items that arose:

What do you do when the config files are not statically laying around on disk, but are dynamically generated at deploy-time using Helm? Solved! You can read how we did this with a nifty yq snippet here: https://tratnayake.dev/understanding-helm-templates-and-utilizing-yq-for-yaml-parsing-mastery
How do you get your CI system to do the things (build the OTC, template out the Helm files)?

As with most things, I figured the first step would be to try doing this on my laptop first:

💡 If I can get this working on my laptop via some scripts, I can then tweak those scripts to work on CI.

Enter, Bash.

`Bash baby, Bash.`

I decided to set this up through a system of scripts:

checks/run_all_checks.sh which would enumerate the environments that my-service was running in, and then based on each environment would run:
checks/run_checks.sh which would:
- Create a checks/test_generated_configs/ directory
- Run the helm template command from above and render out all the config files into that directory
- Iterate through each of the config files and run them through the OTC binary (found at a specific file location) with the validate flag.
- The script would exit printing out the error and with a non-zero code if validation failed.
- Cleanup by deleting the checks/test_generated_configs directory on exit.

This post will centre around run_checks.sh as that is the meat of what I was working on, and here's what I learned.

`The Flow`

I like to set up my bash scripts as follows (there are probably some sort of conventions I should be following, but this has served me well so far).

`Setting up for Success`

Handle errors and let me know when things are going wrong.

Shout out to this amazing gist that explains it in great detail - but we started with this to set ourselves up for success (do you see what I did there?) This line does three things:

-e - tells Bash to bail out immediately on any non-zero exit codes (errors)
-u - tells Bash to bail if any variable is not set (common in substitution operations)
-o pipefail - ensure that any error results in an error for the whole script (i.e. "fail as a team")

For me, I like to set up my variables next as follows:

One thing I learned here was the use of debug=${DEBUG-0} which is essentially variable instantiation with a default.

This line says set debug to = the value of $DEBUG which might be a runtime param, and if not provided, set to 0

`Functions`

Personally, the next thing I like to set up in my script is the functions. To better illustrate how I built this up - I'll show both the functions and their invocation.

This is where the logic comes into play. From the functionality above, I've personally broken them down as follows:

0 - Bootstraping and Parse Args
1 - Generate Config Files
2 - Validate each Config File
3 - Cleanup

This helps us build our skeleton. We can then proceed with building out.

`0 - Bootstrapping and Parsing Args`

Determine which environment this script should generate and validate configuration files for.

#### MAIN# 0. Bootstrapping & Parse Argswhile getopts de: flagdo    case "${flag}" in        d)          debug=1          ;;        e)          environment=$OPTARG          ;;        \?)          usage          exit 1          ;;    esacdoneshift $((OPTIND -1))if [[ $environment == "" ]]; then  usage  echo >&2 "error: please provide the -e  option (staging, development, public)"  exit 1fi

Which uses getopts to parse command line arguments and allows you to use their short codes (i.e. -d for debug, -e for environment)

This is done in a while loop with a case statement (which is kinda like a switch) to assign values to their arguments.

Note that /?) line. This tells the script that if the flag from any of the command line arguments are not in d or e - to print invoke the usage function.
Note the shift $((OPTIND -1)) at the end which:

removes all the options that have been parsed by getopts from the parameters list, and so after that point, $1 will refer to the first non-option argument passed to the script.

This means that if you have more to your command like run_checks.sh -e staging -d foo bar baz ; foo,bar and baz now moves up to the front of the "line" in positions $1, $2 and $3.

if [[ $environment == "" ]]; then  usage  echo >&2 "error: please provide the -e  option (staging, development, public)"  exit 1fi

If no the environment is specified, usage is invoked and the following error is logged to stderr.

`So what's usage?`

#### FUNCTIONSusage() {  echo >&2 "This tool will validate the configuration files for our my-service OTC's"  echo >&2 "See run_all_tests.sh to loop over every env/cluster target."  echo >&2  echo >&2 "Usage: $0 -e  [-d]"  echo >&2 "       $0 -e staging"  echo >&2 "       $0 -e public"  echo >&2}

Is our very helpful message that gets printed out on unexpected command line input to instruct the operator on what to do.

`Checkpoint`

This is what our script looks like now.

Next up we had to implement our steps:

`1 - Generate the Config Files`

Based on the passed in, generate config files using our yq command.

You'll recall that from our last blog post, we now had a handy dandy way of generating the config files using a yq command with the helm template

helm template -f staging.yaml -s templates/collector.yaml . | yq 'select(.kind == "OpenTelemetryCollector").spec.config' -s '"staging-config-" + $index

Now we had to "bashify" this to work with environments other than just staging.

First, we'd need to create a location for these files to live, like /test_generated_configs/
Then we'd want to go ahead and render out each of the config files.

mkdir -p "$TEST_ROOT_DIR"/test_generated_configs/"$environment"pushd "$TEST_ROOT_DIR" &&  eval "$(make_helm_template_command)"popd

This is where we make use of pushd and popd to quickly CD in and out of our specific directory to run our command.

We run the command by using eval "$(make_helm_template_command)"

That's actually from our function here:

Once this is complete, there will be config files generated like this:

`2- Validate each Config File`

For each config file, pass it into the validate command and check that it's valid.

Due to our rollout of M1 laptops amongst our team, we've noticed a mismatch in the binaries we have to build for running locally and in our infrastructure. Specifically that M1 laptops use arm64 and our infra (like CI) uses amd64. Because of this, we have our binaries stored in arch-specific directories like dist/arm64 or dist/amd64.

Using arch=$(uname -m | sed 's/x86_64/amd64/;s/arm.*/arm64/') is a slick one-liner to figure out which arch is being used in the invocation of the script.

`3 - Cleanup`

Get rid of our temp files.

Finally, since these config files will no longer be used (and inf act, are generated at deploy time) we want to ensure we clean-up our mess files. To do so we use the cleanup function.

    echo "$file config check passed "donecleanup

Which looks like this:

Where we tell it not to clean up the files if we're in debug mode (so we can examine them after the fact) or delete the directory otherwise.

Note the trap EXIT which is known as an "exit trap". This tells the bash script to always run this function whenever the script exits for any reason. This is great because it ensures that our config file directories will be deleted whenever the script ends or even if it errors out.

Finally, our whole script looks like this:

`Bonus`

In trying to get my scripts uploaded, I noticed that our CI pipeline's linter kept on yelling at me for some bash-related things. Turns out that these were all coming from a linting step that used shellcheck.

To fix these errors - I downloaded the shellcheck VS Code Extension which is fantastic because, not only will it give you an explanation of the issue within your IDE, it will allow you to quickly fix most issues as well!

`Conclusion`

Sharpening your Bash skills can greatly improve your ability to generally get things done as an engineer. Specifically in SRE - it helps with things like managing CI pipelines and working with tools like Open Telemetry Collectors. By breaking down tasks into manageable functions, using error handling, and leveraging helpful tools like getopts, you can create efficient and maintainable scripts. Additionally, incorporating linters like shellcheck and its VS Code Extension can help you quickly identify and fix errors, ensuring your scripts are reliable and robust.



Understanding Helm Templates and Utilizing YQ for YAML Parsing Mastery
Thilina Ratnayake — Mon, 26 Jun 2023 03:04:54 GMT
This week I've been finishing up a task that involves updating our CI/CD pipeline to make use of a feature available in the newest release of the Open Telemetry Collector to validate configuration files. In doing so, I got a chance to get up close and personal with Helm templating and see how the sausage (a generated K8s manifest) was made. In doing so, I also learned a cool trick with the tool yq which allowed me to extract a specific set of data from a YAML file.
Come along with me and learn how I learned to decipher the flow of data and variables in the Helm chart, and learn how I used yq to extract exactly the data I need from a pile of YAML.
Background
My team runs a couple of open telemetry collectors (OTCs) to ingest a bunch of telemetry.
These OTCs read their config from a configuration file before starting up and doing their job of collecting whatever they're meant to collect.
The configuration files are built (per environment) and passed in at deploy time by the CI/CD pipeline as the last step.
The configuration files are built by Helm and are passed in as a Helm chart.
Open Telemetry Collector Configuration files look like this:
However, a smol hitch with this setup when used with Helm templating is that (1) the configuration files passed in at deploy-time are specific for the environment (i.e. staging might have a different config than public) and (2) we don't catch a bad config until it's too late (when it's deploy time).
Goal
To update our CI/CD pipeline so that we can validate our configuration files to catch errors early, preferably on a PR whenever the configuration files (or underlying template files) are changed.
Support
One thing that helps us on this quest was the (at the time) pending v0.80 release of the Open Telemetry Collector which would bring support for the validate command to validate configuration files.
In v0.80 - you could provide a config file to the validate command like so:
Which will output a zero exit code (exit successfully) if the config file is valid.
Mission
Update the CI/CD pipeline to use the validate command and catch bad configuration files early. This could be done in two places:
PR Builder on Code Changes
CD Pipeline on Infrastructure Deploys.
Problem:
While it's easy to use the validate command to check a single - there turns out to be a lot of configuration files that we need to check. Think the number of environments * number of zones per environment and we're up to about 12 configuration files.
The configuration files are dynamically built using Helm as part of a Helm chart. This means (1) we need to validate using the same config that's provided at runtime and (2) we need to generate those files for the validate command as they don't exist on disk.
A Quick Refresher
An Open Telemetry Collector config file simply looks like this:
We just need to get the config file(s) so we can use it with the validate command for the Open Telemetry Collector binary - to determine if the config is good or not:
Which exits with a non-zero error code if the config file is valid 
Okay, so let's get these Config Files
Well, since they're generated via Helm at runtime, we will need to use helm template command to generate them.
This is where I was at the edge of my comprehension - because I didn't understand how all the values were passed in and through the template. So I went on a side quest to learn how it all fits together.
It's all about the flow
With regards to helm charts and templating, the data flows like this:
Environment-specific values from the environmental yaml files (i.e. staging.yaml) are fed into the Helm chart (i.e. templates/*.yaml)
templates/*.yaml also makes use of the zonalbaseconfig.yaml file.
Helm then generates / output a set of YAML that can be fed to Kubernetes (a manifest) to apply.
The environment files may look something like this:
Which then feeds into the template files. Specifically, templates/collector.yaml which defines the manifest for each collector.
And since our staging.yaml doesn't specify a config: block, the Helm template will make use of what's in the $baseConfig variable as the template. Note that we know from earlier up, $baseConfig reads from zonalbaseconfig.yaml which looks like this:
If you've been following the bouncing ball, you'll notice that the flow looks like this:
Now that we understand the flow, we know that the template that generates these files is templates/collector.yaml.
And so, we can see what Helm generates out by running:
helm template -f staging.yaml -s templates/collector.yaml . > templated.yaml
Which gives us something like this:
You'll notice that these are the entire manifests for all the zonal collectors we've specified in the environment file (i.e. staging.yaml) which means in addition to the config block - it also includes:
The Service Accounts that our template probably generates further down in collector.yaml; and
The rest of the manifest which we don't need (like the metadata etc).
So now we're in a better state - we have the data we need plus extra, but how do we refine it to get just the config blocks?
How do I get just the config files?
This is where yq comes in handy.
YQ is a powerful command-line tool for processing YAML files, similar to how JQ works with JSON. It allows users to query, filter, and manipulate YAML data easily and efficiently. With YQ, you can extract specific data, transform the structure, and even merge multiple YAML files.
Since the helm template command outputs a single yaml document containing multiple files (you can see this with the --- separator); I essentially need to do the following:
In the whole output of files, grab the files that contain an internal value of kind: OpenTelemetryCollector
For each of those files, grab only the spec.config block. (These are what gets read in as config files at runtime).
After grabbing all the config blocks, output them onto disk with some sort of special config so that we can then pass them into the validate command.
After much trial and error, behold - the magic 1-liner command that allowed me to do all this:
helm template -f staging.yaml -s templates/collector.yaml . | yq 'select(.kind == "OpenTelemetryCollector").spec.config' -s '"staging-config-" + $index'
Where I now have staging-config-0.yml and staging-config-1.yml in my directory, that look like this:
Next Steps
Now, since I have all my config files available as staging-config-1|2|3|4.yaml - I can easily feed that into the validate command of my Open Telemetry Collector.
The next steps will be updating our CI/CD steps to do so, which will probably be the easier portion of this task.
Conclusion
In conclusion, using Helm templates and YQ can greatly improve the process of generating and validating Open Telemetry Collector configuration files in a CI/CD pipeline. By understanding the data flow in my Helm templates, and by leveraging YQ's powerful querying capabilities, I was able to extract the necessary config files and use the validate command to catch errors early, ensuring a smoother deployment process.


Speed Up Terraform Debugging Using Terraform Console
Thilina Ratnayake — Fri, 16 Jun 2023 20:39:40 GMT
Want to debug terraform issues with quicker feedback and instant access to terraform state? Try Terraform Console  - the equivalent of the python shell, but for Terraform.
Background
What am I doing?
This week I've been working on setting up canary analysis with Argo Rollouts for a service I'll refer to as: my-service. We do this by using Terraform and the way we do it requires having a local variable named local.my-service_pools containing a map of environments and their pools.
The Problem
The pools (zones) that my-service runs in for each environment (production, staging and meta) are read from multiple yaml files and stored in a variable named local.my-service_pools
local.my-service_pools is used by other terraform tooling and therefore must be correct as soon as it's created.
Due to the way that our infrastructure is set up, the variable requires that any pool named us-central1-f be named default, and therefore any mention of us-central1-f would need to be swapped for default in this map.
To tackle this problem, I decided to first read the contents of the file into a temporary variable named local.my-service_zonal_pools_from_yaml which I thought simply looked like this:
The Goal
To create a new variable named local.my-service_pools which uses a for loop to iterate through my temporary local.my-service.zonal_pools_from_yaml and while doing so: replace any mention of us-central1-f with default.
A Failed Fix
I used a for loop to make a "copy" of the variable (Terraform doesn't support copying variables) by iterating through each environment, and then through each pool to replace us-central1-f with default
I referred to this as my "transformation" variable.
Which seemed good but... was actually wrong , because the downstream references to that variable kept on yelling at me:
Frustration
The error message doesn't tell me much. Yes, it's a list of strings, but that's what I wanted isn't it?
The Feedback loop is too long. Because this is the Terraform for our entire environment's infrastructure and because our Terraform runner (Spacelift) is constantly running plans from PR's across the organization - I have to wait a long time to see the results of my changes via terraform plan
On Spacelift, I'm competing with private workers to become available to run my plan; and
On my laptop - it takes 10+ minutes to build because I don't have the state cached (or, more specifically, that the cache keeps on changing with every run on Spacelift.
Research
I realized I was getting bitten on syntax / data massaging here - and first I'd need a better, faster, cheaper way to play with the Terraform for quick iteration and results.
And so off to Google I went.
Enter: terraform console
What does Terraform Console do? The terraform console command will read the Terraform configuration in the current working directory and the Terraform state file from the configured backend so that interpolations can be tested against both the values in the configuration and the state file.
With terraform console
I could see the warning I was running into as soon as it started up. (It's my "early-warning" that the code was still incorrect)
I could inspect and see what local.my-service_zonal_pools_from_yaml actually looked like right now!
And then I could compare it against local.my-service_pools to see what my transformation variable instantiation was actually putting out!
 The actual problem was that: my-service_pools_from_yaml has each environments zones in a toset() whereas my-service_pools has those zones in a tolist()
Solution
Surround the for loop with a toset()
Now when I run terraform console there are no errors, which is my first-sign that my code is correct.
And, when I compare the two variables in the console, I see that they are both toset()'s :
Caution
While you're in the console, you acquire a state lock 🔒 that could prevent other people from using Terraform.
Releasing state lock. This may take a few moments...
The console holds a lock on the state, and you will not be able to use the console while performing other actions that modify state.
Conclusion
In conclusion, terraform console is a powerful tool for debugging and fine-tuning Terraform code, providing quick feedback and reducing the time spent waiting for terraform plan results. It is especially useful for working with loops, experimenting with new syntax, and resolving structural issues. However, be cautious when using the console as it acquires a state lock, which may prevent others from modifying the state.


Cheating with Terraform State Show
Thilina Ratnayake — Mon, 12 Jun 2023 04:35:08 GMT
Background
One of the cards I took on last week, was the task of adding a low-no-data alert for one of our services. This was an Action Item from an incident post-mortem when we were manually alerted to one of our endpoints being down. Having this alert would ensure that we are notified more quickly should this happen again.
Using Lightstep, it was pretty easy to create the alert based on a metric that we had. (We use a different metric IRL for traffic, but for the sake of this tutorial, I'm using scrape_samples_scraped)
But I wasn't quite done yet.
What about making the same alert in all the other environments ? I would need to recreate this alert for all of those environments by hand.
What happens if we have the worst day ever  and all of our alerts are gone? I would need to remember the queries and details to recreate them.
This is where Infrastructure-as-Code comes in handy, and specifically Terraform - which allows us to codify our resources such as lightstep_alert's. By using Terraform, we can:
Focus on being DRY (Don't Repeat Yourself) - Create one alert and programmatically generate alerts in our other environments.
Have our alerts codified - Be able to recover from a disaster and restore to our current state.
And so I set out to write the Terraform with a file that looked like this:
resource "lightstep_alert" "low-no-requests-api-terraform" {}
Staring at this blank resource stanza made me think:
But wait...I already created the alert in the UI..do I now need to flip back and forth between the Terraform provider documentation and re-create the alert in Terraform?
This would be like the equivalent of getting a transparency paper and attempting to redraw a reference image on a different piece of paper.
I thought this was the only way until I learned a very valuable pattern from one of my very wise coworkers.
[You already have the alert,] Why don't you just import the alert and then use state show to get the code for it?
I..had never thought of that. But, I tried it and it worked! And was a huge time saver.
Here's the tutorial.
Tutorial
Set up Teraform Provider
First up was to ensure I had set up my Terraform provider to use my Lightstep organization and an api_key with a minimum level of Member
provider "lightstep" {  api_key         = var.ls_api_key  organization    = "LightStep"}
Get ID of previously created Resource
To import the lightstep_alert, I needed need to grab the ID of the existing alert. (This can be easily gathered from the browser, see below).
Import the Resource
With the alert ID in hand, I simply had to run a terraform import lightstep_alert. .
In my example it was:terraform import lightstep_alert.low-no-requests-api dev-tratnayake.mXgC4cG1f
terraform import lightstep_alert.low-no-requests-api dev-tratnayake.mXgC4cG1f         1   at 20:34:20  lightstep_alert.low-no-requests-api: Importing from ID "dev-tratnayake.mXgC4cG1f"...lightstep_alert.low-no-requests-api: Import prepared!  Prepared lightstep_alert for importlightstep_alert.low-no-requests-api: Refreshing state... [id=mXgC4cG1f]Import successful!The resources that were imported are shown above. These resources are now inyour Terraform state and will henceforth be managed by Terraform.
Show the imported Resource
Now here's the magic part where you can use Terraform like a 🔬 microscope 🔬 to break down an existing resource into its Terraform. With the resource imported into Terraform state, I could simply use terraform state show to output the alert in Terraform.
terraform state show lightstep_alert.low-no-requests-api                              1   at 20:35:27  # lightstep_alert.low-no-requests-api:resource "lightstep_alert" "low-no-requests-api" {    id           = "mXgC4cG1f"    name         = "Low-no-data-alert (UI)"    project_name = "dev-tratnayake"    type         = "metric_alert"    expression {        is_multi   = false        is_no_data = false        operand    = "below"        thresholds {            critical = "1"            warning  = "5000"        }    }    query {        display        = "line"        hidden         = false        hidden_queries = {}        query_name     = "a"        query_string   = "metric scrape_samples_scraped | filter (job == \"apiserver\") | latest | group_by [\"job\"], sum"    }}
Use the Code from the imported Resource
With the alert now codified, I could simply copy-pasta it as a new alert in my Terraform file as follows (ensuring to strip out the id and metric_type as those are computed when the Terraform is applied).
resource "lightstep_alert" "low-no-requests-api-terraform" {    # id           = "mXgC4cG1f"    name         = "Low-no-data-alert (Terraform)"    project_name = "dev-tratnayake"    # type         = "metric_alert"[... Rest of the Terraform from `terraform state show` here ... ]
Apply the Terraform to create the Resource
Finally, I ran a terraform apply which would:
Create the new alert (low-no-requests-api-terraform); and
Delete the old alert created via UI (low-no-requests) because that's not in the Terraform file, and therefore doesn't match the Terraform state.
terraform apply                                                                           at 20:37:03  lightstep_alert.low-no-requests-api: Refreshing state... [id=mXgC4cG1f]Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:  + create  - destroyTerraform will perform the following actions:  # lightstep_alert.low-no-requests-api will be destroyed  # (because lightstep_alert.low-no-requests-api is not in configuration)  - resource "lightstep_alert" "low-no-requests-api" {      - id           = "mXgC4cG1f" -> null      - name         = "Low-no-data-alert (UI)" -> null      - project_name = "dev-tratnayake" -> null      - type         = "metric_alert" -> null      - expression {          - is_multi   = false -> null          - is_no_data = false -> null          - operand    = "below" -> null          - thresholds {              - critical = "1" -> null              - warning  = "5000" -> null            }        }      - query {          - display        = "line" -> null          - hidden         = false -> null          - hidden_queries = {} -> null          - query_name     = "a" -> null          - query_string   = "metric scrape_samples_scraped | filter (job == \"apiserver\") | latest | group_by [\"job\"], sum" -> null        }    }  # lightstep_alert.low-no-requests-api-terraform will be created  + resource "lightstep_alert" "low-no-requests-api-terraform" {      + id           = (known after apply)      + name         = "Low-no-data-alert (Terraform)"      + project_name = "dev-tratnayake"      + type         = (known after apply)      + expression {          + is_multi   = false          + is_no_data = false          + operand    = "below"          + thresholds {              + critical = "1"              + warning  = "5000"            }        }      + query {          + display      = "line"          + hidden       = false          + query_name   = "a"          + query_string = "metric scrape_samples_scraped | filter (job == \"apiserver\") | latest | group_by [\"job\"], sum"        }    }Plan: 1 to add, 0 to change, 1 to destroy.Do you want to perform these actions?  Terraform will perform the actions described above.  Only 'yes' will be accepted to approve.  Enter a value: yeslightstep_alert.low-no-requests-api: Destroying... [id=mXgC4cG1f]lightstep_alert.low-no-requests-api-terraform: Creating...lightstep_alert.low-no-requests-api: Destruction complete after 1slightstep_alert.low-no-requests-api-terraform: Creation complete after 2s [id=mw5HVFtch]Apply complete! Resources: 1 added, 0 changed, 1 destroyed
The end result is the Alert that I created in UI - now codified and managed by Terraform 
Conclusion
There are many times that Terraform is not the first thing we reach for when creating a resource. Many times we may reach for a GUI or specialized tool to get create resources, especially during rapid prototyping. Using terraform import and terraform state show allows you to keep that momentum by using Terraform to codify the resources you've already created (in a more user-friendly tool) which can then easily be modified to fit your needs.
Doing this is a huge time-saver that almost felt like cheating, and I will definitely be using this in my future Terraforming travels 🚀


🔐 How-To Securely Work With Secrets During Development
Thilina Ratnayake — Mon, 27 Feb 2023 00:34:03 GMT
If you're working on any software projects that require talking to other services, chances are that you probably have to make use of secrets. The most common type of secrets you may have run into -- are passwords and API keys.
There are others as well, and Cyberark does a good job of defining a secret as a:
[...] private piece of information that acts as a key to unlock protected resources or sensitive information in tools, applications, containers, DevOps and cloud-native environments.
Some of the most common types of secrets include:
Privileged account credentials
Passwords
Certificates
SSH keys
API keys
Encryption keys
As you may have learned, perhaps the hard way through a security incident, secrets can't be treated like just any other type of data. Because of their ability to access resources; they must be handled with care and with security in mind.
🧐 How Do You Store Secrets During Development?
If you've googled this question, you may have run into a couple of articles that suggest the following options:
🐚 Provide Secrets as Shell Variables.
The assumption is that your code (i.e. run-server.sh) is set up to look for a key as a shell variable. For example:
cat run-server.sh#! /bin/bashecho "OUTPUT: API Key is ${API_KEY}"
In this approach, you can provide the secret to your program at run-time by running: API_KEY= run-server.sh which provides the following output:
API_KEY=my-api-key ./run-server.sh                                                                                                      OUTPUT: API Key is my-api-key
The drawback of this approach is that your key has now been included in plaintext, within your shell history. If an attacker were to ever compromise your machine, the secrets would be available by running a simple history command.
1815  API_KEY=my-api-key ./run-server.sh
🌲 Provide Secrets as Environment Variables
For this approach, we do a bit better setting the secret as an environment variable, which our program can read from.
Setting the actual environment variable can be done in a number of ways, such as reading from a file on the system. (i.e. api-key-secret.txt)
export API_KEY=$(cat api-key-secret.txt)./run-server.shOUTPUT: API Key is my-env-var-api-key
While the secret is no longer printed out to shell history  , the secret is contained within a file that must now be secured appropriately .
You need to ensure that the file containing the secret key isn't acidentally checked into Verison Control (i.e. Github).
📁 Use A .env File
Some programming languages recommend the pattern of using a dotenv file, where programs can be instructed to look up values from a specific file (i.e. development.env) However, this approach has the same drawback as above.
🤔 What's A Better Way To Store Secrets
A decent number of moons ago when I worked as a Security Engineer doing  security  things, one of the items in our portfolio was a Secret Management system named Conjur (coincidentally, also built by Cyberark).
As an enterprise-grade secret management solution, it had a lot of features such as audit logging and access control policies, but one of the coolest features was the ability to do dynamic secret retrieval.
Specifically, it meant that after engineers checked secrets in to Conjur, they could simply (1) refer to their secrets by their secret reference paths and (2) use the Conjur CLI to "wrap" the invocation of their programs, to pull those secrets at run-time.
I remember thinking that this was super cool, and wished I could do that for personal projects with my existing secret manager (1PW). Unfortunately, the main disqualifier for that at the time, outside of licensing costs, was the requirement to run and secure your own secrets server which wasn't really feasible for local development.
Well, fast forward about 5 years and things have now changed!
🤨 What's A Better Way To Store Secrets in Development?
For me, the "development phase" in personal projects is the phase where I want to rapidly prototype my ideas. I don't want to get bogged down by tedious tasks, which is what secret management used to be. I could move fast by using env or shell variables, but I was only a single .gitignore file mistake away from checking in the contents to version control.
I'd always hoped for a better way and thankfully a couple of weeks ago - I came across this fantastic blog post by 1Password when looking into how to access secrets programmatically.
https://blog.1password.com/1password-cli-2_0/
This blog post goes into using the 1Password CLI to essentially achieve what I wanted to do those years ago. It enables developers to:
Add secrets to their password vault
Get the references to those secrets
Fetch secrets dynamically at runtime.
How does it work?
Download & install the 1Password CLI + 1Pasword8
Add your secret(s) into 1Password. In my case my secret value for my-api-key is: secret-key
Grab the references for the secrets you're interested in.
op item get my-api-key --format json                                                                                                  {  "id": "<>",  "title": "my-api-key",  "version": 1,  "vault": {    "id": "<>",    "name": "Blog-post-development"  },  "category": "API_CREDENTIAL",  "last_edited_by": "[...]",  "created_at": "2023-02-26T23:14:43Z",  "updated_at": "2023-02-26T23:14:43Z",  "fields": [    [...]    {      "id": "credential",      "type": "CONCEALED",      "label": "credential",      "value": "secret-key",      "reference": "op://Blog-post-development/my-api-key/credential"    },
Which in this case is: op://Blog-post-development/my-api-key/credential
Include the secret reference and wrap your execution command with op run
Before, I used to have a Makefile to run my server:
cat Makefile                                                                                                                            run:    ./run-server.sh
Now I simply wrap the run target with op run and my secret reference as follows:
cat Makefile                                                                                                                            run:    API_KEY="op://Blog-post-development/my-api-key/credential" \    op run ./run-server.sh
When I execute with make run will first:
Prompt me for a fingerprint (via touchID on Mac, or master password on all others); and
Retrieve the secret for use in the command.
make run                                                                                                                                API_KEY="op://Blog-post-development/my-api-key/credential" \    op run ./run-server.shOUTPUT: API Key is 
Note: that 1PW's CLI is smart / safe enough to conceal the password. If you want to override this option and display the secret, you can run the same command with the --no-masking flag.
For example:
cat Makefile                                                                                                                            run:    API_KEY="op://Blog-post-development/my-api-key/credential" \    op run --no-masking ./run-server.shmake run                                                                                                                                API_KEY="op://Blog-post-development/my-api-key/credential" \    op run --no-masking ./run-server.shOUTPUT: API Key is secret-key
Why is this better?
Secrets are stored and retrieved from a single place.  The single source of truth for your secret's value is in 1Password. This means that if you need to read this secret from multiple places, you only need to update it once. It also means that if your secret gets "popped", you only need to change or "rotate" it in one place.
Uses a password or biometric to authenticate access.  If you're using a mac, you'll be prompted for TouchID. Else, you'll be prompted for your master password anywhere else.
Secrets are not stored in code.  You don't have to worry about accidentally checking in secrets to version control (i.e. Github).
You don't have to worry about the security of where your secrets are stored. That's managed by an organization that specializes in secret storage and has a vested interest security of your secrets (1PW).
You don't need to do any other lift  to get this working. No need to set up IAM with AWS or GCP, uses 1Password.
Further Extensibility
1Password can group secrets into vaults which means you can also extend this to your colleagues & partners. When you add a secret to a vault and add them as members to your vault, they can also use secrets in the same manner from their 1password accounts.
1Password also has other shell-plugins to securely authenticate with services right from your shell.
There's also a way to integrate CI/CD as well.
Conclusion
Overall, I am extremely excited to see that these sorts of features are slowly trickling down and becoming more accessible for more usecases. The more we do to make security "easier", the better that software becomes for all.


Tutorial Notebook: A simple CRUD app with Go
Thilina Ratnayake — Fri, 03 Feb 2023 17:12:25 GMT
Tutorials Notebook is a blog series where I write-up my thoughts & lessons learned after completing a tutorial. A large credit should go to the authors who created the tutorials in the first place.
Tutorial: https://codewithmukesh.com/blog/implementing-crud-in-golang-rest-api/ by Mukesh Murugan
Lately, a couple of things have been motivating me to learn and get better with GoLang.
There's a project that I've wanted to build for a friend since last year;
I'm ending up working more and more in our codebase at work which is all written in Go.
So I figured I'd start from scratch and work towards the goal of building that app for my friend.
Breaking down that task - one thing I'll need to do is create an API that can handle incoming requests. Today, I completed a tutorial to create a very basic app that would enable a user to Create, Retrieve, Update and Delete (CRUD) a product from a database.
Things I Learned
MySQL. It'd been a while since I'd created a database. Helpful commands on mac:
brew install mysql - Install the MySQL server on mac.
brew services start mysql - Start the MySQL server
mysql -u root - Login with username root and no password (default)
create database ;
use database ;
show tables;
VS Code Shortcut: Typing in hand (+ enter) in VS Code will automatically create a HTTP response handler for you.
Go Pointers
You can define what an object is like. In this example, a product consists of 4 pieces of information.
type Product struct {    ID          uint    `json:"id"`    Name        string  `json:"name"`    Price       float64 `json:"price"`    Description string  `json:"description"`}
A pointer to that struct is used in this HTTP API handler for POST ing (creating) products.
func CreateProduct(w http.ResponseWriter, r *http.Request) {    w.Header().Set("Content-Type", "application/json")    var product entities.Product    json.NewDecoder(r.Body).Decode(&product)    database.Instance.Create(&product)    json.NewEncoder(w).Encode(product)}
Where the important lines are:
var product entities.Product - this line creates an "empty" product that contains the 0 (or nil) values defined int he struct.
json.NewDecoder(r.Body).Decode(&product) - In this line, we use the memory address of variable product as denoted by the & to "fill" it with relevant data.
You can even make cURL commands right from VS Code using the REST Client extension.



Friend (or Foe) Request?
Thilina Ratnayake — Mon, 09 Jan 2023 04:51:23 GMT
ChatGPT, Social Engineering, and the moving goal posts in the AI arms race.
Scene
You have a new connection request.
You awake in the morning to find a new notification on your phone. Its a connection request from someone on LinkedIn. Youve never heard of this person before but considering that its a social network for working professionals  you start going through your verification process.
Do I know this person?
If not, is it someone that I should add?
Are they a threat? Are they a bot?
Most of us have been trained to answer these questions from our first days on the internet (dont talk to strangers!)  but that last one is something thats relatively new in our networked interactions. In this new misinformation age, where troll-farms & criminal outfits operate with the intent of everything from derailing national elections to flaming competitors products  weve slowly been trained to look for markers to verify the validity and humanity of our connections:
Is this a human? Do they walk like one? Do they talk like one?
But what happens when a tool like Chat-GPT, a tool that can generate everything from poems to manifestos based off a single prompt  continues to proliferate into the hands of millions?
They say 2023 is the year that AI will continue to disrupt more industries, what does it mean for our personal security on the Internet.
If youve been on the internet in the last couple of weeks, youve probably seen the explosion of Artificial Intelligence (AI) based tools.
And before you ask, no, I am not using ChatGPT or any of those tools as a cheeky way to write this blog post. This post is written from a 100% certified, Grade-T Human meatsack.
A recap:
In the final throes of 2022, a handful of AI based tools shot into popularity.
Some are focused on art, such as dall-e, starryai and lensai.
Dall-e will take in user prompts (An astronaut, riding a horse, in a photorealistic style) and generate artwork that fits those prompts.
A snippet from Dall-e 2s Home Page. This page also allows the reader to try clicking on the other variations of the prompts to see how the artwork changes.
The latter, such as starryai and lensai  are more focused
towards refining provided images as input. You may have seen this recently in your social networks with friends posting impressive artistic self portraits.
Others, such as ChatGPT, are more focused towards using AI with text. Specifically, using language processing models for things like text generation, language translation, text summarization and sentiment analysis.
Some notable examples include people asking ChatGPT to create custom poems, write (and fix their) code and even generate copy for their marketing or blog posts (again, not me).
As an example for this blog post, heres a prompt that was noodling in my head from the most recent podcast Id listened to.
So, how does an AI helping provide custom output to prompts pose a security risk?
Security Considerations
Security is an arms race. For every attack or sword 🗡, there will eventually be a defence or shield🛡
..until the attackers come up with a bigger sword which will require defenders to build a better shield and so on and so forth.
There are many ways to attack a target that is connected to the internet. However one of the most dangerous and bountiful is exploting the weakest link  humans. This method of attack is known as social engineering.
[]  Social engineering is the psychological manipulation of people into performing actions or divulging confidential information. A type of confidence trick for the purpose of information gathering, fraud, or system access, it differs from a traditional con in that it is often one of many steps in a more complex fraud scheme.
To carry out this sort of attack
requires an entrypoint or vector. Whereas in less human attacks this could be a exploting an insecure port or a poorly compartmentalized process; social engineering usually occurs via interactions.
This sort of attack can be stopped at the door by default blocking any interactions from people that are not friends. Indeed, this appears to thankfully become the on-rails setup experience for most new apps.
Attack: Interactions from malicious actors
Defence: Default block interactions from strangers.
However, what if this is not the default? Indeed, the main draw for social networks is the social aspect of connecting with people right?
This is where the friend-or-foe verification process comes into play and where tools like chat-GPT can be problematic.
Most people are quite good at blocking friend requests on more personal social media (i.e. Facebook, Instagram) if they cant recognize the connection on first glance.
The primary verification test for more personal social media is usually:
Do I know this name?
Does this picture look familiar?
The secondary verification test, which occurs based on a looser security posture and upon failure of the primary test, is then to check if there are any common connections.
Does anyone else on my friends list know this person?
That is to say that least acceptable test for verification in more personal social networks is peer verification.
If a bunch of people I know seem to think this person is legit, then they must be.
Attack: Friend requests from malicious actors on personal social networks.
Defence: Verify validity of connection by ID via name, picture. OR verify via peer verification.
But what about when its a setting like LinkedIn? That site is unique because its a less personal site meant for networking amongst professionals. Just like attending a job-fair or conference  its expected that you may get requests from people that you are not immediately familiar with but are reaching out for professional reasons. Perhaps maybe that connection request is from someone thats reaching out to head-hunt for a slick new role.
In these less personal settings, most people relax their security posture lowering the bar for verification. Here, the name and picture check could be bypassed entirely for validation by their network and industry.
Sure this name and picture might not immediately click, but perhaps I ran into them at a conference. Are they in my industry? And are they connected to any other peers?
The issues from this scenario are as follows:
All it takes is for one bad actors connection request to be accepted by a person and they can now start sending connection requests to other members of the original acceptors professional network.
Every additional connection request thats accepted continues to add credibility.
For this attack, an attacker simply needs to create enough fake profiles (bots) to get through, and they are betting on the fact that the transactional nature of the network will not lead to people reference-checking each other on new connections.
Hey, I see that you accepted a conection request from X, do you know this person?
A defence for this exploit is to obviously screen connection requests more carefully.
Sure, this person appears to be known by others in my network  but do they seem to be legitimate?
As an aside, you can a great example of this type of social engineering in the story of Anna Delvey (the inspiration for the Netflix show Inventing Anna).
She was able to con her way into networking with elite socialites and swindle hundreds of thousands of dollars. She did this by exploting the transactional nature of those relationships to pass as legitimate.
Well obviously shes at this big party, someone must have vouched for her
I dont know her, but she appears to be close with X, therefore she must be legit
By (1) appearing to belong and (2) trusting that the rest of their peer-network had done their due dilligence, Anna was able to gain a foothold within the social network and continue exploiting connections for personal gain.
Attack: Friend requests from malicious actors on professional social networks.
Defence: Verify ID via peer acceptance and behaviour.
For most, this is where a credibility test comes into play trying to determine whether an incoming friend request is a legitimate and human:
They are in my industry and professional network: do they act like a member of that network? What are the indicators?
Do they interact with others?
Do they interact with others posts?
Do they write posts?
In these cases, custom content and interactions are the most reputable indicator.
If Im assessing the credibility of a stranger that appears to be accepted by my network, interactions and custom content are the best indicators of credibility.
The check is for effort. How much effort could a malicious actor be putting into attack you. Most of the time, its little.
Therefore, these are the most reputable indicators because it requires the attacker to put in significantly more effort to gain context and create posts and interactions. Whereas fake bot accounts can easily be created, and generic posts can scripted with a bit more effort  crafting genuine content that can only come from someone thats actually within that industry is considerably harder.
Well until now.
Enter, a tool like ChatGPT. With ChatGPT, an attacker would be able to much easily create posts, comments and content that is both contextually accurate and very hard to distinguish from a human.
Imagine that you work in Cloud Infrastructure and you get a connection request from someone. You dont know them, but they appear to be in the same industry.
You see that they are active in writing posts. They appear to be involved with Terraform and so have a blog post named The benefits of using Terraform with AWS.
AI tools can easily create these types of blog posts.
You even see that they appear to leave comments on a peers post about why Terraform sucks with GCP.
AI tools can again, easily create the text for these sorts of interactions.
Sure while a more trained Infrastructure Engineer might be able to spot little quirks or issues in the content upon more detailed scrutiny, could this sort of content pass the first glance?
Unfortunately, I believe the answer is yes. A problematic scenario I imagine is an attacker, armed with cursory knowledge and basic wordsmithing skills being able to use a tool like ChatGPT to pass the Friend or Foe (or peer) test. Attackers can use these tools to aid in interactions and more easily masquerade as a member of an industry towards gain access to a professional network and exploitation of the people therein. The worst-case scenario is if the attacker was not even in the loop after start and where a program could be written to script content and interactions using ChatGPT. Essentially, imagine the existing problem that exists at scale with low-effort bots, and combine that with the power of ChatGPT.
Seemingly overnight, these AI tools have just become a new tool for attackers to utilize.
Defence
But its not all doom and gloom. In fact within a couple of weeks there are already tools being created to detect AI generated posts, some using AI themselves. An example of this being: GPTZero.
But as with any security posture, I believe the defence is not just one  tool, but a layered approach.
How can I strengthen my security posture against social engineering?
Do not accept interactions from anyone thats not a friend or a friend of friend.
If you receive a friend request from someone that appears to be associated by another connection  do a a casual reference check. Hey do you know this person?
If on a less personal network, check the quality of the interactions. (Do these comments make sense? Do they appear to have relevant context?) Perhaps run that content through a tool like GPTZero.
Directly reach-out to them to assess validity.
Social Engineering capitalizes on our human tendency to want to connect with others without scrutinizing every detail. The best guard against this is to maintain vigilence, defend in depth and perhaps use the very same technologies (AI) to fight fire with fire.


Using Helm To Include All Files From A Directory In-line
Thilina Ratnayake — Sat, 10 Sep 2022 04:35:51 GMT
Lately I've been working on an interesting project that's required me to learn how to use Helm to include all files within a directory as entries in a Kubernetes (K8s) configmap - which is not as straight forward as one might think.
What Am I Trying To Do?
Run a container in a K8s cluster whose entire job is to spin up and execute a binary file with a specific configuration file as a parameter.
Note: The binary is set up to use a config file that should be mounted in a specific location (i.e. /etc/config/config.yaml`)
1st Pass - Mount a single config file, contents pasted in-line.
In K8s, we can make files available to a resource by making use of a configMap.
In regular manifests (plain ol' YAML), you can do the following to add a file to a configMap which can then be mounted as a volume within a container.
1. Specify the ConfigMap with the contents of a single config file.
This is actually exactly the example that's specified in the K8s docs for Add ConfigMap data to a Volume
apiVersion: v1kind: ConfigMapmetadata:  name: special-configdata:  config.yaml: |-    lorem impsum dolor things.    foo = bang    things.
2. Update the pod spec to make use of the configmap and mount it to the desired location.
apiVersion: v1kind: Podmetadata:  name: my-test-podspec:  containers:    - name: test-container      image: registry.k8s.io/busybox      command: [ "/bin/sh", "-c", "ls /etc/config/" ]      volumeMounts:      - name: config-volume        mountPath: /etc/config  volumes:    - name: config-volume      configMap:        name: special-config  restartPolicy: Never
Note that the volumeMount means that all the keys in the ConfigMap will be mounted as their own files - so in our example, since there is only one element in the data field of the configmap - the file that this will get mounted as is config.yaml within the /etc/config directory. 
However, I thought this is kind of tacky to have to expect the contents of the configmap to be updated in-line everytime. It would. be nice if we could decouple this (i.e. if config files could be loaded from elsewhere and be handled differently).
2nd Pass - Mount a single config file - contents inserted from a separate file.
Since it turns out I'm able to make use of Helm (a K8s configuration management and templating tool) for this project, an optimization we did next was simply read in the contents of a file dynamically (for mounting) at deploy-time instead of including the contents in-line. This means as long as a deployer had the required file in the right place, no updates would be required to the configmap's contents.
The magic snippet being: config.yaml: {{ tpl (.Files.Get ".yaml") . | quote }}
apiVersion: v1kind: ConfigMapmetadata:  name: special-configdata:  config.yaml: {{ tpl (.Files.Get "super_duper_sweet_config.yaml") . | quote }}
Even better!
Sweet, we're done right?Nope. While there might only be a one configuration file today - the implementation needs to support having multiple configuration files available to be fed into the binary at run-time.
Assume that there is now a directory named /config_files in the top level directory that contain special configuration files. All of these need to be present at start-up for the container, of which one can be provided to the binary.
Some Possible Solutions:
1. Modify the app code (binary) to fetch config files during first-run.
This was the first solution that jumped into our minds, but we decided against it because forcing a change on the binary (instead of making changes in the way a container was deployed) is pretty antithetical to the principles of K8s. It would also mean having to ask the engineering-teams to change the way the program runs which might cause more problems to solve this one.
2. Use an initContainer
[Init Containers are] specialized containers that run before app containers in a Pod. Init containers can contain utilities or setup scripts not present in an app image.
Where the setup script could be git clone / download all the config files into a volume mount to be used by the main container(s).
While this was a possiblity, we decided against it in the interest of time - specifically beacuse the person I was pairing with is a Helm master who knew about the proper snippet to use. See below.
3. Use Helm To Include All-Files From a Directory At Deploy-Time.
While my partner was a helmMaster and knew that this could be done - this is the StackOverflow post that confirmed it and helped refine the necessary break-through.
Essentially, what we needed Helm to do was to:
Range over a list of YAML files in a directory
Create a new element in the config map per file with the key being the filename and the value (data) being the contents of that file.
# Create a new config map for every Deployment.apiVersion: v1kind: ConfigMapmetadata:  name: special-configdata:  {{- range $path, $_ :=  .Files.Glob  "config_files/**.yaml" }}      {{ $path | trimPrefix "config_files/" }}: |- {{ $.Files.Get $path | indent 4 }}  {{ end }}
Specifically:
{{- range $path, $_ :=  .Files.Glob  "config_files/**.yaml" }}      {{ $path | trimPrefix "config_files/" }}: |- {{ $.Files.Get $path | indent 4 }}  {{ end }}
This helm snippet ranges over the config_files directory for all .yaml files and then creates key-value pairs where the key is the name of the file (minus .yaml) and the value is the contents of the file.
i.e. If the directory looked like:
/config_files---> config_a.yaml---> config_b.yaml
Then the templated configMap (post-templating) would be:
# Create a new config map for every Deployment.apiVersion: v1kind: ConfigMapmetadata:  name: special-configdata:  config_a: |-      config_b: |-    
This means, that when used in conjunction with our previous podSpec which had the following line: 
     volumeMounts:      - name: config-volume        mountPath: /etc/config  volumes:    - name: config-volume      configMap:        name: special-config
The files will be present as:
/etc/config/config_a.yaml
/etc/config/config_b.yaml
Well that's great T, but how are you going to make each container choose a different config file?Stay tuned because that's the next bridge to cross!
Anyways, quick post - but that's how we learned how to use Helm to fetch all files and their contents from a directory and include them in-line. Note that a limitation of this approach is that there's a max limit of 1MB worth of data that can be sent through as as ConfigMap.
A ConfigMap is not designed to hold large chunks of data. The data stored in a ConfigMap cannot exceed 1 MiB. If you need to store settings that are larger than this limit, you may want to consider mounting a volume or use a separate database or file service.


Better Communications With Your Team as a Junior Engineer
Thilina Ratnayake — Sun, 14 Aug 2022 18:47:50 GMT
Better Communications With Your Team as a Junior Engineer
As an Engineer or Knowledge Worker - our roles require the extraction, synthesis, or modification of information to complete our work and get to the finish line. 
But rarely do we do this in isolation. This almost always requires communicating with other humans to get to the finish line.
As a Junior Software Engineer (or any member within a team of knowledge workers) - the Interpersonal Communication skills employed when working within an Engineering Team helps you learn and grow.  As a Senior Engineer, these skills become crucial in operating across organizational boundaries.  
Below are a couple of thoughts and tips compiled through my journey from support to engineering, and as a Junior and onwards. This is knowledge that was hard earned over years and challenging interactions - and I hope it will be helpful to anyone that is starting out as a New or Junior Engineer
This was originally a more concise Twitter thread that can be found here: 
https://twitter.com/tratnayake/status/1553981169154215936
New Junior Engineer - Welcome!
Welcome and congratulations on your new role! As a Junior Engineer, its an exciting, stressful and maybe even intimidating time in your career! Theres a seemingly unending waterhose of information to ingest, countless new problems to solve and so. SO. many things to build. 
The worlds your oyster and after spending years on learning how to sail, you got yourself into a canoe!
And thankfully, its not just you. Youve got a whole TEAM to learn and build with. Whether its just one other engineer or maybe a large pizza team - your teammates will be a huge part of your experience.
But while your team exists to work together, its important to understand that everyone on it, is their own unique person. All of them posessing different strengths, areas of improvement and most importantly, constraints.
Interpersonal Constraints - Knowledge & Space
As humans, its in our nature to immediately start asessing new people, groups and surroundings for certain indicators. In early days, it helped keep us alive - and even today, it helps us determine if we are in danger. Outside of basic survival - in group settings we will also start trying to determine heirarchy, structure and try to make groupings based on different criteria.
In your induction to your team, I would suggest you asess your peers and frame your relationship to them against 2 constraints:
Relevant Functional Knowledge & Space.
Relevant Functional Knowledge
Relevant Functional Knowledge includes the knowledge, skills and experience revolving around the technical areas you work in. For example, the understanding of what makes up a relational database including what makes it ACID compliant is knowledge.  Being able to then log into a MySQL database and then using that knowledge to execute a database migration is a skill. Remembering to take a snapshot AND backup before starting the whole operation is experience (often, hard-won).
Space
Sometimes referred to as capacity or bandwidth, space is the time and energy that is available for tasks.
If we were thinking mathematically - Space is a function of Time multiplied by Energy. 
Space = Time * Energy
If someone has a lot of energy but not enough time in their day to do something, they dont have a lot of space. Conversely, someone that has a lot of time but not a lot of energy is also lackingspace.
Communication & Organizational Boundaries
Boundaries exist in any organization, and theyre not always bad. Boundaries are helpful because they clearly mark a line of change. For example, in computer networking, information is altered to be sent through network(s) according to the constraints of each of those networks and layers. I like the OSI network model - because theres a concept of layers, and especially lower and higher networks.
Peeling back the skin on this example quite heavily: In networking - you can split communications between local networks (like a home WiFi network) and a wide area network (like the Internet). Communications within a local network are treated differently than in a wider area network. For example on a local network, clients send frames to a switch - whereas WAN networking requires packets transitting through and from routers.
Using this example, think of yourself as localhost, your team as the home WiFi network, and the rest of the org as as the Wide Area Network (like the Internet)
Interpersonal Boundary
Inside The Boundary
As a Junior Engineer, you might not have a lot of Relevant Functional Knowledge  but you do have a lot of Space to learn. As a Junior  - there is slack explicitly built into the deadlines accounting for learning. This is  done in exchange for the implicit expectation that you will be working hard to learn. TL:DR - Youre given a lot of space to learn.
Between you and your team is your Interpersonal Boundary.
This is the first organizational boundary you will learn to cross and work around as an Engineer.
Outside the Boundary
On the other side of this bounary is your immediate team. 
A good exercise here is to examine the areas of the teams Knowledge vs Space box compared to yours.
Relevant Functional Knowledge: By time-in alone, your peers will probably have more functional knowledge  or at least organization specific context.
Space: Because every member of your team is their own person, with their own goals, deadlines and constraints - the space that they have available you only exists to a certain limit.
Put more succinctly, your team will have more knowledge than you do - however by also possessing their own constraints, your team (and anyone else) will have less space for you than you have for yourself.
Each member will also occupy their own position within the plot.
Each position will change based on the person and organization. 
Engineering Managers
One unique position on the plot is your Engineering Manager (EM).
On the vertical (X) axis, they are placed on a scale because EMs can posess a varying level of functional knowledge. Some EMs are more technical, some less.
However, the important note is their close proximity to you on the horizontal axis representing space.
EMs should have the most time, or at-least space for you.
As people managers, EMs trade functional tech work for space to lead and develop their people which includes you!
Their role includes providing (strategic) leadership, coaching, and help brokering relationships.
If youre stuck, they can help coach you to figure things out.
If youre blocked, they can use their levers to get you free - or provide you the resources to do so.
If you dont know what to do next, they can help you figure it out or at least point you to someone else.
Your EM is very important member of the team to keep in mind, and fostering a relationship with them is crucial for not only your daily success, but advancement as well.
Teammates - Only Have A Certain Amount of Space For You
The rest of your team will land vertically based on experience and knowledge. The time available for you usually decreases with seniority due to their workloads.
The more senior the teammates, the more knowledge they will have. However, an unfortunate, but understandable,  side-effect of having more relevant functional knowledge is also a larger scope of work. A larger scope of work means less free space for you 😟
This isnt personal, this is just a output versus capacity limit.
Even IF you had the most considerate and approachable Senior Engineer that is explicitly assigned to you as a mentor - there will come a limit to the space they can extend to you without risking an impact to their other duties and responsibilities.
Does this mean Juniors shouldnt interact with Senior Engineers? On the flip side, does this mean that the more senior you are - the less approachable youre allowed to be?
No!
Seniors, EMs and organizations must MAKE formal space for Junior Engineers. In fact that is the mark, and a necessity, for any successful engineering organization in the long-term.
As a Junior though, it does mean that:
No one has to help you. They should** but in the absence of the ability to demand help, youre workign purely on good-will. Which means its on you to do the leg-work to make it as easy as possible to be helped.
You should be more considerate in how you communicate with other members of your team - especially by factoring in the Relevant Functional Knowledge vs Space available.
How Does This Help Me As A Junior Engineer?
At some engineering organizations - Junior Engineers dont have to worry about formal task scoping or grooming, and simply only need to grab a card and take it to the finish line after all of that planing work has already been done by more senior members.
You want to get to the finish line.
But - its never a straight line to the finish, and (especially as a Junior) it will rarely be a solo effort. You will need to work with your teammates in  order to gain the answers and knowledge necessary ot complete your task.
In fact, this is one of the most common ways you will learn as a Junior Engineer.
And like any method, there are  ways you can do this better.
Interaction Cost
Every interaction with another engineer occurs costs. For example:
Time: How much time is spent not working on their original tasks.
Context Switching: At the best case scenario: The effort needed to save and dump state before reacquiring new state to help you. Or most commonly - the cost of reacquring lost state when switching back to the original task.
And its true that the reality is: these costs will always exist.
An implicit agreement of team membership is accepting that interactions are necessary & encouraged.
HOWEVER - becoming a good engineer is understanding constraints and optimizing accordingly. Similarly, being a good teammate is about doing the same by being efficient in your interactions.
Efficiently Crossing the Interpersonal Boundary
By focusing on the transition between these boundaries, you can become more efficient in your interactions with teammates.
1.  Do your DUE DILLIENCE
In the legal world, due dilligence refers to:
Reasonable steps taken by a person in order to satify a legal requirement, especially in buying or selling something.
While there might not be a legal requirement when interacting with buying- you are effectively trading information for time - the least you can do is everything in your power to ensure that the interaction is as efficient as possible.
Doing your due dilligence for an interaction is task specific, and is something youll get better at over time. However, here are 3 suggestions which are widely applicable to most interactions.
And remember - one of the benefits of doing this before you cross the teammate is that you have as much space as you need to work the problem. You havent bugged anyone yet so there is no time pressure.
Take your time. And if you feel that time is running out - work with your leadership to ask for more to figure it out.
1. 🦆 DEBUG WITH A RUBBER DUCK.
Ref: https://en.wikipedia.org/wiki/Rubber_duck_debugging
Before you run for help,  go through your code / your problem - line by line, premise-by-premise.  Youd be surprise how much this approach might catch something you missed, or  unearth nuggets to get you unblocked.
2. 📖 Read The Freakin Manual (RTFM)
Something that gets old quick when you help other people - is if they keep leaning on you without trying to help themselves first. As a knowledge worker, your job is solely focused on interacting with information until it can be used / modified to be useful. 
One perk of being a knowledge worker is that you are usually provided with a link to the Internet. Make sure you make good use of it. Do the leg work to do your research before asking your question (as much as you can) instead of relying on your helper to do it for you.
3.  REFINE your Ask - Organize Your Thoughts In Logical Order & Prep Question
Even if youre an engineer, youre still in sales.
Youre selling an opportunity to get helped, and you want to make sure you do everything you can in your interaction for your recipient to buy the opportunity to help you.
You do this by making sure your sales pitch is tight as it can be.
Is your pitch in logical order? For someone that might be doing something completely different when youre asking them - does your question / story / quest flow from point to point? Make sure to tighten it up.
Do you provide an easy on-ramp to sync understanding? When youve been working for a long time on a specific problem, its easy to get in the weeds and forget how much context youve built up. When pitching to someone, its important to focus on providing an  on-ramp to go from 0  1. Start by providing information that is generalized  and simple before moving  to more specific and complex.
Can you execute your pitch in one-go? HAVE YOU PRACTISED? If youre selling door-to-door, its show-time as soon as you ring that doorbell. Similarly, the moment you send that message - youre on. Practice your pitch thoroughly so as to reduce the time spent clarifying things on the call.
As an example:
I need help! I cant figure out how to deploy to Staging
is not as great as:
Hey can you help me figure out why I cant deploy this to Staging? I read the doc on CI/CD [Ref: Link] and kicked off the pipeline but I seem to be getting this weird 401 error? I read up on that error and it looks like it might be a permissions issue? Do I need to get access to something else?
More on this in Crafting your Message later on.
Once youve done your Due Dilligence, its time to:
Determine Recipient(s) & Method of Delivery
Who Do I Talk To? Decision Math
Hey the Tech Lead wrote the CI/CD pipelines right? Cant I just go to them with this question?
Sure, but by that logic - they probably also have knowledge about most things on the team. 
What happens if everyone goes to the Tech Lead all the time? Thats not sustainable.
A considerate team member takes into consideration the impact on others in relation to the time sensitive and urgency of their requests. Just because you can DM someone that probably knows the answer, doesnt mean you should. Especially in the presence of other options.
Instead, this is where its important to do some Decision Math in determining who you talk to.
To make this decision, you should factor in 3 considerations:
Least Impact - Your should aim to pick the person on your team that will be the least impacted by your interruption; But
Best Answer - You should also aim to pick someone thats probably going to have knowledge on the relevant subject matter. However, at the end of the day;
Quickest Resolution - You need to ensure you ask someone that will be able to give you a timely excuse
For example, if you have a quesiton about your CI/CD pipeline the Decision Math could go like this:
The  problem is around CI/CD pipelines.
Best Answer: This is a problem that the whole team has dealt with. (So anyone could answer this - but you should pick the peer engineer because they are the next lowest level available)
Time Sensitivity: You need to figure this out so that you can deploy the changes to finish your card. But this isnt due till end of week and its a Monday.
Least Impact: Based on the two details above - I can speak to any of the engineers on my team. However I know that my peer engineer is busy pairing with the Tech Lead right now so theyre both busy.
Therefore:  Ill ask my Senior Engineer so that its the least disruptive to the team, probably going to give me the right answer and in the fastest manner.
HOWEVER if you have a dedicated resource (i.e. a technical mentor or big-sib or new-hire buddy go to them first. They are primed and are known to expect you.
For all this talk about who to talk to though, I suggest the following options in escalating up to the next level if one isnt successful.
1. Ask in the Team Chat
The team chat is the  hub that all members work around. In teams with healthy communication cultures - this chatroom should be very active.
Messaging in the chat room has the benefits of getting more eyes and ears on the question / request and information contained within. Even if it doesnt help you or the reader directly at that moment - makes it available for querying and recall by the broader group. It also increases communication and collaboration.
Want an easy way to reduce knowledge silo-ing and single points of failure? 
Leave more breadcrumbs in an easily accessible and frequently monitored space.
Pros
Best Effort, Help Invited- Not tagging anyone specifically means that whomever has the answer, guidance or suggestions can help you based on their constraints (time, energy).
Knowledge Sharing - Putting it in the team chat means that knowledge is shared.DM's are death.
Cons
Not tagging a specific recipient means that there is a lower pressure to respond. If your team does not have a respecful communications / collaboration culture - your question might be ignored.
The open arena for communications might lead to more drawn out conversations. (Bike-shedding, yak-shaving).
If your message is ignored; I would then try to;
2. Ask in the Team Chat - With Targeting or Time Pressure
Same as above but with either a direct CC, direct @ tag or a deadline.
Pros
Same as before + 
Might get a response from someone that missedthe message earlier.
Encourages people to respond
Cons
Doing this excessively could annoy your teammates and lose the good-will / patience to help you.
Finally, if that doesnt work;
3. Ask an Expert Directly via DM
If time is of the essence, go direct to someone. Someone who's guaranteed toknow the answer. OR - if the right person isn't clear, ask your EM.
Pros
Fastest response.
Cons
Death by DMs - Communications energy drain for the recipient having to check on a channel that they werent already monitoring.
Death in DMs - Information is now siloed between you and the requestor.
3. Craft Your Message
Now that you know where / who youre going to interact with for your question & help - its time to craft your message. Remember, this is your sales pitch where youre selling your teammates on the chance to help you - make it as tight as possible.
Some things to consider when crafting your message:
What are their constraints?How much time do they have for me?How much information can I strip out (without losing message integrity)
How much knowledge / context do they have about the questions Im inquiring about? (Is this something they work in daily?)How much information do I need to fill them in on?
What sort of follow-up questions might they ask?Can I incorporate these into the question so that they dont need to ask them?
This is where I recommend reading more into a previous post Ive included around the 5 Step Question:  https://tratnayake.dev/asking-better-questions-as-a-junior
After youve done all that - youre good to send!
Conclusion
If you do all of this, your journey might look like this:
If you take the care to understand the Interpersonal Boundary and what it takes to be deliberate, considerate and efficient in your transitions  going to your team - youll get better answers, be less of a burden on your team and more importantly - practice the skills and habits that will be essential further on in your career as a more senior engineer, when you will need to cross multiple organizational boundaries.


How-To Set Up a Jupyter Notebook on GCP with Granular Access Control to Read from Big Query, Configured with Terraform
Thilina Ratnayake — Sun, 13 Feb 2022 01:03:36 GMT
Context
In our organization we have data-analysts that need to fetch data from different sources to perform their work. In the last couple of weeks, I worked with a Data Analyst that needed a solution to query data from BigQuery (BQ) datasets using R (a programming language for stastical computing).
Whats BigQuery? An enterprise data warehouse that is specific to Google Cloud Platform (GCP). Useful in circumstances where folks want to run analysis on their data but dont want to do it on the databases themselves (Most databases are set up to ensure data is read/written reliably - whereas data warehouses are built specifically for analytical operations). 
One common pattern is to fill BQ datasets from production databases and run analytical operations on those datasets.
Photo by Akinori UEMURA on Unsplash
Constraints
In building / researching a solution for this need, we wanted to work around a couple of constraints:
Future-proofing: The solution should be able to keep up with future demands and reduce the dependency on local hardware.We dont want analyst laptops to be a limiting factor wherever possible. Using the cloud means leveraging the ability to spin up workloads on hardware with specific requirements as needed.
Running this in the cloud also means that a user can continue working / access data from anywhere / any laptop (even if they lose access to their device).
Security - Authentication: The solution must be locked down to specific users.The Bigquery Datasets  are already subject to security policies that lock down their access. However, as this is another / new mechanism making use of the dataset - ensure that this solution at minimum, does not grant access to any new / unwanted users. Ideally - only grants access to those who require it (and are already within the security / access policy).
This solution should also allow the ability to query into datasets in other GCP projects that we maintain.
Security - Authorization: The solution must only allow access to specific DataSets.In line with above, even if the requestors are within the security policy to read and write from this dataset, the solution must be locked down to READONLY operations on the data. This is to ensure that another risk of data loss is mitigated where possible.
Infrastructure-As-Code: The solution should be TerraformedNo special snowflakes*  on our watch! Having this infrastructure terraformed has a lot of benefits, which you can read about here.  But for us, means that everything thats running is codified and can be examined, modified, nuked from one source of truth.
Cost: Using this notebook shouldnt be prohibitively expensive.
Goal
Photo by LVARO MENDOZA on Unsplash
Implement a solution that allows a data-analyst to run R code against BigQuery Datasets which meets all of these constraints.
Solution
The majority of this solution is already covered in a GCP article: Data Science with R on GCP EDA - however what this post includes is an approach that builds on Service Account IAM to meet our security requirements, and shows how to achieve this solution with Terraform
The main character of this solution is the Vertex AI service which allows you to run Jupyter Notebooks (as an IDE for R) on rapidly configurable VMs. (Futureproofing )Its usually used in AI related workflows like training ML (Machine Learning) models and thus in doing so - has native support for talking to BigQuery.
These workbooks have access to the Deep Learning  family of images which allows you to quickly instantiate notebooks with specific images (including one with the R framework installed!)
These workbooks run on top of regular VMs that can be configured to specific workload needs (i.e. tweaking processor, memory and disk specs).
Depending on the hardware used, the costs are minimal (Cost ).For example, using an e2-medium instance is only $24.46 per month at time of writing. (and thats with the assumption that the notebook is running 24/7)
The constraints for our solution are met by the following:🔑🔑🔑  The Vertex AI User Managed Notebook Instance (hereafter referred to as the notebook) can be tied to a Service Account (SA) 🔑🔑🔑.  By applying access control to this SA we can achieve the constraints as follows:Security - Authorization: We can lock down who has access to this notebook by gating on who gets to have the roles/iam.serviceAccountUser role on the Service Account in GCP IAM. 
We can lock down that SAs access to (1) only the datasets required and (2) READ ONLY operations by assigning the following roles with constraints:roles.bigquery.jobUser (on the whole project)
roles.bigquery.dataViewer on the specific datasets.
This also allows querying datasets in other GCP projects, by granting roles in thoes projects to this SA.
This is the major key, as the Service Account can scale up to multiple users by being able to bind a the roles/iam.serviceAccountUser to any principal which can include users AND groups.
Security - Authentication:  **Because our users log in using their Google accounts, the authentication mechanism is taken care of by GCP (using folks credentials).
In order for the Notebook to query BigQuery, the Notebook API must be enabeldThis is a MANUAL operation that must be done in the GCP console.
This can all be Terraformed. (Infrastructure as code )
Implementation
1. Enable the Notebooks API
https://console.cloud.google.com/marketplace/product/google/notebooks.googleapis.com
2. Apply the Terraform
locals {  # CHANGEME  project_name = "tutorial-344120" # The project}# Note this requires running a gcloud auth application-default loginprovider "google" {  project = locals.project_name}##1. Create a Service Accountresource "google_service_account" "analyst_notebook" {  account_id   = "analyst-notebook"  display_name = "SA for analysts to access BQ datasets via Vertex notebook"}##2. Create a User Managed Notebook that uses that Service Accountresource "google_notebooks_instance" "analyst_notebook" {  name     = "analyst-rstudio-notebook"  location = "us-west1-a"  #CHANGEME  machine_type = "e2-medium"  vm_image {    project      = "deeplearning-platform-release"    image_family = "r-latest-cpu-experimental"  }  service_account = google_service_account.analyst_notebook.email}##3A Allow ability to run BQ jobs on all datasets in projectresource "google_project_iam_member" "project" {  project = locals.project_name #CHANGEME if the target datasets are in diff project.  role    = "roles/bigquery.jobUser"  member  = "serviceAccount:${google_service_account.analyst_notebook.email}"}##3B Allow ability to READ on a SPECIFIC BQ dataset.resource "google_bigquery_dataset_iam_member" "analyst_notebook_data_viewer" {  project    = locals.project_name #CHANGEME, if the target datasets are in diff project.  dataset_id = "rick_morty"  role       = "roles/bigquery.dataViewer"  member     = "serviceAccount:${google_service_account.analyst_notebook.email}"}##4. Allow only  the intended user to use the SA and by extension, the notebookresource "google_service_account_iam_binding" "analyst_notebook_service_account_binding-iam" {  service_account_id = google_service_account.analyst_notebook.name  role               = "roles/iam.serviceAccountUser"  members = [    #CHANGEME - who should have access to assume the Service Account (and access the Notebook)    "user:thilina.ratnayake@email.com",  ]}
https://github.com/tratnayake/R_Studio_BigQuery_Jupyter_Notebook
Test
1. Can we open the notebook and query the BigQuery dataset using R?
The R code to query a BQ dataset can be found here: Use R with BigQuery
Yes 🕺 
2. Can anyone else log attempt to open up the Jupyter notebook?
Nope! 🔒  
3. Can we attempt to access other datasets? (Outside of what is specified in the IAM policy?)
Also Nope!🔒  
Why not use a Service Account?
Create a Service Account, let the user download the SA key and use it when connecting to the database from their device.
We stayed away from Service Account keys primarily for the number of risks that they add to the security story.
You can read more about those risks here: https://cloud.google.com/iam/docs/best-practices-for-securing-service-accounts
Using a Service Acount key with local device also means losing out on a couple of features;
No infrastructure as code
Hardware is a constraint - lack of spec / loss is a risk.
Photo by Marliese Streefland on Unsplash
Conclusion
Service Accounts can be great. They are a good approach if you need to represent non-human users or persistent access to a system.
Service Account keys...not so much. They are gross, and icky, and very easy to lose to become a security risk.
When advantageous, use cloud resources to fill the needs of your users as they bring a couple of benefits:
Existing auth mechanisms
Ease of configuration
Infrastructure-as-code
In this case, we combine both and use a Service Account specifically because of it's ability to be a single target to apply our security policies to.
The key feature that enabled use to this solution was GCPs ability to tie a User Managed Notebook Instance to a Service Account which we could then apply our access policies onto.


Asking Better Questions as a Junior
Thilina Ratnayake — Tue, 08 Feb 2022 18:39:56 GMT
This article has a companion video:
https://youtu.be/kUyz0geFp3c
And there's a chart at the end :)
It's can be scary doing something new. Especially if it's something like starting out as a new engineer, or maybe even as an experienced engineer but starting within a new team.  In additional to all the technical aspects - there's so many new relationships to feel-out, norms to establish and culture to absorb. 
I remember just a couple of months ago when I was starting my new job, I had the same anxieties as when I started my first job:
Am I asking too many questions?
omg, are my teammates getting annoyed?
Is this a dumb question?
While a big part of getting past that stage involves pushing through those thoughts to ask questions - there's value in strengthening our questioning skills as it is one act that we have complete control over, and which increases our ability to learn, grow and contribute faster. 
Photo by Benjamin Child on Unsplash
Preamble
My first job after graduating from college with a Bachelors in Computer Systems Technology was starting as an IT Support Administrator at medium sized corporate org.  At this job, I was given a desk that belonged to someone else with the caveat: 
Oh just move some stuff around a bit, but heads up, they might be back soon".  
https://media.giphy.com/media/XCmFwjt9wPotobw1xn/giphy.gif
With a cluttered desk, hand-me-down laptop and a notepad - I was left to my own devices to learn about a complex global  IT system with a senior engineer who seemed more annoyed to be interrupted by my presence than interested in teaching me. 
I lasted two months (but walked away with an amazing friend!).
My next job was also another "awkward fit". I was a Junior Front-End engineer working on a very small start-up that had no documentation or process for helping juniors, and where asynchronous PR reviews were the norm. I still remember my excitement in submitting my first ever PR and then the immediate horror as I looked at the Trello card afterwards which had 74 points to fix as single statements (and no advice or suggestions).
Let me know when you're ready to resubmit.
I let them know after 4 weeks - that this probably wasn't the right fit for me.
https://media.giphy.com/media/Z9cRCMdAMzXi25dwhE/giphy.gif
After these discouraging experiences, I went through an identity crisis (the first of many!) where I wondered if I was ever meant to be Software Engineer. So much so that the next gig I took on was joining a recently acquired start-up as a Customer Support Representative.
And this is it where it all "happened" for me. This was one of my most critical formative experiences in tech where I lean on the skills I learned, refined and mastered every single day.
Support Engineering
Welcome to Support, The Best Damn Org In The Company
Maybe not what you'd expect to hear in that sort of organization - but that was the first thing my team lead said to me. The culture in that company was electric, but the espirit-de-corps and morale in that support team was off the charts. And I think that's because a large portion of them were professionals.
The purpose of a support organization is to assist customers with their questions and problems. To this end, our bread-and-butter was working on tickets. A ticket is generated when there's an interaction from a customer (like an email, or a phone call), and all correspondence takes place on that ticket until the ticket is completed* .
^ Working on tickets
I must have worked on thousands of tickets over my 2.5 years in support, with each ticket having at least 2 interactions. In these tickets, the goal is to identify a customers problem and provide solutions or a course of action as soon as possible. Because of this, I became very good at asking clarifying questions to isolate problems and structuring information into logically ordered, bite-sized pieces. In fact I remember thinking to myself:
At school I learned how to do a wide range of things, from building a custom OS kernel from scratch to writing programs that can handle concurrency and YET; the class that I've used the most in this gig is my Philosophy class.
Logical fallacies. Presenting information.  Ordering premises.  My non-technical profs would be thrilled!
As part of our daily work - one activity that would come up is Escalations
Photo by Kelly Sikkema on Unsplash
Escalations
Customer Support Representatives (CSR's) worked on newly created tickets. We would work with the customer to collect diagnostics and recommend solutions based on those efforts. Usually, most problems were common issues that could be solved with a link to a document or a recommendation to do a few things.  These were the cases in which a ticket could be considered completed and be closed.
Sometimes - there would be tickets where as a CSR you run to the end of your abilities. Either in what you know about the issue, or in you abilities to troubleshoot and resolve the issue. At this point, is when we'd need to escalate the ticket UP to a Technical Support Engineer (TSE).
https://media.giphy.com/media/l2SqbG9QAz1Z314Uo/giphy.gif
Prior to escalating, you had to fill out Escalation Notes on the ticket and let the customer know that the ticket is being escalated. Think of Escalation Notes like a sticky note that you'd put on an essay before you handed it off to an advisor. These Escalation Notes were crucial to a TSE because it would be their first point to orient themselves on what's happened, happening and needs to happen. If it was a long runnning ticket before escalation, the Escalation Notes would hopefully contain the necessary highlights and summary needed in order to hopefully skip reading the rest of the thread. This was was especially important if you were transferring an agitated customer from a call and demanding an escalation because it's taken too long (oops!) -- these escalation notes could be all the TSE has to skim before jumping on to fight the fire.
In this practice, you learned the value of writing good escalation notes quickly. 
Bad Escalation Notes required a TSE to chase after you to get more information or ask clarifying questions. 
Really Bad Escalation Notes missed critical pieces of information, and would be de-escalated for more fact-finding. Not great if you've just told your customer that you're escalating the ticket.
https://media.giphy.com/media/WxDZ77xhPXf3i/giphy.gif
Oof. Not great when your metrics revolve around the amount of interactions you have a with a customer, and how long a ticket has been in progress for.
However, a set of Good Escalation Notes  -- ones where:
the TSE did not need to come back to you with any more questions, 
has a summary on what's been tried,
contains everything they need to continue on the ticket; 
They are the equivalent of sending a polished bowling ball down a freshly waxed lane. You can see it get accepted and worked on by a TSE and (depending on the issue) quickly move towards resolution. Strike!
Photo by Ella Christenson on Unsplash
If you wrote good escalation notes, tickets got solved faster. 
If you wrote good escalation notes consistently, the TSE's would quietly DM you and teach you about the issue and how to solve it in the future, and sometimes - would even give you the answer and send you back to the customer to be able to close the ticket yourself. (In support, being able to work with the customer to close is a pretty satisfying feeling!)
Photo by Camylla Battani on Unsplash
Questions, as Escalations?
Fast forward a couple of years and I made the leap from Support into Engineering as a Cloud Infrastructure Engineer. I was now a small fish in a huge ocean and had so, many, questions to ask. This need for information, paired  with a large amount of anxiety & impostor syndrome was not a good combination. While my team-mates were so supportive and always around to answer questions, I started wondering:
Hmm, what if I started thinking of questions as escalations? Would that make a difference.
The answer, is yes.
I noticed that when I started applying the same principles, the following happened:
My questions would get answered faster.
My questions would highlight resources and share understanding with the rest of the team.
My team-mates started sharing more about their thought process and how it aligned with my initial steps / research.
So with that, here are:
Photo by Zan on Unsplash
5 Tips For Asking Better Questions as a Junior
(Freebie) 0. Imagine Asking Your Question to Yourself.
After I had gotten a bunch of bad escalations punted back down to me - I started getting into this weird "shadow-boxing" mindset where I would assess my notes in the point of view of a TSE.
What questions would they ask?
Does this give enough information?
What if they ask about X?
Should I clarify Y?
Putting yourself in the shoes of the person reading your question bolsters the quality of your question and also allows you to do some self review.
1. Problem Statement - One Liner
What's the problem? 
In one line, boil down the most important parts of the problem. If there are multiple problems, list the most concerning and make a note that there are other problems (with more info available upon request).
2. Context - Desired End State
Why is this a problem? What are you trying to do? What's your desired end-state?
If the problem-statement is your starting point, the context should explain your desired end-state or where you want to go. Doing this allows the person answering your question to simply focus on charting the line between the two as opposed to narrowing down the problem scope with (as many) follow-ups.
3. Steps Taken - Qualify Question
What have you alread tried and researched?
This portion of the question is extremely useful for so many parties
For the people reading your question, this can serve as a list of things on what not to recommend / check because you've already done it - saving time. It also shows that you as the asker have spent time qualifying your question.
If posted in a team / group space - this can highlight resources or context that others may not have known about.
For the asker, this can help you feel more confident about the validity of your question because you see that you have put effort into research, investigation and solving it yourself.
This portion also allows your coach to get a look into your problem-solving and response processes, which will only help hone your instincts for the future.
4. Next Steps - Possible Solutions
What are some next steps you'd try or things you'd investigate?
This step is huge for the person receiving your question.
At best, showing what you've thought about poking next means that they can potentially conserve effort by nudging you in the right direction with some information as opposed to coming up with the whole solution.
At worst, they can disqualify those solutions and explain why - which again, hones your skills for the future.
And sometimes, even they might not even know where to start - but this statement might spark their thought process!
5. Help Requested - Means for Assistance
What help do you need? When are you available?
Most team-mates want to help you. However, they're also busy with their own work. Sometimes, they might see your question (and know the answer!) but the task or call in the moment might short-circuit their desire to reach out to you. 
If you explain what help you need, and what meetings you're available for (and when) - this could allow a team-mate to acknowledge and set up time for later as opposed to needing to acknowledge and try to come up with a solution.
Example of a Good Question
This question is taken from one that I needed to ask last week!
Hey, does anyone know how to decrease the retention window for Prometheus-Server?
The disk on Prometheus-Server has filled up and it's not able to send metrics to Grafana
We've tried making changes in the helm charts but they don't appear to be sticking.
I'm probably going to try doing a live kubectl edit on the cluster next, but not sure if that's the best way.
I'm available for a huddle rn if anyone's available, but also good for a Google meet after 1 hour.
Conclusion
This article is specifically titled Junior and not suffixed with Engineer because while this post is specific to engineers, it can apply to a junior in any field. I validated this with a friend who's in Customer Success, and they mentioned that this is exactly the kind of information they'd want from a Junior Account Manager.
As a more CS tailored example:
Does anyone have a good compelling event to upsell startup to enterprise? 
I have lots of startup clients who need to generate additional revenue for my portfolio
I've already used the multiproduct benefit pitch, but it hasn't landed
Im thinking of talking about a change in NIST regulation next year as my next angle
I'm free after 2 today if anyone is free to roleplay
I hope this post helps ya'll in asking questions, and if there's other points that you'd add - please leave me a comment :) 


Oncall Adventures - When your Prometheus-Server mounted to GCE Persistent Disk on K8s is Full
Thilina Ratnayake — Mon, 24 Jan 2022 06:25:26 GMT
Note to anyone that lands on this page in the middle of an Incident and just  needs the solution  .
Problem: Prometheus-server running in K8s on GCP using a Persistent Volume has run out of disk.
Symptoms:
Grafana shows readings diving off cliff.
Grafana shows no data.
No screaming from other independent sensors (i.e. other teams)
Logs on Prometheus-server show:target=http://XXX.XXX.XXX.XXX:YYYY/metrics msg="Scrape commit failed" err="write to WAL: log samples: write /data/wal/00004932: no space left on device
Shelling into Prometheus-server confirms 100% disk usage.
Fix:
Remove link to filled Persistent Volume.
Mount debug pod onto Persistent Volume.
Clean-up old blocks in /data
Kill debug pod and remove link to Persistent Volume
Restart Prometheus-server
Wait 2 hours for pruning to finish 
Photo by Annie Spratt on Unsplash
PagerDuty Alert. You have 1 triggered notification...
That's the phone-call I was woken up with this morning at 3:30AM Pacific Time. 
This was my first time getting paged in the middle of the night at this job, but the response felt very rehearsed. On-call work is one of the less glamorous but still important, parts of the job - and if you've done it for a couple iterations (and been through a couple of high-intensity outages), it becomes just another activity.
As I rolled out of bed and shuffled over to my desk - I remembered something an old mentor told me as a Junior:
After a while, you'll start seeing the patterns and see that almost every issue you deal with stems from a few specific blueprints.
Sure enough, the incident we worked through last night had a very common culprit - resource exhaustion.
Here are some lessons we learned from this incident:
Photo by Launde Morel on Unsplash
1. If All Your Gauges Go Dark but No One Screams, It's Probably Your Gauges.
The majority of our infrastructure runs on Kubernetes (K8s) which is a container orchestration system.
For observability (o11y) - we make use of Prometheus which works by scraping metrics from our containers and then pushes them to Grafana cloud for data visualization and export. 
This is also where our alerts are configured and how we can paged if something seems wrong.
To understand this setup, there's 2 diagrams:
You'll need to be in light-mode to see the arrows in this diagram 😞 
And the Grafana agent which is on the Prometheus-server and sends the metrics off to Grafana cloud.
The first thing I woke up to was our dashboards showing metrics either:
Diving off a cliff; or
Going dark.
One thing I've learned over time is that while our eyes naturally fixate on anomalies in patterns - it's important to take a look at the bigger picture before diving deep into a single graph.
Specifically, in this case - I saw that all of our graphs were showing the same behaviour (either a steep drop, or lack of data).
This, combined with the fact that no-one else was screaming (our product-engineering teams also have their own monitoring set up more specific to their use) gave me a hunch that these readings might be an issue with our observability into the system than a reflection of the system itself.
To confirm this, I wanted to test one of the claims from our monitoring system.
All CPU usage has dropped, memory usage has dropped, your containers are probably dead in the water.
So I logged into the cluster, and thankfully - I found that our pods and containers were swimming along just fine.
The combination of these 3 factors:
All metrics dropping at the same time;
No other alert of issues from an an independent source (i.e. a product engineer);
The core claim of large scale outage being confirmed false
Led me to the following assumptions:
The infrastructure is still okay.
The monitoring system is probably degraded.
We're not getting any more data.
Lets poke at the monitoring system!
Photo by Mathias P.R. Reding from Pexels
2. Kubernetes is Just a Wrangler for Your Containers, They Still Need to Eat.
Kubernetes just co-ordinates your containers to get them housed (scheduled) and fed (resourced). If they can't, Kubernetes will try its best - but they don't (and can't) send notifications. This is where a monitoring solution like Prometheus comes in. It reports a constant stream of data about what it sees from looking at your app and sends it to Grafana for further visualization and alerting. 
In this case, it just so happens that the monitoring system was what was degraded. But how?
Prometheus runs as pods on the cluster and essentially lives to scrape metrics from other containers and send them out. At the end of the day, it's just another creature (container) that needs food (resources) to live (do its job). 
The main 4 resources that any container needs, organized by what they do  are:
The ability to do things - processing, CPU
The ability to use short-term memory - memory, RAM
The ability to use long-term memory and carry  assets (like a backpack) -  DISK
The ability to talk to others - networking (an IP, port, socket, actually networking has a little more requirements).
If any of these 4 resource requirements are not met, your workloads will fail.
While I'd love to say I have a tried and true method for checking all of these 4 things, the problem in this incident was uncovered by shelling in and taking a look at a couple of things.
One thing that really helped was the use of K9s -  which makes it very easy to see  the state of a Kubernetes object when listing them. For example when listing pods, you can see whether theyre running or experiencing issues from the same screen rather than having to list and then describe. A small but appreciated efficiency.
It also allows you to quickly view the logs of a container and see what's going. 
For me, I was able to find the smoking gun quickly (and luckily) by checking the logs of the prometheus-server which was pumping out the following message a couple of hundred times per second:
target=http://XXX.XXX.XXX.XXX:YYYY/metrics msg="Scrape commit failed" err="write to WAL: log samples: write /data/wal/00004932: no space left on device"
no space left on device
Which was confirmed by doing a quick df which shows
Filesystem           1K-blocks      Used Available Use% Mounted on[...]/dev/sdb              51290592  48898732   2375476  100% /data
Alright, so our disk is full - how do we fix this?
I'll tell you how you shouldn't:
1. Do NOT move blocks from /data into other drives.
While this shell-game will buy you some-time - the WAL (Write Ahead Log) will continue to fill you will simply be prolonging the same issue.
More on the WAL:
The current block for incoming samples is kept in memory and is not fully persisted. It is secured against crashes by a write-ahead log (WAL) that can be replayed when the Prometheus server restarts. Write-ahead log files are stored in the wal directory in 128MB segments. These files contain raw data that has not yet been compacted; thus they are significantly larger than regular block files. Prometheus will retain a minimum of three write-ahead log files. High-traffic servers may retain more than three WAL files in order to keep at least two hours of raw data.  - Ref
You should be very careful when fiddling with WAL files, as corruption will mean the inability for Prometheus to restart cleanly.
2. Do NOT resize the Persisted Volume by editing the the deployment spec on the fly.
Both of these action led to the following outcome when trying to restart Prometheus:
err="opening storage failed: repair corrupted WAL: cannot handle error: open WAL segment: 0: open /prometheus/wal/00000000: no such file or directory"
Well, what now?
While I wish I could say we knew exactly what to do and arrived at the solution immediately - there were a couple more learning experiences. We ended up:
Attempting to restart the Prometheus-server in a zombie state so that we could remove that corrupted WAL.We couldn't. The container would die immediately before we could shell in.
Killing the Prometheus-server deployment and hoping it would restart cleanly.It didn't.
Finally, we:
Deleted the Persistent Volume that was maxed out.
Re-deployed the Prometheus-server via Helm chart using our CI/CD.
Which now, after doing some more research - turns out is just a clumsy and more long-winded way of doing exactly what the Prometheus docs tell you to do: 
If your local storage becomes corrupted for whatever reason, the best strategy to address the problem is to shut down Prometheus then remove the entire storage directory. You can also try removing individual block directories, or the WAL directory to resolve the problem. Note that this means losing approximately two hours data per block directory. Again, Prometheus's local storage is not intended to be durable long-term storage; external solutions offer extended retention and data durability. - Reference
cool. cool. cool. cool. v nice.
👍 
Photo by Jessica Lewis Creative from Pexels
3.  Persistent Volumes & Persistent Volume Claims - There Can Only be One
In Kubernetes, you can mount volumes on your containers. There are a whole bunch of different mounts that can be configured and in our case, we were using a GCE Persistent Disk which is a type of Persistent Volume.
A Persistent Volume (PV) is one where the disk is linked to block storage that can survive if a container, pod or even deployment goes down. So in our case, even if our prometheus-server goes down - the data it saw will be available on disk. Kinda like an off-site black-box.
When dealing with Persistent Volumes, it's important to understand their relationship with Persistent Volume Claims (PVC).
A Persistent Volume Claim specifies a request for storage by a user. It can then be used within a Deployment spec as the volume to mount for a container. 
The important thing to note is that there can only be 1 ReadWrite mount per Persistent Volume (more on that later).
This is what the yaml for a PVC looks like:
apiVersion: v1kind: PersistentVolumeClaimspec:  accessModes:    - ReadWriteOnce  resources:    requests:      storage: "50Gi"
Important to note that in GKE - having this included in a PVC automagically takes care of provisioning a volume in the background.
And here is the yaml for our Prometheus server mounting that persistent volume as storage-volume
apiVersion: apps/v1kind: Deploymentmetadata:  labels:    [...]  name: prometheus-server  namespace: srespec:  selector:    matchLabels:      [...]  replicas: 1  template:    [...]    spec:      enableServiceLinks: true      serviceAccountName: prometheus-server      containers:        [...]        - name: prometheus-server          [...]          volumeMounts:            [...]            - name: storage-volume              mountPath: /data              subPath: ""            [...]      [...]      volumes:        [...]        - name: storage-volume          persistentVolumeClaim:            claimName: prometheus-server
The reason this is important to understand is because there was a fix we could have done.
Refer back to 0th thing we tried when we got: 
err="opening storage failed: repair corrupted WAL: cannot handle error: open WAL segment: 0: open /prometheus/wal/00000000: no such file or directory"
Attempting to restart the Prometheus-server in a zombie state so that we could remove that corrupted WAL (which ended up being futile).
We wanted to get the Prometheus-server up because we thought that was our only way to touch the volume and delete the corrupted WAL, but we couldn't get it up and running long enough to shell into it 😠 .
Looking back at it now, something we should have tried is what this person recommended in a Github issue: https://github.com/kubernetes/test-infra/issues/20439#issuecomment-759119197 which is to:
Use a different container to mount and clean the Persistent Volume
Photo by Mati Mango from Pexels
4. How to Create a Debug Pod and Mount it to a Persistent Volume
A GCE Persistent Disk is like a hard drive. 
Just like you can't plug a single hard-drive into two computers at one time,
A key constraint of a GCE Persistent Disk is that it can only be mounted in ReadWrite mode to a single node at any given time. (However, you can have multiple ReadOnly mounts)
Therefore, in order to get in and clean-it, you need to follow this order of operations:
1. Remove any Existing Associations (from Prometheus-server) to the Volume.
This can be done by either scaling down the Prometheus-server replicas from 1-0, or simply deleting the deployment.
2. Create a Debug pod (running Alpine) to Mount the Disk.
If I were to do it again I'd create a Debug pod file like this:
apiVersion: v1kind: Podmetadata:  name: debugspec:  - name: debug-container    image: alpine:latest    imagePullPolicy: Always    args: ["tail", "-f", "/dev/null"]    volumeMounts:    - mountPath: /data      name: storage-volume  volumes:  - name: storage-volume    persistentVolumeClaim:      claimName: mount-for-debug
The tail -f /dev/null is a way to ensure that the container stays up so you can shell into it.
3. Shell into Persistent Volume and Clean-Up
Delete the corrupted WAL files and blocks that are filling up the disk (preferably from oldest date onwards).
4. Clean-up Debug pod and sever link to PV
Simply delete the deployment.
5. Redeploy the Prometheus-server
Which should now come up successfully as it has access to a empty drive  
How could we prevent this from happening in the future?
Add alerts on disk usage on Prometheus-server
Allow the volumes to auto-expand.
More on this in another blog post!
If you made it this far, here's the picture from my desk as the first rays of sun creeped in through my blinds as we wrapped up the incident..at 0800!



Hello, World!
Thilina Ratnayake — Mon, 24 Jan 2022 03:15:25 GMT
Howdy, my name is Thilina ("Thi-Li-Nuh") and I'm a dev...ish. Lets see what I write about!

tratnayake.dev

A misadventure with Terraform Sets & PagerDuty Schedules

Intro

Before

Enter, the Good Idea Fairy 🧚🏼

After

The Reason

Conclusion

Common CI Pipeline Considerations: Ordering and Caching

Setting this up in CI

Dependency: Install Helm

Dependency: Ensure that the my-service binary is available before the script runs.

Okay what if we made test-my-service-build a pre-requisite for validate-my-service-config ?

Incorrect Targeting

Okay okay, what if we made validate-my-service-config do it's own build?

Inefficient

Okay okay okay, what if - we made use of caching to cut down on the amount of image building we did?

Learnings

Conclusion

Lessons Learned - Sharpening my Bash Skills

Background

Problems & Work to be Done

Bash baby, Bash.

The Flow

Setting up for Success

Functions

0 - Bootstrapping and Parsing Args

So what's usage?

Checkpoint

1 - Generate the Config Files

2- Validate each Config File

3 - Cleanup

Bonus

Conclusion

Understanding Helm Templates and Utilizing YQ for YAML Parsing Mastery

Background

Goal

Support

Mission

Problem:

A Quick Refresher

Okay, so let's get these Config Files

It's all about the flow

How do I get just the config files?

Next Steps

Conclusion

Speed Up Terraform Debugging Using Terraform Console

Background

The Problem

The Goal

A Failed Fix

Frustration

Research

Enter: terraform console

Solution

Caution

Conclusion

Cheating with Terraform State Show

Background

Tutorial

Set up Teraform Provider

Get ID of previously created Resource

Import the Resource

Show the imported Resource

Use the Code from the imported Resource

Apply the Terraform to create the Resource

Conclusion

🔐 How-To Securely Work With Secrets During Development

🧐 How Do You Store Secrets During Development?

🐚 Provide Secrets as Shell Variables.

🌲 Provide Secrets as Environment Variables

📁 Use A .env File

🤔 What's A Better Way To Store Secrets

🤨 What's A Better Way To Store Secrets in Development?

How does it work?

Why is this better?

Further Extensibility

Conclusion

Tutorial Notebook: A simple CRUD app with Go

Things I Learned

Dependency: Ensure that the `my-service` binary is available before the script runs.

Okay what if we made `test-my-service-build` a pre-requisite for `validate-my-service-config` ?

Okay okay, what if we made `validate-my-service-config` do it's own build?

`Problems & Work to be Done`

`Bash baby, Bash.`

`The Flow`

`Setting up for Success`

`Functions`

`0 - Bootstrapping and Parsing Args`

`So what's usage?`

`Checkpoint`

`1 - Generate the Config Files`

`2- Validate each Config File`

`3 - Cleanup`

`Bonus`

`Conclusion`

📁 Use A `.env` File

2. Use an `initContainer`