<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[tratnayake.dev]]></title><description><![CDATA[An SRE working on Cloud Infrastructure who likes to write about "engineering" things.]]></description><link>https://tratnayake.dev</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1642993811224/5F4o4ESRk.png</url><title>tratnayake.dev</title><link>https://tratnayake.dev</link></image><generator>RSS for Node</generator><lastBuildDate>Wed, 20 May 2026 13:23:53 GMT</lastBuildDate><atom:link href="https://tratnayake.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[From Conversations to Code: "I'm Good at Interacting with People, Can't You See That?"]]></title><description><![CDATA[So what would you say you do here?

Is a question I routinely ask myself, because in a start-up, it’s always changing. (And also because Office Space was a fantastic and very quotable movie).
And as a Founding Customer Success Engineer - the follow-u...]]></description><link>https://tratnayake.dev/from-conversations-to-code-im-good-at-interacting-with-people-cant-you-see-that</link><guid isPermaLink="true">https://tratnayake.dev/from-conversations-to-code-im-good-at-interacting-with-people-cant-you-see-that</guid><category><![CDATA[support]]></category><category><![CDATA[success]]></category><category><![CDATA[engineering]]></category><category><![CDATA[Retool]]></category><category><![CDATA[No Code]]></category><dc:creator><![CDATA[Thilina Ratnayake]]></dc:creator><pubDate>Tue, 14 Oct 2025 04:55:05 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1735951067538/befc7989-3f56-471f-9403-f081febc6d6a.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>So what would you say you do here?</em></p>
<p><img src="https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExNmNnYnR6OXZybTFlc2J3ZmhzdDB6MG54aXllcmdhcmg4aXUwdzZjeSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/b7MdMkkFCyCWI/giphy.gif" alt class="image--center mx-auto" /></p>
<p>Is a question I routinely ask myself, because in a start-up, it’s always changing. (And also because <a target="_blank" href="https://www.imdb.com/title/tt0151804/"><strong><em>Office Space</em></strong></a> was a fantastic and <em>very</em> quotable movie).</p>
<p>And as a Founding Customer Success Engineer - the follow-up to this question in the movie actually…kinda fits!</p>
<blockquote>
<p>Well, look, I already told you.</p>
<p>I deal with the [fantastic] customers... so the engineers [can focus on other things].</p>
<p>I have people skills.</p>
<p>I am good at dealing with people!</p>
<p>Can't you understand that?</p>
</blockquote>
<p><img src="https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExeWl2aXI4a2ZmcTVlcjB3andybTd3cHBrZ2wwNW50ODlla3JmZDV6aiZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/oz7tyUbBs5SH6/giphy.gif" alt class="image--center mx-auto" /></p>
<p>Now, while I don’t <em>physically</em> bring the requirements from the customers - to the engineers. The premise is still the same. My goal is to be out in the field <strong>interacting</strong> with customers to bring back anything that can help <em>them</em> become more successful with the product …which happens by helping <strong><em>us</em></strong> build a better product. (I wasn’t sure how to order this sentence)</p>
<p>If I’m not talking to customers to bring back concerns, insights, or feedback - we’re not moving forward.</p>
<p>At different parts of the organizational journey - the role can be much more <strong>sales</strong>-leaning, or more <strong>success</strong>-focused; but at all times, my job revolves around <strong>interactions.</strong></p>
<p>In that sense, <strong>interactions</strong> with our customers (existing and prospective!) are my bread and butter, and they are my atomic unit of work. I suppose you could say they are a primitive in my system of work.</p>
<p>And so, when I started here a year ago, I realized one of the first things I’d need to do is start <strong>tracking</strong> these interactions —</p>
<p>So to build the tooling &amp; processes to help, here are the steps I went through.</p>
<h1 id="heading-some-questions">Some Questions</h1>
<p>I first needed to figure out some answers:</p>
<h2 id="heading-how-do-i-interact-with-customers">How do I interact with customers?</h2>
<p>Zoom calls - These are the majority of my interactions that happen during the Sales Evaluation phase for getting customers on-boarded</p>
<p>Emails - For any routine correspondence, which usually happens before and after calls, and for questions &amp; support requests.</p>
<p>Slack Messages - Either 1:1 or 1:M for direct support requests, questions, or if there are any general announcements about the product.</p>
<h2 id="heading-what-are-the-types-of-information-gained-from-those-interactions">What are the types of information gained from those interactions?</h2>
<ul>
<li><p>Bugs - Things that are not going as expected.</p>
</li>
<li><p>Feature Requests - Ideas for net-new functionality that’s doesn’t exist in the product.</p>
</li>
<li><p>Enhancements - Improvements to existing functionality that would help users be more successful with the product.</p>
</li>
<li><p>Comments - General thoughts that might not have an immediate action-item.</p>
</li>
<li><p>Questions- Self-explanatory.</p>
</li>
</ul>
<h2 id="heading-what-do-we-do-with-these-interactions-where-do-they-go-amp-whats-the-desired-end-result">What do we do with these interactions? Where do they go &amp; what’s the desired end-result?</h2>
<p>I split these interactions into two categories (Actionable vs Good To Know), and each of them have the following destinations &amp; possible end-states:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td><strong>Type</strong></td><td><strong>Destination</strong></td><td><strong>End Result</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Bugs</strong></td><td>Actionable</td><td>Eng(ineering), P(roduct) M(anagement)</td><td>Bug fixed</td></tr>
<tr>
<td><strong>Enhancements</strong></td><td>Actionable</td><td>PM, Eng</td><td>Enhancement triaged and (1) scheduled for implementation, or (2) placed on-hold / determined won’t do.</td></tr>
<tr>
<td><strong>Feature Requests</strong></td><td>Actionable</td><td>PM</td><td>Request is (1) assessed and (2) built or (3) tabled / cancelled.</td></tr>
<tr>
<td><strong>Comments</strong></td><td>Good To Know</td><td>PM</td><td>Possible FR’s or Enhancements. Could help with future strategy / unearthing potential new ideas.</td></tr>
<tr>
<td><strong>Questions</strong></td><td>Good To Know</td><td>Support / Success</td><td>Can help unearth things that could be optimized, better explained and is a prime source of information for <strong>docs</strong> and <strong>content-ideas.</strong></td></tr>
</tbody>
</table>
</div><h2 id="heading-what-sorts-of-information-are-required-for-the-eventual-audience">What sorts of information are required for the eventual audience?</h2>
<p>For each interaction type:</p>
<blockquote>
<p>What would the consumer of this information want when they’re reading it?</p>
<p>What are the specific pieces of information?</p>
</blockquote>
<p><strong>Bugs</strong></p>
<ul>
<li><p>Details: What’s the problem? And in detail?</p>
</li>
<li><p>Severity: How bad is it? Does it stop users from using the product?</p>
</li>
<li><p>Can it be validated &amp; reproduced? (Is this an edge-case, or is this proven to be happening)</p>
</li>
<li><p>Any supporting details that would help resolve the problem? (Anything that the engineers don’t need to hunt for)</p>
</li>
</ul>
<p><strong>Feature Requests (&amp; Enhancements)</strong></p>
<ul>
<li><p>Details &amp; reasoning: What does the customer want, and <strong>why?</strong></p>
</li>
<li><p>Level of importance: How bad do they want this? Will this be a deal-breaker for implementation or continued usage?</p>
</li>
<li><p>How does it fit into their usage?</p>
<ul>
<li>The details here will help PM determine prioritization and feasibility - especially if this request is widely applicable to other members of the customerbase or market.</li>
</ul>
</li>
</ul>
<p><strong>Comments</strong></p>
<ul>
<li><p>What do they users have to say?</p>
</li>
<li><p>Sentiment: Is the message <strong>positive</strong>, <strong>negative</strong> or <strong>netural</strong>?</p>
</li>
</ul>
<p><strong>Questions</strong></p>
<ul>
<li><p>What’s their question?</p>
</li>
<li><p>Has this come up before?</p>
</li>
<li><p>Could this be answered through better messaging, documentation or updates to the product?</p>
</li>
</ul>
<h1 id="heading-some-constraints">Some Constraints</h1>
<p>I’ve worked in medium sized businesses, and I’ve worked in some large mega-corps - and I’ve seen the way that they handle these challenges. Usually, it’s through the use of tools like <strong>ticketing systems</strong> (like Zendesk) and <strong>project management tools</strong> (like Jira) and with all the accoutrements (like middleware to walk between them) to go with it.</p>
<p>But we’re not medium-sized <em>or</em> a mega-corp. We’re a start-up, which means scrappy, <em>resource-efficient</em> and moving quickly. We’re not going to buy those products (yet), and we don’t have all the time to build something very custom.</p>
<blockquote>
<p>These tools will <em>not do Donkey - Shrek<sub>, if he were faced with the same problems </sub></em> . probably. idk.</p>
</blockquote>
<h1 id="heading-some-goals">Some Goals</h1>
<ol>
<li><h2 id="heading-make-it-easy">Make it easy</h2>
<p> I think I may have been reading through Atomic Habits by James Clear at the time - because one thought at top of mind was <strong>making it <em>easy</em> to capture and track this information</strong>. Especially since this is something I was going to be doing a many times a day. I wanted to get the important information out of my head / conversations - quickly.</p>
</li>
<li><h2 id="heading-make-it-trackable-enable-attribution">Make it trackable (Enable attribution)</h2>
<p> The whole point of this was to get the information into a central place so that it could be (1) organized, (2) reported and (3) actioned on.</p>
</li>
</ol>
<blockquote>
<p>Right information, to the right place, at the right time.</p>
</blockquote>
<h1 id="heading-development">Development</h1>
<p>To make this work, I broke down the system &amp; tooling into a couple of units:</p>
<h2 id="heading-components">Components</h2>
<p>This was the initial solution I came up with with three main pieces from left to right:</p>
<ol>
<li><p><strong>Portal</strong>: The <strong>interactions</strong> come “in” via me (or anyone else in the org)<br /> To make sure it was easy to do quickly, and especially in a call - I wanted something like a quick-fill form:</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735947581497/a3931329-dc78-4fbf-b061-a8b48df97a48.png" alt class="image--center mx-auto" /></p>
</li>
<li><p><strong>Database:</strong> The interactions are <strong>stored</strong> in a central place with the relevant information - with all the required and relevant information; and then finally</p>
</li>
<li><p><strong>Systems of Record:</strong> The interactions are <strong>routed</strong> to the appropriate places for <strong>consumption</strong>, <strong>recording</strong> or <strong>action</strong>. (An example below)</p>
<ol>
<li><p><strong>Consumption</strong>: Into Hubspot as entries into the CRM (and Slack as messages to tag me)</p>
</li>
<li><p><strong>Recording</strong>: Into the customer “Implementation plans” that we were originally storing in Notion; and lastly</p>
</li>
<li><p><strong>Action</strong>: Into Github Projects for PM &amp; Engineers to create issues and build off.</p>
</li>
</ol>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735947817721/b1ad7c55-c309-4c45-90d4-9a2a1a8f1c58.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-architectural-decisions">Architectural Decisions</h2>
<p>I decided that my system would probably have to look something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735947394788/2cd36288-ca82-41e1-a9ea-3e64bcdbd33d.png" alt class="image--center mx-auto" /></p>
<p>where I decided to use <strong>Retool</strong> as the database &amp; front-end for a couple of reasons:</p>
<ol>
<li><p>I was very comfortable with its use,</p>
</li>
<li><p>It provides a powerful no-code editor to get UI components up and running BUT also the ability to really dig into (and implement) your own logic / scripts using their <strong>workflows</strong> and</p>
</li>
<li><p>The ability to connect to multiple and various different <strong>resources</strong>. (This comes in handy for things like connecting to product endpoints for usage data and connecting to other API’s for supporting information - i.e. to our AI notetaker’s API for transcripts of conversations)</p>
</li>
</ol>
<h1 id="heading-implementation-the-user-side">Implementation (the User Side)</h1>
<p>With all of that in mind, this is what capturing interactions looks like in YeshID today:</p>
<ol>
<li><p>When anyone needs to record an <strong>interaction</strong> they head to the SupPortal (which, with the use of go-links, becomes very easy to remember with <code>go/supportal</code>)</p>
<ul>
<li>The org picker at the top is populated with all orgs in our dashboard (and will add in the <code>Org_ID</code> which means no hunting for information through the database)</li>
</ul>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735948925106/bc1fafd0-102c-4b96-8ce0-75b73c70cbb4.png" alt class="image--center mx-auto" /></p>
<ol start="2">
<li><p>The interaction types is a drop down split into <code>Bug</code>, <code>Enhancement</code>, <code>Customer Request</code> and <code>Comment</code> - which determines how the interaction will be actioned. (Lets look at a Bug as an example). I can then go in and fill out the details. An option I have for most interaction is “Is this a blocker” which adds a specific label onto the issues in Github projects (where we track our work)</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735949009228/d47a9eda-f56e-4b93-b292-e1c77c7c7e6f.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Once the interaction is <strong>submitted</strong>, it’s fanned out into the following places</p>
</li>
</ol>
<ul>
<li><p><strong>Slack</strong> - So that I have a record (which was initially my only record)</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735949100629/346eddfa-c4f7-4cd4-b9cd-46da9c721ef3.png" alt class="image--center mx-auto" /></p>
</li>
<li><p><strong>Github Issue</strong> - So that the work can be picked up by PM or Eng</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735949119557/83dcaf0b-6639-41e9-9633-30b2fd8a3a46.png" alt class="image--center mx-auto" /></p>
<p>  and finally:</p>
</li>
</ul>
<p><strong>Hubspot</strong> - So that when I’m prepping for calls or whenever PM / someone else needs a recap on a customer’s journey so far, they can see all the interactions that have happened:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735949160256/d8a2dba0-2ad1-4559-8f63-40182cc3bf52.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-bonus-closing-the-loop">Bonus: Closing the Loop</h1>
<p>While <s>knowledge is half the battle</s> getting the information is important - at some point, it only becomes useful for the <strong>user</strong> when they know <strong><em>something’s been done with their feedback</em></strong>. I’ve improved this system so that now when Engineers close a ticket, it will automatically alert me in Slack and in Hubspot.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735949308576/f98a1c71-fad0-4ebf-8c31-9975bc57b9fc.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-benefits">Benefits</h1>
<p>By using this system, I’ve noticed a couple of advantages:</p>
<ol>
<li><p>The <strong>right</strong> <strong>information</strong> is collected at the <strong>right time.</strong> No more bug reports going to Eng without basic information like OrgID’s or loom links.</p>
</li>
<li><p>It’s <strong>easy</strong> to collect information and get the information to move quickly (that’s one of our few advantages against the heavy-hitters as a start-up!)</p>
</li>
<li><p>The interactions are <strong>trackable</strong>. All of the interactions are stored in a retool database which means I can have visibility and the <em>ability</em> to get information / insights like:</p>
<ol>
<li><p>How long do tickets take to get resolved?</p>
</li>
<li><p>How many tickets are attributed to a specific customer?</p>
</li>
<li><p>Did some customers submit more interactions than others? Why? What can we learn from that?</p>
</li>
</ol>
</li>
</ol>
<p>One <em>specific</em> example that comes to mind - is the time we were about to enable billing and wanted to reach out to our customers. (A big milestone in any start-up’s career!). One of the tasks on our plate was to communicate the state of billing to our <strong>Lighthouse Customers</strong> (customers who partnered with us early to help gather data / provide feedback - in exchange for certain perks). We needed to let them know that (1) “hey you’re going to see some information about pricing and billing”, but (2) “don’t worry - [here’s how this information applies to you].”</p>
<p>We <em>could</em> have easily come up with some generic boiler plate messaging that could do the job but NOPE. That’s not the experience I want for my customers, and especially for those people whose efforts, time and feedback was <strong>instrumental</strong> in getting us to that very point. And so I thought about the idea of having a tailored blurb in each of those messages to act as a checkpoint to validate, and recognize their efforts.</p>
<blockquote>
<p>Wouldn’t it be great if each Lighthouse customer got a little snapshot / blurb of how they helped?</p>
</blockquote>
<p>Before this system, that task would have been painstaking and time-intensive. The effort required to dig through multiple places / systems and attribute the tickets would have been unfeasible on it’s own. But thanks to this system and having it <em>all</em> go into Hubspot (or into Github Projects with attribution) - I was able to make up blurbs like this on very short notice.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735949992764/1287bf6d-e4c3-4475-97f6-01da3709d677.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>In a start-up, roles often shift, and as a Founding Customer Success Engineer, my job revolves around interacting with customers to improve our product. To streamline this process, I developed a system for tracking customer interactions using tools like Retool, Hubspot, and Github. This setup allows us to efficiently capture, organize, and act on feedback, ensuring vital information reaches the right teams promptly. The system enhances visibility, speeds up resolution times, and empowers personalized communication with key customers, proving invaluable in our fast-paced environment.</p>
]]></content:encoded></item><item><title><![CDATA[A misadventure with Terraform Sets & PagerDuty Schedules]]></title><description><![CDATA["T, why didn't I get this page?" 🤨
"Wait, why does it show that <other_person> is on call? They just did it the other week." 🧐

Are two phrases that you don't want to hear after making changes to your PagerDuty schedules terraform.
Intro
In the las...]]></description><link>https://tratnayake.dev/a-misadventure-with-terraform-sets-pagerduty-schedules</link><guid isPermaLink="true">https://tratnayake.dev/a-misadventure-with-terraform-sets-pagerduty-schedules</guid><category><![CDATA[Terraform]]></category><category><![CDATA[SRE]]></category><category><![CDATA[Devops]]></category><category><![CDATA[oncall]]></category><category><![CDATA[pagerduty]]></category><dc:creator><![CDATA[Thilina Ratnayake]]></dc:creator><pubDate>Sat, 22 Jul 2023 23:20:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/bwOAixLG0uc/upload/29a77939b2b79aacc64722c199e14b5a.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>"T, why didn't I get this page?" 🤨</p>
<p>"Wait, why does it show that &lt;other_person&gt; is on call? They just did it the other week." 🧐</p>
</blockquote>
<p>Are two phrases that you <em>don't</em> want to hear after making changes to your PagerDuty schedules terraform.</p>
<h1 id="heading-intro">Intro</h1>
<p>In the last couple of weeks, I've been leading the efforts to on-board 3 new engineers to our on-call rotation. As part of that work, one of the tasks is to get those engineers added to PagerDuty(PD), the app we use for managing on-call shifts and alerting. While this can easily be done in the PD UI, we implement these changes via Terraform so that it's documented, codified, and tracked via version control. Also because it adds another layer of auditability.</p>
<p>Some key concepts for working with Pagerduty:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1690067160267/6ae2cedb-e327-4928-95b1-605060bccd39.png" alt class="image--center mx-auto" /></p>
<ul>
<li>A <code>schedule</code> determines the WHO, and WHEN. (Who will be in the rotation, how long the rotation will be, and when the rotation starts).</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1690067399094/0c721f0f-addc-4556-979c-8bed2242fda0.png" alt class="image--center mx-auto" /></p>
<ul>
<li>An <code>escalation policy</code> determines the ordering/logic for which schedules get paged.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1690067417718/9b0ba7fb-d79b-47c0-a1f9-d3848e95fa15.png" alt class="image--center mx-auto" /></p>
<ul>
<li>A <code>service</code> is what represents your service (or system) and will be linked to an <code>escalation policy</code>.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1690067429931/918c1d36-7d41-438b-904a-9e28011e3fb0.png" alt class="image--center mx-auto" /></p>
<p>So from the top:</p>
<ol>
<li><p>When a <code>service</code> has an alert, PD will look at the <code>escalation policy</code>.</p>
</li>
<li><p>Based on the <code>escalation policy</code> and the current situation (i.e. first alert, first loop), PD will notify the appropriate <code>schedule</code></p>
</li>
</ol>
<p>You can see a full gist of the old code <a target="_blank" href="https://gist.github.com/tratnayake/779fec920613b79a94609436a198a457">here</a>.</p>
<blockquote>
<p>An important note for this example is that my team is actually considered a subteam (A) that shares its pager with subteam (B)</p>
</blockquote>
<h1 id="heading-before">Before</h1>
<p>Prior to this work, I had originally set my schedule up as follows:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1690067399094/0c721f0f-addc-4556-979c-8bed2242fda0.png" alt class="image--center mx-auto" /></p>
<p>I also had each person's membership in a PagerDuty <code>team</code> like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1690067557159/09bdff39-788e-4575-8fec-58a7769060d6.png" alt class="image--center mx-auto" /></p>
<p>Given that:</p>
<ol>
<li><p>I was specifying an association to a user twice AND</p>
</li>
<li><p>Creating a new resource for each team membership; I wondered if I could refactor this.</p>
</li>
</ol>
<h3 id="heading-enter-the-good-idea-fairy">Enter, the Good Idea Fairy 🧚🏼</h3>
<p>Since my last brush with Terraform, I'd like to think I'd gotten better with it - especially with the use of <code>for_each</code> statements. So when looking at a solution to this "problem" - I thought:</p>
<blockquote>
<p><em>Why not just create a</em> <code>locals.members</code> list with all the users, and then use that as (1) the members for the <code>schedule</code> and (2) to have a single statement to create the <code>team_memberships</code> via a for_each?</p>
<p>In FACT! Since we have two subteams, I could create two lists and simply combine them!</p>
</blockquote>
<h1 id="heading-after">After</h1>
<p>This is what I ended up with after refactoring and thinking what I <em>thought</em> were good changes. <a target="_blank" href="https://gist.github.com/tratnayake/108f42e1897884ae5c7783373595d946">Gist</a>.</p>
<p>I thought I was pretty slick by doing the following:</p>
<ul>
<li>Setting up the list of teammates in a local variable.</li>
</ul>
<pre><code class="lang-bash">locals {
    my_team_subteam_a_members = toset([
        pagerduty_user.thilina_ratnayake.id,
        pagerduty_user.teammate_b.id,
        pagerduty_user.teammate_c.id,
    ])
    my_team_subteam_b_members = toset([
        pagerduty_user.teammate_a.id,
        pagerduty_user.teammate_d.id
    ])
}
</code></pre>
<ul>
<li>Use the list from above with <code>setunion()</code> to combine both subteams A and B.</li>
</ul>
<pre><code class="lang-bash">resource <span class="hljs-string">"pagerduty_schedule"</span> <span class="hljs-string">"myteam_schedule"</span> {
  name        = <span class="hljs-string">"My Team"</span>
  time_zone   = <span class="hljs-string">"America/Los_Angeles"</span>
  description = <span class="hljs-string">"PD Schedule for My Team, Slack #my-team, Email: my-team@company.com"</span>

    layer {
    name                         = <span class="hljs-string">"weekday"</span>
    rotation_turn_length_seconds = 1209600
    rotation_virtual_start       = <span class="hljs-string">"2023-01-1T09:00:00-08:00"</span>
    start                        = <span class="hljs-string">"2023-01-1T09:00:00-08:00"</span>
    users = setunion(local.my_team_subteam_a_members, local.my_team_subteam_b_members)
    }
}
</code></pre>
<ul>
<li>Iterate through the memberships for each subteam.</li>
</ul>
<pre><code class="lang-bash">resource <span class="hljs-string">"pagerduty_team_membership"</span> <span class="hljs-string">"my_team_subteam_a_members"</span> {
  for_each = local.my_team_subteam_a_members
  user_id = each.value
  team_id = pagerduty_team.my_team_subteam_a.id
}

resource <span class="hljs-string">"pagerduty_team_membership"</span> <span class="hljs-string">"my_team_subteam_b_members"</span> {
  for_each = local.my_team_subteam_b_members
  user_id = each.value
  team_id = pagerduty_team.my_team_subteam_b.id
}
</code></pre>
<p>Except, I wasn't. Because this didn't go as planned - and the day after I made the changes we noticed that the PagerDuty schedules were completely off.</p>
<h1 id="heading-the-reason">The Reason</h1>
<blockquote>
<p>In a schedule, <strong>ordering matters.</strong></p>
</blockquote>
<p>Before, we had specified the ordering and had that ordering based on a start date. That meant that after every interval (rotation), the next person would be in the hot seat to carry the pager.</p>
<p>However, when we did:</p>
<p><code>users = setunion(local.my_team_subteam_a_members, local.my_team_subteam_b_members)</code></p>
<p>This ended up doing a <code>union</code> of the sets, which completely changes &amp; disregards the order. In fact, that's actually specified in the <a target="_blank" href="https://developer.hashicorp.com/terraform/language/functions/setunion">documentation</a> <em>that I missed 🤦🏽‍♂️</em>:</p>
<pre><code class="lang-bash">&gt; setunion([<span class="hljs-string">"a"</span>, <span class="hljs-string">"b"</span>], [<span class="hljs-string">"b"</span>, <span class="hljs-string">"c"</span>], [<span class="hljs-string">"d"</span>])
[
  <span class="hljs-string">"d"</span>,
  <span class="hljs-string">"b"</span>,
  <span class="hljs-string">"c"</span>,
  <span class="hljs-string">"a"</span>,
]
</code></pre>
<blockquote>
<p>The given arguments are converted to sets, so the result is also a set and the ordering of the given elements is not preserved.</p>
</blockquote>
<p>By doing a <code>setunion</code> on the <code>locals.my_team_subteam_a_members</code> and <code>locals.my_team_subteam_b_members</code> - the ordering was completely disregarded which led to PagerDuty setting up someone that wasn't scheduled as the person on-call for the rotation</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>While it's great to be <code>DRY</code> and avoid the repetition of values - that shouldn't get in the way of functionality. With regards to Terraform:</p>
<ol>
<li><p>If ordering matters in a list, don't use <code>setunion()</code></p>
</li>
<li><p>Especially if you're setting up a PagerDuty schedule, just "hardcode" / manually specify the rotation order.</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Common CI Pipeline Considerations: Ordering and Caching]]></title><description><![CDATA[At work last week, I found myself getting burned by ordering and "file doesn't exist" errors. Ultimately - the thing that (would have) helped me get past most of these issues is something that I can't believe I forgot: a table.

If you've been follow...]]></description><link>https://tratnayake.dev/common-ci-pipeline-considerations-ordering-and-caching</link><guid isPermaLink="true">https://tratnayake.dev/common-ci-pipeline-considerations-ordering-and-caching</guid><category><![CDATA[Devops]]></category><category><![CDATA[SRE]]></category><category><![CDATA[ci-cd]]></category><category><![CDATA[CircleCI]]></category><dc:creator><![CDATA[Thilina Ratnayake]]></dc:creator><pubDate>Sat, 15 Jul 2023 17:10:16 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/MRb9Hu3hwgM/upload/ff1ead12e5a2c5a69978f3c226798d00.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At work last week, I found myself getting burned by ordering and "file doesn't exist" errors. Ultimately - the thing that (would have) helped me get past most of these issues is something that I can't believe I forgot: a table.</p>
<hr />
<p>If you've been following along with my last two blog posts, I've been tasked with implementing a check on PR to validate dynamically generated configuration files for a service running Open Telemetry Collectors.</p>
<p>To do this we had to:</p>
<ol>
<li><a target="_blank" href="https://tratnayake.dev/understanding-helm-templates-and-utilizing-yq-for-yaml-parsing-mastery">Use YQ to template out the config files from the Helm templates ✅</a></li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1689437958276/9e13a441-744e-41af-928b-c3e3551e65dc.png" alt class="image--center mx-auto" /></p>
<ol>
<li><p><a target="_blank" href="https://tratnayake.dev/lessons-learned-sharpening-my-bash-skills">Create the bash script that would loop through each of the <code>&lt;environment&gt;.yaml</code>'s to (1) generate the config files and (2) run them through the OTC's <code>validate</code> command ✅</a>  </p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1689437994887/6edb858e-f160-498a-8f9b-e194c3d53337.png" alt class="image--center mx-auto" /></p>
</li>
</ol>
<p>You'll recall that the goal from Step 2 was to get this running successfully on my laptop so we could then set it up to run on CI.</p>
<p>Which brings us to today.</p>
<h1 id="heading-setting-this-up-in-ci">Setting this up in CI</h1>
<p>At this point, I had a working script that I simply had to run on my Continuous Integration (CI) pipeline on each PR. We currently use CircleCI so that meant updating our <code>/.circleci/config.yml</code> file which looks something like this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">workflows:</span>
  <span class="hljs-attr">version:</span> <span class="hljs-number">2</span>
  <span class="hljs-attr">build_and_test:</span>
    <span class="hljs-attr">jobs:</span>
      [<span class="hljs-string">...</span>]
      <span class="hljs-bullet">-</span> <span class="hljs-string">test-my-service-build</span>
</code></pre>
<p>So I figured, maybe I'd just add a new job called <code>validate-my-service-config</code> which would checkout the repo and run anytime any code changed in our Helm chart (where the config files were defined).</p>
<p>Perhaps something like this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">workflows:</span>
  <span class="hljs-attr">version:</span> <span class="hljs-number">2</span>
  <span class="hljs-attr">build_and_test:</span>
    <span class="hljs-attr">jobs:</span>
      [<span class="hljs-string">...</span>]
      <span class="hljs-bullet">-</span> <span class="hljs-string">test-my-service-build</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">validate-my-service-config</span>
<span class="hljs-comment">###</span>
<span class="hljs-attr">validate-my-service-config:</span>
    <span class="hljs-attr">working_directory:</span> <span class="hljs-string">/home/circleci/workspace/myrepo</span>
    <span class="hljs-comment"># Run this step in a Docker container using the docker image we've</span>
    <span class="hljs-comment"># sepc'd in pipeline params.</span>
    <span class="hljs-attr">docker:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">image:</span> <span class="hljs-string">cimg/go:&lt;&lt;</span> <span class="hljs-string">pipeline.parameters.golang-image-version</span> <span class="hljs-string">&gt;&gt;</span>

    <span class="hljs-attr">steps:</span>
      <span class="hljs-comment"># Git checkout the repo</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">shallow-checkout</span>
      <span class="hljs-comment"># Check if the charts dir has changed which contains the config</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">check-if-code-changed:</span>
          <span class="hljs-attr">path:</span> <span class="hljs-string">kubernetes/charts/my-service</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">run:</span>
          <span class="hljs-attr">name:</span> <span class="hljs-string">"validate-my-service-config"</span>
          <span class="hljs-attr">command:</span> <span class="hljs-string">|
            export MYREPO_REPO_ROOT=/home/circleci/workspace/MYREPO
            export PATH=$MYREPO_REPO_ROOT/bin:$PATH
            # make validate-all-configs runs run_all_checks.sh
            make -C kubernetes/charts/my-service validate-all-configs</span>
</code></pre>
<p>But there arose two new problems:</p>
<ol>
<li><p>This step runs in a Go Docker image, it doesn't have <code>Helm</code> onboard; and</p>
</li>
<li><p>This step runs the <code>make validate-all-configs</code> target which runs <code>run_all_checks.sh</code> script which expects the <code>my-service</code> binary to be <strong>present</strong> in a specific file location, <strong>but what if the binary doesn't exist in the expected location at the time of running?</strong></p>
</li>
</ol>
<ol>
<li><h3 id="heading-dependency-install-helm">Dependency: Install Helm</h3>
</li>
</ol>
<p>Installing Helm was easy thanks to a make target that was created by SREs before me, called <code>make install-helm</code> (it would fetch the script from <a target="_blank" href="http://get.helm.sh"><code>get.helm.sh</code></a>, uncommpress it, and install from source)</p>
<ol>
<li><h3 id="heading-dependency-ensure-that-the-my-service-binary-is-available-before-the-script-runs">Dependency: Ensure that the <code>my-service</code> binary is available before the script runs.</h3>
</li>
</ol>
<p>This is where ordering became important. You will recall from further up that our CircleCI workflow had a job called <code>test-my-service-build</code> which was used to test and confirm that the binary would, in fact, build.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">test-my-service-build:</span>
    <span class="hljs-attr">working_directory:</span> <span class="hljs-string">/home/circleci/workspace/myrepo</span>
    <span class="hljs-attr">docker:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">image:</span> <span class="hljs-string">cimg/go:&lt;&lt;</span> <span class="hljs-string">pipeline.parameters.golang-image-version</span> <span class="hljs-string">&gt;&gt;</span>

    <span class="hljs-attr">steps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">shallow-checkout</span>

      <span class="hljs-bullet">-</span> <span class="hljs-attr">check-if-code-changed:</span>
          <span class="hljs-attr">path:</span> <span class="hljs-string">go/src/github.com/myrepo/my-service</span>

      <span class="hljs-bullet">-</span> <span class="hljs-attr">run:</span>
          <span class="hljs-attr">name:</span> <span class="hljs-string">"build"</span>
          <span class="hljs-attr">command:</span> <span class="hljs-string">|
            export MYREPO_REPO_ROOT=/home/circleci/workspace/myrepo
            export PATH=$MYREPO_REPO_ROOT/bin:$PATH
            make -C go/src/github.com/myrepo/my-service build</span>
</code></pre>
<p>I initially thought:</p>
<blockquote>
<p>Sweet! Maybe I'll just run the validation steps inside of testing the image, because that way the binary will be available AND it kills two birds with one stone.</p>
</blockquote>
<p>(Also, on a completely random tangent, that Idiom always reminds me of this comic by Nathan W. Pyle)</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1689440738035/dc575848-b6ea-4bed-8c4b-831380694aa8.jpeg" alt class="image--center mx-auto" /></p>
<p>Anyways, it would look like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1689439064597/869b0348-4146-41c5-8fa3-2231fa64c657.png" alt class="image--center mx-auto" /></p>
<pre><code class="lang-yaml"><span class="hljs-attr">test-my-service-build:</span>
    [<span class="hljs-string">...</span>]
      <span class="hljs-comment"># Check build AND config paths for changes</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">check-if-code-changed:</span>
          <span class="hljs-attr">path:</span> <span class="hljs-string">go/src/github.com/myrepo/my-service</span> <span class="hljs-string">kubernetes/charts/my-service</span>

      <span class="hljs-bullet">-</span> <span class="hljs-attr">run:</span>
          <span class="hljs-attr">name:</span> <span class="hljs-string">"build"</span>
          <span class="hljs-attr">command:</span> <span class="hljs-string">|
            [...]
            make -C go/src/github.com/myrepo/my-service build
</span>      <span class="hljs-bullet">-</span> <span class="hljs-attr">run:</span>
          <span class="hljs-attr">name:</span> <span class="hljs-string">"validate"</span>
          <span class="hljs-attr">command:</span> <span class="hljs-string">|
            [...]
            make -C kubernetes/charts/my-service validate-all-configs</span>
</code></pre>
<p>But I had to quickly disqualify that idea. Mainly because <code>test-my-service-build</code> was (1) specifically "targeted" to run on file changes to the go source and (2) only check the build. Doing config validation in this step felt like inappropriate overloading;</p>
<p><strong>What would happen if we <em>only</em> needed to check config, or only check build? Is it appropriate that we have to do the other step as well?</strong></p>
<p>More specifically, this is inappropriate because it could lead to situations where we are unable to make a <code>build</code> related change because <code>config</code> is bad, and vice versa. This creates unnecessary coupling between the two.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Go Code Changes</td><td>Helm Config Changes</td><td>Result</td></tr>
</thead>
<tbody>
<tr>
<td>✍🏾</td><td>None or ❌</td><td>A PR dealing with Go Code would be blocked due to a Config issue</td></tr>
<tr>
<td>None or ❌</td><td>✍🏾</td><td>A PR dealing with Helm congig changes would be blocked due to a Go Code issue.</td></tr>
</tbody>
</table>
</div><blockquote>
<h2 id="heading-okay-what-if-we-made-test-my-service-build-a-pre-requisite-for-validate-my-service-config">Okay what if we made <code>test-my-service-build</code> a pre-requisite for <code>validate-my-service-config</code> ?</h2>
</blockquote>
<p>Something like:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1689439088364/731520d0-552e-48ee-bb6e-07f822bbab40.png" alt class="image--center mx-auto" /></p>
<pre><code class="lang-yaml"><span class="hljs-attr">workflows:</span>
  <span class="hljs-attr">version:</span> <span class="hljs-number">2</span>
  <span class="hljs-attr">build_and_test:</span>
    <span class="hljs-attr">jobs:</span>
      [<span class="hljs-string">...</span>]
      <span class="hljs-bullet">-</span> <span class="hljs-string">test-my-service-build</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">validate-my-service-config</span>
        <span class="hljs-attr">requires:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-string">test-my-service-build</span>
</code></pre>
<p>Since the binary is built in <code>test-myservice-build</code>, it would ensure that the binary is available on the file system for <code>validate-my-service-config</code></p>
<p>But there was a problem with this approach:</p>
<h3 id="heading-incorrect-targeting">Incorrect Targeting</h3>
<p>Since <code>test-my-service-build</code> was targeted as follows:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">test-my-service-build:</span>
    [<span class="hljs-string">...</span>]
      <span class="hljs-comment"># Check build AND config paths for changes</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">check-if-code-changed:</span>
          <span class="hljs-attr">path:</span> <span class="hljs-string">go/src/github.com/myrepo/my-service</span>
</code></pre>
<p>It would never actually build if there was a change to config, meaning that the binary would never be built and thus be available for the <code>validate-my-service-config</code> job, which is getting back to Square 0.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Go Code Change</td><td>Helm Config Change</td><td>Result</td></tr>
</thead>
<tbody>
<tr>
<td>✍🏾</td><td></td><td>CI would work as intended ✅</td></tr>
<tr>
<td></td><td>✍🏾</td><td>This would fail because changes were not made in the directory that's targeted by <code>test-my-service-build</code> and thus the required binary would not be available.</td></tr>
</tbody>
</table>
</div><blockquote>
<h2 id="heading-okay-okay-what-if-we-made-validate-my-service-config-do-its-own-build">Okay okay, what if we made <code>validate-my-service-config</code> do it's own build?</h2>
</blockquote>
<p>Something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1689439111543/6bf2f982-0858-4869-a094-e923b3970649.png" alt class="image--center mx-auto" /></p>
<p>In Code:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">validate-my-service-config:</span>
    [<span class="hljs-string">...</span>]
      <span class="hljs-comment"># Check build AND config paths for changes</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">check-if-code-changed:</span>
          <span class="hljs-attr">path:</span> <span class="hljs-string">kubernetes/charts/my-service</span>

      <span class="hljs-bullet">-</span> <span class="hljs-attr">run:</span>
          <span class="hljs-attr">name:</span> <span class="hljs-string">"build"</span>
          <span class="hljs-attr">command:</span> <span class="hljs-string">|
            [...]
            make -C go/src/github.com/myrepo/my-service build
</span>      <span class="hljs-bullet">-</span> <span class="hljs-attr">run:</span>
          <span class="hljs-attr">name:</span> <span class="hljs-string">"validate"</span>
          <span class="hljs-attr">command:</span> <span class="hljs-string">|
            [...]
            make -C kubernetes/charts/my-service validate-all-configs</span>
</code></pre>
<h3 id="heading-inefficient">Inefficient</h3>
<p>This setup would mean that the <code>my-service</code> binary would need to get built twice if there was an incoming change that required changing Go code <strong><em>and</em></strong> Helm config.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Go Code Changes</td><td>Helm Config Changes</td><td>Results</td></tr>
</thead>
<tbody>
<tr>
<td>✍🏾</td><td></td><td>1. Triggers <code>test-my-service-build</code> which builds the <code>my-service</code> binary</td></tr>
<tr>
<td></td><td>✍🏾</td><td>1. Triggers <code>validate-my-service-config</code> which (1) builds the <code>my-service</code> binary and (2) validates config</td></tr>
<tr>
<td>✍🏾</td><td>✍🏾</td><td>1. Builds the binary in <code>test-my-service-build</code></td></tr>
<tr>
<td>2. Builds the binary <em>again</em> in <code>validate-my-service-config</code></td><td></td></tr>
</tbody>
</table>
</div><p>This was starting to feel like inappropriate overloading again.</p>
<blockquote>
<h2 id="heading-okay-okay-okay-what-if-we-made-use-of-caching-to-cut-down-on-the-amount-of-image-building-we-did">Okay okay okay, what if - we made use of caching to cut down on the amount of image building we did?</h2>
</blockquote>
<p>Yes. This would work. CircleCI has a couple of strategies for <a target="_blank" href="https://circleci.com/blog/persisting-data-in-workflows-when-to-use-caching-artifacts-and-workspaces/">persisting data between jobs and workflows</a> and for our use case we picked <code>caching</code>.</p>
<p>Specifically:</p>
<ol>
<li><p><code>test-my-service-build</code> ➡️ will always build the image, and then save to the <code>cache</code>.</p>
</li>
<li><p><code>cache</code> ➡️ <code>validate-my-service-config</code> ➡️ <code>cache</code> - where the cache is made available to the validate step.</p>
<ol>
<li><p>If the binary exists in cache, use that.</p>
</li>
<li><p>If it doesn't, create it!</p>
</li>
<li><p>If a binary is created, save it to cache!</p>
</li>
</ol>
</li>
</ol>
<p>This solves our problem as follows:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Go Code Changes</td><td>Helm Config Changes</td><td>Result</td></tr>
</thead>
<tbody>
<tr>
<td>✍🏾</td><td></td><td>Always builds the image, and saves to cache.</td></tr>
<tr>
<td></td><td>✍🏾</td><td>Reads from cache to fetch a binary if it was recently built, and if not, creates the binary and saves to cache.</td></tr>
<tr>
<td>✍🏾</td><td>✍🏾</td><td>Will build the image ONCE in either step (and then save to cache), and that same image will be used in the second step. <strong>(The image only gets built once)</strong></td></tr>
</tbody>
</table>
</div><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1689439435710/3bceac8d-7620-47a9-9c22-ec2662b5688d.png" alt class="image--center mx-auto" /></p>
<p>This is what that looks like in CircleCI config:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">workflows:</span>
  <span class="hljs-attr">version:</span> <span class="hljs-number">2</span>
  <span class="hljs-attr">build_and_test:</span>
    <span class="hljs-attr">jobs:</span>
      [<span class="hljs-string">...</span>]
      <span class="hljs-bullet">-</span> <span class="hljs-string">test-my-service-build</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">validate-my-service-config</span>
<span class="hljs-comment">###</span>
<span class="hljs-attr">test-my-service-build:</span>
    <span class="hljs-attr">working_directory:</span> <span class="hljs-string">/home/circleci/workspace/myrepo</span>

    <span class="hljs-attr">docker:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">image:</span> <span class="hljs-string">cimg/go:&lt;&lt;</span> <span class="hljs-string">pipeline.parameters.golang-image-version</span> <span class="hljs-string">&gt;&gt;</span>

    <span class="hljs-attr">steps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">shallow-checkout</span>

      <span class="hljs-bullet">-</span> <span class="hljs-attr">check-if-code-changed:</span>
          <span class="hljs-attr">path:</span> <span class="hljs-string">go/src/github.com/myrepo/my-service</span>

      <span class="hljs-bullet">-</span> <span class="hljs-attr">run:</span>
          <span class="hljs-attr">name:</span> <span class="hljs-string">"build"</span>
          <span class="hljs-attr">command:</span> <span class="hljs-string">|
            export MYREPO_REPO_ROOT=/home/circleci/workspace/myrepo
            export PATH=$MYREPO_REPO_ROOT/bin:$PATH
</span>
            <span class="hljs-string">make</span> <span class="hljs-string">-C</span> <span class="hljs-string">go/src/github.com/myrepo/my-service</span> <span class="hljs-string">build</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">save_cache:</span>
          <span class="hljs-attr">key:</span> <span class="hljs-string">my-service-binary-cache</span>
          <span class="hljs-attr">paths:</span> 
            <span class="hljs-bullet">-</span> <span class="hljs-string">go/src/github.com/myrepo/my-service/dist/my-service</span>
<span class="hljs-comment">###</span>
<span class="hljs-attr">validate-my-service-config:</span>
    <span class="hljs-attr">working_directory:</span> <span class="hljs-string">/home/circleci/workspace/myrepo</span>

    <span class="hljs-attr">docker:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">image:</span> <span class="hljs-string">cimg/go:&lt;&lt;</span> <span class="hljs-string">pipeline.parameters.golang-image-version</span> <span class="hljs-string">&gt;&gt;</span>

    <span class="hljs-attr">steps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">shallow-checkout</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">check-if-code-changed:</span>
          <span class="hljs-attr">path:</span> <span class="hljs-string">kubernetes/charts/my-service</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">restore_cache:</span>
          <span class="hljs-attr">keys:</span> 
            <span class="hljs-bullet">-</span> <span class="hljs-string">my-service-binary-cache</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">run:</span>
          <span class="hljs-attr">name:</span> <span class="hljs-string">"validate-my-service-config"</span>
          <span class="hljs-attr">command:</span> <span class="hljs-string">|
            export MYREPO_REPO_ROOT=/home/circleci/workspace/MYREPO
            export PATH=$MYREPO_REPO_ROOT/bin:$PATH
            # Check if a binary exists in the cache
            my-servicebin="go/src/github.com/myrepo/my-service/dist/my-service"
            if [ ! -e "$my-servicebin" ]; then
                echo "my-service binary does not exist."
                make -C go/src/github.com/myrepo/my-service build
            fi
            make -C install-helm
             # make validate-all-configs runs run_all_checks.sh
            make -C kubernetes/charts/my-service validate-all-configs
</span>      <span class="hljs-bullet">-</span> <span class="hljs-attr">save_cache:</span>
          <span class="hljs-attr">key:</span> <span class="hljs-string">my-service-binary-cache</span>
          <span class="hljs-attr">paths:</span> 
            <span class="hljs-bullet">-</span> <span class="hljs-string">go/src/github.com/myrepo/my-service/dist/my-service</span>
</code></pre>
<h1 id="heading-learnings">Learnings</h1>
<p>By understanding the ordering required by our CI steps and utilizing the use of Circle CI's cache, we are able to ensure that the dependencies for each step are met and that we're being efficient in doing only the steps required for each type of change.</p>
<p>As you saw from each of my iterations on changing the CI pipeline, I ended up having to build a table to test if that configuration would satisfy each use case (i.e.go change, helm change, go &amp; helm change); If I were to do this again in the future, I think I would make that table, a part of my design and planning process.</p>
<p>For example:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>If there's a Go Change (Binary)</td><td>If there's a Helm change (Config)</td><td>What should we test in CI?</td></tr>
</thead>
<tbody>
<tr>
<td>Yes</td><td>No</td><td>We only need to test that the binary is built successfully.</td></tr>
<tr>
<td>No</td><td>Yes</td><td>We only need to test that the config files can be validated against the binary. Prerequisite:</td></tr>
<tr>
<td>Yes</td><td>Yes</td><td>The binary must be built AND the config files must be validated. But we only need to build the binary once - good opportunity to use caching.</td></tr>
<tr>
<td>No</td><td>No</td><td>Nothing is required, kick back and relax.</td></tr>
</tbody>
</table>
</div><p>Considerations:</p>
<ol>
<li><p>The binary must be built for every go change</p>
</li>
<li><p>The binary must be built prior to testing a helm change, but it <em>may</em> run after the go binary is built in <code>test-my-service-build</code> - this might be a good opportunity to use caching.</p>
</li>
</ol>
<h1 id="heading-conclusion">Conclusion</h1>
<p>In conclusion, understanding the ordering of CI steps and utilizing CircleCI's cache feature can help ensure that dependencies are met efficiently for each type of change. Creating a table during the planning process can help anticipate different scenarios and optimize the CI pipeline accordingly.</p>
]]></content:encoded></item><item><title><![CDATA[Lessons Learned - Sharpening my Bash Skills]]></title><description><![CDATA[Bash scripting is one of those things that I always associate with a strong engineers and especially those in SRE. Conversely, it's not something I get to write a lot of and so - I'll take any opportunity to sharpen those skills.
This blog post expla...]]></description><link>https://tratnayake.dev/lessons-learned-sharpening-my-bash-skills</link><guid isPermaLink="true">https://tratnayake.dev/lessons-learned-sharpening-my-bash-skills</guid><category><![CDATA[Devops]]></category><category><![CDATA[SRE]]></category><category><![CDATA[Bash]]></category><category><![CDATA[shell scripting]]></category><category><![CDATA[Continuous Integration]]></category><dc:creator><![CDATA[Thilina Ratnayake]]></dc:creator><pubDate>Fri, 07 Jul 2023 23:12:03 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/0qvBNep1Y04/upload/ce7ec56fe237fbeeeac604a14c52abb9.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Bash scripting is one of those things that I always associate with a strong engineers and especially those in SRE. Conversely, it's not something I get to write a lot of and so - I'll take any opportunity to sharpen those skills.</p>
<p>This blog post explains how to set up a CI pipeline to validate config files for Open Telemetry Collectors (OTCs). It includes a bash script that checks a config file and deletes the directory if the config check passes, and also uses <code>getopts</code> to parse command line arguments and assign values to them. Additionally, I talk about the Shellcheck VS Code Extension which can be used to quickly fix linting errors.</p>
<h1 id="heading-background">Background</h1>
<p>For the past 2 weeks, I've been continuing my task of setting up our CI pipeline to support config file validation for our Open Telemetry Collectors (OTCs). This is a continuation of the work I've been doing in my last <a target="_blank" href="https://tratnayake.dev/understanding-helm-templates-and-utilizing-yq-for-yaml-parsing-mastery">blog post</a></p>
<p>On the surface - this task <em>seems</em> pretty easy:</p>
<ol>
<li><p>You build an Open Telemetry Collector (OTC)</p>
</li>
<li><p>You grab a config file</p>
</li>
<li><p>You feed the config file into the OTC using the <code>validate</code> command like so: <code>opentelemetry validate --config &lt;config_file.yml</code> which exits with <code>0</code> if valid; and</p>
</li>
<li><p>You do this in your CI (Continuous Integration) pipeline to ensure that the config files you're going to be deploying with, are valid</p>
</li>
</ol>
<p>Easy peasy right?</p>
<p>Not quite.</p>
<h1 id="heading-problems-andamp-work-to-be-done">Problems &amp; Work to be Done</h1>
<p>There were a couple of problems and todo items that arose:</p>
<ol>
<li><p>What do you do when the config files are not statically laying around on disk, but are dynamically generated at deploy-time using Helm? ✅ Solved! You can read how we did this with a nifty <code>yq</code> snippet here: <a target="_blank" href="https://tratnayake.dev/understanding-helm-templates-and-utilizing-yq-for-yaml-parsing-mastery">https://tratnayake.dev/understanding-helm-templates-and-utilizing-yq-for-yaml-parsing-mastery</a></p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688759466617/3bf4b741-451b-4235-aa0d-9fc8e09c6ff4.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>How do you get your CI system to ✨ <em>do the things</em> ✨ (build the OTC, template out the Helm files)?</p>
</li>
</ol>
<p>As with most things, I figured the first step would be to try doing this on my laptop first:</p>
<blockquote>
<p>💡 If I can get this working on my laptop via some scripts, I can then tweak those scripts to work on CI.</p>
</blockquote>
<p>Enter, Bash.</p>
<h1 id="heading-bash-baby-bash">Bash baby, Bash.</h1>
<p>I decided to set this up through a system of scripts:</p>
<ul>
<li><p><code>checks/run_all_checks.sh</code> which would enumerate the environments that <code>my-service</code> was running in, and then based on each environment would run:</p>
</li>
<li><p><code>checks/run_checks.sh</code> which would:</p>
<ul>
<li><p>Create a <code>checks/test_generated_configs/&lt;env&gt;</code> directory</p>
</li>
<li><p>Run the <code>helm template</code> command from above and render out all the config files into that directory</p>
</li>
<li><p>Iterate through each of the config files and run them through the OTC binary <strong><em>(found at a specific file location)</em></strong> with the <code>validate</code> flag.</p>
</li>
<li><p>The script would exit printing out the error and with a non-zero code if validation failed.</p>
</li>
<li><p>Cleanup by deleting the <code>checks/test_generated_configs</code> directory on exit.</p>
</li>
</ul>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688768986539/9efe463a-dc99-4e9f-941b-8ab8790d12f9.png" alt class="image--center mx-auto" /></p>
<p>This post will centre around <code>run_checks.sh</code> as that is the meat of what I was working on, and here's what I learned.</p>
<h2 id="heading-the-flow">The Flow</h2>
<p>I like to set up my bash scripts as follows (there are probably some sort of conventions I should be following, but this has served me well so far).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688769079482/997aa49d-ea52-46ac-ac35-1c36f14e35ee.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-setting-up-for-success">Setting up for Success</h2>
<blockquote>
<p>Handle errors and let me know when things are going wrong.</p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688769116851/f580ec58-a641-48c6-bec3-6e4e046842f9.png" alt class="image--center mx-auto" /></p>
<p>Shout out to this amazing <a target="_blank" href="https://gist.github.com/mohanpedala/1e2ff5661761d3abd0385e8223e16425">gist</a> that explains it in great detail - but we started with this to <code>set</code> ourselves up for success (<em>do you see what I did there?)</em> This line does three things:</p>
<ul>
<li><p><code>-e</code> - tells Bash to bail out immediately on any <code>non-zero</code> exit codes (errors)</p>
</li>
<li><p><code>-u</code> - tells Bash to bail if any variable is not set (common in substitution operations)</p>
</li>
<li><p><code>-o pipefail</code> - ensure that any error results in an error for the whole script (i.e. "fail as a team")</p>
</li>
</ul>
<p>For me, I like to set up my variables next as follows:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688769175851/d32837ed-1010-4864-83ab-79b2e0cec8d9.png" alt class="image--center mx-auto" /></p>
<p>One thing I learned here was the use of <code>debug=${DEBUG-0}</code> which is essentially variable instantiation with a default.</p>
<p>This line says <code>set debug to = the value of $DEBUG which might be a runtime param, and if not provided, set to 0</code></p>
<h2 id="heading-functions">Functions</h2>
<p>Personally, the next thing I like to set up in my script is the <strong>functions</strong>. To better illustrate how I built this up - I'll show both the functions and their invocation.</p>
<p>This is where the logic comes into play. From the functionality above, I've personally broken them down as follows:</p>
<ul>
<li><p>0 - Bootstraping and Parse Args</p>
</li>
<li><p>1 - Generate Config Files</p>
</li>
<li><p>2 - Validate each Config File</p>
</li>
<li><p>3 - Cleanup</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688769206463/b0e44e86-0e37-40ab-bd06-ac59a23195cb.png" alt class="image--center mx-auto" /></p>
<p>  This helps us build our skeleton. We can then proceed with building out.</p>
</li>
</ul>
<h3 id="heading-0-bootstrapping-and-parsing-args">0 - Bootstrapping and Parsing Args</h3>
<blockquote>
<p><em>Determine which environment this script should generate and validate configuration files for.</em></p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688770216316/518571b9-8883-4a15-bce3-60120e0175aa.png" alt class="image--center mx-auto" /></p>
<pre><code class="lang-bash"><span class="hljs-comment">#### MAIN</span>
<span class="hljs-comment"># 0. Bootstrapping &amp; Parse Args</span>
<span class="hljs-keyword">while</span> <span class="hljs-built_in">getopts</span> de: flag
<span class="hljs-keyword">do</span>
    <span class="hljs-keyword">case</span> <span class="hljs-string">"<span class="hljs-variable">${flag}</span>"</span> <span class="hljs-keyword">in</span>
        d)
          debug=1
          ;;
        e)
          environment=<span class="hljs-variable">$OPTARG</span>
          ;;
        \?)
          usage
          <span class="hljs-built_in">exit</span> 1
          ;;
    <span class="hljs-keyword">esac</span>
<span class="hljs-keyword">done</span>
<span class="hljs-built_in">shift</span> $((OPTIND -<span class="hljs-number">1</span>))

<span class="hljs-keyword">if</span> [[ <span class="hljs-variable">$environment</span> == <span class="hljs-string">""</span> ]]; <span class="hljs-keyword">then</span>
  usage
  <span class="hljs-built_in">echo</span> &gt;&amp;2 <span class="hljs-string">"error: please provide the -e &lt;environment&gt; option (staging, development, public)"</span>
  <span class="hljs-built_in">exit</span> 1
<span class="hljs-keyword">fi</span>
</code></pre>
<p>Which uses <code>getopts</code> to parse command line arguments and allows you to use their short codes (i.e. <code>-d</code> for debug, <code>-e</code> for <code>environment)</code></p>
<p>This is done in a <code>while</code> loop with a <code>case</code> statement (which is kinda like a <code>switch</code>) to assign values to their arguments.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688769253734/d8ab7870-9c82-40bf-b8c0-12c28f23b88b.png" alt class="image--center mx-auto" /></p>
<ol>
<li><p>Note that <code>/?)</code> line. This tells the script that if the <code>flag</code> from any of the command line arguments are not in <code>d</code> or <code>e</code> - to print invoke the <code>usage</code> function.</p>
</li>
<li><p>Note the <code>shift $((OPTIND -1))</code> at the end <a target="_blank" href="https://unix.stackexchange.com/questions/214141/explain-the-shell-command-shift-optind-1">which</a>:</p>
</li>
</ol>
<blockquote>
<p>removes all the options that have been parsed by <code>getopts</code> from the parameters list, and so after that point, <code>$1</code> will refer to the first non-option argument passed to the script.</p>
</blockquote>
<p>This means that if you have more to your command like <code>run_checks.sh -e staging -d foo bar baz</code> ; <code>foo</code>,<code>bar</code> and <code>baz</code> now moves up to the front of the "line" in positions $1, $2 and $3.</p>
<pre><code class="lang-yaml"><span class="hljs-string">if</span> [[ <span class="hljs-string">$environment</span> <span class="hljs-string">==</span> <span class="hljs-string">""</span> ]]<span class="hljs-string">;</span> <span class="hljs-string">then</span>
  <span class="hljs-string">usage</span>
  <span class="hljs-string">echo</span> <span class="hljs-string">&gt;&amp;2</span> <span class="hljs-string">"error: please provide the -e &lt;environment&gt; option (staging, development, public)"</span>
  <span class="hljs-string">exit</span> <span class="hljs-number">1</span>
<span class="hljs-string">fi</span>
</code></pre>
<p>If no the <code>environment</code> is specified, <code>usage</code> is invoked and the following error is logged to <code>stderr</code>.</p>
<h6 id="heading-so-whats-usage">So what's usage?</h6>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688769298191/9f2b52ca-05ba-40ec-891d-2f368a5ce423.png" alt class="image--center mx-auto" /></p>
<pre><code class="lang-yaml"><span class="hljs-comment">#### FUNCTIONS</span>
<span class="hljs-string">usage()</span> {
  <span class="hljs-string">echo</span> <span class="hljs-string">&gt;&amp;2</span> <span class="hljs-string">"This tool will validate the configuration files for our my-service OTC's"</span>
  <span class="hljs-string">echo</span> <span class="hljs-string">&gt;&amp;2</span> <span class="hljs-string">"See run_all_tests.sh to loop over every env/cluster target."</span>
  <span class="hljs-string">echo</span> <span class="hljs-string">&gt;&amp;2</span>
  <span class="hljs-string">echo</span> <span class="hljs-string">&gt;&amp;2</span> <span class="hljs-string">"Usage: $0 -e &lt;env&gt; [-d]"</span>
  <span class="hljs-string">echo</span> <span class="hljs-string">&gt;&amp;2</span> <span class="hljs-string">"       $0 -e staging"</span>
  <span class="hljs-string">echo</span> <span class="hljs-string">&gt;&amp;2</span> <span class="hljs-string">"       $0 -e public"</span>
  <span class="hljs-string">echo</span> <span class="hljs-string">&gt;&amp;2</span>
}
</code></pre>
<p>Is our very helpful message that gets printed out on unexpected command line input to instruct the operator on what to do.</p>
<h4 id="heading-checkpoint">Checkpoint</h4>
<p>This is what our script looks like now.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688769335280/9331c912-85ad-4c27-a478-2d22ff1ddfa5.png" alt class="image--center mx-auto" /></p>
<p>Next up we had to implement our steps:</p>
<h3 id="heading-1-generate-the-config-files">1 - Generate the Config Files</h3>
<blockquote>
<p>Based on the &lt;env&gt; passed in, generate config files using our yq command.</p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688771055146/5db877b7-b235-41c8-b453-a6e9eb233d25.png" alt class="image--center mx-auto" /></p>
<p>You'll recall that from our last blog post, we now had a handy dandy way of generating the config files using a <code>yq</code> command with the <code>helm template</code></p>
<p><code>helm template -f staging.yaml -s templates/collector.yaml . | yq 'select(.kind == "OpenTelemetryCollector").spec.config' -s '"staging-config-" + $index</code></p>
<p>Now we had to "bashify" this to work with environments other than just staging.</p>
<ol>
<li><p>First, we'd need to create a location for these files to live, like <code>/test_generated_configs/&lt;env&gt;</code></p>
</li>
<li><p>Then we'd want to go ahead and render out each of the config files.</p>
</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-string">mkdir</span> <span class="hljs-string">-p</span> <span class="hljs-string">"$TEST_ROOT_DIR"</span><span class="hljs-string">/test_generated_configs/"$environment"</span>

<span class="hljs-string">pushd</span> <span class="hljs-string">"$TEST_ROOT_DIR"</span> <span class="hljs-string">&amp;&amp;</span>  <span class="hljs-string">eval</span> <span class="hljs-string">"$(make_helm_template_command)"</span>
<span class="hljs-string">popd</span>
</code></pre>
<p>This is where we make use of <code>pushd</code> and <code>popd</code> to quickly CD <code>in</code> and <code>out</code> of our specific directory to run our command.</p>
<p>We run the command by using <code>eval "$(make_helm_template_command)"</code></p>
<p>That's actually from our function here:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688769389304/4f067b92-6484-4c6c-bd9b-65cd43825277.png" alt class="image--center mx-auto" /></p>
<p>Once this is complete, there will be config files generated like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688769404096/5f31ad64-f40f-4f3c-a4dd-866f7a29c224.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-2-validate-each-config-file">2- Validate each Config File</h3>
<blockquote>
<p>For each config file, pass it into the <code>validate</code> command and check that it's valid.</p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688770292255/f99230da-3e39-4a8c-add3-4195f1468202.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688769427371/0e36f6f3-75b5-42bb-8292-b6e1a30659c9.png" alt class="image--center mx-auto" /></p>
<p>Due to our rollout of M1 laptops amongst our team, we've noticed a mismatch in the binaries we have to build for running locally and in our infrastructure. Specifically that M1 laptops use <code>arm64</code> and our infra (like CI) uses <code>amd64</code>. Because of this, we have our binaries stored in arch-specific directories like <code>dist/arm64</code> or <code>dist/amd64.</code></p>
<p>Using <code>arch=$(uname -m | sed 's/x86_64/amd64/;s/arm.*/arm64/')</code> is a slick one-liner to figure out which arch is being used in the invocation of the script.</p>
<h3 id="heading-3-cleanup">3 - Cleanup</h3>
<blockquote>
<p>Get rid of our temp files.</p>
</blockquote>
<p>Finally, since these config files will no longer be used (and inf act, are generated at deploy time) we want to ensure we clean-up our <s>mess</s> files. To do so we use the cleanup function.</p>
<pre><code class="lang-yaml">    <span class="hljs-string">echo</span> <span class="hljs-string">"$file config check passed ✅"</span>

<span class="hljs-string">done</span>

<span class="hljs-string">cleanup</span>
</code></pre>
<p>Which looks like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688769446623/d71401c4-df1e-4257-9af4-fa5d3e191cb6.png" alt class="image--center mx-auto" /></p>
<p>Where we tell it <strong>not</strong> to clean up the files if we're in debug mode (so we can examine them after the fact) or delete the directory otherwise.</p>
<p>Note the <code>trap &lt;function&gt; EXIT</code> which is known as an "exit trap". <a target="_blank" href="http://redsymbol.net/articles/bash-exit-traps/">This tells the bash script to always run this function whenever the script exits for <em>any reason.</em></a> This is great because it ensures that our config file directories will be deleted whenever the script ends or even if it errors out.</p>
<p>Finally, our whole script looks like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688769477438/6232e2bd-cabc-40dc-93b2-f22c6d6089d1.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-bonus">Bonus</h1>
<p>In trying to get my scripts uploaded, I noticed that our CI pipeline's linter kept on yelling at me for some bash-related things. Turns out that these were all coming from a linting step that used <strong>shellcheck.</strong></p>
<p>To fix these errors - I downloaded the <a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=timonwong.shellcheck">shellcheck</a> VS Code Extension which is <strong>fantastic</strong> because, not only will it give you an explanation of the issue within your IDE, it will allow you to quickly fix most issues as well!</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>Sharpening your Bash skills can greatly improve your ability to generally get things done as an engineer. Specifically in SRE - it helps with things like managing CI pipelines and working with tools like Open Telemetry Collectors. By breaking down tasks into manageable functions, using error handling, and leveraging helpful tools like <code>getopts</code>, you can create efficient and maintainable scripts. Additionally, incorporating linters like shellcheck and its VS Code Extension can help you quickly identify and fix errors, ensuring your scripts are reliable and robust.</p>
]]></content:encoded></item><item><title><![CDATA[Understanding Helm Templates and Utilizing YQ for YAML Parsing Mastery]]></title><description><![CDATA[This week I've been finishing up a task that involves updating our CI/CD pipeline to make use of a feature available in the newest release of the Open Telemetry Collector to validate configuration files. In doing so, I got a chance to get up close an...]]></description><link>https://tratnayake.dev/understanding-helm-templates-and-utilizing-yq-for-yaml-parsing-mastery</link><guid isPermaLink="true">https://tratnayake.dev/understanding-helm-templates-and-utilizing-yq-for-yaml-parsing-mastery</guid><category><![CDATA[Helm]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[YAML]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Thilina Ratnayake]]></dc:creator><pubDate>Mon, 26 Jun 2023 03:04:54 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/nquLTbqTPLc/upload/441b0a37a81b46965ab457b5f1c70c61.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week I've been finishing up a task that involves updating our CI/CD pipeline to make use of a feature available in the newest release of the Open Telemetry Collector to validate configuration files. In doing so, I got a chance to get up close and personal with Helm templating and see how the sausage (a generated K8s manifest) was made. In doing so, I also learned a cool trick with the tool <code>yq</code> which allowed me to extract a specific set of data from a YAML file.</p>
<p>Come along with me and learn how I learned to decipher the flow of data and variables in the Helm chart, and learn how I used <code>yq</code> to extract exactly the data I need from a pile of YAML.</p>
<hr />
<h2 id="heading-background"><strong>Background</strong></h2>
<p>My team runs a couple of open telemetry collectors (OTCs) to ingest a bunch of telemetry.</p>
<ol>
<li><p>These OTCs read their config from a <strong>configuration file</strong> before starting up and doing their job of collecting whatever they're meant to collect.</p>
</li>
<li><p>The configuration files are built (per environment) and passed in at deploy time by the CI/CD pipeline as the last step.</p>
<ol>
<li>The configuration files are built by Helm and are passed in as a Helm chart.</li>
</ol>
</li>
</ol>
<p><a target="_blank" href="https://opentelemetry.io/docs/collector/configuration/">Open Telemetry Collector Configuration files</a> look like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1687749862193/fcc3723c-6ff7-4684-b559-71684e31d0e7.png" alt class="image--center mx-auto" /></p>
<p>However, a smol hitch with this setup when used with Helm templating is that (1) the configuration files passed in at deploy-time are specific for the environment (i.e. staging might have a different config than public) and (2) we don't catch a <em>bad</em> config until it's too late (when it's deploy time).</p>
<h2 id="heading-goal"><strong>Goal</strong></h2>
<p>To update our CI/CD pipeline so that we can validate our configuration files to catch errors early, preferably on a PR whenever the configuration files (or underlying template files) are changed.</p>
<h3 id="heading-support">Support</h3>
<p>One thing that helps us on this quest was the (at the time) pending <code>v0.80</code> release of the Open Telemetry Collector which would bring support for the <code>validate</code> command to <a target="_blank" href="https://github.com/open-telemetry/opentelemetry-collector/issues/4671">validate</a> configuration files.</p>
<p>In v0.80 - you could provide a config file to the <code>validate</code> command like so:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1687749896468/46ceec4f-6667-427e-9fb0-3defa2779850.png" alt class="image--center mx-auto" /></p>
<p>Which will output a zero exit code (exit successfully) if the config file is valid.</p>
<h2 id="heading-mission">Mission</h2>
<p>Update the CI/CD pipeline to use the <code>validate</code> command and catch bad configuration files early. This could be done in two places:</p>
<ol>
<li><p>PR Builder on Code Changes</p>
</li>
<li><p>CD Pipeline on Infrastructure Deploys.</p>
</li>
</ol>
<h2 id="heading-problem">Problem:</h2>
<ol>
<li><p>While it's easy to use the <code>validate</code> command to check a single - there turns out to be a <em>lot</em> of configuration files that we need to check. Think the <code>number of environments</code> * <code>number of zones per environment</code> and we're up to about 12 configuration files.</p>
</li>
<li><p>The configuration files are dynamically built using Helm as part of a Helm chart. This means (1) we need to <code>validate</code> using the same config that's provided at runtime and (2) we need to generate those files for the <code>validate</code> command as they don't exist on disk.</p>
</li>
</ol>
<h1 id="heading-a-quick-refresher">A Quick Refresher</h1>
<p>An Open Telemetry Collector config file simply looks like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1687749862193/fcc3723c-6ff7-4684-b559-71684e31d0e7.png" alt class="image--center mx-auto" /></p>
<p>We just need to get the config file(s) so we can use it with the <code>validate</code> command for the Open Telemetry Collector binary - to determine if the config is good or not:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1687749896468/46ceec4f-6667-427e-9fb0-3defa2779850.png" alt class="image--center mx-auto" /></p>
<p>Which exits with a non-zero error code if the config file is valid ☑️</p>
<h1 id="heading-okay-so-lets-get-these-config-files">Okay, so let's get these Config Files</h1>
<p>Well, since they're generated via Helm at runtime, we will need to use <code>helm template</code> command to generate them.</p>
<p>This is where I was at the edge of my comprehension - because I didn't understand how all the values were passed in and through the template. So I went on a side quest to learn how it all fits together.</p>
<h2 id="heading-its-all-about-the-flow">It's all about the flow</h2>
<p>With regards to helm charts and templating, the data flows like this:</p>
<ol>
<li><p>Environment-specific values from the environmental yaml files (i.e. <code>staging.yaml</code>) are fed into the Helm chart (i.e. <code>templates/*.yaml</code>)</p>
<ol>
<li><code>templates/*.yaml</code> also makes use of the <code>zonalbaseconfig.yaml</code> file.</li>
</ol>
</li>
<li><p>Helm then generates / output a set of YAML that can be fed to Kubernetes (a manifest) to apply.</p>
</li>
</ol>
<p>The environment files may look something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1687750077793/ebdb87fe-f9a7-4a22-bdd8-74d719cab9e8.png" alt class="image--center mx-auto" /></p>
<p>Which then feeds into the template files. Specifically, <code>templates/collector.yaml</code> which defines the manifest for each collector.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1687749981882/ac99c7bb-677e-4215-97ab-3361c5b9f00c.png" alt class="image--center mx-auto" /></p>
<p>And since our <code>staging.yaml</code> doesn't specify a <code>config:</code> block, the Helm template will make use of what's in the <code>$baseConfig</code> variable as the template. Note that we know from earlier up, <code>$baseConfig</code> reads from <code>zonalbaseconfig.yaml</code> which looks like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1687750004550/d3ddb950-c9f4-4b73-9094-4c0e77ebe57a.png" alt class="image--center mx-auto" /></p>
<p>If you've been following the bouncing ball, you'll notice that the flow looks like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1687748232684/ed4c2c51-79b6-4513-a82b-ed6183e3bd9e.png" alt class="image--center mx-auto" /></p>
<p>Now that we understand the flow, we know that the template that generates these files is <code>templates/collector.yaml</code>.</p>
<p>And so, we can see what Helm generates out by running:</p>
<p><code>helm template -f staging.yaml -s templates/collector.yaml . &gt; templated.yaml</code></p>
<p>Which gives us something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1687750022591/989f3f3b-bfc3-4fc0-a102-87418931f1ee.png" alt class="image--center mx-auto" /></p>
<p>You'll notice that these are the entire manifests for <em>all</em> the zonal collectors we've specified in the environment file (i.e. <code>staging.yaml</code>) which means <em>in addition</em> to the config block - it also includes:</p>
<ul>
<li><p>The Service Accounts that our template probably generates further down in <code>collector.yaml</code>; and</p>
</li>
<li><p>The rest of the manifest which we don't need (like the metadata etc).</p>
</li>
</ul>
<p>So now we're in a better state - we have the data we need plus extra, but how do we refine it to get just the config blocks?</p>
<h1 id="heading-how-do-i-get-just-the-config-files">How do I get <em>just</em> the config files?</h1>
<p>This is where <code>yq</code> comes in handy.</p>
<blockquote>
<p>YQ is a powerful command-line tool for processing YAML files, similar to how JQ works with JSON. It allows users to query, filter, and manipulate YAML data easily and efficiently. With YQ, you can extract specific data, transform the structure, and even merge multiple YAML files.</p>
</blockquote>
<p>Since the <code>helm template</code> command outputs a single yaml document containing multiple files (you can see this with the <code>---</code> separator); I essentially need to do the following:</p>
<ul>
<li><p>In the whole output of files, grab the files that contain an internal value of <code>kind: OpenTelemetryCollector</code></p>
<ul>
<li>For each of those files, grab only the <code>spec.config</code> block. (These are what gets read in as config files at runtime).</li>
</ul>
</li>
<li><p>After grabbing all the config blocks, output them onto disk with some sort of special config so that we can then pass them into the <code>validate</code> command.</p>
</li>
</ul>
<p>After much trial and error, behold - the magic 1-liner command that allowed me to do all this:</p>
<p><code>helm template -f staging.yaml -s templates/collector.yaml . | yq 'select(.kind == "OpenTelemetryCollector").spec.config' -s '"staging-config-" + $index'</code></p>
<p>Where I now have <code>staging-config-0.yml</code> and <code>staging-config-1.yml</code> in my directory, that look like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1687750298263/a0811161-418d-4b71-80d1-66527f973b4c.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-next-steps">Next Steps</h1>
<p>Now, since I have all my config files available as <code>staging-config-1|2|3|4.yaml</code> - I can easily feed that into the <code>validate</code> command of my Open Telemetry Collector.</p>
<p>The next steps will be updating our CI/CD steps to do so, which will probably be the easier portion of this task.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>In conclusion, using Helm templates and YQ can greatly improve the process of generating and validating Open Telemetry Collector configuration files in a CI/CD pipeline. By understanding the data flow in my Helm templates, and by leveraging YQ's powerful querying capabilities, I was able to extract the necessary config files and use the <code>validate</code> command to catch errors early, ensuring a smoother deployment process.</p>
]]></content:encoded></item><item><title><![CDATA[Speed Up Terraform Debugging Using Terraform Console]]></title><description><![CDATA[Want to debug terraform issues with quicker feedback and instant access to terraform state? Try ✨Terraform Console ✨ - the equivalent of the python shell, but for Terraform.
Background

What am I doing?

This week I've been working on setting up cana...]]></description><link>https://tratnayake.dev/speed-up-terraform-debugging-using-terraform-console</link><guid isPermaLink="true">https://tratnayake.dev/speed-up-terraform-debugging-using-terraform-console</guid><category><![CDATA[Terraform]]></category><category><![CDATA[terraformcli]]></category><category><![CDATA[Tutorial]]></category><category><![CDATA[SRE]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Thilina Ratnayake]]></dc:creator><pubDate>Fri, 16 Jun 2023 20:39:40 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/xbEVM6oJ1Fs/upload/be3672628a25e0b79ac497d5178fa35d.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Want to debug terraform issues with quicker feedback and instant access to terraform state? Try ✨<a target="_blank" href="https://developer.hashicorp.com/terraform/cli/commands/console">Terraform Console</a> ✨ - the equivalent of the python shell, but for Terraform.</p>
<h1 id="heading-background">Background</h1>
<blockquote>
<p>What am I doing?</p>
</blockquote>
<p>This week I've been working on setting up canary analysis with Argo Rollouts for a service I'll refer to as: <code>my-service</code>. We do this by using Terraform and the <em>way</em> we do it requires having a local variable named <code>local.my-service_pools</code> containing a map of environments and their pools.</p>
<hr />
<h1 id="heading-the-problem">The Problem</h1>
<ol>
<li><p>The pools (zones) that <code>my-service</code> runs in for each environment (<code>production</code>, <code>staging</code> and <code>meta</code>) are read from multiple yaml files and stored in a variable named <code>local.my-service_pools</code></p>
</li>
<li><p><code>local.my-service_pools</code> is used by other terraform tooling and therefore must be correct as soon as it's created.</p>
</li>
<li><p>Due to the way that our infrastructure is set up, the variable requires that any pool named <code>us-central1-f</code> be named <code>default</code>, and therefore any mention of <code>us-central1-f</code> would need to be swapped for default in this map.</p>
</li>
</ol>
<p>To tackle this problem, I decided to first read the contents of the file into a temporary variable named <code>local.my-service_zonal_pools_from_yaml</code> which I <em>thought</em> simply looked like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1686947037479/132b803d-774a-4f99-a958-39f58a8745ce.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-the-goal">The Goal</h1>
<p>To create a new variable named <code>local.my-service_pools</code> which uses a <code>for</code> loop to iterate through my temporary <code>local.my-service.zonal_pools_from_yaml</code> and while doing so: replace any mention of <code>us-central1-f</code> with <code>default.</code></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1686948079719/20d71744-7885-44c2-8f6f-0eff1b1f7a05.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-a-failed-fix">A Failed Fix</h1>
<p>I used a <code>for</code> loop to make a "copy" of the variable (Terraform doesn't support copying variables) by iterating through each environment, and then through each pool to replace <code>us-central1-f</code> with <code>default</code></p>
<p>I referred to this as my <em>"transformation"</em> variable.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1686946110352/b79ca755-76c8-44db-81f6-d9f6b6756de5.png" alt class="image--center mx-auto" /></p>
<p>Which <em>seemed</em> good but... was actually <strong>wrong ❌</strong>, because the downstream references to that variable kept on yelling at me:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1686946134668/dd51c571-e874-4267-9693-c6f11549e51f.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-frustration">Frustration</h1>
<ol>
<li><p><strong>The error message doesn't tell me much.</strong> Yes, it's a list of strings, but <em>that's what I wanted isn't it?</em></p>
</li>
<li><p><strong>The Feedback loop is too long.</strong> Because this is the Terraform for our entire environment's infrastructure and because our Terraform runner (Spacelift) is constantly running plans from PR's across the organization - I have to wait a <strong>long</strong> time to see the results of my changes via <code>terraform plan</code></p>
<ol>
<li><p>On Spacelift, I'm competing with private workers to become available to run my plan; and</p>
</li>
<li><p>On my laptop - it takes 10+ minutes to build because I don't have the state cached (or, more specifically, that the cache keeps on changing with every run on Spacelift.</p>
</li>
</ol>
</li>
</ol>
<h1 id="heading-research">Research</h1>
<p>I realized I was getting bitten on syntax / data massaging here - and first I'd need a better, faster, cheaper way to play with the Terraform for quick iteration and results.</p>
<p>And so off to Google I went.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1686946709221/2463794d-a759-4237-827b-cbd1e859b651.jpeg" alt class="image--center mx-auto" /></p>
<h1 id="heading-enter-terraform-consolehttpsdeveloperhashicorpcomterraformclicommandsconsole">Enter: <a target="_blank" href="https://developer.hashicorp.com/terraform/cli/commands/console">terraform console</a></h1>
<blockquote>
<p>What does Terraform Console do? The terraform console command will read the Terraform configuration in the current working directory and the Terraform state file from the configured backend so that interpolations can be tested against both the values in the configuration and the state file.</p>
</blockquote>
<p>With <code>terraform console</code></p>
<ol>
<li>I could see the warning I was running into as soon as it started up. (It's my "early-warning" that the code was still incorrect)</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1686946165253/c200bee8-aa4a-4550-a21d-b46111fedf37.png" alt class="image--center mx-auto" /></p>
<ol>
<li><p>I could inspect and see what <code>local.my-service_zonal_pools_from_yaml</code> actually looked like right now!</p>
</li>
<li><p>And then I could compare it against <code>local.my-service_pools</code> to see what my <em>transformation</em> variable instantiation was <em>actually</em> putting out!</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1686947167884/98233967-deae-4fbc-8a6c-02b8434bf485.png" alt class="image--center mx-auto" /></p>
<p><strong>‼️ The actual problem was</strong> that: <code>my-service_pools_from_yaml</code> has each environments zones in a <strong>toset()</strong> whereas <code>my-service_pools</code> has those zones in a <strong>tolist()</strong></p>
<h1 id="heading-solution">Solution</h1>
<p>✨Surround the <code>for</code> loop with a <code>toset()</code>✨</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1686946235500/789fea53-c693-4ac9-8d7e-8ad4c0137f2f.png" alt class="image--center mx-auto" /></p>
<p>Now when I run terraform console there are no errors, which is my first-sign that my code is correct.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1686947199830/70547594-193c-4282-adbf-606f22455773.png" alt class="image--center mx-auto" /></p>
<p>And, when I compare the two variables in the console, I see that they are both <code>toset()</code>'s ✅:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1686947095731/372f9554-25ff-4374-9340-fccde760cea3.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-caution">Caution</h2>
<p>While you're in the console, you acquire a state lock 🔒 that could prevent other people from using Terraform.</p>
<pre><code class="lang-yaml"><span class="hljs-string">Releasing</span> <span class="hljs-string">state</span> <span class="hljs-string">lock.</span> <span class="hljs-string">This</span> <span class="hljs-string">may</span> <span class="hljs-string">take</span> <span class="hljs-string">a</span> <span class="hljs-string">few</span> <span class="hljs-string">moments...</span>
</code></pre>
<blockquote>
<p>The console holds a <a target="_blank" href="https://developer.hashicorp.com/terraform/language/state/locking">lock on the state</a>, and you will not be able to use the console while performing other actions that modify state.</p>
</blockquote>
<h1 id="heading-conclusion">Conclusion</h1>
<p>In conclusion, <a target="_blank" href="https://developer.hashicorp.com/terraform/cli/commands/console">terraform console</a> is a powerful tool for debugging and fine-tuning Terraform code, providing quick feedback and reducing the time spent waiting for <code>terraform plan</code> results. It is especially useful for working with loops, experimenting with new syntax, and resolving structural issues. However, be cautious when using the console as it acquires a state lock, which may prevent others from modifying the state.</p>
]]></content:encoded></item><item><title><![CDATA[Cheating with Terraform State Show]]></title><description><![CDATA[Background
One of the cards I took on last week, was the task of adding a low-no-data alert for one of our services. This was an Action Item from an incident post-mortem when we were manually alerted to one of our endpoints being down. Having this al...]]></description><link>https://tratnayake.dev/cheating-with-terraform-state-show</link><guid isPermaLink="true">https://tratnayake.dev/cheating-with-terraform-state-show</guid><category><![CDATA[Terraform]]></category><category><![CDATA[Infrastructure as code]]></category><dc:creator><![CDATA[Thilina Ratnayake]]></dc:creator><pubDate>Mon, 12 Jun 2023 04:35:08 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/_c70Nhh6p44/upload/161dc32a0aced07948339d84b71a2c8a.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-background">Background</h1>
<p>One of the cards I took on last week, was the task of adding a <code>low-no-data</code> alert for one of our services. This was an Action Item from an incident post-mortem when we were <em>manually</em> alerted to one of our endpoints being down. Having this alert would ensure that we are notified more quickly should this happen again.</p>
<p>Using Lightstep, it was pretty easy to create the alert based on a metric that we had. (We use a different metric IRL for traffic, but for the sake of this tutorial, I'm using <code>scrape_samples_scraped</code>)</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1686540086965/86d4930a-82d4-4da3-acb4-f7b0bb4e8457.png" alt class="image--center mx-auto" /></p>
<p>But I wasn't quite done yet.</p>
<ul>
<li><p>What about making the same alert in all the other <code>environments</code> ? <em>I would need to recreate this alert for all of those environments by hand.</em></p>
</li>
<li><p>What happens if we have the <em>worst day ever</em> ™️ and all of our alerts are gone? <em>I would need to remember the queries and details to recreate them.</em></p>
</li>
</ul>
<p>This is where Infrastructure-as-Code comes in handy, and specifically Terraform - which allows us to codify our resources such as <code>lightstep_alert</code>'s. By using Terraform, we can:</p>
<ol>
<li><p>Focus on being DRY (Don't Repeat Yourself) - Create one alert and programmatically generate alerts in our other environments.</p>
</li>
<li><p>Have our alerts codified - Be able to recover from a disaster and restore to our current state.</p>
</li>
</ol>
<p>And so I set out to write the Terraform with a file that looked like this:</p>
<pre><code class="lang-bash">resource <span class="hljs-string">"lightstep_alert"</span> <span class="hljs-string">"low-no-requests-api-terraform"</span> {

}
</code></pre>
<p>Staring at this blank resource stanza made me think:</p>
<blockquote>
<p>But wait...I already created the alert in the UI..do I now need to flip back and forth between the Terraform provider documentation and re-create the alert in Terraform?</p>
</blockquote>
<p>This would be like the equivalent of getting a transparency paper and attempting to redraw a reference image on a different piece of paper.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1686544369918/4c38ecd6-ef3b-4142-8476-77057369d8a1.png" alt class="image--center mx-auto" /></p>
<p>I <em>thought</em> this was the only way until I learned a very valuable <strong>pattern</strong> from one of my very wise coworkers.</p>
<blockquote>
<p>[You already have the alert,] Why don't you just <code>import</code> the alert and then use <code>state show</code> to get the code for it?</p>
</blockquote>
<p>I..had never thought of that. But, I tried it and it worked! And was a huge time saver.</p>
<p>Here's the tutorial.</p>
<h1 id="heading-tutorial">Tutorial</h1>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1686546414691/52076bd8-c372-43e3-aeae-11e13b43cd51.png" alt="A helpful Diagram" class="image--center mx-auto" /></p>
<ol>
<li><h3 id="heading-set-up-teraform-provider">Set up Teraform Provider</h3>
</li>
</ol>
<p>First up was to ensure I had set up my Terraform provider to use my Lightstep <code>organization</code> and an <code>api_key</code> with a minimum level of <code>Member</code></p>
<pre><code class="lang-bash">provider <span class="hljs-string">"lightstep"</span> {
  api_key         = var.ls_api_key
  organization    = <span class="hljs-string">"LightStep"</span>
}
</code></pre>
<ol>
<li><h2 id="heading-get-id-of-previously-created-resource">Get ID of previously created Resource</h2>
</li>
</ol>
<p>To import the <code>lightstep_alert</code>, I needed need to grab the ID of the existing alert. (This can be easily gathered from the browser, see below).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1686540223516/9bf7e101-2eb8-465b-8475-ff7606f03452.png" alt class="image--center mx-auto" /></p>
<ol>
<li><h2 id="heading-import-the-resource">Import the Resource</h2>
</li>
</ol>
<p>With the alert ID in hand, I simply had to run a <code>terraform import lightstep_alert.&lt;name_of_alert&gt; &lt;project&gt;.&lt;id&gt;</code></p>
<ul>
<li>In my example it was:<code>terraform import lightstep_alert.low-no-requests-api dev-tratnayake.mXgC4cG1f</code></li>
</ul>
<pre><code class="lang-bash">terraform import lightstep_alert.low-no-requests-api dev-tratnayake.mXgC4cG1f         1 ✘  at 20:34:20  
lightstep_alert.low-no-requests-api: Importing from ID <span class="hljs-string">"dev-tratnayake.mXgC4cG1f"</span>...
lightstep_alert.low-no-requests-api: Import prepared!
  Prepared lightstep_alert <span class="hljs-keyword">for</span> import
lightstep_alert.low-no-requests-api: Refreshing state... [id=mXgC4cG1f]

Import successful!

The resources that were imported are shown above. These resources are now <span class="hljs-keyword">in</span>
your Terraform state and will henceforth be managed by Terraform.
</code></pre>
<ol>
<li><h2 id="heading-show-the-imported-resource">Show the imported Resource</h2>
</li>
</ol>
<p>Now here's the <strong>magic</strong> part where you can use Terraform like a 🔬 microscope 🔬 to break down an existing resource into its Terraform. With the resource imported into Terraform state, I could simply use <code>terraform state show</code> to output the alert in Terraform.</p>
<pre><code class="lang-bash">terraform state show lightstep_alert.low-no-requests-api                              1 ✘  at 20:35:27  
<span class="hljs-comment"># lightstep_alert.low-no-requests-api:</span>
resource <span class="hljs-string">"lightstep_alert"</span> <span class="hljs-string">"low-no-requests-api"</span> {
    id           = <span class="hljs-string">"mXgC4cG1f"</span>
    name         = <span class="hljs-string">"Low-no-data-alert (UI)"</span>
    project_name = <span class="hljs-string">"dev-tratnayake"</span>
    <span class="hljs-built_in">type</span>         = <span class="hljs-string">"metric_alert"</span>

    expression {
        is_multi   = <span class="hljs-literal">false</span>
        is_no_data = <span class="hljs-literal">false</span>
        operand    = <span class="hljs-string">"below"</span>

        thresholds {
            critical = <span class="hljs-string">"1"</span>
            warning  = <span class="hljs-string">"5000"</span>
        }
    }

    query {
        display        = <span class="hljs-string">"line"</span>
        hidden         = <span class="hljs-literal">false</span>
        hidden_queries = {}
        query_name     = <span class="hljs-string">"a"</span>
        query_string   = <span class="hljs-string">"metric scrape_samples_scraped | filter (job == \"apiserver\") | latest | group_by [\"job\"], sum"</span>
    }
}
</code></pre>
<ol>
<li><h2 id="heading-use-the-code-from-the-imported-resource">Use the Code from the imported Resource</h2>
</li>
</ol>
<p>With the alert now codified, I could simply copy-pasta it as a new alert in my Terraform file as follows (ensuring to strip out the <code>id</code> and <code>metric_type</code> as those are computed when the Terraform is applied).</p>
<pre><code class="lang-bash">resource <span class="hljs-string">"lightstep_alert"</span> <span class="hljs-string">"low-no-requests-api-terraform"</span> {
    <span class="hljs-comment"># id           = "mXgC4cG1f"</span>
    name         = <span class="hljs-string">"Low-no-data-alert (Terraform)"</span>
    project_name = <span class="hljs-string">"dev-tratnayake"</span>
    <span class="hljs-comment"># type         = "metric_alert"</span>

[... Rest of the Terraform from `terraform state show` here ... ]
</code></pre>
<ol>
<li><h2 id="heading-apply-the-terraform-to-create-the-resource">Apply the Terraform to create the Resource</h2>
</li>
</ol>
<p>Finally, I ran a <code>terraform apply</code> which would:</p>
<ol>
<li><p>Create the <em>new</em> alert (<code>low-no-requests-api-terraform</code>); and</p>
</li>
<li><p>Delete the <em>old</em> alert created via UI (<code>low-no-requests</code>) because that's not in the Terraform file, and therefore doesn't match the Terraform state.</p>
</li>
</ol>
<pre><code class="lang-bash">terraform apply                                                                         ✔  at 20:37:03  
lightstep_alert.low-no-requests-api: Refreshing state... [id=mXgC4cG1f]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create
  - destroy

Terraform will perform the following actions:

  <span class="hljs-comment"># lightstep_alert.low-no-requests-api will be destroyed</span>
  <span class="hljs-comment"># (because lightstep_alert.low-no-requests-api is not in configuration)</span>
  - resource <span class="hljs-string">"lightstep_alert"</span> <span class="hljs-string">"low-no-requests-api"</span> {
      - id           = <span class="hljs-string">"mXgC4cG1f"</span> -&gt; null
      - name         = <span class="hljs-string">"Low-no-data-alert (UI)"</span> -&gt; null
      - project_name = <span class="hljs-string">"dev-tratnayake"</span> -&gt; null
      - <span class="hljs-built_in">type</span>         = <span class="hljs-string">"metric_alert"</span> -&gt; null

      - expression {
          - is_multi   = <span class="hljs-literal">false</span> -&gt; null
          - is_no_data = <span class="hljs-literal">false</span> -&gt; null
          - operand    = <span class="hljs-string">"below"</span> -&gt; null

          - thresholds {
              - critical = <span class="hljs-string">"1"</span> -&gt; null
              - warning  = <span class="hljs-string">"5000"</span> -&gt; null
            }
        }

      - query {
          - display        = <span class="hljs-string">"line"</span> -&gt; null
          - hidden         = <span class="hljs-literal">false</span> -&gt; null
          - hidden_queries = {} -&gt; null
          - query_name     = <span class="hljs-string">"a"</span> -&gt; null
          - query_string   = <span class="hljs-string">"metric scrape_samples_scraped | filter (job == \"apiserver\") | latest | group_by [\"job\"], sum"</span> -&gt; null
        }
    }

  <span class="hljs-comment"># lightstep_alert.low-no-requests-api-terraform will be created</span>
  + resource <span class="hljs-string">"lightstep_alert"</span> <span class="hljs-string">"low-no-requests-api-terraform"</span> {
      + id           = (known after apply)
      + name         = <span class="hljs-string">"Low-no-data-alert (Terraform)"</span>
      + project_name = <span class="hljs-string">"dev-tratnayake"</span>
      + <span class="hljs-built_in">type</span>         = (known after apply)

      + expression {
          + is_multi   = <span class="hljs-literal">false</span>
          + is_no_data = <span class="hljs-literal">false</span>
          + operand    = <span class="hljs-string">"below"</span>

          + thresholds {
              + critical = <span class="hljs-string">"1"</span>
              + warning  = <span class="hljs-string">"5000"</span>
            }
        }

      + query {
          + display      = <span class="hljs-string">"line"</span>
          + hidden       = <span class="hljs-literal">false</span>
          + query_name   = <span class="hljs-string">"a"</span>
          + query_string = <span class="hljs-string">"metric scrape_samples_scraped | filter (job == \"apiserver\") | latest | group_by [\"job\"], sum"</span>
        }
    }

Plan: 1 to add, 0 to change, 1 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only <span class="hljs-string">'yes'</span> will be accepted to approve.

  Enter a value: yes

lightstep_alert.low-no-requests-api: Destroying... [id=mXgC4cG1f]
lightstep_alert.low-no-requests-api-terraform: Creating...
lightstep_alert.low-no-requests-api: Destruction complete after 1s
lightstep_alert.low-no-requests-api-terraform: Creation complete after 2s [id=mw5HVFtch]

Apply complete! Resources: 1 added, 0 changed, 1 destroyed
</code></pre>
<p>The end result is the Alert that I created in UI - now codified and managed by Terraform ☑️</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1686541085705/69a8bfa7-bb68-42d0-951d-61057fa55a13.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>There are many times that Terraform is <em>not</em> the first thing we reach for when creating a resource. Many times we may reach for a GUI or specialized tool to get create resources, especially during rapid prototyping. Using <code>terraform import</code> and <code>terraform state show</code> allows you to keep that momentum by using Terraform to codify the resources you've already created (in a more user-friendly tool) which can then easily be modified to fit your needs.</p>
<p>Doing this is a huge time-saver that almost felt like cheating, and I will definitely be using this in my future Terraforming travels 🚀</p>
]]></content:encoded></item><item><title><![CDATA[🔐 How-To Securely Work With Secrets During Development]]></title><description><![CDATA[If you're working on any software projects that require talking to other services, chances are that you probably have to make use of secrets. The most common type of secrets you may have run into -- are passwords and API keys.
There are others as wel...]]></description><link>https://tratnayake.dev/how-to-securely-work-with-secrets-during-development</link><guid isPermaLink="true">https://tratnayake.dev/how-to-securely-work-with-secrets-during-development</guid><category><![CDATA[Security]]></category><category><![CDATA[secrets]]></category><category><![CDATA[development]]></category><category><![CDATA[#howtos]]></category><dc:creator><![CDATA[Thilina Ratnayake]]></dc:creator><pubDate>Mon, 27 Feb 2023 00:34:03 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/q7h8LVeUgFU/upload/227824f26ac63cb582b0351e294692b3.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you're working on <em>any</em> software projects that require talking to other services, chances are that you probably have to make use of <strong>secrets</strong>. The most common type of secrets you may have run into -- are passwords and API keys.</p>
<p>There are others as well, and <a target="_blank" href="https://www.cyberark.com/what-is/secrets-management/">Cyberark</a> does a good job of defining a secret as a:</p>
<blockquote>
<p>[...] private piece of information that acts as a key to unlock protected resources or sensitive information in tools, <a target="_blank" href="https://www.cyberark.com/solutions/digital-transformation/business-critical-applications/"><strong>applications</strong></a>, containers, DevOps and <a target="_blank" href="https://www.cyberark.com/solutions/digital-transformation/cloud-virtualization-security/"><strong>cloud-native environments</strong></a>.</p>
<p>Some of the most common types of secrets include:</p>
<ul>
<li><p>Privileged account credentials</p>
</li>
<li><p>Passwords</p>
</li>
<li><p>Certificates</p>
</li>
<li><p>SSH keys</p>
</li>
<li><p>API keys</p>
</li>
<li><p>Encryption keys</p>
</li>
</ul>
</blockquote>
<p>As you may have learned, <em>perhaps the hard way through a security incident,</em> secrets can't be treated like just any other type of data. Because of their ability to access resources; they must be handled with care and with security in mind.</p>
<h2 id="heading-how-do-you-store-secrets-during-development">🧐 How Do You Store Secrets During Development?</h2>
<p>If you've googled this question, you may have run into a couple of articles that suggest the following options:</p>
<ol>
<li><h3 id="heading-provide-secrets-as-shell-variables"><strong>🐚 Provide Secrets as Shell Variables.</strong></h3>
</li>
</ol>
<p>The assumption is that your code (i.e. <code>run-server.sh</code>) is set up to look for a key as a shell variable. For example:</p>
<pre><code class="lang-bash">cat run-server.sh
<span class="hljs-comment">#! /bin/bash</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"OUTPUT: API Key is <span class="hljs-variable">${API_KEY}</span>"</span>
</code></pre>
<p>In this approach, you can provide the secret to your program at run-time by running: <code>API_KEY=&lt;my-api-key&gt; run-server.sh</code> which provides the following output:</p>
<pre><code class="lang-bash">API_KEY=my-api-key ./run-server.sh                                                                                                      
OUTPUT: API Key is my-api-key
</code></pre>
<p>❗️The drawback of this approach is that your key has now been included in plaintext, within your shell history. If an attacker were to ever compromise your machine, the secrets would be available by running a simple <code>history</code> command.</p>
<pre><code class="lang-bash">1815  API_KEY=my-api-key ./run-server.sh
</code></pre>
<ol>
<li><h3 id="heading-provide-secrets-as-environment-variables"><strong>🌲 Provide Secrets as Environment Variables</strong></h3>
</li>
</ol>
<p>For this approach, we do a bit better setting the secret as an environment variable, which our program can read from.</p>
<p>Setting the actual environment variable can be done in a number of ways, such as reading from a file on the system. (i.e. <code>api-key-secret.txt</code>)</p>
<pre><code class="lang-bash"><span class="hljs-built_in">export</span> API_KEY=$(cat api-key-secret.txt)
./run-server.sh
OUTPUT: API Key is my-env-var-api-key
</code></pre>
<ul>
<li><p>While the secret is no longer printed out to shell history ✅ , the secret is contained within a file that must now be secured appropriately ❗️.</p>
</li>
<li><p>You need to ensure that the file containing the secret key isn't acidentally checked into Verison Control (i.e. Github)❗️.</p>
</li>
</ul>
<ol>
<li><h3 id="heading-use-a-env-file">📁 Use A <code>.env</code> File</h3>
</li>
</ol>
<p>Some programming languages recommend the pattern of using a dotenv file, where programs can be instructed to look up values from a specific file (i.e. <code>development.env)</code> However, this approach has the same drawback as above.</p>
<hr />
<h1 id="heading-whats-a-better-way-to-store-secrets">🤔 What's A Better Way To Store Secrets</h1>
<p>A decent number of moons ago when I worked as a Security Engineer doing ✨ <em>security</em> ✨ things, one of the items in our portfolio was a Secret Management system named <a target="_blank" href="https://www.conjur.org/">Conjur</a> (coincidentally, also built by Cyberark).</p>
<p>As an enterprise-grade secret management solution, it had a lot of features such as audit logging and access control policies, but one of the coolest features was the ability to do <strong>dynamic secret retrieval</strong>.</p>
<p>Specifically, it meant that after engineers checked secrets in to Conjur, they could simply (1) refer to their secrets by their secret reference <strong>paths</strong> and (2) use the Conjur CLI to "wrap" the invocation of their programs, to pull those secrets <strong>at run-time.</strong></p>
<p>I remember thinking that this was super cool, and wished I could do that for personal projects with my existing secret manager (1PW). Unfortunately, the main disqualifier for that at the time, outside of licensing costs, was the requirement to run and <strong>secure</strong> your own secrets server which wasn't really feasible for local development.</p>
<p>Well, fast forward about 5 years and things have now changed!</p>
<hr />
<h1 id="heading-whats-a-better-way-to-store-secrets-in-development">🤨 What's A Better Way To Store Secrets <em>in Development</em>?</h1>
<p>For me, the "development phase" in personal projects is the phase where I want to <strong>rapidly prototype</strong> my ideas. I don't want to get bogged down by tedious tasks, which is what secret management <em>used</em> to be. I could move fast by using env or shell variables, but I was only a single <code>.gitignore</code> file mistake away from checking in the contents to version control.</p>
<p>I'd always hoped for a better way and thankfully a couple of weeks ago - I came across this <em>fantastic</em> blog post by 1Password when looking into how to access secrets programmatically.</p>
<p><a target="_blank" href="https://blog.1password.com/1password-cli-2_0/">https://blog.1password.com/1password-cli-2_0/</a></p>
<p>This blog post goes into using the 1Password CLI to essentially achieve what I wanted to do those years ago. It enables developers to:</p>
<ol>
<li><p>Add secrets to their password vault</p>
</li>
<li><p>Get the references to those secrets</p>
</li>
<li><p>Fetch secrets dynamically at runtime.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1677457994756/27148b8a-a83c-4a79-9a08-7ade13e7cdc8.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-how-does-it-work">How does it work?</h2>
<ol>
<li><p>Download &amp; install the <a target="_blank" href="https://developer.1password.com/docs/cli/get-started/#install">1Password CLI + 1Pasword8</a></p>
</li>
<li><p>Add your secret(s) into 1Password. In my case my secret value for <code>my-api-key</code> is: <code>secret-key</code></p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1677455281805/87e32e12-40fc-4be9-85cb-997f21683d28.png" alt class="image--center mx-auto" /></p>
<ol>
<li>Grab the references for the secrets you're interested in.</li>
</ol>
<pre><code class="lang-bash">op item get my-api-key --format json                                                                                                  
{
  <span class="hljs-string">"id"</span>: <span class="hljs-string">"&lt;&lt;REDACTED&gt;&gt;"</span>,
  <span class="hljs-string">"title"</span>: <span class="hljs-string">"my-api-key"</span>,
  <span class="hljs-string">"version"</span>: 1,
  <span class="hljs-string">"vault"</span>: {
    <span class="hljs-string">"id"</span>: <span class="hljs-string">"&lt;&lt;REDACTED&gt;&gt;"</span>,
    <span class="hljs-string">"name"</span>: <span class="hljs-string">"Blog-post-development"</span>
  },
  <span class="hljs-string">"category"</span>: <span class="hljs-string">"API_CREDENTIAL"</span>,
  <span class="hljs-string">"last_edited_by"</span>: <span class="hljs-string">"[...]"</span>,
  <span class="hljs-string">"created_at"</span>: <span class="hljs-string">"2023-02-26T23:14:43Z"</span>,
  <span class="hljs-string">"updated_at"</span>: <span class="hljs-string">"2023-02-26T23:14:43Z"</span>,
  <span class="hljs-string">"fields"</span>: [
    [...]
    {
      <span class="hljs-string">"id"</span>: <span class="hljs-string">"credential"</span>,
      <span class="hljs-string">"type"</span>: <span class="hljs-string">"CONCEALED"</span>,
      <span class="hljs-string">"label"</span>: <span class="hljs-string">"credential"</span>,
      <span class="hljs-string">"value"</span>: <span class="hljs-string">"secret-key"</span>,
      <span class="hljs-string">"reference"</span>: <span class="hljs-string">"op://Blog-post-development/my-api-key/credential"</span>
    },
</code></pre>
<p>Which in this case is: <code>op://Blog-post-development/my-api-key/credential</code></p>
<ol>
<li>Include the secret reference and wrap your execution command with <code>op run</code></li>
</ol>
<p>Before, I used to have a Makefile to run my server:</p>
<pre><code class="lang-bash">cat Makefile                                                                                                                            
run:
    ./run-server.sh
</code></pre>
<p>Now I simply wrap the <code>run</code> target with <code>op run</code> and my secret reference as follows:</p>
<pre><code class="lang-bash">cat Makefile                                                                                                                            
run:
    API_KEY=<span class="hljs-string">"op://Blog-post-development/my-api-key/credential"</span> \
    op run ./run-server.sh
</code></pre>
<p>When I execute with <code>make run</code> will first:</p>
<ul>
<li>Prompt me for a fingerprint (via touchID on Mac, or master password on all others); and</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1677454676156/d85e5814-4f86-480d-b065-39898029bc6a.png" alt class="image--center mx-auto" /></p>
<ul>
<li>Retrieve the secret for use in the command.</li>
</ul>
<pre><code class="lang-bash">make run                                                                                                                                
API_KEY=<span class="hljs-string">"op://Blog-post-development/my-api-key/credential"</span> \
    op run ./run-server.sh
OUTPUT: API Key is &lt;concealed by 1Password&gt;
</code></pre>
<p>Note: that 1PW's CLI is smart / safe enough to conceal the password. If you want to override this option and display the secret, you can run the same command with the <code>--no-masking</code> flag.</p>
<p>For example:</p>
<pre><code class="lang-bash">cat Makefile                                                                                                                            
run:
    API_KEY=<span class="hljs-string">"op://Blog-post-development/my-api-key/credential"</span> \
    op run --no-masking ./run-server.sh

make run                                                                                                                                
API_KEY=<span class="hljs-string">"op://Blog-post-development/my-api-key/credential"</span> \
    op run --no-masking ./run-server.sh
OUTPUT: API Key is secret-key
</code></pre>
<h1 id="heading-why-is-this-better">Why is this better?</h1>
<ol>
<li><p><strong>Secrets are stored and retrieved from a single place</strong>. <strong>✅</strong> The single source of truth for your secret's value is in 1Password. This means that if you need to read this secret from multiple places, you only need to update it once. It also means that if your secret gets "popped", you only need to change or "rotate" it in one place.</p>
</li>
<li><p><strong>Uses a password or biometric to authenticate access.</strong> <strong>✅</strong> If you're using a mac, you'll be prompted for TouchID. Else, you'll be prompted for your master password anywhere else.</p>
</li>
<li><p><strong>Secrets are not stored in code. ✅</strong> You don't have to worry about accidentally checking in secrets to version control (i.e. Github).</p>
</li>
<li><p><strong>You don't have to worry about the security of where your secrets are stored.</strong> <strong>✅</strong>That's managed by an organization that specializes in secret storage and has a vested interest security of your secrets (1PW).</p>
</li>
<li><p><strong>You don't need to do any other lift ✅</strong> to get this working. No need to set up IAM with AWS or GCP, uses 1Password.</p>
</li>
</ol>
<h2 id="heading-further-extensibility">Further Extensibility</h2>
<ul>
<li><p>1Password can group secrets into <strong>vaults</strong> which means you can also extend this to your colleagues &amp; partners. When you add a secret to a vault and add them as members to your vault, they can also use secrets in the same manner from their 1password accounts.</p>
</li>
<li><p>1Password also has other <a target="_blank" href="https://developer.1password.com/docs/cli/shell-plugins">shell-plugins</a> to securely authenticate with services right from your shell.</p>
</li>
<li><p>There's also a way to integrate CI/CD as well.</p>
</li>
</ul>
<h1 id="heading-conclusion">Conclusion</h1>
<p>Overall, I am <strong>extremely</strong> excited to see that these sorts of features are slowly trickling down and becoming more accessible for more usecases. The more we do to make security "easier", the better that software becomes for all.</p>
]]></content:encoded></item><item><title><![CDATA[Tutorial Notebook: A simple CRUD app with Go]]></title><description><![CDATA[Tutorials Notebook is a blog series where I write-up my thoughts & lessons learned after completing a tutorial. A large credit should go to the authors who created the tutorials in the first place.

Tutorial: https://codewithmukesh.com/blog/implement...]]></description><link>https://tratnayake.dev/tutorial-notebook-a-simple-crud-app-with-go</link><guid isPermaLink="true">https://tratnayake.dev/tutorial-notebook-a-simple-crud-app-with-go</guid><category><![CDATA[golang]]></category><category><![CDATA[crud]]></category><category><![CDATA[MySQL]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[Thilina Ratnayake]]></dc:creator><pubDate>Fri, 03 Feb 2023 17:12:25 GMT</pubDate><content:encoded><![CDATA[<blockquote>
<p>Tutorials Notebook is a blog series where I write-up my thoughts &amp; lessons learned after completing a tutorial. A large credit should go to the authors who created the tutorials in the first place.</p>
</blockquote>
<p>Tutorial: <a target="_blank" href="https://codewithmukesh.com/blog/implementing-crud-in-golang-rest-api/">https://codewithmukesh.com/blog/implementing-crud-in-golang-rest-api/</a> by <a class="user-mention" href="https://hashnode.com/@iammukeshm">Mukesh Murugan</a></p>
<hr />
<p>Lately, a couple of things have been motivating me to learn and get better with GoLang.</p>
<ol>
<li><p>There's a project that I've wanted to build for a friend since last year;</p>
</li>
<li><p>I'm ending up working more and more in our codebase at work which is all written in Go.</p>
</li>
</ol>
<p>So I figured I'd start from scratch and work towards the goal of building that app for my friend.</p>
<p>Breaking down that task - one thing I'll need to do is create an API that can handle incoming requests. Today, I completed a tutorial to create a very basic app that would enable a user to Create, Retrieve, Update and Delete (CRUD) a product from a database.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1675444193068/01fb0c98-aa01-4ba3-87e0-c62f5848f8d1.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-things-i-learned">Things I Learned</h2>
<ol>
<li><p>MySQL. It'd been a while since I'd created a database. Helpful commands on mac:</p>
<ul>
<li><p><code>brew install mysql</code> - Install the MySQL server on mac.</p>
</li>
<li><p><code>brew services start mysql</code> - Start the MySQL server</p>
</li>
<li><p><code>mysql -u root</code> - Login with username <code>root</code> and no password (default)</p>
</li>
<li><p><code>create database &lt;name&gt;;</code></p>
</li>
<li><p><code>use database &lt;name&gt;;</code></p>
</li>
<li><p><code>show tables;</code></p>
</li>
</ul>
</li>
<li><p>VS Code Shortcut: Typing in <code>hand</code> (+ enter) in VS Code will automatically create a HTTP response handler for you.</p>
</li>
<li><p>Go Pointers</p>
</li>
</ol>
<ul>
<li>You can define what an object is like. In this example, a <code>product</code> consists of 4 pieces of information.</li>
</ul>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> Product <span class="hljs-keyword">struct</span> {
    ID          <span class="hljs-keyword">uint</span>    <span class="hljs-string">`json:"id"`</span>
    Name        <span class="hljs-keyword">string</span>  <span class="hljs-string">`json:"name"`</span>
    Price       <span class="hljs-keyword">float64</span> <span class="hljs-string">`json:"price"`</span>
    Description <span class="hljs-keyword">string</span>  <span class="hljs-string">`json:"description"`</span>
}
</code></pre>
<ul>
<li>A pointer to that struct is used in this HTTP API handler for <code>POST</code> ing (creating) products.</li>
</ul>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">CreateProduct</span><span class="hljs-params">(w http.ResponseWriter, r *http.Request)</span></span> {
    w.Header().Set(<span class="hljs-string">"Content-Type"</span>, <span class="hljs-string">"application/json"</span>)
    <span class="hljs-keyword">var</span> product entities.Product
    json.NewDecoder(r.Body).Decode(&amp;product)
    database.Instance.Create(&amp;product)
    json.NewEncoder(w).Encode(product)
}
</code></pre>
<p>Where the important lines are:</p>
<p><code>var product entities.Product</code> - this line creates an "empty" product that contains the 0 (or nil) values defined int he struct.</p>
<p><code>json.NewDecoder(r.Body).Decode(&amp;product)</code> - In this line, we use the memory address of variable <code>product</code> as denoted by the <code>&amp;</code> to "fill" it with relevant data.</p>
<ol>
<li>You can even make cURL commands right from VS Code using the REST Client extension.</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1675443545482/117dd18c-2855-4c1d-909d-5f953454abe7.png" alt class="image--center mx-auto" /></p>
]]></content:encoded></item><item><title><![CDATA[Friend (or Foe) Request?]]></title><description><![CDATA[ChatGPT, Social Engineering, and the moving goal posts in the AI arms race.

Scene

You have a new connection request.

You awake in the morning to find a new notification on your phone. It’s a connection request from someone on LinkedIn. You’ve neve...]]></description><link>https://tratnayake.dev/friend-or-foe-request</link><guid isPermaLink="true">https://tratnayake.dev/friend-or-foe-request</guid><category><![CDATA[Artificial Intelligence]]></category><category><![CDATA[AI]]></category><category><![CDATA[Security]]></category><category><![CDATA[uncategorized]]></category><category><![CDATA[information security]]></category><dc:creator><![CDATA[Thilina Ratnayake]]></dc:creator><pubDate>Mon, 09 Jan 2023 04:51:23 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1673281965823/N4-wpVUF-.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h5 id="heading-chatgpt-social-engineering-and-the-moving-goal-posts-in-the-ai-arms-race">ChatGPT, Social Engineering, and the moving goal posts in the AI arms race.</h5>
<hr />
<h2 id="heading-scene">Scene</h2>
<blockquote>
<p>You have a new connection request.</p>
</blockquote>
<p>You awake in the morning to find a new notification on your phone. It’s a connection request from someone on LinkedIn. You’ve never heard of this person before but considering that it’s a social network for working professionals – you start going through your verification process.</p>
<p><em>Do I know this person?</em></p>
<p><em>If not, is it someone that I should add?</em></p>
<p><em>Are they a threat?</em> <strong><em>Are they a bot?</em></strong></p>
<p>Most of us have been trained to answer these questions from our first days on the internet (<em>don’t talk to strangers!)</em> – but that last one is something that’s relatively new in our networked interactions. In this new misinformation age, where troll-farms &amp; criminal outfits operate with the intent of everything from derailing national elections to flaming competitors products — we’ve slowly been trained to look for markers to verify the validity and <em>humanity</em> of our connections:</p>
<p><strong><em>Is this a human? Do they walk like one? Do they talk like one?</em></strong></p>
<p>But what happens when a tool like Chat-GPT, a tool that can generate everything from poems to manifestos based off a single prompt – continues to proliferate into the hands of millions?</p>
<p>They say 2023 is the year that AI will continue to disrupt more industries, <strong>what does it mean for our personal security on the Internet.</strong></p>
<hr />
<p>If you’ve been on the internet in the last couple of weeks, you’ve probably seen the explosion of Artificial Intelligence (AI) based tools.</p>
<p><strong>And before you ask</strong>, <em>no, I am not using ChatGPT or any of those tools as a cheeky way to write this blog post.</em> This post is written from a 100% certified, Grade-T Human meatsack.</p>
<h2 id="heading-a-recap">A recap:</h2>
<p>In the final throes of 2022, a handful of AI based tools shot into popularity.</p>
<p>Some are focused on <strong>art</strong>, such as <a target="_blank" href="https://openai.com/dall-e-2/">dall-e</a>, <a target="_blank" href="https://starryai.com/">starryai</a> and <a target="_blank" href="https://lens-ai.com/">lensai</a>.</p>
<p>Dall-e will take in user prompts (“<code>An astronaut, riding a horse, in a photorealistic style</code>“) and generate artwork that fits those prompts.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673281943616/QhkP0ysaB.png" alt /></p>
<p><em>A snippet from Dall-e 2’s Home Page. This page also allows the reader to try clicking on the other variations of the prompts to see how the artwork changes.</em></p>
<p>The latter, such as <a target="_blank" href="https://starryai.com/">starryai</a> and <a target="_blank" href="https://lens-ai.com/">lensai</a> – are more focused<br />towards refining provided images as input. You may have seen this recently in your social networks with friends posting impressive artistic self portraits.</p>
<p>Others, such as <a target="_blank" href="https://chatgpt.pro/">ChatGPT,</a> are more focused towards using AI with <strong>text</strong>. Specifically, using language processing models for things like text generation, language translation, text summarization and sentiment analysis.</p>
<p>Some notable examples include people asking ChatGPT to create custom poems, write (<em>and fix their</em>) code and even generate copy for their marketing or blog posts (<em>again, not me)</em>.</p>
<p>As an example for this blog post, here’s a prompt that was noodling in my head from the most recent podcast I’d listened to.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673281945505/Hx0QHAuRv.png" alt /></p>
<p>So, how does an AI helping provide custom output to prompts pose a security risk?</p>
<hr />
<h1 id="heading-security-considerations">Security Considerations</h1>
<p>Security is an arms race. For every <strong>attack</strong> or sword 🗡, there will eventually be a <strong>defence</strong> or shield🛡</p>
<p>..until the attackers come up with a bigger <em>sword</em> which will require defenders to build a better <em>shield</em> and so on and so forth.</p>
<p>There are many ways to attack a target that is connected to the internet. However one of the most dangerous and bountiful is exploting the weakest link – humans. This method of attack is known as <strong>social engineering.</strong></p>
<blockquote>
<p>[…]  <strong>Social engineering</strong> is the <a target="_blank" href="https://en.wikipedia.org/wiki/Psychological_manipulation">psychological manipulation</a> of people into performing actions or divulging <a target="_blank" href="https://en.wikipedia.org/wiki/Confidentiality">confidential information</a>. A type of <a target="_blank" href="https://en.wikipedia.org/wiki/Confidence_trick">confidence trick</a> for the purpose of information gathering, fraud, or system access, it differs from a traditional “con” in that it is often one of many steps in a more complex fraud scheme.</p>
</blockquote>
<p>To carry out this sort of <strong>attack</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673281948973/9SyB9bGU4.png" alt="🗡" /></p>
<p>requires an entrypoint or vector. Whereas in less “human” attacks this could be a exploting an insecure port or a poorly compartmentalized process; social engineering usually occurs via <strong>interactions</strong>.</p>
<p>This sort of attack can be stopped “at the door” by default blocking any interactions from people that are not friends. Indeed, this appears to thankfully become the “on-rails” setup experience for most new apps.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673281949902/FqBy85wLY.png" alt="🗡" /></p>
<p><strong>Attack</strong>: Interactions from malicious actors</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673281950713/buvwcrH2Q.png" alt="🛡" /></p>
<p><strong>Defence</strong>: Default block interactions from strangers.</p>
<p>However, what if this is <em>not</em> the default? Indeed, the main draw for social networks is the <strong>social</strong> aspect of connecting with people right?</p>
<p>This is where the friend-or-foe verification process comes into play and where tools like chat-GPT can be problematic.</p>
<p>Most people are quite good at blocking friend requests on more “personal” social media (i.e. Facebook, Instagram) if they can’t recognize the connection on first glance.</p>
<p>The <strong>primary</strong> verification test for more personal social media is usually:</p>
<ol>
<li><p>Do I know this name?</p>
</li>
<li><p>Does this picture look familiar?</p>
</li>
</ol>
<p>The <strong>secondary</strong> verification test, which occurs based on a looser security posture and upon failure of the primary test, is then to check if there are any common connections.</p>
<ul>
<li>Does anyone else on my friends list know this person?</li>
</ul>
<p>That is to say that least acceptable test for verification in more personal social networks is peer verification.</p>
<blockquote>
<p>If a bunch of people I know seem to think this person is legit, then they must be.</p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673281952015/SfieeWxvn.png" alt="🗡" /></p>
<p><strong>Attack</strong>: Friend requests from malicious actors on <strong>personal</strong> social networks.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673281952808/xe7l0gi63.png" alt="🛡" /></p>
<p><strong>Defence</strong>: Verify validity of connection by ID via name, picture. OR verify via <em>peer verification.</em></p>
<p>But what about when it’s a setting like LinkedIn? That site is unique because it’s a “less personal” site meant for networking amongst professionals. Just like attending a job-fair or conference – it’s expected that you may get requests from people that you are not immediately familiar with but are reaching out for professional reasons. Perhaps maybe that connection request is from someone that’s reaching out to head-hunt for a slick new role.</p>
<p>In these less personal settings, most people relax their security posture lowering the bar for verification. Here, the name and picture check could be bypassed entirely for validation by their network and industry.</p>
<blockquote>
<p>Sure this name and picture might not immediately click, but perhaps I ran into them at a conference. Are they in my industry? And are they connected to any other peers?</p>
</blockquote>
<p>The issues from this scenario are as follows:</p>
<ol>
<li><p>All it takes is for one bad actor’s connection request to be “accepted” by a person and they can now start sending connection requests to other members of the original acceptors professional network.</p>
</li>
<li><p>Every additional connection request that’s accepted continues to add credibility.</p>
</li>
</ol>
<p>For this attack, an attacker simply needs to create enough fake profiles (‘bots’) to get through, and they are betting on the fact that the transactional nature of the network will not lead to people “reference-checking” each other on new connections.</p>
<blockquote>
<p>Hey, I see that you accepted a conection request from X, do you know this person?</p>
</blockquote>
<p>A <strong>defence for this exploit</strong> is to obviously screen connection requests more carefully.</p>
<blockquote>
<p>Sure, this person appears to be known by others in my network – but do they seem to be legitimate?</p>
</blockquote>
<hr />
<p>As an aside, you can a great example of this type of social engineering in the story of Anna Delvey (the inspiration for the Netflix show “<a target="_blank" href="https://www.imdb.com/title/tt8740976/">Inventing Anna</a>“).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673281954434/EIZdsYVEX.jpeg" alt /></p>
<p>She was able to con her way into networking with elite socialites and swindle hundreds of thousands of dollars. She did this by exploting the transactional nature of those relationships to pass as legitimate.</p>
<p><em>“Well obviously she’s at this big party, someone must have vouched for her”</em></p>
<p><em>“I don’t know her, but she appears to be close with X, therefore she must be legit”</em></p>
<p>By (1) <strong>appearing</strong> to belong and (2) <strong>trusting</strong> that the rest of their peer-network had done their due dilligence, Anna was able to gain a foothold within the social network and continue exploiting connections for personal gain.</p>
<hr />
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673281955302/Eb6AgY9Cm.png" alt="🗡" /></p>
<p><strong>Attack</strong>: Friend requests from malicious actors on <strong>professional</strong> social networks.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673281956158/edo5JH7Bk.png" alt="🛡" /></p>
<p><strong>Defence</strong>: Verify ID via <em>peer acceptance</em> and behaviour.</p>
<p>For most, this is where a credibility test comes into play trying to determine whether an incoming friend request is a legitimate and human:</p>
<blockquote>
<p>“They are in my industry and professional network: do they act like a member of that network? What are the indicators?”</p>
</blockquote>
<ul>
<li><p><em>Do they interact with others?</em></p>
</li>
<li><p><em>Do they interact with others posts?</em></p>
</li>
<li><p><em>Do they write posts?</em></p>
</li>
</ul>
<p>In these cases, <strong>custom content</strong> and <strong>interactions</strong> are the most reputable indicator.</p>
<blockquote>
<p>If I’m assessing the credibility of a stranger that appears to be accepted by my network, interactions and custom content are the best indicators of credibility.</p>
</blockquote>
<p>The check is for <strong>effort.</strong> How much effort could a malicious actor be putting into attack you. Most of the time, it’s little.</p>
<p>Therefore, these are the most reputable indicators because it requires the attacker to put in significantly more <strong>effort</strong> to gain context and create posts and interactions. Whereas fake bot accounts can easily be created, and generic posts can scripted with a bit more effort – crafting genuine content that can only come from someone that’s actually within that industry is <strong>considerably harder.</strong></p>
<p>Well… until now.</p>
<p><strong>Enter</strong>, a tool like ChatGPT. With ChatGPT, an attacker would be able to much easily create posts, comments and content that is both contextually accurate and very hard to distinguish from a human.</p>
<p>Imagine that you work in Cloud Infrastructure and you get a connection request from someone. You don’t know them, but they appear to be in the same industry.</p>
<p>You see that they are active in writing posts. They appear to be involved with Terraform and so have a blog post named “<strong>The benefits of using Terraform with AWS</strong>“.</p>
<p><em>AI tools can easily create these types of blog posts.</em></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673281957192/CN0Yugnd7.png" alt /></p>
<p>You even see that they appear to leave comments on a peer’s post about why Terraform sucks with GCP.</p>
<p><em>AI tools can again, easily create the text for these sorts of interactions.</em></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673281958677/3vIMPUGLH.png" alt /></p>
<p>Sure while a more trained Infrastructure Engineer might be able to spot little quirks or issues in the content upon more detailed scrutiny, could this sort of content pass the <em>first glance?</em></p>
<p>Unfortunately, I believe the answer is <strong>yes</strong>. A problematic scenario I imagine is an attacker, armed with cursory knowledge and basic wordsmithing skills being able to use a tool like ChatGPT to pass the Friend or Foe (or peer) test. Attackers can use these tools to aid in interactions and more easily masquerade as a member of an industry towards gain access to a professional network and exploitation of the people therein. The worst-case scenario is if the attacker was not even in the loop after start and where a program could be written to script content and interactions using ChatGPT. Essentially, imagine the existing problem that exists at scale with low-effort bots, and combine that with the power of ChatGPT.</p>
<p>Seemingly overnight, these AI tools have just become a new tool for attackers to utilize.</p>
<h3 id="heading-defence">Defence</h3>
<p>But it’s not all doom and gloom. In fact within a couple of weeks there are already tools being created to detect AI generated posts, some using AI themselves. An example of this being: <a target="_blank" href="https://www.vice.com/en/article/3admg8/a-compsci-student-built-an-app-that-can-detect-chatgpt-generated-text">GPTZero.</a></p>
<p>But as with any security posture, I believe the defence is not just one – tool, but a <strong>layered approach.</strong></p>
<p><em>How can I strengthen my security posture against social engineering?</em></p>
<ol>
<li><p>Do not accept interactions from anyone that’s not a friend or a friend of friend.</p>
</li>
<li><p>If you receive a friend request from someone that appears to be associated by another connection – <strong>do a a casual reference check.</strong> “Hey do you know this person?”</p>
</li>
<li><p>If on a less personal network, check the <strong>quality</strong> of the interactions. (Do these comments make sense? Do they appear to have relevant context?) Perhaps run that content through a tool like GPTZero.</p>
</li>
<li><p>Directly reach-out to them to assess validity.</p>
</li>
</ol>
<p>Social Engineering capitalizes on our human tendency to <em>want</em> to connect with others without scrutinizing every detail. The best guard against this is to maintain vigilence, defend in depth and perhaps use the very same technologies (AI) to fight fire with fire.</p>
]]></content:encoded></item><item><title><![CDATA[Using Helm To Include All Files From A Directory In-line]]></title><description><![CDATA[Lately I've been working on an interesting project that's required me to learn how to use Helm to include all files within a directory as entries in a Kubernetes (K8s) configmap - which is not as straight forward as one might think.
What Am I Trying ...]]></description><link>https://tratnayake.dev/helm-include-all-files-from-directory-in-line</link><guid isPermaLink="true">https://tratnayake.dev/helm-include-all-files-from-directory-in-line</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[Helm]]></category><category><![CDATA[SRE]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Thilina Ratnayake]]></dc:creator><pubDate>Sat, 10 Sep 2022 04:35:51 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/Sq0L3SPWLHI/upload/v1662784296227/-5djEDGpL.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Lately I've been working on an interesting project that's required me to learn how to use Helm to include all files within a directory as entries in a Kubernetes (K8s) configmap - which is not as straight forward as one might think.</p>
<h1 id="heading-what-am-i-trying-to-do">What Am I Trying To Do?</h1>
<p>Run a container in a K8s cluster whose entire job is to spin up and execute a binary file with a specific configuration file as a parameter.</p>
<blockquote>
<p>Note: The binary is set up to use a config file that should be mounted in a specific location (i.e. /etc/config/config.yaml`)</p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1662782861582/_uW1Zq3B_.png" alt="image.png" /></p>
<h2 id="heading-1st-pass-mount-a-single-config-file-contents-pasted-in-line">1st Pass - Mount a single config file, contents pasted in-line.</h2>
<p>In K8s, we can make files available to a resource by making use of a <code>configMap</code>.</p>
<p>In regular manifests (plain ol' YAML), you can do the following to add a file to a configMap which can then be mounted as a volume within a container.</p>
<h3 id="heading-1-specify-the-configmap-with-the-contents-of-a-single-config-file">1. Specify the ConfigMap with the contents of a single config file.</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1662783140265/mnGy78t8X.png" alt="image.png" /></p>
<p>This is actually exactly the example that's specified in the K8s docs for <a target="_blank" href="https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#add-configmap-data-to-a-volume">Add ConfigMap data to a Volume</a></p>
<pre><code>apiVersion: v1
kind: ConfigMap
metadata:
  name: special<span class="hljs-operator">-</span>config
data:
  config.yaml: <span class="hljs-operator">|</span><span class="hljs-operator">-</span>
    lorem impsum dolor things.
    foo <span class="hljs-operator">=</span> bang
    things.
</code></pre><h3 id="heading-2-update-the-pod-spec-to-make-use-of-the-configmap-and-mount-it-to-the-desired-location">2. Update the pod spec to make use of the configmap and mount it to the desired location.</h3>
<pre><code><span class="hljs-attribute">apiVersion</span>: v1
<span class="hljs-attribute">kind</span>: Pod
<span class="hljs-attribute">metadata</span>:
  <span class="hljs-attribute">name</span>: my-test-pod
<span class="hljs-attribute">spec</span>:
  <span class="hljs-attribute">containers</span>:
    - <span class="hljs-attribute">name</span>: test-container
      <span class="hljs-attribute">image</span>: registry.k8s.io/busybox
      <span class="hljs-attribute">command</span>: [ <span class="hljs-string">"/bin/sh"</span>, <span class="hljs-string">"-c"</span>, <span class="hljs-string">"ls /etc/config/"</span> ]
      <span class="hljs-attribute">volumeMounts</span>:
      - <span class="hljs-attribute">name</span>: config-volume
        <span class="hljs-attribute">mountPath</span>: /etc/config
  <span class="hljs-attribute">volumes</span>:
    - <span class="hljs-attribute">name</span>: config-volume
      <span class="hljs-attribute">configMap</span>:
        <span class="hljs-attribute">name</span>: special-config
  <span class="hljs-attribute">restartPolicy</span>: Never
</code></pre><p>Note that the <code>volumeMount</code> means that all the keys in the ConfigMap will be mounted as their own files - so in our example, since there is only one element in the <code>data</code> field of the configmap - the file that this will get mounted as is <code>config.yaml</code> within the <code>/etc/config</code> directory. </p>
<p>However, I thought this is kind of tacky to have to expect the contents of the configmap to be updated in-line everytime. It would. be nice if we could decouple this (i.e. if config files could be loaded from elsewhere and be handled differently).</p>
<h2 id="heading-2nd-pass-mount-a-single-config-file-contents-inserted-from-a-separate-file">2nd Pass - Mount a single config file - contents inserted from a separate file.</h2>
<p>Since it turns out I'm able to make use of <a target="_blank" href="https://helm.sh/">Helm</a> (a K8s configuration management and templating tool) for this project, an optimization we did next was simply read in the contents of a file dynamically (for mounting) at deploy-time instead of including the contents in-line. This means as long as a deployer had the required file in the right place, no updates would be required to the configmap's contents.</p>
<p>The magic snippet being: <code>config.yaml: {{ tpl (.Files.Get "&lt;filename&gt;.yaml") . | quote }}</code></p>
<pre><code>apiVersion: v1
kind: ConfigMap
metadata:
  name: special<span class="hljs-operator">-</span>config
data:
  config.yaml: {{ tpl (.Files.Get <span class="hljs-string">"super_duper_sweet_config.yaml"</span>) . | quote }}
</code></pre><p>Even better!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1662783437911/LVNOXIp94.png" alt="image.png" /></p>
<p><strong>Sweet, we're done right?</strong>
Nope. While there might only be a <em>one</em> configuration file today - the implementation needs to support having <em>multiple</em> configuration files available to be fed into the binary at run-time.</p>
<p>Assume that there is now a directory named <code>/config_files</code> in the top level directory that contain special configuration files. All of these need to be present at start-up for the container, of which one can be provided to the binary.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1662783448274/r5G7YNI3U.png" alt="image.png" /></p>
<h1 id="heading-some-possible-solutions">Some Possible Solutions:</h1>
<h2 id="heading-1-modify-the-app-code-binary-to-fetch-config-files-during-first-run">1. Modify the app code (binary) to fetch config files during first-run.</h2>
<p>This was the first solution that jumped into our minds, but we decided against it because forcing a change on the binary (instead of making changes in the way a container was deployed) is pretty antithetical to the principles of K8s. It would also mean having to ask the engineering-teams to change the way the program runs which might cause more problems to solve this one.</p>
<h2 id="heading-2-use-an-initcontainer">2. Use an <code>initContainer</code></h2>
<blockquote>
<p>[Init Containers are] specialized containers that run before app containers in a Pod. Init containers can contain utilities or setup scripts not present in an app image.</p>
</blockquote>
<p>Where the setup script could be <code>git clone</code> / download all the config files into a volume mount to be used by the main container(s).</p>
<p>While this was a possiblity, we decided against it in the interest of time - specifically beacuse the person I was pairing with is a Helm master who knew about the proper snippet to use. See below.</p>
<h1 id="heading-3-use-helm-to-include-all-files-from-a-directory-at-deploy-time">3. Use Helm To Include All-Files From a Directory At Deploy-Time.</h1>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1662783601054/hbQIFyTEB.png" alt="image.png" /></p>
<p>While my partner was a <code>helmMaster</code> and knew that this <em>could</em> be done - <a target="_blank" href="https://stackoverflow.com/questions/47595295/how-do-i-load-multiple-templated-config-files-into-a-helm-chart">this is the StackOverflow post that confirmed it</a> and helped refine the necessary break-through.</p>
<p>Essentially, what we needed Helm to do was to:</p>
<ul>
<li>Range over a list of YAML files in a directory</li>
<li>Create a new element in the config map per file with the <code>key</code> being the filename and the <code>value</code> (data) being the contents of that file.</li>
</ul>
<pre><code><span class="hljs-comment"># Create a new config map for every Deployment.</span>
<span class="hljs-symbol">apiVersion:</span> v1
<span class="hljs-symbol">kind:</span> ConfigMap
<span class="hljs-symbol">metadata:</span>
  <span class="hljs-symbol">name:</span> special-config
<span class="hljs-symbol">data:</span>
  {{- range $path, $_ <span class="hljs-symbol">:</span>=  .Files.Glob  <span class="hljs-string">"config_files/**.yaml"</span> }}    
  {{ $path <span class="hljs-params">| trimPrefix "config_files/" }}: |</span>- 
{{ $.Files.Get $path <span class="hljs-params">| indent 4 }}
  {{ <span class="hljs-keyword">end</span> }}</span>
</code></pre><p>Specifically:</p>
<pre><code>{{<span class="hljs-operator">-</span> range $path, $_ :<span class="hljs-operator">=</span>  .Files.Glob  <span class="hljs-string">"config_files/**.yaml"</span> }}    
  {{ $path <span class="hljs-operator">|</span> trimPrefix <span class="hljs-string">"config_files/"</span> }}: <span class="hljs-operator">|</span><span class="hljs-operator">-</span> 
{{ $.Files.Get $path <span class="hljs-operator">|</span> indent <span class="hljs-number">4</span> }}
  {{ end }}
</code></pre><p>This helm snippet ranges over the <code>config_files</code> directory for all <code>.yaml</code> files and then creates key-value pairs where the key is the name of the file (minus <code>.yaml</code>) and the value is the contents of the file.</p>
<p>i.e. If the directory looked like:</p>
<pre><code><span class="hljs-operator">/</span>config_files
<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">&gt;</span> config_a.yaml
<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">&gt;</span> config_b.yaml
</code></pre><p>Then the templated configMap (post-templating) would be:</p>
<pre><code><span class="hljs-comment"># Create a new config map for every Deployment.</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ConfigMap</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">special-config</span>
<span class="hljs-attr">data:</span>
  <span class="hljs-attr">config_a:</span> <span class="hljs-string">|-
    &lt;contents&gt;
</span>  <span class="hljs-attr">config_b:</span> <span class="hljs-string">|-</span>
    <span class="hljs-string">&lt;contents&gt;</span>
</code></pre><p>This means, that when used in conjunction with our previous <code>podSpec</code> which had the following line: </p>
<pre><code>     <span class="hljs-attribute">volumeMounts</span>:
      - <span class="hljs-attribute">name</span>: config-volume
        <span class="hljs-attribute">mountPath</span>: /etc/config
  <span class="hljs-attribute">volumes</span>:
    - <span class="hljs-attribute">name</span>: config-volume
      <span class="hljs-attribute">configMap</span>:
        <span class="hljs-attribute">name</span>: special-config
</code></pre><p>The files will be present as:</p>
<ul>
<li><code>/etc/config/config_a.yaml</code></li>
<li><code>/etc/config/config_b.yaml</code></li>
</ul>
<p><em>Well that's great T, but how are you going to make each container choose a different config file?</em>
Stay tuned because that's the next bridge to cross!</p>
<hr />
<p>Anyways, quick post - but that's how we learned how to use Helm to fetch all files and their contents from a directory and include them in-line. Note that a limitation of this approach is that there's a max limit of 1MB worth of data that can be sent through as as ConfigMap.</p>
<blockquote>
<p>A ConfigMap is not designed to hold large chunks of data. The data stored in a ConfigMap cannot exceed 1 MiB. If you need to store settings that are larger than this limit, you may want to consider mounting a volume or use a separate database or file service.</p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[Better Communications With Your Team as a Junior Engineer]]></title><description><![CDATA[Better Communications With Your Team as a Junior Engineer
As an Engineer or Knowledge Worker - our roles require the extraction, synthesis, or modification of information to complete our work and get to the finish line. 

But rarely do we do this in ...]]></description><link>https://tratnayake.dev/better-communications-with-your-team-as-a-junior-engineer</link><guid isPermaLink="true">https://tratnayake.dev/better-communications-with-your-team-as-a-junior-engineer</guid><category><![CDATA[newbie]]></category><category><![CDATA[communication]]></category><category><![CDATA[team]]></category><category><![CDATA[Beginner Developers]]></category><category><![CDATA[Culture]]></category><dc:creator><![CDATA[Thilina Ratnayake]]></dc:creator><pubDate>Sun, 14 Aug 2022 18:47:50 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1660502571497/FVIF-kxpb.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-better-communications-with-your-team-as-a-junior-engineer">Better Communications With Your Team as a Junior Engineer</h1>
<p>As an Engineer or Knowledge Worker - our roles require the extraction, synthesis, or modification of information to complete our work and get to the finish line. </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501226467/3KD3s39J2.png" alt="Untitled.png" /></p>
<p>But rarely do we do this in isolation. This almost always requires communicating with other humans to get to the finish line.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501246950/Fb1WU7Obw.png" alt="Untitled (1).png" /></p>
<p>As a Junior Software Engineer (or any member within a team of knowledge workers) - the Interpersonal Communication skills employed when working within an Engineering Team helps you learn and grow.  As a Senior Engineer, these skills become crucial in operating across organizational boundaries.  </p>
<p>Below are a couple of thoughts and tips compiled through my journey from support to engineering, and as a Junior and onwards. This is knowledge that was hard earned over years and challenging interactions - and I hope it will be helpful to anyone that is starting out as a New or Junior Engineer</p>
<hr />
<p>This was originally a more concise Twitter thread that can be found here: </p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://twitter.com/tratnayake/status/1553981169154215936">https://twitter.com/tratnayake/status/1553981169154215936</a></div>
<hr />
<h1 id="heading-new-junior-engineer-welcome">New Junior Engineer - Welcome!</h1>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501278226/QwxPWmjan.png" alt="FZAujnvVsAAIMOc.png" /></p>
<p>Welcome and congratulations on your new role! As a Junior Engineer, it’s an exciting, stressful and maybe even intimidating time in your career! There’s a seemingly unending waterhose of information to ingest, countless new problems to solve and so. SO. many things to build. </p>
<p>The world’s your oyster and after spending years on learning how to sail, you got yourself into a canoe!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501299916/EqT4CMmwo.png" alt="FZAwATQUYAAaQSF.png" /></p>
<p>And thankfully, it’s not just you. You’ve got a whole TEAM to learn and build with. Whether it’s just one other engineer or maybe a large pizza team - your teammates will be a <em>huge</em> part of your experience.</p>
<p>But while your team exists to work together, it’s important to understand that everyone on <em>it,</em> is their own unique person. All of them posessing different strengths, areas of improvement and most importantly, <strong>constraints</strong>.</p>
<h2 id="heading-interpersonal-constraints-knowledge-andamp-space">Interpersonal Constraints - Knowledge &amp; Space</h2>
<p>As humans, it’s in our nature to immediately start asessing new people, groups and surroundings for certain indicators. In early days, it helped keep us alive - and even today, it helps us determine if we are in danger. Outside of basic survival - in group settings we will also start trying to determine heirarchy, structure and try to make groupings based on different criteria.</p>
<p>In your induction to your team, I would suggest you asess your peers and frame your relationship to them against 2 constraints:</p>
<p><strong>Relevant Functional Knowledge &amp; Space.</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501340463/LCq_mrVNE.png" alt="FZA0jd6UUAAXkAe.png" /></p>
<h2 id="heading-relevant-functional-knowledge">Relevant Functional Knowledge</h2>
<p>Relevant Functional Knowledge includes the knowledge, skills and experience revolving around the technical areas you work in. For example, the understanding of what makes up a relational database including what makes it ACID compliant is <em>knowledge</em>.  Being able to then log into a MySQL database and then using that knowledge to execute a database migration is a <em>skill</em>. Remembering to take a snapshot AND backup before starting the whole operation is <em>experience</em> (often, hard-won).</p>
<h2 id="heading-space">Space</h2>
<p>Sometimes referred to as “capacity” or “bandwidth”, space is the time and energy that is available for tasks.</p>
<p>If we were thinking mathematically - Space is a function of Time multiplied by Energy. </p>
<p><code>Space = Time * Energy</code></p>
<p>If someone has a lot of energy but not enough time in their day to do something, they don’t have a lot of space. Conversely, someone that has a lot of time but not a lot of energy is also lackingspace.</p>
<h1 id="heading-communication-andamp-organizational-boundaries">Communication &amp; Organizational Boundaries</h1>
<p>Boundaries exist in any organization, and they’re not always bad. Boundaries are helpful because they clearly mark a line of change. For example, in computer networking, information is altered to be sent through network(s) according to the constraints of each of those networks and layers. I like the OSI network model - because there’s a concept of layers, and especially <em>lower</em> and <em>higher</em> networks.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501359071/EOYsAC0tv.jpg" alt="data-flow-osi-model (1).jpg" /></p>
<p>Peeling back the skin on this example quite heavily: In networking - you can split communications between local networks (like a home WiFi network) and a wide area network (like the <strong>I</strong>nternet<em>).</em> Communications within a local network are treated differently than in a wider area network. For example on a local network, clients send frames to a switch - whereas WAN networking requires packets transitting through and from routers.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501369465/gk6BHwqQA.png" alt="Untitled (2).png" /></p>
<p>Using this example, think of yourself as localhost, your team as the home WiFi network, and the rest of the org as as the Wide Area Network (like the Internet)</p>
<h1 id="heading-interpersonal-boundary">Interpersonal Boundary</h1>
<h3 id="heading-inside-the-boundary">Inside The Boundary</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501397118/eJkI5teyx.png" alt="Untitled (3).png" /></p>
<p>As a Junior Engineer, you might not have a lot of <em>Relevant Functional Knowledge</em> — but you <em>do</em> have a lot of <em>Space</em> to learn. As a Junior  - there is slack explicitly built into the deadlines accounting for learning. This is  done in exchange for the implicit expectation that you will be <em>working hard</em> to learn. TL:DR - You’re given a lot of space to <em>learn</em>.</p>
<p>Between you and your team is your <strong>Interpersonal Boundary.</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501408926/p16UT6a_l.jpeg" alt="FZA1SjjVsAED87w.jpeg" /></p>
<p>This is the first “organizational boundary” you will learn to cross and work around as an Engineer.</p>
<h3 id="heading-outside-the-boundary">Outside the Boundary</h3>
<p>On the other side of this bounary is your immediate team. </p>
<p>A good exercise here is to examine the areas of the team’s “Knowledge vs Space” box compared to yours.</p>
<ul>
<li>Relevant Functional Knowledge: By time-in alone, your peers will probably have more functional knowledge — or at least organization specific context.</li>
<li>Space: Because every member of your team is their own person, with their own goals, deadlines and constraints - the space that they have available you only exists to a certain limit.</li>
</ul>
<p>Put more succinctly, your team will have more knowledge than you do - however by also possessing their own constraints, <strong>your team (and anyone else) will have less space for you <em>than you have for yourself.</em></strong></p>
<p>Each member will also occupy their own position within the plot.</p>
<p>Each position will change based on the person and organization. </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501426982/9frQqnm3M.png" alt="FZA2661VEAE7DjP.png" /></p>
<h3 id="heading-engineering-managers">Engineering Managers</h3>
<p>One unique position on the plot is your Engineering Manager (EM).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501438502/gnqb3tTq9.png" alt="FZA3nXgUsAAhV_A.png" /></p>
<p>On the vertical (X) axis, they are placed on a scale because EM’s can posess a varying level of <em>functional knowledge</em>. Some EM’s are more technical, some less.</p>
<p>However, the important note is their close proximity to you on the horizontal axis representing <em>space</em>.</p>
<p>EM’s should have the <em>most</em> time, or at-least s<em>pace for you.</em></p>
<p>As people managers, EM’s trade functional tech work for space to lead and develop their people which includes you!</p>
<p>Their role includes providing (strategic) leadership, coaching, and help brokering relationships.</p>
<ul>
<li>If you’re stuck, they can help coach you to figure things out.</li>
<li>If you’re blocked, they can use their levers to get you free - or provide you the resources to do so.</li>
<li>If you don’t know what to do next, they can help you figure it out or at least point you to someone else.</li>
</ul>
<p>Your EM is very important member of the team to keep in mind, and fostering a relationship with them is crucial for not only your daily success, but advancement as well.</p>
<h2 id="heading-teammates-only-have-a-certain-amount-of-space-for-you">Teammates - Only Have A Certain Amount of Space For You</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501451307/sXj39eor-.png" alt="FZA6Ko2UIAEaUFC.png" /></p>
<p>The rest of your team will land vertically based on experience and knowledge. The time available for you usually decreases with seniority due to their workloads.</p>
<p>The more senior the teammates, the more knowledge they will have. However, an unfortunate, but understandable,  side-effect of having more relevant functional knowledge is also a larger scope of work. A larger scope of work means less free space for you 😟</p>
<p>This isn’t personal, this is just a output versus capacity limit.</p>
<p>Even IF you had the most considerate and approachable Senior Engineer that is explicitly assigned to you as a mentor - there will come a limit to the space they can extend to you without risking an impact to their other duties and responsibilities.</p>
<p><em>Does this mean Juniors shouldn’t interact with Senior Engineers? On the flip side, does this mean that the more senior you are - the less approachable you’re allowed to be?</em></p>
<p><strong>No!</strong></p>
<p>Seniors, EM’s and organizations <em>must</em> MAKE formal space for Junior Engineers. In fact that is the mark, and a necessity, for any successful engineering organization in the long-term.</p>
<p>As a Junior though, it does mean that:</p>
<p><strong>No one <em>has</em> to help you. </strong>They <em>should**</em> but in the absence of the ability to <em>demand</em> help, you’re workign purely on good-will. Which means it’s on you to do the leg-work to make it as easy as possible to be helped.</p>
<p><em>You should be more considerate in how you communicate with other members of your team - especially by factoring in the Relevant Functional Knowledge vs Space available.</em></p>
<h1 id="heading-how-does-this-help-me-as-a-junior-engineer">How Does This Help Me As A Junior Engineer?</h1>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501470221/9MhF6e6zN.png" alt="FZA7ExNVQAI5EAv.png" /></p>
<p>At some engineering organizations - Junior Engineers don’t have to worry about formal task scoping or grooming, and simply only need to grab a card and “take it to the finish line” after all of that planing work has already been done by more senior members.</p>
<p>You want to get to the finish line.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501481673/TPex3ao6H.jpeg" alt="FZA7vJkUIAEM3_R.jpeg" /></p>
<p><em>But</em> - it’s “never a straight line to the finish”, and (especially as a Junior) it will rarely be a solo effort. You <em>will</em> need to work with your teammates in  order to gain the answers and knowledge necessary ot complete your task.</p>
<p>In fact, this is one of the most common ways you will learn as a Junior Engineer.</p>
<p>And like any method, there are  ways you can do this better.</p>
<h1 id="heading-interaction-cost">Interaction Cost</h1>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501494994/gmIRepbvz.png" alt="FZA8Vg1UsAEzEsE.png" /></p>
<p>Every interaction with another engineer occurs costs. For example:</p>
<ul>
<li>Time: How much time is spent not working on their original tasks.</li>
<li>Context Switching: At the best case scenario: The effort needed to save and dump state before reacquiring new state to help you. Or most commonly - the cost of reacquring lost state when switching back to the original task.</li>
</ul>
<p>And it’s true that the reality is: <em>these costs will always exist.</em></p>
<p>An implicit agreement of team membership is accepting that interactions are necessary &amp; encouraged.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501509758/-KM999RRQ.png" alt="FZA-AaeUcAAnmbb.png" /></p>
<p>HOWEVER - becoming a good engineer is understanding constraints and optimizing accordingly. Similarly, being a good teammate is about doing the same by being efficient in your interactions.</p>
<h1 id="heading-efficiently-crossing-the-interpersonal-boundary">Efficiently Crossing the Interpersonal Boundary</h1>
<p>By focusing on the transition between these boundaries, you can become more efficient in your interactions with teammates.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501526988/pI0VhImXI.png" alt="Untitled (4).png" /></p>
<h2 id="heading-1-do-your-due-dillience">1.  Do your DUE DILLIENCE</h2>
<p>In the legal world, due dilligence refers to:</p>
<blockquote>
<p>Reasonable steps taken by a person in order to satify a legal requirement, especially in buying or selling something.</p>
</blockquote>
<p>While there might not be a legal requirement when interacting with buying- you <em>are</em> effectively trading information for time - the least you can do is everything in your power to ensure that the interaction is as efficient as possible.</p>
<p>Doing your due dilligence for an interaction is task specific, and is something you’ll get better at over time. However, here are 3 suggestions which are widely applicable to most interactions.</p>
<p>And remember - one of the benefits of doing this <em>before</em> you cross the teammate is that you have as much space as you need to work the problem. You haven’t bugged anyone yet so there is no time pressure.</p>
<p>Take your time. And if you feel that time is running out - work with your leadership to ask for more to figure it out.</p>
<h3 id="heading-1-debug-with-a-rubber-duck">1. 🦆 DEBUG WITH A RUBBER DUCK.</h3>
<p>Ref: <a target="_blank" href="https://en.wikipedia.org/wiki/Rubber_duck_debugging">https://en.wikipedia.org/wiki/Rubber_duck_debugging</a></p>
<p>Before you run for help,  go through your code / your problem - line by line, premise-by-premise.  You’d be surprise how much this approach might catch something you missed, or  unearth nuggets to get you unblocked.</p>
<h3 id="heading-2-read-the-freakin-manual-rtfm">2. 📖 Read The Freakin’ Manual (RTFM)</h3>
<p>Something that gets old quick when you help other people - is if they keep leaning on you without trying to help themselves first. As a knowledge worker, your job is solely focused on interacting with information until it can be used / modified to be useful. </p>
<p>One perk of being a knowledge worker is that you are usually provided with a link to the Internet. Make sure you make good use of it. Do the leg work to do your research <em>before</em> asking your question (as much as you can) instead of relying on your helper to do it for you.</p>
<h3 id="heading-3-refine-your-ask-organize-your-thoughts-in-logical-order-andamp-prep-question">3. ✍️ REFINE your Ask - Organize Your Thoughts In Logical Order &amp; Prep Question</h3>
<p><em>“Even if you’re an engineer, you’re still in sales.”</em></p>
<p>You’re selling an opportunity to get helped, and you want to make sure you do <em>everything</em> you can in your interaction for your recipient to “buy” the opportunity to help you.</p>
<p>You do this by making sure your sales pitch is <em>tight</em> as it can be.</p>
<ul>
<li><em>Is your pitch in logical order?</em> For someone that might be doing something <em>completely</em> different when you’re asking them - does your question / story / quest flow from point to point? Make sure to tighten it up.</li>
<li><em>Do you provide an easy on-ramp to sync understanding?</em> When you’ve been working for a long time on a specific problem, it’s easy to get “in the weeds” and forget how much context you’ve built up. When pitching to someone, it’s important to focus on providing an  “on-ramp” to go from 0 → 1. Start by providing information that is <em>generalized</em>  and <em>simple</em> before moving → to more <em>specific</em> and <em>complex.</em></li>
<li><em>Can you execute your pitch in one-go? HAVE YOU PRACTISED?</em> If you’re selling door-to-door, it’s show-time as soon as you ring that doorbell. Similarly, the moment you send that message - you’re on. Practice your pitch thoroughly so as to reduce the time spent clarifying things on the call.</li>
</ul>
<p>As an example:</p>
<blockquote>
<p>I need help! I can’t figure out how to deploy to Staging</p>
</blockquote>
<p>is not as great as:</p>
<blockquote>
<p>Hey can you help me figure out why I can’t deploy this to Staging? I read the doc on CI/CD [Ref: Link] and kicked off the pipeline but I seem to be getting this weird <code>401</code> error? I read up on that error and it looks like it might be a permissions issue? Do I need to get access to something else?</p>
</blockquote>
<p>More on this in “Crafting your Message” later on.</p>
<hr />
<p>Once you’ve done your Due Dilligence, it’s time to:</p>
<h1 id="heading-determine-recipients-andamp-method-of-delivery">Determine Recipient(s) &amp; Method of Delivery</h1>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501564695/LznNShEHV.png" alt="Untitled (5).png" /></p>
<h3 id="heading-who-do-i-talk-to-decision-math">Who Do I Talk To? Decision Math</h3>
<p><em>Hey the Tech Lead wrote the CI/CD pipelines right? Can’t I just go to them with this question?</em></p>
<p>Sure, but by that logic - they probably also have knowledge about most things on the team. </p>
<p><strong><em>What happens if everyone goes to the Tech Lead all the time?</em></strong> That’s not sustainable.</p>
<p>A considerate team member takes into consideration the impact on others in relation to the time sensitive and urgency of their requests. Just because you <em>can</em> DM someone that probably knows the answer, doesn’t mean you should. Especially in the presence of other options.</p>
<p>Instead, this is where it’s important to do some Decision Math in determining who you talk to.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501577760/9_-lE11YK.png" alt="Untitled (6).png" /></p>
<p>To make this decision, you should factor in 3 considerations:</p>
<ul>
<li><strong>Least Impact</strong> - Your should aim to pick the person on your team that will be the <em>least</em> impacted by your interruption; But</li>
<li><strong>Best Answer</strong> - You should also aim to pick someone that’s probably going to have knowledge on the relevant subject matter. However, at the end of the day;</li>
<li><strong>Quickest Resolution</strong> - You need to ensure you ask someone that will be able to give you a timely excuse</li>
</ul>
<p>For example, if you have a quesiton about your CI/CD pipeline the Decision Math could go like this:</p>
<p>The  problem is around CI/CD pipelines.</p>
<ul>
<li>Best Answer: This is a problem that the whole team has dealt with. (So anyone could answer this - but you should pick the peer engineer because they are the next lowest level available)</li>
<li>Time Sensitivity: You need to figure this out so that you can deploy the changes to finish your card. But this isn’t due till end of week and it’s a Monday.</li>
<li>Least Impact: Based on the two details above - I can speak to any of the engineers on my team. However I know that my peer engineer is busy pairing with the Tech Lead right now so they’re both busy.</li>
</ul>
<p>Therefore:  I’ll ask my Senior Engineer so that it’s the least disruptive to the team, probably going to give me the right answer and in the fastest manner.</p>
<p>HOWEVER if you have a <em>dedicated resource</em> (i.e. a “technical mentor” or “big-sib” or “new-hire buddy” go to them first. They are primed and are known to expect you.</p>
<p>For all this talk about <em>who</em> to talk to though, I suggest the following options in <strong>escalating up to the next level if one isn’t successful</strong>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501625117/JpvwXRZmS.png" alt="Untitled (7).png" /></p>
<h3 id="heading-1-ask-in-the-team-chat">1. Ask in the Team Chat</h3>
<p>The team chat is the  <em>hub</em> that all members work around. In teams with healthy communication cultures - this chatroom should be <em>very</em> active.</p>
<p>Messaging in the chat room has the benefits of getting more eyes and ears on the question / request and information contained within. Even if it doesn’t help you or the reader <em>directly at that moment</em> - makes it available for querying and recall by the broader group. It also increases communication and collaboration.</p>
<p>Want an <em>easy</em> way to reduce knowledge silo-ing and single points of failure? </p>
<p>Leave more breadcrumbs in an easily accessible and frequently monitored space.</p>
<p><strong>Pro’s</strong></p>
<ol>
<li>Best Effort, Help Invited- Not tagging anyone specifically means that whomever has the answer, guidance or suggestions can help you based on <em>their</em> constraints (time, energy).</li>
<li>Knowledge Sharing - Putting it in the team chat means that knowledge is shared.
DM's are death.</li>
</ol>
<p><strong>Con’s</strong></p>
<ol>
<li>Not tagging a specific recipient means that there is a lower pressure to respond. If your team does not have a respecful communications / collaboration culture - your question might be ignored.</li>
<li>The open arena for communications might lead to more drawn out conversations. (Bike-shedding, yak-shaving).</li>
</ol>
<p>If your message is ignored; I would then try to;</p>
<h3 id="heading-2-ask-in-the-team-chat-with-targeting-or-time-pressure">2. Ask in the Team Chat - With Targeting or Time Pressure</h3>
<p>Same as above but with either a direct CC, direct <code>@</code> tag or a deadline.</p>
<p><strong>Pro’s</strong></p>
<p>Same as before + </p>
<ol>
<li>Might get a response from someone that missed
the message earlier.</li>
<li>Encourages people to respond</li>
</ol>
<p><strong>Con’s</strong></p>
<ol>
<li>Doing this excessively could annoy your teammates and lose the good-will / patience to help you.</li>
</ol>
<p>Finally, if <em>that</em> doesn’t work;</p>
<h3 id="heading-3-ask-an-expert-directly-via-dm">3. Ask an Expert Directly via DM</h3>
<p>If time is of the essence, go direct to someone. Someone who's guaranteed to
know the answer. OR - if the right person isn't clear, ask your EM.</p>
<p><strong>Pro’s</strong></p>
<ul>
<li>Fastest response.</li>
</ul>
<p><strong>Con’s</strong></p>
<ul>
<li>“Death by DM’s” - Communications energy drain for the recipient having to check on a channel that they weren’t already monitoring.</li>
<li>“Death in DM’s” - Information is now siloed between you and the requestor.</li>
</ul>
<hr />
<h1 id="heading-3-craft-your-message">3. Craft Your Message</h1>
<p>Now that you know where / who you’re going to interact with for your question &amp; help - it’s time to craft your message. Remember, this is your <em>sales pitch</em> where you’re selling your teammates on the chance to help you - make it as tight as possible.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501660293/CRPoImBQ5.png" alt="Untitled (8).png" /></p>
<p>Some things to consider when crafting your message:</p>
<ol>
<li>What are their constraints?<ol>
<li>How much time do they have for me?<ol>
<li>How much information can I strip out (without losing message integrity)</li>
</ol>
</li>
<li>How much knowledge / context do they have about the questions I’m inquiring about? (Is this something they work in daily?)<ol>
<li>How much information do I need to fill them in on?</li>
</ol>
</li>
</ol>
</li>
<li>What sort of follow-up questions might they ask?<ol>
<li>Can I incorporate these into the question so that they don’t need to ask them?</li>
</ol>
</li>
</ol>
<p>This is where I recommend reading more into a previous post I’ve included around the “5 Step Question”:  <a target="_blank" href="https://tratnayake.dev/asking-better-questions-as-a-junior">https://tratnayake.dev/asking-better-questions-as-a-junior</a></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501672700/tiBDsBQVr.png" alt="Untitled (9).png" /></p>
<p>After you’ve done all that - you’re good to send!</p>
<hr />
<h1 id="heading-conclusion">Conclusion</h1>
<p>If you do all of this, your journey might look like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1660501683802/cn1q-Jl_o.png" alt="Untitled (10).png" /></p>
<p>If you take the care to understand the Interpersonal Boundary and what it takes to be deliberate, considerate and efficient in your transitions  going to your team - you’ll get better answers, be less of a burden on your team and more importantly - practice the skills and habits that will be essential further on in your career as a more senior engineer, when you will need to cross multiple organizational boundaries.</p>
]]></content:encoded></item><item><title><![CDATA[How-To Set Up a Jupyter Notebook on GCP with Granular Access Control to Read from Big Query, Configured with Terraform]]></title><description><![CDATA[Context
In our organization we have data-analysts that need to fetch data from different sources to perform their work. In the last couple of weeks, I worked with a Data Analyst that needed a solution to query data from BigQuery (BQ) datasets using R...]]></description><link>https://tratnayake.dev/terraforming-jupyter-notebooks-r-stats-bigquery-access-control</link><guid isPermaLink="true">https://tratnayake.dev/terraforming-jupyter-notebooks-r-stats-bigquery-access-control</guid><category><![CDATA[Terraform]]></category><category><![CDATA[infrastructure]]></category><dc:creator><![CDATA[Thilina Ratnayake]]></dc:creator><pubDate>Sun, 13 Feb 2022 01:03:36 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/IKHvOlZFCOg/upload/v1644705732494/SSoq5tRcA.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-context">Context</h1>
<p>In our organization we have data-analysts that need to fetch data from different sources to perform their work. In the last couple of weeks, I worked with a Data Analyst that needed a solution to query data <strong>from</strong> BigQuery (BQ) datasets using R (a programming language for stastical computing).</p>
<p><em>What’s BigQuery?</em> <a target="_blank" href="https://cloud.google.com/bigquery/docs/introduction">An enterprise data warehouse that is specific to Google Cloud Platform (GCP)</a>. Useful in circumstances where folks want to run analysis on their data but don’t want to do it on the databases themselves (Most databases are set up to ensure data is read/written reliably - whereas data warehouses are built specifically for analytical operations). </p>
<p>One common pattern is to fill BQ datasets from production databases and run analytical operations on those datasets.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644705856268/_JUlDf9Un.jpeg" alt="akinori-uemura--T6vP7ZGz0Q-unsplash.jpg" />
Photo by <a href="https://unsplash.com/@a_uem?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Akinori UEMURA</a> on <a href="https://unsplash.com/s/photos/chains?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
<h2 id="heading-constraints">Constraints</h2>
<p>In building / researching a solution for this need, we wanted to work around a couple of constraints:</p>
<ul>
<li><strong>Future-proofing</strong>: The solution should be able to keep up with future demands and reduce the dependency on local hardware.<ul>
<li>We don’t want analyst laptops to be a limiting factor wherever possible. Using the cloud means leveraging the ability to spin up workloads on hardware with specific requirements as needed.</li>
<li>Running this in the cloud also means that a user can continue working / access data from anywhere / any laptop (even if they lose access to their device).</li>
</ul>
</li>
<li><strong>Security</strong> - Authentication: The solution must be locked down to specific users.<ul>
<li>The Bigquery Datasets  are already subject to security policies that lock down their access. However, as this is another / new mechanism making use of the dataset - ensure that this solution at minimum, does not grant access to any new / unwanted users. Ideally - <strong>only</strong> grants access to those who require it (and are already within the security / access policy).</li>
<li>This solution should also allow the ability to query into datasets in other GCP projects that we maintain.</li>
</ul>
</li>
<li><strong>Security</strong> - Authorization: The solution must only allow access to specific DataSets.<ul>
<li>In line with above, even if the requestors are within the security policy to read and write from this dataset, the solution must be locked down to READONLY operations on the data. This is to ensure that another risk of data loss is mitigated where possible.</li>
</ul>
</li>
<li><strong>Infrastructure-As-Code:</strong> The solution should be Terraformed<ul>
<li>No special snowflakes*  on our watch! Having this infrastructure terraformed has a lot of benefits, which you can read about <a target="_blank" href="https://learn.hashicorp.com/tutorials/terraform/infrastructure-as-code">here</a>.  But for us, means that everything that’s running is codified and can be examined, modified, nuked from one source of truth.</li>
</ul>
</li>
<li><strong>Cost:</strong> Using this notebook shouldn’t be prohibitively expensive.</li>
</ul>
<hr />
<h1 id="heading-goal">Goal</h1>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644705932141/pwg2co3xK.jpeg" alt="alvaro-mendoza-6dRiUBjRvsM-unsplash (1).jpg" />
Photo by <a href="https://unsplash.com/@a56?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">ÁLVARO MENDOZA</a> on <a href="https://unsplash.com/s/photos/goal?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
<p>Implement a solution that allows a data-analyst to run R code against BigQuery Datasets which meets all of these constraints.</p>
<h1 id="heading-solution">Solution</h1>
<p>The majority of this solution is already covered in a GCP article: <a target="_blank" href="https://cloud.google.com/architecture/data-science-with-r-on-gcp-eda">Data Science with R on GCP EDA</a> - however what this post includes is an approach that builds on Service Account IAM to meet our security requirements, and shows how to achieve this solution with Terraform</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644705351601/As26e_W-k.png" alt="R_Studio_Jupyter_Notebook.png" /></p>
<ol>
<li>The main character of this solution is the Vertex AI service which allows you to run Jupyter Notebooks (as an IDE for R) on rapidly configurable VM’s. (<strong>Futureproofing</strong> ✅)<ol>
<li>It’s usually used in AI related workflows like training ML (Machine Learning) models and thus in doing so - has <strong>native support for talking to BigQuery.</strong></li>
<li>These workbooks have access to the <code>Deep Learning</code>  <a target="_blank" href="https://cloud.google.com/vertex-ai/docs/workbench/user-managed/images">family of images</a> which allows you to quickly instantiate notebooks with specific images (including one with the R framework installed!)</li>
<li>These workbooks run on top of regular VM’s that can be configured to specific workload needs (i.e. tweaking processor, memory and disk specs)<strong>.</strong></li>
</ol>
</li>
<li>Depending on the hardware used, the costs are minimal (<strong>Cost</strong> ✅).<ol>
<li>For example, using an <code>e2-medium</code> instance is only $24.46 per month at time of writing. (and that’s with the assumption that the notebook is running 24/7)</li>
</ol>
</li>
<li>The constraints for our solution are met by the following:<ol>
<li>🔑🔑🔑  The Vertex AI User Managed Notebook Instance (hereafter referred to as the “notebook”) can be tied to a Service Account (SA) 🔑🔑🔑.  By applying access control to this SA we can achieve the constraints as follows:<ol>
<li><strong>Security - Authorization:</strong> <ol>
<li>We can lock down who has access to this notebook by gating on who gets to have the <code>roles/iam.serviceAccountUser</code> role on the Service Account in GCP IAM. </li>
<li>We can lock down that SA’s access to (1) only the datasets required and (2) READ ONLY operations by assigning the following roles with constraints:<ol>
<li><code>roles.bigquery.jobUser</code> (on the whole project)</li>
<li><code>roles.bigquery.dataViewer</code> on the specific datasets.</li>
</ol>
</li>
<li>This also allows querying datasets in other GCP projects, by granting roles in thoes projects to this SA.</li>
<li>This is the major key, as the Service Account can scale up to multiple users by being able to bind a the <code>roles/iam.serviceAccountUser</code> to any principal which can include users AND groups.</li>
</ol>
</li>
<li><strong>Security - Authentication:</strong> ✅ <em>**</em>Because our users log in using their Google accounts, the authentication mechanism is taken care of by GCP (using folks’ credentials).</li>
</ol>
</li>
</ol>
</li>
<li>In order for the Notebook to query BigQuery, the Notebook API must be enabeld<ol>
<li>This is a MANUAL operation that must be done in the GCP console.</li>
</ol>
</li>
<li>This can all be Terraformed. (<strong>Infrastructure as code</strong> ✅)</li>
</ol>
<h2 id="heading-implementation">Implementation</h2>
<h3 id="heading-1-enable-the-notebooks-api">1. Enable the Notebooks API</h3>
<p><a target="_blank" href="https://console.cloud.google.com/marketplace/product/google/notebooks.googleapis.com">https://console.cloud.google.com/marketplace/product/google/notebooks.googleapis.com</a></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644705993245/UxP5bsf70.png" alt="Notebooks_API_–_APIs___Services_–_tutorials_–_Google_Cloud_Platform.png" /></p>
<h3 id="heading-2-apply-the-terraform">2. Apply the Terraform</h3>
<pre><code class="lang-terraform">locals {
  # CHANGEME
  project_name = "tutorial-344120" # The project
}

# Note this requires running a gcloud auth application-default login
provider "google" {
  project = locals.project_name
}



##1. Create a Service Account
resource "google_service_account" "analyst_notebook" {
  account_id   = "analyst-notebook"
  display_name = "SA for analysts to access BQ datasets via Vertex notebook"
}

##2. Create a User Managed Notebook that uses that Service Account
resource "google_notebooks_instance" "analyst_notebook" {
  name     = "analyst-rstudio-notebook"
  location = "us-west1-a"
  #CHANGEME
  machine_type = "e2-medium"
  vm_image {
    project      = "deeplearning-platform-release"
    image_family = "r-latest-cpu-experimental"
  }

  service_account = google_service_account.analyst_notebook.email
}

##3A Allow ability to run BQ jobs on all datasets in project
resource "google_project_iam_member" "project" {
  project = locals.project_name #CHANGEME if the target datasets are in diff project.
  role    = "roles/bigquery.jobUser"
  member  = "serviceAccount:${google_service_account.analyst_notebook.email}"
}


##3B Allow ability to READ on a SPECIFIC BQ dataset.
resource "google_bigquery_dataset_iam_member" "analyst_notebook_data_viewer" {
  project    = locals.project_name #CHANGEME, if the target datasets are in diff project.
  dataset_id = "rick_morty"
  role       = "roles/bigquery.dataViewer"
  member     = "serviceAccount:${google_service_account.analyst_notebook.email}"
}


##4. Allow only  the intended user to use the SA and by extension, the notebook
resource "google_service_account_iam_binding" "analyst_notebook_service_account_binding-iam" {
  service_account_id = google_service_account.analyst_notebook.name
  role               = "roles/iam.serviceAccountUser"

  members = [
    #CHANGEME - who should have access to assume the Service Account (and access the Notebook)
    "user:thilina.ratnayake@email.com",
  ]
}
</code></pre>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/tratnayake/R_Studio_BigQuery_Jupyter_Notebook">https://github.com/tratnayake/R_Studio_BigQuery_Jupyter_Notebook</a></div>
<h2 id="heading-test">Test</h2>
<h3 id="heading-1-can-we-open-the-notebook-and-query-the-bigquery-dataset-using-r">1. Can we open the notebook and query the BigQuery dataset using R?</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644705395413/IijcnOzVJ.png" alt="Cursor_and_Untitled_ipynb_-_JupyterLab_and_SQL_workspace_–_BigQuery_–_tutorials_–_Google_Cloud_Platform.png" /></p>
<p>The R code to query a BQ dataset can be found here: <a target="_blank" href="https://cloud.google.com/vertex-ai/docs/workbench/user-managed/use-r-bigquery">Use R with BigQuery</a></p>
<p>Yes 🕺 </p>
<h3 id="heading-2-can-anyone-else-log-attempt-to-open-up-the-jupyter-notebook">2. Can anyone else log attempt to open up the Jupyter notebook?</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644706112270/cQtF3Cyul.png" alt="Screen_Shot_2022-02-03_at_3_50_23_PM.png" /></p>
<p>Nope! 🔒  ✅</p>
<h3 id="heading-3-can-we-attempt-to-access-other-datasets-outside-of-what-is-specified-in-the-iam-policy">3. Can we attempt to access other datasets? (Outside of what is specified in the IAM policy?)</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644706453094/qmhNkwAys.png" alt="Cursor_and_Untitled_ipynb_-_JupyterLab.png" /></p>
<p>Also Nope!🔒  ✅</p>
<hr />
<h1 id="heading-why-not-use-a-service-account">Why not use a Service Account?</h1>
<p>Create a Service Account, let the user download the SA key and use it when connecting to the database from their device.</p>
<p>We stayed away from Service Account keys primarily for the number of risks that they add to the security story.</p>
<p>You can read more about those risks here: <a target="_blank" href="https://cloud.google.com/iam/docs/best-practices-for-securing-service-accounts">https://cloud.google.com/iam/docs/best-practices-for-securing-service-accounts</a></p>
<p>Using a Service Acount key with local device also means losing out on a couple of features;</p>
<ul>
<li>No infrastructure as code</li>
<li>Hardware is a constraint - lack of spec / loss is a risk.</li>
</ul>
<hr />
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644706536103/ls4v8DC4A.jpeg" alt="marliese-streefland-2l0CWTpcChI-unsplash.jpeg" />
Photo by <a href="https://unsplash.com/@marliesebrandsma?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Marliese Streefland</a> on <a href="https://unsplash.com/s/photos/dog?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>Service Accounts can be great. They are a good approach if you need to represent non-human users or persistent access to a system.</p>
<p>Service Account keys...not so much. They are gross, and icky, and <em>very</em> easy to lose to become a security risk.</p>
<p>When advantageous, use cloud resources to fill the needs of your users as they bring a couple of benefits:</p>
<ul>
<li>Existing auth mechanisms</li>
<li>Ease of configuration</li>
<li>Infrastructure-as-code</li>
</ul>
<p>In this case, we combine both and use a Service Account specifically because of it's ability to be a single target to apply our security policies to.</p>
<p>The key feature that enabled use to this solution was <strong>GCP’s ability to tie a User Managed Notebook Instance to a Service Account which we could then apply our access policies onto.</strong></p>
]]></content:encoded></item><item><title><![CDATA[Asking Better Questions as a Junior]]></title><description><![CDATA[This article has a companion video:
https://youtu.be/kUyz0geFp3c
And there's a chart at the end :)

It's can be scary doing something new. Especially if it's something like starting out as a new engineer, or maybe even as an experienced engineer but ...]]></description><link>https://tratnayake.dev/asking-better-questions-as-a-junior</link><guid isPermaLink="true">https://tratnayake.dev/asking-better-questions-as-a-junior</guid><category><![CDATA[Junior developer ]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[teaching]]></category><dc:creator><![CDATA[Thilina Ratnayake]]></dc:creator><pubDate>Tue, 08 Feb 2022 18:39:56 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1644345668001/Fog59c9Lg.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This article has a companion video:</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://youtu.be/kUyz0geFp3c">https://youtu.be/kUyz0geFp3c</a></div>
<p>And there's a chart at the end :)</p>
<hr />
<p>It's can be scary doing something new. Especially if it's something like starting out as a new engineer, or maybe even as an experienced engineer but starting within a new team.  In additional to all the technical aspects - there's so many new relationships to feel-out, norms to establish and culture to absorb. </p>
<p>I remember just a couple of months ago when I was starting my new job, I had the <strong>same</strong> anxieties as when I started my first job:</p>
<blockquote>
<p>Am I asking too many questions?</p>
<p>omg, are my teammates getting annoyed?</p>
<p>Is this a dumb question?</p>
</blockquote>
<p>While a big part of getting past that stage involves pushing through those thoughts to ask questions - there's value in strengthening our questioning skills as it is one act that we have <strong>complete control over</strong>, and which increases our ability to learn, grow and contribute faster. </p>
<hr />
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644341758254/6VTX8yMkH.jpeg" alt="photo-1431540015161-0bf868a2d407.jpeg" />
Photo by <a href="https://unsplash.com/@bchild311?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Benjamin Child</a> on <a href="https://unsplash.com/s/photos/corporate?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
<h1 id="heading-preamble">Preamble</h1>
<p>My first job after graduating from college with a Bachelors in Computer Systems Technology was starting as an <strong>IT Support Administrator</strong> at medium sized corporate org.  At this job, I was given a desk that belonged to someone else with the caveat: </p>
<blockquote>
<p>Oh just move some stuff around a bit, but heads up, they might be back soon".  </p>
</blockquote>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://media.giphy.com/media/XCmFwjt9wPotobw1xn/giphy.gif">https://media.giphy.com/media/XCmFwjt9wPotobw1xn/giphy.gif</a></div>
<p>With a cluttered desk, hand-me-down laptop and a notepad - I was left to my own devices to learn about a complex global  IT system with a senior engineer who seemed more annoyed to be interrupted by my presence than interested in teaching me. </p>
<p>I lasted two months (but walked away with an amazing friend!).</p>
<p>My next job was also another "awkward fit". I was a Junior Front-End engineer working on a very small start-up that had no documentation or process for helping juniors, and where asynchronous PR reviews were the norm. I still remember my excitement in submitting my first ever PR and then the immediate horror as I looked at the Trello card afterwards which had <strong>74</strong> points to fix as single statements (and no advice or suggestions).</p>
<blockquote>
<p>Let me know when you're ready to resubmit.</p>
</blockquote>
<p>I let them know after 4 weeks - that this probably wasn't the right fit for me.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://media.giphy.com/media/Z9cRCMdAMzXi25dwhE/giphy.gif">https://media.giphy.com/media/Z9cRCMdAMzXi25dwhE/giphy.gif</a></div>
<p>After these discouraging experiences, I went through an identity crisis (the first of many!) where I wondered if I was ever meant to be Software Engineer. So much so that the next gig I took on was joining a recently acquired start-up as a Customer Support Representative.</p>
<p>And this is it where it all "happened" for me. This was one of my most critical formative experiences in tech where I lean on the skills I learned, refined and mastered <strong>every single day.</strong></p>
<hr />
<h2 id="heading-support-engineering">Support Engineering</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644342385516/uveX4ScXx.jpeg" alt="14481945_10157482774060526_8268574988926302610_o.jpg" /></p>
<blockquote>
<p>Welcome to Support, The Best Damn Org In The Company</p>
</blockquote>
<p>Maybe not what you'd expect to hear in that sort of organization - but that was the <strong>first</strong> thing my team lead said to me. The culture in that company was electric, but the espirit-de-corps and morale in that support team was off the charts. And I <strong>think</strong> that's because a large portion of them were professionals.</p>
<p>The purpose of a support organization is to assist customers with their questions and problems. To this end, our bread-and-butter was working on <strong>tickets</strong>. A ticket is generated when there's an interaction from a customer (like an email, or a phone call), and all correspondence takes place on that ticket until the ticket is completed* .</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644341992523/5ZxG9YsNe.jpeg" alt="27993439_10159928959840526_3388268662950844016_o.jpg" />
^ Working on tickets</p>
<p>I must have worked on <strong>thousands</strong> of tickets over my 2.5 years in support, with each ticket having at least 2 interactions. In these tickets, the goal is to identify a customers problem and provide solutions or a course of action as soon as possible. Because of this, I became <em>very</em> good at asking clarifying questions to isolate problems and structuring information into logically ordered, bite-sized pieces. In fact I remember thinking to myself:</p>
<blockquote>
<p>At school I learned how to do a wide range of things, from building a custom OS kernel from scratch to writing programs that can handle concurrency and YET; the class that I've used the most in this gig is my Philosophy class.</p>
</blockquote>
<p>Logical fallacies. Presenting information.  Ordering premises.  My non-technical profs would be thrilled!</p>
<hr />
<p>As part of our daily work - one activity that would come up is <strong>Escalations</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644342171149/vNU2RqXy5.jpeg" alt="photo-1553044020-8c90843adf96.jpeg" />
Photo by <a href="https://unsplash.com/@kellysikkema?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Kelly Sikkema</a> on <a href="https://unsplash.com/s/photos/sticky-note?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
<h2 id="heading-escalations">Escalations</h2>
<p>Customer Support Representatives (CSR's) worked on newly created tickets. We would work with the customer to collect diagnostics and recommend solutions based on those efforts. Usually, most problems were common issues that could be solved with a link to a document or a recommendation to do a few things.  These were the cases in which a ticket could be considered completed and be <strong>closed.</strong></p>
<p>Sometimes - there would be tickets where as a CSR you run to the end of your abilities. Either in what you know about the issue, or in you abilities to troubleshoot and resolve the issue. At this point, is when we'd need to <strong>escalate</strong> the ticket <strong>UP</strong> to a Technical Support Engineer (TSE).</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://media.giphy.com/media/l2SqbG9QAz1Z314Uo/giphy.gif">https://media.giphy.com/media/l2SqbG9QAz1Z314Uo/giphy.gif</a></div>
<p>Prior to escalating, you had to fill out Escalation Notes on the ticket and let the customer know that the ticket is being escalated. Think of Escalation Notes like a sticky note that you'd put on an essay before you handed it off to an advisor. These Escalation Notes were crucial to a TSE because it would be their first point to orient themselves on what's happened, happening and needs to happen. If it was a long runnning ticket before escalation, the Escalation Notes would hopefully contain the necessary highlights and summary needed in order to hopefully skip reading the rest of the thread. This was was especially important if you were transferring an agitated customer from a call and demanding an escalation because it's taken too long (oops!) -- these escalation notes could be all the TSE has to skim before jumping on to fight the fire.</p>
<p>In this practice, you learned the value of writing good escalation notes quickly. </p>
<p><strong>Bad Escalation Notes</strong> required a TSE to chase after you to get more information or ask clarifying questions. </p>
<p><strong>Really Bad Escalation Notes</strong> missed critical pieces of information, and would be de-escalated for more fact-finding. <em>Not great if you've just told your customer that you're escalating the ticket.</em></p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://media.giphy.com/media/WxDZ77xhPXf3i/giphy.gif">https://media.giphy.com/media/WxDZ77xhPXf3i/giphy.gif</a></div>
<p>Oof. Not great when your metrics revolve around the amount of interactions you have a with a customer, and how long a ticket has been in progress for.</p>
<p>However, a set of <strong>Good Escalation Notes</strong>  -- ones where:</p>
<ul>
<li>the TSE did not need to come back to you with any more questions, </li>
<li>has a summary on what's been tried,</li>
<li>contains everything they need to continue on the ticket; </li>
</ul>
<p>They are the equivalent of sending a polished bowling ball down a freshly waxed lane. You can see it get accepted and worked on by a TSE and (depending on the issue) quickly move towards resolution. Strike!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644342481149/Y3dRZ5pMN.jpeg" alt="ella-christenson-l6DorjudX64-unsplash.jpg" />
Photo by <a href="https://unsplash.com/@ellabella124?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Ella Christenson</a> on <a href="https://unsplash.com/s/photos/bowling?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
<p>If you wrote good escalation notes, tickets got solved faster. </p>
<p>If you wrote good escalation notes <strong>consistently</strong>, the TSE's would quietly DM you and teach you about the issue and how to solve it in the future, and sometimes - would even give you the answer and send you back to the customer to be able to close the ticket yourself. (In support, being able to work with the customer to close is a pretty satisfying feeling!)</p>
<hr />
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644342244677/jhN3-0laK.jpeg" alt="photo-1534551767192-78b8dd45b51b.jpeg" />
Photo by <a href="https://unsplash.com/@camylla93?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Camylla Battani</a> on <a href="https://unsplash.com/s/photos/question?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
<h1 id="heading-questions-as-escalations">Questions, as Escalations?</h1>
<p>Fast forward a couple of years and I made the leap from Support into Engineering as a Cloud Infrastructure Engineer. I was now a small fish in a huge ocean and had <strong>so, many, questions</strong> to ask. This need for information, paired  with a large amount of anxiety &amp; impostor syndrome was <strong>not</strong> a good combination. While my team-mates were so supportive and always around to answer questions, I started wondering:</p>
<blockquote>
<p>Hmm, what if I started thinking of questions as escalations? Would that make a difference.</p>
</blockquote>
<p>The answer, is <strong>yes</strong>.</p>
<p>I noticed that when I started applying the same principles, the following happened:</p>
<ol>
<li>My questions would get answered faster.</li>
<li>My questions would highlight resources and share understanding with the rest of the team.</li>
<li>My team-mates started sharing more about their thought process and how it aligned with my initial steps / research.</li>
</ol>
<p>So with that, here are:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644342330844/sqK3pXfbI.jpeg" alt="photo-1552912276-dde406237918.jpeg" />
Photo by <a href="https://unsplash.com/@zanilic?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Zan</a> on <a href="https://unsplash.com/s/photos/high-five?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
<h1 id="heading-5-tips-for-asking-better-questions-as-a-junior">5 Tips For Asking Better Questions as a Junior</h1>
<h1 id="heading-freebie-0-imagine-asking-your-question-to-yourself">(Freebie) 0. Imagine Asking Your Question to Yourself.</h1>
<p>After I had gotten a bunch of bad escalations punted back down to me - I started getting into this weird "shadow-boxing" mindset where I would assess my notes in the point of view of a TSE.</p>
<blockquote>
<p>What questions would they ask?</p>
<p>Does this give enough information?</p>
<p>What if they ask about X?</p>
<p>Should I clarify Y?</p>
</blockquote>
<p>Putting yourself in the shoes of the person reading your question bolsters the quality of your question and also allows you to do some self review.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644344881289/-I9jT3RJe.png" alt="1.png" /></p>
<h2 id="heading-1-problem-statement-one-liner">1. Problem Statement - One Liner</h2>
<blockquote>
<p>What's the problem? </p>
</blockquote>
<p>In one line, boil down the most important parts of the problem. If there are multiple problems, list the most concerning and make a note that there are other problems (with more info available upon request).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644344891538/zTpd8-4XV.png" alt="2.png" /></p>
<h2 id="heading-2-context-desired-end-state">2. Context - Desired End State</h2>
<blockquote>
<p>Why is this a problem? What are you trying to do? What's your desired end-state?</p>
</blockquote>
<p>If the <code>problem-statement</code> is your starting point, the context should explain your desired end-state or where you want to go. Doing this allows the person answering your question to simply focus on charting the line between the two as opposed to narrowing down the problem scope with (as many) follow-ups.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644344903146/YdguwFZAO.png" alt="3.png" /></p>
<h2 id="heading-3-steps-taken-qualify-question">3. Steps Taken - Qualify Question</h2>
<blockquote>
<p>What have you alread tried and researched?</p>
</blockquote>
<p>This portion of the question is <strong>extremely useful for so many parties</strong></p>
<ul>
<li><p>For the people reading your question, this can serve as a list of things on what not to recommend / check because you've already done it - saving time. It also shows that you as the asker have spent time qualifying your question.</p>
</li>
<li><p>If posted in a team / group space - this can highlight resources or context that others may not have known about.</p>
</li>
<li><p>For the asker, this can help you feel more confident about the validity of your question because you see that you have put effort into research, investigation and solving it yourself.</p>
</li>
<li><p>This portion also allows your coach to get a look into your problem-solving and response processes, which will only help hone your instincts for the future.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644344910914/t2XS8LQUV.png" alt="4.png" /></p>
<h2 id="heading-4-next-steps-possible-solutions">4. Next Steps - Possible Solutions</h2>
<blockquote>
<p>What are some next steps you'd try or things you'd investigate?</p>
</blockquote>
<p>This step is huge for the person receiving your question.</p>
<p>At best, showing what you've thought about poking next means that they can potentially conserve effort by nudging you in the right direction with some information as opposed to coming up with the whole solution.</p>
<p>At worst, they can disqualify those solutions and explain why - which again, hones your skills for the future.</p>
<p>And sometimes, even they might not even know where to start - but this statement might spark their thought process!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644344918313/gkqxpZANz.png" alt="5.png" /></p>
<h2 id="heading-5-help-requested-means-for-assistance">5. Help Requested - Means for Assistance</h2>
<blockquote>
<p>What help do you need? When are you available?</p>
</blockquote>
<p>Most team-mates want to help you. However, they're also busy with their own work. Sometimes, they might see your question (and know the answer!) but the task or call in the moment might short-circuit their desire to reach out to you. </p>
<p>If you explain what help you need, and what meetings you're available for (and when) - this could allow a team-mate to acknowledge and set up time for later as opposed to needing to acknowledge <em>and</em> try to come up with a solution.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1644344440660/XnSNy70Yq.png" alt="5TipsBetterQuestions.png" /></p>
<h1 id="heading-example-of-a-good-question">Example of a Good Question</h1>
<p>This question is taken from one that I needed to ask last week!</p>
<ol>
<li>Hey, does anyone know how to decrease the retention window for Prometheus-Server?</li>
<li>The disk on Prometheus-Server has filled up and it's not able to send metrics to Grafana</li>
<li>We've tried making changes in the helm charts but they don't appear to be sticking.</li>
<li>I'm probably going to try doing a live <code>kubectl edit</code> on the cluster next, but not sure if that's the best way.</li>
<li>I'm available for a huddle rn if anyone's available, but also good for a Google meet after 1 hour.</li>
</ol>
<hr />
<h1 id="heading-conclusion">Conclusion</h1>
<p>This article is specifically titled <code>Junior</code> and not suffixed with <code>Engineer</code> because while this post is specific to engineers, it can apply to a junior in any field. I validated this with a friend who's in Customer Success, and they mentioned that this is exactly the kind of information they'd want from a Junior Account Manager.</p>
<p>As a more CS tailored example:</p>
<ol>
<li>Does anyone have a good compelling event to upsell startup to enterprise? </li>
<li>I have lots of startup clients who need to generate additional revenue for my portfolio</li>
<li>I've already used the multiproduct benefit pitch, but it hasn't landed</li>
<li>Im thinking of talking about a change in NIST regulation next year as my next angle</li>
<li>I'm free after 2 today if anyone is free to roleplay</li>
</ol>
<p>I hope this post helps ya'll in asking questions, and if there's other points that you'd add - please leave me a comment :) </p>
]]></content:encoded></item><item><title><![CDATA[Oncall Adventures - When your Prometheus-Server mounted to GCE Persistent Disk on K8s is Full]]></title><description><![CDATA[Note to anyone that lands on this page in the middle of an Incident and just ✨ needs the solution ✨ .
Problem: Prometheus-server running in K8s on GCP using a Persistent Volume has run out of disk.
Symptoms:

Grafana shows readings diving off cliff.
...]]></description><link>https://tratnayake.dev/oncall-adventures-prometheus-filled-disk</link><guid isPermaLink="true">https://tratnayake.dev/oncall-adventures-prometheus-filled-disk</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[GCP]]></category><category><![CDATA[monitoring]]></category><dc:creator><![CDATA[Thilina Ratnayake]]></dc:creator><pubDate>Mon, 24 Jan 2022 06:25:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/b0p818k8Ok8/upload/v1643002835984/5tRLYrIiL.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Note to anyone that lands on this page in the middle of an Incident and just ✨ needs the solution ✨ .</p>
<p><strong>Problem:</strong> Prometheus-server running in K8s on GCP using a Persistent Volume has run out of disk.</p>
<p><strong>Symptoms</strong>:</p>
<ul>
<li>Grafana shows readings diving off cliff.</li>
<li>Grafana shows no data.</li>
<li>No screaming from other independent sensors (i.e. other teams)</li>
<li>Logs on Prometheus-server show:<pre><code>target=http://XXX.XXX.XXX.XXX:YYYY/metrics msg="Scrape <span class="hljs-keyword">commit</span> <span class="hljs-keyword">failed</span><span class="hljs-string">" err="</span>write <span class="hljs-keyword">to</span> WAL: <span class="hljs-keyword">log</span> samples: write /<span class="hljs-keyword">data</span>/wal/<span class="hljs-number">00004932</span>: <span class="hljs-keyword">no</span> <span class="hljs-keyword">space</span> <span class="hljs-keyword">left</span> <span class="hljs-keyword">on</span> device
</code></pre></li>
<li>Shelling into Prometheus-server confirms 100% disk usage.</li>
</ul>
<p><strong>Fix:</strong></p>
<ol>
<li>Remove link to filled Persistent Volume.</li>
<li>Mount debug pod onto Persistent Volume.</li>
<li>Clean-up old blocks in <code>/data</code></li>
<li>Kill debug pod and remove link to Persistent Volume</li>
<li>Restart Prometheus-server</li>
<li>Wait 2 hours for pruning to finish </li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1643003535040/lCc_322_Y.jpeg" alt="annie-spratt-n3T1gBYgkJo-unsplash.jpg" /></p>
<p>Photo by <a href="https://unsplash.com/@anniespratt?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Annie Spratt</a> on <a href="https://unsplash.com/s/photos/empty-bed?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
<blockquote>
<p>PagerDuty Alert. You have 1 triggered notification...</p>
</blockquote>
<p>That's the phone-call I was woken up with this morning at 3:30AM Pacific Time. </p>
<p>This was my first time getting paged in the middle of the night at this job, but the response felt very rehearsed. On-call work is one of the less glamorous but still important, parts of the job - and if you've done it for a couple iterations (and been through a couple of high-intensity outages), it becomes just another activity.</p>
<p>As I rolled out of bed and shuffled over to my desk - I remembered something an old mentor told me as a Junior:</p>
<blockquote>
<p>After a while, you'll start seeing the patterns and see that almost every issue you deal with stems from a few specific blueprints.</p>
</blockquote>
<p>Sure enough, the incident we worked through last night had a very common culprit - <strong>resource exhaustion</strong>.</p>
<p>Here are some lessons we learned from this incident:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1643003714489/YHHen10oJ.jpeg" alt="launde-morel-4VSsq9ErzOs-unsplash.jpg" />
Photo by <a href="https://unsplash.com/@laundemrl?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Launde Morel</a> on <a href="https://unsplash.com/s/photos/dashboard?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
<h2 id="heading-1-if-all-your-gauges-go-dark-but-no-one-screams-its-probably-your-gauges">1. If All Your Gauges Go Dark but No One Screams, It's Probably Your Gauges.</h2>
<p>The majority of our infrastructure runs on Kubernetes (K8s) which is a container orchestration system.</p>
<p>For observability (o11y) - we make use of <a target="_blank" href="https://prometheus.io/docs/introduction/overview/">Prometheus</a> which works by scraping metrics from our containers and then pushes them to Grafana cloud for data visualization and export. </p>
<p>This is also where our alerts are configured and how we can paged if something seems wrong.</p>
<p>To understand this setup, there's 2 diagrams:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1643004589485/qxoH4w3Fd.png" alt="Pasted_Image_2022-01-23__10_05_PM.png" /></p>
<p><strong>You'll need to be in <code>light-mode</code> to see the arrows in this diagram 😞</strong> </p>
<p>And the Grafana agent which is on the Prometheus-server and sends the metrics off to Grafana cloud.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1643004644684/BpOBoi5sx.jpeg" alt="agent.jpg" /></p>
<p>The first thing I woke up to was our dashboards showing metrics either:</p>
<ol>
<li>Diving off a cliff; or</li>
<li>Going dark.</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1642995601691/GC0uKBL-0.png" alt="Observability_Stack_-_Grafana.png" /></p>
<p>One thing I've learned over time is that while our eyes naturally fixate on anomalies in patterns - it's important to take a look at the bigger picture before diving deep into a single graph.</p>
<p>Specifically, in this case - I saw that <em>all</em> of our graphs were showing the same behaviour (either a steep drop, or lack of data).</p>
<p>This, combined with the fact that no-one else was screaming (our product-engineering teams also have their own monitoring set up more specific to their use) gave me a hunch that these readings might be an issue with our observability into the system than a reflection of the system itself.</p>
<p>To confirm this, I wanted to test one of the claims from our monitoring system.</p>
<blockquote>
<p>All CPU usage has dropped, memory usage has dropped, your containers are probably dead in the water.</p>
</blockquote>
<p>So I logged into the cluster, and thankfully - I found that our pods and containers were swimming along just fine.</p>
<p>The combination of these 3 factors:</p>
<ol>
<li>All metrics dropping at the same time;</li>
<li>No other alert of issues from an an independent source (i.e. a product engineer);</li>
<li>The core claim of large scale outage being confirmed false</li>
</ol>
<p>Led me to the following assumptions:</p>
<ol>
<li>The infrastructure is still okay.</li>
<li>The monitoring system is probably degraded.</li>
<li>We're not getting any more data.</li>
</ol>
<p>Lets poke at the monitoring system!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1643003849536/IMWCS4kTw.jpeg" alt="pexels-mathias-pr-reding-6966060.jpg" />
Photo by Mathias P.R. Reding from Pexels</p>
<h2 id="heading-2-kubernetes-is-just-a-wrangler-for-your-containers-they-still-need-to-eat">2. Kubernetes is Just a Wrangler for Your Containers, They Still Need to Eat.</h2>
<p>Kubernetes just co-ordinates your containers to get them housed (scheduled) and fed (resourced). If they can't, Kubernetes will try its best - but they don't (and can't) send notifications. This is where a monitoring solution like Prometheus comes in. It reports a constant stream of data about what it sees from looking at your app and sends it to Grafana for further visualization and alerting. </p>
<p>In this case, it just so happens that the monitoring system was what was degraded. But how?</p>
<p>Prometheus runs as pods on the cluster and essentially lives to scrape metrics from other containers and send them out. At the end of the day, it's just another creature (container) that needs food (resources) to live (do its job). </p>
<p>The main 4 resources that any container needs, organized by what they do — are:</p>
<ol>
<li>The ability to <strong>do</strong> things - processing, <strong>CPU</strong></li>
<li>The ability to use short-term memory - memory, <strong>RAM</strong></li>
<li>The ability to use long-term memory and <strong>carry</strong>  assets (like a backpack) -  <strong>DISK</strong></li>
<li>The ability to talk to others - networking (an IP, port, socket, actually networking has a little more requirements).</li>
</ol>
<p>If any of these 4 resource requirements are not met, your workloads will fail.</p>
<p>While I'd love to say I have a tried and true method for checking all of these 4 things, the problem in this incident was uncovered by shelling in and taking a look at a couple of things.</p>
<p>One thing that really helped was the use of K9s -  which makes it very easy to see  the state of a Kubernetes object when listing them. For example when listing pods, you can see whether they’re running or experiencing issues from the same screen rather than having to list and then describe. A small but appreciated efficiency.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1642997875718/aUo6erEV2.png" alt="k9s.png" /></p>
<p>It also allows you to quickly view the logs of a container and see what's going. </p>
<p>For me, I was able to find the smoking gun quickly (and luckily) by checking the logs of the <code>prometheus-server</code> which was pumping out the following message a couple of hundred times per second:</p>
<pre><code>target=http://XXX.XXX.XXX.XXX:YYYY/metrics msg="Scrape <span class="hljs-keyword">commit</span> <span class="hljs-keyword">failed</span><span class="hljs-string">" err="</span>write <span class="hljs-keyword">to</span> WAL: <span class="hljs-keyword">log</span> samples: write /<span class="hljs-keyword">data</span>/wal/<span class="hljs-number">00004932</span>: <span class="hljs-keyword">no</span> <span class="hljs-keyword">space</span> <span class="hljs-keyword">left</span> <span class="hljs-keyword">on</span> device<span class="hljs-string">"</span>
</code></pre><p>⚠️<code>no space left on device</code>⚠️</p>
<p>Which was confirmed by doing a quick <code>df</code> which shows</p>
<pre><code>Filesystem           1K<span class="hljs-operator">-</span>blocks      Used Available Use<span class="hljs-operator">%</span> Mounted on
[...]
<span class="hljs-operator">/</span>dev<span class="hljs-operator">/</span>sdb              <span class="hljs-number">51290592</span>  <span class="hljs-number">48898732</span>   <span class="hljs-number">2375476</span>  <span class="hljs-number">100</span><span class="hljs-operator">%</span> <span class="hljs-operator">/</span>data
</code></pre><p><strong>Alright, so our disk is full - how do we fix this?</strong></p>
<p>I'll tell you how you shouldn't:</p>
<h3 id="heading-1-do-not-move-blocks-from-data-into-other-drives">1. Do <strong>NOT</strong> move blocks from <code>/data</code> into other drives.</h3>
<p>While this shell-game <em>will</em> buy you some-time - the WAL (Write Ahead Log) will continue to fill you will simply be prolonging the same issue.</p>
<p>More on the WAL:</p>
<blockquote>
<p>The current block for incoming samples is kept in memory and is not fully persisted. It is secured against crashes by a write-ahead log (WAL) that can be replayed when the Prometheus server restarts. Write-ahead log files are stored in the wal directory in 128MB segments. These files contain raw data that has not yet been compacted; thus they are significantly larger than regular block files. Prometheus will retain a minimum of three write-ahead log files. High-traffic servers may retain more than three WAL files in order to keep at least two hours of raw data.  - <a target="_blank" href="https://prometheus.io/docs/prometheus/latest/storage/">Ref</a></p>
</blockquote>
<p>You should be <em>very</em> careful when fiddling with WAL files, as corruption will mean the inability for Prometheus to restart cleanly.</p>
<h3 id="heading-2-do-not-resize-the-persisted-volume-by-editing-the-the-deployment-spec-on-the-fly">2. Do <strong>NOT</strong> resize the Persisted Volume by editing the the deployment spec on the fly.</h3>
<p>Both of these action led to the following outcome when trying to restart Prometheus:</p>
<pre><code>err="opening storage failed: <span class="hljs-keyword">repair</span> corrupted WAL: cannot handle <span class="hljs-keyword">error</span>: <span class="hljs-keyword">open</span> WAL <span class="hljs-keyword">segment</span>: <span class="hljs-number">0</span>: <span class="hljs-keyword">open</span> /prometheus/wal/<span class="hljs-number">00000000</span>: <span class="hljs-keyword">no</span> such <span class="hljs-keyword">file</span> <span class="hljs-keyword">or</span> <span class="hljs-keyword">directory</span><span class="hljs-string">"</span>
</code></pre><p><strong>Well, what now?</strong></p>
<p>While I wish I could say we knew exactly what to do and arrived at the solution immediately - there were a couple more learning experiences. We ended up:</p>
<ol>
<li>Attempting to restart the Prometheus-server in a zombie state so that we could remove that corrupted WAL.<ul>
<li>We couldn't. The container would die immediately before we could shell in.</li>
</ul>
</li>
<li>Killing the Prometheus-server deployment and hoping it would restart cleanly.<ul>
<li>It didn't.</li>
</ul>
</li>
</ol>
<p>Finally, we:</p>
<ul>
<li>Deleted the Persistent Volume that was maxed out.</li>
<li>Re-deployed the Prometheus-server via Helm chart using our CI/CD.</li>
</ul>
<p>Which now, after doing some more research - turns out is just a clumsy and more long-winded way of doing exactly what the Prometheus docs tell you to do: </p>
<blockquote>
<p>If your local storage becomes corrupted for whatever reason, the best strategy to address the problem is to shut down Prometheus then remove the entire storage directory. You can also try removing individual block directories, or the WAL directory to resolve the problem. Note that this means losing approximately two hours data per block directory. Again, Prometheus's local storage is not intended to be durable long-term storage; external solutions offer extended retention and data durability. - <a target="_blank" href="https://prometheus.io/docs/prometheus/latest/storage/">Reference</a></p>
</blockquote>
<p>cool. cool. cool. cool. v nice.</p>
<p>👍 </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1643003949991/fnpgYvTHs.jpeg" alt="pexels-jessica-lewis-creative-3361486.jpg" />
Photo by Jessica Lewis Creative from Pexels</p>
<h2 id="heading-3-persistent-volumes-andamp-persistent-volume-claims-there-can-only-be-one">3.  Persistent Volumes &amp; Persistent Volume Claims - There Can Only be One</h2>
<p>In Kubernetes, you can mount volumes on your containers. There are a whole bunch of different mounts that <a target="_blank" href="https://kubernetes.io/docs/concepts/storage/volumes/">can be configured</a> and in our case, we were using a GCE Persistent Disk which is a type of Persistent Volume.</p>
<p>A Persistent Volume (PV) is one where the disk is linked to block storage that can survive if a container, pod or even deployment goes down. So in our case, even if our prometheus-server goes down - the data it saw will be available on disk. Kinda like an off-site black-box.</p>
<p>When dealing with Persistent Volumes, it's important to understand their relationship with Persistent Volume Claims (PVC).</p>
<p>A Persistent Volume Claim specifies a request for storage by a user. It can then be used within a Deployment spec as the volume to mount for a container. </p>
<p>The important thing to note is that there can only be 1 ReadWrite mount per Persistent Volume (more on that later).</p>
<p>This is what the yaml for a PVC looks like:</p>
<pre><code><span class="hljs-attribute">apiVersion</span>: v1
<span class="hljs-attribute">kind</span>: PersistentVolumeClaim
<span class="hljs-attribute">spec</span>:
  <span class="hljs-attribute">accessModes</span>:
    - ReadWriteOnce
  <span class="hljs-attribute">resources</span>:
    <span class="hljs-attribute">requests</span>:
      <span class="hljs-attribute">storage</span>: <span class="hljs-string">"50Gi"</span>
</code></pre><ul>
<li>Important to note that in GKE - having this included in a PVC automagically takes care of provisioning a volume in the background.</li>
</ul>
<p>And here is the yaml for our Prometheus server mounting that persistent volume as <code>storage-volume</code></p>
<pre><code>apiVersion: apps<span class="hljs-operator">/</span>v1
kind: Deployment
metadata:
  labels:
    [...]
  name: prometheus<span class="hljs-operator">-</span>server
  namespace: sre
spec:
  selector:
    matchLabels:
      [...]
  replicas: <span class="hljs-number">1</span>
  template:
    [...]
    spec:
      enableServiceLinks: <span class="hljs-literal">true</span>
      serviceAccountName: prometheus<span class="hljs-operator">-</span>server
      containers:
        [...]
        <span class="hljs-operator">-</span> name: prometheus<span class="hljs-operator">-</span>server
          [...]
          volumeMounts:
            [...]
            <span class="hljs-operator">-</span> name: <span class="hljs-keyword">storage</span><span class="hljs-operator">-</span>volume
              mountPath: <span class="hljs-operator">/</span>data
              subPath: <span class="hljs-string">""</span>
            [...]
      [...]
      volumes:
        [...]
        <span class="hljs-operator">-</span> name: <span class="hljs-keyword">storage</span><span class="hljs-operator">-</span>volume
          persistentVolumeClaim:
            claimName: prometheus<span class="hljs-operator">-</span>server
</code></pre><p>The reason this is important to understand is because there was a fix we could have done.</p>
<p>Refer back to 0th thing we tried when we got: </p>
<pre><code>err="opening storage failed: <span class="hljs-keyword">repair</span> corrupted WAL: cannot handle <span class="hljs-keyword">error</span>: <span class="hljs-keyword">open</span> WAL <span class="hljs-keyword">segment</span>: <span class="hljs-number">0</span>: <span class="hljs-keyword">open</span> /prometheus/wal/<span class="hljs-number">00000000</span>: <span class="hljs-keyword">no</span> such <span class="hljs-keyword">file</span> <span class="hljs-keyword">or</span> <span class="hljs-keyword">directory</span><span class="hljs-string">"</span>
</code></pre><blockquote>
<ol>
<li>Attempting to restart the Prometheus-server in a zombie state so that we could remove that corrupted WAL (which ended up being futile).</li>
</ol>
</blockquote>
<p>We wanted to get the Prometheus-server up because we thought that was our only way to touch the volume and delete the corrupted WAL, but we couldn't get it up and running long enough to shell into it 😠 .</p>
<p>Looking back at it now, something we should have tried is what this person recommended in a Github issue: https://github.com/kubernetes/test-infra/issues/20439#issuecomment-759119197 which is to:</p>
<blockquote>
<p>Use a different container to mount and clean the Persistent Volume</p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1643004167089/1aZ3b2HxB.jpeg" alt="pexels-mati-mango-6330644.jpg" />
Photo by Mati Mango from Pexels</p>
<h2 id="heading-4-how-to-create-a-debug-pod-and-mount-it-to-a-persistent-volume">4. How to Create a Debug Pod and Mount it to a Persistent Volume</h2>
<p>A GCE Persistent Disk is like a hard drive. </p>
<blockquote>
<p>Just like you can't plug a single hard-drive into two computers at one time,</p>
</blockquote>
<p>A key constraint of a GCE Persistent Disk is that it can only be mounted in <code>ReadWrite</code> mode to a <strong>single</strong> node at any given time. (However, you <em>can</em> have multiple <code>ReadOnly</code> mounts)</p>
<p>Therefore, in order to get in and clean-it, you need to follow this order of operations:</p>
<h3 id="heading-1-remove-any-existing-associations-from-prometheus-server-to-the-volume">1. Remove any Existing Associations (from Prometheus-server) to the Volume.</h3>
<p>This can be done by either scaling down the Prometheus-server replicas from 1-0, or simply deleting the deployment.</p>
<h3 id="heading-2-create-a-debug-pod-running-alpine-to-mount-the-disk">2. Create a Debug pod (running Alpine) to Mount the Disk.</h3>
<p>If I were to do it again I'd create a Debug pod file like this:</p>
<pre><code><span class="hljs-attribute">apiVersion</span>: v1
<span class="hljs-attribute">kind</span>: Pod
<span class="hljs-attribute">metadata</span>:
  <span class="hljs-attribute">name</span>: debug
<span class="hljs-attribute">spec</span>:
  - <span class="hljs-attribute">name</span>: debug-container
    <span class="hljs-attribute">image</span>: <span class="hljs-attribute">alpine</span>:latest
    <span class="hljs-attribute">imagePullPolicy</span>: Always
    <span class="hljs-attribute">args</span>: [<span class="hljs-string">"tail"</span>, <span class="hljs-string">"-f"</span>, <span class="hljs-string">"/dev/null"</span>]
    <span class="hljs-attribute">volumeMounts</span>:
    - <span class="hljs-attribute">mountPath</span>: /data
      <span class="hljs-attribute">name</span>: storage-volume
  <span class="hljs-attribute">volumes</span>:
  - <span class="hljs-attribute">name</span>: storage-volume
    <span class="hljs-attribute">persistentVolumeClaim</span>:
      <span class="hljs-attribute">claimName</span>: mount-for-debug
</code></pre><ul>
<li>The <code>tail -f /dev/null</code> is a way to ensure that the container stays up so you can shell into it.</li>
</ul>
<h3 id="heading-3-shell-into-persistent-volume-and-clean-up">3. Shell into Persistent Volume and Clean-Up</h3>
<ul>
<li>Delete the corrupted WAL files and blocks that are filling up the disk (preferably from oldest date onwards).</li>
</ul>
<h3 id="heading-4-clean-up-debug-pod-and-sever-link-to-pv">4. Clean-up Debug pod and sever link to PV</h3>
<p>Simply delete the deployment.</p>
<h3 id="heading-5-redeploy-the-prometheus-server">5. Redeploy the Prometheus-server</h3>
<p>Which should now come up successfully as it has access to a empty drive ✅ </p>
<h1 id="heading-how-could-we-prevent-this-from-happening-in-the-future">How could we prevent this from happening in the future?</h1>
<ol>
<li>Add alerts on disk usage on Prometheus-server</li>
<li>Allow the volumes to auto-expand.</li>
</ol>
<p>More on this in another blog post!</p>
<hr />
<p>If you made it this far, here's the picture from my desk as the first rays of sun creeped in through my blinds as we wrapped up the incident..at 0800!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1643003168031/Lp1M98RBe.jpeg" alt="IMG_1007.JPG" /></p>
<hr />
]]></content:encoded></item><item><title><![CDATA[Hello, World!]]></title><description><![CDATA[Howdy, my name is Thilina ("Thi-Li-Nuh") and I'm a dev...ish. Lets see what I write about!]]></description><link>https://tratnayake.dev/hello-world</link><guid isPermaLink="true">https://tratnayake.dev/hello-world</guid><category><![CDATA[Hello World]]></category><category><![CDATA[Blogging]]></category><category><![CDATA[Developer Blogging]]></category><dc:creator><![CDATA[Thilina Ratnayake]]></dc:creator><pubDate>Mon, 24 Jan 2022 03:15:25 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/LSuIc8Riv9I/upload/v1642994006272/hD9TkAbbtg.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Howdy, my name is Thilina ("Thi-Li-Nuh") and I'm a dev...ish. Lets see what I write about!</p>
]]></content:encoded></item></channel></rss>