Ramblings of a Coder's Mind

intelligent Engineering: Principles for Building With AI

2025-11-06T00:00:00+05:30

Software engineering is changing — again. Not in a loud, overnight way, but in a quiet structural way that’s already reshaping how good teams build software.

We aren’t replacing engineers. We are upgrading the way engineers think, work, and build.

AI isn’t a shortcut to avoid the hard parts — it simply shifts where the hard parts are.

Over 2 years applying AI in prototyping, experimentation, internal tools, production systems, and team workflows — one thing has become very clear:

AI doesn’t make engineering easier. It makes disciplined engineering more valuable.

Great teams are not the ones who “use AI everywhere.” They’re the ones who use AI well — with clarity, responsibility, and intent.

Below is a working set of principles we’ve found useful for building in this new environment. They aren’t commandments, they aren’t finished, and they will evolve — just like our tools will.

But they help keep us grounded in the parts of engineering that matter.

intelligent Engineering Principles

These principles fall into two buckets — what is new, and what remains timeless but more important than ever.

AI-Native Principles

Principles AI use creates or transforms — they wouldn’t exist without it.

AI augments, humans stay accountable.

AI can extend your reach, accelerate your ideas, and surface possibilities you may not see, but it cannot own the outcome. Engineering judgment, ethical responsibility, and decision-making stay with us. Tools assist; humans remain answerable.

Context is everything.

AI outputs only reflect the clarity, completeness, and structure of the input. If we want meaningful results, we must bring meaningful context — not vague requests. Better thinking in produces better thinking out.

Learn how to manage context well. The larger the system, the more important it becomes to build discipline and practices around how context should be managed. Good engineering practices can help ensure new teammates get AI systems primed with up to date and correct context for every project. These practices also help ensure the system stays up to date. If the context is too large for your model to hold, teams should engineer solutions around it like optionally loaded markdowns or, at larger scales, RAG.

Smarter AI needs smarter guardrails.

As generation gets faster, review must become sharper. Code, ideas, and architectures produced by AI still demand rigorous validation for quality, safety, and alignment with intent. The faster we move, the stronger our checks must be.

Shape AI deliberately.

Don’t let generic tooling decide how your team works. Choose where AI fits, what it should influence, and how it should be used to support — not reshape — your engineering culture. Intentional adoption prevents accidental dependencies being created.

Learning never stops.

AI practices evolved weekly; now they evolve monthly. This is still faster pace than many, if not most are used to. Teams that keep experimenting, reflecting, and adapting stay ahead. Treat AI as a moving system — one that rewards curiosity, continuous improvement, and lightweight experimentation. What didn’t work a few months ago might be possible now and the only way you will know is if you experiment.

Timeless Foundations — Reaffirmed for the AI Era

Good development sense that now matters even more with AI in the loop.

Learn fast, adapt continuously

Start small, validate often, and tighten feedback loops to ensure AI continues to deliver real value.

Sustainable Value over fleeting output.

Unmaintainable, insecure, and rigid solutions waste time and money. Always prioritise building the right value over building the wrong one fast.

What This Looks Like in Practice

This isn’t theory. Here’s what it means day-to-day on an engineering team:

We use AI to explore ideas, but we validate assumptions ourselves.
We generate code fast, then review it twice as hard.
We experiment constantly — but scale only what works.
We write clearer problems, not just faster code.
We design systems with longevity in mind — not convenience today and chaos tomorrow.

This is not “old engineering vs new engineering.” It’s the next chapter of the same story: build well, stay curious, stay accountable.

AI doesn’t remove the craft of engineering. It multiplies the importance of the engineer.

Building the Skills of an Intelligent Engineer

Principles shape how we think. Skills shape what we can do with that thinking.

To build effectively with AI, engineers need to understand not just how to prompt, but how these systems work underneath.

Mastering these skills turns AI from a black box into a design partner. That’s the real craft of intelligent engineering.

Core Practices

Prompt Engineering and Context Engineering are the new craftsmanship of AI-era software building.

It’s no longer about “writing the right prompt” — it’s about structuring intent, constraints, and information so that the model understands your problem the way you do.

Deeper Understanding

To use AI tools responsibly and creatively, engineers should understand the mechanics: how tokens, embeddings, and vector spaces shape what the model “remembers,” “understands,” and “forgets.”

This isn’t about becoming an ML engineer — it’s about having the literacy to reason about your tools.

System Design for AI

Modern AI systems go beyond single prompts.

Concepts like vector search, retrieval-augmented generation (RAG), and agents define how context flows and how reasoning chains form.

Engineers should learn to design with prompt libraries, multi-agent orchestration, and feedback loops that adapt over time.

Why This Matters

Teams that adopt AI without principles create:

Frail systems masked by fast prototypes
Blind trust disguised as speed
Complexity that compounds silently
Teams that stop thinking deeply because “the model knows”

And then they pay for it later — painfully.

Teams that adopt AI with principles:

Build faster and safer
Think more clearly, not less
Use tools to enhance judgment, not bypass it
Ship meaningful, durable systems
Become harder to compete with over time

The future is not “AI builds everything.”

The future is AI-raised engineers who build better than before.

A Closing Thought

Agile reshaped how we deliver. AI is reshaping how we think while we deliver.

Who are we?

Not Prompt Writers.
Not Tool Operators.
intelligent Engineers

We’re just at the beginning. These principles will evolve. If you’d like to build this thinking together — I’d love to hear your take. What principle would you add or challenge?

Level Up Code Quality with an AI Assistant

2025-07-29T00:00:00+05:30

Using AI coding assistants to introduce, automate, and evolve quality checks in your project.

I have talked about teams needing to have a world class developer experience as a pre-requisite for a well functioning team. When teams lack such a setup, the most common response is a lack of time or buy in from stakeholders to build these things. With AI coding assistants being readily available to most developers today, the engineering effort and the cost investment for the business lesser reducing the barrier to entry.

Current State

This post showcases an actual codebase that has not been actively maintained for over 5 years but runs a product that is actively used. It is business critical but did not have the necessary safety nets in place. Let us go through the journey, prompts inclusive, on how to make the code quality of this repository better, one prompt at a time.

The project is a Django backend application that exposes APIs. We start off with a quick overview of the code and notice that there are tests and some documentation but a lack of consistent way to run and test the application.

The Journey

I am assuming you are running these commands using Claude Code (with Claude Sonnet 4 in most cases). This is equally applicable across any coding assistant. Results will vary based on your choices of models, prompts and the codebase.

Setting up Basic Documentation and Some Automation

If you are using a tool like Claude Code, run /init in your repository and you will get a significant part of this documentation.

Can you analyse the code and write up documentation in README.md that
 clearly summarises how to setup, run, test and lint the application.
Please make sure the file is concise and does not repeat itself. 
Write it like technical documentation. Short and sweet.

Next step is to start setting up some automation (like just files) to help make the project easier to use. This will take a couple of attempts to get right but here is a prompt you can start off with

Please write up a just file. I would like the following commands
`just setup` - set up all the dependencies of the project
`just run` - start up the applications including any dependencies
`just test` - run all tests
If you require clarifications, please ask questions. 
Think hard about what other requirements I need to fulfill. 
Be critical and question everything. 
Do not make code changes till you are clear on what needs to be done.

This will give you a base structure for you to modify quickly and get up and running. If you README.md has a preferred way to run the application (locally vs docker), the just will automatically use it. If not, you will have to provide clarification.

Setting up pre-commit for Early Feedback

Let’s start small and build on it.

Please setup pre-commit with a single task to run all tests on every push.
Update the just script to ensure pre-commit hooks are installed locally
 during the setup process.

We probably didn’t need to be this explicit but I find managing context and keeping tasks small mean I move a lot quicker.

Curating Code Quality Tools

Lets begin by finding good tools to use, create a plan for the change and then execute the plan. Start off by moving Claude Code to Plan mode (shift+tab twice)

What's a good tool to check the complexity of the python code this
 repository has and lint on it to provide the team feedback as a 
 pre-commit hook?

It came back with a set of tools I liked but it assumed that the commit will immediately go green. In an existing large codebase with tech debt, this will not happen. Let’s break this down further.

The list of tools you're suggesting sound good. 
The codebase currently will have a very large number of violations. 
I want the ability to incrementally improve things with every commit. 
How do we achieve this?

Creating a Plan

After you iterate on the previous prompt with the agent, you will get a plan that you’ll be happy with. The AI assistant will ask for permission to move forward and execute the plan but before doing so, it will be worth creating a save state. Imagine this as a video game save, if something goes wrong, come back and restore from this point. This also allows you to clear context since everything is dumped to markdown files on disk.

Can you create a plan that is executable in steps?
Write that plan to `docs/code-quality-improvements`.
Try to use multiple background agents if it helps speed up this process.

Give it a few minutes to analyse the code. In my case, the following files were created. README.md says that “Tasks within the same phase can be executed in parallel by multiple Claude Code assistants, as long as prerequisites are met”. You are ready to hit /clear and clear out the context window.

Phase 1 sets up the basic tools, phase 2 configures them, phase 3 focuses on integration and automation and phase 4 adds monitoring and focuses on improving the code quality.

Before executing the plan, I commit the plan (docs/code-quality-improvement). This allows me to track any changes that have been made. When executing the plan, I do not check in the changes made to the plan. This allows me to drop the plan at the end of the process. As a team, we have discussed potentially keeping the plan around as an artifact. To do so, you would have to ask Claude Code to use relative paths (it uses absolute paths when asking for files to be updated in the plan).

Executing the Plan

I would like to improve code quality and I have come up with a plan to do 
so under `docs/code-quality-improvement`.
Can you analyse the plan and start executing it? The `README.md` has a 
quick start section which tasks about how to execute different phases of the 
plan. As you execute the plan, mark tasks as done to track state.

You will notice that Claude Code will add dependencies to requirements-dev.txt and try to run things without installing them. Also, it will add dependencies that do not exist. Stop the execution (by pressing Esc ) and use the following prompt to course correct

For every pip dependency you add to `requirements-dev.txt`, please run 
`pip install`. 
Before adding a dependency to the dependency file, please check if it is 
available on `pip`.

Once phase 1 and phase 2 of the plan are complete, the following files are created and ready to be committed.

When the quality gates are added on phase 3, run the command once to test if everything works and create another commit. After this, I had to prompt it once more to integrate the lint steps into a simplified developer experience.

Please add `just lint` as a command to run all quality checks

Test the brand new lint command and then run a commit. Ask claude code to proceed to phase 4.

You might see Claude Code doubt a plan that it has created. It is a good question because the system is functional but if we prefer the more advanced checks, we should request it pushes on with Phase 4 implementation.

After phase 4, we have a codebase that checks for code quality every time a developer is pushing code. Our repository has pre-commit hooks for linting, runs all quality checks once before pushing. The quality checks will fail if the code added has unformatted files, imports in the wrong order, flake8 lint issues or functions with higher code complexity. It checks this only in the files being touched (because we told it that we had debt that needs to be reduced and all checks will not pass by default)

You still have debt, lets go over fixing this in the next step.

Fixing Existing Debt

Tools like isort can highlight issues and fix them. You should start off running such commands to fix the code. On most codebases, this will touch almost all of the files. The challenge with this is that all the issues that cannot be fixed automatically (like wildcard imports) will need to be fixed manually. This is where you make a choice either to fix issues manually or automatically. If you’re using Claude Code to fix these issues and there is a large number, you’re probably going to pay in upwards of $10 for this session on any decent sized codebase. I recommend moving to GitHub Copilot’s agent to help push down costs here.

Ask your coding assistant of choice to run the lint command and fix the issues. Most of them will stop after 1–2 attempts because the list is large. You can tell it to “keep doing this task till there are no linting errors left. DO NOT stop till the lint command passes”. If your context file (CLAUDE.md) does not talk about how to lint, be explicit and tell your coding assistant what the command to be run is.

What is Left?

If you look at the gradual-tightening task, it created a command to analyse the code and keep being gradually more strict. This command can either be run manually or automatically on a pipeline. One of the parameters it changes is the max-complexity which is set to 20 by default. This complexity will be reduced over a period of time. Similarly, the complexity check tasks have a lower bar to begin with and should be improved periodically to tighten the quality guidelines on this repository.

While our AI coding pair has helped design and improve the code quality to a large extent, the last mile has to be walked by all of our teammates. We now have a strong feedback mechanism for bad code that will fail the pipeline and stop code from being committed or pushed. The last bit requires team culture to be built. On one of my teams, we had a soft check in every retro to see if every member had made the codebase a little bit better in a sprint. A sprint is 10 days and “a little bit” can include refactoring a tiny 2–3 line function and making it better. The bar is really low but the social pressure of wanting to make things better motivated all of us to drive positive change.

Having a high quality codebase with a good developer experience is not a pipe dream and making it a reality is easier than ever with AI coding assistants like Claude Code or Copilot. What have you been able to improve recently? 😃

How to choose your coding assistants

2025-07-17T00:00:00+05:30

Why it’s harder for a professional developer to use a tool despite the wide variety of choices

Coding assistants like Cursor, Windsurf, Claude Code, Gemini CLI, Codex, Aider, OpenCode, JetBrains AI etc. have been making the news for the last few months. Yet, the choice of tools is a lot harder and limited for some of us than it seems.

TL;DR: OpenCode > Claude Code > Aider > Copilot > *

Understanding the tools

Not all tools are created equal. Tools evolve fairly rapidly so the examples listed here might be invalid fairly soon.

You can plot the different types of coding assistants on a graph showcasing the amount of human involvement required (lesser involvement = more automation). The first GitHub Copilot release I used allowed tab completions. It would either complete single lines or entire blocks of code. You could describe your intent by creating a function with a good name or by writing a comment. GitHub Copilot then supported inline prompting or chat sessions.

Coding agents are the current state of the art toolset for most developers on a day to day basis. They allow you to have conversations with them and you should treat them as team mates, albeit ones with anterograde amnesia.

Some problems can be parallelised and background agents triggered locally are incredibly powerful. Claude code supports subagents is frequently used for analysis and solving multiple issues in parallel using git worktrees. Similarly, some people hook up agents to remote instances for things like code reviews using Claude code or Copilot.

The extreme version of this is pure vibe coding. There is enough content out there about why this is a bad idea and the number of issues on real systems because of this.

Challenges with using these tools

When picking up a tool, I have started looking at different aspects of these tools

Choice of models

LLMs change quite quickly. Claude Sonnet 3.7 started off being the favourite model for most developers I know. When Claude Sonnet 4 came out at the same cost as 3.7, it became the new favourite model. Claude Opus 4 is great for larger codebases but expensive.

As I write this (mid-July 2025), the word on the street is that Grok 4 is currently the best model on the block. Choose something that has good coding insights and a large context window. Claude Sonnet has some of the smaller context windows but is tuned quite well for software development.

Cursor supports most of the best models and provides diversity. Tools like Claude Code and Gemini CLI are built and maintained primarily for use with a single model.

Ease of use

This one is fairly subjective and dependent on the developer’s preference. Tools like Cursor are VS Code forks and thus provide tight integration with the editor. Others like Claude Code, Codex and Gemini CLI run on the terminal. Claude Code provides decent integration with the IDEs from the JetBrains family and thus provide good support to pair with your AI assistant.

Speed factors into ease of use too. While Jetbrains AI is the best integrated tool amongst all of these (if you prefer using their IDEs), their AI tool is one of the slowest. Slower tools mean slower feedback cycles. Slower feedback cycles are some of the worst things for dev experience.

Cost per change

Cost pays a huge part in someone’s choice of tools and running LLMs are fairly expensive to run. Most tools charge you per use, some by tokens, some by APIs. Since we’re in the relatively early days of these tools and they are competing to capture the market, some still provide fixed investment offers in exchange for “unlimited plans.

Cursor used to be $20/month with unlimited usage till June 2025. While all “unlimited” usage is rate limited, if the usage limits are generous or the rate limits are not severe, users can manage to have a decent developer experience. More recently, Cursor updated their prices to make the $20/month Pro plan for “light users”. Daily users are recommended to use their $60/month Pro+ plan and power users are recommended to use their $200/month Ultra plan. Users on reddit have complained about how the Ultra plan is insufficient, though Cursor’s documentation says that it should be sufficient. This seems to primarily be because of heavy Claude Opus 4 usage, one of the most expensive models.

Another fixed usage tool is Claude Code for individuals with it’s Pro and Max plans. The $100/month Max plan seems to be the sweet spot for most heavy users and is probably the best value for money, at least until you look at the licensing.

Google’s Gemini CLI, at launched, announced the most insane free tier (that allows you to spend an estimated $620/day) but at the cost of training on your projects. More on this, in the next section. The free tier might not be this generous forever so if the “training on your data” bit isn’t a concern, enjoy Google’s generosity.

IP ownership indemnity and licensing

Licensing is a complicated topic and I go off of the advice that people much more qualified than me give in this space. The current understanding of this space is that you want to be on

company licensing (avoid individual licenses)
a tool that does not train on your data
provides you indemnity against IP claims

You should avoid individual licenses since the protections usually apply to you, not the organisation you work for. If you work with a services company and create IP for your clients, you want to avoid the risk of the protections not covering your clients.

Avoid tools that train on your data if you’re building something commercially. If you’re on a FOSS tool/system, you can ignore this fact. Google Gemini CLI’s free tier is a great example of this. They get to use your data to make the system better in exchange for you having a good coding assistant free of cost.

Anthropic, the creator of Claude Code, indemnifies its commercial users against lawsuits. Most other tools tend to do this too. Interestingly, Cursor does not, at least as of the writing of this article. Their MSA provides this protection, however, they only do this for customers signing up for more than 250 seats. This may change in the future and talking to their support is the best way to clarify this.

For team members who are new to using coding assistants, start off with Copilot where users will appreciate the fixed cost. Learn, experiment. Strengthen your core skills in this new world: Prompt Engineering and Context Engineering (more on these skills in another blog).

When you have mastered these skills, you should consider moving to an API based tool that allows you to switch between models. Personally, I’m a fan of the Claude Sonnet and Opus models over OpenAI (and to some extent, Gemini). If you can manage costs well, move to Claude Code (or an open source tool like OpenCode or Aider). I would put OpenCode above Claude Code due to it’s flexibility.

Patterns for AI assisted software development

2025-07-07T00:00:00+05:30

Moving beyond tools: habits, prompts, and patterns for working well with AI

In the last post — AI for Software Engineering, not (only) Code Generation — we explored how AI is transforming software engineering beyond just writing code. Now, let’s look at what that means for teams and individuals in practice.

There are a few patterns that people running teams and on teams that are going to build software with assistance from AI tools should remember.

For people building teams

Focus on value

With the AI ecosystem shifting weekly, C-level and VP-level stakeholders who prioritise modular documentation, model pairing, scoped context, and tooling agility will drive the highest ROI while keeping teams nimble and ready for whatever comes next. Make it work, make it right and then make it fast/cheap.

Journey per software delivery stage, one stage at a time per team

This journey is going to be transformational for teams. Like most transformations, you do not want to change too much too quickly.

When bringing change to a single team, introduce it one software delivery stage at a time to easily verify effectiveness. In a large organisation, you could try different tools for the same stage on different teams to A/B test effectiveness while taking into account the nuances of the individual teams themselves. We don’t recommend this approach if you would like to converge towards a single tool throughout the organisation because changing tool choices after the team gets used to it causes more friction.

When you have multiple teams willing to take this journey, you can have each of them pick tools in different stages to help reduce the time that your organisation takes to make a decision on a toolset. A couple of teams can try AI tools for requirements analysis while others can try agentic coding tools for development.

Expect a learning curve

Especially if you’re an experienced developer, you will feel slower when you start off on this journey. This is no different than working with a new teammate and feeling that your overall productivity is lower. You trade off your own speed against the value you will get when your teammate is onboarded and can deliver by themselves.

From our experience, you are looking at a 2–4 week drop in perceived productivity before the gains will start showing up. As a result, the costs will go up (slower delivery and cost of tools) before they come back down (faster delivery and more time to focus on quality).

Quality guardrails are a prerequisite

Do not bolt on quality and security guardrails after the fact. Start with them. Ensure a robust test pyramid and implement shift-left strategies for both testing and security, enabling quick and early feedback. These guardrails will be invaluable when your team is moving at breakneck speeds through newer features.

If you don’t have these guardrails first, you can use AI to help generate them and review these plans. Like the Maker-Checker process, if an AI coding assistant has helped you plan and create these guardrails, they should be thoroughly reviewed by someone who has the expertise in these fields to catch the small bugs that can have disastrous consequences later.

Autonomous agents are far away

Humans are required in the loop for software development. 10+ years after the first demos of driverless cars, we’re still waiting for a general purpose implementation. While we have made massive progress, it takes time. While agents have made massive progress in the last 2 years, we still need to exist to make sure things work well and that the systems are maintainable. The skill to build maintainable systems is more important now than ever.

Watch out for ‘AI Slop’

Without the right guardrails and structures in place, teams will produce more code, faster while sacrificing quality and security. Teams that have been given access to AI tools without helping them build skills first often point out longer pull requests coming in faster than ever before making people reviewing the code a bottleneck. Eventually, the reviewers end up accepting pull requests due to pressure or fatigue leading to important issues being missed.

Individuals should focus on small chunks of work and teams should look at key metrics to measure the effectiveness of their tool usage (we talk about both of these later in the post).

Changes to individual responsibilities and team composition over time

If teams in your organisation currently contain distinct individuals playing different roles like business analyst, architect, developer, quality analyst, infrastructure engineer and production support engineer, you will see that the distinct responsibilities of these roles will rely less on administrative tasks freeing each of them to focus on thinking strategically and the core responsibilities of their roles. Different organisations will see a merger of different roles. Some will see a merger of the business analyst and product manager roles. Some will see product and project managers merge. Some will see project managers’ responsibilities be split between technical leads and product owners.

In doing so, individuals will emerge that pick up or demonstrate their ability wear multiple hats for example, talk to the business, design the system, develop, validate, deploy and monitor it. These individuals will understand the challenges of the business and work end to end to address it. We have been calling such individuals Solution Consultants at Sahaj and believe that most teams will need such individuals on their team in the near future once they leverage AI in their delivery.

Beware of reduced intuition for decision making

As teams move towards using automated notetakers to help capture more detailed conversations, we should be on the lookout for a few anti-patterns

While conversation summaries help with a quick read, they are often misleading or inaccurate. Please read the full transcription to help improve confidence in what was spoken about. Transcripts are not a replacement for actually having real conversations, an anti-pattern we have seen come up on recent teams.

Transcripts are also not a replacement for remembering context yourself. Context helps build intuition for decisions and one of our worries is that intuition will reduce over a period of time.

For people on teams

The ‘new teammate’ mindset

Treat the AI system as a new team mate or a collaborative partner and not a tool. You can use a tool, be unhappy about the way the tool works and stop using it. When a new team mate joins your team, the fundamental thought process is different. You try to onboard the team mate and give it better context. Writing good instructions or prompts is key to success.

LLMs are like team mates with anterograde amnesia. They can have some memories but these are fairly limited by the size of their context windows. Understanding how to manage context windows is key to being able to work with our new team mates effectively. Keep only what is necessary in the context window and clear it when it isn’t required. Common context should be added to a file (check rules section below) and included only when necessary.

If your prompts to a coding assistant are vague, the tool will keep going around in circles and not make any progress on the task or do the wrong thing.

For example, when you ask the agent: I have noticed that [http://localhost:4000/create-profile](http://localhost:4000/create-profile) has alignment issues and contains text that is spreading outside the buttons. Can you please fix this?

If the agent has access to the puppeteer MCP, it will open up the UI, take a screenshot, process and fix it. If your application has a login page, it will see that the Create Profile view is not being loaded and decide to “fix” this issue by removing authentication 😞. Adding “Please wait for me to login if required” to the prompt helps avoid this issue.

If your prompts have not told the system that you need a solution that has been simplified or one that does not hard code solutions, it will not follow these instructions. Add your general coding standards to a document and include that in the base context. If you have rules around test quality, split that into a smaller document explaining what good tests look like for the team.

Small chunks of work

Break your work down. Reviewing a 1000 line review has always been hard. You can generate large code diffs with AI quickly. You, the developer, are the bottleneck. You are still responsible for quality and security.

Work on smaller chunks. Review regularly. Do small commits. Age old practices still apply.

Configure the tool based on your team’s rules

Each tool requires configuration. Configurations take time to test. It might take a few tries over multiple days to get these configurations correct. Each tool has a different way to be configured and there is no standardisation. In the Agentic code pairing tool space, every tool has its own configuration mechanism. Cursor has Cursor Rules, Claude has memory, Windsurf has Memories & Rules and IntelliJ’s Junie has guidelines. Each of these looks like a markdown file but has slightly different formats. If you’re experimenting between multiple tools (or different teammates prefer different tools), you will have to keep these rules in sync by hand. What’s worse is that the same instructions do not have the same effectiveness across different tools because their system prompts are different. Testing regularly and tweaking is key. Tools also rapidly update. Claude Code releases every couple of days (at the time of writing). Rules may need to be updated based on changes to the tool of your choice.

Shift in time spent on different responsibilities

Teams will increasingly spend more time upfront in planning what needs to be built and what the right thing to build is than in actually building things. This does not mean that teams are walking away from agile but truly embracing it. The time spent on analysis and planning will go up as a proportion but the overall time taken to deliver a version will go down. Each of the individual activities (analysis, development etc.) will be done in thin slices helping build the system up incrementally.

Over-reliance on AI instead of thinking and remembering yourself

Since AI works fast, it’s easier to be lulled into a sense of security and thus have a sense of reliance on the tools. Over time, some individuals may spend less time thinking critically and making decisions.

For example, if a good note-taking app takes notes and summarises them correctly 95% of the time, it is easy to forget that the 5% of mistakes, especially if they happen in critical parts of the conversation, can be quite expensive to fix. Summaries are good but they are not a replacement for reading the transcript which itself cannot beat actually having a conversation with people.

We need to use these systems to help us be better at our roles. Critical thinking is not optional, now more so than ever. We need to put guardrails in place to spot and correct intellectual laziness. If an issue is found that you missed during review, check if you thought about it critically enough. Do so for teammates too and help provide feedback if they are slipping.

How do you know AI is helping software delivery?

Use both qualitative and quantitative measures. Early stages focus on “leading” indicators: developer sentiment, tool usage, and workflow metrics. Conduct developer surveys and track AI usage statistics (active users, acceptance rates) as GitHub recommends. Complement these with engineering metrics: cycle time (time from commit to deploy), pull-request size and review duration, deployment frequency, and change‑failure rates. These DORA‑style metrics help ensure speedups don’t sacrifice quality. Align these KPIs to business outcomes (e.g. shorter time-to-market, fewer critical bugs). Set “clear, measurable goals” for AI use and monitor both productivity and code quality over time.

Up next, we’ll dive into strategies for managing tech debt and elevating developer experience in a world where AI is part of the team. We’ll explore why it’s now easier than ever to stay ahead of the curve — and share the exact prompts and techniques that make it possible.

Credits

This blog would not have been possible without the constant support and guidance from Greg Reiser, Priyank Gupta, Veda Kanala and Akshay Karle. I would also like George Song and Carmen Mardiros for reviewing multiple versions of this post and providing patient feedback 😀.

This content has been written on the shoulders of giants (at and outside Sahaj) that I have done my best to quote throughout.

AI for Software Engineering, not (only) Code Generation

2025-06-25T00:00:00+05:30

Rethinking the role of AI across the entire software lifecycle

Everyone has been talking about using coding assistants to aid with software delivery. There is more to delivering good software than writing code.

Every software development project requires a few different activities from analysis (what), to planning and design (how), to development (build), to testing (validate), to deployment (implement). Each of these activities depends on different skills and techniques that can benefit from the effective use of modern AI technologies.

All software development methodologies, from waterfall to the different agile techniques, fundamentally follow the same cycle. We feel this cycle is not changing yet but there are improvements waiting to be unlocked for organisations.

This post aims to demonstrate how teams of the future can gear themselves to build better products faster.

Use of AI tools across software delivery

The tools mentioned in this section are examples to help the reader understand the idea and not recommendations on what to use.

During Analysis

Improved analysis

Many teams have integrated AI into their analysis process. Starting with single agent flows that support definition of features, epic and stories, to multi-agent flows that help with addressing different parts of a problem space in parallel. My colleague Carmen Mardiros showcases how to revise a plan using Claude Code where individual agents perform specific tasks to help the analyst optimise a plan before execution. Effectively using AI in support of critical analysis and planning can provide benefits beyond basic requirements definition. Multi-agent systems out-perform single agent systems but spend significantly more tokens (and thus money) to do so.

Taskmaster is an AI powered tool that, together with an interactive coding assistant such as Claude Code, can serve as a virtual technical project manager by helping with defining requirements, offering feedback on edge cases, writing stories and setting up and managing the product backlog.

Since you can also ask Claude Code to analyse the codebase to identify technical debt, you can use the same tools to manage both the technical and feature backlogs of the product. This is particularly important when working with mature (legacy) systems as teams and product owners often struggle with balancing technical debt reduction (payback) and new feature development. Although these tools do not replace the expertise required to effectively manage a backlog and prioritise work, they can significantly reduce the administrative burden of doing so.

If all requirements are documented as PRDs, it becomes easier to measure drift as well as look at cards that might be created but might have parts that have already been implemented. You can run this analysis as a weekly or monthly job to clean up your backlog of tasks that are no longer needed.

Not all administrative tasks have been eliminated. When you transition from PRDs to epics on your backlog, there is a time period when both remain active and during this time, the two need to be consciously kept in sync. Over a period of time, the importance of the PRD wanes and it can be killed off. The same is true for other transitions like the one between stories and code.

Changes in roles for Business Analysts and Project Managers

The roles of business analysts included note taking, summarising and analysing and helping shape the right product for the business. This role is shifting to focus on being more strategic in nature focusing on finding good opportunities for your products, taking away the transcription/administration parts of the role. Similarly, the roles of Project Managers will include less time on administrative tasks and more time on making sure the right features are being built.

This is true for all roles we’re going to be speaking about in this post to some extent, calling this out explicitly since this is the first.

Improved iterative UI/UX design

Tools such as Canva and Figma have helped minimise the time taken to go through a complete feedback cycle with users. AI tools have now started linking up with these tools to help spot implementation drift during development. These tools also have the ability to spot requirements gaps and help us foresee problems. More on this during the feedback cycles section.

Clair Mary Sebastian also talks about using generative AI for requirements analysis and wireframing using OpenAI’s APIs alongside Figma’s wireframe designer.

AI note taking apps for requirement analysis

Copilot4Devops that will take text summaries and help generate user stories or feature specs. This can be a particularly powerful technique to aide in quicker iterations with generating stories and feature specs.

Note taking apps like fireflies.ai have fairly accurate notes across multiple languages with user detection in conversations and help improve user experience and recall for conversations.

While conversation summaries help with a quick read, they are often misleading or inaccurate. A best practice (or should we say “must have practice”) is for participants to review the notes shortly after the meeting and correct any errors before the notes are accepted. In addition to preventing the dissemination of inaccurate information, this practice improves information retention amongst participants and contributes to an improved shared understanding. This is in contrast to the anti-pattern of relying on unreviewed transcripts and meeting notes, an anti-pattern that discourages critical thinking and delays establishment of a shared understanding that is critical to successful delivery.

Transcripts are not a replacement for actually having real conversations, an anti-pattern we have seen come up on recent teams. Transcripts are also not a replacement for remembering context yourself. Context helps build intuition for decisions and one of our worries is that intuition will reduce over a period of time.

Improved communication and context

Currently, users from the business (or product owners as a proxy) work with business analysts from delivery teams to collaboratively help shape the product. This communication usually requires experienced product owners who understand technology well enough at a distance to know what questions to ask and how to shape the conversation to build quick consensus on what the product’s vision is. This communication also requires experienced business analysts who know how to extract details of how the system should work, anticipate challenges during building the product and pre-empt them with questions. Teams who do a good job at analysing the system require individuals at the top of their game. If either of these individuals does not have the pre-requisite knowledge, communication is sub-optimal.

We see that this status-quo is ripe for disruption. Doing so requires us to build a system (or product) that absorbs domain context before it can be used.

Since most teams are distributed, a conversational AI can help users prepare for their synchronous or asynchronous communication with the team given that the AI has the persona of a developer who is an expert at the specific tech that is used to work on the product. Similarly, delivery team members can use a conversational AI system to help understand the business context better and anticipate pushback and prep for it. Being able to understand the devil’s advocate stance in their head and prepare for it is something most people struggle with. Important conversations still happen through direct communication, however, both the users and the business analysts can help pair on preparing for the actual conversation with real people on the other side.

Over a period of time, the conversational AI system can help improve the quality of preparation conversations for both actors providing quicker feedback.

During System Design

AI makes it possible to more quickly and thoroughly define and compare different solution designs for a given problem space. The ability to quickly and thoroughly evaluate the impact of different architectural decisions can multiply the value of experienced architects and may even enable more advanced practices such as emergent architecture as AI can help teams safely adjust the solution design as requirements change or new requirements emerge.

When a system is built, the system design is built to meet some constraints and have a target state. Both the target state and constraints evolve over time. Good teams will track these constraints in the beginning and through the evolution of the product as ADRs and fitness functions. Some teams find it hard to keep track of the delta between the current and target state (current debt). Using AI tools, this debt is easier to identify, track and address. Teams can use specific prompts in different areas to identify these challenges and help evolve the system in the right direction.

Tools like eraser.io exist to allow generation of architectural documents through text. Combining this with the ability to generate documentation based on the code, systems can ensure architectural documents are always up to date.

During Development and validation

In today’s fast-evolving AI landscape, engineers must embrace a dual-mode workflow (planner and executor) to get the most out of coding assistants. As a planner, you leverage a high-reasoning model (for example, Claude Sonnet 4 over 3.7 or GPT-4o) to deconstruct monolithic docs into modular guides (e.g. splitting a bulky claude.md into coding-practices.md and development-workflow.md), map out architectural changes, and draft a detailed implementation roadmap. Once the blueprint is locked in, switch to a specialized coding model (like Sonnet, GitHub Copilot with tailored instructions, or Claude Code) for hands-on development, refactoring, and validation. By matching each task to the model best suited for it and scoping prompts to only the relevant files or services you streamline token usage, accelerate processing, and cut context-window bloat.

Executing at scale also demands a culture of experimentation and flexibility. Expect a learning curve as teams test different assistants (Copilot, Cursor, Claude-Code, etc.) and prompt strategies for different tasks like migrating an entire codebase versus tweaking a single method signature, for example. Build in continuous feedback loops around prompt-to-PR cycle times, code quality metrics, and token costs to identify what works best in each scenario. Agentic integrations via Model Context Protocols and tools like Puppeteer, Slack bots, and GitHub Actions can then automate routine tasks — from branch creation to dependency updates and test orchestration right within your existing toolchain.

During Deployment and Operationalisation

Over the past decade, practices in the DevOps space have changed quite significantly with the focus on automation (CI/CD) observability and improved monitoring tools. As this data became more centralised in platforms like AppDynamics, DataDog and NewRelic, these systems have been able to spot errors, intelligently alert users and help spot anomalies.

Platforms like Harness now support automated error analysis to help understand the root cause of issues and help provide steps to fix them.

During Feedback Cycles

Traditionally, individuals caught drifts in software development. There are tools being built in place to help catch different types of drift. Tools such as Cubyts catch both requirement drift (between requirement specs and stories) and implementation drift (between requirement specs, application mock ups and implementation). This is possible because these tools connect with tools like JIRA, Figma, GitHub etc. to analyse the contents of that platform and find possible challenges using the capabilities LLMs provide.

How do you enable this transformation

Preparation

Identify a candidate project
Ensure the candidate project has good safety nets
Ensure the candidate project has a stable product team with good shared context
Identify the right stage of software development, which is most painful and will benefit from introducing AI tools
Identify seed individuals with prior experience in the space, the right opinions and the ability to mentor team members
Identify the tool to introduce
Set up success criteria for this transformation

The journey

Set up time to up-skill team members (on the skills from the “For people on teams” section). Pair team members with seed individuals for maximum effectiveness.
Set up weekly retrospective meetings to catch trends and course correct as necessary. Timely feedback is critical.
Set up a checkpoint to see if the team members require less support from seed individuals weekly. Until a threshold of independence is reached, keep repeating steps 1–3.
Seed individuals depart from the team and only join retrospectives for support.
Set up a checkpoint to check if seed individuals are required in the retros and to confirm that the team is meeting the success criteria.

The 4-week period are indicative examples of what teams may need. Tweak the time period on a need basis.

AI’s role in software engineering goes far beyond code generation — it’s reshaping how we design systems, make decisions, and collaborate. To truly unlock its potential, we need to rethink not just our tools, but how our teams operate. In the next post, we’ll explore patterns for AI-assisted software delivery — focusing on how to build more effective teams, and how individuals can work differently to make the most of AI in their day-to-day practice.

Credits

This blog would not have been possible without the constant support and guidance from Greg Reiser, Priyank Gupta, Veda Kanala and Akshay Karle. I would also like Swapnil Sankla, George Song, Rhushikesh Apte and Carmen Mardiros for reviewing multiple versions of this document and providing patient feedback 😀.

This content has been written on the shoulders of giants (at and outside Sahaj) that I have done my best to quote throughout.

What makes Developer Experience World-Class?

2025-06-23T00:00:00+05:30

The habits, tools, and practices that set great engineering teams apart.

Developer experience (DevEx) isn’t just about fancy tools or slick UIs - it’s about removing friction so teams can move with confidence, speed, and clarity. In high-performing teams, great DevEx means fewer context switches, faster feedback loops, and more time spent actually building. In this post, we’ll explore the five non-negotiables every codebase should have to support world-class collaboration, and we’ll map out a practical DevEx stack to help your team deliver better products, faster.

The Five Non-negotiables

I. Project readme

Short, sweet and simple

Write a short note with a few lines on what this codebase is responsible for. Indicate the setup process and lifecycle to go to production. Code should act as documentation and anything that code will not document as obviously (what are the first set of things you should read) should be in here.

II. Automated setup

A single command to get your entire workstation setup.

I am a huge fan of using shell scripts for smaller projects and justfiles for larger ones. This isn’t about tools. This is about your experience.

Run just setup and have a workstation that is ready to go (including requiring node/python, installing all the dependencies and setting up a database, if required. I expect other obvious commands like just run, just lint, just test and just build. I admit that I have been spoiled by gradle and maven in JVM land and clearly have withdrawal symptoms in the land of the snakes.

Take this a step further and automate test data creation. If your application is stateful, please generate the test data on startup. This way, you are ready to test what you need the moment your application starts. Test data setup might add a few seconds to your startup but it will save you minutes in testing things and much more than that in your emotional happiness. If you are building an e-commerce website, create a few product categories, products in each of the categories and a few test users. Make sure your test user has elevated privileges to begin with making it easier for you to start testing things. The single just run command should have you ready to test your scenarios.

III. Iterate fast

Faster the feedback, the better

I like fast iterations. Left on my own, I’d commit every 5–10 minutes; sooner, if I can get away with it. This includes the time it takes me to lint and test. This means fast code linting and tests. I love code linting tools that take less than a second and unit tests that take less than 5 seconds across the entire project. If running all tests takes more than 5 seconds, I’ll run them before a push. If it takes more than a minute, I’m refactoring/optimising something.

There are enough engineering techniques to go fast. Got a large number of tests? Run them in parallel. Integration tests take time? Share container context and database test containers.

Once you get used to this, you will not go back.

IV. Enforced pre-commit/pre-push checks

Shift feedback leftward

Use frameworks like pre-commit (others exist for most toolchains) to run your entire CI safety net locally. Early feedback is key. Lint code with every commit. Linters should spot issues (like increased complexity, dead code, etc.) early and format code consistently. Test everything before pushing.

V. Everything runs locally

Nothing should require the internet or external resources if possible

Can you run your code locally? Sounds like a silly suggestion but I bet that there is at least one team still working with an old system that’s written in C that logs into a remote machine and writes code in vim without any setup for code completion, early compilation feedback or running code in an “IDE like setup” with world-class debugging support.

Use a proper IDE. I personally love the IntelliJ suite of tools for most languages. Some of my teammates are emacs and vim power users who have the same setups (code completion, auto-compilation, error detection, running code, and debugging support). IntelliJ even comes with its own set of profiling tools that are a real timesaver for me and easily worth the cost of usage.

A teammate once asked “If you were to get on a flight, could you continue to write code”. This was not a hypothetical question as we used to travel every week and spend 3+ hours on a flight, time you’d like to make good use of. Having a toolset that you can go offline and work comfortably in, even when travelling is a really nice experience as a developer.

The DevEx Stack

Want to go beyond the non negotiable items and dive deeper into improving your team’s DevEx? Here’s a stack with some techniques, tools and practices to try out.

This section is going to be heavy with crosslinks to other articles to keep this article short for people who already know some of these concepts.

        Layer                          Tools/Practices
Code Quality            Linters, Formatters, Typing, Modular Design
Automation              Pre-commit, CI/CD, Makefiles, Containerisation
Testing and Validation  Fast tests, Coverage, Contracts, Security Scans
Documentation           Onboarding, Readmes, ADRs, Comments, PR Templates
Culture and Workflow    Git hygiene, Blameless retros, Tech debt tracking

Foundational code practices

People have their preference in how code is styled and a good codebase is one that looks like a single person has written it.

Have a clear and consistent code style that is enforced via linters and formatters that work across the CLI and IDEs that the team uses. Use a configuration that is checked into version control to ensure consistency.

Duck typing enthusiasts can look away but please prefer strong typing (Typescript over Javascript, mypy on Python etc.). Your IDE suggestions and ease of exploration of language APIs will thank you, especially if your team aren’t experts at the language.

Build a codebase that has clean code architecture (layered, hexagonal, etc.). The codebase should clearly showcase design preferences (composition over inheritance) and even codify them through tests or fitness functions when possible.

When the code isn’t obvious, do not add comments. Write better tests and refactor your code.

Tooling and automation

Run formatters, linters, and tests automatically (using tools like pre-commit and husky). Build CI/CD pipelines that are fast and reliable which provide meaningful feedback when things fail. Automatic deployments to non-prod environments. Automated rollbacks strategies when deploying to production. It’s 2025 and there are very few reasons to need downtime even when running most standard migrations. Build observability into your pipelines to help diagnose issues (like pipelines slowing down) quicker.

A local developer experience that is consistent with production (docker, docker compose, and vagrant environments for more bespoke OS’). Use scripts for common workflows (just, npm scripts).

Builds need to be deterministic and reproducible. Lock your dependencies and avoid dependency hell. This might not be a big deal to the experience of developers on a daily basis but add periodic checks for outdated or vulnerable dependencies (using tools like snyk).

Testing and verification

Automate your tests with good quality unit tests for your logic, integration tests for the boundaries and (hopefully consumer driven) contract tests for external APIs that together, make up a good test pyramid or test trophy. Do not chase test coverage numbers. Use coverage to catch critical paths that are not well tested. Flaky tests suck, please eliminate them like a plague. Make them easy to run using an obvious command (like just test, npm test or ./gradlew test).

Use test doubles when necessary. Mocks and stubs are required but try to be stateful when possible (the last bit is a debatable opinion; one of the few endless debates in this blog). Use snapshot tests when possible but do not abuse this technique.

Lint your code. Do so early. Add security linters (like bandit or semgrep)

Collaboration and documentation

Every project should have a README.md, a TL;DR of your quick start guide for developers. Add a CONTRIBUTION.md for guidelines on how people can be good contributors (do you practice trunk based development or git flow? The answer will not be obvious to everyone when starting off on the project). Set up PR templates and code review guidelines to help aid internal conversations.

Automate your setup. New developers on your team should be productive in less than 5 minutes of the repository checkout (including the time taken to download dependencies).

If you’re creating an SDK or API, please generate auto-generated documentation. Capture decisions as Architectural Decision Records (ADRs) and C4 diagrams. This makes continued context maintenance and acquiring historical context easier.

Team workflows and culture

Decide on Trunk Based Development (TBD) or Git Flow (GF). If you’re going with TBD, merge early and merge often but do so with feature toggles. If you’re going with GF, create short-lived feature branches.

Set up a culture of blameless retros to learn from your mistakes effectively.

Track tech debt actively on the backlog and manage it regularly. Acknowledge and prioritise debt alongside features.

Ensure the team is used to sharing feedback openly. Set up retrospectives as a group and time to introspect as individuals.

This might all sound obvious in hindsight. So, why doesn’t every team invest in it? In truth, many developers have never experienced what great DevEx feels like. They don’t know it can be better, or they’ve accepted the friction as normal. But once you’ve worked in an environment where someone has sweated the details - where every part of the workflow feels seamless - you can’t unsee it. You start to expect it. And that expectation changes everything.

What’s next?

Every team deserves a developer experience that brings out their best work. Start by imagining what “great” looks like for your codebase - your north star. Then chart a course. Build a roadmap. Rally others. The path from chaos to clarity is paved with small, deliberate steps.

Take 10 minutes today to write down your team’s DevEx wishlist. Start a conversation with the team: What’s slowing us down? Pick one thing from the DevEx stack and implement it this week.

Change starts with you.

In the next blog, we’ll dive into how AI coding assistants can help amplify your impact - accelerating code quality, catching issues early, and automating the boring stuff - so you can focus on what really matters: building things that matter.

Credits

Thanks to Vinayak Kadam for providing feedback and Priyadarshan Patil for requesting me to write about this, after my passionate filled monologue in a conversation about Developer Experience.

The Cost of Culture: Transparency

2025-02-12T00:00:00+05:30

Why most people believe they want transparency — but actually don’t.

Transparency has been a cornerstone of Sahaj throughout my journey here. It is not just a value we champion but a principle deeply embedded in how we operate. But transparency is not as simple as it sounds — it comes with its own challenges and costs.

Examples of transparency include sharing all business related data openly internally such as salaries, revenue, forecasted incoming work and people etc.

Transparency promotes empowerment

Trust is essential at Sahaj. We believe in empowering everyone to help our business grow by making informed decisions. For this to work, everyone must have access to the vital information that shows them the bigger picture. We strive hard to avoid informational hierarchy, where some people have information and others don’t. We believe everyone has the right to access key information, which will make people see the bigger picture and make the right decisions to help grow the business.

Having access to this information along with the power to make important decisions to grow our business is how each of us walks the path toward becoming better CEOs and business leaders. Consequently, some of us (myself included) experience an increased sense of fulfilment and engagement. It enables each one of us to articulate our ideas, grow and build expertise in the areas we would like to while simultaneously helping our business grow. However, this empowerment also comes with responsibilities and challenges, as transparency requires effort and understanding to truly benefit the collective.

Balancing collective and individual needs

Transparency, like any valuable principle, comes at a cost. Informational transparency makes information available to everyone. While the context is provided internally, everyone will not have spent the same time to absorb and process the information on their end since their priorities are different. Broadly, folks primarily responsible for operations (business oriented roles), spend more time thinking about these things than folks primarily responsible for software delivery.

Transparency, like radical candor, requires looking beyond initial reactions to understand the deeper context. Let us understand this with an example.

An example through personal experience

Let’s talk about open salaries. The concept is simple, ensure that everyone inside the organisation knows how much everyone else makes. We do this to make sure everyone in the organisation gets paid appropriately. For me, personally, it has made it possible to spend less time worrying about whether I get paid fairly and more time in actually doing my best work.

On day one, transparency might feel overwhelming — like staring at a spreadsheet of numbers without context. As one spends more time in the organisation, understands contexts and the value an individual brings in, they are then able to better co-relate the number with value. At this point, a lot of things start making sense like why I get paid what I do and why others around me get paid a similar amount. There will also be a ton of things that will not make sense. This is a pivotal point because either I could make assumptions or ask for clarification. Assumptions often lead to frustration, a path I prefer to avoid. Much like many other Sahajeevis, I took the route of asking for more information. Sahaj is a pull-based organisation meaning we expect people to ask questions whenever they have them and those with more context will provide it. Over time, as you engage and ask questions, patterns emerge, and you start to see the bigger picture.

At some point, you will interview a candidate you really like. However, based on your understanding of how salaries work internally, you might think this is an offer we should not make (usually because their expected salaries are too high). You know there are internal mechanisms to handle such situations. Despite having “similar knowledge”, some of us will be comfortable moving forward and some of us will not. There are a couple of reasons why people react differently to the same information.

They are comparing the candidate to themselves.
They are unable to see the bigger picture, either because they haven’t fully understood the information or because they cannot see past self interest (#1)

True cost of transparency

When people understand the larger picture and prioritise the collective over immediate self-interest, transparency transforms from a burden into a powerful tool for running an effective business. This balance between discomfort and long-term growth is the true cost — and reward — of transparency.

While many people believe they want transparency, what they often seek is convenience. Convenience to have access to data and to be able to use it to make arguments that serve their self interest. This is a normal part of the journey we all go through in transparent organisations.

The true reason for needing transparency in an organisation is to help teach all of us how to effectively run a business. This requires us to be uncomfortable at times. Uncomfortable because we need to put a collective (our business) before ourselves. Uncomfortable because we have to admit the fact that at times, we want to think of ourselves first and that, as leaders, there are times we cannot or should not. Uncomfortable with the realisation that while we might want to be leading our business, there are moments when we aren’t ready to do so. What we all want at times is just comfort and to not have to think of the big picture.

When we join an organisation like Sahaj, we get access to information. At some point, the information will not make sense to us (based on our context, available information and/or knowledge). The only way to grow is to start a conversation with others and evolve our perspectives, which drives us to realisation. This state of confusion isn’t permanent since we will eventually realise new unanswered questions which require growth in our perspective.

Embracing the Challenge and Growth of Transparency

Transparency grants us access to important information, which challenges us to think critically and grow both personally and professionally. However, this growth often comes with moments of discomfort — times when we must confront opposing viewpoints or accept hard truths. And that’s okay. It’s okay to feel uncomfortable, to need time to process, and to engage in open dialogue to navigate these complexities.

The rewards of transparency are profound: freedom, greater knowledge, continuous learning, and meaningful growth. Yet, the cost of transparency is equally real. It requires us to embrace discomfort, confront differences, and invest time in understanding and resolving them.

If you find yourself wanting transparency but hesitating to embrace the effort it demands, perhaps what you’re truly seeking is convenience — the comfort of data that reinforces your own perspective. There’s no shame in admitting this; I, too, have found myself drawn to the easier path at times. It’s human nature. But real progress, like radical candor, requires a willingness to embrace discomfort and challenge ourselves to change.

Transparency is not just a value — it’s a practice. It asks us to think beyond ourselves, to see the bigger picture, and to lead with empathy and understanding. If we’re willing to pay its price, transparency can transform not just our organisations but also ourselves as leaders and contributors to a collective vision.

Thanks to Kshitij, Swapnil, Puneet, Geeta, and Priyank for their reviews and early feedback.

What are event driven architectures?

2024-09-30T00:00:00+05:30

A couple of years ago, I was part of group of individuals working on defining different Event Driven architectures during a weekend summit. A summary of the summit was already published by Martin Fowler as first as a blog and later as a talk, the blog takes a slightly different view than the explanation I needed and thus this post was created. This is a recreation of the contents in the talk. If you have watched it, you can skip reading this summary.

What is event driven?

This technique is a popular technique to avoid coupling in systems. These systems tend to eventually become good sources of data that the business would like to build data platforms, insights and models on.

This page exists to

Help understand the different patterns at a high level
Understand the implications on building data systems

Events vs Commands

An event is when a system wants to announce what has happened but not what is to be done. For example, a new insurance quote being generated is an event. It announces to the world that a quote has been generated but not what should happen as a result.

A command is when a system wants something to be done and is asking a system to do it. For example, an upstream system might ask the communications system to send an email with specific details and this is a command to the communications system.

Both of these are usually implemented as events on a queue. The primary differences are how they are named and what the intent is.

Different types of event driven patterns

Let’s start with an example to help visualise the problem in which the customer changes an address for their house insurance in an insurance provider’s system which leads to a new quote being generated. This quote needs to be sent back to the user via an email.

If the services are built as visualised with calls across services being made, the services will be tightly coupled in their flow (since customer management needs to know of the existence of the quoting system which in-turn needs to know about the existence and need for communication). Here is how that problem can be solved with event driven architectures.

Event notification pattern

In this pattern, a source system will send a “notification” to all other systems that something has happened. The consumer needs to setup an event listener and figure out how to react to it. An example of this can be seen by the customer management system generating

Since the events do not have any information about what has changed, the downstream systems still need to call the upstream system to understand the details of what has changed to take action on the changes.

Here are a couple of versions of the customer changed event. The first version is one where the customer address changed event could include only the ID of the customer who’s address has changed. For every other part of the information (including what’s changed), the downstream systems need to contact the customer mamangement service.

Of course, these additional questions could be included in the event notification because they are related to the core event itself. There will always be some fields that a downstream system might need that are not directly part of the event but are required by the downstream system.

Advantages of using Event Notification

Systems built are decoupled. When there’s other actions that need to be made based on an address being changed, it’s easy to add another system to take action on this event without changes being required on the customer management side.

Downsides of using Event Notification

Systems built will be devoid of any behavior and there is no easy way to trace what happens downstream. There is no easy way to trace all the changes that happen in the code (by looking at the source code) to understand the list of changes that happen when the user changes their address.

Distributed tracing systems like zipkin aim to address these challenges by allowing visualisation of flows on environments with a full setup. Code can be traced by using mono-repos with the event names being the same across services. These are techniques to deal with the inability to trace code/flows across systems and while neither of them are as effective as tracing usages of your code, they help drive a balance between decoupling and ease of use.

Even when all the related information to the event have been added into the event payload, there will still be a need for downstream systems to require information. This means additional API calls will have to be made to the upstream system. As more downstream systems subscribe to a particular event, the upstream system will be under higher load to provide access to information and the downstream system’s availability is depedent on the upstream system.

Event carried state transfer pattern

Event carried state transfer (or ECST, for short) is sends all information related to the domain object in the event to avoid Event Notification’s need for call backs for additional information.

Downstream systems need to store the parts of the information they need for their usecase. If a difference between the old and new data is required, the data-structures chosen should make calculating differences easier.

Advantages of using ECST

Systems using this pattern have a lower dependence on their upstream services and thus have higher availability.

Downsides of using ECST

The higher availability comes at the cost of making the system eventually consistent. The data will also have higher replication.

Event sourcing

An event sourced system is one where the events are stored on an event store/event log and where the current application state can be completely recreated based on the event store.

The event store is an append only log of events that have occurred and, in the example, the customer DB is an example of a snapshot. A snapshot is required for enhancing the performance of your store (since it stores the current state of your system for quick access).

Both source control systems (like git, svn etc.) and financial accounting ledgers are good examples of event sourcing.

Advantages of using Event Sourcing

This system makes audit, debuggability and replayability simple. Such systems are great to recreate issues and understand what order things happened in. The ability to timetravel with data on a production system is quite useful. Concepts like branching is possible with data and what-ifs are easy to simulate to figure out the difference. Differences can then be applied through the creation of compensating actions.

Downsides of using Event Sourcing

This system will make event versioning mandatory. Interacting with external systems becomes more complicated since these calls are side-effects and event sourced systems should not cause this side-effect again when events are being replayed.

Command Query Responsibility Segregation pattern

Command Query Responsibility Segregation (or CQRS, for short) is a model in which reads and writes are separated. This allows scaling and optimisation reads and writes separately as per requirements.

Advantages of using CQRS

This pattern is rarely necessary but extremely useful to allow write heavy vs read heavy transactions to be scaled separately. The read side(s) can be optimised for the usecases they are being used for.

Downsides of using CQRS

Adds significant complexity building and maintaining a system.

Reading material

What is event driven?
The many meanings of Event-Driven Architecture?
Axon framwork - Framework for event driven, event sourced, CQRS powered applications in Java
Event Sourcing: A Practical Guide to Actually Getting It Done

MLOps: Building a healthy data platform

2021-08-02T00:00:00+05:30

Spoiler: MLOps is to ML Platforms what DevOps is to most tech products. If you think this means MLOps is automating your deployments, this article is for you.

What is DevOps and how is it so much bigger than automating deployments?

You know that a term you coined has made it mainstream when people use it regularly in conversations and rarely understand what you meant.

— Martin Fowler (paraphrased from an in-person conversation)

Rouan summarises DevOps culture well in his post on Martin’s bliki. It is easy for developers to get disinterested with operational concerns. “It works on my machine” used to be a common phrase between developers in yesteryears. Some operations folks can also be less concerned by development challenges. Increased collaboration can help build a bridge in the gap between Developers and Operations team members and thus make your product better.

This increased collaboration has made observed requirements like system and resource utilisation monitoring, (centralised) logging, automated and repeatable deployments, no slow-flake servers etc. key parts of our products. Each of these improve the quality of life of your product either by directly benefiting the end user or making the system more maintainable for Developers and Operations users thus reducing the time to fix issues for end user issues. Developers and Operations folks are also first class users of your system. Their happiness (ease of debugging issues, deploying etc.) is a key part of your product’s success. It allows them to spend more time improving your product for paying end users.

What is MLOps?

MLOps is a culture that increases collaboration between folks building ML models (developers, data scientists etc.) and people who monitor these models and ensure everything is working as intended (operations). The observed requirements in your system will have some overlaps with what we have already talked about like system and resource monitoring, (centralised) logging, automated and repeatable deployments, automated creation of repeatable (non-snowflake) infrastructure etc. It will also include a few Data Platform specific observed requirements such as model and data versioning, data lineage, monitoring effectiveness of your model over an extended period of time, monitoring data drift etc.

Some tools/techniques to build a robust data platform

The need of every data platform is slightly different based on the challenges you are solving and the scale at which you operate. One of the platforms I’ve been working on produces 2TB of data every week. It didn’t take too much time for data storage costs to be the number 1 line item on our bill and we invested some time in optimising our storage and retention strategy. Other teammates have lowered data volumes and focus on reducing the cycle time for model creation. Your mileage may vary.

Based on our experience building data platforms over the past few years, here are a few tools we have used and things we have watched out for.

Data Storage

Choose a storage mechanism that provides cheap and reliable access to your data while meeting all legal requirements for your dataset. If you are in a heavily regulated environment (finance, medicine etc.), you might not be able to use the cloud for customer data. The techniques still remain similar. Partition your data based on access requirements and retention times. Archive data when you do not need it. Use features like push down predicates to efficiently read your data.

We recently wrote about data storage, versioning and partitioning which goes into great depth into this topic.

Job Scheduler/Workflow orchestrator

Your data pipelines will get complex over a period of time. Much like infrastructure as code, we would like our data pipelines in code. Apache Airflow is one of the tools that allows us to do this fairly easily. Sayan Biswas wrote about our airflow usage in 2019. Over the last few years, we have made dozens of improvements to the way we use Airflow. In a subsequent post in this series, we will talk through these improvements.

Monitoring and managing data processing costs

We spawn EMR clusters on demand and terminate them when jobs complete. A cluster runs only 1 spark job (and a few extra tasks for cleanups and reporting). If a job fails due to resource constraints, this helps isolate if another hungry job consumed too many resources before a scaling policy kicked in.

Each EMR cluster has an orchestrator node (AWS and Hadoop call them “master nodes”) and a group of core nodes (Hadoop calls them “worker nodes”). We request for on demand nodes for orchestrators and reserve the instances to reduce cost. We bid for spot instances for cores using a dynamic pricing strategy that is dependent on the current price. We have considered building a system that automatically switches instance types based on availability, price and stability in AWS but failures in spot bids are currently rare enough that it does not justify the cost of developing this feature.

We also monitor the resource utilisation of our spark jobs using Ganglia on AWS EMR. This tells us our CPU, memory, disk and network utilisation for our clusters. Since the information on Ganglia is lost when clusters are terminated, we run an EMR step to export a snapshot of Ganglia before the cluster terminates. This in conjunction with persisted spark history server data on AWS allows us to tune underperforming spark jobs. In a subsequent post, we will go into details of how to monitor your jobs effectively and tune them.

Monitoring the status of data pipeline jobs

Airflow creates EMR clusters and monitors each of the jobs. If a job fails, Airflow notifies us on a specific slack channel with links to the Airflow logs and AWS cluster.

Complex spark applications produce hundreds of megabytes of logs. These logs are distributed across the cluster and will be lost when the cluster is shut down. AWS EMR has an option to automatically copy the logs to S3 with a 2 minute delay.

We have tried using CloudWatch to index and analyse our spark logs but it was far too expensive. We also tried using a self hosted ELK stack but the cost of scaling it up for the volume of logs sent was too high. Dumping it on S3 and analysing it offline gave us the best cost to performance ratio.

To help reduce the time to fix an issue, when an issue is detected, the EMR cluster analyses its logs from YARN and publishes an extract onto slack as an attachment. Any further detailed analysis can be done on the logs in S3.

Monitoring data quality and data drift

Every time we write code, we run tests to ensure the code is safe to be deployed. Why don’t we do the same thing with data every time we access it?

When you first look at the data and build the model, you ensure the quality of the data used for training the model meets acceptable standards for your solution. Data Quality is measured by looking at the qualitative and quantitative pieces of your dataset. Over a period of time, these qualitative and quantitative attributes might drift causing adverse effects on your model. Thus, it is important to monitor your data quality and data drift. Data drift might be large enough that your model does not produce the right results any more or might be small enough to introduce a bias in your results. Monitoring these characteristics is key to producing accurate insights for your business.

Tools like Great Expectations and Deequ will ensure that your data is sound structurally and volumetrically. Deequ also has operators to look at rate of change of data which is a better expectation than having static thresholds on large volumes of data.

For example, given an employee salary database where the salary is nullable, a check to ensure no more than 100 employees out of the 1000 you currently have data for have no reported salary is bound to fail when the data volume increases significantly. If this check was to ensure no more than 10% of employees have no reported salary will work as the data scales as long as it scales evenly. Moving to a check that looks at rate of change of ratio of users not reporting a salary will be more robust. If the number changes significantly (up or down), it might mean that it’s time to tune your model since the source data is drifting away from when it was trained.

There are more complex examples on how we watch for data drift that will have to wait for a dedicated post.

The MLOps mindset

When our end users feel pain, we add new features to make their experience better. The same should be true for developers/operations experience (DevEx/OpsEx).

When it takes us longer to debug a problem or understand why a model did what it did, we improve our tooling and observability into our system. When it ran slower or was more expensive, we improved our observability to investigate inefficiencies quicker.

This has allowed us to grow our data platform 10x in terms of features and data volumes while reducing the time taken to produce insights for our end users by 98.75%, the cost to do so by 35% and not to mention a significant improvement in developer and customer experience.

Thanks to Jayant, Priyank, Anay and Trishna for reviewing drafts and providing early feedback. As always, Niki’s artwork wizardry is key!

Data storage patterns, versioning and partitions

2021-05-09T00:00:00+05:30

When you have large volumes of data, storing it logically helps users discover information and makes understanding the information easier. In this post, we talk about some of the techniques we use to do so in our application.

In this post, we are going to use the terminology of AWS S3 buckets to store information. The same techniques can be applied on other cloud, non cloud providers and bare metal servers. Most setups will include a high bandwidth low latency network attached storage with proximity to the processing cluster or disks on HDFS if the entire platform uses HDFS. Your mileage may vary based on your team’s setup and use case. We are also going to talk about techniques which have allowed us to efficiently process this information using Apache Spark as our processing engine. Similar techniques are available for other data processing engines.

Managing storage on disk

When you have large volumes of data, we have found it useful to separate data that comes in from the upstream providers (if any) from any insights we process and produce. This allows us to segregate access (different parts have different PII classifications) and apply different retention policies.

We would separate each of these datasets so it’s clear where each came from. When setting up the location to store your data, refer to local laws (like GDPR) for details on data residency requirements.

Provider buckets

Providers tend to make their own directories to send us data. This allows them to have access over how long they want to retain data or if they need to modify information. Data is rarely modified but when it is, a heads up is given to re-process information.

If this was an event driven system, we would have different event types suggesting that the data from an earlier date was modified. Since the volume of data is large and the batch nature of data transfer on our platform, verbal/written communication is preferred by our data providers which allows us to re-trigger our data pipelines for the affected days.

Landing bucket

Most data platforms either procure data or produce it internally. The usual mechanism is for a provider to write data into its own bucket and give its consumers (our platform) access. We copy the data into a landing bucket. This data is a full replica of what the provider gives us without any processing. Keeping data we received from the provider separate from data we process and insights we derive allows us to

Ensure that we don’t accidentally share raw data with others (we are contractually obligated not to share source data)
Apply different access policies to raw data when it contains any PII
Preserve an untouched copy of the source if we ever have to re-process the data (providers delete data from their bucket within a month or so)

Core bucket

The data in the landing bucket might be in a format sub optimal for processing (like CSV). The data might also be dirty. We take this opportunity to clean up the data and change the format to something more suitable for processing. For our use case, a downstream pipeline usually consumes a part of what the upstream pipeline produces. Since only a subset of the data is read downstream by a single job, using a file format that allows optimized columnar reads helped us boost performance and thus we use formats like ORC and parquet in our system. The output after this cleanup and transformation is written to the core bucket (since this data is clean input that’s optimised for further processing and thus core to the functioning of the platform).

While landing has an exact replica of what the data provider gave us, core’s raw data just transforms it to a more appropriate format (parquet/ORC for our use case) and processing applies some data cleanup strategies, adds meta-data and a few processed columns.

Derived bucket

Your data platform probably has multiple models running on top of the core data that produce multiple insights. We write the output for each of these into its own directory.

Advantages of data segregation

Separating the data makes it easier to find the data. When you have terabytes or petabytes of information across your organization with multiple teams working on this data platform, it becomes easy to lose track of the information that is already available and it can be hard to find it if they are stored in different places. Having some way to find information is helpful. For us, separating the data by whether we get it from an upstream system, we produce it or we send it out to a downstream system helps teams find information easily.
Different rules apply to different datasets. You might be obligated to delete data from raw information you have purchased under certain conditions (like when they have PII). Rules for retaining derived data are different if it does not contain any PII.
Most platforms allow archiving of data. Separating the dataset makes it easier to archive different datasets. (we’ll talk about other aspects of archiving during data partitioning)

Data partitioning

Partitioning is a technique that allows your processing engine (like Spark) to read data more efficiently thus making the program more efficient. The most optimal way to partition data is based on the way it is read, written and/or processed. Since most data is written once and read multiple times, optimising a dataset for reads makes sense.

We create a core bucket for each region we operate in (based on data residency laws of the area). For example, since the EU data cannot leave the EU, we create a derived-bucket in one of the regions in the EU. Under this bucket, we separate the data based on the country, the model that’s producing the data, a version of the data (based on its schema) and the date partition based on which the data was created.

Reading data from a path like derived-bucket/country=uk/model=alpha/version=1.0 will give you a data set with columns year, month and day. This is useful when you are looking for data across different dates. When filtering the data based on a certain month, frameworks like spark allow the use of push down predicates making reads more efficient.

Data versioning

We change the version of the data every time there is a breaking change. Our versioning strategy is similar to the one talked about in the book for Database Refactoring with a few changes for scale. The book talks about many types of refactoring and the column rename is a common and interesting use case.

Since the data volume is comparatively low in databases (megabytes to gigabytes), migrating everything to the latest schema is (comparatively) inexpensive. It is important to make sure the application is usable at all points and that there is no point at which the application is not usable.

Versioning on large data sets

When the data volume is high (think terabytes to petabytes), running migrations like this is a very expensive process in terms of the time and resources taken. Also, the application downtime during the migration is large or there’s 2 copies of the dataset created (which makes storage more expensive).

Non breaking schema changes

Let’s say you have a dataset that maps the real names to superhero names that you have written to model=superhero-identities/year=2021/month=05/day=01.

+--------------+-----------------+
|  real_name   | superhero_name  |
+--------------+-----------------+
| Tony Stark   | Iron Man        |
| Steve Rogers | Captain America |
+--------------+-----------------+

The next day, if you would like to add their home location, you can write the following data set to the directory day=02.

+------------------+----------------+--------------------------+
|    real_name     | superhero_name |      home_location       |
+------------------+----------------+--------------------------+
| Bruce Banner     | Hulk           | Dayton, Ohio             |
| Natasha Romanoff | Black Widow    | Stalingrad, Soviet Union |
+------------------+----------------+--------------------------+

Soon after, you realize that storing the real name is too risky. The data you have already published was public knowledge but moving forward, you would like to stop publishing real names. Thus on day=03, you remove the real_name column.

+----------------+---------------------------+
| superhero_name |       home_location       |
+----------------+---------------------------+
| Spider-Man     | Queens, New York          |
| Ant-Man        | San Francisco, California |
+----------------+---------------------------+

When you read derived-bucket/country=uk/model=superhero-identities/ using spark, the framework will read the first schema and use it to read the entire dataset. As a result, you do not see the new home_location column.

scala> spark.read.
  parquet("model=superhero-identities").
  show()
+----------------+---------------+----+-----+---+
|       real_name| superhero_name|year|month|day|
+----------------+---------------+----+-----+---+
|Natasha Romanoff|    Black Widow|2021|    5|  2|
|    Bruce Banner|           Hulk|2021|    5|  2|
|            null|        Ant-Man|2021|    5|  3|
|            null|     Spider-Man|2021|    5|  3|
|    Steve Rogers|Captain America|2021|    5|  1|
|      Tony Stark|       Iron Man|2021|    5|  1|
+----------------+---------------+----+-----+---+

Asking Spark to merge the schema for you shows all columns (with missing values shown as null)

scala> spark.read.option("mergeSchema", "true").
  parquet("model=superhero-identities").
  show()
+----------------+---------------+--------------------+----+-----+---+
|       real_name| superhero_name|       home_location|year|month|day|
+----------------+---------------+--------------------+----+-----+---+
|Natasha Romanoff|    Black Widow|Stalingrad, Sovie...|2021|    5|  2|
|    Bruce Banner|           Hulk|        Dayton, Ohio|2021|    5|  2|
|            null|        Ant-Man|San Francisco, Ca...|2021|    5|  3|
|            null|     Spider-Man|    Queens, New York|2021|    5|  3|
|    Steve Rogers|Captain America|                null|2021|    5|  1|
|      Tony Stark|       Iron Man|                null|2021|    5|  1|
+----------------+---------------+--------------------+----+-----+---+

As your model’s schema evolves, using features like merge schema allows you to read the available data across various partitions and then process it. While we have showcased spark’s abilities to merge schemas for parquet files, such capabilities are also available with other file formats.

Breaking changes or parallel runs

Sometimes, you evolve and improve your model. It is useful to do parallel runs and compare the result to verify that it is indeed better before the business switches to use the newer version.

In such cases we bump up the version of the solution. Let’s assume job alpha v1.0.36 writes to the directory derived-bucket/country=uk/model=alpha/version=1.0. When we have a newer version of the model (that either has a very different schema or has to be run in parallel), we bump the version of the job (and the location it writes to) to 2.0 making the job alpha v2.0.0 and it’s output directory derived-bucket/country=uk/model=alpha/version=2.0.

If this change was made and deployed on 1st of Feb and this job runs daily, the latest date partition under model=alpha/version=1.0 will be year=2020/month=01/day=31. From the 1st of Feb, all data will be written to the model=alpha/version=2.0 directory. If the data in version 2.0 is not sufficient for the business on 1st Feb, we either run backfill jobs to get more data under this partition or we run both version 1 and 2 until version 2’s data is ready to be used by the business.

The version on disk represents the version of the schema and can be matched up with the versioning of the artifact when using Semantic Versioning.

Advantages

Each version partition on disk has the same schema (making reads easier)
Downstream systems can choose when to migrate from one version to another
A new version can be tested out without affecting the existing data pipeline chain

Summary

Applications, system architecture and your data always evolve. Your decisions in how you store and access your data affect your system’s ability to evolve. Using techniques like versioning and partitioning helps your system continue to evolve with minimal overhead cost. Thus, we recommend integrating these techniques into your product at its inception so the team has a strong foundation to build upon.

Thanks to Sanjoy, Anay Sathish, Jayant and Priyank for their draft reviews and early feedback. Thanks to Niki for using her artwork wizardry skills.

Version controlled configuration and secrets management for Terraform

2019-08-26T00:00:00+05:30

Terraform is a tool to build your infrastructure as code. We’ve been having a few challenges while trying to figure out how to how to manage configuration and secrets when integrating terraform with our CD pipeline.

Life before version control

Before we can do that, it’s important to understand build process before we began on this journey.

Our build model for this project was branch based. Each environment maps to a branch (main -> dev, uat -> uat and production -> production). All other (feature) branches only ran the plan stage against the dev environment.

As you can notice, the configurations, secrets and keys are all maintained on the build agent. This means, every developer wanting to run plan and test their changes needs to replicate the terraform_variables directory. Any mistakes in doing so masks actual issues that your pipeline might face leading to delayed feedback.

Next, let’s look at what our codebase looked like

terraform
├── module-1
│   ├── backend.tf
│   ├── data.tf
│   ├── resources.tf
│   ├── provider.tf
│   └── variables.tf
├── module-2
│   ├── backend.tf
│   ├── data.tf
│   ├── resources.tf
│   ├── provider.tf
│   └── variables.tf
└── scripts
    └── provision
        ├── apply.sh
        ├── init.sh
        └── plan.sh

The provisioning scripts help us consistently run different stages across modules. Each module is an independent area of our infrastructure (such as core networking, HTTP services etc.)

Each of the provisioning scripts accepted a WORKSPACE_NAME (branch for execution that maps to the environment terraform is running for) and MODULE_NAME (module being executed).

init.sh ran the terraform init stage of the pipeline downloading the necessary plugins and initializing the backend

#!/bin/bash
set -e

cd $MODULE_NAME

echo "init default.tfstate"
terraform init -backend-config="key=default.tfstate"

echo "select or create new workspace $WORKSPACE_NAME"
terraform workspace select $WORKSPACE_NAME || terraform workspace new $WORKSPACE_NAME

echo "init $MODULE_NAME/terraform.tfstate"
terraform init -backend-config="key=$MODULE_NAME/terraform.tfstate" -force-copy -reconfigure

plan.sh ran the terraform plan stage allowing users to review their changes before applying them.

#!/bin/bash
set -e

cd $MODULE_NAME

echo "select or create new workspace $WORKSPACE_NAME"
terraform workspace select $WORKSPACE_NAME || terraform workspace new $WORKSPACE_NAME

echo "plan with var file ~/terraform_variables/$WORKSPACE_NAME/$MODULE_NAME.tfvars"
terraform plan -var-file=~/terraform_variables/$WORKSPACE_NAME/$MODULE_NAME.tfvars -out=$MODULE_NAME.tfplan -input=false

apply.sh applied the changes onto an environment. Developers do not run this command from local to ensure consistency on the environment

#!/bin/bash
set -e

cd $MODULE_NAME

echo "select or create new workspace $WORKSPACE_NAME"
terraform workspace select $WORKSPACE_NAME || terraform workspace new $WORKSPACE_NAME

echo "apply with var file ~/terraform_variables/$WORKSPACE_NAME/$MODULE_NAME.tfvars"
terraform apply -var-file=~/terraform_variables/$WORKSPACE_NAME/$MODULE_NAME.tfvars -auto-approve

Version controlling configuration

We moved the variables into the config directory by making a directory for every branch for each of the 3 environments we had.

terraform
├── config
│   ├── main
│   │   ├── module-1.tfvars
│   │   └── module-2.tfvars
│   ├── production
│   │   ├── module-1.tfvars
│   │   └── module-2.tfvars
│   ├── uat
│   │   ├── module-1.tfvars
│   │   └── module-2.tfvars
├── module-1
│   └── ...
├── module-2
|   └── ...
└── scripts
    ├── provision
    │   ├── apply.sh
    │   ├── functions.sh
    │   ├── init.sh
    │   └── plan.sh
    └── test_variable_names.sh

According to terraform’s documentation, you can export a variable that your terraform codes need with a prefix of TF_VAR.

functions.sh provides convenience functions to read the configuration and secrets.

#!/bin/bash

function fetch_variables() {
    workspace_name=$1
    module_name=$2

    echo $(cat ../config/$workspace_name/$module_name.tfvars | sed '/^$/D' | sed 's/.*/TF_VAR_& /' | tr -d '\n')
}

fetch_variables read the tfvars file, removes empty lines (that were added for readability), prefixed the name with TF_VAR and joined all entries into a single line. The string this method returns can be used as a prefix to the terraform command while running plan and apply making them environment variables.

Updated plan and apply scripts are placed in the secrets management section for brevity

Testing configuration files

The only limitation is that none of these variables can have a hyphen in the name because of shell variable naming rules. As with any potential mistake, a test providing feedback helps protect you from run time failures. test_variable_names.sh does this check for us.

#!/bin/bash

function parse_and_test_properties_entries() {
    prop=$1
    if [[ "$prop" == "" || $prop = \#* ]]; then
        continue
    fi

    key="$(cut -d'=' -f1 <<<"$prop")"
    if [[ $key =~ "-" ]]; then
        echo "$filename contains \"$key\" which contains a hyphen"
        exit 1
    fi
}

function parse_file() {
    filename=$1
    OLD_IFS=$IFS
    props=$(cat $filename)

    IFS=$'\n'
    for prop in ${props[@]}; do
        parse_and_test_properties_entries $prop
    done
    IFS=$OLD_IFS
}

base_dir="config"
for sub_dir in $(find $base_dir -mindepth 1 -maxdepth 1 -type d); do
    workspace_name=${sub_dir#"$base_dir/"}

    for input_file in config/$workspace_name/*.tfvars; do
        parse_file $input_file
    done

    echo "All variables are named correctly in config/$workspace_name"
done

Version controlling secrets

Secrets like passwords can be version controlled in a similar way though they require encryption to keep them safe. We’re using OpenSSL with a symmetric key to encrypt our secrets. Each secret is put into a tfsecrets file (internally a property file just like tfvars files for configuration). When encrypted, the file will have an extension of .tfsecrets.enc. When the plan or apply stages are executed, files are decrypted in memory (and not on disk, for security reasons) and used the same way.

functions.sh gets a new addition to support reading all secrets

function fetch_secrets() {
    workspace_name=$1
    module_name=$2
    secret_key_for_workspace=$(eval "echo \$SECRET_KEY_$workspace_name")
    echo $(openssl enc -aes-256-cbc -d -in ../config/$workspace_name/$module_name.tfsecrets.enc -pass pass:$secret_key_for_workspace | sed '/^$/D' | sed 's/.*/TF_VAR_& /' | tr -d '\n')
}

The astute amongst you probably noticed that we’re using OpenSSL v1.0.2s because v1.1.x changes the syntax on encryption/decryption of files. Also, you might have noticed the use of environment variables like SECRET_KEY_main, SECRET_KEY_uat and SECRET_KEY_production as the encryption keys. These values are stored on our CI server (in our case GitLab) which makes these values available to our CI agent during execution.

For local development, we have scripts to encrypt and decrypt configuration files either one at a time or in bulk per environment. It’s worth noting that re-encryption of the same file will show up on your git diff since the encrypted file’s metadata changes. Only check in encrypted files when their contents have changed (helping you debug future issues)

encrypt.sh takes SECRET_KEY as an environment variable for making local usage easier.

#!/bin/bash
set -e

if [ -z "$SECRET_KEY" ]; then
    echo "Set a SECRET_KEY for \"$WORKSPACE_NAME\" encryption"
    exit 1
fi

function encrypt_file() {
    input_file=$1
    target_file="$input_file.enc"
    echo "Encrypting $input_file to $target_file"
    openssl enc -aes-256-cbc -salt -in $input_file -out $target_file -pass pass:$SECRET_KEY
    rm -f $input_file
}

if [ -z $1 ]; then
    echo "Usage:"
    echo "  ./scripts/encrypt.sh "
    echo "  ./scripts/encrypt.sh all"
    exit 2
elif [ "$1" == "all" ]; then
    for input_file in config/$WORKSPACE_NAME/*.tfsecrets; do
        encrypt_file $input_file
    done
else
    encrypt_file $1
fi

decrypt.sh also takes the same SECRET_KEY as an environment variable for making local usage easier.

#!/bin/bash
set -e

if [ -z "$SECRET_KEY" ]; then
    echo "Set a SECRET_KEY for \"$WORKSPACE_NAME\" decryption"
    exit 1
fi

function decrypt_file() {
    input_file=$1
    target_file=${input_file%".enc"}
    echo "Decrypting $input_file to $target_file"
    openssl enc -aes-256-cbc -d -in $input_file -out $target_file -pass pass:$SECRET_KEY
    rm -f $input_file
}

if [ -z $1 ]; then
    echo "Usage:"
    echo "  ./scripts/decrypt.sh "
    echo "  ./scripts/decrypt.sh all"
    exit 2
elif [ "$1" == "all" ]; then
    for input_file in config/$WORKSPACE_NAME/*.tfsecrets.enc
    do
        decrypt_file $input_file
    done
else
    decrypt_file $1
fi

Testing secret files

If all files for an environment aren’t checked with the same key, you’ll face a runtime error. Since files can be encrypted individually, you must test if all files have been encrypted correctly. This test is also useful when you’re rotating the SECRET_KEY for an environment.

test_encryption.sh needs SECRET_KEY_ values set so it can be executed locally.

#!/bin/bash

base_dir="config"

for sub_dir in $(find $base_dir -mindepth 1 -maxdepth 1 -type d); do
    workspace_name=${sub_dir#"$base_dir/"}
    password_var_name="\$SECRET_KEY_$workspace_name"
    secret_key_for_workspace=$(eval "echo $password_var_name")

    if [ -z "$secret_key_for_workspace" ]; then
        echo "Variable $password_var_name has not been set. Unable to test"
        exit 1
    fi

    for input_file in config/$workspace_name/*.tfsecrets.enc
    do
        openssl enc -aes-256-cbc -d -in $input_file -pass pass:$secret_key_for_workspace &> /dev/null
        if [ $? != 0 ]; then
            echo "Unable to decrypt $input_file with $password_var_name"
            exit 1
        fi
    done

    echo "Successfully decrypted all secrets in config/$workspace_name"
done

End result

Our final project structure contains the following files

terraform
├── config
│   ├── main
│   │   ├── module-1.tfvars
│   │   ├── module-1.tfsecrets.enc
│   │   ├── module-2.tfvars
│   │   └── module-2.tfsecrets.enc
│   ├── production
│   │   ├── module-1.tfvars
│   │   ├── module-1.tfsecrets.enc
│   │   ├── module-2.tfvars
│   │   └── module-2.tfsecrets.enc
│   ├── uat
│   │   ├── module-1.tfvars
│   │   ├── module-1.tfsecrets.enc
│   │   ├── module-2.tfvars
│   │   └── module-2.tfsecrets.enc
├── module-1
│   └── ...
├── module-2
|   └── ...
└── scripts
    ├── decrypt.sh
    ├── encrypt.sh
    ├── provision
    │   ├── apply.sh
    │   ├── functions.sh
    │   ├── init.sh
    │   └── plan.sh
    ├── test_encryption.sh
    └── test_variable_names.sh

plan.sh uses functions.sh to load configuration and secrets

#!/bin/bash
set -e

source $(dirname "$0")/functions.sh

cd $MODULE_NAME

echo "select or create new workspace $WORKSPACE_NAME"
terraform workspace select $WORKSPACE_NAME || terraform workspace new $WORKSPACE_NAME

echo "plan with var file config/$WORKSPACE_NAME/$MODULE_NAME.tfvars"
config=$(fetch_variables $WORKSPACE_NAME $MODULE_NAME)
secrets=$(fetch_secrets $WORKSPACE_NAME $MODULE_NAME)
eval "$secrets $config terraform plan -out=$MODULE_NAME.tfplan -input=false"

apply.sh uses functions.sh in a similar fashion

#!/bin/bash
set -e

source $(dirname "$0")/functions.sh

cd $MODULE_NAME

echo "select or create new workspace $WORKSPACE_NAME"
terraform workspace select $WORKSPACE_NAME || terraform workspace new $WORKSPACE_NAME

echo "apply with var file config/$WORKSPACE_NAME/$MODULE_NAME.tfvars"
config=$(fetch_variables $WORKSPACE_NAME $MODULE_NAME)
secrets=$(fetch_secrets $WORKSPACE_NAME $MODULE_NAME)
eval "$secrets $config terraform apply -auto-approve"

And thus, our terraform project requires no data from the CI agent and can be executed perfectly from any box as long as it has the latest code checked out and the correct version of terraform.

Managing multiple signatures for git repositories

2019-06-11T00:00:00+05:30

Github explains pretty well how to sign commits. You can make it automatic by globally setting commit.gpgsign = true by using

git config --global commit.gpgsign true

What if you have different signatures for your personal ID and your work ID?

First, you create multiple signatures. It is important that the email address in the signature is the same as the one for the user who has authored the commit. Run gpg -K --keyid-format SHORT to see all available keys. The output looks like

/Users/karun/.gnupg/pubring.kbx
-------------------------------
sec   rsa4096/11111111 2019-06-11 [SC]
      1234567890123456789012345678901211111111
uid         [ultimate] Karun Japhet 
ssb   rsa4096/22222222 2019-06-11 [E]

sec   rsa4096/33333333 2019-06-11 [SC]
      0987654321098765432109876543210933333333
uid         [ultimate] Karun Japhet 
ssb   rsa4096/44444444 2019-06-11 [E]

Fetch the ID for each of the signatures. The ID for the personal signature is 11111111 and that for the work signature is 33333333. To assign a signature to the repo, execute git config user.signingkey .

Personally, I have aliases for personal and work signatures and every time I checkout a project, run the alias once.

alias signpersonal= "git config user.signingkey 11111111 && git config user.email \"karun@personal.com\""
alias signwork    = "git config user.signingkey 33333333 && git config user.email \"karun@work.com\""

Run git log --show-signature to verify if a commit used the right signature. Happy commit-signing.

Fixing broken Social logins on your browser

2019-04-16T00:00:00+05:30

Privacy vs Convienience is a constant battle. Personally, I prefer dialing up my privacy up to 11 to avoid being tracked. Every once in a while, social logins are important because it’s the only way to use a service. If this service is an internal company login that only uses social login via the company’s Google ID, you don’t have much of a chance.

If your login just won’t work, try changing the following settings

Privacy Badger

Allow calls to accounts.google.com & apis.google.com

Firefox settings

Allow Third party trackers in Firefox through Settings > Privacy & Security > Cookies > Third-party trackers

The untold guide to troubleshoot Phillips Hue and Google Assistant Integration

2018-11-10T00:00:00+05:30

Recently, I moved into a new home and was setting up my Phillips Hue lights with my Google Home assistants around my house for convenience. I noticed a couple of hick-ups since the last time I did this.

Logging into Phillips Hue app

If the app does not ask you to hit the button on your bridge, your account already has a bridge associated with it.

You can see what bridge is associated with your MeetHue account on the bridges page.

Remove any older bridges you might have on your account and try logging into the Phillips Hue app again. Once complete, you should be able to link your Google Home assistant to your Phillips Hue app.

Other House Keeping for security reasons

You can cleanup how many apps have access to your account and how many other users have access to your bridge. If you see anything that your don’t recognize, remove it. After all, these apps and Hue account users can control the lights in your house. If you don’t know them, remove their access.

In my case, the only users on my bridge are the family members in my house and the only apps I have are the Phillips Hue Android app (for mobile access remotely) and Google (assistant integration).

Efficient logback logging on JVM

2018-11-01T00:00:00+05:30

Efficient logging that doesn’t bring your application down is simple to setup but is often overlooked. Here are some quick tips on how to achieve exactly that

Async Logging

Most applications these days should have a single (console) appender. This can be linked up with your log aggregator of choice. If your application cannot aggregate logs off the console stream, file is your next best alternative.

Wrap each of your appenders with an async appender and add the async appender to your root logger.

Every call to the logger creates a log event. In synchronous logging, that log event was processed and writes were made to all appender streams before the application continued. Since most stream writes involve I/O, this meant the application would wait for I/O before continuining thereby slowing it down. With async logging, the event gets pushed to a log level specific in memory queue. These events are processed and consumed by the appenders asynchronously. Since the application can continue after a log event has been published to the queue, asynchronous logging works quicker (as long as I/O is the long pole in the tent that is publishing log messages)

Here’s a sample configuration:

   name="FILE" class="ch.qos.logback.core.FileAppender">
    myapp.log
    
      %logger{35} - %msg%n
    
   name="ASYNC-FILE" class="ch.qos.logback.classic.AsyncAppender">
     ref="FILE" />
    1024
    false
  
     ref="ASYNC-FILE" />

Every queue has a configurable depth. The depth of the queue is based on how much memory you have and expected ratio in rates of messages coming in through the application and the messages being published through the I/O bottleneck.

If you hit max queue depth on either the WARN or ERROR queues, further statements for those levels become synchronous.

If you hit more than 80% of the max queue depth on any other level, the system will start dropping log statements (due to discardingThreshold=20 by default and neverBlock=true). Therefore, under high load, you can lose INFO, DEBUG and TRACE log messages. This behaviour is acceptable for most cases except specific critical statements (like audit logs). For such cases, you can add asynchronous appenders that are allowed to block.

The percentage of depth after which messages are dropped is configurable. You can make info/debug logs synchronous at 100% too if needed by changing the neverBlock=false (which is the default behaviour).

All of this information is available on logback’s documentation.

Writing log statements

Async logs only work more efficiently because the production of events is synchronous (and hopefully a quick task) and the processing of events (which requires IO) is a slow task.

However if production of log messages takes long time, async logging will not make things better. When you’re printing a large amount of data or if the creation of the log message is an expensive operation, use the following kind of log statement

// style 1: java string interpolation; inefficient and hard to read :P
logger.info("Large object value was " + largeObject1 + " and long operation printed " + largeObject2.longOperation())
// style 2: scala string interpolation; inefficient but easy to read
logger.info(s"Large object value was $largeObject1 and long operation printed ${largeObject2.longOperation()}")
// style 3: logback based string interpolation; efficient but inconvenient to read
logger.info("Large object value was {} and long operation printed {}", largeObject1, largeObject2.longOperation())

While the scala interpolation (style 2) is the easiest to read, we should only do it when the objects being printed are small (small-ish strings or primitives).

Rule of thumb:

For quick statements, use style 2.
For large statements, use style 3 (sacrifices readability for efficiency)
Never use style 1 :P

Using LazyLogging as opposed to creating loggers yourself

Use lazy logging. It internally uses loggers that wraps yours code (during compile time) with if checks to not process log statements if the specific log level doesn’t need to be printed (using macros). Worried about performance due to extra if conditions? You shouldn’t. Modern processors contain black magic called branch prediction that reduce the effect of statements such as this to be effectively nothing.

IMO, every scala project should use lazy logging. It’s light on dependencies and has a nice implementation that makes your logging more efficient run faster for fractionally slower compilation.

The Science in the Art of the Showcase (for distributed teams)

2018-07-03T00:00:00+05:30

Showcases are a key part of our agile ceremonies. We showcase our work to our stakeholders for feedback at the end of every iteration. And as with every presentation, I believe there is a Science in the Art of the Showcase (for distributed teams).

On one of our recent teams, our showcases had challenges. Each of these challenges is a piece of feedback. We added structure to our showcases by running it like a theatre recording TV shows.

This isn’t revolutionary stuff. This is an attempt at defining a structure that should make it easier to organize showcases based off a check-list.

Role

The Master of Ceremonies

The MC is the face of the operations. They are responsible to dessiminate information and keep the crowd engaged. This means that the person should have context about what goes on and how to handle the different failures around client infra (skype issues, VDI issues etc).

Best practices for MCs

Running commentary: Always keep speaking. Is there an issue? Keep the show rolling. Be transparent. Your support (folks below) will keep feeding you information when necessary.

The Stagehand

This is the magician that controls the lighting on stage. This person actually runs the slides and the demos ensuring everything is smooth

Best practices for Stagehands

Practice your demos repeatedly till it’s muscle memory
Ensure the demo windows are already prepared with data entry. Avoid copy pasting unless it cannot be avoided.
Ensure the content on the screen is visible on the media the stakeholders consume it on. If the stakeholders get together in a room and look at a screen projected on the screen or a big TV, please ensure that the font size is readable.

The Conductor

This is the person who runs the show. This person is responsible to stay on the demo co-ordination chat and spot issues and handle them before they become a thing. This person is also responsible to give instant feedback to people running the showcase when needed.

Best practices for The Conductor

Ensure you have an eye on the demo co-ordination chat. Delegate replying to another window if required
Ensure the MC is providing running commentary
Step in only if it is absolutely required
Keep an eye out for schedule

The Theatre Tech

The person who watches the logs and statuses for the services involved in the demo. If there is anything going wrong, talk to the conductor immediately.

Best practices for the Theatre Techs

Have appropriate windows ready to perform the tasks you might need to in a hurry (bouncing services)
Have windows showing instance health
Have log window opens

The Timekeeper

This person is in the room (with clients) and is responsible to keep time. If the discussion goes off, it is your responsibility to cut the discussion off and setup a followup discussion.

If the clients are in multiple locations, have a timekeeper per location. Might be the conductor when available in a location.

The Scribes

Multiple people taking notes and sharing them after the demo. They are responsible to pick up body queues from the people around them and take notes on follow up discussions that we need to have.

Best practices for The Scribe

Be active on a demo co-ordination chat channel and provide instantaneous feedback from different locations. This helps the conductor get more information and is key to their effectiveness.

The Playwright

This person is primarily responsible for the content of the showcase.

The content of a showcase should be like a TV show. A major milestone/deliverable is like a season and should have an overarching story (aka narrative arc). Each showcase is like an episode and should have a subsection of the narrative arc.

The way Presentation Patterns book describes narrative arcs in presentations is true about showcases

~~Presentations~~ Showcases are a form of storytelling; don’t ignore a few thousand years of oratory history. A Narrative Arc is a common trope; organizing your ~~presentation~~ showcase in a similar way leverages your audience’s lifetime of story listening experience.

Execution

Know the people on your team. Identify which team members can do what roles. Invest in and groom people for roles based on their interest, it’s a growth opportunity.

Prep work for the venue

If your client site requires you to book rooms, do so as far out in advance as possible.
If your team is distributed, make sure the room has a good VC with a computer you can use to run the demo. Ensure your laptop can easily connect to the VC equipment in the room.
Know your venue and plan your seating. Presenters closer to the screen. Stakeholders in clean view of the screen and the presenters.

Prep on the day

Sign up for roles based on your skills
Do multiple dry runs
Show up to the room 20 minutes before the start of the meeting. Set it up.

Upgrade everything in brew

2017-10-19T00:00:00+05:30

Homebrew is a the missing package manager for Mac OS. Brew cask extends Homebrew and brings its elegance, simplicity, and speed to Mac OS applications and large binaries alike.

If you’re using these tools and would like to upgrade all of the applications you have, run the following command.

brew update && brew upgrade && (brew cask outdated | cut -f 1 -d " " | xargs brew cask reinstall) && brew cleanup

Breaking it down

Update brew with information from the latest taps: brew update
Upgrade apps in brew: brew upgrade
Update brew cask apps: brew cask outdated | cut -f 1 -d " " | xargs brew cask reinstall
Find outdated cask apps: brew cask outdated
Cut out the app names: cut -f 1 -d " "
Upgrade brew cask apps: xargs brew cask reinstall
Remove installers for brew apps (to release disk space): brew cleanup

Note: brew cask cleanup is now deprecated.

Lombok usage in large enterprises

2017-08-13T00:00:00+05:30

Verbosity of Java

Java is a verbose language. No one disputes it.

Despite the clunky nature of the language syntax, it still is the language of choice in most enterprises. If you work in the services industry or are a technology consultant, chances are that you have to work with Java on a regular basis.

If you’re also a fan of functional programming language and have worked any modern programming language, you’ll recognize that Java’s syntax hinders your productivity because of the large amounts of boilerplate the language will generate. While newer JVM based lanaguages like Kotlin solve these problems in different ways, the open source community created Project Lombok to provide similar syntactic sugar in the world’s most popular enterprise programming language.

What is Lombok?

Lombok is a Java dependency that uses Java annotations to generate byte code straight into the class files during the compilation phase there by allowing the boilerplate code from your codebase to be significantly reduced.

An example from the Software Engineering Trends post from Jan 2010 shows

@Data(staticConstructor="of")
public class Company {
  private final Person founder;
  private String name;
  private List<Person> employees;
}

would generate the same code as

public class Company {
  private final Person founder;
  private String name;
  private List<Person> employees;

  private Company(final Person founder) {
    this.founder = founder;
  }

  public static Company of(final Person founder) {
    return new Company(founder);
  }

  public Person getFounder() {
    return founder;
  }

  public String getName() {
    return name;
  }

  public void setName(final String name) {
    this.name = name;
  }

  public List<Person> getEmployees() {
    return employees;
  }

  public void setEmployees(final List<Person> employees) {
    this.employees = employees;
  }

  @java.lang.Override
  public boolean equals(final java.lang.Object o) {
    if (o == this) return true;
    if (o == null) return false;
    if (o.getClass() != this.getClass()) return false;
    final Company other = (Company)o;
    if (this.founder == null ? other.founder != null : !this.founder.equals(other.founder)) return false;
    if (this.name == null ? other.name != null : !this.name.equals(other.name)) return false;
    if (this.employees == null ? other.employees != null : !this.employees.equals(other.employees)) return false;
    return true;
  }

  @java.lang.Override
  public int hashCode() {
    final int PRIME = 31;
    int result = 1;
    result = result * PRIME + (this.founder == null ? 0 : this.founder.hashCode());
    result = result * PRIME + (this.name == null ? 0 : this.name.hashCode());
    result = result * PRIME + (this.employees == null ? 0 : this.employees.hashCode());
    return result;
  }

  @java.lang.Override
  public java.lang.String toString() {
      return "Company(founder=" + founder + ", name=" + name + ", employees=" + employees + ")";
  }
}

The good

You shouldn’t have to write code that can be generated automatically. Of course, modern IDEs will do this for you with a few clicks of the keyboard

We’re trying to optimize more than a few clicks though. Have a look at the equals method below:

  @java.lang.Override
  public boolean equals(final java.lang.Object o) {
    if (o == this) return true;
    if (o == null) return false;
    if (o.getClass() != this.getClass()) return false;
    final Company other = (Company)o;
    if (this.founder == null ? other.founder != null : !this.founder.equals(other.founder)) return false;
    if (this.name == null ? other.name != null : !this.name.equals(other.name)) return false;
    if (this.employees == null ? other.employees != null : !this.employees.equals(other.employees)) return false;
    return true;
  }

Is this a standard equals method (one where every field in the class is checked for equality)? Did we skip a field? Did we do a non standard check on one of the fields? Unless you go through the method line by line, there is no way to know.

Generating code saves you the hassle of checking. If there is an annotation, you know the what the implementation will be (assuming you know how the framework works). If there’s code, chances are that it’s a non-standard implementation (or someone made a mistake).

The bad

If you wish to check the generated code, you need an IDE that decompiles byte code or a tool that does the same.

If something’s wonky, debugging the issue might not be straight forward.

The downright ugly

Modern IDEs like IntelliJ are built for refactoring. One of the most common refactoring options is the option to Change Signature. It’s an extremely useful option that allows you to reorder method (or constructor) parameters and the IDE takes care of the appropriate changes throughout the codebase.

The order of the constructor parameters in a lombok-fied class is the order in which the parameters are declared. Changing this order changes the constructor signature.

For a class with different parameter types, this is not a problem. Refactoring the following class

@Data
public class Company {
  private final Person founder;
  private String name;
  private List<Person> employees;
}

to the following signature

@Data
public class Company {
  private final Person founder;
  private List<Person> employees;
  private String name;
}

is not a problem. The usage of the constructor will fail to compile and provide feedback.

If you have primitive types in your lombok-fied class, you have a problem. Refactoring the following class

@Data
public class Person {
  private final String employeeId;
  private final String firstName;
  private final String lastName;
}

to the following signature

@Data
public class Person {
  private final String firstName;
  private final String lastName;
  private final String employeeId;
}

will provide no feedback. The code will compile and set employeeIds to firstNames, firstNames to lastNames and lastNames to employeeIds. If you don’t have tests on the behavior of the Person class, you won’t notice this issue until it’s too late. Hopefully, you don’t have tests for a data container with no behavior.

Where is Lombok appropriate?

Do you have a project where you have a strict set of contributors?
- because you’ll have to walk them through the rules of appropriate usage of lombok
Do your contributors understand Lombok well and how it works?
- because you will have unexpected defects due to refactoring if they don’t
Do your contributors understand how to properly unit test and do they understand the automation test pyramid?
- appropriate high level testing could catch functional defects. you don’t want unit tests checking constructors and getters
Do you have strict code quality control?
- without a way to check for inappropriate usage of lombok, defects can very easily creep in
Is the team willing to invest time and effort into training new team members about Lombok and potential downsides?
- your learnings have to be passed to every future member of the codebase
Do most of your models use value objects and avoid primitives?
- because reordering non-primitive fields will lead to compile exceptions providing feedback

If your team can answer yes to all of the above, you should use Lombok.

I must admit, most large teams can’t answer yes to all of the questions. Have you considered using Kotlin instead? :)

Hosting blogs for 1¢ a month

2016-12-05T02:04:56+05:30

If you’re a dev and you self host your blog, I’d love to hear why. Why do you self host blogs? For most simple blogs in this day and age, migration to a static site like Jekyll or Octopress is pretty easy. I did this a while back. This can be followed up by asking Amazon S3 to host your website. You can even get cloudflare to front the SSL for free.

Why? S3 is free for the first year. Even post that period, my bills have been <$0.02/month which is a 99.951% reduction in cost.

Continuous delivery

Snap CI is will integrate with your publically accessible GitHub repositories for free and trigger builds on commit! Connect to your github repository and get it to compile your markdown into html. Deploying to S3 is a piece of cake. Congratulations on having continuous delivery for your blog!

Cloudflare provides the free SSL and Amazon S3 provides the near free hosting. A few cents a month to host your entire website is a good deal!

I’ve been on S3 for a year now and I couldn’t be happier!

** Goodbye Servers, Welcome S3 **

Commonly made mistakes in Unit Testing

2016-02-28T09:24:50+05:30

What is Unit Testing?

Unit testing is all about focusing on one element of the software at a time. This unit is called the often called the ‘System Under Test’ (refer Mocks Aren’t Stubbs). In order to test only one unit at a time, all other units need to not be test at the same time. As obvious as that sounds, it’s easy to miss.

Classes do not exist independent of one another. They usually have dependencies. Such dependencies are called the ‘Collaborators’. There are multiple ways to manage collaborators that have been talked about by Martin in his article.

Pre-requisites to the post before going forward

Before we go on, please ensure you’ve read through Mocks Aren’t Stubbs by Martin Fowler. This post assumes that you’ve gone through the article before continuing on to commonly made mistakes in Unit Testing

Mocks vs Actual Implementations

Consider a board game where the Board class runs the game with the help of it’s collaborators Player and Dice.

public class Board {
  private final List<Player> players;
  private final Dice dice;
  private int currentPlayerId;

  public Board(final Dice dice, final Player... players) {
    this.currentPlayerId = 0;
    this.dice = dice;
    this.players = Arrays.asList(players);
  }

  public void playMove() {
    players.get(currentPlayerId).move(dice.roll());
    currentPlayerId = evaluateNextPlayerId();
  }

  public Player getCurrentPlayer() {
    return players.get(currentPlayerId);
  }

  private int evaluateNextPlayerId() {
    return currentPlayerId + 1 < players.size() ? currentPlayerId + 1 : 0;
  }
}

class Dice {
  public int roll() {
    return (int) Math.round(Math.random() * 6);
  }
}

class Player {
  @lombok.Getter
  private int position;

  public int move(final int moveCount) {
    position += moveCount;
  }
}

If we consider the Board to be the System Under Test, the most tempting trap to fall into is start testing the Board directly.

public class BoardTest {
  @Test
  public void shouldMovePlayerForCorrectPlayer() {
    final Dice dice = new Dice();
    final Player player1 = new Player();
    final Player player2 = new Player();
    final Board board = new Board(dice, player1, player2);

    board.playMove();

    assertThat(player1.getPosition(), greaterThan(0));
    assertThat(player2.getPosition(), is(0));
    assertThat(board.getCurrentPlayer(), is(player2));
  }
}

This is not the greatest example but it does attempt to show you the coupling between the different components. Player1’s current position isn’t predictable since it’s coupling with dice. The dependency also means that if the dice has defects, the board can’t be tested appropriately.

By swapping out player and dice instances with mocks, we have the ability to only test the board independent of potential issues with the dependencies.

The above test can be refactored to look like

public class BoardTest {
  @Test
  public void shouldMovePlayerForCorrectPlayer() {
    final Dice dice = mock(Dice.class);
    final Player player1 = mock(Player.class);
    final Player player2 = mock(Player.class);
    final Board board = new Board(dice, player1, player2);

    when(dice.roll()).thenReturn(3);
    when(player1.move(3)).thenReturn(3);

    board.playMove();

    verify(player1).move(3);
    verify(player2, never()).move(anyInt());
    assertThat(board.getCurrentPlayer(), is(player2));
  }
}

The test now allows you to check if player1 was moved 3 places since the response provided by the dice is in your control. Mocks also allow you to test that player2 was not called.

This becomes even more important in an example where the response from the mock affects the system under test. Controlling the mock allows you to control predict the end state of the system under test with the assumption that your mock setup is correct. These assumptions can be validated with the spec for the individual mocks. The unit test for dice mock can confirm that the dice only returns values between 1 and 6 (inclusive).

Testing inside the boundaries

Every functionality should be tested within it’s boundaries. Let’s take the Dice class as an example and talk about what this means.

Typically a dice produces values between 1 and 6.

It’s corresponding test has to prove that rolling a dice always results in a value between 1-6.

@Test
public void shouldRollValidNumberOnDice() {
  assertThat(new Dice().roll(), isOneOf(1, 2, 3, 4, 5, 6));
}

This test proves that the value is inside the range but does not prove that it will always be in that range. Since the implementation contains a PRNG, the end result cannot be predicted.

Most readers wouldn’t have noticed the defect in the implementation.

class Dice {
  public int roll() {
    return (int) Math.round(Math.random() * 6);
  }
}

The implementation can produce values 0-6. The fact that your test passed proves that it is a flaky unit test. The test has a 1/7 chance of failing. The fact that it didn’t fail when you ran it is not surprising :)

DI, your new best friend

The anti-pattern to take away from the previous example is that the Dice class relies on a library and that the library is contained in the class. The fact that it can’t be injected means that you can’t control it.

Dependency Injection is your friend!

class Dice {
  private final Random random;
  private final int numberOfFaces;

  Dice(final Random random, final int numberOfFaces) {
    this.random = random;
    this.numberOfFaces = numberOfFaces;
  }

  public int roll() {
    return random.nextInt(numberOfFaces - 1) + 1;
  }
}

Now, your test can work with a mocked Random instance for more accurate results.

@Test
public void shouldRollValidNumberOnDice() {
  final Random random = mock(Random.class);
  when(random.nextInt(5)).thenReturn(0, 1, 2, 3, 4, 5);

  final Dice dice = new Dice(random, 6);

  assertThat(dice.roll(), is(1));
  assertThat(dice.roll(), is(2));
  assertThat(dice.roll(), is(3));
  assertThat(dice.roll(), is(4));
  assertThat(dice.roll(), is(5));
  assertThat(dice.roll(), is(6));
}

We’re currently making 2 assumptions on the collaborator.

random.nextInt is always called with parameter 5
random.nextInt(5) always returns values between 0 and 5

The first assumption is in part validated by the mocking library. If Dice called by any other parameter, the results wouldn’t be what we want. But if you want to be extra sure, you could always make the test fail using an argument captor

@Test
public void shouldRollValidNumberOnDice() {
  final Random random = mock(Random.class);
  final ArgumentCaptor<Integer> argumentCaptor = ArgumentCaptor.forClass(Integer.class);
  when(random.nextInt(argumentCaptor.capture())).thenReturn(0, 1, 2, 3, 4, 5);

  final Dice dice = new Dice(random, 6);

  assertThat(dice.roll(), is(1));
  assertThat(dice.roll(), is(2));
  assertThat(dice.roll(), is(3));
  assertThat(dice.roll(), is(4));
  assertThat(dice.roll(), is(5));
  assertThat(dice.roll(), is(6));
  assertThat(argumentCaptor.getAllValues(), is(asList(5, 5, 5, 5, 5, 5)));
}

The second assumption should not be validated by you. If you look at the documentation for random.nextInt() you will notice

/**
 * Returns a pseudorandom, uniformly distributed {@code int} value
 * between 0 (inclusive) and the specified value (exclusive), drawn from
 * this random number generator's sequence.
 ...
 */
public int nextInt(int bound) {...}

It is the responsibility of the library (java.util.Random in this case) to test itself.

How do I know Random will not misbehave? I don’t. The Dice component could be integration tested. It is an absolute necessity if you deem the component to be an untrusted collaborator. If this was a database connection or a REST call, you’d want that. For a Java util or a well tested open source library, you could be forgiven for not writing an integration test.

In this case, I won’t be writing one for sure! ☺

Ramblings of a Coder's Mind

intelligent Engineering: Principles for Building With AI

intelligent Engineering Principles

AI-Native Principles

AI augments, humans stay accountable.

Context is everything.

Smarter AI needs smarter guardrails.

Shape AI deliberately.

Learning never stops.

Timeless Foundations — Reaffirmed for the AI Era

Learn fast, adapt continuously

Sustainable Value over fleeting output.

What This Looks Like in Practice

Building the Skills of an Intelligent Engineer

Core Practices

Deeper Understanding

System Design for AI

Why This Matters

A Closing Thought

Level Up Code Quality with an AI Assistant

Current State

The Journey

Setting up Basic Documentation and Some Automation

Setting up pre-commit for Early Feedback

Curating Code Quality Tools

Creating a Plan

Executing the Plan

Fixing Existing Debt

What is Left?

How to choose your coding assistants

Understanding the tools

Challenges with using these tools

Choice of models

Ease of use

Cost per change

IP ownership indemnity and licensing

What do I use and recommend at this point?

Patterns for AI assisted software development

For people building teams

Focus on value

Journey per software delivery stage, one stage at a time per team

Expect a learning curve

Quality guardrails are a prerequisite

Autonomous agents are far away

Watch out for ‘AI Slop’

Changes to individual responsibilities and team composition over time

Beware of reduced intuition for decision making

For people on teams

The ‘new teammate’ mindset

Small chunks of work

Configure the tool based on your team’s rules

Shift in time spent on different responsibilities

Over-reliance on AI instead of thinking and remembering yourself

How do you know AI is helping software delivery?

Credits

AI for Software Engineering, not (only) Code Generation

Use of AI tools across software delivery

During Analysis

Improved analysis

Changes in roles for Business Analysts and Project Managers

Improved iterative UI/UX design

AI note taking apps for requirement analysis

Improved communication and context

During System Design

During Development and validation

During Deployment and Operationalisation

During Feedback Cycles

How do you enable this transformation

Preparation

The journey

Credits

What makes Developer Experience World-Class?

The Five Non-negotiables

I. Project readme

II. Automated setup

III. Iterate fast

IV. Enforced pre-commit/pre-push checks

V. Everything runs locally

The DevEx Stack

Foundational code practices