Featured image of post LLM Workflow for CAB

LLM Workflow for CAB

Let's build an LLM workflow that serves on a change advisory board.

Premise

Have you ever received a CAB (change advisory board) submission only for it to be deferred? A key element was absent or an important aspect not considered?

The idea here is to build a basic proof-of-concept workflow that reviews CAB changes prior to them reaching CAB - giving the author an opportunity to revise their submission and making it more likely to succeed.

An LLM wont have all the context it needs nor a perfect understanding of your environment. That’s okay - we strive to build a workflow with a set of operating principles and guidelines that guide the model to produce useful feedback and strengthen the change control process.

The aim is not to completely remove humans from the process. Human in the loop is integral and will have final decision authority.

I’d consider this workflow a success if it leads to refined submissions, more expedient processing of changes, and an overall stronger change control process.

Nor is it limited to CAB - the basic principles here would work for any review workflow, I imagine.

This work can serve as a starting point for a broader and more capable AI agent later down the line.

Tools

I’ll be using..

..to make this work. If you lack the hardware to run models locally, then ollama components can be easily swapped out for an LLM API providers such as OpenAI, Anthropic, Google, or another one of your choosing.

Designing a workflow

Let’s consider, broadly, the approaches we can take:

  • Single-pass
    • Regular model with tuned system prompt
    • CoT model (chain-of-thought) built-in
    • RAG with environmental context
  • Multi-pass
    • Regular model with tuned system prompt
    • CoT model (chain-of-thought) built-in
    • RAG with environmental context
    • Role-playing
  • Agentic
    • RAG with environmental context
    • Role-playing
    • Tool or function calling

Additional considerations:

  • System Prompt Tuning:
    • does a highly detailed system prompt produce better outcomes, is there a point of diminishing returns?
    • how closely can a model ahere to instructions and response formats?
  • Retrieval-Augmented Generation (RAG):
    • can environmental context - ie infrastructure code, documentation, emails - impact quality of the evaluation?
    • can a model even make sense of all the data we’re throwing at it?
    • can additional context lead to better evaluation or just more nit-picking and repetition?
  • Role-playing:
    • does simulating the perspectives of stakeholders improve the overall quality of the analysis? ie a simulated interaction between an engineer and a manager.
    • adversarial models - one model tries to aggressively justify the change to go ahead, whilst the other argues for it to be rejected. Does that lead to better discovery of edge cases?
  • Agents:
    • realistically, how far can we break down a task before we see diminishing returns?
    • can multiple tuned experts produce better results than just a clever system prompt?
  • Tool Use:
    • what kind of tools are available to the LLM and how can they expand its reach?
    • is there a limit to how useful tools can be? Beyond various means of information retrieval, do we want an agent to have the capability to independently test a change? Would that even be possible or are we encroaching on the responsibilities of engineers?
  • Context window:
    • can we fit all the relevant information and supporting evidence in-context?
    • is there a trade-off to be made between cheaper models vs more capable models? probably not at the scale we’re working with.
    • RAG helps when data to be evaluated exceeds context window, but then again - can a model make sense of all that data?
  • Model Selection:
    • Does using different LLMs produce better results?
    • are there any biases or weaknesses in specific models that can be overcome by mixing multiple models at different stages of the analysis?

To keep things manageable, I’m going to start off with a multi-pass approach, using Microsoft’s Phi4 model. I’m going to leave out RAG, role-play and tool-calling agents for later iterations.

Prompt tuning

To let us iterate faster, I’m including a CAB generator - just another LLM call with a custom prompt - to generate unique CAB submissions on each run. This will give me a large variety to test with, I’m hoping that it will help me ascertain the effectivness of the workflow across a wide spectrum of inputs.

illustration of the testing process

The workflow

This is the logcal flow at a high level. In my testing, substitute “human” for an LLM generating inputs. We can swap between regular, CoT, and other models to see what kind of results we get.

illustration of the workflow at a high level

Langflow Implementation

  • Langflow 1.1.1 running in docker
  • ollama running on a server in my network
  • Phi4 large language model from Microsoft.

Workflow floorplan in Langflow: Workflow floorplan in Langflow

With everything set up, let’s run the workflow and..

Run through the results

First, model generates a change..

..it reviews the basics..

..then the risks..

..then resource implications..

.. communications plan and approvals..

.. and now collate responses, do another review, and generate a recommendation:

Setup

If you want to try any of this yourself, I’ve uploaded a docker compose file and the Langflow template to a GitHub repo. Follow the steps below to reproduce my setup. I’m assuming you’ve got docker and ollama installed already.

  1. git clone https://github.com/MRKups/llm-workflow-cab-1.git, or download from web
  2. cd llm-workflow-cab-1
  3. docker compose up -d
  4. Open up a terminal, ollama serve
  5. Open http://localhost:7860/
  6. Select New Flow, Blank Flow
  7. Click on flow title (top center), Import
  8. Import cab-workflow-1.json
  9. Run the workflow

Closing remarks

Overall, I’m quite pleased with how this turned out. While the quality of the final output is heavily dependent on the quality of the model you are using, the results produced by Phi4 are very good. Phi4 was able to extract all the relevant information from the change and evaluate it, correctly highlighting weakness areas and calling attention to unaddressed risks.

Large language model workflows and ‘AI’ agents are opening up a new frontier, bringing the next wave of automation and creating new opportunities. Let’s remember - models and integrations will only improve as time goes on, enabling new use cases and applications.

I hope this got you thinking about how these tools can be applied to everyday problems and what else may be possible with time.

All content authored by a human.

This website was built by leveraging open source software. Please consider supporting these projects.

Hugo and Stack