That’s right folks, it’s another AI post on some rando’s blog. Whoopee.
To be fair, I’m not quite sure where this is going to go. I’d like to think it’ll firmly place my ass on the top of the fence, not quite ready to jump to one side or the other, but I guess I’ll leave that to the reader to figure out.
I’m also not going to bore you by chronicling in something slightly more than minor detail my years of experience with AI tooling across a variety of domains. Rather, I’m just going to go straight to my pontifications musings.
LaaJ, or How I Learned to Like Smoking
Because I’m a moron, I have no way of proving mathematically or otherwise that what I’m about to opine on here is of any merit whatsoever. That said, if someone smarter than me (not hard) wants to chime in, feel free to do so.
I’ve been involved in quite a few conversations, professionally and otherwise, about LLM output consistency checking and validation. It’s an important pillar in ensuring, principally, the success of a model in a real-world environment but also, IMHO, a crucial step towards AI tool adoption in the private sector for those who tend to give slightly more of a shit than others. It’s no big secret that AI tooling generates contextually and factually incorrect statements more often than perhaps what the AI bois would like to admit to. I’d need about eleven more pairs of hands to account for just half the times over the past four years this has happened to me, and I can speak to what this looks like in a corporate IT setting, which is equally disheartening; it’ll definitely take the wind out of the sails. Thus, obvious questions emerge about reducing “hallucinations”, or filtering out undesired outcomes, given the context, by some means.
Keeping in mind the axiom garbage in, garbage out, one can, with just a passing familiarity of prompt/context engineering writing (is that even a thing anymore?), prime the pump so that you maybe reduce noise. That said, however good an initial prompt may be, especially if it’s designed as a multi-shot prompt pattern with plenty of output templates, and however you’ve tuned the parameters available to you as a user, with or without API access, the outcome is still neither deterministic nor comprehensive. It’s pretty difficult to overstate this, especially for me since it’s an aspect about AI tool usage I take very seriously.
Accepting the output of an AI tool prima facie, regardless of the methods used to augur said output, is at best wholesale ignorance and at worst criminal. If you have no interest in or methods for validating the output of an AI tool, you likely shouldn’t be using them. Period. And I’ll die on that hill.
I could easily segue off into another topic here about what exactly goes wrong with AI generation in several contexts, and perhaps that’s worthwhile, but I’ll do it in the next section.
I’ll focus in on one specific example at work, although I’ll also make it as abstract as possible for obvious reasons. A very common use case for AI tooling is parsing and summarizing a documentation corpus. Personally, I use AI tooling for this case both inside and outside work. If I need to understand something nuanced about a specific Linux command that’s in the man pages, or if I need to find out something about a part of an operating system, or if I need to understand polymorphic function call permutations, I’ll generally use an AI tool to help me. Now, perhaps an obvious question here is why when I can just look it up myself, and the answer is if companies would stop being asshats and keep their documentation easy to read, AND IN THE SAME F^%&#$@ PLACE, I wouldn’t have to use an AI tool for this. Aside from that, and perhpas more germain to the task I’m about to detail, AI tooling can be used as a research accelerator along a variety of axes. This is doubly true when the research effort is spread out amongst everyone on your team, just focusing on different topic areas.
Simply put, each member of my team was tasked with researching specific aspects about an operating system upgrade, and me being the cute little early adopter and evangelist that I am, suggested that we attempt to use our in-house AI tooling as said research accelerant. We’re each regularly at 100% capacity, time is short, and we needed that work done yesterday. Now, I knew exactly what I was asking for by doing this: taking on the role of parting the sea to shepherd the flock to greener (presumably) pastures AND meeting my own obligations to the goal. Being honest, not only are we not really an AI-centric team, but most folks would rather not be bothered with the topic if at all possible for one reason or another. My job then was doing the research and winning hearts and minds. Fine. I’m up to the task. I’ve got three kids under 10. I know how this works.
A wealth of cards distributed on our Agile board and a lunch-and-learn facilitated by yours truly demonstrating prompting techniques, parameter tuning, and manual validation practices later, I said make it so. And… well…
For the sake of the immediate topic, I’m going to put aside all the trials and tribulations, and focus on one thread that proved really interesting.
One of my teammates took it upon themselves to start running consistency checks with their results, and was doing so in a way that was, AFAIK, sound: generate three-to-five results using the exact same prompts and parameter values with one model, then use a tertiary model to evaluate the result set using specific positive and negative prompts as heuristics. Intuitively, this didn’t make much sense to me for a variety of reasons, but it proved to be an eye opening exercise regardless. On the whole, defining success as the overlap between the items in the set (which still says nothing of the factual accuracy of course), the AI was successful about 65% of the time. Meaning, prima facie, there’s a potential inaccuracy rate of 35%.
Of course, these numbers and processes require substantial dissection (beyond the scope for this post) and they’re likely not really accurate in relation to the objective, that being evaluating the so-called truthfulness of the reponses an AI tool gives. Whatever this value is, it then propagates towards the overarching objective which is the accuracy of the research effort. What we’re trying to avoid is an engineer using an AI tool for research, accounting for the fact that they’d likely use a single response as gospel despite stern reminders to the contrary, and extrapolating from it actionable products like computer code or attestations to their work commitments. However, with a 35% delta in just the response content, the team’s foundational trust in the tooling diminished quite a bit. For some, it effectively rug pulled all the work I’d done to that point to even get these people to just look at the tool, let alone use it.
Because I know some people are screaming at their monitors, I absolutely understand that percentage doesn’t mean much, to either you or I. It means something to general consumers. And, upon further investigation into the actual content of the results, there were demonstrable errors and inaccuracies. Some facts were downright wrong, others the tool claimed it backed up with references to external documentation sources but when checked, the external source contained no such reference or data. For the average user, this is more than enough to seal the deal with their alignment to the AI is bullshit camp; who can blame them? Generally speaking, if a tool fails or ceases to be useful, a craftsman will replace it or look elsewhere. (Seasoned) IT Systems Engineers are no different.
And here’s yet another avenue where anthropomorphism comes into play, which boils my piss to no end. “The AI is lying”. Or “The AI is hallucinating.”
Guys, this is a nuclear reactor-powered auto-complete tool. It’s not a sentient being, nor on its way towards one. In fact, there’s questions being asked from major players in the lab/frontier space if LLMs are a kind of dead end: see here and here. And yes, I’m well aware of the feedback to Sutton’s recent podcast with Dwarkesh.
Anyway, as enterprises are won’t to do, methods for curtailing the 35%, as it were, became the focal point of conversation. Without much stock placed in the matter, it’s not hard to understand how one potential solution could be using a secondary or even tertiary LLM as a so-called judge of the output of a different LLM. I myself had fallen out of my chair at the overwhelming stupidity of such a suggestion until I discovered that not only is this not a novel concept in the AI space, but that there are entire products of such design. After doing some reading on the subject, I remain unconvinced this is a reliable avenue of interest, and even more perplexed that this somehow expanded beyond the academic realm.
The supposed whiz and bang of the LLM-as-a-Judge (LaaJ) process as I understand it are the scalability, supposed consistency, and reduction of Human-in-the-Loop (HITL) occurrences during evaluation, opening the gates for further streamlined and advanced automation opportunities. And while this all sounds well and good, especially for autonomous agents that are proactive rather than reactive, it still raises a lot of obvious questions with, what I feel, don’t have great answers. Navigating this space to me feels like how it does when I talk about Test-Driven Development (TDD). Some people are like IYKYK, and I’m like WTF.
There have been presented to me three examples justifying LaaJ:
- Regression Testing: LaaJ can grade test questions several orders of magnitude faster than HITL for a fraction of the monetary cost (supposedly).
- Inline Evaluation: Judge models can be used in real-time to solicit or filter either user or assistant messages to determine suitability (i.e. does a message contain vulgar language).
- Tooling Optimization (RAG Optimization) - If a prompt is supplemented with RAG, the Judge proves that the external request returned what it said it was going to.
I’m not really interested in the academic testing portions of these things, but there are evidently a few buffs here that LaaJ offers in terms of performance.
However, the intiutive question is about cascading probabilities and what the net effect it has over both iterations and the yield. More plainly, if you know for certain there’s a greater than zero percent margin for error baked into each assistant message, regardless if the quality of the user message is a component of said error margin, and also that demonstrably augmentation techniques like RAG and MCP further reduce but don’t eliminate the error margin, and you rely on tools of a similar genus, while obviously not the same make or model, to evaluate an output set from a similar tool, how would you, with any degree of certainty, assume that the inherent error margin didn’t propagate or expand during the evaluation phase and influence the outcome?
I’ll try this again. If I ask a model to tell me about how to manage memory correctly in the C programming language, and it tells me that new is the keyword to dynamically allocate memory (it’s not, BTW), and I pass it through a single or series of judges to assert the validity of the output, what guarantees do I have that the subsequent LLMs, as heavily augmented as they are, won’t fall prey to their inherent shortcomings? Or, worse still, are there emergent pitfalls to look out for in these kinds of architectures that are potentially unique?
Justifying LLM activity with Regression Testing always pisses me off because it inexorably involves head count, specifically RIF. It’s the quintessential corporate excuse of being able to accomplish X goal faster, and by that metric alone, we make the cost number go down, then RIF. I don’t really need to say anymore about this. Inline evaluations have obvious merit, and it’s hard to argue against them from the perspective of work product. One has to be able to render assertions against a LLM’s output. Is LaaJ the best way to accomplish that? Remains to be seen. Tooling Optimization is just more bullshit. Specific to RAG, users generally need to explicitly provide their content to the model via the tool, so it’s kinda hard to see how you’d need to assert the retrieval. Now, if the grounding mechanism is the Internet (which seems hysterical given that the Internet was the corpus on which these models were all trained on), I can almost kind of see how LaaJ comes into play there. But again, is this the best solution?
There are also a few properties of LaaJ architecture that vex me:
- So-called “Gold Standard” data, which is basically data pre-tagged by a human clearly stating valid results for a given problem domain.
- Capability asymmetry. LaaJ wants you to be doing your work with “dumb” models and then give your results to “smart” models.
- CoT. Nothing really to say about this. Anyone using a LLM should already be invoking CoT.
- Heavy MCP Use. Some judge models will have MCP tooling available to them so they can, for example, compile and run computer code to assess the output for validity.
- Judge Panels. This is an idiomatic way of saying I’m using multiple Judge LLMs in the same workflow. For example, one could be Gemini, another Claude, another Grok, and another DeepSeek, which would all evaluate the output of GPT (because we all know GPT is totally unc now).
This is all what you should be doing anyway with a plain old LLM in the first place, especially ones you train on your own datasets, except for the domain specific capability asymmetries and Judge Panels. So, it’s still hard for me to understand how any of this, whatsoever, addresses the core issues of reducing error probabilities in the result set and any that could emerge during the judging process. It’s still a LLM doing LLM shit, just to another LLM. Not to mention that, usually, these kinds of tooling augmentations are historically signs of something being not quite right with your core technology. Another way to evaluate this, especially if you’re the bean counter type, is what holistic cost savings are you achieving between up-front sanitized and tagged data, proper RAG implements, and just more savvy users who’ll take the time to validate the output than to scaffold all this and hope the results will improve?
But goise! There’s AI MONEY!
Of course, none of this says anything at all about the actual cost of using this technique. Society has proven the direct environmental impacts from garden variety use of LLM technologies, and LaaJ just amplifies this for what net benefit?
So the inevitable question is how does LaaJ fit into the IT Systems Engineering space?
My belief is that it sort of, kind of doesn’t.
A(n) Sober Approach
IMO, there are what I’ll call four pillars of AI integrations in IT Systems Engineering, and for some of these, I’ll hijack buzzwords so I sound cooler than I really am: LLMOps, Automation Augmentations/Assimilations, Agentic Applications, and Security.
LLMOps (See? I used one!) focuses on augmenting Tier 1 and Tier 2 teams with any operational activity. This is the super sketchy area where people can and have suffered redundancies. However, my belief is that no matter how much you train and reinforce a model, the current iteration of LLMs aren’t ever going to be suitable to replace people en masse. Don’t believe me? How many dumpster fire companies liquidated a significant portion of their staff, blamed it on AI, then within half a year hired most of them back? Let’s stop being stupid about this. Anyway, common examples in this space could be a documentarian (a perfect use for a chat bot), process and workflow facilitator, and problem tutor (there’s an excellent example of this already in the form of Lightspeed with Red Hat).
Obviously I’m not accounting for what could occur in this space outside these examples, but let’s assume that this is it. Is there an opportunity for LaaJ? A charitable take could be yes, however it’s also very easy to see how any results here could be verified by senior engineering. All your engineering tiers should be working closely anyway despite the stratification and its very valid purpose. It doesn’t cost the company any more money for a bubbling topic to get to senior or lead levels, especially if it ensures continued sane operation. In any case, senior and lead engineers should be setting pace for juniors and associates, so it follows naturally that this is just due course. It’s a very organic and suitable method for addressing these matters. No need to augment or replace it.
Automation Augmentations/Assimilations describes how AI tooling could affect incumbent or emergent automation patterns/tooling/workflows/practices. Automata here typically revolve around server buildouts, initial state provisioning, middleware deployments, post-deployment state maintenance, health monitoring, general maintenance, and infrastructure upgrades to name a few. The complexity of these automata can and will range from very simple to very complex (unfortunately) and will likely involve more than a few teams depending on how your org is structured. This aspect will likely complicate adoption and implementation matters, but that’s a topic for another day. The AI question here is multi-faucted: how can AI add value to existing automata, and could AI prove to be a more effective arbiter over a specific process than established or projected automation constructs? As dystopian adjacent as that latter point sounds, it’s not a farfetched idea. All the pieces are in place already, it’s simply a matter of continued work effort and time.
This pillar is interesting because it’s a composition of a vast array of technologies with both overlapping and competing purposes. For example, what kinds of automation can you use for server buildouts? Fancy ones like Terraform, Pulumi, and Attune are common, but so too are more manual methods relying on either Bash and/or PowerShell scripts. Some of the calculus here is org size, other is engineering mindset. Could AI possibly augment this specific process? Of course. Code that you write for any of these tools could be either evaluated by AI in a continuous improvement effort, newer engineers could use AI to summarize the codebase, or AI could even write new code for the environment (with varying degrees of success). Is there room for LaaJ? Possibly, but IMHO not in an immediately obvious way.
Despite derision against the drought of idempotency in AI output, a pipe dream I’ve long envisioned is proactive agentic AI that would, when coupled with predictive models, preemptively intercede on a server if noticing a specific trend towards a well-defined unhealthy state. On the other hand, if co-morbidities (hey, if frontier labs can anthropomorphise, so too can I) emerge that exhibit overlapping magisteria but don’t lead to the same or different unhealthy cases, they could notify an operations or engineering team about the impending doom.