Author(s): Nate Liebmann Originally published on Towards AI. A Tame Oracle. Generated with Microsoft Designer With the second anniversary of the ChatGPT earthquake right around the corner, the rush to build useful applications based on large language models (LLMs) of its like seems to be in full force. But despite the aura of magic surrounding demos of LLM agents or involved conversations, I am sure many can relate to my own experience developing LLM-based applications: you start with some example that seems to be working great, but buyer’s remorse is soon to follow. Trying out other variations of the task could simply fail miserably, without a clear differentiator; and agentic flows could reveal their tendency to diverge when straying away from the original prototyping happy path. If not for the title, you might have thought at this point I was a generative AI luddite, which could not be further from the truth. The journey my team at Torq and I have been on in the past two years, developing LLM-based software features that enhance the no-code automation building experience on our platform, has taught me a lot about the great power LLMs bring — if handled correctly. From here on I will discuss three core principals that guide our development and allow our agents to reach successful production deployment and customer utility. I believe they are highly relevant to other LLM based applications just as much. ❶ The least freedom principle LLMs interact through free-text, but it’s not always the way our users will interact with our LLM-based application. In many cases, even if the input is indeed a textual description provided by the user, the output is much more structured, and could be used to take actions in the application automatically. In such a setting, the great power in the LLM’s ability to solve some tasks otherwise requiring massive and complex deterministic logic or human intervention — can turn into a problem. The more leeway we give the LLM, the more prone our application is to hallucinations and diverging agentic flows. Therefore, a-la the least privileges principle in security, I believe it’s important to constrain the LLM as much as possible. Fig. 1: The unconstrained, multi-step agentic flow Consider an agent that takes a snapshot of a hand-written grocery list, extracts the text via OCR, locates the most relevant items in stock, and prepares an order. It may sound tempting to opt for a flexible multi-step agentic flow where the agent can use methods such as search_product and add_to_order (see fig. 1 above). However, this process could turn out to be very slow, consist of superfluous steps, and might even get stuck in a loop in case some function call returns an error the model struggles with recovering from. An alternative approach could constrain the flow to two steps, the first being a batch search to get a filtered product tree object, and the second being generating the order based on it, referencing appropriate products from the partial product tree returned by the search function call (see fig. 2 below). Apart from the clear performance benefit, we can be much more confident the agent will remain on track and complete the task. Fig. 2: A structured agentic flow with deterministic auto-fixing When dealing with problems in the generated output, I believe it’s best to do as much of the correction deterministically, without involving the LLM again. This is because against our intuition, sending an error back to an LLM agent and asking it to correct it does not always get it back on track, and might even increase the likelihood of further errors, as some evidence has shown. Circling back to the grocery shopping agent, it is very likely that in some cases invalid JSON paths will be produced to refer to products (e.g., food.cheeses.goats[0] instead of food.dairy.cheeses.goat[0]). As we have the entire stock at hand, we can apply a simple heuristic to automatically fix the incorrect path in a deterministic way, for example by using an edit distance algorithm to find the valid path closest to the generated one in the product tree. Even then, some invalid paths might be too far from any valid ones. In such a case, we might want to simply retry the LLM request rather than adding the error to the context and asking it to fix it. ❷ Automated empirical evaluation Unlike traditional 3rd-party APIs, calling an LLM with the exact same input could produce different results each time, even when setting the temperature hyper-parameter to zero. This is in direct conflict with fundamental principals of good software engineering, that is supposed to give the users an expected and consistent experience. The key to tackling this conflict is automated empirical evaluation, which I consider the LLM edition of test-driven development. The evaluation suite can be implemented as a regular test suite, which has the benefit of natural integration into the development cycle and CI/CD pipelines. Crucially, however, the LLMs must be actually called, and not mocked, of course. Each evaluation case consists of user inputs and initial system state, as well as a grading function for the generated output or modified state. Unlike traditional test cases, the notion of PASS or FAIL is insufficient here, because the evaluation suite plays an important role in guiding improvements and enhancements, as well as catching unintended degradations. The grading function should therefore return a fitness score for the output or state modifications our agent produces. How do we actually implement the grading function? Think, for example, of a simple LLM task for generating small Python utility functions. An evaluation case could prompt it to write a function that computes the nth element of the Fibonacci sequence. The model’s implementation might take either the iterative or the recursive path, both valid (though suboptimal, because there is a closed form expression), so we cannot make assertions about the specifics of the function’s code. The grading function in this case could, however, take a handful of test […]
↧