Misadventures in LangChain
Chains are powerful, but I'm not convinced you need LangChain to build them.
I scheduled my last post, so while it was going out I was working on ripping out the guts of my demo and rebuilding it with LangChain. It was a great exercise and I learned a lot about the technology, but ultimately I polished up the old guts and put them back.
What is LangChain?
LangChain provides an open source set of libraries that “make it easy to create custom chains”. They want people to use their free library, and later consume paid LangSmith (developer platform) and LangServe (deployment and hosting) services.
“Chains” are multiple step LLM processes. They sound niche, but the basic chain that LangChain talks about in their docs is almost exactly what I was doing:
This classic chain is prompt => model => parser, and each of those steps were necessary in my pre-LangChain demo. First I created the prompt from a combination of context and messages, then passed that to ChatGPT (the model), and then I had some code that took the model’s response and turned it into the format I wanted to send to the user (the parser). LangChain formalizes those steps so that each one can be reusable and the chain can be more easily extended.
Why would I want that?
LLMs tend to work better with step by step instructions. Sometimes you can do this with prompt engineering - if you’re asking it to grade a student’s answer to a textbook question, you get better results by asking it to solve the problem itself and then compare the answer to the student’s than by asking directly. Sometimes you can’t fit all of that into one prompt, or you can but the LLM gets “lost in the middle”, and it’s easier to string multiple requests together.
The example that I have is ending my conversation - in my heart, I want it to role play the complaining child until the natural conclusion of the conversation or until the user writes “finish”. This was way too hard for it so I ended up asking for 5 exchanges before the finish step. That worked only on the more expensive model, so to use the cheaper model I ultimately wrote separate “continue” and “complete” prompts based on the number of messages in the conversation.
With a chained LLM call, I can first ask ChatGPT if the conversation should be finished, then use that output to submit the continue or the complete prompt. This sounds like a minor optimization, but you might recall that my prompts were quite long already. By splitting out into multiple calls I can give clearer explanations for each request, and I can provide more relevant examples which can significantly improve performance.
Some of these tasks will be easier than others - by making multiple LLM calls, you can tune each step to use the least expensive model that performs at the desired level. Since you pay 2-3x more for output, and the expensive models are 10-30x more expensive than the cheaper ones, there can be real value in this optimization.
Agents
A lot of the excitement about LLMs centers on “agents” - code that can take actions to accomplish a goal. There are two important requirements for agents that I didn’t need for my initial demo - calling other services, and defining circular chains.
Functions
One common toy example is a chatbot that can also give you real-time weather information . You can add instructions to your prompt that say “if you need to know the weather in a particular city, return this particular JSON response with the city and I’ll go get it for you”. By doing this, you’re registering a function. If the user asks for the weather in San Diego, the LLM asks your server for it, your server responds, and then you make another request to the LLM with the requested data. If the user asks for anything else, the LLM can just respond directly and skip the JSON.
This example has just one function - the Weather function - but you can build a request with many functions. Functions can be used to get some realtime data like reading a webpage or getting a stock quote, but they can also be used to take some action like booking a restaurant using OpenTable or creating a Google Calendar event.
Circular Chains
The weather flow chart works fine if you know exactly what you want and how many steps it’ll take, but will get out of control quickly as you add more functionality. This isn’t the only formulation, but one way that agents are generalized are with a Plan => Execute => Reflect loop:
The agent runs through this loop until the task is completed, or some other exit criteria is reached (the agent tried n times, has come to the conclusion that it’s not possible with these tools, etc). At its core, this is how “Browse with Bing” works - it runs a search, reads the top results, then decides if it should follow more links or run another search.
Integration
All this sounds great, so how did it work?
Sadly, not that well.
Read this as one person’s opinion - one person who put about a week into it, and who habitually jumps into coding before reading all the docs1. However, a colleague remarked that he didn’t feel that “the juice was worth the squeeze”, and that was certainly my experience.
The Good
Being able to chain together multiple calls with chain = chatPrompt.pipe(model).pipe(parser)
is a pattern that I like. To get it to work you need some base class that has some common functions (in this case it’s called Runnable
, and it has methods invoke
, batch
, and stream
). Using it forces the developer into some healthy, modular patterns, which in turn drives reusability.
This kind of breakdown also makes it easier to swap in and out various models - it’s easy to go from ChatGPT 3.5 to 4 in code, but it’s more of a pain to go from OpenAI’s ChatGPT to Google’s Bard or Anthropic’s Claude. LangChain is pretty good for that.
LangChain also provides a set of Prompt Templates and Parsers which can be helpful. It has first-class message support - it’s easy to create a prompt from a series of messages and roles, which I was doing manually. You can put placeholders in your prompt template, which you can then populate at runtime:
Parsers are really helpful for streaming - LLMs can take a long time to answer, so they’re happy to return a word at a time. This is fine for strings, but for structured data that needs matching quotes and a }
for every {
, a partial string will cause JSON.parse to throw an error. LangChain’s JsonOutputFunctionsParser resolves that complexity for you out of the box.
Finally, the function and tool support lets you specify all that structured formatting once, pass it clearly to the model, and reuse it both for prompting and parsing. In theory.
The Bad
In my experience, it just didn’t work very well. I left my SMS demo deployed while I developed, and after 3 days of wrestling I just couldn’t get the results from LangChain to compare with my old, hacked up version.
The biggest issue is prompting. LangChain is trying to do a bunch of clever stuff - templating, passing in schemas, formatting messages. Templating uses curly braces which are also heavily used in JSON, so I needed to change my formatting in a way that I didn’t totally trust. My original schema guidance was 5 lines long, their autogenerated guidance was three times longer and the model was less consistent in following it. I found that messages had to come from the roles “System”, “Human”, and “AI” (despite their documentation to the contrary), and one way or another the model was flipping from child to parent alarmingly often. At the end of the day, I couldn’t generate sufficiently clear and performant prompts using their tools, which is surprising since the end result is just a string.
Documentation was a confounding issue - I wrestled a ton with parsers, in large part due to documentation quality. LangChain has generated a lot of documentation (good!), but hasn’t done a great job of cleaning up out-of-date stuff2. They’ve launched libraries for Python and Javascript, but Python seems to be their preferred language; it seems like their JS library is a step or two behind.
With a bit more patience I think I could have gotten my existing logic into custom templates and parsers and continued using LangChain, but that should be needed in an optimization phase, not table stakes. Lotta squeezing required.
Cleaning Up
I ended up spending a half day removing LangChain and putting things back together rather than reverting. There were a bunch of improvements that I’d made in order to get LangChain working in the first place, and I try to be judicious about my bathwater-disposal practices. So my code’s in a better place, but I wish that LangChain’s happy path implementation used a little less magic and was a little more practical.
I actually did read a lot of docs before starting. Honestly this might have been the first time.
Using LLMs to clean up documentation is one of the use cases I’m interested in, at least academically - it seems like when you write a new article, you should be able to ask “find me the articles in our knowledge base that appear to contradict this new article”, and then the author can clean up old stuff or correct their new article.