Using LLMs

Table of Contents

Here’s how I use LLMs to help me write code

Some thoughts on using LLMs during programming:

Gemini 2 Flash seems to be good.

You can’t create “programs”, but you can (sometimes) very efficiently create well defined components, which is where it shines.

Probably, the use case of present day LLMs in general purpose coding is a myth/overblown.

Sure, it’s easy to get them to implement a function given a definition or simple CRUD code. But the task of doing something very specific… (library / style / special considerations etc.) is hard.

To solve that problem requires a lot of prompting and other work above and beyond the LLM.

Use LLMs to build applications composed from smaller building blocks, each of which is well defined and above the level of programming language, so it’s easier to get right, be useful and critically: be refineable and enhanceable. When it does write code, it’s to a very specific plugin API that is well described in prompting.

First, ask the LLM what they suggest based on a description of what I want to do and then, use the suggested libraries. Don’t suggest older libraries to the LLMs for which there isn’t enough examples and documentation for. Use newer and more popular libraries. The LLMs will produce near perfect code. I still have to resolve some issues but they were minor.

Know the limitations of LLMs. Use code that is popular and has many examples.

You really need to be using LLMs in an IDE.

If you use AI properly with tools that support it, it’s a 3000% speed boost.

If you use LLMs in a chat window, it’s mostly a helpful assistant

Without mentioning any commercial solutions without a proper reason, there is an open source project ‘aider’. It’s a step in the right direction.

https://aider.chat/

What helps is:

Understand quite specifically how the context window and stateless nature of the LLM’s functions, and how the prompt structure can be dramatically manipulated beyond traditional turn-based chat structures in order to achieve greater results.
Understand that you have to work within the scope of the training dataset that the LLM has access to. Even if extending that via web search or direct provision of documentation, you’re going to hit limitations if the data isn’t already a part of the “training parameters” (though you “extend it’s capacity” within a given context window quite a bit, if you play your tokens right..)
Either building or using an integrated IDE environment, specifically one that allows dynamic management of documentation and sharing of documents and versions during a “chat” (i.e. when sending a context window to the LLM)

You have to accept the process of designing your own flow for integrating the LLM into your process. But once you start to design a dedicated workflow and management process for providing proper prompting and context window building - the capacity is astounding. It’s truly remarkable. It’s just also very “limited” in a sense.

The capacity of it to produce the desired response is very high,

but the likelihood of that happening is completely and totally based on:

YOUR capacity to provide the prompt (context window) that allows the model to output the response your looking for!

The best open-source LLM for code generation might not always be the best choice for enterprise use, and vice versa. It’s essential to evaluate your specific needs, including factors like test generation capabilities, developer productivity impact, and the specific programming languages you work with most often.

Just use ChatGPT fee version unless you are doing a huge project. If you are doing a huge project, then subscribe for a month or until it’s done.

Here’s how I use LLMs to help me write code

https://simonwillison.net/2025/Mar/11/using-llms-for-code/

Set reasonable expectations

Ignore the “AGI” hype—LLMs are still fancy autocomplete. All they do is predict a sequence of tokens—but it turns out writing code is mostly about stringing tokens together in the right order, so they can be extremely useful for this provided you point them in the right direction.

If you assume that this technology will implement your project perfectly without you needing to exercise any of your own skill you’ll quickly be disappointed.

Instead, use them to augment your abilities. My current favorite mental model is to think of them as an over-confident pair programming assistant who’s lightning fast at looking things up, can churn out relevant examples at a moment’s notice and can execute on tedious tasks without complaint.

Over-confident is important. They’ll absolutely make mistakes—sometimes subtle, sometimes huge. These mistakes can be deeply inhuman—if a human collaborator hallucinated a non-existent library or method you would instantly lose trust in them. Don’t fall into the trap of anthropomorphizing LLMs and assuming that failures which would discredit a human should discredit the machine in the same way.

When working with LLMs you’ll often find things that they just cannot do. Make a note of these—they are useful lessons! They’re also valuable examples to stash away for the future—a sign of a strong new model is when it produces usable results for a task that previous models had been unable to handle.

Account for training cut-off dates

A crucial characteristic of any model is its training cut-off date. This is the date at which the data they were trained on stopped being collected. For OpenAI’s models this is usually October of 2023. Anthropic and Gemini and other providers may have more recent dates.

This is extremely important for code, because it influences what libraries they will be familiar with. If the library you are using had a major breaking change since October 2023, OpenAI models won’t know about it!

I gain enough value from LLMs that I now deliberately consider this when picking a library—I try to stick with libraries with good stability and that are popular enough that many examples of them will have made it into the training data. I like applying the principles of boring technology—innovate on your project’s unique selling points, stick with tried and tested solutions for everything else.

LLMs can still help you work with libraries that exist outside their training data, but you need to put in more work—you’ll need to feed them recent examples of how those libraries should be used as part of your prompt.

This brings us to the most important thing to understand when working with LLMs:

Context is king

Most of the craft of getting good results out of an LLM comes down to managing its context—the text that is part of your current conversation.

This context isn’t just the prompt that you have fed it: successful LLM interactions usually take the form of conversations, and the context consists of every message from you and every reply from the LLM that exist in the current conversation thread.

When you start a new conversation you reset that context back to zero. This is important to know, as often the fix for a conversation that has stopped being useful is to wipe the slate clean and start again.

Some LLM coding tools go beyond just the conversation. Claude Projects for example allow you to pre-populate the context with quite a large amount of text—including a recent ability to import code directly from a GitHub repository which I’m using a lot.

Tools like Cursor and VS Code Copilot include context from your current editor session and file layout automatically, and you can sometimes use mechanisms like Cursor’s @commands to pull in additional files or documentation.

One of the reasons I mostly work directly with the ChatGPT and Claude web or app interfaces is that it makes it easier for me to understand exactly what is going into the context. LLM tools that obscure that context from me are less effective.

You can use the fact that previous replies are also part of the context to your advantage. For complex coding tasks try getting the LLM to write a simpler version first, check that it works and then iterate on building to the more sophisticated implementation.

I often start a new chat by dumping in existing code to seed that context, then work with the LLM to modify it in some way.

One of my favorite code prompting techniques is to drop in several full examples relating to something I want to build, then prompt the LLM to use them as inspiration for a new project. I wrote about that in detail when I described my JavaScript OCR application that combines Tesseract.js and PDF.js—two libraries I had used in the past and for which I could provide working examples in the prompt.

Ask them for options

Most of my projects start with some open questions: is the thing I’m trying to do possible? What are the potential ways I could implement it? Which of those options are the best?

I use LLMs as part of this initial research phase.

I’ll use prompts like “what are options for HTTP libraries in Rust? Include usage examples”—or “what are some useful drag-and-drop libraries in JavaScript? Build me an artifact demonstrating each one” (to Claude).

The training cut-off is relevant here, since it means newer libraries won’t be suggested. Usually that’s OK—I don’t want the latest, I want the most stable and the one that has been around for long enough for the bugs to be ironed out.

If I’m going to use something more recent I’ll do that research myself, outside of LLM world.

The best way to start any project is with a prototype that proves that the key requirements of that project can be met. I often find that an LLM can get me to that working prototype within a few minutes of me sitting down with my laptop—or sometimes even while working on my phone.

Tell them exactly what to do

Once I’ve completed the initial research I change modes dramatically. For production code my LLM usage is much more authoritarian: I treat it like a digital intern, hired to type code for me based on my detailed instructions.

Here’s a recent example:

Write a Python function that uses asyncio httpx with this signature:

async def download_db(url, max_size_bytes=5 * 1025 * 1025): -> pathlib.Path

Given a URL, this downloads the database to a temp directory and returns a path to it. BUT it checks the content length header at the start of streaming back that data and, if it’s more than the limit, raises an error. When the download finishes it uses sqlite3.connect (…) and then runs a PRAGMA quick_check to confirm the SQLite data is valid—raising an error if not. Finally, if the content length header lies to us— if it says 2MB but we download 3MB—we get an error raised as soon as we notice that problem.

I could write this function myself, but it would take me the better part of fifteen minutes to look up all of the details and get the code working right. Claude knocked it out in 15 seconds.

I find LLMs respond extremely well to function signatures like the one I use here. I get to act as the function designer, the LLM does the work of building the body to my specification.

I’ll often follow-up with “Now write me the tests using pytest”. Again, I dictate my technology of choice—I want the LLM to save me the time of having to type out the code that’s sitting in my head already.

If your reaction to this is “surely typing out the code is faster than typing out an English instruction of it”, all I can tell you is that it really isn’t for me any more. Code needs to be correct. English has enormous room for shortcuts, and vagaries, and typos, and saying things like “use that popular HTTP library” if you can’t remember the name off the top of your head.

The good coding LLMs are excellent at filling in the gaps. They’re also much less lazy than me—they’ll remember to catch likely exceptions, add accurate docstrings, and annotate code with the relevant types.

You have to test what it writes!

I wrote about this at length last week: the one thing you absolutely cannot outsource to the machine is testing that the code actually works.

Your responsibility as a software developer is to deliver working systems. If you haven’t seen it run, it’s not a working system. You need to invest in strengthening those manual QA habits.

This may not be glamorous but it’s always been a critical part of shipping good code, with or without the involvement of LLMs.

Remember it’s a conversation

If I don’t like what an LLM has written, they’ll never complain at being told to refactor it! “Break that repetitive code out into a function”, “use string manipulation methods rather than a regular expression”, or even “write that better!”—the code an LLM produces first time is rarely the final implementation, but they can re-type it dozens of times for you without ever getting frustrated or bored.

Occasionally I’ll get a great result from my first prompt—more frequently the more I practice—but I expect to need at least a few follow-ups.

I often wonder if this is one of the key tricks that people are missing—a bad initial result isn’t a failure, it’s a starting point for pushing the model in the direction of the thing you actually want.

Use tools that can run the code for you

An increasing number of LLM coding tools now have the ability to run that code for you. I’m slightly cautious about some of these since there’s a possibility of the wrong command causing real damage, so I tend to stick to the ones that run code in a safe sandbox. My favorites right now are:

ChatGPT Code Interpreter, where ChatGPT can write and then execute Python code directly in a Kubernetes sandbox VM managed by OpenAI. This is completely safe—it can’t even make outbound network connections so really all that can happen is the temporary filesystem gets mangled and then reset.
Claude Artifacts, where Claude can build you a full HTML+JavaScript+CSS web application that is displayed within the Claude interface. This web app is displayed in a very locked down iframe sandbox, greatly restricting what it can do but preventing problems like accidental exfiltration of your private Claude data.
ChatGPT Canvas is a newer ChatGPT feature with similar capabilites to Claude Artifacts. I have not explored this enough myself yet.

And if you’re willing to live a little more dangerously:

Cursor has an “Agent” feature that can do this, as does Windsurf and a growing number of other editors. I haven’t spent enough time with these to make recommendations yet.
Aider is the leading open source implementation of these kinds of patterns, and is a great example of dogfooding—recent releases of Aider have been 80%+ written by Aider itself.
Claude Code is Anthropic’s new entrant into this space. I’ll provide a detailed description of using that tool shortly.

This run-the-code-in-a-loop pattern is so powerful that I chose my core LLM tools for coding based primarily on whether they can safely run and iterate on my code.

Vibe-coding is a great way to learn

Andrej Karpathy coined the term vibe-coding just over a month ago, and it has stuck:

There’s a new kind of coding I call “vibe coding”, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. […] I ask for the dumbest things like “decrease the padding on the sidebar by half” because I’m too lazy to find it. I “Accept All” always, I don’t read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it.

Andrej suggests this is “not too bad for throwaway weekend projects”. It’s also a fantastic way to explore the capabilities of these models—and really fun.

The best way to learn LLMs is to play with them. Throwing absurd ideas at them and vibe-coding until they almost sort-of work is a genuinely useful way to accelerate the rate at which you build intuition for what works and what doesn’t.

Be ready for the human to take over

I got lucky with this example because it helped illustrate my final point: expect to need to take over.

LLMs are no replacement for human intuition and experience. I’ve spent enough time with GitHub Actions that I know what kind of things to look for, and in this case it was faster for me to step in and finish the project rather than keep on trying to get there with prompts.

The biggest advantage is speed of development

My new colophon page took me just under half an hour from conception to finished, deployed feature.

I’m certain it would have taken me significantly longer without LLM assistance—to the point that I probably wouldn’t have bothered to build it at all.

This is why I care so much about the productivity boost I get from LLMs so much: it’s not about getting work done faster, it’s about being able to ship projects that I wouldn’t have been able to justify spending time on at all.

I wrote about this in March 2023: AI-enhanced development makes me more ambitious with my projects. Two years later that effect shows no sign of wearing off.

It’s also a great way to accelerate learning new things—today that was how to customize my GitHub Pages builds using Actions, which is something I’ll certainly use again in the future.

The fact that LLMs let me execute my ideas faster means I can implement more of them, which means I can learn even more.

LLMs amplify existing expertise

Could anyone else have done this project in the same way? Probably not! My prompting here leaned on 25+ years of professional coding experience, including my previous explorations of GitHub Actions, GitHub Pages, GitHub itself and the LLM tools I put into play.

I also knew that this was going to work. I’ve spent enough time working with these tools that I was confident that assembling a new HTML page with information pulled from my Git history was entirely within the capabilities of a good LLM.

My prompts reflected that—there was nothing particularly novel here, so I dictated the design, tested the results as it was working and occasionally nudged it to fix a bug.

If I was trying to build a Linux kernel driver—a field I know virtually nothing about—my process would be entirely different.

Bonus: answering questions about codebases

If the idea of using LLMs to write code for you still feels deeply unappealing, there’s another use-case for them which you may find more compelling.

Good LLMs are great at answering questions about code.

This is also very low stakes: the worst that can happen is they might get something wrong, which may take you a tiny bit longer to figure out. It’s still likely to save you time compared to digging through thousands of lines of code entirely by yourself.

The trick here is to dump the code into a long context model and start asking questions. My current favorite for this is the catchily titled gemini-2.0-pro-exp-02-05, a preview of Google’s Gemini 2.0 Pro which is currently free to use via their API.

I used this trick just the other day. I was trying out a new-to-me tool called monolith, a CLI tool written in Rust which downloads a web page and all of its dependent assets (CSS, images etc) and bundles them together into a single archived file.

I was curious as to how it worked, so I cloned it into my temporary directory and ran these commands:

cd /tmp
git clone https://github.com/Y2Z/monolith
cd monolith

files-to-prompt . -c | llm -m gemini-2.0-pro-exp-02-05  -s 'architectural overview as markdown'

I’m using my own files-to-prompt tool (built for me by Claude 3 Opus last year) here to gather the contents of all of the files in the repo into a single stream. Then I pipe that into my LLM tool and tell it (via the llm-gemini plugin) to prompt Gemini 2.0 Pro with a system prompt of “architectural overview as markdown”.

This gave me back a detailed document describing how the tool works—which source files do what and, crucially, which Rust crates it was using. I learned that it used reqwest, html5ever, markup5ever_rcdom and cssparser and that it doesn’t evaluate JavaScript at all, an important limitation.

I use this trick several times a week. It’s a great way to start diving into a new codebase—and often the alternative isn’t spending more time on this, it’s failing to satisfy my curiosity at all.

Links to this note

Toolbox