The power of LLMs in software development

essay, maker

Since the release of GPT models, I've been keeping an eye on what was going on with it, and what a great use of it can be done.

When GPT-3 was released, I worked on a product called "Write it for me" that would help makers write publications based on given needs. You could tweak the result by selecting a tone, a language, etc. The target was product makers as I know the pain of doing marketing. Having the power to delegate the content writing to AI for SEO was a good use.

Funny enough, ChatGPT was unveiled and took over the world few months later. Innovation outpaced my shipping velocity.

I've never been a huge fan of LLMs, mainly because I find it overrated. Headlines are playing with fears by convincing people their jobs are going to be outdated shorty, but this is misleading. LLMs are bringing a groundbreaking new paradigm and it can't be ignored. Still, it fails on most value-added tasks.

I'm not against "AI", ML, or LLMs. I'm cautious about it. We see the benefits, but we must not ignore shortcomings. I wanted to experiment and try it in software engineering to solve real-world problems.

When words solve more than algorithm

Sometimes, it's easier to explain what we try to achieve with words rather than transcribing it with code.

Last week, I had to do scraping: gathering and clearing a dataset. My goal was to extract email addresses from a list of profiles by reading their PDF resume one by one and extracting the email from a string.

The first thing you may think of is RegExp. That's the strategy I've used for the first round. The thing with PDF is that reading its content from a browser is not always easy. The email is located in the string with noisy text around. Instead of having whitespace before and after the mail to delimit it, we end up with misplaced whitespaces. The RegExp is very helpful in shortening a text range to where the email is, with an acceptable 32 characters around so we don't cut it in half.

Data sample Scrapped data that must still be cleaned up

Based on the screenshot above, your brain is smart enough to guess what the email of our dear John the Spy is, right? But how would you translate it into computer instructions?

This is where an LLM enters. I thought "If I explain it with words, and given that LLMs are trained on the public Internet, it might understand and cross-reference with its training dataset".

To validate my hypothesis, I tried a few samples using ChatGPT and got 100% success without any hallucinations. God, does it mean GPT has been learning from the website I'm scrapping?!

Implementation

Now that the hypothesis is validated, we must industrialize the process. Instead of paying for OpenAI's API (maker bias on how to spend less), I gave a chance to Ollama to have a local-running LLM, using Llama 3 model.

I tried to write an explicit prompt to get a clear answer, despite not being a skilled prompt engineer. The instructions were:

Reply with the JSON object only, so your response can be used in an API. With the following JSON object, take the "email" property, clean the email address, and save it in a new "clearedEmail" property. {profileJSONObject}

Thanks to Ollama Node.js library, industrializing the data cleaning could be done in a breeze.

Integrating the LLM was as simple as that. The story could end up here, but I'll dig into implementation issues.

The first thing is that Llama wasn't responding in a JSON-only format, but always prepending a text Here's the object with the email cleared (and other variants).

I had to clear the response string to extract JSON. Not a big deal, as I was inputting a flat object and expected to get a flat object in response too.

I looked for the index of { to go directly to the JSON-stringified object, and looked for the index of } to get the end of the object. This substring would clear any noise in the response.

Once the substring was isolated, it could easily be parsed and manipulated in JavaScript.

While we could have thought of it at first, I hadn't predicted that the response object would not always be the same structure. Sometimes returning the full input object with a new clearedEmail property, sometimes altering the email property. Or even answering with a new object containing the clearedEmail property only.

These various behaviors add up to the blackbox of obfuscated and undebuggable dependency we're adding to the project.

Mitigating hallucinations

We're getting close to a working data extraction and formatting project. What could go wrong now?

Well, are you curious? Have you looked into the extracted email address?

Yes, doe.john@gmail.com looks like a valid email, a coherent one given the object we inputted to Llama. But.. if the email is the one we're looking for, it might also be in the email property, right?

Out of curiosity, let's try adding watchdogs. Once the response is parsed, the email extracted, let's match it against the original string. If it's ok, then we have a high confidence rate that the email is correct. Any other case must be flagged as a failure to either retry or do manual processing.

Out of a total of 700 datapoints, 370 had an email and 125 failed to be cleaned.

Failure does not mean the LLM hasn't found any email but rather has given a wrong answer and had to be reprocessed (either automatically or manually, I went with the latter).

I find that a ~30% failure rate is acceptable in this situation, given the time saved for cleaning over 300 datapoints.

Note that in this experimentation, I didn't mention false-positives where PARISjohn.doespy19@gmail.com would also have been detected as a valid email. I encountered only one datapoint in this situation.

Now what?

This was my first interaction with an alien Llama and a concrete usage of it.

My skepticism is not as high as it was a few months ago. We're only at the beginning of this technology, and given its progress, it will become a great copilot for every boring task we all have in our jobs. But it's not going to remove jobs. Any company doing so is missing its purpose. Jobs are getting enhanced and transformed, removing boring and time-consuming tasks to focus on value-added ones.

In production-grade apps, LLMs are not yet ready to take the big leap. Indeed, a 30% failure rate (aka 70% success) is a very good score, but it's still very high and hard to mitigate or make it predictable when running at a large scale.

In one word, LLMs are copilot. Like dogs are the human's best friend, LLMs are becoming the human's best copilot in their job.