Yep, this was generated with OpenAI’s image generation. Not the best picture but I like it.

How to save a lot of money when using GPT models to crawl the web

Helping clients reduce their costs more than ten-folds by simply refactoring the input

5 min readSep 8, 2023

TL;DR: often, the simplest solutions are the best. And in many cases, you can achieve better results more efficiently by simply stripping the HTML of all unnecessary content, and asking to answer only using discrete choices. Finally, ask advice from GPT directly at the end, such as which links is it most relevant to follow next.

Introduction

As a freelancer specialized in implementing machine learning in real-world scenarios, a lot of my recent project requests have been about integrating GPT models to enhance existing products, or even creating new solutions from scratch. Now a common point that a lot of these clients had is that they were more business oriented, had an idea and wanted to know how much it would cost to put in production.

Some of them had even already tried a few naive solutions, and had been taken aback by the costs. To make things very realistic, here’s an actual (modified for privacy of course) use case that happened.

As part of their product, this client would crawl various websites, and extract information from them, which was then processed and was the core of their service. Recently, they had told their team to try and use GPT-4 API to try and get better information, more contextual, and so on. This is a great use case of LLMs, who can absorb context and spit out processed summaries in whatever format required.

Their naive approach was simple: download the webpage, send it to the GPT API, and ask to extract pieces of information from it. Now let’s understand their errors.

Optimising the Input

First of all, a web page includes many things: HTML with the core content, CSS for the styling, JS for the dynamics, and assets such as images, PDFs, etc. Pricing for these LLMs is made up of two parts: input tokens (length of input) and output tokens (length of the response), both of which can be optimised.

The simplest and cheapest approach (both in terms of development and API) is to strip everything that isn’t “inner html”, so everything that’s isn’t core content but is styling. This will pretty much always reduce it by more than 90%, and in most cases even 99%. For the majority of application of this kind, it’s more than enough. The model will usually be able to understand what’s going on. If you want, you can keep only the most inner tags, i.e. the header family, the <p>, <a>, etc. A nice thing too is to keep the alt message of tags such as <img>, as these should contain the textual information.

We have the core content, now let’s move up one notch. We can also extract all the web links, and send them separately to the model. Why? It may be able to extract information from it. So now our message looks like

h1-Foo Bar
h2-Lorem ipsum dolor sit amet
p-Some other text that have extracted
alt-image of a flying bird
p-more text

Here are the links used in the page
fonts.google.com/..
facebook.com/..
twitter.com/..
medium.com/..

This will generate a much shorter message to send to the LLM. Congrats! you’re just reduced your input cost by.. a few factors, likely. The links could also easily have been added directly in the middle of the text (just like it would be in the HTML), that’s up to you and the context.

Optimising the Output

Now comes the output, which also has a cost. Of course, the output is heavily based on the needs, but you will achieve the best ration value/token when asking to categorize the input. For example, in our case, we may want to know

what’s the industry?
What’s would you estimate the company size to be? (<10:startup, <50:small, < 300:medium,
On a scale of 1 to 5, how active are they?

When doing this, always give the list of categories, or ask it to put it in some standard format if it exists. This way, you will be able to use the query very generically.

h1-Foo Bar
h2-Lorem ipsum dolor sit amet
p-Some other text that have extracted
alt-image of a flying bird
p-more text

Here are the links used in the page
fonts.google.com/..
facebook.com/..
twitter.com/..
medium.com/..

Answer only using the given categories:
What's would you estimate the company size to be? <10:startup, <50:small, < 300:medium, >=300:large
On a scale of 1 to 5, how active are they?

Answer using standard choices. When possible, use ISO notation:
what's the industry?
in which country are they based?
which countries do they seem to target?

Very importantly, you can also ask to prioritize the connected links to explore. More example, it may (most of the time rightfully) decide that the terms and conditions are not important given the questions being asked.

h1-Foo Bar
h2-Lorem ipsum dolor sit amet
p-Some other text that have extracted
alt-image of a flying bird
p-more text

Here are the links used in the page
fonts.google.com/..
facebook.com/..
twitter.com/..
medium.com/..

When answering, use the identifier numberals provided at the beginning of each
question. Answer only using the given categories.

1.1 What's would you estimate the company size to be? <10:startup, <50:small, 
< 300:medium, >=300:large

1.2 On a scale of 1 to 5, how active are they?

Answer using standard choices. When possible, use ISO notation:

2.1 what's the industry?

2.2 in which country are they based?

2.3 which countries do they seem to target?

3.1 From the given links that I have provided to you, pick the top 3 
that you think are the most relevant to answer these questions.

Finally, a few nice things to add is actually to ask GPT for advice. For example, we got quite nice results when we add a conclusion asking it to add a couple of questions / features to extract that it think would fit well with what we currently extract.

How to save a lot of money when using GPT models to crawl the web

Helping clients reduce their costs more than ten-folds by simply refactoring the input

Introduction

Optimising the Input

Optimising the Output

Written by Edoardo Barp

No responses yet