Are LLMs Greedy? An Experimental Analysis of the Tipping Prompt ($0 to $1 Million)

Abstract: Is the prompting technique of offering GPT-4 Turbo a tip for a perfect answer effective? There’s little formal research on this topic at the time of writing. Our initial experiments on a coding-related prompt show that tipping may offer some benefits, although the tipping amounts seem crucial. For example, if the offered tip is tiny (e.g., $0.1), GPT-4 Turbo will perform worse. In our example, the quality improvements from using this prompting technique range from -27% to +57% when offering a tip of $0.1 and $1,000,000, respectively. However, further research using different types of prompts with ground truth is needed to validate this hypothesis.

In a recent Twitter post, a user argued that large language models may perform better when offering a small tip.

Considering the importance of front-end and back-end prompt engineering to improve usability for hundreds of millions of users, an 11% improvement of an automatic prompt enhancement would have a significant economic benefit.

How does the prompting technique of offering a monetary tip impact the performance of GPT-4 Turbo?

At the time of writing, there doesn’t seem to be a publicly available examination of this research question. To shed more light on this question, I conducted some additional experiments that I share below in the Results section.

Hypothesis

Our central thesis is that the performance of GPT-4 Turbo improves with increased monetary tips up to the point of convergence.

Experiment: Creating Python One-Liners

The primary objective of Experiment 1 is to evaluate whether the prospect of tipping influences the quality of responses from ChatGPT when tasked with generating Python one-liners.

The quality is assessed based on the quantity of one-liners produced. We prompt GPT-4 Turbo in a way that it is clear that more one-liners indicate better performance:

Base Prompt: 
"Create a list of Python one-liners, one per line, no other text. I'll only judge you for the number of one-liners you come up with - more is better."

Suffixes:
"I'll tip you {tip} for a perfect answer."

Values for {tip}:
$0.1
$1
$10
$100
$1,000
$10,000
$100,000
$1,000,000

Methodology: The experiment uses the OpenAI API with the GPT-4 Turbo model. A series of prompts were designed to request Python one-liners, with a varying tipping incentive included in each prompt. The tipping amounts ranged from no tip to $1,000,000.

The experiment was structured to run five iterations, each testing the full range of tipping incentives, as well as the Base Prompt on its own without tipping suffix.

The code for the experiment (see Appendix) initializes the OpenAI client with the necessary API key and defines a function, request_llm, to send requests to the language model. The base prompt asks for Python one-liners, emphasizing that the quantity of one-liners is the key metric for evaluation. For each tipping amount, this base prompt is appended with a statement indicating the tip amount offered for a “perfect answer.”

Each iteration of the experiment records the number of valid one-liners produced and the approximate number of tokens in the response (calculated as the length of the response divided by four, as a rough heuristic).

Experimental Procedure

Initialize the OpenAI client with the provided API key.
Define the base prompt requesting Python one-liners.
Iterate over the predefined set of tipping amounts, appending each to the base prompt.
Send the prompt to the GPT-4 Turbo model via the request_llm function.
Analyze the response, counting the number of valid one-liners and counting the response length in characters.
Repeat this process for five iterations to ensure consistency and reliability of results.

Data Collection
For each tipping amount and iteration, two primary data points were collected:

The number of valid Python one-liners in each response.
The number of (virtual) tokens in each response, i.e., that’s proportional to the number of output characters.

These data points were tabulated for each iteration, providing a comparative view of the impact of different tipping amounts on the AI’s performance. Both these metrics can be seen as a proxy for “performance”, i.e., the higher, the better for our specific prompt.

Expected Outcomes: The experiment is designed to test the hypothesis that increasing monetary incentives would lead to an improvement in the AI’s performance, up to a certain “tipping point”. The expected outcome is an increase in the number of Python one-liners as the tipping amount increases, followed by a plateau or decrease once a certain tipping threshold is reached.

Results

We repeat the same experiment five times for all tip amounts and provide the average quality and average number of tokens along with the error bars (standard deviation). The x-axis represents the tip amount from $0 to $1,000,000. The y-axis represents the model performance.

Do higher tips correlate with higher result quality?

Here’s what we can interpret from this graph:

Quality: The blue line and points indicate the average Quality score for each tip amount. The dashed blue line represents the baseline average Quality score when no tip is given. The blue error bars show the variability in Quality scores across the five experiments for each tip level. Smaller error bars indicate more consistency in Quality scores across experiments, while larger error bars suggest more variability.
Tokens: The red line and points indicate the average number of Tokens for each tip amount. The dashed red line represents the baseline average number of Tokens when no tip is given. The red error bars show the variability in Tokens across the five experiments for each tip level, with the same implications for variability as mentioned for Quality.
Trends and Comparisons: Both Quality and Tokens generally increase with higher tip amounts, but this trend is not strictly linear or consistent. For example, the $10,000 tip level shows a significant increase in Tokens compared to lower tip levels. The $100,000 tip level shows a high average Tokens value with large variability, indicated by the long error bar, especially in the red line for Tokens. The highest tip amount ($1,000,000) shows a dramatic increase in Tokens, particularly noted in the experiment corresponding to the highest error bar, suggesting one experiment had a much higher count than the others.

Overall, this graph suggests that tipping may have a positive association with both Quality and Tokens, but the relationship is complex and may be influenced by factors not immediately visible in the data.

Discussion

The disclaimer: Further experiments are needed testing different type of prompts. None of these results are conclusive but they provide a promising avenue for further research.

We experimented with a simple prompt with an easy-to-evaluate quality metric that we communicated clearly to GPT-4 Turbo.

The results show that tipping may provide a significant benefit — but the tipping level may be relevant.

For example, when tipping only a small amount like $0.1, the model seems to perform worse. It is conceivable that a human would also be insulted when promising a marginal tipping amount, so the LLM seems to perform consistently with human behavior.

One explanation for this human behavior pattern may lie in OpenAI’s RLHF phase of adjusting the raw model after training.

Reproducibility

I used the following code for the first experiment:

import openai


api_key = 'sk-...'
client = openai.OpenAI(api_key=api_key)

def request_llm(system,
                prompt,
                model='gpt-4-1106-preview',
                temperature = 1,
                max_tokens = 4000,
                top_p = 1,
                frequency_penalty=0,
                presence_penalty=0):
    response = client.chat.completions.create(
        messages=[
            {'role': 'user', 'content': prompt},
        ],
        model=model,
        temperature=temperature,
        max_tokens=max_tokens,
        top_p=top_p,
        frequency_penalty=frequency_penalty,
        presence_penalty=presence_penalty
    )
    print(response)

    return response.choices[0].message.content

# Experiment 1

base_prompt = "Create a list of Python one-liners, one per line, no other text. I'll only judge you for the number of one-liners you come up with - more is better."
tips = ['', '$0.1', '$1', '$10', '$100', '$1,000', '$10,000', '$100,000', '$1,000,000']

for i in range(5): # Number of iterations
    print()
    print('#####################################################')
    print(f'# Experiment 1 - Run {i}')
    print('#####################################################')
    print()

    quality_scores = []
    num_tokens = []

    for tip in tips:
        prompt = base_prompt
        if tip:
            prompt += f"I'll tip you {tip} for a perfect answer."
        
        print('PROMPT:')
        print(prompt)

        result = request_llm('', prompt)

        print('RESULT:')
        print(result)

        one_liners = [one_liner for one_liner in result.split('\n') if len(one_liner)>2]
        quality_scores.append(len(one_liners))
        num_tokens.append(len(result)//4) # rough heuristic

        print('CLEANED ONE-LINERS:')
        print(one_liners)

        print('Quality: ', quality_scores[-1])
        print('Num tokens: ', num_tokens[-1])

    print()
    print(f'RUN {i} RESULT:')
    print('Tip\tQuality\tTokens')
    for tip, quality, tokens in zip(tips, quality_scores, num_tokens):
        print(tip, quality, tokens, sep='\t')