fbpx

AI Model Comparison: GPT-4o, Llama 3.1, and Claude 3.5 on Bedrock

  • Technical Posts

Evolution of AI Models

The generative AI landscape has experienced remarkable progress in recent years, particularly in the development of Large Language Models (LLMs). Since the emergence of GPT-3 in 2020, AI models have evolved dramatically to generate text with greater factual accuracy and flexibility. Beyond OpenAI’s advancements with models like GPT-3.5 and GPT-4, other major players such as Anthropic (with their Claude Sonnet series), Meta (Llama series), and Google DeepMind (Gemini series) are actively competing and innovating.

This AI model evolution is driven by factors like increasing model sizes, the expansion of training datasets, and improvements in inference speed and efficiency. Larger AI models enable more sophisticated reasoning, while more extensive training data enhances AI’s knowledge and the accuracy of its responses. Optimizations in inference speed are achieved through techniques such as quantization and parallel processing, with FP8 quantization and custom accelerators playing a crucial role in accelerating execution and optimizing resource consumption. Furthermore, the latest AI models are increasingly incorporating multimodal capabilities, allowing not only text but also image and video generation, leading to more advanced information processing.

The proliferation of LLMs has led various companies to highlight their unique advanced features and strengths. Options now range from models prioritizing accuracy to those optimized for inference speed or cost-effectiveness, catering to a wide array of applications or specialized areas. However, this increasing number of available options also makes it challenging for users to determine the most suitable model for their specific needs or tasks. Factors such as accuracy, speed, cost, security, and operational flexibility must be carefully considered.

So, what is the right model for a specific task? How can we tell which model is a good fit for diverse needs? In this blog post, we will delve into the evaluation process conducted during the development of dotData Insight, comparing OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet (on Amazon Bedrock), and a self-hosted Llama-3.1-70B-Instruct model. Our evaluation focuses on key aspects relevant to North American data professionals: accuracy, speed, cost, security, and capacity.

ai model comparison

Exploring GPT-4o, Llama 3.1, and Claude 3.5

Let’s briefly introduce each AI model and outline the basis for our AI model comparison.

GPT-4o(OpenAI)

GPT-4o, from OpenAI, represents a significant advancement from its predecessor, GPT-4, with substantial optimizations in speed and cost. This AI model has both a free version and a paid version, which has more features. It boasts enhanced capabilities to handle multimodal tasks, not just text but also audio and image generation. It is trained on vast datasets, with token-level optimization for efficiency.

A key advantage of this model over other models is its ease of integration via API, significantly reducing the burden of deployment, operation, and maintenance. Leveraging OpenAI’s cloud infrastructure ensures users always have access to the latest model updates and major upgrades, along with robust security measures and load balancing, minimizing the need for infrastructure management.

Claude 3.5 Sonnet (Anthropic / Amazon Bedrock)

Claude 3.5, developed by Anthropic, is an LLM distinguished by its emphasis on safety and mitigation of ethical biases. This reasoning model supports long context windows, which allow for reasoning tasks while retaining extensive conversational history, making it particularly well-suited for interactive applications.

Being available on Amazon Bedrock, Claude 3.5 seamlessly integrates with the AWS ecosystem, offering scalability and cost management benefits. However, its API-only access might impose limitations on real-time optimization and extensive customization. Recently, Anthropic has also launched Claude 3.7 Sonnet, which they mention to be a state-of-the art reasoning model for coding and delivering significant improvements in content generation, data analysis, and planning.

For our evaluation in this post, we will focus on using Claude 3.5 Sonnet through Amazon Bedrock.

Llama-3.1-70B-Instruct (Meta / Hosted on AWS)

Llama-3.1-70B-Instruct, an open-source LLM from Meta, offers the significant advantage of being self-hostable. With 70 billion parameters, it is specifically optimized for instruction-following tasks. Its support for FP8 quantization enables faster inference and reduced memory usage.

In a self-hosted environment, the selection of infrastructure and the configuration of the inference engine are critical. For our evaluation, we hosted an FP8-quantized version of Llama-3.1-70B-Instruct on a g5.12xlarge instance, employing vLLM as the inference engine to enhance speed and memory efficiency.

How dotData Insight Leverages Generative AI for Signal Interpretation

dotData Insight, powered by dotData’s proprietary AI engine, extracts crucial patterns (signals) from business data that strongly correlate with business objectives. It then leverages the knowledge of generative AI to complement these statistical “signals” derived from data, facilitating the exploration of practical business hypotheses. This synergy allows business users to intuitively understand data insights, formulate new business hypotheses, and enhance their strategic planning and execution. Specifically, dotData Insight inputs signal descriptions (automatically generated by dotData’s AI engine) and information about the distribution of signals and target variables into generative AI prompts to interpret signals and generate business hypotheses.

Let’s consider a concrete example using consumer loan data to analyze the signals of customers with a higher probability of defaulting. The signal discovered by dotData Insight and an example of a business hypothesis generated by the integrated generative AI are shown below:

Signal: “Customers with at least one past loan application for a ‘mobile device’ in the last 3 years have a 1.1 times higher loan default rate.”

Interpretation: “Frequent borrowing for cell phone purchases.”

LLM Generated Business Hypothesis: “Customers who have a history of applying for loans to purchase cell phones in the last three years tend to frequently acquire the latest smartphones. This suggests a pattern of excessive consumption or impulse buying, potentially indicating poor financial management skills. Consequently, the burden of repaying multiple loans increases their debt-to-income ratio, leading to a higher risk of default.”

The interpretation clarifies that “mobile device” in the signal generally refers to cell phones, and “at least one past loan application” implies frequent borrowing. This provides a more relatable understanding of the signal, identifying individuals who frequently borrow for cell phone purchases. Furthermore, the business hypothesis explains why frequent borrowing for cell phones might lead to default, suggesting that the increased burden of multiple loan repayments elevates the risk. This level of detail enhances the usefulness of the insight for devising effective countermeasures.

As this example illustrates, dotData Insight combines signals discovered by dotData’s AI with generative AI to provide users with valuable hypotheses. However, the quality of these hypotheses heavily depends on the capabilities of the generative AI model used. The next section will detail the comparison of GPT-4o, Llama 3.1, and Claude 3.5 that we conducted during the development of dotData Insight.

Evaluation Results

Evaluation Criteria

We conducted a comprehensive evaluation of GPT-4o, Llama 3.1, and Claude 3.5 across the following dimensions:

  • Accuracy: Does the generated business hypothesis appear reasonable from a human perspective? (Qualitative assessment)
  • Speed: The time taken for inference.
  • Cost: The expense incurred for running the inference.
  • Security: Feasibility of use within a closed network.
  • Capacity: Limitations on tokens and API access.

Evaluation Results

Accuracy

Since the accuracy assessment involved a subjective evaluation of each LLM’s responses, let’s first examine the specific outputs generated by each AI model. We prompted each LLM to generate a hypothesis based on a signal derived by dotData Insight: “Conditions increasing the home loan default rate: A past loan application exists where the applicant was a ‘new customer’ within the last 3 years.”

GPT-4o Response: OK

“New customers often lack a substantial credit history with financial institutions, making them a higher credit risk. Even if they have a past loan application, the limited credit history makes it difficult to accurately assess their repayment ability and creditworthiness, increasing the risk of default.”

Llama 3.1 Response: NG

“If a new customer has had a past loan application approved, it might indicate a poor credit history. Therefore, the risk of default increases. Customers whose past loan applications have been approved are more likely to default in the future, leading to a higher default rate.”

Claude 3.5 Response: OK

“Customers with a record of a past loan application as a new customer within the last 3 years are likely to have a relatively short credit history. This limited credit history makes it challenging to accurately assess their repayment ability and creditworthiness, consequently increasing the risk of default. The absence of a long-term repayment track record also makes their resilience to unexpected financial difficulties uncertain, contributing to a higher default rate.”

Both GPT-4o and Claude 3.5 correctly emphasize the “new customer” aspect and highlight the limited credit history and difficulty in assessment. This aligns with the general understanding of how financial institutions evaluate the risk of new customers, making these reasonable business hypotheses. In contrast, Llama 3.1’s response suggests that a past approved loan application for a new customer indicates a poor credit history and a higher likelihood of future default. This contradicts conventional wisdom and raises concerns.

We conducted this business hypothesis generation task for 10 different signals and compared AI models in terms of response quality:

GPT-4oLlama 3.1Claude 3.5
Model Number of Unnatural Responses0/103/101/10

These results suggest that GPT-4o and Claude 3.5 outperform Llama 3.1 in terms of accuracy for this specific application of business hypothesis generation from signals. Llama 3.1 tended to produce more inconsistent or illogical interpretations compared to the other two AI models.

Speed

The following table shows the time taken for each LLM to generate a single business hypothesis:

GPT-4oLlama 3.1Claude 3.5
Response Time (seconds)1.96.65.8

Comparing the speeds of the three LLMs, GPT-4o was the fastest, with a response time of 1.9 seconds. Claude 3.5 took 5.8 seconds, and Llama 3.1 took 6.6 seconds. This indicates that GPT-4o has a clear advantage in response speed compared to the other two models.

Cost

The table below compares the cost of generating 10 business hypotheses for each LLM:

GPT-4oLlama 3.1Claude 3.5
Cost (USD)0.065 *10.103 *20.069 *1

*1 GPT-4o cost is calculated based on OpenAI’s pricing (reference source), and Claude 3.5 cost is based on AWS Bedrock US West (Oregon) region pricing (reference source).

*2 Llama 3.1 cost is calculated based on the time spent on inference using a g5.12xlarge instance in the AWS US West (Oregon) region (reference source), excluding vLLM startup and model loading times.

From a cost perspective, GPT-4o and Claude 3.5 are more economical. GPT-4o costs $0.065 per 10 requests, while Claude 3.5 costs $0.069 for the same. Llama 3.1, at $0.103 per 10 requests, is more expensive than the other two models.

Furthermore, since Llama 3.1 was evaluated in a self-hosted environment, charges are incurred for the instance’s uptime, regardless of whether requests are being processed. This means that if the volume of requests is not sufficiently high, the cost efficiency of Llama 3.1 could be further diminished. In contrast, GPT-4o and Claude 3.5, being API-based services, follow a pay-as-you-go model based on token usage. This makes them more cost-efficient as there are fewer idle resource costs compared to a self-hosted solution like Llama 3.1.

Security

The following table compares the feasibility of using each LLM within a closed network:

GPT-4oLlama 3.1Claude 3.5
Use in Closed NetworkNoYesYes

Llama 3.1 can be self-hosted, making it possible to use in a closed network. Claude 3.5 can also be used within AWS in a closed network. On the other hand, OpenAI’s GPT-4o requires internet access and cannot be used in a closed network.

Capacity

The table below compares the limitations on tokens and API access for each LLM:

GPT-4oLlama 3.1Claude 3.5
Requests per Minute10K *1None *2250 *3
Tokens per Minute2M *1None *22M *3

*1 Upper limit for OpenAI tier 4 organization (reference source).

*2 As it is self-hosted, there are no inherent limitations. However, as the number of requests and tokens increases, the response time will lengthen, so there is a practical upper limit to the number of requests that a single machine can process. For dotData Insight’s business hypothesis generation, the response time exceeds 10 seconds with 3 concurrent requests, and considering real-time performance, approximately 3 concurrent requests is considered the practical capacity of a single g5.12xlarge instance.

*3 Upper limit for AWS Bedrock US West (Oregon) region (reference source).

From a capacity standpoint, GPT-4o offers significantly higher limits for both requests per minute and tokens per minute in comparison to Claude 3.5. However, it’s crucial to note that GPT-4o cannot be used in a closed network. While Claude 3.5 is available in closed networks its limits for requests per minute (250) is considerably lower. Llama 3.1, being self-hosted, has no explicit limits on requests or tokens, but the practical limit is determined by the response time degradation with increasing load (see *2).

Conclusion

Currently, no single AI model unequivocally excels in all aspects of accuracy, speed, cost, security, and capacity. Therefore, it is crucial to strategically select and potentially combine different models based on specific application needs and business requirements. For instance, self-hosted LLMs are highly valuable in sectors with stringent security requirements, such as finance and healthcare. Conversely, cloud-based LLMs offer advantages in terms of rapid development and flexibility.

At dotData, our products are designed to support multiple models, including OpenAI and Bedrock, allowing individual users to choose the optimal model for their specific needs. This enables us to cater to diverse requirements, whether prioritizing high accuracy, minimizing costs, or ensuring secure operation within a closed network. As the performance of these AI models continues to improve and the options for regional availability and API access expand, the process of selecting the best LLM may become more complex. However, this also broadens the potential for leveraging AI in business. To prepare for an increasingly AI-driven future, continuous monitoring and evaluation of these technologies will be essential.

Hiroshi Tamano, Ph.D.
Hiroshi Tamano, Ph.D.

Hiroshi is the Tech Lead of the analytical core technology for dotData Insight, where he currently leads the development of analytical techniques. Before joining dotData in 2021, Hiroshi was involved in research in AI and machine learning. Hiroshi received his Ph.D. in Statistical Science.

dotData's AI Platform

dotData Feature Factory Boosting ML Accuracy through Feature Discovery

dotData Feature Factory provides data scientists to develop curated features by turning data processing know-how into reusable assets. It enables the discovery of hidden patterns in data through algorithms within a feature space built around data, improving the speed and efficiency of feature discovery while enhancing reusability, reproducibility, collaboration among experts, and the quality and transparency of the process. dotData Feature Factory strengthens all data applications, including machine learning model predictions, data visualization through business intelligence (BI), and marketing automation.

dotData Insight Unlocking Hidden Patterns

dotData Insight is an innovative data analysis platform designed for business teams to identify high-value hyper-targeted data segments with ease. It provides dotData's hidden patterns through an intuitive, approachable interface. Through the powerful combination of AI-driven data analysis and GenAI, Insight discovers actionable business drivers that impact your most critical key performance indicators (KPIs). This convergence allows business teams to intuitively understand data insights, develop new business ideas, and more effectively plan and execute strategies.