Can LLMs overcome the average intelligence of their training data?

G.cole1 · December 2, 2024, 11:49am

From what I understand, LLMs work well because of the sheer size of their training data, which is basically everything people write on the internet. But here’s the thing: most of the data comes from average-level thinking, and there isn’t a lot of high-quality content that reflects advanced reasoning or intelligence. Synthetic data might fill some gaps, but it doesn’t create new knowledge—it just builds on what’s already there. So, either LLMs have to find better ways to look up existing answers, or they need to figure out how to think like humans and come up with new ideas. But scaling isn’t helping much anymore since we’ve already used up most of the good data. Humans can’t produce high-quality data fast enough. Is the answer some kind of neuro-symbolic AI, or something else?

Nevin · December 2, 2024, 11:49am

Welcome to this forum for AI discussions

Guidelines for Questions and Discussions

Here are some helpful pointers for posting and engaging:

Make sure your post is detailed—more than 100 characters! The more context, the better.
Check if your question has already been asked using the search function.
- Example: Questions like ‘Will AI take all our jobs?’ come up a lot.
It’s okay to discuss both the positives and negatives of AI, but keep it respectful.
Share links to back up your points if possible.
Avoid fearmongering, like calling AI the ‘end of times.’ It’s just technology, not magic.

Let us know if you have any questions or need help. Thanks for participating!

This is an automated post. Please message the mods if you have concerns.

Nori · December 2, 2024, 11:49am

LLMs don’t rely on the intelligence of their training data—they work by identifying relationships and patterns within the data. It doesn’t matter how smart the individual pieces of data are because the model focuses on structure and connections rather than opinions.

G.cole1 · December 2, 2024, 11:50am

@Nori
I get that, but here’s the issue: the training data mostly represents patterns from average intelligence. If those higher-level patterns aren’t in the training data, the model can’t generate them. So, isn’t this really about the limits of the data we give it?

Nori · December 2, 2024, 11:50am

@G.cole1
You’re focusing too much on the ‘intelligence’ of the data itself. What’s important is the complexity and connections in the data overall, not how clever the individual sources sound.

G.cole1 · December 2, 2024, 11:50am

Nori said:
@G.cole1
You’re focusing too much on the ‘intelligence’ of the data itself. What’s important is the complexity and connections in the data overall, not how clever the individual sources sound.

Are you saying there’s no link between the language style of the training data and the quality of responses generated by the model?

Nori · December 2, 2024, 11:50am

@G.cole1
No, I’m not saying that. The model learns how to communicate from the data’s language, but it doesn’t inherit the intelligence level of the content.

Scout · December 2, 2024, 11:51am

LLMs are already good at analyzing language patterns to gauge intelligence or other traits. If you wanted, you could train a model to filter inputs based on a certain intelligence threshold. But limiting data like that could cause other problems, like reducing the diversity and inclusiveness of the model.

G.cole1 · December 2, 2024, 11:51am

@Scout
The challenge is that probabilities dominate here. You can’t fundamentally change the patterns in the training data, and fine-tuning only goes so far. When the complexity of requests rises, but the quality of training data doesn’t match, the model starts making errors or giving weak answers.