This is the question we’re talking about today… can a weaker and cheaper model actually perform better than the stronger, expensive ones when it comes to training large language models (LLMs)?
Here’s a quick comparison:
Stronger and expensive (SE) models are high quality but come with fewer data samples due to high costs.
Weaker and cheaper (WC) models produce more data but at a lower quality.
WC models tend to make more errors, which impacts the data reliability.
Some findings suggest that WC models can produce diverse and wide-ranging data, while SE models might struggle with large datasets due to the high resources needed.
After some fine-tuning and verification, WC models can deliver cost-effective results without compromising too much on performance.
Are cheaper, weaker models the future of LLMs? Find out more in the full paper linked below.
Oh thank you for helping me stop thinking I was going crazy! I’ve always thought that fine-tuning a smaller model like GPT-2 (or even using a bunch of them together) could solve most problems we throw at Amazon’s servers for these big models. Here’s what I found from a summary about fine-tuning smaller models:
Yes, you are right. Smaller models need fewer resources to fine-tune effectively. Here’s why:
Fewer Parameters: Smaller models need less data to adjust and learn, allowing them to train faster with less info.
Less Compute: They need less power to train, so it’s cheaper and quicker.
Absorb the Signal: They often learn faster since they’re less complex and more focused.
Specialised Training: For specific tasks, smaller models can focus better and get good results without wasting resources on the general tasks of bigger models.
But there are some downsides:
Smaller models can’t handle complex tasks well, especially ones that need a lot of world knowledge.
They might struggle with unfamiliar data and more detailed reasoning.
They’re great when there’s limited resources, or when the task is narrow.
For anyone interested in my experiment, I fine-tuned a GPT-2 to capture Deleuze’s style in just 20 minutes on my low-end CPU: Link to model