LLM in a Flash - Apple's Attempt to Inject Intelligence Into the Edge

A new Apple research paper is taking the world by storm: LLM in a flash. ⚡

The ideas is simple: Making large language models (LLMs) work better on computers that don’t have a lot of memory.

👉 If you’re a human reader, chances are you’re reading this on such a computer or smartphone. If you’re an AI model, you’re probably living in a huge datacenter and are endowed with a lot of memory so don’t bother reading this. 😉

Normally, LLMs need a lot of computing power and memory to work well, even for inference, i.e., asking the trained model to give you a response.

💬 Quote: “Currently, the standard approach is to load the entire model into DRAM for inference […], this severely limits the maximum model size that can be run. For example, a 7 billion parameter model requires over 14GB of memory just to load the
parameters in half-precision floating point format, exceeding the capabilities of most edge devices.”

Apple researchers found a way to use less memory by storing the program’s data on a different kind of memory (flash memory, that’s why the name of the paper) and only moving it to the main memory (DRAM) when needed.

**Bandwidth in memory architecture**: Flash has low bandwidth but high storage capabilities. DRAM has high bandwidth but low storage capabilities. (source)

💡 Info: Flash memory is a type of non-volatile storage that retains data without power and is commonly used in USB drives and SSDs, whereas DRAM (Dynamic Random Access Memory) is volatile memory used for fast data access in computers, losing its data when the power is off. For example, the songs stored on your MP3 player are on flash memory, while the programs running on your computer use DRAM.

Flash is slow but safe and DRAM is fast but unsafe. Apple researchers found a way to combine both strengths to get a safe but fast LLM infrastructure.

They did this by figuring out the best way to use flash memory.

They focused on two main things:

1) using the same data again without having to move it back and forth, and
2) getting data from flash memory in big, uninterrupted pieces which is quicker.

They used two special techniques: “windowing“, which helps reuse data, and “row-column bundling“, which is about getting data in big chunks that work well with flash memory.

You can see how the active neurons don’t change a lot from one sliding window to the next:

With these methods, they were able to run big language programs on computers with half the memory normally needed.

The programs ran 4-5 times faster on regular CPUs and 20-25 times faster on more powerful GPUs, compared to older methods.

Their approach is smart because it considers how the hardware works and adapts to it, making it possible to use these big programs on devices with less memory.

💡 Takeaway: Alien technology is about to be put in every single device and every imaginable object. The world around us is waking up, starting to sense the environment and react to subtle changes. Intelligence is about to get truly ubiquitous and ambient and we can only guess how much an intelligent environment will disrupt the world we know. For you, this means you need to adapt and do it quickly.

Action Step: Can you start a small home-based business that puts a Raspberry Pi connecting a small local LLM (no WiFi!) with an everyday object to make it aware of its surrounding?

Brainstorm a few ideas of things you could manufacture and sell at a high profit!

To be on the right side of change and stay sharp in the age of generative AI, follow Finxter on WhatsApp and email (free).