NLP and The FOR Loop Fallacy

Sam McInerney
Sep 24, 2024
7 min read

FOR loops are awesome.

They are an essential tool for every coder.

A staple... the bread and butter, the 'Je ne sais quois' of programming essence.

But what if your beloved FOR loop is making your NLP code 150 times slower!?

... Put on your oxygen masks, we just lost cabin pressure...

This dramatic change in efficiency is something we clearly need to avoid when creating our programs, and it's something we need to understand to its core as well. (Pun intended)

Today I'll explore the reasons behind this efficiency shift. Look at the power of compiled languages over interpreted languages, and how using optimised code can make such a huge difference, especially in NLP and data science programming.

We'll also dabble in the fundamentals of computer science and the architecture of processing hardware itself. Exciting times! :D.

Follow Along: For those wanting to follow along and see the code and associated explanations, connect here

The Speed Experiment

Create two arrays, both filled with a million random numbers... we'll call them 'a' and 'b'.

import numpy as np
a = np.random.rand(10000000)
b = np.random.rand(10000000)

Start the pre-experiment stopwatch...

import time
start_time = time.time()

Now we use a standard FOR loop to multiply every digit in array a, by every digit in array b, and store the additive result.

for i in range(len(a)):
    result += a[i] * b[i]

Stop the clock!!!

stop_time = time.time()

And our survey says:

Running time of the 'For loop' multiplication: 448.495... ms

Which when you think about it, is absolutely incredible.

This processor just did a million multiplications and a million additions in under half a second. Computers are cool. Fact.

So let's see what happens when we run the exact same experiment through our optimised NumPy process. Replace the FOR loop with this line:

result = np.dot(a,b)

And the results are in...

Running time of the dot product process: 2.699... ms

The FOR loop is 445.79... ms slower.

Approximately 166 times slower using the FOR loop than dot product.

So let's unpack our results and answer some questions

What is this optimised process named 'dot product'?

Why is it performing the same process as the FOR loop in multiple magnitudes less time?

Dot Product

The dot product calculation may ring vague bells of geometry and vectors...

They seem bizarrely unrelated to what we are doing with our numbers in the array.

However, every word in a language model is part of a multi-dimensional geometric space, the arrays are no more than a set of coordinates - a map to the location of each word.

If you were to draw a line between these "x marks the spots", the distance between each word helps the language model work out their relationship to one another.

If you're interested in learning the hows and whys of the NLP side of dot product use I've included a short excerpt below for the enthusiasts. Feel free to skip it.

NLP
It all starts in 2017 with the defacto game changer of a paper released by Google, "Attention is all you need" which showcased the attention mechanism.

Widely recognised as the pivotal paper in LLM technology, this single technology has "spawned an entire race of language models" - ChatGPT, LLama, ect. (Thank you Mopheus)

Dot product and NLP:

Attention Mechanisms 
Attention mechanisms rely heavily on dot products. 

The attention “score” between different words in a sentence is calculated using the dot product of their embeddings. This helps the model ‘pay attention’ to the relevant parts of the input.

Gradient Descent
Optimisation algorithms like gradient descent use dot products to compute gradients. 

In NLP models, gradients are used to update parameters (like weights in neural networks) to minimize the loss function. This is essential for training models to understand and generate human language.

Cosine Similarity
Cosine similarity, which is derived from the dot product, is often used in NLP to measure the similarity between documents or sentences. 

It’s particularly useful in tasks like document clustering, information retrieval, and recommendation systems.

Now that we've seen how dot products play a crucial role in NLP, let's return to our performance comparison.

A FOR loop is such a fundamental operation in programming, how can it be that slow compared to its dot product companion?

Enter NumPy

NumPy means Numerical Python.

It's a popular library for anyone playing with data science, maths or, well anything with numbers really. When it comes to doing mathematical operations, it has a few tricks up its sleeve.

Vectorisation:

NumPy does something called "vectorization," which is a fancy way of saying it performs operations on entire arrays (or large chunks of data) all at once, instead of one element at a time like a traditional for loop.

Pre-compiled highly optimised sub-libraries

When it comes to performing mathematical calculations is that it relies on two highly optimised libraries: BLAS (Basic Linear Algebra Subprograms), and

LAPACK (Linear Algebra Package).

By optimised, I mean these libraries are written in lightning-fast, low-level compiled languages rather than interpreted ones - like Python. And they use low level system calls,

What's the difference, you may ask?

Python is interpreted, not pre-compiled.

Which means it's worked out by the system, compiled, as it goes.

All code that humans write has to be turned into machine code, that is the binary 0's and 1's the CPU understands.

Python does this in a way described as 'interpreted' which means it does it line by line after you click the play button on your favourite IDE.

Compiled languages on the other hand are pre-built into this machine code format, meaning the process of converting into 0's and 1's has already been done. So when you're talking about nanosecond levels of efficiency, you're really speeding things up.

On top of this, NumPy does a few other things well too:

Memory Efficiency:

NumPy arrays are stored in contiguous blocks of memory, which reduces the time spent between grabbing information and carting it over to the CPU.

What do I mean by contiguous?

Think of a NumPy array as a row of lockers.

In Python’s normal lists, each piece of data might be stored in random lockers scattered all over the place.

So every time you need to use data from the list, the program has to run around the locker room, finding where everything is.

With NumPy, all the data is stored next to each other in a row of lockers. This means that when the computer fetches the data, it can grab everything quickly and efficiently, without running back and forth.

Parallelism:

NumPy can take advantage of parallel processing capabilities of modern CPUs.

Parallelism segues nicely into the topic of CPUs.

You've probably heard about how Nvidia the company is taking over the world due to its parallel processing capability - but may not have understood entirely what that meant.

So let's take a walk down CPU lane and get a taste of how they do what they do ...

(Just for clarity they are taking over the world with their GPU's not CPU's, but the same principles apply and that is a post for another day! :))

Parallel processing and the CPU

How can a CPU perform operations on entire arrays at once, a CPU is a CPU, it does one thing at a time, doesn't it?

Yes and no...

While each core of a CPU executes one instruction at a time, modern CPUs have several features that allow them to perform multiple operations concurrently.

Parallel Processing

Modern CPUs have multiple cores, and each core can execute instructions independently. This means that a CPU can perform multiple operations simultaneously by distributing tasks across its cores. This is known as parallel processing.

For example, if you have a quad-core processor, it can handle four tasks at the same time, significantly speeding up computations.

Hyper-Threading

Hyper-threading is a technology used by some Intel processors to improve parallelization of computations. It allows a single physical core to act like two logical cores, effectively doubling the number of tasks the CPU can handle simultaneously.

This is achieved by sharing the core’s resources between two threads, allowing the CPU to switch between tasks more efficiently and reduce idle time whilst memory is being fetched or it has to handle something with the operating system for example.

Multi-Thread Processing

Multi-threading involves dividing a single process into multiple threads that can be executed concurrently.

Each thread represents a separate path of execution within the process. In the context of NumPy, multi-threading can be used to perform operations on different parts of an array simultaneously.

This is particularly useful for large datasets, as it allows for faster processing by leveraging the CPU’s ability to handle multiple threads.

So, in summary then...

While each core executes one instruction at a time, the presence of multiple cores, hyper-threading, and multi-threading capabilities allow CPUs to handle multiple tasks concurrently.

This is how NumPy can perform vectorized operations efficiently, taking advantage of the CPU’s parallel processing power to apply operations to entire arrays at once

CPU's in a nutshell

Clock speed and clock cycles are fundamental concepts in understanding how a CPU performs operations, including the aformentioned vectorized operations in NumPy.

Anyone who has bought a PC, or probably even a mobile phone will have been faced with information on what GHz (Gigahertz) the processor ran at.

The length of a CPU’s clock cycle is determined by its clock speed, which is measured in hertz (Hz). A GHz (Gigahertz) is equal to one billion cycles per second.

So a 1 Ghz CPU has a clock cycle of 1 nanosecond.

The higher the clock speed, the more it will get done in a nanosecond :)

Not to show my age, but I remember pining over a processor that had a clock speed of 66 Megahertz (MHz) rather than just 33MHz... A MHz (Megahertz) is a thousand times slower than that, at a 100 million cycles per second. My how times have changed.

Let's wrap this up.

In summary, FOR loops are powerful, but when performance is critical, using optimised libraries like NumPy can make the difference between code that runs and code that flies.

(Or for those of us playing with NLP, waiting 20 hours for your program to finish versus 2 hours).

NLP and The FOR Loop Fallacy

Recent Posts

Comments