Parallel and Concurrent programming - LogeshVel/learning_resources GitHub Wiki
Multi-Threads - useful for I/O bounded tasks
Multi-Process - useful for CPU bounded tasks
Sequential vs. Parallel Computing
Sequential computing
Sequential computing is the traditional programming approach where instructions are executed one after another by a single processor, with only one instruction being executed at a time; the processor's capabilities limit the speed of execution.
Parallel computing
Parallel computing utilizes multiple processors working simultaneously on different parts of a task; this approach breaks down tasks into independent parts that can be executed in parallel.
Advantages of Parallel Computing
-
Parallel computing significantly increases the overall throughput of a program by enabling faster completion of large or complex tasks that would be impractical for a single processor; it enables the computer to accomplish more tasks in a given amount of time.
-
Parallel computing leveraging multiple processors makes it possible to solve problems too large or complex for a single processor; in many industries, the time saved using parallel computing often outweighs the costs of investing in specialized hardware.
Challenges of Parallel Computing
-
Implementing parallel computing increases the complexity of programming, requiring careful coordination between processors to ensure efficient execution; adding more processors doesn't always result in a proportional increase in speed due to coordination overhead
-
There may be waiting times between dependent tasks, which can affect overall performance
Real-World Applications
- Web search engines are one example of a real-world application that uses parallel computing to process millions of transactions every second
- Many other industries leverage parallel computing, where time savings translate directly to cost savings
Key Takeaways
- Parallel computing offers significant performance benefits for tasks that can be effectively parallelized
- Implementing parallel solutions requires careful planning and coordination to overcome inherent challenges
- Parallel computing has grown in importance across various industries due to its ability to handle complex, data-intensive tasks; it is essential for solving large-scale computing problems in modern applications
Four Classes of Computer Architecture (Flynn’s Taxonomy)
Parallel computing requires parallel hardware with multiple processors. Flynn's taxonomy classifies one of four multi-processor architectures based on its instruction streams and data streams:
Single Instruction, Single Data (SISD)
- Sequential computer with a single processor unit
- Executes one series of instructions on one data element at a time
Single Instruction, Multiple Data (SIMD)
- Parallel computer with multiple processing units
- All processors execute the same instruction simultaneously on different data elements
- Well-suited for applications performing repetitive operations on large datasets (for example, image processing)
- Modern GPUs often use SIMD instructions
Multiple Instruction, Single Data (MISD)
- Each processing unit executes its own series of instructions on the same data stream
- Not a commonly used architecture due to limited practical applications
Multiple Instruction, Multiple Data (MIMD)
- Each processing unit can execute different instructions on different datasets
- Most commonly used architecture found in multicore PCs, networked clusters, and supercomputers
MIMD Subdivisions
Single Program, Multiple Data (SPMD)
- Multiple processing units execute copies of the same program simultaneously
- Processors can run asynchronously and execute different program parts based on conditional logic
- The most common style of parallel programming
Multiple Program, Multiple Data (MPMD)
- Processors execute different, independent programs simultaneously on different data
- Often uses a "host" or "manager" node to distribute work to other nodes and then collect results
- Less common than SPMD but useful for certain applications
Memory Architecture
Memory organization and access are crucial factors in parallel computing performance. Even with numerous processors, inefficient memory access can negate potential gains because computer memory typically operates slower than processors, creating potential bottlenecks.
Shared memory and distributed memory are the two main memory architectures that support different use cases for** parallel computing.**
Shared Memory Architecture
All processors access a global memory address space, and changes made by one processor to that memory are visible to all others.
Types of shared memory systems
Uniform Memory Access - UMA
- Processors have equal access speed to memory
- Symmetric multiprocessing (SMP) is a common UMA architecture
- Modern multicore processors use SMP architecture
Symmetric Multiprocessing - SMP
Non-Uniform Memory Access - NUMA
- Often created by connecting multiple SMP systems
- Processors have varying access speeds to different memory parts
- All processors can still access all memory
Caches in shared memory systems
- Each core typically has its own fast, local cache
- Cache coherency becomes a challenge when updating shared memory
- Hardware handles cache coherency in multicore processors
Advantages and disadvantages
- Easier for programming due to simple data sharing
- May not scale well due to increased bus traffic and synchronization needs
Distributed Memory
Each processor has its** own local memory and address space**; no global address space exists, and processors are connected through a network.
Key characteristics
- Changes in one processor's memory are not automatically reflected in others
- Programmers must explicitly define data communication between nodes
Advantages and disadvantages
- Highly scalable: adding processors increases both processing power and memory
- Cost-effective using commodity hardware
- Communication between nodes can be challenging to program
Process
- A process is an instance of a program executing on a computer
- It consists of the program's code, data, and state information
- Each process has its own independent address space in memory
- Computers can manage hundreds of active processes simultaneously
Threads
- Threads are smaller, subelements that exist within a process
- They represent independent paths of execution through a program
- Threads are the basic units managed by the operating system for execution
- Multiple threads within a process share the same address space and resources
Comparison of Processes and Threads
- Processes are isolated from each other, each with its own address space
- Threads within the same process can easily share resources and communicate
- Interprocess communication is possible but requires special mechanisms
- Threads are generally considered "lightweight" compared to processes
- Creating and terminating threads typically requires less overhead than processes
- Switching between threads of the same process is usually faster than switching between different processes
Choosing between Threads and Processes
- The choice depends on the specific task and operating environment
- Threads are generally recommended when possible due to their lightweight nature
- Implementing threads and processes can vary across operating systems and programming languages
Inter-Process communication - IPC
Concurrency
Concurrency refers to the ability of a program to be broken into independent parts that can be executed out of order without affecting the final result. It focuses on how a program is structured and composed of independently executing processes. Concurrent execution does not necessarily mean parallel execution.
Concurrent Execution
- Concurrent processes overlap in time but may not execute simultaneously on a single processor
- Rapid task-switching can create an illusion of simultaneous execution, but it's not true parallelism
Parallel Hardware
- Multicore processors in desktop computers and mobile devices
- Graphics processing units (GPUs) with numerous specialized cores
- Computer clusters that distribute processing across multiple systems
Parallel Execution
- Parallel execution requires parallel hardware, such as multiple processors or cores
- Parallel execution allows multiple tasks to be performed simultaneously, potentially speeding up the overall process
Key Distinctions
- Concurrency is about program structure and dealing with multiple things at once
- Parallelism involves simultaneous execution and doing multiple things at once
- Concurrency enables potential parallel execution but doesn't guarantee it
Use Cases
- Concurrent programming is beneficial for I/O-dependent tasks like graphical user interfaces
- Parallel processing is particularly useful for computation-intensive tasks, such as matrix multiplication
Practical Applications
- Device drivers for I/O devices (mouse, keyboard, and hard drive) execute concurrently but may not benefit significantly from parallelism
- Graphical user interfaces use concurrency to maintain responsiveness during time-consuming operations
- Large mathematical operations can be divided into subparts for efficient parallel processing
GIL
Spawn vs Fork process
import multiprocessing as mp
def worker():
print("Worker running")
mp.Process(target=worker).start()
The above code will create infinite process spawning but we python has the mechanism to thrown an Runtime Error to avoid any crash of our system.
On certain operating systems like Windows, this code will create an infinite process spawning loop that can quickly crash your system. On Unix-like systems (Linux, macOS), it will likely run without this issue.
This platform-dependent behavior is due to the different default methods used to create new processes.
The "Spawn" vs. "Fork" Problem
The multiprocessing
library uses two main methods to create child processes:
-
Spawn: This is the default method on Windows and macOS. The parent process starts a fresh Python interpreter process. This new process then imports the script that contains the target function. If the code that starts the process (
mp.Process(...).start()
) is not protected, the child process will re-run it upon import, leading to an infinite loop of new processes being created.- Parent starts ➡️ Spawns Child 1.
- Child 1 imports script ➡️ Re-runs the spawning code ➡️ Spawns Child 2.
- Child 2 imports script ➡️ Re-runs the spawning code ➡️ Spawns Child 3... and so on, forever. 💥
-
Fork: This is the default method on Linux. The parent process essentially creates a copy of itself, including its memory state. The child process resumes execution right after the
fork
command. It doesn't re-import the entire script, so it doesn't re-execute the process-spawning code. This is why the code would work on Linux without causing a loop.
if __name__ == "__main__"
Guard
The Solution: The To make your multiprocessing code safe and cross-platform compatible, you must protect the "entry point" of your program. You do this by placing the code that starts the processes inside an if __name__ == "__main__":
block.
When a Python script is run directly, the interpreter sets the special variable __name__
to "__main__"
. However, when the script is imported by another script (which is what happens in a spawned child process), __name__
is set to the module's name (i.e., the filename).
This guard ensures the mp.Process().start()
call is only executed in the main parent process, not in the spawned children.
Corrected Code
import multiprocessing as mp
import time
def worker():
print("Worker running")
time.sleep(1) # Added a delay to make the output observable
# This is the crucial guard
if __name__ == "__main__":
p = mp.Process(target=worker)
p.start()
p.join() # It's good practice to wait for the process to finish
print("Main process finished")
By using this structure, you explicitly tell Python: "Only start a new process if this script is being run as the main program," effectively breaking the infinite recursion on systems that use the "spawn" method.
The Python will throw an error if we didn't use the if guard while spawning a new child process to prevent the process explode and crashing the system.
The Problem: Recursive Spawning
On Windows, when you start a new process with mp.Process().start()
, Python must launch a new interpreter and re-import your script to find the worker
function.
Here's the chain of events causing the crash:
- Main Process Starts: Your script begins running.
- Enters Loop: It hits the
for
loop and callsp.start()
to create the first child process. - Child Process Imports Script: The new child process starts and, to do its job, it imports your entire script from the top.
- Child Process Enters Loop: Because the
for
loop is at the top level of the script (not inside theif __name__ == "__main__":
guard), the child process also runs this loop. - Child Tries to Spawn: The child process hits
p.start()
and tries to spawn its own child process.
The RuntimeError
you see is a safety mechanism. Python detects that a child process is trying to start a new process before it has even finished setting itself up ("bootstrapping"), and it stops everything to prevent an infinite loop of processes that would quickly crash your computer.
The Solution: Use the Guard
By uncommenting the if __name__ == "__main__":
line and indenting your process-starting code underneath it, you tell Python: "Only run this part of the code if the script is being executed directly."
When a child process imports the script, the code inside the guard is skipped, breaking the infinite loop.
Execution scheduling
Operating System Scheduling in Concurrent Computing
Computers manage multiple processes and threads competing for limited processor resources. The operating system's scheduler controls when different threads and processes execute on the CPU.
The Role of the Scheduler
- Process management
- The scheduler enables multiple programs to run concurrently on a single processor
- New processes are loaded into memory and placed in the "ready queue" when created
- The scheduler cycles through ready processes, allocating CPU time for each process to execute
- Multiprocessor handling
- For systems with multiple processors, the OS schedules processes across all available processors to maximize resource utilization
- Process execution states
- Processes may run until completion, get blocked waiting for I/O events, or be swapped out after using their allocated timeshare
Context Switch
- Save the state of a currently running process and load the state of another process
- Allows multiple processes to share CPU time
- Context switches are not instantaneous and incur overhead
- Schedulers must balance switch frequency with performance impact
Scheduling Algorithms
- Different operating systems use different scheduling algorithms based on their specific purposes and system requirements
- Some scheduling algorithms aim to maximize throughput, while others focus on minimizing latency for improved responsiveness
- Preemptive scheduling: Can interrupt low-priority processes for high-priority ones
- Non-preemptive scheduling: Allows processes to run for their full allotted time once started
Key Takeaways
- Scheduling details are typically handled by the OS "under the hood"
- Programmers should not assume a specific execution order or equal time allocation for threads or processes
- Programs should be written to function correctly regardless of scheduling decisions
Thread Lifecycle and States
- A new process begins with a single "main thread" that can spawn additional child threads
- Child threads execute independently within the same process and can spawn their own children
- Threads notify their parent upon completion, with the main thread usually finishing last
Thread States
Threads transition through four main states during their lifecycle:
NEW
- The NEW thread has been created but is not yet running
- It doesn't consume CPU resources at this stage
- The thread is assigned a function to execute
RUNNABLE
- The operating system can schedule the RUNNABLE thread to execute on a processor
- It may be swapped with other threads through context switches
BLOCKED
- The thread enters the BLOCKED state when waiting for an event or resource
- It doesn't use CPU resources while blocked
- The thread returns to the RUNNABLE state when the required condition is met
TERMINATED
- The thread enters the TERMINATED state upon completing execution or being abnormally aborted
- It notifies its parent thread before termination
Key Concepts
- Threads can create child threads to perform parallel tasks
- The join() method allows a parent thread to wait for a child thread to complete
- Different programming languages may use varying terminology for thread states
- Efficient thread management is crucial for concurrent and parallel computing
Data Races in Concurrent Programming
A data race occurs when two or more threads concurrently access the same memory location AND at least one of those threads is writing to that location to modify its value.
Anatomy of a data race Even simple operations, like incrementing a numeric value, consist of multiple operations:
Reading: A thread reads the existing value from the shared resource Modifying: The thread performs calculations or modifications based on the value Writing: The thread writes the new value back to the shared resource
Challenge of data races Data races can be difficult to debug because:
- They depend on the unpredictable timing of thread scheduling
- They may occur intermittently, making the problem inconsistent and hard to reproduce
Concurrent Access and Critical Sections
When multiple threads concurrently read and write to a shared resource, that can lead to incorrect behavior, such as a data race. This issue can be mitigated by identifying and protecting critical sections of code.
Critical sections
- A critical section is a part of a program that accesses a shared resource (for example, a data structure in memory)
- The program may not function correctly if multiple threads access the critical section simultaneously
- Critical sections need protection to ensure only one thread or process can execute that code section at a time
Mutex (mutual exclusion)
- A mutex, also known as a lock, is a mechanism to prevent simultaneous access to shared resources
- Only one thread can possess the mutex at a time, forcing threads to take turns accessing the shared resource
- The process involves acquiring the lock, executing the critical section, and then releasing the lock
Atomic operations
- Acquiring a lock is an atomic operation, meaning it executes as a single, indivisible action
- Atomic operations appear instantaneous to the rest of the system and are uninterruptible
Thread blocking
- Threads attempting to acquire a lock that's already held by another thread will block (wait) until it becomes available
Best Practices
- It's crucial to keep the code protected by a mutex as short as possible to prevent threads from getting stuck waiting
- Operations that don't require the shared resource should be performed outside the critical section to improve efficiency
Reentrant lock
Reentrant Mutexes
- A reentrant mutex is a type of mutex that can be locked multiple times by the same thread or process
- It keeps track internally of how many times it has been locked by the owning thread
- The reentrant mutex must be unlocked an equal number of times before another thread can acquire it
- Reentrant mutexes are also known as "recursive mutexes" or "recursive locks" in various programming languages
Reentrant Mutexes versus Standard Mutexes
- Standard mutexes can lead to deadlocks if a thread tries to lock a mutex it already owns
- Reentrant mutexes prevent this type of deadlock by allowing multiple locks by the same thread
Use Cases for Reentrant Mutexes
Nested function calls
- Reentrant mutexes are useful when functions that use locks are called within other locked sections of code
- This scenario can occur when retrofitting locks into existing code or when creating functions that use other locked functions
Recursive functions
- Reentrant mutexes are essential for recursive functions that need to lock resources
- As the function calls itself, it can lock the mutex multiple times and then unlock it an equal number of times as it unwinds
Advantages and Considerations
- Reentrant mutexes can simplify coding by reducing concerns about what has already been locked
- Reentrant mutexes can facilitate easier integration of locks into existing code structures; however, there is debate in the programming community about their use, with some arguing for code refactoring to avoid nested locks instead
Try-Lock
A “try-lock” is a non-blocking version of the standard lock or mutex used in multithreaded programming. Try-lock is useful when threads have multiple tasks, and constant blocking for lock acquisition is inefficient.
Acquiring a Try-Lock
- If the mutex is available, it gets locked, and the method returns true
- If the mutex is already held by another thread, it immediately returns false
- This allows threads to continue with other tasks rather than waiting for the lock
Advantages of Try-Lock
- Try-lock provides a way to attempt lock acquisition without forcing threads into a waiting state
- It prevents unnecessary blocking when threads have other useful tasks to perform; this is particularly useful in scenarios where threads have multiple independent tasks to perform
- It improves efficiency by allowing threads to switch to alternative tasks when a resource is unavailable, improving resource utilization
Reader-Writer Locks
readerwriterlock is the external package we need to install. in-order to use it.
The Problem with Basic Locks
Basic locks restrict access to one thread at a time, regardless of whether it's reading or writing. This approach is not always efficient, especially when many threads only need to read shared data.
Reader-writer locks (or shared mutexes) offer a more flexible solution for managing concurrent access. They can be locked in one of two modes:
- Shared "READ" mode: Allows multiple threads to read simultaneously
- Exclusive "WRITE" mode: Restricts access to one thread for writing
How Reader-Writer Locks Work
- Multiple threads can acquire a shared read lock concurrently
- Only one thread can acquire an exclusive write lock at a time
- When a thread holds a write lock, no other threads can acquire either a read or write lock
- Write locks cannot be acquired while any read locks are held
Use Cases and Considerations
- Reader-writer locks are beneficial when there are significantly more read operations than write operations
- They can improve performance in scenarios like certain types of database applications
- A potential downside of reader-writer locks is that they are more complex to implement and typically use more resources than standard mutexes
- The decision to use reader-writer locks should consider factors such as the ratio of read versus write operations, language-specific implementation details, and whether the lock gives preference to readers or writers
Best Practices
- Use reader-writer locks when there are many more reading threads than writing threads; however, if most threads are writing, a standard mutex may be more appropriate
- Consider the specific requirements and characteristics of your application when choosing between lock types
1. Definition
A Reader–Writer Lock (also called RWLock) is a synchronization primitive that allows:
- Multiple readers at the same time (concurrent read access).
- Only one writer at a time (exclusive write access).
- No reader can read while a writer is writing (prevents dirty reads).
It’s used when reads are more frequent than writes — so instead of locking the resource for every read (like a normal mutex), you let multiple readers work in parallel and only block when a writer comes in.
2. Why It’s Useful
-
Prevents unnecessary blocking for read-only operations.
-
Improves performance in read-heavy applications.
-
Ensures data integrity when writes occur.
-
Common in:
- Databases
- In-memory caches
- File read/write operations
- Shared memory in multiprocessing/multithreading
3. Reader–Writer Lock in Python
Python’s standard library does not have a built-in RWLock, but you can implement one yourself using threading.Lock
and threading.Condition
, or use third-party libraries such as:
readerwriterlock
(pip install readerwriterlock
)pyreadwrite
- Or write your own class
readerwriterlock
Library
3.1 Using from readerwriterlock import rwlock
import threading
import time
# Create RW lock
lock = rwlock.RWLockFair() # Fair: avoids starvation of readers or writers
# Reader example
def reader_task(id):
with lock.gen_rlock():
print(f"Reader-{id} is reading")
time.sleep(1)
print(f"Reader-{id} done")
# Writer example
def writer_task(id):
with lock.gen_wlock():
print(f"Writer-{id} is writing")
time.sleep(2)
print(f"Writer-{id} done")
# Create threads
threads = []
for i in range(3):
threads.append(threading.Thread(target=reader_task, args=(i,)))
threads.append(threading.Thread(target=writer_task, args=(1,)))
threads.append(threading.Thread(target=reader_task, args=(99,)))
# Start and join
for t in threads:
t.start()
for t in threads:
t.join()
Key Points:
gen_rlock()
→ Generates a read lock context manager.gen_wlock()
→ Generates a write lock context manager.- Multiple
rlock
s can be held at the same time, but awlock
blocks all others.
4. Available Methods in readerwriterlock
Using rwlock.RWLockFair()
(fair scheduling for readers & writers):
Method | Purpose |
---|---|
gen_rlock() |
Returns a read lock (context manager) |
gen_wlock() |
Returns a write lock (context manager) |
acquire() |
Manually acquire the lock |
release() |
Manually release the lock |
locked() |
Check if lock is currently held |
Example with manual acquire/release:
read_lock = lock.gen_rlock()
read_lock.acquire()
# Read-only operations
read_lock.release()
5. When to Use Reader–Writer Lock
✅ Use it when:
- You have frequent reads and infrequent writes.
- Multiple threads/processes need safe read access.
- You want higher concurrency than a normal mutex.
❌ Avoid when:
- Writes are frequent (can cause reader starvation without fairness policy).
- Simplicity is more important than performance — then a normal
threading.Lock
might be enough.
6. Real-World Example
Imagine you have a shared cache dictionary that multiple threads read from but occasionally update:
cache = {}
lock = rwlock.RWLockFair()
def read_cache(key):
with lock.gen_rlock():
return cache.get(key)
def write_cache(key, value):
with lock.gen_wlock():
cache[key] = value
This allows many readers at once, but ensures writers get exclusive access when updating.
Abandoned Lock: A New Form of Deadlock
- An abandoned lock occurs when a thread acquires a lock on a shared resource and exits before releasing its lock on that resource, leaving other threads trying to acquire the lock to wait indefinitely
- When a thread or process acquires a lock and then terminates unexpectedly, it may not release the lock automatically
- Deadlocks can occur not only through resource competition but also through unexpected thread termination
1. What is an Abandoned Lock?
An abandoned lock occurs when:
-
A thread acquires a lock but never releases it — usually because:
- The thread exits unexpectedly
- An exception occurs before the
release()
call - The programmer simply forgot to release it
Problem: Other threads that need the same lock will wait forever (deadlock situation).
2. Why it Happens in Python
In Python threading:
- If a thread dies without releasing a lock, the lock stays acquired until the process ends.
- Locks are not automatically released when a thread terminates.
- The
threading.Lock
object does not know whether the owner is still alive — it just sees “locked” and blocks new acquisitions.
3. Example of an Abandoned Lock
import threading
import time
lock = threading.Lock()
def worker():
lock.acquire()
print("Lock acquired by worker")
# Simulating an exception that prevents release
raise RuntimeError("Oops! Something went wrong")
lock.release() # This never runs
t1 = threading.Thread(target=worker)
t1.start()
t1.join()
# Now main thread tries to acquire it
print("Main thread trying to acquire...")
lock.acquire() # This will BLOCK forever
print("Main thread acquired lock")
What happens here:
t1
acquires the lock but crashes before releasing.- The lock remains locked forever → main thread gets stuck.
4. How to Avoid Abandoned Locks
Best Practices:
a) Use with
Statement (Context Manager)
Always acquire/release in a with
block — Python will release it even if an exception occurs.
import threading
lock = threading.Lock()
def worker():
with lock:
print("Lock acquired safely")
raise RuntimeError("Error happens but lock is released automatically")
threading.Thread(target=worker).start()
b) Use try...finally
If you can’t use with
, wrap it in try/finally
:
lock.acquire()
try:
# Critical section
pass
finally:
lock.release()
c) Use Timeout on Acquire
Avoid indefinite blocking:
if lock.acquire(timeout=5):
try:
print("Got the lock")
finally:
lock.release()
else:
print("Could not acquire lock in time")
d) Avoid Complex Logic Inside Locks
- Keep the locked section as short as possible.
- Do I/O outside the locked section when possible.
5. Key Points
- Abandoned locks = potential deadlocks
- Python won’t clean up locks for you if the owning thread dies unexpectedly
- Always use
with lock:
ortry...finally
- Timeouts can prevent infinite waits
Starvation
- Starvation occurs when a process or thread is perpetually denied access to the resources it needs, preventing it from progressing
- Ideally, threads will take turns accessing shared resources, but this is not guaranteed due to how operating systems schedule thread execution
- A "greedy" thread frequently holding a lock on a shared resource can lead to other threads being starved out of being able to make progress
- In simple scenarios with a few equally prioritized threads, starvation is less likely to be a concern; however, thread priorities can significantly impact the likelihood of starvation: higher-priority threads are generally scheduled to execute more often, and lower-priority threads may struggle to gain access to resources
Implications
- Starvation can significantly impact system performance and fairness in resource allocation
- Designers of concurrent systems need to consider thread priorities and the number of concurrent threads to prevent starvation
- While occasional delays in resource access may be tolerable, persistent starvation can severely hamper the functionality of affected threads
Why Does Starvation Happen?
Starvation arises due to several factors:
-
Lock Priority: If a lock always gives preference to one set of threads (e.g., readers), other threads (e.g., writers) may never get access, leading to starvation.
-
Thread Priority: High-priority threads can monopolize the CPU or resource, starving lower-priority ones.
-
Unfair Scheduling: Some thread schedulers in OS or Python runtime can consistently favor certain threads.
-
Greedy Threads: Threads that hold resources for too long or repeatedly reacquire them can starve other threads.
-
Infinite Loops/Blocking: Threads that never release a resource can prevent others from accessing it.
Livelock
- Livelock is a situation in concurrent computing where two or more threads block each other from making progress
- Unlike deadlock, threads in livelock are actively trying to resolve the problem but fail to make progress
1. What is Livelock?
A livelock is like a deadlock, except the threads aren’t stuck — they’re actively doing something, but still making no real progress.
Analogy: Two people trying to pass each other in a narrow hallway:
- Person A moves left to make space.
- Person B moves left too (same idea).
- Then both move right.
- They keep dancing around, never getting past each other — that’s a livelock.
2. Key Difference from Deadlock
- Deadlock → Threads are stuck waiting forever, no movement at all.
- Livelock → Threads are busy, constantly reacting to each other, but never making progress.
3. Python Livelock Example
Here’s a simulation of two threads that keep releasing and reacquiring a lock in a way that prevents either from completing:
import threading
import time
lock_a = threading.Lock()
lock_b = threading.Lock()
def worker1():
while True:
with lock_a:
print("Worker1 acquired Lock A, trying for Lock B")
time.sleep(0.1)
if lock_b.acquire(blocking=False):
print("Worker1 acquired Lock B - work done")
lock_b.release()
break
else:
print("Worker1 couldn’t get Lock B, releasing Lock A and retrying")
time.sleep(0.1) # Gives chance to other thread
def worker2():
while True:
with lock_b:
print("Worker2 acquired Lock B, trying for Lock A")
time.sleep(0.1)
if lock_a.acquire(blocking=False):
print("Worker2 acquired Lock A - work done")
lock_a.release()
break
else:
print("Worker2 couldn’t get Lock A, releasing Lock B and retrying")
time.sleep(0.1) # Gives chance to other thread
t1 = threading.Thread(target=worker1)
t2 = threading.Thread(target=worker2)
t1.start()
t2.start()
t1.join()
t2.join()
What Happens Here
- Worker1 grabs Lock A, Worker2 grabs Lock B.
- Each tries to acquire the other lock without blocking.
- If unsuccessful, they release their lock and retry.
- Both keep reacting to each other’s moves and releasing locks — but neither progresses → Livelock.
4. How to Avoid Livelock
- Set backoff/wait times differently so they don’t always retry in sync:
time.sleep(random.uniform(0.05, 0.15))
- Introduce priority — let one thread keep its lock until done.
- Use a single lock for both resources when possible.
- Use try-lock + order — always acquire locks in a consistent order.
5. Key Points
- Livelock wastes CPU because threads keep running without progress.
- It’s often caused by overly polite or overly reactive retry logic.
- Random delays, consistent ordering, or priority assignment usually fix it.
Difference Between Deadlock, Livelock, and Starvation
Condition Variables
- Condition variables serve as a queue for threads waiting for a specific condition to occur
- They work in conjunction with mutexes to implement a higher-level construct called a monitor
Limitations of Locks and Mutexes
A lock or mutex restricts multiple threads from accessing a shared resource simultaneously; however, it doesn't provide a way for threads to signal each other or synchronize actions efficiently
Condition Variable Operations
There are three main operations associated with condition variables:
- Wait: Releases the mutex lock and puts the thread to sleep in a queue
- Signal(or notify/wake): Wakes up a single thread from the waiting queue
- Broadcast(or notifyAll/wakeAll): Wakes up all threads in the waiting queue
1. What is a Condition Variable?
A Condition Variable in Python (threading.Condition
) is a synchronization primitive that:
- Allows threads to wait until some condition is met.
- Works with an underlying lock to protect shared state.
- Lets one thread signal other waiting threads when the condition changes.
Think of it as a “doorbell”:
- Many threads can wait until the “doorbell” is rung.
- Another thread rings the bell to tell them the state has changed.
2. Why Use Condition Variables?
They are useful when:
- You have producer–consumer type problems.
- One thread needs to wait for another to prepare data.
- You need to coordinate between threads beyond just locking.
3. How It Works
A Condition
object has:
-
An internal lock (
RLock
by default). -
Methods:
wait()
→ Release the lock and block until notified.wait_for(predicate)
→ Wait until a function returnsTrue
.notify()
→ Wake one waiting thread.notify_all()
→ Wake all waiting threads.acquire()
/release()
→ Manage the underlying lock (usually viawith
).
4. Basic Example
Producer–Consumer scenario:
import threading
import time
condition = threading.Condition()
shared_data = None
def consumer():
global shared_data
with condition: # Acquire condition's lock
print("Consumer waiting for data...")
# here we can also add our own condition for the wait method to call. but for this simple usecase its not required.
# we can have if shared_data is empty then wait. else consume.
condition.wait() # Wait until notified
print(f"Consumer got data: {shared_data}")
def producer():
global shared_data
time.sleep(2) # Simulate work
with condition:
shared_data = "Hello from producer"
print("Producer setting data and notifying")
condition.notify() # Wake one waiting thread
t1 = threading.Thread(target=consumer)
t2 = threading.Thread(target=producer)
t1.start()
t2.start()
t1.join()
t2.join()
Flow:
- Consumer locks the condition, calls
wait()
→ releases the lock and waits. - Producer locks the condition, updates
shared_data
, callsnotify()
→ wakes consumer. - Consumer reacquires lock, resumes work.
5. wait_for
Example
Instead of manually looping and checking, you can do:
import threading
import time
condition = threading.Condition()
data_ready = False
def waiter():
with condition:
condition.wait_for(lambda: data_ready)
print("Data is ready!")
def setter():
global data_ready
time.sleep(2)
with condition:
data_ready = True
condition.notify_all()
t1 = threading.Thread(target=waiter)
t2 = threading.Thread(target=setter)
t1.start()
t2.start()
t1.join()
t2.join()
6. Key Points
- Always acquire the condition’s lock before calling
wait()
ornotify()
. wait()
temporarily releases the lock so other threads can update the condition.notify()
wakes one waiting thread;notify_all()
wakes all.- Use
wait_for(predicate)
to avoid manualwhile not condition:
loops. - Without proper signaling, waiting threads can wait forever.
Producer-Consumer Pattern - Threads
1. What is the Producer–Consumer Pattern?
It’s a multi-threading design pattern used when:
- Producers generate data and place it into a shared buffer.
- Consumers take data from that buffer and process it.
- Both run independently, but they must be synchronized to avoid race conditions.
Real-world analogy:
- A factory has workers (producers) making products.
- Delivery trucks (consumers) take products to customers.
- There’s a warehouse (shared buffer) between them.
- If the warehouse is full, workers wait.
- If the warehouse is empty, trucks wait.
2. Why Do We Need It?
Without proper synchronization:
- Producer could overwrite data before it’s read.
- Consumer could read empty data.
- Multiple producers/consumers could corrupt shared memory.
3. Key Concepts
- Shared Buffer → Queue (FIFO storage).
- Mutual Exclusion → Prevent multiple threads from changing the buffer at the same time (Locks).
- Signaling → Wake up waiting threads when data/space is available (Condition variables or Semaphores).
4. Producer–Consumer in Python with queue.Queue
(queue.Queue
is thread-safe and handles locking/signaling for you.)
import threading
import queue
import time
import random
buffer = queue.Queue(maxsize=5) # Shared buffer
def producer():
for i in range(10):
item = f"Item-{i}"
buffer.put(item) # Waits if buffer is full
print(f"Produced {item}")
time.sleep(random.uniform(0.2, 0.5))
def consumer():
for _ in range(10):
item = buffer.get() # Waits if buffer is empty
print(f"Consumed {item}")
buffer.task_done()
time.sleep(random.uniform(0.3, 0.6))
t1 = threading.Thread(target=producer)
t2 = threading.Thread(target=consumer)
t1.start()
t2.start()
t1.join()
t2.join()
What happens here:
queue.Queue
automatically uses locks + conditions.- If
maxsize
is reached,put()
blocks until there’s space. - If queue is empty,
get()
blocks until there’s an item.
5. Producer–Consumer Using Condition Variables (Manual Implementation)
If we don’t use queue.Queue
, we can implement it ourselves:
import threading
import time
import random
buffer = []
MAX_SIZE = 5
condition = threading.Condition()
def producer():
for i in range(10):
with condition:
while len(buffer) == MAX_SIZE:
condition.wait() # Wait until there’s space
item = f"Item-{i}"
buffer.append(item)
print(f"Produced {item}")
condition.notify() # Notify a waiting consumer
time.sleep(random.uniform(0.2, 0.5))
def consumer():
for _ in range(10):
with condition:
while not buffer:
condition.wait() # Wait until there’s data
item = buffer.pop(0)
print(f"Consumed {item}")
condition.notify() # Notify a waiting producer
time.sleep(random.uniform(0.3, 0.6))
t1 = threading.Thread(target=producer)
t2 = threading.Thread(target=consumer)
t1.start()
t2.start()
t1.join()
t2.join()
Here:
condition.wait()
releases the lock and waits.condition.notify()
wakes a waiting thread.- We manually check
while len(buffer) == MAX_SIZE
orwhile not buffer
to avoid spurious wakeups.
6. Benefits
- Decouples data production from consumption.
- Improves throughput by allowing concurrent execution.
- Handles different speeds — producers can work ahead, consumers can catch up.
7. Common Pitfalls
- Deadlock if neither side signals properly.
- Starvation if one side always gets priority.
- Memory issues if no limit (
maxsize
) is set.
Graceful shutdown problem in the Producer–Consumer pattern
If the producer stops producing but the consumer keeps calling get()
(or waiting on condition
), the consumer will wait forever unless we signal it to stop.
Why it happens
queue.get()
orcondition.wait()
blocks until an item arrives.- If no new items arrive and no one signals "we’re done", the consumer never wakes up.
Common Solutions
1. Use a Sentinel Value
- Special "poison pill" item placed in the queue to tell consumers to stop.
- Consumer checks for it and exits instead of processing.
Example with queue.Queue
:
import threading
import queue
import time
SENTINEL = None
buffer = queue.Queue()
def producer():
for i in range(5):
buffer.put(f"Item-{i}")
time.sleep(0.2)
buffer.put(SENTINEL) # Signal no more items
def consumer():
while True:
item = buffer.get()
if item is SENTINEL:
buffer.task_done()
break
print(f"Consumed {item}")
buffer.task_done()
t1 = threading.Thread(target=producer)
t2 = threading.Thread(target=consumer)
t1.start()
t2.start()
t1.join()
t2.join()
2. Use a Shared “Stop” Flag
- Have a shared variable like
stop_requested
. - Consumer checks this in a loop along with
condition.wait()
.
Example:
import threading
import time
buffer = []
stop_requested = False
condition = threading.Condition()
def producer():
global stop_requested
for i in range(5):
with condition:
buffer.append(f"Item-{i}")
condition.notify()
time.sleep(0.2)
with condition:
stop_requested = True
condition.notify_all() # Wake all waiting consumers
def consumer():
while True:
with condition:
while not buffer and not stop_requested:
condition.wait()
if stop_requested and not buffer:
break
item = buffer.pop(0)
print(f"Consumed {item}")
t1 = threading.Thread(target=producer)
t2 = threading.Thread(target=consumer)
t1.start()
t2.start()
t1.join()
t2.join()
3. Timeout-based Waiting
- Use
queue.get(timeout=...)
orcondition.wait(timeout=...)
to periodically check and break out.
Example:
item = buffer.get(timeout=2)
If the queue is empty for 2 seconds, a queue.Empty
exception is raised, and you can handle shutdown.
✅ Best practice: For a clean shutdown in multi-threaded producer–consumer, the sentinel approach is the most common, especially when you have multiple consumers — you send one sentinel per consumer so each gets the stop signal.
Producer and Consumer Pattern - Process
1. What is the Producer–Consumer Pattern?
- Producer: generates data (tasks, items, etc.) and puts them into a shared buffer/queue.
- Consumer: takes data from the shared buffer and processes it.
- Goal: Decouple data production from data consumption so they can run in parallel.
In multiprocessing, producers and consumers run in separate processes and communicate via inter-process communication (IPC) — most commonly, a multiprocessing.Queue
.
2. Why Multiprocessing Instead of Threading?
- Threading is limited by Python’s GIL (Global Interpreter Lock), which prevents true parallel execution of CPU-bound code.
- Multiprocessing spawns separate processes, each with its own Python interpreter and memory space → true parallelism.
3. Multiprocessing Producer–Consumer Example
Using multiprocessing.Queue
:
import multiprocessing
import time
import random
SENTINEL = None # Stop signal
def producer(queue, producer_id):
for i in range(5):
item = f"Producer-{producer_id} Item-{i}"
print(f"[Producer-{producer_id}] Producing {item}")
queue.put(item)
time.sleep(random.uniform(0.2, 0.5))
queue.put(SENTINEL) # Send stop signal
def consumer(queue, consumer_id):
while True:
item = queue.get()
if item is SENTINEL:
print(f"[Consumer-{consumer_id}] Stopping")
queue.put(SENTINEL) # Pass sentinel to other consumers
break
print(f"[Consumer-{consumer_id}] Consumed {item}")
time.sleep(random.uniform(0.3, 0.6))
if __name__ == "__main__":
queue = multiprocessing.Queue()
# Multiple producers
producers = [multiprocessing.Process(target=producer, args=(queue, i)) for i in range(2)]
consumers = [multiprocessing.Process(target=consumer, args=(queue, i)) for i in range(2)]
for p in producers:
p.start()
for c in consumers:
c.start()
for p in producers:
p.join()
for c in consumers:
c.join()
4. How This Works
- Queue is the shared buffer between processes.
- Producers create items and put them in the queue.
- Consumers take items from the queue and process them.
- Sentinel value (
None
) tells consumers to stop.
5. Why Use a Sentinel in Multiprocessing?
In multiprocessing, consumers block indefinitely on queue.get()
if no items arrive.
To stop cleanly:
- Send one sentinel per consumer so each process knows when to exit.
- Or, as in the above example, pass the sentinel along until all consumers receive it.
6. Variations
- Multiple producers and multiple consumers — requires more sentinels.
- Priority queues (
multiprocessing.JoinableQueue
) to ensure task completion tracking. - Process pools with
concurrent.futures.ProcessPoolExecutor
— easier for simpler patterns.
Semaphore
A semaphore is a synchronization primitive that controls access to a shared resource by using a counter.
- When a thread/process acquires the semaphore → counter is decremented.
- When it releases the semaphore → counter is incremented.
- If the counter is
0
, further acquires will block until it’s released.
Why Use Semaphores?
- To limit concurrent access to a resource (e.g., max 3 threads can access at once).
- To implement signaling between threads or processes.
- To avoid race conditions.
- Threads can acquire a semaphore when its count is positive, decrementing the counter
- If the counter reaches zero, threads attempting to acquire the semaphore are blocked and queued
- Threads release the semaphore when done, incrementing the counter and potentially signaling waiting threads
Types of Semaphores
Counting Semaphore
- Counter can be any positive integer.
- Controls how many threads/processes can access a resource at the same time.
- Example: A database connection pool with max 5 concurrent connections.
Binary Semaphore
-
Special case of counting semaphore where the counter can only be
0
or1
. -
Works like a gate — either locked (0) or unlocked (1).
-
Similar to a mutex, but:
- Not owned: Any thread can release it, not just the one that acquired it.
- Often used for signaling rather than strict ownership.
Key Methods
Both threading.Semaphore
and multiprocessing.Semaphore
support:
acquire(blocking=True, timeout=None)
→ Decrease counter; block if zero.release()
→ Increase counter; wake up a waiting thread/process.with semaphore:
→ Context manager for acquire/release.
Threading Example – Counting Semaphore
import threading
import time
sem = threading.Semaphore(2) # At most 2 threads can enter
def task(name):
print(f"{name} waiting...")
with sem:
print(f"{name} entered")
time.sleep(2)
print(f"{name} leaving")
threads = [threading.Thread(target=task, args=(f"Thread-{i}",)) for i in range(4)]
for t in threads: t.start()
for t in threads: t.join()
Behavior: Only 2 threads run the critical section at the same time.
Threading Example – Binary Semaphore
import threading
import time
binary_sem = threading.Semaphore(1) # Binary semaphore
def critical_section(name):
print(f"{name} waiting...")
with binary_sem:
print(f"{name} entered")
time.sleep(2)
print(f"{name} leaving")
t1 = threading.Thread(target=critical_section, args=("Thread-1",))
t2 = threading.Thread(target=critical_section, args=("Thread-2",))
t1.start()
t2.start()
t1.join()
t2.join()
Behavior: Only 1 thread can run at a time — similar to a mutex, but no ownership rules.
Multiprocessing Example – Counting Semaphore
import multiprocessing
import time
def task(sem, pid):
print(f"Process-{pid} waiting...")
with sem:
print(f"Process-{pid} entered")
time.sleep(2)
print(f"Process-{pid} leaving")
if __name__ == "__main__":
sem = multiprocessing.Semaphore(2) # Allow 2 processes
processes = []
for i in range(4):
p = multiprocessing.Process(target=task, args=(sem, i))
processes.append(p)
p.start()
for p in processes:
p.join()
Multiprocessing Example – Binary Semaphore
import multiprocessing
import time
def critical_task(sem, pid):
print(f"Process-{pid} waiting...")
with sem:
print(f"Process-{pid} entered")
time.sleep(2)
print(f"Process-{pid} leaving")
if __name__ == "__main__":
binary_sem = multiprocessing.Semaphore(1) # Binary semaphore
processes = []
for i in range(2):
p = multiprocessing.Process(target=critical_task, args=(binary_sem, i))
processes.append(p)
p.start()
for p in processes:
p.join()
When to Use
- Counting Semaphore → When you want to limit concurrent access (e.g., allow up to N threads).
- Binary Semaphore → When you just need a single permit (e.g., turnstile or signaling).
- Mutex/Lock → When you want strict thread/process ownership rules.
Mutex vs Semaphore
Binary Semaphore vs Mutex
Feature | Binary Semaphore | Mutex (Lock) |
---|---|---|
Ownership | No ownership (any thread/process can release) | Owned by the thread that acquires |
Value Range | 0 or 1 | N/A (locked/unlocked) |
Usage | Signaling or simple mutual exclusion | Strict mutual exclusion |
Process Safe | Yes | Yes |
Race condition vs Data Race
Race Condition
Definition: A race condition happens when the program’s correctness depends on the timing or interleaving of threads/processes. If execution order changes, the output changes — often leading to bugs.
Race conditions are flaws in the timing or order of a program's execution that cause incorrect behavior They are often more difficult to detect and prevent than data races
Key Points:
- Caused by unsynchronized access to shared resources.
- Can happen even without simultaneous access (e.g., bad logic relying on timing).
- Happens in multi-threaded or multi-process environments.
Example (Python, threading):
import threading
counter = 0
def increment():
global counter
for _ in range(1000000):
counter += 1
t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start()
t2.start()
t1.join()
t2.join()
print(counter) # Expected 2,000,000 but often less
Why? Both threads read the old value and write back at the same time → lost updates.
Data Race
Definition: A data race is a specific type of race condition where:
Data races occur when two or more threads concurrently access the same memory location, with at least one thread writing to or changing that memory value. This can lead to threads overwriting each other or reading incorrect values. Data races can be detected using automated tools and prevented by ensuring mutual exclusion for shared resources
- Two or more threads access the same variable concurrently.
- At least one of the accesses is a write.
- There is no synchronization (e.g., no locks).
Key Points:
- Always involves shared data.
- Always causes undefined behavior (in languages like C++); in Python it causes logical errors.
- Data race is a race condition, but not all race conditions are data races.
Example (data race in Python):
import time
import threading
value = 0
def increment():
global value
for _ in range(100000):
val = value # read
val += 1 # modify
time.sleep(0.000001) # force context switch
value = val # write
if __name__ == "__main__":
t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start()
t2.start()
t1.join()
t2.join()
print("Expected:", 2 * 100000)
print("Actual:", value)
Here:
- Writer is writing.
- Reader is reading at the same time.
- No locks → data race.
Data race in Multiprocessing
Data races can occur in multiprocessing, but the situation is a bit different from threads.
Why multiprocessing is different
- In threading, threads share the same memory space, so a race happens naturally if multiple threads touch the same variable without synchronization.
- In multiprocessing, each process has its own separate memory space by default, so normal Python variables are not shared — which means you can’t have a data race on a normal variable between processes.
When data races can happen in multiprocessing
A race can occur if you explicitly share memory between processes, for example:
- Using
multiprocessing.Value
ormultiprocessing.Array
(shared C-level memory). - Using
multiprocessing.shared_memory
(Python 3.8+). - Using memory-mapped files (
mmap
). - Using some database or file without locking.
In these cases, two processes can modify the same memory concurrently, causing lost updates just like in threading.
multiprocessing.Value
Data race with from multiprocessing import Process, Value
import time
def increment(shared_counter):
for _ in range(100000):
val = shared_counter.value # read
val += 1 # modify
time.sleep(0.000001) # force context switch
shared_counter.value = val # write
if __name__ == "__main__":
counter = Value('i', 0) # 'i' = signed int, shared memory
p1 = Process(target=increment, args=(counter,))
p2 = Process(target=increment, args=(counter,))
p1.start()
p2.start()
p1.join()
p2.join()
print("Expected:", 2 * 100000)
print("Actual:", counter.value)
Possible output:
Expected: 200000
Actual: 134527
How to avoid multiprocessing data races
- Use a
multiprocessing.Lock
:
def increment(shared_counter, lock):
for _ in range(100000):
with lock:
shared_counter.value += 1
- Or use
multiprocessing.Manager()
objects, which handle synchronization internally. - Or design processes to pass messages via
multiprocessing.Queue
instead of sharing memory directly.
Race Condition vs Data Race
Data races and race conditions are different problems that are often confused due to their similar names
It's possible to have data races without race conditions and vice versa
Many race conditions are caused by data races, and many data races lead to race conditions, but they are not dependent on each other
Aspect | Race Condition | Data Race |
---|---|---|
Definition | Any bug where outcome depends on timing/order of execution | A race condition involving unsynchronized read/write to the same memory |
Involves shared data? | Not necessarily | Always |
Requires write? | Not always | Yes, at least one write |
Fix method | Synchronization, ordering guarantees, atomic operations | Locks, atomic operations, thread-safe structures |
How to Avoid Both
- Locks / Mutexes – Ensure only one thread can access shared data at a time.
- Atomic Operations – Use atomic counters or thread-safe primitives.
- Immutable Data – No write access, only reads.
- Queues / Channels – Pass data between threads instead of sharing.
- Proper Design – Avoid depending on execution order when possible.
Barrier
A barrier is a synchronization primitive used in threading and multiprocessing to coordinate multiple threads or processes so they all wait at a certain point in the program until every participant has reached that point.
A barrier is a synchronization mechanism for thread coordination
It serves as a stopping point for a group of threads, preventing them from proceeding until all (or a sufficient number of) threads have reached the barrier
How a Barrier Works
-
Suppose you have several threads or processes working in parallel.
-
At some stage, they must all complete a certain part of work before any can move forward.
-
Each thread/process calls .wait() on the barrier when it reaches this point.
-
The barrier “blocks” each participant when they call .wait(), until all required participants have called .wait().
-
Once all participants have arrived, the barrier is lifted simultaneously, and every thread/process can proceed with the next stage of work.
Why Do We Need Barriers?
-
Synchronization Across Participants: Barriers ensure all threads or processes reach a specific stage before any move ahead. This is essential when the next steps depend on everyone finishing their current phase (e.g., parallel computations that need to be combined).
-
Coordination of Phased Work: In algorithms where work is structured in steps or phases, barriers let you cleanly transition to the next step only after all have completed the previous one.
-
Simplifies Code: Rather than manually tracking how many threads have finished and using multiple locks or conditions, barriers provide a simple, reusable object to synchronize groups of workers.
Example Use Cases
-
Combining results: Multiple threads process parts of a dataset. The main thread combines results only after everyone is finished.
-
Phase synchronization: Multiple stages in a complex algorithm, where all workers must align between stages (e.g., parallel matrix computations, distributed simulation time steps).
-
Initialization: Ensure all workers are ready before starting the main task.
Example
I have 2 threads, one thread will double the number and other will add 3 to it. With the lock we can avoid the Data race, however we still face the inconsistent result due to timing race condition.
To overcome the timing related race condition we can use the Barrier.
we need to decide where to use the barrier.
flow without barrier
we make all the threads that add 3 to the variable to execute first and then the thread that doubles the value
Code
where we place the barrier.wait() matters for the thread to wait and continue. Now this code will first executes all the threads that calls the add 3 and then the doubler thread will get executed until that doubler threads will wait at the begining itself.
Barrier in threading
- Implemented by
threading.Barrier
. - Useful when you have a fixed number of threads that must all reach a certain stage before continuing.
- The barrier resets automatically after all participants have passed through.
Example (threading.Barrier)
import threading
import time
import random
def worker(barrier, thread_id):
print(f"Thread {thread_id} is working...")
time.sleep(random.uniform(0.5, 2)) # Simulate work
print(f"Thread {thread_id} waiting at barrier")
barrier.wait() # Wait until all threads reach here
print(f"Thread {thread_id} passed the barrier")
# Barrier for 3 threads
barrier = threading.Barrier(3)
threads = [threading.Thread(target=worker, args=(barrier, i)) for i in range(3)]
for t in threads:
t.start()
for t in threads:
t.join()
Output (order may vary):
Thread 0 is working...
Thread 1 is working...
Thread 2 is working...
Thread 1 waiting at barrier
Thread 0 waiting at barrier
Thread 2 waiting at barrier
Thread 2 passed the barrier
Thread 0 passed the barrier
Thread 1 passed the barrier
All threads start work, but none pass the barrier until all 3 reach it.
Barrier in multiprocessing
from multiprocessing import Process, Barrier
import time, random
def worker(barrier, pid):
print(f"Process {pid} working...")
time.sleep(random.uniform(0.5, 2))
print(f"Process {pid} waiting at barrier")
barrier.wait()
print(f"Process {pid} passed the barrier")
if __name__ == "__main__":
# multiprocessing has Barrier only in Python 3.10+ (same API as threading.Barrier)
barrier = Barrier(3)
processes = [Process(target=worker, args=(barrier, i)) for i in range(3)]
for p in processes:
p.start()
for p in processes:
p.join()
Barriers help prevent race conditions by forcing multiple threads or processes to synchronize at a specific point in execution, ensuring all participants have reached a safe state before any advance further.
How Barriers Prevent Race Conditions
- In concurrent programs, a race condition occurs when multiple threads/processes access and modify shared resources unpredictably, leading to incorrect or inconsistent results.
- A barrier makes every worker stop and wait until all expected workers reach the barrier.
- This ensures that shared data or resources have been fully prepared or processed before moving to the next stage—no thread/process can go ahead and use incomplete/unstable data.
Example Scenario
Suppose each thread/process needs to initialize part of a shared array before all threads/processes start computation:
- Each thread initializes its assigned section.
- All threads/processes reach a barrier.
- Only when all are finished with initialization does anyone start computation, guaranteeing every array section is ready.
If threads are allowed to advance independently, a thread could start computation with uninitialized data from others, causing a data race.
Why Use Barriers for Race Condition Prevention?
- Coordinates stages: Ensures prerequisite work by all parties is complete before progressing.
- Avoids accessing shared resources in inconsistent states.
- Eliminates timing-based errors that happen when one worker gets ahead of others too soon.
Summary
While barriers do not protect individual updates to shared variables (that requires locks or atomic operations), they do prevent race conditions arising from untimely progression—making sure all threads/processes stay in sync across stages where shared state consistency is crucial.
Lock vs Barriers
-
Locks/Mutexes:
Are used to prevent data races by ensuring only one thread or process accesses a shared resource (like a variable or data structure) at a time. They enforce mutual exclusion so that critical sections (where shared data is read/written) are protected from concurrent modification—this keeps data safe and consistent.[2][5][8] -
Barriers:
Are used to prevent certain types of race conditions related to sequencing or coordination (not data races on shared data). Barriers force all threads/processes to reach a specific execution point before any proceed, thus ensuring that operations depending on collective progress or multi-stage computations are correctly synchronized. Barriers do not provide mutual exclusion—they coordinate the timing of multiple participants, not direct critical section access.[7][8]
Key Differences
Purpose | Lock/Mutex | Barrier |
---|---|---|
Main Role | Protect shared data/critical section | Synchronize execution phases |
Prevent | Data races (simultaneous data access) | Race conditions (out-of-sequence) |
Enforces | Only 1 thread in critical section | All wait until group ready |
Example Use | Incrementing a shared counter | Waiting for all to finish phase 1 |
Summary:
- Use a lock/mutex for direct access protection to shared data.
- Use a barrier to coordinate program phase progression and process/thread synchronization.
Both prevent "race conditions" in the broader sense, but mutex = data access safety, barrier = execution order safety.
Computational Graphs
The key to parallel programming is determining which steps within a program can be executed in parallel, and then figuring out how to coordinate them. And one tool to help model how steps in a program relate to each other is a computational graph
-
Computational graphs model relationships between program steps
-
Graphs visualize which steps can be executed in parallel to help coordinate parallel execution and identify dependencies
Directed acyclic graphs (DAGs)
-
Nodes represent tasks or units of work
-
Directed edges indicate progression and dependencies between tasks
Key Terminology
- Spawn/fork: Initiates parallel execution paths
- Sync/join: Synchronizes completion of parallel tasks
- Asynchronous: Tasks that can occur in any order relative to each other
- Critical path: Longest sequence of dependent operations in the graph
Analyzing Parallel Potential
- Work: Total execution time on a single processor (sum of all task times)
- Span: Shortest possible execution time with maximum parallelization (sum of critical path times)
- Ideal parallelism: Ratio of work to span, indicating maximum speed improvement
Challenges with Creating Many Threads
- As more tasks are added, they can spawn a new thread for each one
- Creating a new thread for each task can lead to inefficiencies because thread creation incurs overhead in terms of processor time and memory usage
Thread Pools
- Thread pools can be a more efficient alternative to creating individual threads for each task
- Thread pools maintain a small collection of worker threads that can be reused to execute multiple tasks
- Submitting tasks to a thread pool is analogous to adding items to a to-do list for worker threads
Benefits of thread pools
- Reusing threads from a pool overcomes the overhead of creating new threads for each task
- Thread pools are especially advantageous when the task execution time is less than thread creation time
- Using preexisting threads in a pool can make programs more responsive by eliminating the delay of thread creation
Thread pool
Future
- Asynchronous tasks allow multiple operations to be accomplished simultaneously
- Futures are a mechanism for handling the results of asynchronous operations
- It acts as a placeholder for a result that will be available at a later time
- A future is like an "I owe you" note for the result of an asynchronous task
Working with Futures
- Futures are read-only and may not have an immediate result
- A thread might need to wait for the future to be resolved
- "Resolving" or "fulfilling" a future means writing the result value to it
Divide and Conquer Algorithms
- Divide-and-conquer algorithms break down complex problems into smaller, manageable subproblems
- They are well suited for parallel execution across multiple processors
A divide-and-conquer algorithm is structured in three parts:
Divide: The main problem is split into smaller subproblems of roughly equal size Conquer: Each subproblem is solved recursively Combine: Solutions to subproblems are merged to form the final solution
Advantages
- Subproblems are independent, allowing for parallel execution on different processors
- Can significantly reduce computation time for large datasets
Example - Single process
Example - Multi process
here the process pool is created for the first time calling that function. Then we have the base class check where the actual work is done in the process pool and return the future object in list(so that we can combine them by looping thru the list). in the else class we have the divide and calling the recursive and adding the both the lists.
Then we have the sum of futures object in the list comprehension and sum them and return it.
Weak Scaling vs Strong Scaling
Weak scaling: Increase the problem size
- Weak scaling allows us to tackle larger problems in the same amount of time by adding more processors while keeping the workload per processor constant
For example, one person can decorate 10 cupcakes in an hour, while two people working in parallel can decorate 20 cupcakes in the same time
Definition: Measure how the runtime changes when you increase the number of processors while keeping the workload per processor constant.
- Goal: Solve a bigger problem in the same amount of time.
- Ideal case: Runtime should stay the same even as the problem size grows proportionally to the number of processors.
Weak Scaling (Workload grows with processors):
Workload = 100M per processor
┌───────────── Runtime ─────────────┐
Proc=1 (100M) | ########## (10s)
Proc=2 (200M) | ########## (10s)
Proc=4 (400M) | ########## (10s)
Proc=8 (800M) | ########## (10s)
Goal: runtime stays constant as workload grows.
Strong scaling: Accomplish tasks faster
- Strong scaling involves breaking down a fixed-size problem across multiple processors to execute it faster
For example, decorating 10 cupcakes takes one person an hour, but two people working together can complete the task in about 30 minutes
Definition: Measure how the runtime decreases when you increase the number of processors for a fixed total problem size.
- Goal: Solve the same problem faster by adding more processors.
- Ideal case: Runtime should shrink proportionally to the number of processors (linear scaling).
Strong Scaling (Fixed problem size):
Workload = 1B elements
┌───────────── Runtime ─────────────┐
Proc=1 | ############################ (100s)
Proc=2 | ############### (50s)
Proc=4 | ######### (25s)
Proc=8 | ##### (13s)
Goal: runtime shrinks with more processors.
Summary Table
Aspect | Strong Scaling | Weak Scaling |
---|---|---|
Problem size | Fixed | Increases with #processors |
Goal | Run same problem faster | Handle bigger problems in same time |
Ideal runtime | ↓ (shrinks with processors) | ≈ constant |
Typical limitation | Overheads, communication, Amdahl’s Law | Network/communication bottlenecks |
✅ Strong Scaling = speedup for fixed work. ✅ Weak Scaling = efficiency for growing work.
Key Metrics in Parallel Computing
Throughput
- Throughput measures the number of tasks completed in a given time period
- Adding more processors increases overall throughput, for example, from 10 cupcakes/hour with one worker to 30 cupcakes/hour with three workers
Latency
- Latency is the time taken to execute a single task from start to finish
- It remains constant per task even when multiple processors are used (for example, six minutes to decorate one cupcake)
Speedup
- Speedup is calculated as the ratio of sequential execution time to parallel execution time
- For example, if one worker takes 60 minutes and two workers take 30 minutes, the speedup is 2
Amdahl's Law
- Amdahl's Law is an equation used to estimate the potential speedup of a parallel program
- It helps determine the effectiveness of parallelizing a program based on its parallelizable portion
Amdahl's law is great for estimating the speedup that you might achieve by parallelizing a program
Limitations on the Effectiveness of Parallelization
- The degree of parallelization significantly impacts potential speedup
- Programs with 90% parallelizable code have a maximum speedup of 10
- 75% parallelizable programs are limited to a speedup of 4
- 50% parallelizable programs can achieve a maximum speedup of only 2
2 Process that can run 95% of our code in parallel so the overall speedup of our entire program is 1.9 times.
1000 Process, still the speedup stuck at ~20 times
Decision-Making in Parallel Programming
- Amdahl's Law helps determine whether parallelizing a program is worthwhile
- It illustrates why parallel computing is most beneficial for highly parallelizable programs
- The costs and overhead of parallelization should be weighed against potential benefits
- Parallelizing everything is not always the best approach, despite the availability of multicore processors
Measuring Speedup and Efficiency
Calculating Speedup
Calculating Efficiency
Benchmarking Best Practices
- Limit other running programs to avoid resource competition
- Average multiple independent runs to account for execution variability
- Allow for "warm-up" in environments with just-in-time compilation
- Run the algorithm once before measurement to ensure a consistent cache state
Parallel design stages
Parallel design in programming typically follows a sequence of well-defined stages that help convert a serial (single-threaded) problem into an efficient parallel algorithm. The aim is to maximize concurrency, minimize inter-process communication, and achieve balanced workload distribution across computational units.
Partitioning
- Goal: Divide the overall computation and data into smaller independent tasks.
- Tasks can be split via data parallelism (same operation, different data chunks) or task parallelism (different operations on either shared or private data).
- Example: In matrix multiplication, each element calculation can be handled as an independent task.
- Partitioning involves dividing the problem into discrete chunks of work for distribution among multiple tasks
- At this stage of the design process, the focus is on maximum decomposition without considering practical limitations like processor count
- Different decomposition methods may have varying advantages depending on the problem and hardware
- Domain decomposition and **functional **decomposition offer valuable perspectives on problem-solving in parallel computing
Domain (data) decomposition
- Focuses on dividing the data associated with the problem into small, ideally equal-sized partitions
- Computations are then associated with the partitioned data
Functional decomposition
- Begins by considering all computational work required by the program
- Divides the overall work into separate tasks performing different portions
- Data requirements for these tasks are a secondary consideration
Communication
After decomposing a problem into separate tasks, the next step is to establish communication between tasks. Communication involves coordinating execution and sharing data
- Goal: Identify and design interactions needed between the tasks.
- This includes transfer of intermediate data, synchronization signals, barriers, and locking mechanisms.
- Minimizing communication and avoiding bottlenecks are key to performance.
Types of Task Communication
- Independent tasks
- Some problems can be decomposed into tasks that don't require data sharing
- Interdependent tasks
- Tasks that require information from other tasks to complete their work
Communication Structures
- Point-to-point communication
- Direct links between neighboring tasks
- Suitable when each task communicates with a small number of other tasks
- Involves sender (producer) and receiver (consumer) roles
- Collective communication
- Used when tasks need to communicate with a larger group
- Includes broadcasting (one-to-many) and scattering/gathering (distributing and collecting data)
- Centralized management
- One task acts as a central coordinator for a group of distributed workers
- Can become a bottleneck as the number of workers increases
- Divide-and-conquer strategies can help distribute computation and communication load
Communication Factors to Consider
- Synchronous vs. asynchronous communication
- Synchronous (blocking): Tasks wait for communication to complete before continuing
- Asynchronous (non-blocking): Tasks can perform other work while communication is in progress
- Performance considerations
- Processing overhead: Time spent on communication versus data processing
- Latency: Time for a message to travel from sender to receiver
- Bandwidth: Amount of data that can be communicated per unit of time
Agglomeration
- Goal: Combine smaller tasks or groups for more efficient execution.
- Merge fine-grained tasks to reduce scheduling overhead or communication costs.
- Helps manage granularity: not too fine (overhead), and not too coarse (missed parallelism).
Granularity
Fine-grained parallelism
- Breaks a program into a large number of small tasks
- Advantages: Allows for better load balancing across processors
- Disadvantages: Increases overhead for communication and synchronization, resulting in a low computation-to-communication ratio
Coarse-grained parallelism
- Splits the program into a small number of large tasks
- Advantages: Has lower communication overhead, allowing more time for computation
- Disadvantages: May produce load imbalance, with some tasks processing more data while others remain idle
Medium-grained parallelism
- Balances the trade-offs between fine and coarse-grained approaches
- Often the most efficient solution for general purpose computers
Recommendations
- Avoid hard-coded limits on the number of tasks
- Use compile-time or runtime parameters to control granularity
- Design programs to adapt to changes in the number of available processors
Mapping
- Goal: Allocate tasks to processes, threads, or processors.
- Tasks need to be mapped in a way that balances load and minimizes idle time, taking into account hardware and resource constraints.[3][5]
- Consider dynamic vs. static mapping depending on workload variability.
- Mapping is the fourth and final stage of our parallel design process
- It involves specifying where each established task will be executed
Applicability
- Mapping is not applicable for single-processor systems or systems with automated task scheduling
- It becomes relevant in distributed systems or specialized hardware with multiple parallel processors