Parallel and Concurrent programming - LogeshVel/learning_resources GitHub Wiki

Multi-Threads - useful for I/O bounded tasks

Multi-Process - useful for CPU bounded tasks

Sequential vs. Parallel Computing

Sequential computing

Sequential computing is the traditional programming approach where instructions are executed one after another by a single processor, with only one instruction being executed at a time; the processor's capabilities limit the speed of execution.

Parallel computing

Parallel computing utilizes multiple processors working simultaneously on different parts of a task; this approach breaks down tasks into independent parts that can be executed in parallel.

Advantages of Parallel Computing

Parallel computing significantly increases the overall throughput of a program by enabling faster completion of large or complex tasks that would be impractical for a single processor; it enables the computer to accomplish more tasks in a given amount of time.
Parallel computing leveraging multiple processors makes it possible to solve problems too large or complex for a single processor; in many industries, the time saved using parallel computing often outweighs the costs of investing in specialized hardware.

Challenges of Parallel Computing

Implementing parallel computing increases the complexity of programming, requiring careful coordination between processors to ensure efficient execution; adding more processors doesn't always result in a proportional increase in speed due to coordination overhead
There may be waiting times between dependent tasks, which can affect overall performance

Real-World Applications

Web search engines are one example of a real-world application that uses parallel computing to process millions of transactions every second
Many other industries leverage parallel computing, where time savings translate directly to cost savings

Key Takeaways

Parallel computing offers significant performance benefits for tasks that can be effectively parallelized
Implementing parallel solutions requires careful planning and coordination to overcome inherent challenges
Parallel computing has grown in importance across various industries due to its ability to handle complex, data-intensive tasks; it is essential for solving large-scale computing problems in modern applications

Four Classes of Computer Architecture (Flynn’s Taxonomy)

Parallel computing requires parallel hardware with multiple processors. Flynn's taxonomy classifies one of four multi-processor architectures based on its instruction streams and data streams:

Single Instruction, Single Data (SISD)

Sequential computer with a single processor unit
Executes one series of instructions on one data element at a time

Single Instruction, Multiple Data (SIMD)

Parallel computer with multiple processing units
All processors execute the same instruction simultaneously on different data elements
Well-suited for applications performing repetitive operations on large datasets (for example, image processing)
Modern GPUs often use SIMD instructions

Multiple Instruction, Single Data (MISD)

Each processing unit executes its own series of instructions on the same data stream
Not a commonly used architecture due to limited practical applications

Multiple Instruction, Multiple Data (MIMD)

Each processing unit can execute different instructions on different datasets
Most commonly used architecture found in multicore PCs, networked clusters, and supercomputers

MIMD Subdivisions

Single Program, Multiple Data (SPMD)

Multiple processing units execute copies of the same program simultaneously
Processors can run asynchronously and execute different program parts based on conditional logic
The most common style of parallel programming

Multiple Program, Multiple Data (MPMD)

Processors execute different, independent programs simultaneously on different data
Often uses a "host" or "manager" node to distribute work to other nodes and then collect results
Less common than SPMD but useful for certain applications

Memory Architecture

Memory organization and access are crucial factors in parallel computing performance. Even with numerous processors, inefficient memory access can negate potential gains because computer memory typically operates slower than processors, creating potential bottlenecks.

Shared memory and distributed memory are the two main memory architectures that support different use cases for** parallel computing.**

Shared Memory Architecture

All processors access a global memory address space, and changes made by one processor to that memory are visible to all others.

Types of shared memory systems

Uniform Memory Access - UMA

Processors have equal access speed to memory
Symmetric multiprocessing (SMP) is a common UMA architecture
Modern multicore processors use SMP architecture

Symmetric Multiprocessing - SMP

Non-Uniform Memory Access - NUMA

Often created by connecting multiple SMP systems
Processors have varying access speeds to different memory parts
All processors can still access all memory

Caches in shared memory systems

Each core typically has its own fast, local cache
Cache coherency becomes a challenge when updating shared memory
Hardware handles cache coherency in multicore processors

Advantages and disadvantages

Easier for programming due to simple data sharing
May not scale well due to increased bus traffic and synchronization needs

Distributed Memory

Each processor has its** own local memory and address space**; no global address space exists, and processors are connected through a network.

Key characteristics

Changes in one processor's memory are not automatically reflected in others
Programmers must explicitly define data communication between nodes

Advantages and disadvantages

Highly scalable: adding processors increases both processing power and memory
Cost-effective using commodity hardware
Communication between nodes can be challenging to program

Process

A process is an instance of a program executing on a computer
It consists of the program's code, data, and state information
Each process has its own independent address space in memory
Computers can manage hundreds of active processes simultaneously

Threads

Threads are smaller, subelements that exist within a process
They represent independent paths of execution through a program
Threads are the basic units managed by the operating system for execution
Multiple threads within a process share the same address space and resources

Comparison of Processes and Threads

Processes are isolated from each other, each with its own address space
Threads within the same process can easily share resources and communicate
Interprocess communication is possible but requires special mechanisms
Threads are generally considered "lightweight" compared to processes
Creating and terminating threads typically requires less overhead than processes
Switching between threads of the same process is usually faster than switching between different processes

Choosing between Threads and Processes

The choice depends on the specific task and operating environment
Threads are generally recommended when possible due to their lightweight nature
Implementing threads and processes can vary across operating systems and programming languages

Inter-Process communication - IPC

Concurrency

Concurrency refers to the ability of a program to be broken into independent parts that can be executed out of order without affecting the final result. It focuses on how a program is structured and composed of independently executing processes. Concurrent execution does not necessarily mean parallel execution.

Concurrent Execution

Concurrent processes overlap in time but may not execute simultaneously on a single processor
Rapid task-switching can create an illusion of simultaneous execution, but it's not true parallelism

Parallel Hardware

Multicore processors in desktop computers and mobile devices
Graphics processing units (GPUs) with numerous specialized cores
Computer clusters that distribute processing across multiple systems

Parallel Execution

Parallel execution requires parallel hardware, such as multiple processors or cores
Parallel execution allows multiple tasks to be performed simultaneously, potentially speeding up the overall process

Key Distinctions

Concurrency is about program structure and dealing with multiple things at once
Parallelism involves simultaneous execution and doing multiple things at once
Concurrency enables potential parallel execution but doesn't guarantee it

Use Cases

Concurrent programming is beneficial for I/O-dependent tasks like graphical user interfaces
Parallel processing is particularly useful for computation-intensive tasks, such as matrix multiplication

Practical Applications

Device drivers for I/O devices (mouse, keyboard, and hard drive) execute concurrently but may not benefit significantly from parallelism
Graphical user interfaces use concurrency to maintain responsiveness during time-consuming operations
Large mathematical operations can be divided into subparts for efficient parallel processing

GIL

Spawn vs Fork process

import multiprocessing as mp

def worker():
    print("Worker running")

mp.Process(target=worker).start()

The above code will create infinite process spawning but we python has the mechanism to thrown an Runtime Error to avoid any crash of our system.

On certain operating systems like Windows, this code will create an infinite process spawning loop that can quickly crash your system. On Unix-like systems (Linux, macOS), it will likely run without this issue.

This platform-dependent behavior is due to the different default methods used to create new processes.

The "Spawn" vs. "Fork" Problem

The multiprocessing library uses two main methods to create child processes:

Spawn: This is the default method on Windows and macOS. The parent process starts a fresh Python interpreter process. This new process then imports the script that contains the target function. If the code that starts the process (mp.Process(...).start()) is not protected, the child process will re-run it upon import, leading to an infinite loop of new processes being created.
- Parent starts ➡️ Spawns Child 1.
- Child 1 imports script ➡️ Re-runs the spawning code ➡️ Spawns Child 2.
- Child 2 imports script ➡️ Re-runs the spawning code ➡️ Spawns Child 3... and so on, forever. 💥
Fork: This is the default method on Linux. The parent process essentially creates a copy of itself, including its memory state. The child process resumes execution right after the fork command. It doesn't re-import the entire script, so it doesn't re-execute the process-spawning code. This is why the code would work on Linux without causing a loop.

The Solution: The `if name == "main"` Guard

To make your multiprocessing code safe and cross-platform compatible, you must protect the "entry point" of your program. You do this by placing the code that starts the processes inside an if __name__ == "__main__": block.

When a Python script is run directly, the interpreter sets the special variable __name__ to "__main__". However, when the script is imported by another script (which is what happens in a spawned child process), __name__ is set to the module's name (i.e., the filename).

This guard ensures the mp.Process().start() call is only executed in the main parent process, not in the spawned children.

Corrected Code

import multiprocessing as mp
import time

def worker():
    print("Worker running")
    time.sleep(1) # Added a delay to make the output observable

# This is the crucial guard
if __name__ == "__main__":
    p = mp.Process(target=worker)
    p.start()
    p.join() # It's good practice to wait for the process to finish
    print("Main process finished")

By using this structure, you explicitly tell Python: "Only start a new process if this script is being run as the main program," effectively breaking the infinite recursion on systems that use the "spawn" method.

The Python will throw an error if we didn't use the if guard while spawning a new child process to prevent the process explode and crashing the system.

The Problem: Recursive Spawning

On Windows, when you start a new process with mp.Process().start(), Python must launch a new interpreter and re-import your script to find the worker function.

Here's the chain of events causing the crash:

Main Process Starts: Your script begins running.
Enters Loop: It hits the for loop and calls p.start() to create the first child process.
Child Process Imports Script: The new child process starts and, to do its job, it imports your entire script from the top.
Child Process Enters Loop: Because the for loop is at the top level of the script (not inside the if __name__ == "__main__": guard), the child process also runs this loop.
Child Tries to Spawn: The child process hits p.start() and tries to spawn its own child process.

The RuntimeError you see is a safety mechanism. Python detects that a child process is trying to start a new process before it has even finished setting itself up ("bootstrapping"), and it stops everything to prevent an infinite loop of processes that would quickly crash your computer.

The Solution: Use the Guard

By uncommenting the if __name__ == "__main__": line and indenting your process-starting code underneath it, you tell Python: "Only run this part of the code if the script is being executed directly."

When a child process imports the script, the code inside the guard is skipped, breaking the infinite loop.

Execution scheduling

Operating System Scheduling in Concurrent Computing

Computers manage multiple processes and threads competing for limited processor resources. The operating system's scheduler controls when different threads and processes execute on the CPU.

The Role of the Scheduler

Process management

The scheduler enables multiple programs to run concurrently on a single processor
New processes are loaded into memory and placed in the "ready queue" when created
The scheduler cycles through ready processes, allocating CPU time for each process to execute

Multiprocessor handling

For systems with multiple processors, the OS schedules processes across all available processors to maximize resource utilization

Process execution states

Processes may run until completion, get blocked waiting for I/O events, or be swapped out after using their allocated timeshare

Context Switch

Save the state of a currently running process and load the state of another process
Allows multiple processes to share CPU time
Context switches are not instantaneous and incur overhead
Schedulers must balance switch frequency with performance impact

Scheduling Algorithms

Different operating systems use different scheduling algorithms based on their specific purposes and system requirements
Some scheduling algorithms aim to maximize throughput, while others focus on minimizing latency for improved responsiveness
Preemptive scheduling: Can interrupt low-priority processes for high-priority ones
Non-preemptive scheduling: Allows processes to run for their full allotted time once started

Key Takeaways

Scheduling details are typically handled by the OS "under the hood"
Programmers should not assume a specific execution order or equal time allocation for threads or processes
Programs should be written to function correctly regardless of scheduling decisions

Thread Lifecycle and States

A new process begins with a single "main thread" that can spawn additional child threads
Child threads execute independently within the same process and can spawn their own children
Threads notify their parent upon completion, with the main thread usually finishing last

Thread States

Threads transition through four main states during their lifecycle:

NEW

The NEW thread has been created but is not yet running
It doesn't consume CPU resources at this stage
The thread is assigned a function to execute

RUNNABLE

The operating system can schedule the RUNNABLE thread to execute on a processor
It may be swapped with other threads through context switches

BLOCKED

The thread enters the BLOCKED state when waiting for an event or resource
It doesn't use CPU resources while blocked
The thread returns to the RUNNABLE state when the required condition is met

TERMINATED

The thread enters the TERMINATED state upon completing execution or being abnormally aborted
It notifies its parent thread before termination

Key Concepts

Threads can create child threads to perform parallel tasks
The join() method allows a parent thread to wait for a child thread to complete
Different programming languages may use varying terminology for thread states
Efficient thread management is crucial for concurrent and parallel computing

Data Races in Concurrent Programming

A data race occurs when two or more threads concurrently access the same memory location AND at least one of those threads is writing to that location to modify its value.

Anatomy of a data race Even simple operations, like incrementing a numeric value, consist of multiple operations:

Reading: A thread reads the existing value from the shared resource Modifying: The thread performs calculations or modifications based on the value Writing: The thread writes the new value back to the shared resource

Challenge of data races Data races can be difficult to debug because:

They depend on the unpredictable timing of thread scheduling
They may occur intermittently, making the problem inconsistent and hard to reproduce

Concurrent Access and Critical Sections

When multiple threads concurrently read and write to a shared resource, that can lead to incorrect behavior, such as a data race. This issue can be mitigated by identifying and protecting critical sections of code.

Critical sections

A critical section is a part of a program that accesses a shared resource (for example, a data structure in memory)
The program may not function correctly if multiple threads access the critical section simultaneously
Critical sections need protection to ensure only one thread or process can execute that code section at a time

Mutex (mutual exclusion)

A mutex, also known as a lock, is a mechanism to prevent simultaneous access to shared resources
Only one thread can possess the mutex at a time, forcing threads to take turns accessing the shared resource
The process involves acquiring the lock, executing the critical section, and then releasing the lock

Atomic operations

Acquiring a lock is an atomic operation, meaning it executes as a single, indivisible action
Atomic operations appear instantaneous to the rest of the system and are uninterruptible

Thread blocking

Threads attempting to acquire a lock that's already held by another thread will block (wait) until it becomes available

Best Practices

It's crucial to keep the code protected by a mutex as short as possible to prevent threads from getting stuck waiting
Operations that don't require the shared resource should be performed outside the critical section to improve efficiency

Reentrant lock

Reentrant Mutexes

A reentrant mutex is a type of mutex that can be locked multiple times by the same thread or process
It keeps track internally of how many times it has been locked by the owning thread
The reentrant mutex must be unlocked an equal number of times before another thread can acquire it
Reentrant mutexes are also known as "recursive mutexes" or "recursive locks" in various programming languages

Reentrant Mutexes versus Standard Mutexes

Standard mutexes can lead to deadlocks if a thread tries to lock a mutex it already owns
Reentrant mutexes prevent this type of deadlock by allowing multiple locks by the same thread

Use Cases for Reentrant Mutexes

Nested function calls

Reentrant mutexes are useful when functions that use locks are called within other locked sections of code
This scenario can occur when retrofitting locks into existing code or when creating functions that use other locked functions

Recursive functions

Reentrant mutexes are essential for recursive functions that need to lock resources
As the function calls itself, it can lock the mutex multiple times and then unlock it an equal number of times as it unwinds

Advantages and Considerations

Reentrant mutexes can simplify coding by reducing concerns about what has already been locked
Reentrant mutexes can facilitate easier integration of locks into existing code structures; however, there is debate in the programming community about their use, with some arguing for code refactoring to avoid nested locks instead

Try-Lock

A “try-lock” is a non-blocking version of the standard lock or mutex used in multithreaded programming. Try-lock is useful when threads have multiple tasks, and constant blocking for lock acquisition is inefficient.

Acquiring a Try-Lock

If the mutex is available, it gets locked, and the method returns true
If the mutex is already held by another thread, it immediately returns false
This allows threads to continue with other tasks rather than waiting for the lock

Advantages of Try-Lock

Try-lock provides a way to attempt lock acquisition without forcing threads into a waiting state
It prevents unnecessary blocking when threads have other useful tasks to perform; this is particularly useful in scenarios where threads have multiple independent tasks to perform
It improves efficiency by allowing threads to switch to alternative tasks when a resource is unavailable, improving resource utilization

Reader-Writer Locks

readerwriterlock is the external package we need to install. in-order to use it.

The Problem with Basic Locks

Basic locks restrict access to one thread at a time, regardless of whether it's reading or writing. This approach is not always efficient, especially when many threads only need to read shared data.

Reader-writer locks (or shared mutexes) offer a more flexible solution for managing concurrent access. They can be locked in one of two modes:

Shared "READ" mode: Allows multiple threads to read simultaneously
Exclusive "WRITE" mode: Restricts access to one thread for writing

How Reader-Writer Locks Work

Multiple threads can acquire a shared read lock concurrently
Only one thread can acquire an exclusive write lock at a time
When a thread holds a write lock, no other threads can acquire either a read or write lock
Write locks cannot be acquired while any read locks are held

Use Cases and Considerations

Reader-writer locks are beneficial when there are significantly more read operations than write operations
They can improve performance in scenarios like certain types of database applications
A potential downside of reader-writer locks is that they are more complex to implement and typically use more resources than standard mutexes
The decision to use reader-writer locks should consider factors such as the ratio of read versus write operations, language-specific implementation details, and whether the lock gives preference to readers or writers

Best Practices

Use reader-writer locks when there are many more reading threads than writing threads; however, if most threads are writing, a standard mutex may be more appropriate
Consider the specific requirements and characteristics of your application when choosing between lock types

1. Definition

A Reader–Writer Lock (also called RWLock) is a synchronization primitive that allows:

Multiple readers at the same time (concurrent read access).
Only one writer at a time (exclusive write access).
No reader can read while a writer is writing (prevents dirty reads).

It’s used when reads are more frequent than writes — so instead of locking the resource for every read (like a normal mutex), you let multiple readers work in parallel and only block when a writer comes in.

2. Why It’s Useful

Prevents unnecessary blocking for read-only operations.
Improves performance in read-heavy applications.
Ensures data integrity when writes occur.
Common in:
- Databases
- In-memory caches
- File read/write operations
- Shared memory in multiprocessing/multithreading

3. Reader–Writer Lock in Python

Python’s standard library does not have a built-in RWLock, but you can implement one yourself using threading.Lock and threading.Condition, or use third-party libraries such as:

readerwriterlock (pip install readerwriterlock)
pyreadwrite
Or write your own class

3.1 Using `readerwriterlock` Library

from readerwriterlock import rwlock
import threading
import time

# Create RW lock
lock = rwlock.RWLockFair()  # Fair: avoids starvation of readers or writers

# Reader example
def reader_task(id):
    with lock.gen_rlock():
        print(f"Reader-{id} is reading")
        time.sleep(1)
    print(f"Reader-{id} done")

# Writer example
def writer_task(id):
    with lock.gen_wlock():
        print(f"Writer-{id} is writing")
        time.sleep(2)
    print(f"Writer-{id} done")

# Create threads
threads = []
for i in range(3):
    threads.append(threading.Thread(target=reader_task, args=(i,)))
threads.append(threading.Thread(target=writer_task, args=(1,)))
threads.append(threading.Thread(target=reader_task, args=(99,)))

# Start and join
for t in threads:
    t.start()
for t in threads:
    t.join()

Key Points:

gen_rlock() → Generates a read lock context manager.
gen_wlock() → Generates a write lock context manager.
Multiple rlocks can be held at the same time, but a wlock blocks all others.

4. Available Methods in readerwriterlock

Using rwlock.RWLockFair() (fair scheduling for readers & writers):

Method	Purpose
`gen_rlock()`	Returns a read lock (context manager)
`gen_wlock()`	Returns a write lock (context manager)
`acquire()`	Manually acquire the lock
`release()`	Manually release the lock
`locked()`	Check if lock is currently held

Example with manual acquire/release:

read_lock = lock.gen_rlock()
read_lock.acquire()
# Read-only operations
read_lock.release()

5. When to Use Reader–Writer Lock

✅ Use it when:

You have frequent reads and infrequent writes.
Multiple threads/processes need safe read access.
You want higher concurrency than a normal mutex.

❌ Avoid when:

Writes are frequent (can cause reader starvation without fairness policy).
Simplicity is more important than performance — then a normal threading.Lock might be enough.

6. Real-World Example

Imagine you have a shared cache dictionary that multiple threads read from but occasionally update:

cache = {}
lock = rwlock.RWLockFair()

def read_cache(key):
    with lock.gen_rlock():
        return cache.get(key)

def write_cache(key, value):
    with lock.gen_wlock():
        cache[key] = value

This allows many readers at once, but ensures writers get exclusive access when updating.

Abandoned Lock: A New Form of Deadlock

An abandoned lock occurs when a thread acquires a lock on a shared resource and exits before releasing its lock on that resource, leaving other threads trying to acquire the lock to wait indefinitely
When a thread or process acquires a lock and then terminates unexpectedly, it may not release the lock automatically
Deadlocks can occur not only through resource competition but also through unexpected thread termination

1. What is an Abandoned Lock?

An abandoned lock occurs when:

A thread acquires a lock but never releases it — usually because:
- The thread exits unexpectedly
- An exception occurs before the release() call
- The programmer simply forgot to release it

Problem: Other threads that need the same lock will wait forever (deadlock situation).

2. Why it Happens in Python

In Python threading:

If a thread dies without releasing a lock, the lock stays acquired until the process ends.
Locks are not automatically released when a thread terminates.
The threading.Lock object does not know whether the owner is still alive — it just sees “locked” and blocks new acquisitions.

3. Example of an Abandoned Lock

import threading
import time

lock = threading.Lock()

def worker():
    lock.acquire()
    print("Lock acquired by worker")
    # Simulating an exception that prevents release
    raise RuntimeError("Oops! Something went wrong")
    lock.release()  # This never runs

t1 = threading.Thread(target=worker)
t1.start()
t1.join()

# Now main thread tries to acquire it
print("Main thread trying to acquire...")
lock.acquire()  # This will BLOCK forever
print("Main thread acquired lock")

What happens here:

t1 acquires the lock but crashes before releasing.
The lock remains locked forever → main thread gets stuck.

4. How to Avoid Abandoned Locks

Best Practices:

a) Use with Statement (Context Manager)

Always acquire/release in a with block — Python will release it even if an exception occurs.

import threading

lock = threading.Lock()

def worker():
    with lock:
        print("Lock acquired safely")
        raise RuntimeError("Error happens but lock is released automatically")

threading.Thread(target=worker).start()

b) Use try...finally

If you can’t use with, wrap it in try/finally:

lock.acquire()
try:
    # Critical section
    pass
finally:
    lock.release()

c) Use Timeout on Acquire

Avoid indefinite blocking:

if lock.acquire(timeout=5):
    try:
        print("Got the lock")
    finally:
        lock.release()
else:
    print("Could not acquire lock in time")

d) Avoid Complex Logic Inside Locks

Keep the locked section as short as possible.
Do I/O outside the locked section when possible.

5. Key Points

Abandoned locks = potential deadlocks
Python won’t clean up locks for you if the owning thread dies unexpectedly
Always use with lock: or try...finally
Timeouts can prevent infinite waits

Starvation

Starvation occurs when a process or thread is perpetually denied access to the resources it needs, preventing it from progressing
Ideally, threads will take turns accessing shared resources, but this is not guaranteed due to how operating systems schedule thread execution
A "greedy" thread frequently holding a lock on a shared resource can lead to other threads being starved out of being able to make progress
In simple scenarios with a few equally prioritized threads, starvation is less likely to be a concern; however, thread priorities can significantly impact the likelihood of starvation: higher-priority threads are generally scheduled to execute more often, and lower-priority threads may struggle to gain access to resources

Implications

Starvation can significantly impact system performance and fairness in resource allocation
Designers of concurrent systems need to consider thread priorities and the number of concurrent threads to prevent starvation
While occasional delays in resource access may be tolerable, persistent starvation can severely hamper the functionality of affected threads

Why Does Starvation Happen?

Starvation arises due to several factors:

Lock Priority: If a lock always gives preference to one set of threads (e.g., readers), other threads (e.g., writers) may never get access, leading to starvation.
Thread Priority: High-priority threads can monopolize the CPU or resource, starving lower-priority ones.
Unfair Scheduling: Some thread schedulers in OS or Python runtime can consistently favor certain threads.
Greedy Threads: Threads that hold resources for too long or repeatedly reacquire them can starve other threads.
Infinite Loops/Blocking: Threads that never release a resource can prevent others from accessing it.

Livelock

Livelock is a situation in concurrent computing where two or more threads block each other from making progress
Unlike deadlock, threads in livelock are actively trying to resolve the problem but fail to make progress

1. What is Livelock?

A livelock is like a deadlock, except the threads aren’t stuck — they’re actively doing something, but still making no real progress.

Analogy: Two people trying to pass each other in a narrow hallway:

Person A moves left to make space.
Person B moves left too (same idea).
Then both move right.
They keep dancing around, never getting past each other — that’s a livelock.

2. Key Difference from Deadlock

Deadlock → Threads are stuck waiting forever, no movement at all.
Livelock → Threads are busy, constantly reacting to each other, but never making progress.

3. Python Livelock Example

Here’s a simulation of two threads that keep releasing and reacquiring a lock in a way that prevents either from completing:

import threading
import time

lock_a = threading.Lock()
lock_b = threading.Lock()

def worker1():
    while True:
        with lock_a:
            print("Worker1 acquired Lock A, trying for Lock B")
            time.sleep(0.1)
            if lock_b.acquire(blocking=False):
                print("Worker1 acquired Lock B - work done")
                lock_b.release()
                break
            else:
                print("Worker1 couldn’t get Lock B, releasing Lock A and retrying")
        time.sleep(0.1)  # Gives chance to other thread

def worker2():
    while True:
        with lock_b:
            print("Worker2 acquired Lock B, trying for Lock A")
            time.sleep(0.1)
            if lock_a.acquire(blocking=False):
                print("Worker2 acquired Lock A - work done")
                lock_a.release()
                break
            else:
                print("Worker2 couldn’t get Lock A, releasing Lock B and retrying")
        time.sleep(0.1)  # Gives chance to other thread

t1 = threading.Thread(target=worker1)
t2 = threading.Thread(target=worker2)

t1.start()
t2.start()

t1.join()
t2.join()

What Happens Here

Worker1 grabs Lock A, Worker2 grabs Lock B.
Each tries to acquire the other lock without blocking.
If unsuccessful, they release their lock and retry.
Both keep reacting to each other’s moves and releasing locks — but neither progresses → Livelock.

4. How to Avoid Livelock

Set backoff/wait times differently so they don’t always retry in sync:

time.sleep(random.uniform(0.05, 0.15))

Introduce priority — let one thread keep its lock until done.
Use a single lock for both resources when possible.
Use try-lock + order — always acquire locks in a consistent order.

5. Key Points

Livelock wastes CPU because threads keep running without progress.
It’s often caused by overly polite or overly reactive retry logic.
Random delays, consistent ordering, or priority assignment usually fix it.

Difference Between Deadlock, Livelock, and Starvation

Condition Variables

Condition variables serve as a queue for threads waiting for a specific condition to occur
They work in conjunction with mutexes to implement a higher-level construct called a monitor

Limitations of Locks and Mutexes

A lock or mutex restricts multiple threads from accessing a shared resource simultaneously; however, it doesn't provide a way for threads to signal each other or synchronize actions efficiently

Condition Variable Operations

There are three main operations associated with condition variables:

Wait: Releases the mutex lock and puts the thread to sleep in a queue
Signal(or notify/wake): Wakes up a single thread from the waiting queue
Broadcast(or notifyAll/wakeAll): Wakes up all threads in the waiting queue

1. What is a Condition Variable?

A Condition Variable in Python (threading.Condition) is a synchronization primitive that:

Allows threads to wait until some condition is met.
Works with an underlying lock to protect shared state.
Lets one thread signal other waiting threads when the condition changes.

Think of it as a “doorbell”:

Many threads can wait until the “doorbell” is rung.
Another thread rings the bell to tell them the state has changed.

2. Why Use Condition Variables?

They are useful when:

You have producer–consumer type problems.
One thread needs to wait for another to prepare data.
You need to coordinate between threads beyond just locking.

3. How It Works

A Condition object has:

An internal lock (RLock by default).
Methods:
- wait() → Release the lock and block until notified.
- wait_for(predicate) → Wait until a function returns True.
- notify() → Wake one waiting thread.
- notify_all() → Wake all waiting threads.
- acquire() / release() → Manage the underlying lock (usually via with).

4. Basic Example

Producer–Consumer scenario:

import threading
import time

condition = threading.Condition()
shared_data = None

def consumer():
    global shared_data
    with condition:  # Acquire condition's lock
        print("Consumer waiting for data...")
        # here we can also add our own condition for the wait method to call. but for this simple usecase its not required.
        # we can have if shared_data is empty then wait. else consume.
        condition.wait()  # Wait until notified
        print(f"Consumer got data: {shared_data}")

def producer():
    global shared_data
    time.sleep(2)  # Simulate work
    with condition:
        shared_data = "Hello from producer"
        print("Producer setting data and notifying")
        condition.notify()  # Wake one waiting thread

t1 = threading.Thread(target=consumer)
t2 = threading.Thread(target=producer)

t1.start()
t2.start()
t1.join()
t2.join()

Flow:

Consumer locks the condition, calls wait() → releases the lock and waits.
Producer locks the condition, updates shared_data, calls notify() → wakes consumer.
Consumer reacquires lock, resumes work.

5. wait_for Example

Instead of manually looping and checking, you can do:

import threading
import time

condition = threading.Condition()
data_ready = False

def waiter():
    with condition:
        condition.wait_for(lambda: data_ready)
        print("Data is ready!")

def setter():
    global data_ready
    time.sleep(2)
    with condition:
        data_ready = True
        condition.notify_all()

t1 = threading.Thread(target=waiter)
t2 = threading.Thread(target=setter)

t1.start()
t2.start()
t1.join()
t2.join()

6. Key Points

Always acquire the condition’s lock before calling wait() or notify().
wait() temporarily releases the lock so other threads can update the condition.
notify() wakes one waiting thread; notify_all() wakes all.
Use wait_for(predicate) to avoid manual while not condition: loops.
Without proper signaling, waiting threads can wait forever.

Producer-Consumer Pattern - Threads

1. What is the Producer–Consumer Pattern?

It’s a multi-threading design pattern used when:

Producers generate data and place it into a shared buffer.
Consumers take data from that buffer and process it.
Both run independently, but they must be synchronized to avoid race conditions.

Real-world analogy:

A factory has workers (producers) making products.
Delivery trucks (consumers) take products to customers.
There’s a warehouse (shared buffer) between them.
If the warehouse is full, workers wait.
If the warehouse is empty, trucks wait.

2. Why Do We Need It?

Without proper synchronization:

Producer could overwrite data before it’s read.
Consumer could read empty data.
Multiple producers/consumers could corrupt shared memory.

3. Key Concepts

Shared Buffer → Queue (FIFO storage).
Mutual Exclusion → Prevent multiple threads from changing the buffer at the same time (Locks).
Signaling → Wake up waiting threads when data/space is available (Condition variables or Semaphores).

4. Producer–Consumer in Python with queue.Queue

(queue.Queue is thread-safe and handles locking/signaling for you.)

import threading
import queue
import time
import random

buffer = queue.Queue(maxsize=5)  # Shared buffer

def producer():
    for i in range(10):
        item = f"Item-{i}"
        buffer.put(item)  # Waits if buffer is full
        print(f"Produced {item}")
        time.sleep(random.uniform(0.2, 0.5))

def consumer():
    for _ in range(10):
        item = buffer.get()  # Waits if buffer is empty
        print(f"Consumed {item}")
        buffer.task_done()
        time.sleep(random.uniform(0.3, 0.6))

t1 = threading.Thread(target=producer)
t2 = threading.Thread(target=consumer)

t1.start()
t2.start()

t1.join()
t2.join()

What happens here:

queue.Queue automatically uses locks + conditions.
If maxsize is reached, put() blocks until there’s space.
If queue is empty, get() blocks until there’s an item.

5. Producer–Consumer Using Condition Variables (Manual Implementation)

If we don’t use queue.Queue, we can implement it ourselves:

import threading
import time
import random

buffer = []
MAX_SIZE = 5
condition = threading.Condition()

def producer():
    for i in range(10):
        with condition:
            while len(buffer) == MAX_SIZE:
                condition.wait()  # Wait until there’s space
            item = f"Item-{i}"
            buffer.append(item)
            print(f"Produced {item}")
            condition.notify()  # Notify a waiting consumer
        time.sleep(random.uniform(0.2, 0.5))

def consumer():
    for _ in range(10):
        with condition:
            while not buffer:
                condition.wait()  # Wait until there’s data
            item = buffer.pop(0)
            print(f"Consumed {item}")
            condition.notify()  # Notify a waiting producer
        time.sleep(random.uniform(0.3, 0.6))

t1 = threading.Thread(target=producer)
t2 = threading.Thread(target=consumer)

t1.start()
t2.start()

t1.join()
t2.join()

Here:

condition.wait() releases the lock and waits.
condition.notify() wakes a waiting thread.
We manually check while len(buffer) == MAX_SIZE or while not buffer to avoid spurious wakeups.

6. Benefits

Decouples data production from consumption.
Improves throughput by allowing concurrent execution.
Handles different speeds — producers can work ahead, consumers can catch up.

7. Common Pitfalls

Deadlock if neither side signals properly.
Starvation if one side always gets priority.
Memory issues if no limit (maxsize) is set.

Graceful shutdown problem in the Producer–Consumer pattern

If the producer stops producing but the consumer keeps calling get() (or waiting on condition), the consumer will wait forever unless we signal it to stop.

Why it happens

queue.get() or condition.wait() blocks until an item arrives.
If no new items arrive and no one signals "we’re done", the consumer never wakes up.

Common Solutions

1. Use a Sentinel Value

Special "poison pill" item placed in the queue to tell consumers to stop.
Consumer checks for it and exits instead of processing.

Example with queue.Queue:

import threading
import queue
import time

SENTINEL = None
buffer = queue.Queue()

def producer():
    for i in range(5):
        buffer.put(f"Item-{i}")
        time.sleep(0.2)
    buffer.put(SENTINEL)  # Signal no more items

def consumer():
    while True:
        item = buffer.get()
        if item is SENTINEL:
            buffer.task_done()
            break
        print(f"Consumed {item}")
        buffer.task_done()

t1 = threading.Thread(target=producer)
t2 = threading.Thread(target=consumer)

t1.start()
t2.start()

t1.join()
t2.join()

2. Use a Shared “Stop” Flag

Have a shared variable like stop_requested.
Consumer checks this in a loop along with condition.wait().

Example:

import threading
import time

buffer = []
stop_requested = False
condition = threading.Condition()

def producer():
    global stop_requested
    for i in range(5):
        with condition:
            buffer.append(f"Item-{i}")
            condition.notify()
        time.sleep(0.2)
    with condition:
        stop_requested = True
        condition.notify_all()  # Wake all waiting consumers

def consumer():
    while True:
        with condition:
            while not buffer and not stop_requested:
                condition.wait()
            if stop_requested and not buffer:
                break
            item = buffer.pop(0)
        print(f"Consumed {item}")

t1 = threading.Thread(target=producer)
t2 = threading.Thread(target=consumer)

t1.start()
t2.start()

t1.join()
t2.join()

3. Timeout-based Waiting

Use queue.get(timeout=...) or condition.wait(timeout=...) to periodically check and break out.

Example:

item = buffer.get(timeout=2)

If the queue is empty for 2 seconds, a queue.Empty exception is raised, and you can handle shutdown.

✅ Best practice: For a clean shutdown in multi-threaded producer–consumer, the sentinel approach is the most common, especially when you have multiple consumers — you send one sentinel per consumer so each gets the stop signal.

Producer and Consumer Pattern - Process

1. What is the Producer–Consumer Pattern?

Producer: generates data (tasks, items, etc.) and puts them into a shared buffer/queue.
Consumer: takes data from the shared buffer and processes it.
Goal: Decouple data production from data consumption so they can run in parallel.

In multiprocessing, producers and consumers run in separate processes and communicate via inter-process communication (IPC) — most commonly, a multiprocessing.Queue.

2. Why Multiprocessing Instead of Threading?

Threading is limited by Python’s GIL (Global Interpreter Lock), which prevents true parallel execution of CPU-bound code.
Multiprocessing spawns separate processes, each with its own Python interpreter and memory space → true parallelism.

3. Multiprocessing Producer–Consumer Example

Using multiprocessing.Queue:

import multiprocessing
import time
import random

SENTINEL = None  # Stop signal

def producer(queue, producer_id):
    for i in range(5):
        item = f"Producer-{producer_id} Item-{i}"
        print(f"[Producer-{producer_id}] Producing {item}")
        queue.put(item)
        time.sleep(random.uniform(0.2, 0.5))
    queue.put(SENTINEL)  # Send stop signal

def consumer(queue, consumer_id):
    while True:
        item = queue.get()
        if item is SENTINEL:
            print(f"[Consumer-{consumer_id}] Stopping")
            queue.put(SENTINEL)  # Pass sentinel to other consumers
            break
        print(f"[Consumer-{consumer_id}] Consumed {item}")
        time.sleep(random.uniform(0.3, 0.6))

if __name__ == "__main__":
    queue = multiprocessing.Queue()

    # Multiple producers
    producers = [multiprocessing.Process(target=producer, args=(queue, i)) for i in range(2)]
    consumers = [multiprocessing.Process(target=consumer, args=(queue, i)) for i in range(2)]

    for p in producers:
        p.start()
    for c in consumers:
        c.start()

    for p in producers:
        p.join()
    for c in consumers:
        c.join()

4. How This Works

Queue is the shared buffer between processes.
Producers create items and put them in the queue.
Consumers take items from the queue and process them.
Sentinel value (None) tells consumers to stop.

5. Why Use a Sentinel in Multiprocessing?

In multiprocessing, consumers block indefinitely on queue.get() if no items arrive. To stop cleanly:

Send one sentinel per consumer so each process knows when to exit.
Or, as in the above example, pass the sentinel along until all consumers receive it.

6. Variations

Multiple producers and multiple consumers — requires more sentinels.
Priority queues (multiprocessing.JoinableQueue) to ensure task completion tracking.
Process pools with concurrent.futures.ProcessPoolExecutor — easier for simpler patterns.

Semaphore

A semaphore is a synchronization primitive that controls access to a shared resource by using a counter.

When a thread/process acquires the semaphore → counter is decremented.
When it releases the semaphore → counter is incremented.
If the counter is 0, further acquires will block until it’s released.

Why Use Semaphores?

To limit concurrent access to a resource (e.g., max 3 threads can access at once).
To implement signaling between threads or processes.
To avoid race conditions.

Threads can acquire a semaphore when its count is positive, decrementing the counter
If the counter reaches zero, threads attempting to acquire the semaphore are blocked and queued
Threads release the semaphore when done, incrementing the counter and potentially signaling waiting threads

Types of Semaphores

Counting Semaphore

Counter can be any positive integer.
Controls how many threads/processes can access a resource at the same time.
Example: A database connection pool with max 5 concurrent connections.

Binary Semaphore

Special case of counting semaphore where the counter can only be 0 or 1.
Works like a gate — either locked (0) or unlocked (1).
Similar to a mutex, but:
- Not owned: Any thread can release it, not just the one that acquired it.
- Often used for signaling rather than strict ownership.

Key Methods

Both threading.Semaphore and multiprocessing.Semaphore support:

acquire(blocking=True, timeout=None) → Decrease counter; block if zero.
release() → Increase counter; wake up a waiting thread/process.
with semaphore: → Context manager for acquire/release.

Threading Example – Counting Semaphore

import threading
import time

sem = threading.Semaphore(2)  # At most 2 threads can enter

def task(name):
    print(f"{name} waiting...")
    with sem:
        print(f"{name} entered")
        time.sleep(2)
        print(f"{name} leaving")

threads = [threading.Thread(target=task, args=(f"Thread-{i}",)) for i in range(4)]
for t in threads: t.start()
for t in threads: t.join()

Behavior: Only 2 threads run the critical section at the same time.

Threading Example – Binary Semaphore

import threading
import time

binary_sem = threading.Semaphore(1)  # Binary semaphore

def critical_section(name):
    print(f"{name} waiting...")
    with binary_sem:
        print(f"{name} entered")
        time.sleep(2)
        print(f"{name} leaving")

t1 = threading.Thread(target=critical_section, args=("Thread-1",))
t2 = threading.Thread(target=critical_section, args=("Thread-2",))

t1.start()
t2.start()
t1.join()
t2.join()

Behavior: Only 1 thread can run at a time — similar to a mutex, but no ownership rules.

Multiprocessing Example – Counting Semaphore

import multiprocessing
import time

def task(sem, pid):
    print(f"Process-{pid} waiting...")
    with sem:
        print(f"Process-{pid} entered")
        time.sleep(2)
        print(f"Process-{pid} leaving")

if __name__ == "__main__":
    sem = multiprocessing.Semaphore(2)  # Allow 2 processes
    processes = []

    for i in range(4):
        p = multiprocessing.Process(target=task, args=(sem, i))
        processes.append(p)
        p.start()

    for p in processes:
        p.join()

Multiprocessing Example – Binary Semaphore

import multiprocessing
import time

def critical_task(sem, pid):
    print(f"Process-{pid} waiting...")
    with sem:
        print(f"Process-{pid} entered")
        time.sleep(2)
        print(f"Process-{pid} leaving")

if __name__ == "__main__":
    binary_sem = multiprocessing.Semaphore(1)  # Binary semaphore
    processes = []

    for i in range(2):
        p = multiprocessing.Process(target=critical_task, args=(binary_sem, i))
        processes.append(p)
        p.start()

    for p in processes:
        p.join()

When to Use

Counting Semaphore → When you want to limit concurrent access (e.g., allow up to N threads).
Binary Semaphore → When you just need a single permit (e.g., turnstile or signaling).
Mutex/Lock → When you want strict thread/process ownership rules.

Mutex vs Semaphore

Binary Semaphore vs Mutex

Feature	Binary Semaphore	Mutex (Lock)
Ownership	No ownership (any thread/process can release)	Owned by the thread that acquires
Value Range	0 or 1	N/A (locked/unlocked)
Usage	Signaling or simple mutual exclusion	Strict mutual exclusion
Process Safe	Yes	Yes

Race condition vs Data Race

Race Condition

Definition: A race condition happens when the program’s correctness depends on the timing or interleaving of threads/processes. If execution order changes, the output changes — often leading to bugs.

Race conditions are flaws in the timing or order of a program's execution that cause incorrect behavior They are often more difficult to detect and prevent than data races

Key Points:

Caused by unsynchronized access to shared resources.
Can happen even without simultaneous access (e.g., bad logic relying on timing).
Happens in multi-threaded or multi-process environments.

Example (Python, threading):

import threading

counter = 0

def increment():
    global counter
    for _ in range(1000000):
        counter += 1

t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)

t1.start()
t2.start()
t1.join()
t2.join()

print(counter)  # Expected 2,000,000 but often less

Why? Both threads read the old value and write back at the same time → lost updates.

Data Race

Definition: A data race is a specific type of race condition where:

Data races occur when two or more threads concurrently access the same memory location, with at least one thread writing to or changing that memory value. This can lead to threads overwriting each other or reading incorrect values. Data races can be detected using automated tools and prevented by ensuring mutual exclusion for shared resources

Two or more threads access the same variable concurrently.
At least one of the accesses is a write.
There is no synchronization (e.g., no locks).

Key Points:

Always involves shared data.
Always causes undefined behavior (in languages like C++); in Python it causes logical errors.
Data race is a race condition, but not all race conditions are data races.

Example (data race in Python):

import time
import threading

value = 0

def increment():
    global value
    for _ in range(100000):
        val = value  # read
        val += 1                    # modify
        time.sleep(0.000001)        # force context switch
        value = val  # write

if __name__ == "__main__":

    t1 = threading.Thread(target=increment)
    t2 = threading.Thread(target=increment)

    t1.start()
    t2.start()
    t1.join()
    t2.join()

    print("Expected:", 2 * 100000)
    print("Actual:", value)

Here:

Writer is writing.
Reader is reading at the same time.
No locks → data race.

Data race in Multiprocessing

Data races can occur in multiprocessing, but the situation is a bit different from threads.

Why multiprocessing is different

In threading, threads share the same memory space, so a race happens naturally if multiple threads touch the same variable without synchronization.
In multiprocessing, each process has its own separate memory space by default, so normal Python variables are not shared — which means you can’t have a data race on a normal variable between processes.

When data races can happen in multiprocessing

A race can occur if you explicitly share memory between processes, for example:

Using multiprocessing.Value or multiprocessing.Array (shared C-level memory).
Using multiprocessing.shared_memory (Python 3.8+).
Using memory-mapped files (mmap).
Using some database or file without locking.

In these cases, two processes can modify the same memory concurrently, causing lost updates just like in threading.

Data race with `multiprocessing.Value`

from multiprocessing import Process, Value
import time

def increment(shared_counter):
    for _ in range(100000):
        val = shared_counter.value  # read
        val += 1                    # modify
        time.sleep(0.000001)        # force context switch
        shared_counter.value = val  # write

if __name__ == "__main__":
    counter = Value('i', 0)  # 'i' = signed int, shared memory

    p1 = Process(target=increment, args=(counter,))
    p2 = Process(target=increment, args=(counter,))

    p1.start()
    p2.start()
    p1.join()
    p2.join()

    print("Expected:", 2 * 100000)
    print("Actual:", counter.value)

Possible output:

Expected: 200000
Actual: 134527

How to avoid multiprocessing data races

Use a multiprocessing.Lock:

def increment(shared_counter, lock):
    for _ in range(100000):
        with lock:
            shared_counter.value += 1

Or use multiprocessing.Manager() objects, which handle synchronization internally.
Or design processes to pass messages via multiprocessing.Queue instead of sharing memory directly.

Race Condition vs Data Race

Data races and race conditions are different problems that are often confused due to their similar names

It's possible to have data races without race conditions and vice versa

Many race conditions are caused by data races, and many data races lead to race conditions, but they are not dependent on each other

Aspect	Race Condition	Data Race
Definition	Any bug where outcome depends on timing/order of execution	A race condition involving unsynchronized read/write to the same memory
Involves shared data?	Not necessarily	Always
Requires write?	Not always	Yes, at least one write
Fix method	Synchronization, ordering guarantees, atomic operations	Locks, atomic operations, thread-safe structures

How to Avoid Both

Locks / Mutexes – Ensure only one thread can access shared data at a time.
Atomic Operations – Use atomic counters or thread-safe primitives.
Immutable Data – No write access, only reads.
Queues / Channels – Pass data between threads instead of sharing.
Proper Design – Avoid depending on execution order when possible.

Barrier

A barrier is a synchronization primitive used in threading and multiprocessing to coordinate multiple threads or processes so they all wait at a certain point in the program until every participant has reached that point.

A barrier is a synchronization mechanism for thread coordination

It serves as a stopping point for a group of threads, preventing them from proceeding until all (or a sufficient number of) threads have reached the barrier

How a Barrier Works

Suppose you have several threads or processes working in parallel.
At some stage, they must all complete a certain part of work before any can move forward.
Each thread/process calls .wait() on the barrier when it reaches this point.
The barrier “blocks” each participant when they call .wait(), until all required participants have called .wait().
Once all participants have arrived, the barrier is lifted simultaneously, and every thread/process can proceed with the next stage of work.

Why Do We Need Barriers?

Synchronization Across Participants: Barriers ensure all threads or processes reach a specific stage before any move ahead. This is essential when the next steps depend on everyone finishing their current phase (e.g., parallel computations that need to be combined).
Coordination of Phased Work: In algorithms where work is structured in steps or phases, barriers let you cleanly transition to the next step only after all have completed the previous one.
Simplifies Code: Rather than manually tracking how many threads have finished and using multiple locks or conditions, barriers provide a simple, reusable object to synchronize groups of workers.

Example Use Cases

Combining results: Multiple threads process parts of a dataset. The main thread combines results only after everyone is finished.
Phase synchronization: Multiple stages in a complex algorithm, where all workers must align between stages (e.g., parallel matrix computations, distributed simulation time steps).
Initialization: Ensure all workers are ready before starting the main task.

Example

I have 2 threads, one thread will double the number and other will add 3 to it. With the lock we can avoid the Data race, however we still face the inconsistent result due to timing race condition.

To overcome the timing related race condition we can use the Barrier.

we need to decide where to use the barrier.

flow without barrier

we make all the threads that add 3 to the variable to execute first and then the thread that doubles the value

Code

where we place the barrier.wait() matters for the thread to wait and continue. Now this code will first executes all the threads that calls the add 3 and then the doubler thread will get executed until that doubler threads will wait at the begining itself.

Barrier in threading

Implemented by threading.Barrier.
Useful when you have a fixed number of threads that must all reach a certain stage before continuing.
The barrier resets automatically after all participants have passed through.

Example (threading.Barrier)

import threading
import time
import random

def worker(barrier, thread_id):
    print(f"Thread {thread_id} is working...")
    time.sleep(random.uniform(0.5, 2))  # Simulate work
    print(f"Thread {thread_id} waiting at barrier")
    barrier.wait()  # Wait until all threads reach here
    print(f"Thread {thread_id} passed the barrier")

# Barrier for 3 threads
barrier = threading.Barrier(3)

threads = [threading.Thread(target=worker, args=(barrier, i)) for i in range(3)]
for t in threads:
    t.start()
for t in threads:
    t.join()

Output (order may vary):

Thread 0 is working...
Thread 1 is working...
Thread 2 is working...
Thread 1 waiting at barrier
Thread 0 waiting at barrier
Thread 2 waiting at barrier
Thread 2 passed the barrier
Thread 0 passed the barrier
Thread 1 passed the barrier

All threads start work, but none pass the barrier until all 3 reach it.

Barrier in multiprocessing

from multiprocessing import Process, Barrier
import time, random

def worker(barrier, pid):
    print(f"Process {pid} working...")
    time.sleep(random.uniform(0.5, 2))
    print(f"Process {pid} waiting at barrier")
    barrier.wait()
    print(f"Process {pid} passed the barrier")

if __name__ == "__main__":
    # multiprocessing has Barrier only in Python 3.10+ (same API as threading.Barrier)
    barrier = Barrier(3)

    processes = [Process(target=worker, args=(barrier, i)) for i in range(3)]
    for p in processes:
        p.start()
    for p in processes:
        p.join()

Barriers help prevent race conditions by forcing multiple threads or processes to synchronize at a specific point in execution, ensuring all participants have reached a safe state before any advance further.

How Barriers Prevent Race Conditions

In concurrent programs, a race condition occurs when multiple threads/processes access and modify shared resources unpredictably, leading to incorrect or inconsistent results.
A barrier makes every worker stop and wait until all expected workers reach the barrier.
This ensures that shared data or resources have been fully prepared or processed before moving to the next stage—no thread/process can go ahead and use incomplete/unstable data.

Example Scenario

Suppose each thread/process needs to initialize part of a shared array before all threads/processes start computation:

Each thread initializes its assigned section.
All threads/processes reach a barrier.
Only when all are finished with initialization does anyone start computation, guaranteeing every array section is ready.

If threads are allowed to advance independently, a thread could start computation with uninitialized data from others, causing a data race.

Why Use Barriers for Race Condition Prevention?

Coordinates stages: Ensures prerequisite work by all parties is complete before progressing.
Avoids accessing shared resources in inconsistent states.
Eliminates timing-based errors that happen when one worker gets ahead of others too soon.

Summary

While barriers do not protect individual updates to shared variables (that requires locks or atomic operations), they do prevent race conditions arising from untimely progression—making sure all threads/processes stay in sync across stages where shared state consistency is crucial.

Lock vs Barriers

Locks/Mutexes:
Are used to prevent data races by ensuring only one thread or process accesses a shared resource (like a variable or data structure) at a time. They enforce mutual exclusion so that critical sections (where shared data is read/written) are protected from concurrent modification—this keeps data safe and consistent.[2][5][8]
Barriers:
Are used to prevent certain types of race conditions related to sequencing or coordination (not data races on shared data). Barriers force all threads/processes to reach a specific execution point before any proceed, thus ensuring that operations depending on collective progress or multi-stage computations are correctly synchronized. Barriers do not provide mutual exclusion—they coordinate the timing of multiple participants, not direct critical section access.[7][8]

Key Differences

Purpose	Lock/Mutex	Barrier
Main Role	Protect shared data/critical section	Synchronize execution phases
Prevent	Data races (simultaneous data access)	Race conditions (out-of-sequence)
Enforces	Only 1 thread in critical section	All wait until group ready
Example Use	Incrementing a shared counter	Waiting for all to finish phase 1

Summary:

Use a lock/mutex for direct access protection to shared data.
Use a barrier to coordinate program phase progression and process/thread synchronization.

Both prevent "race conditions" in the broader sense, but mutex = data access safety, barrier = execution order safety.

Computational Graphs

The key to parallel programming is determining which steps within a program can be executed in parallel, and then figuring out how to coordinate them. And one tool to help model how steps in a program relate to each other is a computational graph

Computational graphs model relationships between program steps
Graphs visualize which steps can be executed in parallel to help coordinate parallel execution and identify dependencies

Directed acyclic graphs (DAGs)

Nodes represent tasks or units of work
Directed edges indicate progression and dependencies between tasks

Key Terminology

Spawn/fork: Initiates parallel execution paths
Sync/join: Synchronizes completion of parallel tasks
Asynchronous: Tasks that can occur in any order relative to each other
Critical path: Longest sequence of dependent operations in the graph

Analyzing Parallel Potential

Work: Total execution time on a single processor (sum of all task times)
Span: Shortest possible execution time with maximum parallelization (sum of critical path times)
Ideal parallelism: Ratio of work to span, indicating maximum speed improvement

Challenges with Creating Many Threads

As more tasks are added, they can spawn a new thread for each one
Creating a new thread for each task can lead to inefficiencies because thread creation incurs overhead in terms of processor time and memory usage

Thread Pools

Thread pools can be a more efficient alternative to creating individual threads for each task
Thread pools maintain a small collection of worker threads that can be reused to execute multiple tasks
Submitting tasks to a thread pool is analogous to adding items to a to-do list for worker threads

Benefits of thread pools

Reusing threads from a pool overcomes the overhead of creating new threads for each task
Thread pools are especially advantageous when the task execution time is less than thread creation time
Using preexisting threads in a pool can make programs more responsive by eliminating the delay of thread creation

Thread pool

Future

Asynchronous tasks allow multiple operations to be accomplished simultaneously
Futures are a mechanism for handling the results of asynchronous operations
It acts as a placeholder for a result that will be available at a later time
A future is like an "I owe you" note for the result of an asynchronous task

Working with Futures

Futures are read-only and may not have an immediate result
A thread might need to wait for the future to be resolved
"Resolving" or "fulfilling" a future means writing the result value to it

Divide and Conquer Algorithms

Divide-and-conquer algorithms break down complex problems into smaller, manageable subproblems
They are well suited for parallel execution across multiple processors

A divide-and-conquer algorithm is structured in three parts:

Divide: The main problem is split into smaller subproblems of roughly equal size Conquer: Each subproblem is solved recursively Combine: Solutions to subproblems are merged to form the final solution

Advantages

Subproblems are independent, allowing for parallel execution on different processors
Can significantly reduce computation time for large datasets

Example - Single process

Example - Multi process

here the process pool is created for the first time calling that function. Then we have the base class check where the actual work is done in the process pool and return the future object in list(so that we can combine them by looping thru the list). in the else class we have the divide and calling the recursive and adding the both the lists.

Then we have the sum of futures object in the list comprehension and sum them and return it.

Weak Scaling vs Strong Scaling

Weak scaling: Increase the problem size

Weak scaling allows us to tackle larger problems in the same amount of time by adding more processors while keeping the workload per processor constant

For example, one person can decorate 10 cupcakes in an hour, while two people working in parallel can decorate 20 cupcakes in the same time

Definition: Measure how the runtime changes when you increase the number of processors while keeping the workload per processor constant.

Goal: Solve a bigger problem in the same amount of time.
Ideal case: Runtime should stay the same even as the problem size grows proportionally to the number of processors.

Weak Scaling (Workload grows with processors):

Workload = 100M per processor
 ┌───────────── Runtime ─────────────┐
 Proc=1 (100M) | ########## (10s)
 Proc=2 (200M) | ########## (10s)
 Proc=4 (400M) | ########## (10s)
 Proc=8 (800M) | ########## (10s)

Goal: runtime stays constant as workload grows.

Strong scaling: Accomplish tasks faster

Strong scaling involves breaking down a fixed-size problem across multiple processors to execute it faster

For example, decorating 10 cupcakes takes one person an hour, but two people working together can complete the task in about 30 minutes

Definition: Measure how the runtime decreases when you increase the number of processors for a fixed total problem size.

Goal: Solve the same problem faster by adding more processors.
Ideal case: Runtime should shrink proportionally to the number of processors (linear scaling).

Strong Scaling (Fixed problem size):

Workload = 1B elements
 ┌───────────── Runtime ─────────────┐
 Proc=1 | ############################ (100s)
 Proc=2 | ###############             (50s)
 Proc=4 | #########                   (25s)
 Proc=8 | #####                       (13s)

Goal: runtime shrinks with more processors.

Summary Table

Aspect	Strong Scaling	Weak Scaling
Problem size	Fixed	Increases with #processors
Goal	Run same problem faster	Handle bigger problems in same time
Ideal runtime	↓ (shrinks with processors)	≈ constant
Typical limitation	Overheads, communication, Amdahl’s Law	Network/communication bottlenecks

✅ Strong Scaling = speedup for fixed work. ✅ Weak Scaling = efficiency for growing work.

Key Metrics in Parallel Computing

Throughput

Throughput measures the number of tasks completed in a given time period
Adding more processors increases overall throughput, for example, from 10 cupcakes/hour with one worker to 30 cupcakes/hour with three workers

Latency

Latency is the time taken to execute a single task from start to finish
It remains constant per task even when multiple processors are used (for example, six minutes to decorate one cupcake)

Speedup

Speedup is calculated as the ratio of sequential execution time to parallel execution time
For example, if one worker takes 60 minutes and two workers take 30 minutes, the speedup is 2

Amdahl's Law

Amdahl's Law is an equation used to estimate the potential speedup of a parallel program
It helps determine the effectiveness of parallelizing a program based on its parallelizable portion

Amdahl's law is great for estimating the speedup that you might achieve by parallelizing a program

Limitations on the Effectiveness of Parallelization

The degree of parallelization significantly impacts potential speedup
Programs with 90% parallelizable code have a maximum speedup of 10
75% parallelizable programs are limited to a speedup of 4
50% parallelizable programs can achieve a maximum speedup of only 2

2 Process that can run 95% of our code in parallel so the overall speedup of our entire program is 1.9 times.

1000 Process, still the speedup stuck at ~20 times

Decision-Making in Parallel Programming

Amdahl's Law helps determine whether parallelizing a program is worthwhile
It illustrates why parallel computing is most beneficial for highly parallelizable programs
The costs and overhead of parallelization should be weighed against potential benefits
Parallelizing everything is not always the best approach, despite the availability of multicore processors

Measuring Speedup and Efficiency

Calculating Speedup

Calculating Efficiency

Benchmarking Best Practices

Limit other running programs to avoid resource competition
Average multiple independent runs to account for execution variability
Allow for "warm-up" in environments with just-in-time compilation
Run the algorithm once before measurement to ensure a consistent cache state

Parallel design stages

Parallel design in programming typically follows a sequence of well-defined stages that help convert a serial (single-threaded) problem into an efficient parallel algorithm. The aim is to maximize concurrency, minimize inter-process communication, and achieve balanced workload distribution across computational units.

Partitioning

Goal: Divide the overall computation and data into smaller independent tasks.
Tasks can be split via data parallelism (same operation, different data chunks) or task parallelism (different operations on either shared or private data).
Example: In matrix multiplication, each element calculation can be handled as an independent task.

Partitioning involves dividing the problem into discrete chunks of work for distribution among multiple tasks
At this stage of the design process, the focus is on maximum decomposition without considering practical limitations like processor count
Different decomposition methods may have varying advantages depending on the problem and hardware
Domain decomposition and **functional **decomposition offer valuable perspectives on problem-solving in parallel computing

Domain (data) decomposition

Focuses on dividing the data associated with the problem into small, ideally equal-sized partitions
Computations are then associated with the partitioned data

Functional decomposition

Begins by considering all computational work required by the program
Divides the overall work into separate tasks performing different portions
Data requirements for these tasks are a secondary consideration

Communication

After decomposing a problem into separate tasks, the next step is to establish communication between tasks. Communication involves coordinating execution and sharing data

Goal: Identify and design interactions needed between the tasks.
This includes transfer of intermediate data, synchronization signals, barriers, and locking mechanisms.
Minimizing communication and avoiding bottlenecks are key to performance.

Types of Task Communication

Independent tasks

Some problems can be decomposed into tasks that don't require data sharing

Interdependent tasks

Tasks that require information from other tasks to complete their work

Communication Structures

Point-to-point communication

Direct links between neighboring tasks
Suitable when each task communicates with a small number of other tasks
Involves sender (producer) and receiver (consumer) roles

Collective communication

Used when tasks need to communicate with a larger group
Includes broadcasting (one-to-many) and scattering/gathering (distributing and collecting data)

Centralized management

One task acts as a central coordinator for a group of distributed workers
Can become a bottleneck as the number of workers increases
Divide-and-conquer strategies can help distribute computation and communication load

Communication Factors to Consider

Synchronous vs. asynchronous communication

Synchronous (blocking): Tasks wait for communication to complete before continuing
Asynchronous (non-blocking): Tasks can perform other work while communication is in progress

Performance considerations

Processing overhead: Time spent on communication versus data processing
Latency: Time for a message to travel from sender to receiver
Bandwidth: Amount of data that can be communicated per unit of time

Agglomeration

Goal: Combine smaller tasks or groups for more efficient execution.
Merge fine-grained tasks to reduce scheduling overhead or communication costs.
Helps manage granularity: not too fine (overhead), and not too coarse (missed parallelism).

Granularity

Fine-grained parallelism

Breaks a program into a large number of small tasks
Advantages: Allows for better load balancing across processors
Disadvantages: Increases overhead for communication and synchronization, resulting in a low computation-to-communication ratio

Coarse-grained parallelism

Splits the program into a small number of large tasks
Advantages: Has lower communication overhead, allowing more time for computation
Disadvantages: May produce load imbalance, with some tasks processing more data while others remain idle

Medium-grained parallelism

Balances the trade-offs between fine and coarse-grained approaches
Often the most efficient solution for general purpose computers

Recommendations

Avoid hard-coded limits on the number of tasks
Use compile-time or runtime parameters to control granularity
Design programs to adapt to changes in the number of available processors

Mapping

Goal: Allocate tasks to processes, threads, or processors.
Tasks need to be mapped in a way that balances load and minimizes idle time, taking into account hardware and resource constraints.[3][5]
Consider dynamic vs. static mapping depending on workload variability.

Mapping is the fourth and final stage of our parallel design process
It involves specifying where each established task will be executed

Applicability

Mapping is not applicable for single-processor systems or systems with automated task scheduling
It becomes relevant in distributed systems or specialized hardware with multiple parallel processors

Parallel and Concurrent programming - LogeshVel/learning_resources GitHub Wiki

Sequential vs. Parallel Computing

Sequential computing

Parallel computing

Advantages of Parallel Computing

Challenges of Parallel Computing

Real-World Applications

Key Takeaways

Four Classes of Computer Architecture (Flynn’s Taxonomy)

Single Instruction, Single Data (SISD)

Single Instruction, Multiple Data (SIMD)

Multiple Instruction, Single Data (MISD)

Multiple Instruction, Multiple Data (MIMD)

MIMD Subdivisions

Single Program, Multiple Data (SPMD)

Multiple Program, Multiple Data (MPMD)

Memory Architecture

Shared Memory Architecture

Uniform Memory Access - UMA

Symmetric Multiprocessing - SMP

Non-Uniform Memory Access - NUMA

Distributed Memory

Process

Threads

Inter-Process communication - IPC

Concurrency

Parallel Hardware

Parallel Execution

GIL

Spawn vs Fork process

The "Spawn" vs. "Fork" Problem

The Solution: The if __name__ == "__main__" Guard

Corrected Code

The Problem: Recursive Spawning

The Solution: Use the Guard

Execution scheduling

The Role of the Scheduler

Context Switch

Scheduling Algorithms

Thread Lifecycle and States

Thread States

Data Races in Concurrent Programming

Concurrent Access and Critical Sections

Critical sections

Mutex (mutual exclusion)

Atomic operations

Thread blocking

Best Practices

Reentrant lock

Reentrant Mutexes

Reentrant Mutexes versus Standard Mutexes

Use Cases for Reentrant Mutexes

Try-Lock

Reader-Writer Locks

3.1 Using readerwriterlock Library

Abandoned Lock: A New Form of Deadlock

Starvation

Livelock

Difference Between Deadlock, Livelock, and Starvation

Condition Variables

Limitations of Locks and Mutexes

Condition Variable Operations

Producer-Consumer Pattern - Threads

Graceful shutdown problem in the Producer–Consumer pattern

Common Solutions

1. Use a Sentinel Value

2. Use a Shared “Stop” Flag

3. Timeout-based Waiting

Producer and Consumer Pattern - Process

1. What is the Producer–Consumer Pattern?

2. Why Multiprocessing Instead of Threading?

3. Multiprocessing Producer–Consumer Example

4. How This Works

5. Why Use a Sentinel in Multiprocessing?

6. Variations

Semaphore

Why Use Semaphores?

Types of Semaphores

Counting Semaphore

Binary Semaphore

The Solution: The `if name == "main"` Guard

3.1 Using `readerwriterlock` Library

Data race with `multiprocessing.Value`