Shared Memory in Multiprocessing

Now, if you’ve ever tried to write code for multiple processes working on the same data set, you know how frustrating it can be. You might have even resorted to passing around pickled objects or using a database as a middleman just to avoid having to deal with shared memory. Chill out, don’t worry!Today we’re going to show you how to use Python’s built-in multiprocessing module to share data between processes without all the hassle.

First, let’s start by creating some sample code that demonstrates the problem we’re trying to solve:

# Import necessary modules
import time # Import time module for time-related functions
from multiprocessing import Process, Queue # Import Process and Queue from multiprocessing module for creating and managing processes

# Define a worker function that takes in a queue as a parameter
def worker(q):
    while True:
        item = q.get() # Get an item from the queue
        # Do something with the item...
        time.sleep(1) # Simulate some work being done by sleeping for 1 second
        q.task_done() # Notify the queue that the task is complete

# Check if the script is being run directly
if __name__ == '__main__':
    queue = Queue() # Create a queue object
    for i in range(5):
        queue.put(i) # Put 5 items into the queue
    
    processes = [] # Create an empty list to store the processes
    for _ in range(3):
        p = Process(target=worker, args=(queue,)) # Create a process that runs the worker function with the queue as an argument
        p.daemon = True # Set the process as a daemon, meaning it will terminate when the main process ends
        p.start() # Start the process
        processes.append(p) # Add the process to the list of processes
        
    # Wait for all tasks to complete
    queue.join() # This will block until all items in the queue have been processed
    
    for p in processes:
        p.terminate() # Terminate all processes to prevent them from running indefinitely

In this example, we’re creating a simple worker function that gets items from a Queue and performs some task on them (in this case, just sleeping for 1 second). We then create three instances of the worker process using multiprocessing’s Process class, passing in our queue as an argument.

The problem with this code is that each worker process has its own copy of the data set we’re working on. This means that if two workers try to access the same item at the same time, they might end up modifying different copies and causing all sorts of headaches down the line.

To solve this problem, we need to use shared memory instead of passing around pickled objects or using a database as a middleman. In Python’s multiprocessing module, you can create a SharedMemory object that allows multiple processes to access and modify the same data set simultaneously:

# Import necessary modules
import time # Import time module for time-related functions
from multiprocessing import Process, Queue, sharedctypes # Import necessary modules from multiprocessing library

# Create a class for shared data using sharedctypes.RawArray
class MySharedData(sharedctypes.RawArray):
    _typecode = 'i' # Specify the typecode as 'i' for integer
    
    # Define the constructor with size as a parameter
    def __init__(self, size=1024*1024):
        super().__init__(size) # Call the constructor of the parent class
        
    # Define a method to append items to the shared data
    def append(self, item):
        self[len(self)] = item # Use the len() function to get the index and assign the item to that index
        
    # Define a method to get an item from the shared data at a specific index
    def get_item(self, index):
        return self[index] # Return the item at the specified index
    
# Define a worker function to be executed by each process
def worker(data):
    print("Data:", data) # Print the shared data
    print("Data length:", len(data)) # Print the length of the shared data
    print("Data type:", type(data)) # Print the type of the shared data
    
# Check if the script is being run directly
if __name__ == '__main__':
    # Create an instance of the MySharedData class
    data = MySharedData()
    
    # Append 5 items to the shared data
    for i in range(5):
        data.append(i)
    
    # Create a list to store the processes
    processes = []
    
    # Create 3 processes and add them to the list
    for _ in range(3):
        p = Process(target=worker, args=(data,)) # Create a process with the worker function as the target and pass the shared data as an argument
        p.daemon = True # Set the process as a daemon process
        p.start() # Start the process
        processes.append(p) # Add the process to the list
        
    # Wait for 5 seconds
    time.sleep(5)
    
    # Terminate all processes
    for p in processes:
        p.terminate()

In this example, we’ve created a custom SharedMemory object called MySharedData that wraps around Python’s built-in RawArray class. This allows us to create an array of integers (or any other data type) and share it between multiple processes using shared memory.

We then use the append() method to add items to our shared data set, and the get_item() method to retrieve them later on. Note that we’re not passing around a copy of this object like we did with the Queue in our previous example instead, each process is accessing the same memory location directly!

Of course, there are some caveats to using shared memory: it can be slower than other methods due to synchronization overhead, and you need to make sure that your data set doesn’t grow too large (otherwise, you might run out of memory). But for certain use cases, like working with large datasets or performing complex calculations on multiple cores simultaneously, it can be a powerful tool in your programming arsenal.

It’s not always the best solution, but sometimes it’s exactly what you need to get the job done.

SICORPS