Table of Contents
Concurrency and Parallelism
Why concurrency?
Why this guide?
How to follow?
The Basics
Managing Threads
Assignment 1 (b)
Exercise 1
Commentary on the solution
Sharing Data Between Threads
Part 1
Mini Task (Fix for Dispatcher Problem)
Assignment 1 (c)
Part 2
Assignment 1 (d)
Exercise 2 (Bank Account)
Exercise 3 (Many Readers One Writer)
Synchronizing Concurrent Operations
Part 1: Waiting using condition variables
Assignment 1 (e)
Part 2: Futures, Promises, and Asynchronous Tasks
Exercise 4: Parallel Calculator
Assignment 1 (f)
Remarks
Concurrency and Parallelism in the Wider World
Multi-threading in Python
Numba and JIT
Parallel Processing on the GPU
Concurrency and Parallelism
The main resource we are following is C++ Concurrency in Action (referred as CCIA) by Anthony Williams. It is expected that after going through this guide in its entirety you will be able to look at new design patterns and concepts in concurrency and understand the “Why?”s of what is happening with a lot less friction. You’ll learn the core knowledge required to design and build concurrent applications.
IMPORTANT: If you think you are really stuck at a certain point, and are not able to understand something, free to contact the person who wrote this guide!
Why concurrency?
Nowadays, almost every system is multi-core, and exploiting the pre-existing compute power is one of the key skills out there that one can learn.
One thing which catches everyone’s attention is HFT, and this skill is very valuable to such firms. Gaming studios require it for building high performance game engines. In general, processing large amounts of data, or handling a lot of operations requires you to make your code more efficient, and utilise resources which it has available already.
Why this guide?
As it presently stands, you won’t find any resource on the internet which follows such a pattern. And it isn’t without reason. It is generally assumed that someone dealing with concurrency is already capable of designing applications by themselves, and mostly the focus is on design patterns and other choices.
As it presently stands, most resources on concurrency lack exercises, assuming the reader to be familiar with writing utilities and applications in C++. However, we are assuming the contrary. We assume that you know how to write basic applications, and have a decent idea about the STL. We focus on giving you sufficient code-examples, and a birdseye view of how concurrency is used in real utilities that you might use frequently.
Although we recommend learning concurrency in C++, there is an exceptional resource by Anton Zhiyanov in Go language for learning concurrency (for experienced Go programmers), whose philosophy (of using examples) I have tried to replicate, alas, in a very limited fashion due to time and format restrictions. Here is the link if you wish to use it.
How to follow?
A pretty important pre-requisite is to be familiar with C++ to a certain degree. Whether you are familiar enough or not can be tested by Assignment 1 of this guide. If you are not, then here are some resources for learning C++ itself.
If you got time on your hands and are interested in this stuff, read and do everything mentioned, if not more! If you wish to build a good understanding of multi-threading and concurrency, but a little bit short on time, and want to just understand how to use concurrency, skip the reading assignments marked optional, and if you somehow are even busier, then you may skip the project (not recommended though, as this is they the key aspect of the guide), but still, do all the other assignments.
Having established the “Why”s let us start!
The Basics
Reading Assignment:
- Why Threads?
- CCIA: Chapter 1, Sections 1.1 to 1.4. Optional: 1.3.
Helpful Videos:
- On Processes (Can skip, not that relevant to concurrency, but you’ll gain an understanding on CPU instructions on a lower level)
- Why Are Threads Needed in Single Core Processors
- Threads on Multicore Processors (Can skip, already explained nicely in reading assignment, watch if not clear)
This section explores different methods of concurrency, single-threading and multi-threading, and discusses why should one use concurrency and establishes that the main motivation behind using it is either separation of concerns or performance.
Here’s an experiment you can try. Build and run the code in the given file. If you wish to look at the code, you may, but at this stage you don’t have to. But essentially the task is to sum the first billion natural numbers, by using a simple for loop.
Considering that most of you will be having at most 16 threads, all even multiples of threads have been added up to 16 so you can notice the pattern yourself! Once you run it, you’ll find results similar to this (assuming you have a 16 threaded CPU, the results would be similar to mine, however try running it yourself).
Available threads 16
(No Threading) Time over single run: 439.338ms
(1 Threads) Average time over 15 runs: 1605.68 ms
(2 Threads) Average time over 15 runs: 811.696 ms
(4 Threads) Average time over 15 runs: 433.742 ms
(6 Threads) Average time over 15 runs: 308.542 ms
(8 Threads) Average time over 15 runs: 232.459 ms
(10 Threads) Average time over 15 runs: 186.434 ms
(12 Threads) Average time over 15 runs: 162.535 ms
(14 Threads) Average time over 15 runs: 141.65 ms
(16 Threads) Average time over 15 runs: 127.17 ms
(20 Threads) Average time over 15 runs: 110.081 ms
(32 Threads) Average time over 15 runs: 114.502 ms
(64 Threads) Average time over 15 runs: 107.696 ms
(128 Threads) Average time over 15 runs: 106.636 ms
(256 Threads) Average time over 15 runs: 106.176 ms
(1024 Threads) Average time over 15 runs: 115.011 ms
(4096 Threads) Average time over 15 runs: 142.688 ms
Can you infer why you see the results that you see here?
Assignment 1 (a)
Right now you just know how to create a thread. Cute, right? To design concurrent programs, you need be comfortable with vanilla programs. This assignment involves no use of concurrency. You are tasked with building a find and replace utility. We will be working with this base for some sections of the guide, mostly to understand how concurrency is utilised in an actual tool, and how much it can affect speeds.
Here is what we expect the output for ./grep --help (assuming your build file is called “grep”). Make sure that all the functionalities mentioned here are implemented.
A grep-like tool with replacing capabilities.
USAGE:
./grep [OPTIONS] <pattern> <file1> [file2]...
./grep [OPTIONS] -r <replacement> <pattern> <file1> [file2]...
OPTIONS:
-r, --replace <TEXT> Enable find-and-replace mode.
-i, --ignore-case Perform case-insensitive matching.
-n, --line-number Prefix each line of output with its line number.
-v, --invert-match Select non-matching lines.
-h, --help Display this help message.
Refer to our code here if you face difficulties, but make sure you do not just copy-paste stuff.
For benchmarking our tool, we will need two separate data-sets, generated using this utility_small which creates multiple small files, and utility_large which creates multiple large files. This is to standardize the testing. Both the utilities use subsets of different lengths from the 10,000 most-used words on google dataset.
Note: In case your computer has a weaker CPU, you may want to reduce the file size in the larger utility (by default, 8 files of size 500MB), and the number of files (I used 8, I suggest using less than 8 for reasons you’ll understand later).
Managing Threads
Reading/Watching Assignment: (In order as mentioned)
- Basic Thread Management Theory, Leave Yield, Suspend and Resume for now (or read if you’re interested).
- CCIA: Sections 2.1.1 - 2.1.3 => Using .join() practically => Section 2.1.4, 2.2, 2.3
Joining threads is mostly for coordination, not necessarily parallelism or efficiency. There is a difference between referencing a member function in a struct, and referencing a free function. The name of a free-function decays to a pointer to the function, whereas the name of a member function does not strictly decay to it’s pointer, since it is not well-defined without the object it is associated with, however & can be used to get the address of the member function.
Assignment 1 (b)
Take inspiration from the technique used in Listing 2.7 from the book, to implement a way to process multiple files concurrently. Think about the design choices you have learnt in this chapter, and figure out one key modification you’ll need to ensure safety. Here are the results I found with my implementation:
No Threading
[sjais ~]$ ./a.out "hello" 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt
--- Processed 1.txt in 349.182 ms. ---
--- Processed 2.txt in 347.961 ms. ---
--- Processed 3.txt in 347.135 ms. ---
--- Processed 4.txt in 346.712 ms. ---
--- Processed 5.txt in 350.587 ms. ---
--- Processed 6.txt in 347.18 ms. ---
--- Processed 7.txt in 348.428 ms. ---
--- Processed 8.txt in 345.747 ms. ---
Finished processing files in 2783.06 ms.
With Threading
[sjais ~]$ ./a.out "hello" 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt
--- Processed 8.txt in 417.541 ms. ---
--- Processed 7.txt in 418.521 ms. ---
--- Processed 4.txt in 419.33 ms. ---
--- Processed 3.txt in 419.488 ms. ---
--- Processed 5.txt in 420.017 ms. ---
--- Processed 6.txt in 420.865 ms. ---
--- Processed 1.txt in 492.593 ms. ---
--- Processed 2.txt in 493.279 ms. ---
Finished processing files in 493.677 ms.
When testing yourself, I would highly recommend to not print the lines, since there will be way too many of them. (Optional) Implement a feature to print the total number of occurrences found.
However, there are a couple of problems with this. In one of my runs, the output was like
[sjais ~]$ ./a.out "hello" 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt
--- Processed 8.txt in 419.799 ms. ---
--- Processed 7.txt in 420.15 ms. ---
--- Processed 5.txt in 421.128 ms. ---
--- Processed 3.txt in 421.811 ms. ---
--- Processed 6.txt in --- Processed 4.txt in 422.597 ms. ---
422.507 ms. ---
--- Processed 1.txt in 504.27 ms. ---
--- Processed 2.txt in 506.959 ms. ---
Finished processing files in 507.347 ms.
Can you figure out why? In the next chapter you’ll learn about ways to fix this issue.
Reading/Watching Assignment: (In order as mentioned)
- CCIA: Section 2.4, 2.5
Once done reading, and with the assignment, spend some time revising the listings, and make sure you understand things at least to a certain degree. The exercise given isn’t tough, but uses every topic you have learnt till now, and will be a longer problem. I recommend you play around with the Listings in the book as the book suggests, which can be found here.
Exercise 1
A Central Dispatcher receives a continuous stream of unique Delivery Orders. These orders cannot be duplicated. The Dispatcher sends the orders to another centre, let’s say the Post Office. The Post Office is supposed to find (let’s say create) an available Driver (thread basically), assign the order to the driver, and then return the driver back to the Central Dispatcher, and the driver is then added back to the ready-to-go fleet (the thread should be free after delivery). The drivers all work in parallel.
Upon finishing its work, each driver must return a one-line summary of their work in a central logbook, and it is written as soon as the driver finishes its work. You do not have to worry as of now about the conflicts between drivers when they are entering their work in the logbook. This is a known flaw, and you’ll fix this in the next chapter.
Technical Requirements/Design Pattern
- struct
DeliveryOrder- Containsorder_id,destinationand a methodexecutewhich returns the log. You may simulate the work using any function you like, I suggest, you just let the program sleep for a few (random number of) milliseconds (google how to put a thread to sleep for this). The log can just return that the delivery was completed. - class
Driver- This will be your thread manager, which manages a single thread. The driver should be movable, but not copyable (unique_ptr). Implement the relevant safeguards. The constructor takes theDeliverOrderand reference tologBook. You must directly invoke theexecutemethod ofDeliveryOrder. The destructor must also make sure that the thread has been properly shutdown. You should also make sure that your Driver object is movable (this is a crucial part, think about what makes an object movable, HINT: If a variable contains the Driver object, and you dovar = Driver(order, book), what is a necessary condition for this to go smoothly?) create_driver- Orfind_driver. You do you. Helper function to create a driver, kinda optional.main(Dispatcher) - The logbook is a vector ofstd::string. The fleet of drivers should be instd::vector<Driver>. Use good practices to initiate these strings. Find the optimal number of drivers (section 2.4), or just hardcode it your choice. Create a loop to create and add the drivers to your vector. Themainfunction should wait for all the tasks to be completed. After which, print the contents of the logbook.
BONUS: As of now, the Driver is movable, but you don’t really use it’s moving capability much. Try running your code without the Driver being movable. Also, think of how you can put the moveability of it into practice.
Expected Output: Depending on how you actually write your logs, it will look different. But the output should likely be different everytime you run it, and end with a Segmentation fault (core dumped) error. Think about why you get the error that you get here, and fix the segmentation fault too. Here is the output in one of my runs.
Starting Program...
--- Creating 16 orders ---
Creating driver for order 0.
Creating driver for order 1.
Creating driver for order 2.
Driver for order 2 delivering to Sector 2.
Driver for order 2 delivering to Sector 2.
Creating driver for order 3.
Creating driver for order 4.
Driver for order 4 delivering to Sector 4.
Creating driver for order 5.
Driver for order 5 delivering to Sector 5.
Creating driver for order 6.
Driver for order 6 delivering to Sector 6.
Driver for order 6 delivering to Sector 6.
Creating driver for order Creating driver for order 7.
Segmentation fault (core dumped)
Commentary on the solution
I’ll start by saying, the reason for using random times is a surprise for you in later sections. Next, reiterating the design choices. The driver class initializes the thread and has its destructor called upon whenever the class is going out of scope. To revise, first the main function reaches its closing brace, after which the drivers vector is being cleaned. Since each individual entry of the drivers vector is an instance of the Driver class, it’s destructor is called, which joins the thread into the main function, causing it to wait for the completion of the thread. This kind of resource acquisition pattern that handles the resource without you having to do anything is known as RAII (Resouce Acquisition is Initialization).
As a sidenote, anyone who has worked with rust before probably understands RAII already. Rust heavily utilises the RAII design pattern to manage memory, file handles, network connections and of course, threads. Rust also ensures that each resource has only a single owner, which makes it memory safe by default. These checks have to be added in C++ as and where needed (spoiler alert, they’re not needed everywhere).
Next, we have made the threads (which in this case are Drivers) movable, by defining Driver(Driver&&) (the move constructor) and deleting Driver(const Driver&) (the copy constructor). It is necessary to make them movable, since a vector either needs the stored data type to be movable or copyable by the standard, since during resizing, or invocation of methods like emplace_back, the data present in the vector needs to be moved, CPP Ref. Here are some videos for the constructors in case you are not familiar with them Default Constructor, (Wiki) Copy Constructor, and just look up the move constructor somewhere.
Another important feature that we have used here are lambda function. People who have used Python before might find this familiar, although it is a tad bit more complicated than the one you possibly know from Python. Here is a video for Lambdas and I would recommend going through the CPP Ref for the same, (if you have not heard of function pointers before, it might be beneficial to watch this video before the one for lambdas, Function Pointers). It was not necessary to use lambdas in this particular problem, and it could have been done without lambdas too, which should be sufficiently easy to figure out, although there is something tricky going on with the logbook if you do it.
I mention an optional design pattern, which is the use of the friend declaration. In the given scenario, the constructor itself has been made private! This means that you cannot call it from any given function. Only specific functions, which are a “friend” of the Class can call its constructor, which in this case is find_driver. This kind of a design pattern is known as “Factory Patterns”. This is completely optional for now, but is an interesting design pattern to know. The reason it has been used here is, it adds verbosity for clarity and keeps the program safe. The more important reason is that such patterns are very important in programs in which you are constantly delegating tasks to other functions to manage them safely, a strong example of which is a threadpool, which we will be exploring in later sections. Lastly, if you have any questions about the make_unique usage, just google it but it basically prevents memory leaks, and wraps the newly created object in a unique pointer.
Sharing Data Between Threads
Part 1
In this section we will start first by solving the race conditions you might have encountered in the previous sections. Later we will build upon the grep tool by developing more sophisticated logging tools.
Reading/Watching Assignment (In Order):
- [YT] On Sharing of Data => Race Conditions Overview => CCIA: Section 3.1 to 3.2.3 => [YT] Mutexes on Low Level (Answers how multiple threads accessing a mutex isn’t itself a race condition) => [YT] Data Race & Mutex (Optional)
The bug you must have encountered previously, while attempting the grep problem is also an example of a race condition, albeit not that severe. Notice how the bug isn’t easily reproducable. Similarly, the issue with the Dispatcher problem is another example of a race condition, with higher severity. The output you get in each of these programs feels essentially random, and irreproducible. This is one of the main reasons why it is so hard to diagnose race conditions and fix them. Thus you should always make sure that your data structures must have protection mechanisms, predominantly, mutexes(standing for mutually exclusive).
Mini Task (Fix for Dispatcher Problem)
As you probably figured out in the earlier assignment that the Segmentation Fault originated from multiple threads trying to call push_back on the logbook vector simultaneously. Take appropriate steps to fix this. To do this, there are two possible paths you can explore. You can either create two mutexes, one for the output and one for the input, or create a single shared mutex. For this problem, you should do it using a shared mutex, and also explore the advantages/disadvantages of either approach.
Assignment 1 (c)
First you should do some preliminary tasks to get used to mutexes.
- Fix Garbled Output: Create a shared mutex to protect the out stream, and use a lock guard to make the processed file message atomic.
- Data Aggregation: Implement the optional feature from earlier, i.e. count the total number of occurrences found across all the files. Create a shared integer containing that number and protect it using a mutex. While updating the total count across files, first lock the data thread, then before outputting the number of occurrences in the file processed, lock the output mutex. A thread should lock the mutex after processing a file, and add the count of the number of occurrences from a single file to the total count.
Now we will aim to demonstrate a new type of bug that can be created. We will do this by adding a “Status Reporting” feature. This feature will run in its own thread, and periodically fire up reporting the total number of matches found thus far. This is particularly simple to do, and here’s the outline of how you will have to do it.
- Create a new thread in
mainwhich runs a status reporting loop. - Inside of this loop, the thread should
- Lock the mutex controlling the output stream to print a message which reports the progress.
- Lock the mutex controlling the number of matches, reading the
total_matchesand printing its value. - Sleep for 100ms.
You will notice that the program you have made sometimes runs to completion, but other times it just hangs and never finishes. Can you figure out why it never finishes?
Part 2
Reading/Watching Assignment (In Order):
- Understanding the Deadlock: [YT] Deadlock => CCIA: Sections 3.2.4 to 3.2.7 => Dining Philosphers Problem and the other linked problems therein
- Other Tools: CCIA: [YT] Lazy Initialization and
unique_lock=> Section 3.3 => Reader Writer Locks
In this section, we will solve the deadlock problem faced in the previous part, solve some inefficiencies, only stopping for processes for which it is necessary to stop. For the status reporting from part 1,
As you must have figured out after reading about deadlocks, the reason that your program hangs in some iterations is because it enters a deadlock, which occurs when two or more threads are blocked forever, each waiting for a resource held by another thread in the cycle.
In our particular case, worker thread might lock the mutex guarding the result, and then try to lock the outstream mutex, whereas the new status thread that is running might try to lock out the outstream mutex and lock the then lock the mutex guarding the result. If the timing is right, both the threads are waiting for the other one and neither can proceed, freezing the program. This wouldn’t happen if the order of locking the mutexes was the same in both the threads.
Moreover, many threads access the total_matches, but the status reporting thread access it pretty frequently, at least for large files which take multiple seconds to process. We used 100ms as the interval but if the interval was less then the number of times it is accessed would also increase. A standard mutex prevents the status thread from reading the value from it if another thread is also reading it, which is one of the causes of the bottleneck. This problem is solved by using a shared_mutex. Since for readers a lock doesn’t matter, a shared lock allows multiple threads to acquire the shared lock simultaneously.
Assignment 1 (d)
In this assignment we will simply fix the deadlock and create a thread-safe logger.
- Fixing the Deadlock: If you thought through the previous assignment thoroughly, and read the present section, one solution should be obvious, which is, to keep the order of locking consistent (I intentionally instructed you to do it otherwise). We are trying to move towards a decoupled IO and processing paradigm. We will implement a thread-safe Logger, which will handle its own mutex for the output, eliminating the need to maintain a separate mutex.
- Thread-Safe Logger: The Logger class centralizes the I/O. You should use
call_once, so as the class is initiated just once even if multiple threads request it. Use an internal mutex, to prevent garbled output. - Status Reporter: Since this just needs read access, and it need not be exclusive, a
shared_mutexis perfect for this use case. Weigh the difference between each of them.
Exercise 2 (Bank Account)
A bank has multiple accounts. When making deposits and withdrawing money from a bank account, unsafe data structures can cause significant issues in keeping monetary accounts safe. In this exercise, you’ll be simulating this exact scenario. You will model an Account class, and implement a transfer function. You will also have to implement thread safe methods for withdraw and deposit within the account class.
Although you are free to implement it however you like, here’s a possible template which you can use to implement it, if you are unsure. However, regardless of how you do it, the main function should be implemented as mentioned.
Design Pattern
- class
Account- Should contain a balance, a mutex to protect the balance and thread safe methodsdepositandwithdrawto perform the mentioned, and acheckBalancemethod. - function
transfer- This function should lock both the accounts during transfer. You should decide the order of operations yourself. - function
main- Create twoAccountinstancesa&b, and launch two threads. The first thread repeatedly transfers an amount of10.0fromatoband the second one does the reverse.
If you are able to implement thread-safe methods which run clearly without fail, then that’s well and good. Otherwise, try to fix the bugs in the partial solution given in the GitHub repository.
Exercise 3 (Many Readers One Writer)
This problem mostly mirrors the grep tool in some sense. This exercise is to build a settings manager, which will load stuff from a configuration file and follow a pretty similar chain of events, i.e. load settings, read settings and update settings. I suppose you can see the parallel clearly.
Design Pattern
- class
Settings- It should contain amap<string, string>, should be protected by an appropriate mutex, and handle one-time initialization. It should implement the following member functions:void load(): Simulate loading (or actually load, your call) from a file by populating the map with some dummy data.string get(const string& key): Check/handle if the settings have been imported or not, then acquire just the read access from the map.void update(const string& key, const string& value): Check/handle if the settings have been loaded or not, then acquire the write access from the map.
- function
main- Create a single global instance of the Settings class, and then launch multiple read threads and a single write threads. Experiment around and ensure that everything works correctly, which it should if you chose the correct objects to use in the correct order.
BONUS: Although this will suffice for most operations, in large applications you want to squeeze out the most degree of parallelism you can. While a thread is reading, the writer thread will be blocked. The amount of time a lock is being held can be reduced pretty heavily. Figure out the order of operations which minimizes this time, and implement it. Even better if you can test this!
No solution is provided for this exercise in the GitHub repo due to its simplicity.
Synchronizing Concurrent Operations
So far our operations have been really simple. If you watched the Computerphile video on Multithreading Code, we basically have just covered what they explained in that single video. Guarding a resource using a lock, and unlocking it when necessary. In it’s essence, that’s what it is all about. Just guarding data. But our interactions have been really simple. This section introduces the tools required for more sophisticated coordination, such as waiting for specific conditions and getting back results from background tasks.
Part 1: Waiting using condition variables
Reading/Watching Assignment (In Order):
- [YT] Condition Variable => CCIA: Sections 4.1 to 4.1.2 => Producer/Consumer Pattern & Introduction to Thread Pools This lecture is written in OCaml (used by Jane Street, btw) but should be easy enough to understand
If one thread needs to wait for another thread to complete a specific task, then a mutex becomes incredibly inefficient, since the consumer would have to repeatedly lock the resource, while accessing it, which wastes a lot of valualbe CPU cycles. For a thread to pause until its work is ready would significantly improve resource utilisation. This is where condition variables come in, allowing threads to wait for a certain condition to become true.
Assignment 1 (e)
So far in our grep tool, every worker is responsible for printing its results directly to the console, even though the thread-safe logger manages the mutexes, there still remains a performance bottleneck. Writing to I/O stream is a slow operation, and while writing holds locks preventing other threads from doing anything.
We will now decouple the file processing from the logging completely. In our producer consumer pattern, the worker threads will be producers that which find matches and put them into a shared work queue. Another thread, which is the consumer thread will then pull the results from this shared work queue, handling the outputs. This implies we never have to wait for the console.
This is a significant change in design patterns, thus we will ditch the entirety of our previous code, and work with new one from now. The last part of the grep tool, that is Assignment 1 (f), which is your capstone project, shall be built on top of this one.
Design Pattern
- Build a Thread-Safe Queue - You can simply reuse CCIA Listing 4.5. This thread-safe queue will hold a struct/class which contains all the necessary information for a result.
- Worker Threads - The logic for finding the occurrences remains the same, but the implementation of using the Logger is no longer needed. Instead, the thread should push the result into an result object, which should then be pushed to the above mentioned thread-safe queue.
- Logger Thread - In
main, create a consumer thread, which loops and calls thewait_and_popmethod on the queue. As soon as a GrepResult object is found, it should be formatted and printed to the output stream. - Shutdown - Figure out some method to convey to the consumer if the work has all been finished, and to shutdown the consumer once that condition is met.
Part 2: Futures, Promises and Asynchronous Tasks
Reading/Watching Assignment (In Order):
- [YT] Asynchronous Programming in Game Engines => [YT] Future, Promise and async() => CCIA: Section 4.2.1 to 4.2.5
Asynchronous tasks are a more modern way of handling concurrency, by just communicating that this task needs to be done, and getting the result back when it is finished. This is a higher-level way to handle these scenarios using futures and promises, almost like the future is a receipt and you’ve been made a promise by someone that they’ll eventually deliver the product you have the receipt for.
Exercise 4: Parallel Calculator
Before attempting the final assignment, it is highly recommended to attempt this exercise. This is a simple task to get acquainted with std::async, std::promise, std::future.
Part A
Write a simple function, say square which takes in the input, sleeps for 1 second and then returns the square of the input. In the main function, create a vector<future<int>>. Fill it with 5 future vectors by using async(square, i). After launching, create another loop to iterate over the futures, and call .get() to get the result and add it to a total. Use chrono to time your operations.
Part B
Create a new function, say promise_worker(promise<int> p, int input), in which if the input is negative, don’t allow it, else perform a calculation, and use p.set_value() with the result. In the main function, create a promise<int>, get it’s future, launch a thread with the worker, and pass in the promise (focus!).
No solution is being provided for this exercise either, since it is pretty fundamental, and should be doable.
Assignment 1 (f)
The goal of this task is to parallelize the search within a single large file. I will highlight the implementation issues and a general problems upon which you have to work on your base program, although keeping it independent of the specified recommendations isn’t a issue as long as it works.
main- earlier the main function just launched threads and that’s it. Here, you’ll need to determine the word chunks depending on the number of files and the number of threads you have available. Moreover, you’ll likely have to maintain a vector offutures to store the results. The aggregation of results will also happen in the main function.Notice a subtlety here. The word chunks you generate by just splicing texts might split exactly at the boundary of some word. Moreover, what happens when you are searching for non-alphanumeric characters. How would you handle this splitting issue for those?
worker- since we are dealing with very large files here, splicing the text into multiple chunks and storing them would require just as much space in memory as the file itself, which might not be possible.
Beyond Locks
This guide has focused on lock-based concurrency (using std::mutex, std::shared_mutex, etc.). This is the most common, robust, and safest way to handle complex data structures. However, for simple, individual variables (like a counter, a flag, or a pointer), using a mutex is often heavyweight, causing unnecessary thread contention.
For these specific cases, C++ provides std::atomic<T>. An atomic variable guarantees that its operations (like load(), store(), fetch_add()) are “atomic”—they cannot be torn apart by the CPU and will be completed indivisibly without being interrupted by another thread. This allows for lock-free programming, which can be significantly faster in high-performance scenarios (like in drivers, game engines, or HFT systems) because it avoids the kernel-level overhead of a mutex.
Remarks
This marks the end of concurrency and parallelism in C++. Although this covers a significant portion of the basics of concurrency, this is not where it ends. If you notice CCIA, then we are basically ending on the 4th chapter itself. However, if you want to continue with the book further, I’ll just link one talk (C++ atomics, from basic to advanced. What do they really do) which can possibly help you atleast push a bit further. Further iterations of this guide in the near future will extend it to more advanced and specialised portions of the text which are deemed viable to be put in a guide-like format, and those which will benefit from assignments and exercises.
Having learnt parallelism and concurrency to a decent level, you might want to challenge yourself. After all, these assignments and exercises in this chapter were based completely on the basics, and barely required thinking yourself. If you do want to take on some challenge, then I recommend you solve some problems from the book The Modern C++ Challenge, which have some good problems on concurrency for you to explore. The following sections aim to explore how concurrency and parallelism is managed in other languages, and how is it different from what you have learnt in C++ (although the present iterations just has Python).
I strongly encourage you to keep reading the text, watch conferences from CppCon, and most importantly build projects which require you to think, not just in terms of concurrency but overall.
Concurrency and Parallelism in the Wider World
This section aims to explore more of the architectures of parallelism and understand why things are how they are, instead of just asking how they are. This section will be expanded in the near future.
Multi-threading in Python
Although we have previously discussed about parallelism, multi-threading and concurrency. Till now these terms have been used inter changebly. In python a siginifcant difference occurs. The standard implementation of python (CPython) has a mechanism called the Global Interpreter Lock (GIL), which basically synchronizes the operations so that only one thread can execute operations at time. The GIL is absent in C++ thus, whatever number of threads are created can be allocated to different cores and thus can run truly in parallel. Although tasks can still be concurrent they can never be truly parallel (the GIL itself is a mutex lock!). Actually, you can have truly parallel systems, but that won’t be due to multi-threading.
Multi-threading can be utilised for tasks such as reading from the I/O while still processing information in parallel since context-switching still is allowed. However, to run CPU bound tasks in parallel,you will have to utilise multi-processing, which implies you run multiple processes, instead of running multiple threads on the same process. This method of paralleism has significant resource over-head over multi-threading, since each independent process starts with its own memory space, has it’s own instance of a Python interpreter and thus is a heavy operation for a system to do. Moreover, since threads share the same memory space, communication between two threads is straight-forward, which is not the case with processes. Passing data requires one to serialize the data, and send it through a pipe, which has to then be deserialized by the receiving process.
The official python documentation does a great job explaining the usage of these modules, Thread Based Parallelism, Process Based Parallelism, and should be referred to if you want to learn how to do this.
You might question why does python have a GIL?. You should read the linked article in that case which discussess in detail all the issues with GIL and why is it still a part of the Python programming language. Still, another question rises, why does python have to use reference counting? How do other languages handle that particular issue?
Python uses reference counting because it aligns with its philosophy of simplicity and predictability. Since the objects are removed as soon as their references, it makes it deterministic. This makes managing resources straightforward and reduces cognitive overload for the programmer, in contrast to languages like C and C++ which let the programmer have complete control over the memory. This improves the control significantly and lets memory be managed as and when needed, but also increases the development time.
Overall the two languages have different philosophies. Since C++ code is compiled directly to machine code it is highly optimized. For CPU intensive tasks, C++ is thus much faster than Python. However, due to the single threaded nature imposed by GIL in Python it has the unique ability that C extensions written for it do not have to worry about thread safety related issues, and thus are readily integrable. Sections of code of such libraries which do not need any interaction with Python objects can release the GIL, and thus utilise multiple cores of the CPU. Libraries like NumPy are written in C and are compiled into machine code when the library is built.
Numba and JIT
Having established that for CPU intensive tasks, Python is limited by the GIL, and a solution to this problem is to use pre-compiled libraries like NumPy, what can we do if we want to speed up small parts of the program which do not communicate with Python objects? For programs in which there exist computationally intensive tasks which constitute a significant portion of the run-time, we can speed these up by using Just-in-Time Compilation (JIT).
While a standard Python interpreter reads your code line-by-line and executes it, a Ahead-of-Time (AOT) compiler like C++ reads all the code, analyses it and converts it into machine code which can then directly be loaded into the machine and executed.
You might question, why aren’t all languages AOT compiled if it is so much faster than in time compilation. First, AOT compiled program is platform dependent and has to be compiled for a specific target. AOT compilation introduces the compilation step which can slow down the development and testing. AOT programs don’t have flexibility during runtime. These limitations of AOT compiled programs are overcome by interpreted and JIT-compiled languages, significantly improving portability and development speed. Thus, whether to use an AOT compiled language or a interpreted language or a JIT compiled language is a decision which has to be made by considering the trade-offs of choosing either of them
A JIT approach is a hybrid approach, which in the initial call to the function compiles the code, translates the function into highly optimized machine code, which can be loaded directly into machine, skipping the slow interpretation part completely. Here’s a great video by Computerphile about how JIT compilers work.
Numba is one such JIT compiler, which translates a subset of Python code into machine code, and thus can be used to essentially “re-write” your performance critical functions as if they were C as you start running them. Here is a video (YT) Make Python code 1000x faster with Numba which introduces numba, and informs you about some good pratices while using numba, along with some simulation type examples so you understand exactly what numba is about. Here’s an official quick start guide to Numba. If you wish to explore further, the official documentation is a pretty good resource to do so.
As a sidenote, I would like to link one of my favorite projects till date which utilises the features of Numba, (GitHub) String Art Generator, and also is the project from which I learnt about Numba due to how necessary speeding it up was. It’s a really cool project imo, and you might want to explore the project, and maybe generate some cool art with it.
Parallel Processing on the GPU
So far we have limited our discussion to just CPUs, and how we use a handful of cores to speed up our computation, and perform a few complex tasks simultaneously or divide a large task into smaller more manageable tasks. However, the GPU takes a radically different approach to parallel processing, built specifically for repetitive tasks of rendering graphics. As most of you probably already know, GPUs have proven to be extraordinarily effective for scientific computing tasks, which is due to the architecture of the chip.
- The architecture of a CPU gives it a few cores, each capable of performing complex, sequential operations. They handle the overall logic of the program, the user output and everything.
- The architecture of the GPU gives it thousands of cores, each one serving simple operational task. They all execute the same exact command in unison on a massive amount of data. This makes it ideal to process large amounts of data and perform simple operations on them.
You might question, if GPUs have so many cores, why can’t CPUs have them too? Well, tldr; there’s a fundamental difference between the architecture of both the chips. Each core of a CPU has large, dedicated memory caches (L1, L2) and to store data and the CPU utilises a huge amount of the chip space for features like branch prediction and out of order execution. Thus the overall size of a CPU core is much bigger than the size of a GPU core, which requires a tiny cache and minimal control logic. Then, another question pops up, why not make bigger CPUs?
We reach issues at the manufacturing level now. The CPUs themselves are printed at the nanometer scale on a silicon wafer, which usually has microscoping defects and due to even a single defect the entire chip can become useless. Thus, smaller chips have higher success rates than bigger chips. And to make smaller chips with the same compute, you need to have finer printing technologies, at which point we soon reach the limitation of physical laws, and not just in printing.
The bigger the chip is, the more time does it take for a singal to travel through it, which at the billions of operations per second rate becomes a huge bottleneck. Here is an extremely cool video about [YT] How are Microchips Made, but at the end we will reach a dead end, due to Amdahl’s law, which as Wikipedia states, is “a formula that shows how much faster a task can be completed when more resources are added to the system”.
The architecture of a GPU is called SIMT (Single Instruction, Multiple Threads). A given group of threads on a GPU has to perform the same operation, in contrast to CPUs where each thread can have its own operation independent of the other threads. Large group of threads are called “Warps” or “Bundles” and are usually in groups of 32 and 64, and an entire bundle executes one instruction at a time. I highly recommend the following video to learn about the GPU architecture: [YT] How do Graphics Cards Work? Exploring GPU Architecture.
The way you program a GPU is significantly different from how you program a CPU. The workflow for GPGPU (General Purpose Computing on GPU) is:
- Host Code (CPU): The main program still has to run on the CPU.
- Data Transfer: The CPU copies the necessary data which is to be used to act from the system RAM to the VRAM. This step is the biggest performance bottleneck.
If you have heard about ARM chips, then you probably know how the main reason Apple Silicon was so much better than the other competitors on the market is because Apple has complete control over what goes in their chips. Theoretically speaking if the CPU was built with unified memory, then this bottleneck would reduce dramatically. Infact, there exists the HSA foundation which is developing the HSA, which is a Heterogenous System Architecture. Here is an article by NVIDIA exploring the same in CUDA, Unified Memory for CUDA Beginners.
- Kernel Launch: The CPU launches a kernel, which is a function written by the programmer which will be executed by every single GPU thread you’ve launched.
- Device Code (GPU): The threads all run in parallel.
- Result Transfer: Once the kernel is finished the result is copied back from the VRAM to system RAM.
The dominant frameworks as of now for GPGPU are NVIDIA’s CUDA and OpenCL. If you want to get into this stuff, then official CUDA C++ Programming Guide is the best resource out there.
Contributors
- Shivansh Jaiswal | +91 9971104638
- Austin Shijo | +91 8946061070