Graph optimization software
Yan Zhou. A short summary of this paper. Song Abstract—Kidney transplantation is typically the most effective treatment for patients with end-stage renal disease. However, the supply of kidneys is far short of the fast-growing demand. Kid- ney paired donation KPD programs provide an innovative ap- proach for increasing the number of available kidneys. In a KPD program, willing but incompatible donor—candidate pairs may ex- change donor organs to achieve mutual benefit.
Recently, research on exchanges initiated by altruistic donors ADs has attracted great attention because the resultant organ exchange mechanisms offer advantages that increase the effectiveness of KPD programs. Currently, most KPD programs focus on rule-based strategies of prioritizing kidney donation.
In this paper, we consider and com- pare two graph-based organ allocation algorithms to optimize an outcome-based strategy defined by the overall expected utility of kidney exchanges in a KPD program with both incompatible pairs Fig. Illustration of three types of kidney exchanges: a two-way cycle- and ADs. We develop an interactive software-based decision sup- based exchange, b three-way cycle-based exchange, and c a chain-based port system to model, monitor, and visualize a conceptual KPD exchange initiated by an AD.
Top In the graphs, donors D and their willing program, which aims to assist clinicians in the evaluation of differ- but incompatible candidates C are shown in the same numbered pairs, and ent kidney allocation strategies.
Using this system, we demonstrate arrows denote the kidney transplant from donor D to the compatible candidate empirically that an outcome-based strategy for kidney exchanges C.
Bottom Graphs are the corresponding graphical representation of three leads to improvement in both the quantity and quality of kidney cases. BD is a bridge donor that triggers another chain-based exchange in a transplantation through comprehensive simulation experiments. Index Terms—Kidney exchanges, optimal matches, software. This is fortuitous since transplants from live donors generally have a higher chance I.
Unfortunately, biological incompatibility, such as ABO blood type mismatch N comparison to dialysis, kidney transplantation has been I proven to be a more effective treatment for most patients with end-stage renal disease. However, in response to the growing de- or the presence of human leukocyte antigen HLA antibod- ies [9], prevents many intended living-donor transplants from being performed.
Therefore, kidney paired donation KPD pro- mand, there is a serious shortage in supply of transplantable kid- grams [13], also referred to as kidney exchanges, are established neys. As a result, more than 90 patients were waiting for kid- to circumvent these incompatibilities by allowing incompatible ney transplantation by the end of [11]. Manuscript received October 2, ; revised December 18, and In these situations, the candidate C of one pair is compatible February 11, ; accepted March 3, Date of publication April 20, with and receives the kidney of the donor D from another pair.
This research was in part funded Recently, a chain of kidney exchanges triggered by an altruis- by U. Chen and Y. Li are co-first authors. Asterisk because chain-based exchanges can be advantageous compared indicates corresponding author.
Li, J. Kalbfleisch, Y. Zhou, and P. The goal of kidney exchanges is to make optimal decisions A. Accord- ing to Li et al. They proposed a probability-based utility measure to access a variety of uncertainties in KPD, so that the optimal- ity is based on the overall expected utility of exchanges.
This paper will apply their method of expected utility to develop an algorithm and software. Other researchers explore theoretical analysis and real application issues on chain exchanges, such as the algorithmic efficiency or optimal length of chains, the bene- fit and computational limitation of integrating chains into KPD program, etc. An overall review of KPD is referenced in [18].
However, it remains unknown as how best to utilize all the forms of simultaneous exchange cycles and nonsimulta- Fig. In this example, the cycles and clinical decision making. In this paper, we consider a graphical model to determine optimal matches for KPD program, in which both cycle and chain exchanges are involved.
Through comprehensive simula- II. Let V be the exchange uncertainties, in comparison to the existing strategies number of vertices nodes and E the number of edges in the for kidney exchanges. The major contributions of this paper are graph, where. Each vertex in the graph G summarized as follows. Each directed edge from vertex i to j proposed by [7], we relax the current strategy of a KPD indicates that the donor kidney in vertex i is compatible with the program in which optimal matches are selected for both candidate in vertex j e.
In this directed graph, each exchange cycles and chains by allowing both operational edge can be assigned a weight representing the edge utility uij of uncertainty and contingency plans.
In addition, an edge probability pij can be included gorithms, termed as MEU-Parallel and MEU-Sequential, for each edge to characterize the chance of an actual successful which, respectively, search simultaneously and sequen- kidney transplant from i to j.
In the Expected Utility of exchanges. For example, edge utility uij could be obtained different organ allocation strategies and effectiveness of from medical-outcome-based utility, such as the estimated total policy.
In particular, we build a user-friendly graphical in- number of incremental years of life from transplant LYFT terface which provides easy communication between clin- [19], which was proposed in the allocation policy for deceased icians and computer tools, thus facilitating convenience donor kidney transplants. On the other hand, the edge probability and quality of clinical decision making.
We approach, based on clinical data from multiple existing KPD first present the relevant mathematical formulation and algo- programs. Detailed discussion regarding edge utility and edge rithm for kidney exchanges in detail in Section II. Then, a probability can be found in [7]. Therefore, the task of optimiz- this system. Finally, we conclude and propose future work in ing matches on graph can be realized by solving the following Section V. CHEN et al. The constraints in Equation 1 indicate that no cycle or chain can- didate can be involved in more than one exchange.
Here, EUc is the expected utility of cycle or chain c. In our setting, the cal- culation of EUc based on uij and pij for all possible cycles or chain set configurations has been discussed in [7]. Currently, most KPD programs focus on rule-based strategies of prioritizing kidney donation. In this paper, we consider and compare two graph-based organ allocation algorithms to optimize an outcome-based strategy defined by the overall expected utility of kidney exchanges in a KPD program with both incompatible pairs and ADs.
We develop an interactive software-based decision support system to model, monitor, and visualize a conceptual KPD program, which aims to assist clinicians in the evaluation of different kidney allocation strategies. Using this system, we demonstrate empirically that an outcome-based strategy for kidney exchanges leads to improvement in both the quantity and quality of kidney transplantation through comprehensive simulation experiments. And then I also have two integers that point to the front and the back of the queue.
So initially, the front of the queue is at position 0, and the back is at position 1. I'm going to set current to be that vertex. And then I'll compute the degree of that vertex, which I can do by looking at the difference between consecutive offsets. And I also assume that Offsets of n is equal to m, just to deal with the last vertex.
And then I'm going to loop through all of the neighbors for the current vertex. And to access each neighbor, what I do is I go into the Edges array. And I know that my neighbors start at Offsets of current.
And therefore, to get the i-th neighbor, I just do Offsets of current plus i. That's my index into the Edges array. Now I'm going to check if my neighbor has been explored yet. And I can check that by checking if parent of neighbor is equal to negative 1.
If it is, that means I haven't explored it yet. And then I'll set a parent of neighbor to be current. And I'm just going to keep repeating this while loop until it becomes empty. And here, I'm only generating the parent pointers. But I could also generate the distances if I wanted to with just a slight modification of this code. So any questions on how this code works? OK, so here's a question. What's the most expensive part of the code? Can you point to one particular line here that is the most expensive?
But you're close. So anyone have any other ideas? Because whenever we access this parent array, the neighbor can appear anywhere in memory. So that's going to be a random access. And if the parent array doesn't fit in our cache, then that's going to cost us a cache miss almost every time.
This Edges array is actually mostly accessed sequentially. Because for each vertex, all of its edges are stored contiguously in memory, we do have one random access into the Edges array per vertex because we have to look up the starting location for that vertex. But it's not 1 per edge, unlike this check of the parent array. That occurs for every edge. So does that make sense? So let's do a back-of-the-envelope calculation to figure out how many cache misses we would incur, assuming that we started with a cold cache.
And we also assume that n is much larger than the size of the cache, so we can't fit any of these arrays into cache. We'll assume that a cache line has 64 bytes, and integers are 4 bytes each. So let's try to analyze this. And the reason here is that we're initializing this array sequentially. So we're accessing contiguous locations. And this can take advantage of spatial locality. On each cache line, we can fit 16 of the integers.
Because again, this is going to be a sequential access into this queue array. To compute the degree here, that's going to take n cache misses overall. Because each of these accesses to Offsets array is going to be a random access. Because we have no idea what the value of current here is. It could be anything. So across the entire algorithm, we're going to need n cache misses to access this Offsets array.
So does anyone see where that bound comes from? And you're accessing the Edges contiguously. But we also have to add 2n. Because whenever we access the Edges for a particular vertex, the first cache line might not only contain that vertex's edges. And similarly, the last cache line that we access might also not just contain that vertex's edges.
So therefore, we're going to waste the first cache line and the last cache line in the worst case for each vertex. And summed cross all vertices, that's going to be 2n. Accessing this parent array, that's going to be a random access every time. So we're going to incur a cache miss in the worst case every time. So summed across all edge accesses, that's going to be m cache misses. And if m is greater than 3n, then the second term here is going to dominate.
And m is usually greater than 3n in most real-world graphs. And the second term here is dominated by this random access into the parent array. So let's see if we can optimize this code so that we get better cache performance. So let's say we could fit a bit vector of size n into cache.
But we couldn't fit the entire parent array into cache. What can we do to reduce the number of cache misses? So does anyone have any ideas?
So we're going to use a bit vector to store whether the vertex has been explored yet or not. So we only need 1 bit for that. We're not storing the parent ID in this bit vector. We're just storing a bit to say whether that vertex has been explored yet or not. And then, before we check this parent array, we're going to first check the bit vector to see if that vertex has been explored yet.
And if it has been explored yet, we don't even need to access this parent array. If it hasn't been explored, then we won't go ahead and access the parent entry of the neighbor. But we only have to do this one time for each vertex in the graph because we can only visit each vertex once. And therefore, we can reduce the number of cache misses from m down to n. So overall, this might improve the number of cache misses.
In fact, it does if the number of edges is large enough relative to the number of vertices. However, you do have to do a little bit more computation because you have to do bit vector manipulation to check this bit vector and then also to set the bit vector when you explore a neighbor.
So here's the code using the bit vector optimization. So here, I'm initializing this bit vector called visited. And then I'm setting all of the bits to 0, except for the source vertex, where I'm going to set its bit to 1.
And I'm doing this bit calculation here to figure out the bit for the source vertex. And then now, when I'm trying to visit a neighbor, I'm first going to check if the neighbor is visited by checking this bit array. And I can do this using this computation here-- AND visited of neighbor over 32, by this mask-- 1 left shifted by neighbor mod And if that's false, that means the neighbor hasn't been visited yet.
So I'll go inside this IF clause. And then I'll set the visited bit to be true using this statement here. And then I do the same operations as I did before. It turns out that this version is faster for large enough values of m relative to n because you reduce the number of cache misses overall. You still have to do this extra computation here, this bit manipulation.
But if m is large enough, then the reduction in number of cache misses outweighs the additional computation that you have to do. OK, so that was a serial implementation of breadth-first search. Now let's look at a parallel implementation. So I'm first going to do an animation of how a parallel breadth-first search algorithm would work. The parallel reference search algorithm is going to operate on frontiers, where the initial frontier contains just a source vertex.
And on every iteration, I'm going to explore all of the vertices on the frontier and then place any unexplored neighbors onto the next frontier. And then I move on to the next frontier. So in the first iteration, I'm going to mark the source vertex as explored, set its distance to be 0, and then place the neighbors of that source vertex onto the next frontier. In the next iteration, I'm going to do the same thing, set these distances to 1.
I also am going to generate a parent pointer for each of these vertices. And this parent should come from the previous frontier, and it should be a neighbor of the vertex. And here, there's only one option, which is the source vertex. So I'll just pick that as the parent. And then I'm going to place the neighbors onto the next frontier again, mark those as explored, set their distances, and generate a parent pointer again.
And notice here, when I'm generating these parent pointers, there's actually more than one choice for some of these vertices. And this is because there are multiple vertices on the previous frontier. And some of them explored the same neighbor on the current frontier. So a parallel implementation has to be aware of this potential race.
Here, I'm just picking an arbitrary parent. So as we see here, you can process each of these frontiers in parallel. So you can parallelize over all of the vertices on the frontier as well as all of their outgoing edges. However, you do need to process one frontier before you move on to the next one in this BFS algorithm. And a parallel implementation has to be aware of potential races. So as I said earlier, we could have multiple vertices on the frontier trying to visit the same neighbors.
So somehow, that has to be resolved. And also, the amount of work on each frontier is changing throughout the course of the algorithm. So you have to be careful with load balancing. Because you have to make sure that the amount of work each processor has to do is about the same. If you use Cilk to implement this, then load balancing doesn't really become a problem. So any questions on the BFS algorithm before I go over the code?
OK, so here's the actual code. And here I'm going to initialize these four arrays, so the parent array, which is the same as before. I'm going to have an array called frontier, which stores the current frontier. And then I'm going to have an array called frontierNext, which is a temporary array that I use to store the next frontier of the BFS.
And then also I have an array called degrees. I'm going to place the source vertex at the 0-th index of the frontier. I'll set the frontierSize to be 1. And then I set the parent of the source to be the source itself. While the frontierSize is greater than 0, that means I still have more work to do.
And then I'll set the i-th entry of the degrees array to be the degree of the i-th vertex on the frontier. And I can do this just using the difference between consecutive offsets.
And then I'm going to perform a prefix sum on this degrees array. And we'll see in a minute why I'm doing this prefix sum. But first of all, does anybody recall what prefix sum is? So who knows what prefix sum is? Do you want to tell us what it is? So let's say this is our input array. The output of this array would store for each location the sum of everything before that location in the input array.
So here we see that the first position has a value of 0 because a sum of everything before it is 0. There's nothing before it in the input.
The second position has a value of 2 because the sum of everything before it is just the first location. The third location has a value of 6 because the sum of everything before it is 2 plus 4, which is 6, and so on. So I believe this was on one of your homework assignments. So hopefully, everyone knows what prefix sum is. And later on, we'll see how we use this to do the parallel breadth-first search.
OK, so I'm going to do a prefix sum on this degrees array. And then I'm going to loop over my frontier again in parallel. I'm going to let v be the i-th vertex on the frontier. Index is going to be equal to degrees of i. And then my degree is going to be Offsets of v plus 1 minus Offsets of v. Now I'm going to loop through all v's neighbors. And here I just have a serial for loop. But you could actually parallelize this for loop.
It turns out that if the number of iterations in the for loop is small enough, there's additional overhead to making this parallel, so I just made it serial for now. But you could make it parallel. To get the neighbor, I just index into this Edges array.
I look at Offsets of v plus j. Then now I'm going to check if the neighbor has been explored yet. And I can check if parent of neighbor is equal to negative 1. So that means it hasn't been explored yet, so I'm going to try to explore it. And I do so using a compare-and-swap. I'm going to try to swap in the value of v with the original value of negative 1 in parent of neighbor. And the compare-and-swap is going to return true if it was successful and false otherwise.
And if it returns true, that means this vertex becomes the parent of this neighbor. And then I'll place the neighbor on to frontierNext at this particular index-- index plus j. And otherwise, I'll set a negative 1 at that location. OK, so let's see why I'm using index plus j here. So here's how frontierNext is organized. So each vertex on the frontier owns a subset of these locations in the frontierNext array. And these are all contiguous memory locations.
And it turns out that the starting location for each of these vertices in this frontierNext array is exactly the value in this prefix sum array up here. So vertex 1 has its first location at index 0. Vertex 2 has its first location at index 2. Vertex 3 has its first location at index 6, and so on. So by using a prefix sum, I can guarantee that all of these vertices have a disjoint subarray in this frontierNext array.
And then they can all write to this frontierNext array in parallel without any races. And index plus j just gives us the right location to write to in this array. So index is the starting location, and then j is for the j-th neighbor. So here is one potential output after we write to this frontierNext array. So we have some non-negative values.
And these are vertices that we explored in this iteration. We also have some negative 1 values. And the negative 1 here means that either the vertex has already been explored in a previous iteration, or we tried to explore it in the current iteration, but somebody else got there before us.
Because somebody else is doing the compare-and-swap at the same time, and they could have finished before we did, so we failed on the compare-and-swap. So we don't actually want these negative 1 values, so we're going to filter them out. And we can filter them out using a prefix sum again. And this is going to give us a new frontier. And we'll set the frontierSize equal to the size of this new frontier. And then we repeat this while loop until there are no more vertices on the frontier.
So any questions on this parallel BFS algorithm? Then you do a prefix sum on that array, which gives us unique offsets into an output array. So then everybody just looks at the prefix sum array there. And then it writes to the output array. So it might be easier if I tried to draw this on the board. OK, so let's say we have an array of size 5 here. So what I'm going to do is I'm going to generate another array which stores a 1 if the value in the corresponding location is not a negative 1 and 0 otherwise.
And then I do a prefix sum on this array here. And this gives me 0, 1, 1, 2, and 2. And now each of these values that are not negative 1, they can just look up the corresponding index in this output array. And this gives us a unique index into an output array.
So this element will write to position 0, this element would write to position 1, and this element would write to position 2 in my final output. So this would be my final frontier. Does that make sense? OK, so let's now analyze the working span of this parallel BFS algorithm. So a number of iterations required by the BFS algorithm is upper-bounded by the diameter D of the graph. And the diameter of a graph is just the maximum shortest path between any pair of vertices in the graph.
And that's an upper bound on the number of iterations we need to do. And this is also assuming that the inner loop is parallelized, the inner loop over the neighbors of a vertex. So to get the span, we just multiply these two terms. So we get theta of D times log m span. What about the work? So to compute the work, we have to figure out how much work we're doing per vertex and per edge. So first, notice that the sum of the frontier sizes across entire algorithm is going to be n because each vertex can be on the frontier at most once.
Also, each edge is going to be traversed exactly once. So that leads to m total edge visits. On each iteration of the algorithm, we're doing a prefix sum. And the cost of this prefix sum is going to be proportional to the frontier size. So summed across all iterations, the cost of the prefix sum is going to be theta of n.
We also have to do this filter. But the work of the filter is proportional to the number of edges traversed in that iteration. And summed across all iterations, that's going to give theta of m total. So overall, the work is going to be theta of n plus m for this parallel BFS algorithm. So this is a work-efficient algorithm.
The work matches out the serial algorithm. Any questions on the analysis? OK, so let's look at how this parallel BFS algorithm runs in practice. So here, I ran some experiments on a random graph with 10 million vertices and million edges. And the edges were randomly generated. And I made sure that each vertex had 10 edges. I ran experiments on a core machine with 2-way hyperthreading. Does anyone know what hyperthreading is? Yeah, what is it?
So hyperthreading is an Intel technology where for each physical core, the operating system actually sees it as two logical cores. They share many of the same resources, but they have their own registers. So if one of the logical cores stalls on a long latency operation, the other logical core can use the shared resources and hide some of the latency.
OK, so here I am plotting the speedup over the single-threaded time of the parallel algorithm versus the number of threads. So we see that on 40 threads, we get a speedup of about 22 or 23X. And when we turn on hyperthreading and use all 80 threads, the speedup is about 32 times on 40 cores. And this is actually pretty good for a parallel graph algorithm. It's very hard to get very good speedups on these irregular graph algorithms.
So 32X on 40 cores is pretty good. I also compared this to the serial BFS algorithm because that's what we ultimately want to compare against. So we see that on 80 threads, the speedup over the serial BFS is about 21, 22X.
This is because it's doing less work than the parallel version. The parallel version has to do actual work with the prefix sum in the filter, whereas the serial version doesn't have to do that. But overall, the parallel implementation is still pretty good. OK, questions? So a couple of lectures ago, we saw this slide here. So Charles told us never to write nondeterministic parallel programs because it's very hard to debug these programs and hard to reason about them. So is there nondeterminism in this BFS code that we looked at?
So let's go back to the code. So this compare-and-swap here, there's a race there because we get multiple vertices trying to write to the parent entry of the neighbor at the same time. And the one that wins is nondeterministic. So the BFS tree that you get at the end is nondeterministic. OK, so let's see how we can try to fix this nondeterminism. OK so, as we said, this is a line that causes the nondeterminism. It turns out that we can actually make the output BFS tree, be deterministic by going over the outgoing edges in each iteration in two phases.
So how this works is that in the first phase, the vertices on the frontier are not actually going to write to the parent array. Or they are going to write, but they're going to be using this writeMin operator. And the writeMin operator is an atomic operation that guarantees that we have concurrent writes to the same location. The smallest value gets written there. So the value that gets written there is going to be deterministic.
It's always going to be the smallest one that tries to write there. Then in the second phase, each vertex is going to check for each neighbor whether a parent of neighbor is equal to v.
If it is, that means it was the vertex that successfully wrote to parent of neighbor in the first phase. And therefore, it's going to be responsible for placing this neighbor onto the next frontier.
0コメント