Skip to content

Latest commit

 

History

History
441 lines (261 loc) · 30.4 KB

2015-08-20.md

File metadata and controls

441 lines (261 loc) · 30.4 KB

Epic Graph Battles of History: Chaos vs Order

Today we're going to get the report on an impromptu throw-down between two approaches to graph processing:

  • Chaos is a new scalable system, due to appear at SOSP 2015. It isn't yet public, but my understanding is that it is basically a beast at sequentially streaming through edge data, across as many machines as you can swing.

  • Order was the subject of the recent blog post that stirred up this brouhaha: can something as simple as sorting empower the lowly laptop to compete with the scalable systems? Order isn't actually the name of a system, but it should be.

This started out with me posting about sorting vs not-sorting, and how it sorting might be better. There was even a paragraph on when it might not be better.

An X-Stream author fired back, with unfortunate statements like:

This is probably why Frank only evaluated Pagerank on Twitter to illustrate the utility of Hilbert curves. This (and the COST paper in general) is an unfortunate example of doing "single data point science".

This hasn't been true ever, which a read through the COST paper would have revealed, and back in February we even went through using the same technique on the 128 billion edge Common Crawl graph. The only reason I'm not posting a screenshot of me telling this same author about these results on Facebook several months ago is that I can't drive their ridiculous search tools.

Various tweetles, much #trashtalk, and some idle troublemaking on my part got us to a state with the questions raised: "How well does the Hilbert curve approach work for Breadth First Search?", and "Will you wager your mustache that Chaos beats a laptop?". Reflecting back on this, I am probably a bad person.

We're going to find out about the answer to the first question, and my understanding is that we'll also hear how Chaos works for BFS on the same data at a closely related time. I've asked twice, and I am afraid the second question must be taken to be answered in the negative.

For full disclosure, before you read on and get disappointed: I don't actually have the Chaos numbers yet. Not my department. I have been asked to disclose my numbers first, which I actually think is totally fair, because it is wildly annoying to publish numbers and have everyone else work just barely hard enough to beat them and announce "science". But, I will (optimistically) approximate how I think Chaos should perform (as a perfectly scaling version of the authors' prior work) until I get specific numbers from them.

And if you just want to see my numbers, you can scan to the bottom, just above the hilariously presumptive (but totally legit) moralizing.


Update: Despite leading with tweets like

Promises

we have just learned that

Broken

Amitabha did reveal some PageRank numbers (updated in the text, and in the summary), and also that I should cite X-Stream for one of the steps we take. Details in-line.


In the first corner: Unordered Scans

One can process graphs by repeatedly scanning edge lists sequentially. No matter which edges you need, you'll see them again in the next pass, so you can do pretty much anything if you don't mind doing a possibly large number of scans.

We used this interface for the COST measurements, but it is also the approach taken by much more substantial systems. X-Stream in particular is designed to do sequential scans, lots of em, and as fast as one can hope given the hardware.

If you've been following graph processing, you've probably seen lots of algorithms written as sequential scans. PageRank is probably the foremost example, which as a somewhat dressed up matrix-vector multiplication just wants to see all edges in any order:

for &(src, dst) in &edges {
    new_rank[dst] += (0.85 * old_rank[src] + 0.15) / degree[src];
}

It really doesn't matter in what order you process the edges, or even that they get processed at the same machine, so long as you eventually add everything up.

My laptop does twenty pagerank iterations over the Common Crawl hyperlink data in about 12.9 hours. Oof. There is no reason thirty two identically configured machines shouldn't be able to crank through the computation in roughly one thirty-second the time. Maybe a bit more because of networks and such.

The point is: this is the sort of thing X-Stream (and Chaos, reportedly) are designed for. If breadth-first search ends up looking like PageRank, my laptop and I are in deep doo-doo.


Update: The Chaos number for twenty iterations of PageRank is 6220 seconds, "unoptimized" (?). This is about one eighth of the laptop time above, so absolutely better.


In the other corner: Ordered Edges

An alternate approach is to think harder about your set of edges, put them in some sort of order, and then exploit that order to do graph processing more efficiently.

On my laptop, you can take the Common Crawl hyperlink data, provided by Web Data Commons, and order the edges so that they compress down to about 2.8 bits per edge. There are other ways to do this, but I chose to order the edges along a Hilbert space-filling curve, which teases out locality and makes the differences between subsequent elements often small, and therefore compress well.

All this meant that I could fit a 128 billion edge graph, which would be 1TB if written as (u32, u32) pairs, into 42GB on my laptop's SSD. If the same compression level held, a one trillion edge graph would be 336GB, which would also fit on my SSD. I can read the data back out at what is effectively 2GB/s of (u32, u32) edge data, because the compression amplifies the 500MB/s throughput of my SSD.

In addition to streaming faster, we can use a broader class of algorithms when all the data are in one place. A great example is graph connectivity: in one scan over the edges, union-find discovers the full connected component structure of the graph, using additional memory linear in the number of nodes.

let mut roots: Vec<u32> = (0..nodes).collect();
let mut ranks = vec![0u8; nodes];

for &(mut src, mut dst) in &edges {

    // look up roots of src and dst
    while src != roots[src] { src = roots[src]; }
    while dst != roots[dst] { dst = roots[dst]; }

    if x != y {
        // swing pointer from lower rank to higher
        match ranks[src].cmp(&ranks[dst]) {
            Ordering::Less    => { roots[src] = dst as u32 },
            Ordering::Greater => { roots[dst] = src as u32 },
            Ordering::Equal   => { roots[dst] = src as u32;
                                   ranks[src] += 1 },
        }
    }
}

You might worry that this looks like lots of horrible random access. If the number of nodes grows larger than main memory, won't we be boned?

The Hilbert curve layout has a nice locality property, that nearby edges in the curve order will (largely) have nearby src and dst fields. The curve trades the possibility of extreme locality in either (e.g. sorting by one of the fields) for good locality in both. This means that as we go through the edges, we are looking at localized parts of roots and ranks.

Note: This isn't exactly true of all the roots pointer chasing, as written. But, the Hilbert-curve order does ensure that after each power-of-two number of nodes, we have discovered the full connectivity structure on the graph restricted to those nodes; each node might as well point directly to its root (and we could do that in a sequential pass over roots, if it were important).

The standard approach to connectivity in an unordered scan setting is the "label propagation" algorithm. This approach repeatedly circulates node ids around the graph, converging in a number of iterations no less than the distance from the best-labeled node to any other node. In the Common Crawl data, there seems to be a vertex at distance 330 from vertex zero, so naive label propagation shouldn't expect to do fewer than 330 iterations, and can't do fewer than 165 (if it prioritized the label of the midpoint of the 330 long path).

By comparison, the union-find code above does one scan through the graph. Even if thirty two machines take one thirty-second the time of my laptop, the union find approach will still be ten times faster.

Breadth-First Search

We aren't here to do PageRank, and we aren't here to do graph connectivity. We are here for breadth-first search: labeling each graph node with the minimum number of edges that must be traversed from some root node (I'm using node zero).

Update: Apparently we also need a parent index for each node; whatever, we'll do that too.

There is a pretty simple modification to label propagation that circulates distances rather than labels:

// maxing out at 2^16 distance.
let mut dist = vec![65535; nodes];
let mut old_sum: u64 = dist.sum() + 1;
let mut new_sum: u64 = dist.sum();;

while new_sum < old_sum {
    for &(src, dst) in &edges {
        match label[src].cmp(&label[dst]) {
            Ordering::Less    => dist[dst] = dist[src].saturating_add(1),
            Ordering::Greater => dist[src] = dist[dst].saturating_add(1),
            Ordering::Equal   => { },
        }
    }

    old_sum = new_sum;
    new_sum = dist.sum();
}

Sweet, nailed it! Time to announce that BFS is underway, and go have a drink.

Ok, we are back from our drink, and it is still running. It's done five iterations, so lets look at how it is moving along. Here are the counts of the number of vertices at distances zero to ten after each of the first five passes:

First five passes

Nice. Things are starting to come together, getting hard to tell them apart. We should be done pretty soon.

Ok. It's been a while, and it's not done. Let's look at the number of vertices at distances zero to ten for iterations 50-54:

First five passes

Oh, right. They are each exactly the same. That should be true at least up to distance 50, because we've explored literally every length 50 path at this point. Let's look at the first place the counts are different, distances 70-79:

First five passes

Well, that's annoying. They are different, but not much, and geez who really cares anyhow? But, we said we were going to compute this thing, so...

Let's think. Each of these iterations takes 1500 seconds and there seem to be a bunch of them. There shouldn't be more than 330 of them, since I already told you that was the farthest distance from node zero, but let's say there were that many.

...calculates...

137.5 hours?

I think this is about how long X-Stream would need (modulo hardware), and Chaos should be 32x faster. Hump that.

As it turns out you only need 166 iterations to converge, because each pass of the algorithm above pushes information along any path that is a subsequence of the ordered edges. That distance 330 node would have to be along a path all of whose edges appear in reverse order in the edge order. Instead, about half do. Exactly half, it turns out (the 166th iteration only confirms that we have converged).

69.2 hours. Half the iterations a synchronous scan-based system would have to do. Boom.

That's pretty much three days, which isn't great, but it's how long this computation takes using this type of approach.

Or, how long it would take. If I were to run it.

Excuse me?

Yeah I didn't do that computation. Seriously, three days? My laptop has better shit to do.

No, we are going to be smarter.

I don't see the laptop beating the scalable systems at their own game: scanning lots and lots of edges at a very high rate. Instead, we need to channel the example of graph connectivity: take advantage of the fact that we have more flexibility than the scalable systems, with their restrictive programming model. We have to use our brains, and then do something interesting.

Something interesting

In BFS iteration i, each vertex at distance less-or-equal to i has its distance fixed: it will not change. We have looked at all paths of length i or less and not found anything better.

Ok. These vertices have their distances fixed. It's not like we can ignore them, though; they might result in smaller distances for their neighbors, right?

Nope. Once we have finished a pass all edges from fixed vertices will have had whatever effect they might have. The neighbors will get the opportunity to be at distance i+1, and whether they take it or not, those edges will never present a better opportunity. Edges from vertices with fixed distances are no longer useful for us.

Are there lots of these useless edges? Maybe they are getting in our way. Let's see how the number of "active" edges evolves as we do our iterations:

iteration: 1 @ 1557.82s;    active: 128561867719
iteration: 2 @ 3083.91s;    active: 128530887679
iteration: 3 @ 4600.89s;    active: 122869425331
iteration: 4 @ 6118.64s;    active: 68871095476
iteration: 5 @ 7626.10s;    active: 17992146960
iteration: 6 @ 9123.97s;    active: 2802148909
iteration: 7 @ 10625.84s;   active: 1154659186
iteration: 8 @ 12122.87s;   active: 865194623

Wow. So many of the edges are useless. That sucks. Each iteration past 7 is using less than 1% of the edges it reads in. What a tremendous waste. I mean, it's not like we can skip them without some sort of index (hey not a bad idea!), and we aren't going to re-write the edge list each iteration to pull out the meaningful ones, because we don't have the spare space.

Using all parts of the computer

Ok, think. Think think. We can see these active edges as they fly past, and instead of reading them from disk over and over, we would like somehow to "remember" them for the next pass. "Remember"...

Hey, our computer has "memory", right? Not enough to fit the whole graph, of course. But, once the number of active edges drops below, say one billion, let's just stash them in memory and stop reading data from the disk. The first eight iterations aren't going to look any different, but things should go faster afterwards, right?

iteration: 9 @ 13398.91s;
iteration: 10 @ 13402.89s;  
...
iteration: 165 @ 13992.01s;
iteration: 166 @ 13995.77s;
Echidnatron%

Oh cool. 13,994.77 seconds. That's about four hours. Kinda decent, maybe?


Update: Amitabha would like to call attention to Section 5.3, paragraph 4 of the X-Stream paper, which is totally fair. I think what he is refering to is

Exploring generic stream compression algorithms as well as those specific to graphs [11], or performing extra passes to eliminate those edges that are no longer needed are important avenues of exploration we are pursuing to reduce wastage in X-Stream.

If it turns out that Chaos has a general way of doing this, I'll be delighted. It is a hard problem, and the solutions seem to be different for each problem. Vertex values don't stabilize for the same reasons in BFS as they do in Label Propagation as they do in PageRank, and if they've developed a general purpose solution for that, awesome. On the other hand, the quote is from a section discussing the wastage in BFS on geometric graphs (road networks) where the problem is that most edges don't actually become inactive early, the graphs just have fundamentally large diameter, which challenges approaches that can only do scans.

Paragraph 5 continues:

We conclude that X-Stream is an efficient way to execute a variety of algorithms on real-world graphs, its only limitation being graphs whose structure requires a large number of iterations.

I just want to call out a little bit of bias here. The union_find algorithm requires just one pass. I don't think the number of iterations is the issue, but rather being forced to use a certain type of iteration. More on this later.


A wild tweeter appears!

I was pretty good to go here. But... someone started up with the trash talk.

The goalposts got moved a bit from where I thought they were, and it became important to include not just the distance for each node, but also its parent pointer, which can be any connected node one step closer to the root. Also, I was supposed to report how long the Hilbert curve pre-processing takes, as if one should be asked to re-do that for each computation.

Fair enough. Step back, son.

Optimizing things

Given the excuse, I thought I would optimize my code a bit. Let's go through a few things I did.

Stash the active edges earlier

Rather than doing another pass to collect the active edges, we can speculatively stash them in an array with bounded size, and if at the end of the iteration it turns out the array isn't full, win. I chose to use 1 << 30 as the threshold because the 8GB wouldn't need to spill to disk, and with this limit we don't spend much wasted time trying to fill up the vector either.

This saves us one pass over the data.

Think a bit harder

It turns out that our characterization of active edges was a bit sloppy. The condition I used for an edge (src,dst) in iteration iter was

dist[src] > iter && dist[dst] > iter

However, if both distances equal iter + 1 the edge is also not interesting; at least one of the two endpoints needs the opportunity to improve. A better condition for an edge (src, dst) at iteration iter is

(dist[src] > iter && dst[dst] > iter + 1) ||
(dist[dst] > iter && dst[src] > iter + 1)

That seems pretty minor, but the number of vertices still in play drops pretty fast, and most of the active edges are between pairs of "about to be fixed" vertices. If we have fewer active edges, we can drop out of the loop one round earlier.

This saves us one pass over the data.

Being a better person

This "active edges" approach only really works because we reach most of the graph. If there were two disconnected components, we'd never actually get to a small number of edges, because all the edges between unreachable vertices would always appear active.

It turns out we were weirdly lucky with our 1 << 30 threshold above: there are apparently about 800 million edges that cannot be reached. Had we chosen a slightly smaller threshold, we would never have dropped out and finished quickly.

Historical note: the "mustache wager" was offered before I knew if this would be the case; that is how dangerous I like to live.

The more cynical among you might worry that I picked a good threshold after seeing the data, and fair enough. So let's make it right.

The right thing to do here is use our brains again, and remember that we have a one-pass connected components routine just a few screens up. We'll run that as part of the first iteration, and then set the distance of any unreachable vertices to 0; that will make sure their edges are never interesting, and we won't mistake them for actually being the root at the end of the computation. This shaves out 800 million edges early on, which means we can meet that 1 << 30 threshold sooner.

This saves us one pass over the data.

Helping LLVM out

For reasons that escape me, in my edge decoder LLVM produced some assembly that implemented a fragment like

if rare_event {
    call_method();
}

with a bunch of stack pushing, then a conditional jump, then a bunch of stack popping. All the stack manipulation was burning about half the time in the decoding. Forcing an inline sorted this out, and improved the per-iteration time from 1500s to about 1200s (because we do more than just decode edges).

Putting it together

Here is how it looks now:

Echidnatron% cargo run --release -- bfs compressed ~/Projects/Datasets/cc.bin
     Running `target/release/COST bfs compressed /Users/mcsherry/Projects/Datasets/cc.bin`
iteration: 1 @ 1522.82669049401
iteration: 2 @ 2587.3787615860056
iteration: 3 @ 3718.661802833012
iteration: 4 @ 4977.97501425301
iteration: 5 @ 6280.728537422008
iteration: 6 @ 7565.9328179680015
iteration: 7 @ 7587.944992198012
...
iteration: 329 @ 7589.818780647998
iteration: 330 @ 7589.818783035007
iteration: 331 @ 7589.818785973999
Echidnatron%

Notice how it takes off just after iteration six, rather than iteration nine? Less obviously, removing the 800 million edges means that instead of six hundred seconds spent in the tail (go check!) we spend 24 seconds.

You may have noticed that we are doing 331 iterations rather than 166 iterations. We had to do that to make the whole parent pointer nonsense work out.

Right, all that other stuff you had to do.

Yeah, so. I have to tell you about how parent pointers work, and we need to talk through how long it takes to go from unordered edges to Hilbert curve ordered, compressed data.

Parent pointers

I tried this two ways. The first way is boring, and doesn't work as well as the second. The second is interesting, and re-inforces a recent point about whether sorting is good for you.

The first approach just keeps about 16GB of Vec<u32> around, and each time a distance gets set we write the appropriate parent into the corresponding entry of the vector. It slows down each iteration a bit, because we are now exercising the swap file.

...
iteration: 331 @ 9060.661106728003
Echidnatron%

Hrm. An extra 1500 seconds. That is dumb.

The second approach is to keep about 32GB of Vec<(u32,u32)> around, where we push any (src,dst) pairs whenever a distance gets set. This is the same information as the vector above, stored in more space, but accessed sequentially rather than randomly.

Those numbers up above, with the 7589s elapsed measurement, were already doing this.

Unfortunately, this isn't the right format. It is really important to put the parent data into an array indexed by vertex and contain the parent. How should we do that with a big Vec<(u32, u32)>? The way I see it, we could just do:

let mut parents = vec![0; nodes];
for &(child, parent) in &parent_pairs {
    parents[child] = parent;
}

How long should this take? Hrm. Let's say 60ns for a random write. There are 3.5 billion of these folks. That's 210 seconds. Nice. It's probably lots better than that, because of the locality within each iteration (we probably have 6 large highly sequential runs), but let's just add 210 to 7589 and ...

No!

The reason we are in this stupid mess is because I was dumb enough to say that maybe sorting things might give better performance than just doing blind random access.

Wouldn't this be a great opportunity to double check whether I am full of it, or sorting might actually improve this (though, improving 210 seconds is pretty dull; we're just doing this to make a point).

Maybe. Let's see if I get back to this. Ed: Nope.

Hilbert conversion

This is a bit tricky. I don't have the source data I used to convert, and even if I did, teasing apart the costs is hard. The data show up as compressed text, and the dominant cost is parsing the text into integers.

I just did some profiling, and found that I could parse a line containing a pair of u32 values in about 224ns per line:

test parse ... bench:         224 ns/iter (+/- 36)

If we do the math, that is about eight hours just for parsing the text. Thanks, people who produce binary data as text.


Update: This example was bad. I was doing an allocation I could have avoided (in the string splitting). Rust actually makes this easy (by providing an iterator over str references, rather than an owned string), but I was sloppy! The better numbers are:

test parse ... bench:         145 ns/iter (+/- 30)

Five hours instead of eight! In case you were wondering where the time goes in your higher-level language...


This compares with Hilbert transformation costs of

test encode_h        ... bench:          15 ns/iter (+/- 2)
test decode_h        ... bench:          29 ns/iter (+/- 6)

So, the transformations are pretty small. And, we only do the encode anyhow; the decode happens (more efficiently) as part of the computation. With 128 billion edges, encoding takes 1920 seconds.

Next we have to think about sorting the data. The 1TB of edge data are distributed over about 700 files, which means about 1.46GB of u64 data per file.

What an amazing coincidence; we just worked with this amount of data last week. We were sorting (u32, u32) data, which only requires four passes versus the eight we will need. So lets multiply that elapsed time (7.5 seconds) by two.

Fifteen seconds for each of the 700 files. Ouch. That is 10500 seconds.

Finally, we have to merge these files together. I think I did a 10-way merge three times. This seemed to go quite fast, I'm going to say IO bound. The Hilbert decompressor delivers about 2GB of edge data per second, and with 3TB of total edge data to process, this is about 1500 seconds.

Let's add up everything that isn't parsing. 1920 seconds, plus 10500 seconds, plus 1500 seconds.

I get 13,920 seconds. It's all an estimate, but there you go. Everything other than the 1500 seconds of merging is trivially parallelizable, and my oft neglected second core might get some use.


Edit: My mistake. We were sorting 4GB last week, not 1.5GB. Sorting 1.5GB of u64 data takes me 5s. So that's 3500 seconds rather than 10500. We just saved 7000 seconds by double-checking our results; yay science! :D


Summing up

Here are the numbers I've got, for various tasks

Chaos Order FlashGraph
BreadthFirst ????? 7800s 298s
20xPageRank 6220s 46600s 1306**
Connectivity ????? 1285s 461s
Layout 0? 6920s* ?????

I've added some numbers for FlashGraph, taken from their performance page. FlashGraph is a neat single-box system that uses an array of SSDs to get substantially higher throughput than I can. It avoids the restriction to sequential scans, which makes me super happy.

I've had a request to note the hardware, which is a good point before anyone runs off and draws conclusions about Chaos vs FlashGraph. FlashGraph uses an array of 15 SSDs, and 512 GB of RAM. We don't know what Chaos uses, except that it is 32 machines. It will be good to learn when you should put all of your money into one box, and when you should spread it out over several (and when it really doesn't matter enough to get worked up).

My PageRank number is a bit stale, as it doesn't reflect the improved decoder. It looks like it should be just over 40,000s, though I haven't done a full run yet.

Also, you could totally go faster if you just did the decoding on a second thread. It's super easy in Rust, so I'll probably try that out.

*: This used to be 13920 seconds, but I mis-remembered the sorting times from last week. It is much faster. Sorry about that.

**: The FlashGraph numbers are for 30 iterations (I multiplied their number by 2/3), and in their paper they mention vertices that cease producing updates, making it possible that they are doing an approximate PageRank. If anyone knows specifics, fill me in.


Update Da Zheng (behind FlashGraph) got in touch and reports that their performance numbers improved for their FAST paper, and updated the FlashGraph wiki. This improved their numbers quite a bit. He also confirmed that it is an approximate PageRank.


Conclusions

To the extent that the single-threaded code did well, it was because we took advantage of our ability to write programs as we see fit. This is a super-powerful ability, and not one that computer scientists should so quickly dispose of. Restrictive programming models make it easier to write apparently efficient systems, but if that efficiency comes by forcing you to write shit programs, we haven't obviously improved the world yet.

I think Naiad is awesome, not primarily because it had a rawking implementation, but because it let you do things you couldn't do with other systems. The horribly bad label-propagation algorithm can become the actually sane prioritized label propagation, once you can explain which labels to process each iteration. You get a factor of ten reduction in data exchanged and compute done just by using a better algorithm. Alternately, if you want to incrementalize your computation, you don't need to build a brand new system, you just need to start writing incremental implementations, because you can totally just do that.

I really think that in addition to tightly tuned systems like Chaos, we need to make sure to keep moving towards more expressive systems like Naiad. Otherwise we are cashing in all sort of great algorithms just so we can draw straighter scaling plots.

Of course, all of this is moot if it turns out that I got my ass handed to me. But that would mean actual serious computer scientific progress happened, which would make me almost as happy as Amitabha wearing a M-for-McSherry mustache, Tony Wonder style:

Amitabha, inverted