Skip to content
This repository has been archived by the owner on Nov 28, 2020. It is now read-only.

[Update] Benchmarking Microtests: State of the union of node/benchmark tests and where to go next #25

Closed
ThePrimeagen opened this issue Jan 11, 2016 · 16 comments

Comments

@ThePrimeagen
Copy link
Member

The goal of this issue is to demonstrate the following:

  • Why microbenching is important.
  • Benchmark.js is a good library to use
  • Where am I going to start and the path forward

Why micro benchmark?

Micro benchmarking is arguably more important from a library's standpoint than application / integration level benchmarking (I have heard macro benchmark as a term to). Micro benchmarking will quickly flag slowdowns in the system with little noise. This will help diagnosis the issue with little to no investigation needed.

Arguments against micro benchmarks

  • Can be less reliable.
    • This is addressed below in more detail (linked article). It can be measured accurately*.
  • Application / integration benchmarks are more meaningful measurements.
    • Correct and Incorrect. Its meaningful for its estimated performance of said application / integration that is being measured, but does not mean it will be as performant for my application as our calling patterns could be different, thus different performance characteristics.
    • Second, there is no reasonable/practical way to determine where performance issues are arising from if the granularity of performance tests are at application level. The noise is to loud.
    • Finally, due to the performance of application level measurements, some operations can become 2 - 3x slower and be eclipsed by the performance of the application itself. The 1000 paper cuts of slow down can be observed overtime with no individual being able to determine where/when it happened.

* measured accurately: If a stable platform and multiple runs are used, one gets the most consistent measurements possible from javascript measuring javascript.

Where are the current tests at and where do we go?

Overview

After reviewing the set of tests for nodejs/node/benchmark I see an awesome set of micro benchmarks. It really is a great place to start. It appears that the represented set of node specific libraries are here.

Why not just use those tests?

The primary reason why the tests are invalid forms of measurement can be found here(for the TL/DR; portion, read how option A and D work). Secondly, it would also reduce potential bugs / learning curve just by using a well known performance measuring library. Especially since the custom benchmark minimally suffers from more than all the same downfalls of benchmark.js.

Downfalls of Benchmark.js and their workaround.

The downfalls of benchmark.js is its javascript measuring javascript (one of the same downfalls of node/benchmark/common.js). The operating system can do who knows what during performance runs and cause incorrect measurements. Thus a more consistent platform (EC2 as an example) can make results more stable. Multiple runs (say 10), tossing out the high/low will help remove v8 mid-run optimizing, OS context switching, Wednesdays bad weather, etc. etc. issues.

What about flamegraphs?

Flamegraphs do not give an absolute number, they give relative numbers. Flamegraphs are amazing for understanding whats taking the most time within a library, not the performance of the library itself..

A side note: This would be a very interesting tool to use for performance charting overtime. One could use the % of samples as an indicator of growth of running time. If all tests were measured for a long enough period of time, a complete picture could be established and used build over build / day over day / some frequency. The only issue I see with this is that there is no out of box solution to this. Secondly writing this library would be a feat in of itself. So we will defer discussions / implementation of this for a later time or never.

Where to go from here?

  • Talk to @mhdawson on where to commit these tests too.
  • Now that we have a baseline of where to start I'll create a set of tests for require. It may be impossible for require new module code to be tested by benchmark.js due to the caching nature (it really depends if I can muck with the memory or not). It will be trivial to test requires cached result retrieval with benchmark js.
  • I'll talk to @mhdawson and learn how to integrate the results into the already built
    charting/storage system.
  • I'll start building a suite of tests using benchmark js for each of node's subsystems. This would be buffer, path, urlparse, etc. I would follow suit of node/benchmark.
@ThePrimeagen
Copy link
Member Author

BTW, I was going to use this issue to update my progress on the aforementioned set of tasks then close when done.

@ThePrimeagen
Copy link
Member Author

Update
As I start the benchmarking I realized that testing the require module is probably a bad place to start since its a one time start-up cost and probably the least concerning to a sophisticated node application. I will continue to get the tests up to simply prove out the framing of our benchmark tests and to get the first one being plotted.

@gareth-ellis
Copy link
Member

I guess it depends what we're measuring - I'd say if it's startup we're measuring, then require is a cost that almost every node application is going to pay.

Maybe something for another issue, but in your experience, how are people (how is Netflix) running node? I remember at node.JS interactive the talk on Netflix mentioned the much faster startup time being a definite win for node. How often are node instances being restarted? Do you have any very short lived node applications?

@mhdawson
Copy link
Member

The performance of require was called out specifically by sam-github here: #22. Even though startup is a one-time cost its still seem as a key metric for runtimes.

@ThePrimeagen
Copy link
Member Author

@gareth-ellis @mhdawson I doubt the validity of this. But, as per agreed upon task set, I will start at require.

@yunong can correct me if I am wrong but the start up time of the website prior to node was significantly longer that it made node a huge win to switch to, regardless of require time. I know I am getting really pedantic but node startup time and application startup time can be considered separate... :) Just saying.

As mentioned above, I will start on require (will be done shortly) and then i'll move on to what I consider to be the next most used node core lib.

And yes @gareth-ellis we have shorter lived node apps. But short lived is probably relative term... @yunong can speak to the average life of our node servers.

@ThePrimeagen
Copy link
Member Author

require, URL, and events are "done".

@jeffreytgilbert
Copy link

Where does http, keepalive, sockets, etc performance come in? There's certainly an argument to be made that if you're writing a service in node, you're going to need those basics to show perf gains over time. As important as it is to benchmark the methods you bake all your applications with, I don't know of anyone writing their own http module in javascript.

On that note, I'm curious to know if you still see javascript as the best mechanism for testing things like concurrent connection handling and how those things could fall over if benched from the same machine as the one thats running the http process. Additionally, clustering and forked process speeds have a serious lack of transparency into how fast they actually perform. I know there were scheduler changes in the latest releases vs 0.10 which were supposed to fix these, but I haven't found any benchmarks that showcase this.

@jeffreytgilbert
Copy link

To follow that last message up, here are some alternate benchmarking tools i've run in the past which offer fuel for the discussion.

https://github.com/wg/wrk
https://github.com/newsapps/beeswithmachineguns

@ThePrimeagen
Copy link
Member Author

@jeffreytgilbert Good morning! I love the benchmarking tools you provided. I will definitely be checking that out.

As of right now a few of the members of the WG are working on getting the Acme Air application setup to be part of the benchmarking suite. My part is to focus from the other side. Here is my rough plan of attack.

  1. Start with the smallest elements, true micro benchmark. These are useful metrics to have and give signals when the overall system has slowed down.
  2. Build some macro tests out of combined micro benchmarks. Some form of caching algorithms, or other fun complex pieces of work that push one aspect to the max and avoids some of the pitfalls of micro benchmarks.
  3. If Acme Air has yet to get off the ground I'll probably start setting up some various low level http tests. This is where the aforementioned libraries will potentially come in handy at that point. This is where I would be testing several of the spawning / process libraries (including cluster). I'll probably create some sort of aggregation service and see how many requests are able to be processed over some time period.

If you have any thoughts or ideas please let me know.

@jeffreytgilbert
Copy link

Saving the test data as a png is fine. Saving it to a new repo is fine. Save the data as well. Use time series data storage for some of the benchmarks. Opentsdb or a graphing system like influxdb or prometheus or one of those that you can tie into grafana or another suite.

@mhdawson
Copy link
Member

mhdawson commented Feb 8, 2016

In terms of the plans for how we are working initially to save the data and graph see https://github.com/nodejs/benchmarking/blob/master/benchmarks/README.md. I'm slowly working on this with one of the steps being to add the micro benchmarks added by @michaelbpaulson to those we generate graphs for.

@jeffreytgilbert
Copy link

Also see for comparison "are we fast yet". I believe I heard someone reference this on the videocast I listened in to, but my memory is failing me. https://arewefastyet.com/

It's relevant as it is currently comparing browsers using benchmark suites based on their build types and they plot results run over time so you can easily see deltas.

@mhdawson
Copy link
Member

mhdawson commented Feb 9, 2016

@ofrobots
Copy link
Collaborator

While I do think that micro-benchmarks have do their place, I wanted to offer a slightly different perspective.

My personal experience, and also the experience of the V8 team, has been that there is a lot of day to day churn and variability on micro-benchmarks. I personally have worked in teams that have wasted person-years of development time because of spurious regressions on micro-benchmarks (_cough_CaffeineMark etc.cough) in a former life.

Micro-benchmarks try to set precise expectations about how a specific piece of code is going to be executed by the computer. It is hard to have that but in a managed runtime where JIT and GC are constantly conspiring against you, in different ways each day. As a result your micro-benchmarks may not be measuring what you expect them to be measuring. The exact timing of when the JIT decides to optimize things, or how much it wants to optimize things, or the exact thresholds at which V8 chooses to switches a Array to dictionary mode, can affect micro-benchmarks a lot more than real code. It may be inadvertently sensitive to such incidental details, and this sensitivity may completely drown out the signal that you were originally intending to capture.

Here's a video (by the awesome Vyacheslav Egorov) that talks about benchmarking JavaScript, that touches on these topics: https://www.youtube.com/watch?v=65-RbBwZQdU. See also a related discussion on the topic of micro-benchmarks: robohornet/robohornet#67

It is not always clear if the results from a micro-benchmark will necessarily translate to real world applications. Apart from a being bad predictor, this also may misplace the incentives for a VM implementer from getting work done to improve performance of real world applications.

Having said that, I do think targeted micro-benchmark make sense for specific things. It is perfectly reasonable to have a benchmark for startup performance. It is also perfectly reasonable to have a simple benchmark to measure http server throughput for a trivial http server.

The point I am trying to convey is you can evaluate a benchmark on a case by case basis only. I would contest a blanket statement about virtues of micro-benchmarks in general, and I would be against this WG adopting such a philosophy. A well-realized performance regression suite should also have larger workloads derived from real world use-cases.

@mhdawson
Copy link
Member

@michaelbpaulson I think we should close this for now as there has been no update for > 1year. If you want to restart the conversation feel free to re-open.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants