-
Notifications
You must be signed in to change notification settings - Fork 47
[Update] Benchmarking Microtests: State of the union of node/benchmark tests and where to go next #25
Comments
BTW, I was going to use this issue to update my progress on the aforementioned set of tasks then close when done. |
Update |
I guess it depends what we're measuring - I'd say if it's startup we're measuring, then Maybe something for another issue, but in your experience, how are people (how is Netflix) running node? I remember at node.JS interactive the talk on Netflix mentioned the much faster startup time being a definite win for node. How often are node instances being restarted? Do you have any very short lived node applications? |
The performance of require was called out specifically by sam-github here: #22. Even though startup is a one-time cost its still seem as a key metric for runtimes. |
@gareth-ellis @mhdawson I doubt the validity of this. But, as per agreed upon task set, I will start at @yunong can correct me if I am wrong but the start up time of the website prior to node was significantly longer that it made node a huge win to switch to, regardless of As mentioned above, I will start on And yes @gareth-ellis we have shorter lived node apps. But short lived is probably relative term... @yunong can speak to the average life of our node servers. |
require, URL, and events are "done". |
Where does http, keepalive, sockets, etc performance come in? There's certainly an argument to be made that if you're writing a service in node, you're going to need those basics to show perf gains over time. As important as it is to benchmark the methods you bake all your applications with, I don't know of anyone writing their own http module in javascript. On that note, I'm curious to know if you still see javascript as the best mechanism for testing things like concurrent connection handling and how those things could fall over if benched from the same machine as the one thats running the http process. Additionally, clustering and forked process speeds have a serious lack of transparency into how fast they actually perform. I know there were scheduler changes in the latest releases vs 0.10 which were supposed to fix these, but I haven't found any benchmarks that showcase this. |
To follow that last message up, here are some alternate benchmarking tools i've run in the past which offer fuel for the discussion. https://github.com/wg/wrk |
@jeffreytgilbert Good morning! I love the benchmarking tools you provided. I will definitely be checking that out. As of right now a few of the members of the WG are working on getting the Acme Air application setup to be part of the benchmarking suite. My part is to focus from the other side. Here is my rough plan of attack.
If you have any thoughts or ideas please let me know. |
Saving the test data as a png is fine. Saving it to a new repo is fine. Save the data as well. Use time series data storage for some of the benchmarks. Opentsdb or a graphing system like influxdb or prometheus or one of those that you can tie into grafana or another suite. |
In terms of the plans for how we are working initially to save the data and graph see https://github.com/nodejs/benchmarking/blob/master/benchmarks/README.md. I'm slowly working on this with one of the steps being to add the micro benchmarks added by @michaelbpaulson to those we generate graphs for. |
Also see for comparison "are we fast yet". I believe I heard someone reference this on the videocast I listened in to, but my memory is failing me. https://arewefastyet.com/ It's relevant as it is currently comparing browsers using benchmark suites based on their build types and they plot results run over time so you can easily see deltas. |
Yes the plan outlined in https://github.com/nodejs/benchmarking/blob/master/benchmarks/README.md. will results in graphs along these lines: https://github.com/nodejs/benchmarking/blob/master/benchmarks/startup_footprint/footprint.png |
While I do think that micro-benchmarks have do their place, I wanted to offer a slightly different perspective. My personal experience, and also the experience of the V8 team, has been that there is a lot of day to day churn and variability on micro-benchmarks. I personally have worked in teams that have wasted person-years of development time because of spurious regressions on micro-benchmarks (_cough_CaffeineMark etc.cough) in a former life. Micro-benchmarks try to set precise expectations about how a specific piece of code is going to be executed by the computer. It is hard to have that but in a managed runtime where JIT and GC are constantly conspiring against you, in different ways each day. As a result your micro-benchmarks may not be measuring what you expect them to be measuring. The exact timing of when the JIT decides to optimize things, or how much it wants to optimize things, or the exact thresholds at which V8 chooses to switches a Array to dictionary mode, can affect micro-benchmarks a lot more than real code. It may be inadvertently sensitive to such incidental details, and this sensitivity may completely drown out the signal that you were originally intending to capture. Here's a video (by the awesome Vyacheslav Egorov) that talks about benchmarking JavaScript, that touches on these topics: https://www.youtube.com/watch?v=65-RbBwZQdU. See also a related discussion on the topic of micro-benchmarks: robohornet/robohornet#67 It is not always clear if the results from a micro-benchmark will necessarily translate to real world applications. Apart from a being bad predictor, this also may misplace the incentives for a VM implementer from getting work done to improve performance of real world applications. Having said that, I do think targeted micro-benchmark make sense for specific things. It is perfectly reasonable to have a benchmark for startup performance. It is also perfectly reasonable to have a simple benchmark to measure http server throughput for a trivial http server. The point I am trying to convey is you can evaluate a benchmark on a case by case basis only. I would contest a blanket statement about virtues of micro-benchmarks in general, and I would be against this WG adopting such a philosophy. A well-realized performance regression suite should also have larger workloads derived from real world use-cases. |
@michaelbpaulson I think we should close this for now as there has been no update for > 1year. If you want to restart the conversation feel free to re-open. |
The goal of this issue is to demonstrate the following:
Why micro benchmark?
Micro benchmarking is arguably more important from a library's standpoint than application / integration level benchmarking (I have heard macro benchmark as a term to). Micro benchmarking will quickly flag slowdowns in the system with little noise. This will help diagnosis the issue with little to no investigation needed.
Arguments against micro benchmarks
* measured accurately: If a stable platform and multiple runs are used, one gets the most consistent measurements possible from javascript measuring javascript.
Where are the current tests at and where do we go?
Overview
After reviewing the set of tests for nodejs/node/benchmark I see an awesome set of micro benchmarks. It really is a great place to start. It appears that the represented set of node specific libraries are here.
Why not just use those tests?
The primary reason why the tests are invalid forms of measurement can be found here(for the TL/DR; portion, read how option A and D work). Secondly, it would also reduce potential bugs / learning curve just by using a well known performance measuring library. Especially since the custom benchmark minimally suffers from more than all the same downfalls of benchmark.js.
Downfalls of Benchmark.js and their workaround.
The downfalls of benchmark.js is its javascript measuring javascript (one of the same downfalls of node/benchmark/common.js). The operating system can do who knows what during performance runs and cause incorrect measurements. Thus a more consistent platform (EC2 as an example) can make results more stable. Multiple runs (say 10), tossing out the high/low will help remove v8 mid-run optimizing, OS context switching, Wednesdays bad weather, etc. etc. issues.
What about flamegraphs?
Flamegraphs do not give an absolute number, they give relative numbers. Flamegraphs are amazing for understanding whats taking the most time within a library, not the performance of the library itself..
A side note: This would be a very interesting tool to use for performance charting overtime. One could use the % of samples as an indicator of growth of running time. If all tests were measured for a long enough period of time, a complete picture could be established and used build over build / day over day / some frequency. The only issue I see with this is that there is no out of box solution to this. Secondly writing this library would be a feat in of itself. So we will defer discussions / implementation of this for a later time or never.
Where to go from here?
require
. It may be impossible forrequire
new module code to be tested by benchmark.js due to the caching nature (it really depends if I can muck with the memory or not). It will be trivial to testrequire
s cached result retrieval with benchmark js.charting/storage system.
buffer
,path
,urlparse
, etc. I would follow suit of node/benchmark.The text was updated successfully, but these errors were encountered: