-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[browser][mono] Heap snapshot system for WASM #105713
base: main
Are you sure you want to change the base?
Conversation
Tagging subscribers to 'arch-wasm': @lewing |
Other than the rollup issue I haven't been able to figure out, this is mostly "complete". I'm sure it has a lot of quality issues though since it touches so many subsystems. |
@@ -235,6 +246,12 @@ acquire_new_pages_initialized (uint32_t page_count) { | |||
if (bytes >= UINT32_MAX) | |||
return NULL; | |||
|
|||
if (first_controlled_page_index == UINT32_MAX) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please move this fix into separate PR ?
@@ -582,7 +582,6 @@ new_codechunk (MonoCodeManager *cman, int size) | |||
MONO_PROFILER_RAISE (jit_chunk_created, ((mono_byte *) chunk->data, chunk->size)); | |||
|
|||
code_memory_used += chunk_size; | |||
mono_runtime_resource_check_limit (MONO_RESOURCE_JIT_CODE, code_memory_used); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain this change ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was dead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I saw more dead code in this PR, let's move it into separate PR ?
src/mono/mono/metadata/class-init.c
Outdated
|
||
#ifdef HOST_BROWSER | ||
int | ||
mono_class_update_heapshot_scratch_byte (MonoClass *klass, char new_value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scratch_byte
is quite generic name. How do we use it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to add a comment in code for it, the function that updates it has an explanation. essentially, if we heapshot twice we need a way to tell whether a given class has been recorded in the current heapshot. we do this by increasing the scratch byte for each heapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename it to usedInCurrentHeapShotCount
import { mono_log_info } from "../logging"; | ||
import { getU32 } from "../memory"; | ||
import { utf8ToString } from "../strings"; | ||
import { BlobBuilder } from "../jiterpreter-support"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not mix jiterp and heapshot features together, because I think we may want to trim away each separately.
But let's discuss it first, maybe I don't understand well the synergy here.
What functions make BlobBuilder
ideal ?
Afaik that builder is allocating inside of WASM linear memory, correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blobbuilder is mostly ideal because it implements LEB and some other things we need. Duplicating that would add code size and risk, so I decided to start with reuse. It does have to use wasm linear memory, which is why I added a parachute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need LEB ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why memory shot needs to use linear memory for output ?
@@ -13,6 +13,12 @@ import fast_glob from "fast-glob"; | |||
import gitCommitInfo from "git-commit-info"; | |||
import MagicString from "magic-string"; | |||
|
|||
setTimeout(() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the error message ? Or what's the problem ?
How/why does adding .html
file influence rollup ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no error message, it just hangs forever. during the hang no code is running... it seems like the node instance just sits idle forever in a 'server' mode because rollup neglects to perform a process.exit when it's done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
append rollup -c --forceExit
in src\mono\browser\runtime\package.json
will fix this
if ((typeof (data) !== "object") || (typeof (data.sender) !== "string") || (typeof (data.cmd) !== "string")) | ||
return; | ||
|
||
console.log("channel_message", data.cmd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this was not flagged by eslint ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, but I can safely remove it. I forgot.
let heapshotStartedWhen = performance.now(); | ||
|
||
const tabId = (Math.random() * 100000).toFixed(0); | ||
const globalChannel : BroadcastChannel | null = globalThis["BroadcastChannel"] ? new BroadcastChannel(".NET Runtime Diagnostics") : null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not do this in global scope, this happens very early when the ES6 module is loaded. Let's move it behind some feature flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also worried about this becoming attack surface. Let's make it opt-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only works for content on the same origin. How would it be used for attacks?
I'm open to opt-in behind a flag or something, but I don't know how to read runtime options from typescript this early in startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it could be only listening on the channel after the user opens browser console and enters
globalThis.getDotnetRuntime(0).enableDiagnosticChannel()
And maybe we could get rid of the whole complexity of HTML deployment if the whole experience was triggered just via
globalThis.getDotnetRuntime(0).downloadMemoryDump()
or
globalThis.getDotnetRuntime(0).logMemoryStats()
I would prefer that.
|
||
console.log("channel_message", data.cmd); | ||
switch (data.cmd) { | ||
case "whosThere": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whosThere
-> dotnetHeapShotHandshake
or something
let title = "unknown"; | ||
if (globalThis["document"] && globalThis["document"]["title"]) | ||
title = globalThis.document.title; | ||
globalChannel!.postMessage({ cmd: "iAmHere", sender: tabId, version: ProductVersion, running: loaderHelpers.is_runtime_running(), title }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iAmHere
-> dotnetHeapShotHandshakeResponse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to integrate the UI (.html) into build process so that it can get copied to the app wwwroot? If so, do we want to do it here or in a follow up? I can help with that part.
@@ -14,6 +14,7 @@ | |||
<WasmExtraFilesToDeploy Include="index.html" /> | |||
<WasmExtraFilesToDeploy Include="main.js" /> | |||
<WasmExtraFilesToDeploy Include="appstart-frame.html" /> | |||
<!-- <WasmExtraFilesToDeploy Include="../../../browser/runtime/heapshot/diagnostics-panel.html" /> --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<!-- <WasmExtraFilesToDeploy Include="../../../browser/runtime/heapshot/diagnostics-panel.html" /> --> |
Fix "next byte" comments in class-private-definition and move min_align to a better place Checkpoint Checkpoint Fix root scanning Checkpoint Remove old file Fix bugs in diagnostics panel Checkpoint Checkpoint Checkpoint Checkpoint Checkpoint Fix repeated heapshots Fix double free Store format version as a counter Push counter chunks to the front of the saved file Better filenames Add a "kind" string to the profiler gcroots event so we can flow through the names of registered roots Migrate wasm heapshot roots to use the existing profiler event Store information on js owned and cs owned objects into wasm heapshots Store counters as signed LEB Better JS object descriptions Basic support for 'partial' heapshots for gathering live statistics Store jiterp counters in snapshots Checkpoint Re-enable a subset of mono_counters and embed them in heap snapshots Fix CI Fix CI Embed dlmalloc info into snapshot counters Checkpoint Cleanup
I prefer to invest into making the EventPipe single-threaded and use it's format. We already have bespoke mono profilers and viewers. |
At present for mono we have two existing ways to capture a heap snapshot for memory analysis, gcdump and heapshot. Unfortunately, as-is for WASM scenarios we can't directly apply them when trying to find the root causes of out-of-memory failures:
This draft PR contains a system based on BroadcastChannel and a helper 'diagnostics panel' tab where the running application can stream information on the GC heap and runtime state to another tab with its own heap, and the other tab will buffer all that information into a file that can be downloaded from the browser and analyzed offline.
The file format is a simple RIFF-based structure where various types of data are split into chunked streams - i.e. each GC object has an 'object header' in the OBJH stream, and object references in the GC heap are emitted into REFS chunks, while each MonoClass generates an entry in the TYPE stream.
At present there are two working prototype viewers, https://github.com/kg/MonoHeapSnapshotViewer/tree/main and https://github.com/1hub/dotnet-heapview/tree/master (thanks @filipnavara for your help here!). The latter can be used in the browser at https://1hub.github.io/dotnet-heapview/ while the former is currently winforms-only.
I've attempted to add additional useful information into streams in the snapshot to aid debugging trickier OOM conditions - this PR re-enables part of the old mono counters system so we can embed counters into the snapshot, which will give information like how much memory has been allocated for classes or jiterpreter traces.
Basic representations of the GCHandle tables, GC roots, and JS interop handle tables are also embedded into snapshots.
By nature this is less capable than the heap snapshot functionality available in Visual Studio for netfx/coreclr, but my goal with this was to build a simple starting point we can use to solve problems in the field today, and evolve forward until it can be fully replaced with one of our other systems.
Diagnostics panel:
Viewer: