Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[browser][mono] Heap snapshot system for WASM #105713

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

kg
Copy link
Member

@kg kg commented Jul 30, 2024

At present for mono we have two existing ways to capture a heap snapshot for memory analysis, gcdump and heapshot. Unfortunately, as-is for WASM scenarios we can't directly apply them when trying to find the root causes of out-of-memory failures:

  • GCDump is eventpipe/etw-based, which requires threads, and the threaded runtime is not ready yet
  • Heapshot is unmaintained and requires the ability to buffer the data into a FILE, which in emscripten is buffered in malloc'd memory. If the application is out of memory, buffering into the file will fail.

This draft PR contains a system based on BroadcastChannel and a helper 'diagnostics panel' tab where the running application can stream information on the GC heap and runtime state to another tab with its own heap, and the other tab will buffer all that information into a file that can be downloaded from the browser and analyzed offline.

The file format is a simple RIFF-based structure where various types of data are split into chunked streams - i.e. each GC object has an 'object header' in the OBJH stream, and object references in the GC heap are emitted into REFS chunks, while each MonoClass generates an entry in the TYPE stream.

At present there are two working prototype viewers, https://github.com/kg/MonoHeapSnapshotViewer/tree/main and https://github.com/1hub/dotnet-heapview/tree/master (thanks @filipnavara for your help here!). The latter can be used in the browser at https://1hub.github.io/dotnet-heapview/ while the former is currently winforms-only.

I've attempted to add additional useful information into streams in the snapshot to aid debugging trickier OOM conditions - this PR re-enables part of the old mono counters system so we can embed counters into the snapshot, which will give information like how much memory has been allocated for classes or jiterpreter traces.

Basic representations of the GCHandle tables, GC roots, and JS interop handle tables are also embedded into snapshots.

By nature this is less capable than the heap snapshot functionality available in Visual Studio for netfx/coreclr, but my goal with this was to build a simple starting point we can use to solve problems in the field today, and evolve forward until it can be fully replaced with one of our other systems.

Diagnostics panel:
image

Viewer:
image

@kg kg added arch-wasm WebAssembly architecture area-Diagnostics-mono labels Jul 30, 2024
Copy link
Contributor

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

@kg kg marked this pull request as ready for review August 5, 2024 15:42
@kg
Copy link
Member Author

kg commented Aug 5, 2024

Other than the rollup issue I haven't been able to figure out, this is mostly "complete". I'm sure it has a lot of quality issues though since it touches so many subsystems.

@@ -235,6 +246,12 @@ acquire_new_pages_initialized (uint32_t page_count) {
if (bytes >= UINT32_MAX)
return NULL;

if (first_controlled_page_index == UINT32_MAX) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please move this fix into separate PR ?

@@ -582,7 +582,6 @@ new_codechunk (MonoCodeManager *cman, int size)
MONO_PROFILER_RAISE (jit_chunk_created, ((mono_byte *) chunk->data, chunk->size));

code_memory_used += chunk_size;
mono_runtime_resource_check_limit (MONO_RESOURCE_JIT_CODE, code_memory_used);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain this change ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was dead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I saw more dead code in this PR, let's move it into separate PR ?


#ifdef HOST_BROWSER
int
mono_class_update_heapshot_scratch_byte (MonoClass *klass, char new_value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scratch_byte is quite generic name. How do we use it ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to add a comment in code for it, the function that updates it has an explanation. essentially, if we heapshot twice we need a way to tell whether a given class has been recorded in the current heapshot. we do this by increasing the scratch byte for each heapshot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename it to usedInCurrentHeapShotCount

import { mono_log_info } from "../logging";
import { getU32 } from "../memory";
import { utf8ToString } from "../strings";
import { BlobBuilder } from "../jiterpreter-support";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not mix jiterp and heapshot features together, because I think we may want to trim away each separately.
But let's discuss it first, maybe I don't understand well the synergy here.

What functions make BlobBuilder ideal ?

Afaik that builder is allocating inside of WASM linear memory, correct ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blobbuilder is mostly ideal because it implements LEB and some other things we need. Duplicating that would add code size and risk, so I decided to start with reuse. It does have to use wasm linear memory, which is why I added a parachute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need LEB ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why memory shot needs to use linear memory for output ?

@@ -13,6 +13,12 @@ import fast_glob from "fast-glob";
import gitCommitInfo from "git-commit-info";
import MagicString from "magic-string";

setTimeout(() => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the error message ? Or what's the problem ?
How/why does adding .html file influence rollup ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no error message, it just hangs forever. during the hang no code is running... it seems like the node instance just sits idle forever in a 'server' mode because rollup neglects to perform a process.exit when it's done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

append rollup -c --forceExit in src\mono\browser\runtime\package.json will fix this

if ((typeof (data) !== "object") || (typeof (data.sender) !== "string") || (typeof (data.cmd) !== "string"))
return;

console.log("channel_message", data.cmd);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this was not flagged by eslint ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, but I can safely remove it. I forgot.

let heapshotStartedWhen = performance.now();

const tabId = (Math.random() * 100000).toFixed(0);
const globalChannel : BroadcastChannel | null = globalThis["BroadcastChannel"] ? new BroadcastChannel(".NET Runtime Diagnostics") : null,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not do this in global scope, this happens very early when the ES6 module is loaded. Let's move it behind some feature flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also worried about this becoming attack surface. Let's make it opt-in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only works for content on the same origin. How would it be used for attacks?

I'm open to opt-in behind a flag or something, but I don't know how to read runtime options from typescript this early in startup.

Copy link
Member

@pavelsavara pavelsavara Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it could be only listening on the channel after the user opens browser console and enters

globalThis.getDotnetRuntime(0).enableDiagnosticChannel()

And maybe we could get rid of the whole complexity of HTML deployment if the whole experience was triggered just via

globalThis.getDotnetRuntime(0).downloadMemoryDump()

or

globalThis.getDotnetRuntime(0).logMemoryStats()

I would prefer that.


console.log("channel_message", data.cmd);
switch (data.cmd) {
case "whosThere": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whosThere -> dotnetHeapShotHandshake or something

let title = "unknown";
if (globalThis["document"] && globalThis["document"]["title"])
title = globalThis.document.title;
globalChannel!.postMessage({ cmd: "iAmHere", sender: tabId, version: ProductVersion, running: loaderHelpers.is_runtime_running(), title });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iAmHere -> dotnetHeapShotHandshakeResponse

Copy link
Member

@maraf maraf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to integrate the UI (.html) into build process so that it can get copied to the app wwwroot? If so, do we want to do it here or in a follow up? I can help with that part.

@@ -14,6 +14,7 @@
<WasmExtraFilesToDeploy Include="index.html" />
<WasmExtraFilesToDeploy Include="main.js" />
<WasmExtraFilesToDeploy Include="appstart-frame.html" />
<!-- <WasmExtraFilesToDeploy Include="../../../browser/runtime/heapshot/diagnostics-panel.html" /> -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!-- <WasmExtraFilesToDeploy Include="../../../browser/runtime/heapshot/diagnostics-panel.html" /> -->

kg added 2 commits August 29, 2024 15:01
Fix "next byte" comments in class-private-definition and move min_align to a better place

Checkpoint

Checkpoint

Fix root scanning

Checkpoint

Remove old file
Fix bugs in diagnostics panel

Checkpoint

Checkpoint

Checkpoint

Checkpoint

Checkpoint

Fix repeated heapshots

Fix double free
Store format version as a counter
Push counter chunks to the front of the saved file

Better filenames

Add a "kind" string to the profiler gcroots event so we can flow through the names of registered roots
Migrate wasm heapshot roots to use the existing profiler event
Store information on js owned and cs owned objects into wasm heapshots
Store counters as signed LEB

Better JS object descriptions

Basic support for 'partial' heapshots for gathering live statistics

Store jiterp counters in snapshots

Checkpoint

Re-enable a subset of mono_counters and embed them in heap snapshots

Fix CI

Fix CI

Embed dlmalloc info into snapshot counters

Checkpoint

Cleanup
@pavelsavara
Copy link
Member

I prefer to invest into making the EventPipe single-threaded and use it's format.
That would be also useful on single-threaded WASI.

We already have bespoke mono profilers and viewers.
Related #107434

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-wasm WebAssembly architecture area-Diagnostics-mono
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants