Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
-
Updated
Jun 27, 2024 - C++
Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
Floating-point matrix multiplication implementation (arbitrary precision)
ForMatmul - A Fortran library that overloads the matmul function to enable efficient matrix multiplication with/without coarray.
Matrix multiplication on the NPU inside RK3588
This project integrates a custom CUDA-based matrix multiplication kernel into a PyTorch deep learning model, leveraging GPU acceleration for matrix operations. The goal is to compare the performance of this custom kernel with PyTorch's built-in matrix multiplication and demonstrate how custom CUDA kernels can optimize compute-intensive operations.
Raspberry Pi Pico (RP2040) and Adafruit Metro M7 (NXP IMXRT10XX) benchmark
In this project, ınstruction numbers from a c program are counted with pin and c++.
OpenMP Matrix Multiplication Offloading Playground
📰 This repository contains time measurements of various algorithms on the CPU and GPU using PyCuda: matrix multiplication, Pi computation, and bilateral filtering.
Matrix-matrix multiplication implementations benchmarking
Add a description, image, and links to the matmul topic page so that developers can more easily learn about it.
To associate your repository with the matmul topic, visit your repo's landing page and select "manage topics."