-
To Code, or Not To Code? Exploring Impact of Code in Pre-training,
arXiv, 2408.10914
, arxiv, pdf, cication: -1Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, Sara Hooker
-
POA: Pre-training Once for Models of All Sizes,
arXiv, 2408.01031
, arxiv, pdf, cication: -1Yingying Zhang, Xin Guo, Jiangwei Lao, Lei Yu, Lixiang Ru, Jian Wang, Guo Ye, Huimei He, Jingdong Chen, Ming Yang · (POA - Qichuzyy)
-
Efficient Continual Pre-training by Mitigating the Stability Gap,
arXiv, 2406.14833
, arxiv, pdf, cication: -1Yiduo Guo, Jie Fu, Huishuai Zhang, Dongyan Zhao, Yikang Shen
-
Instruction Pre-Training: Language Models are Supervised Multitask Learners,
arXiv, 2406.14491
, arxiv, pdf, cication: -1Daixuan Cheng, Yuxian Gu, Shaohan Huang, Junyu Bi, Minlie Huang, Furu Wei · (LMOps - microsoft)
-
Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training,
arXiv, 2405.15319
, arxiv, pdf, cication: -1Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, Jie Fu · (llm-stacking.github)
-
Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining,
arXiv, 2405.14908
, arxiv, pdf, cication: -1Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding
-
Pre-training Small Base LMs with Fewer Tokens,
arXiv, 2404.08634
, arxiv, pdf, cication: -1Sunny Sanyal, Sujay Sanghavi, Alexandros G. Dimakis · (LLM-Inheritune - sanyalsunny111)
-
Training LLMs over Neurally Compressed Text,
arXiv, 2404.03626
, arxiv, pdf, cication: -1Brian Lester, Jaehoon Lee, Alex Alemi, Jeffrey Pennington, Adam Roberts, Jascha Sohl-Dickstein, Noah Constant
-
The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis,
arXiv, 2404.01204
, arxiv, pdf, cication: -1Chen Yang, Junzhuo Li, Xinyao Niu, Xinrun Du, Songyang Gao, Haoran Zhang, Zhaoliang Chen, Xingwei Qu, Ruibin Yuan, Yizhi Li
-
Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training,
arXiv, 2403.00758
, arxiv, pdf, cication: -1Qingyan Guo, Rui Wang, Junliang Guo, Xu Tan, Jiang Bian, Yujiu Yang
-
Reverse Training to Nurse the Reversal Curse,
arXiv, 2403.13799
, arxiv, pdf, cication: -1Olga Golovneva, Zeyuan Allen-Zhu, Jason Weston, Sainbayar Sukhbaatar
-
Language models scale reliably with over-training and on downstream tasks,
arXiv, 2403.08540
, arxiv, pdf, cication: -1Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh
-
fm-cheatsheet - allenai
Website for hosting the Open Foundation Models Cheat Sheet. · (fmcheatsheet)
-
SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection,
arXiv, 2401.13160
, arxiv, pdf, cication: -1Ke Ye, Heinrich Jiang, Afshin Rostamizadeh, Ayan Chakrabarti, Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Gui Citovsky, Sanjiv Kumar
-
MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications,
arXiv, 2310.15777
, arxiv, pdf, cication: -1Yizhe Yang, Huashan Sun, Jiawei Li, Runheng Liu, Yinghao Li, Yuhang Liu, Heyan Huang, Yang Gao · (jiqizhixin)
-
In-Context Pretraining: Language Modeling Beyond Document Boundaries,
arXiv, 2310.10638
, arxiv, pdf, cication: -1Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, Mike Lewis
-
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale,
arXiv, 2309.04564
, arxiv, pdf, cication: 3Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, Sara Hooker
-
metaseq - facebookresearch
· (mp.weixin.qq) · (bilibili)
-
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters,
arXiv, 2408.03314
, arxiv, pdf, cication: -1Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar
-
Scaling Laws for Linear Complexity Language Models,
arXiv, 2406.16690
, arxiv, pdf, cication: -1Xuyang Shen, Dong Li, Ruitao Leng, Zhen Qin, Weigao Sun, Yiran Zhong
-
D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models,
arXiv, 2406.01375
, arxiv, pdf, cication: -1Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang
-
Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?,
arXiv, 2406.04391
, arxiv, pdf, cication: -1Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo
-
Observational Scaling Laws and the Predictability of Language Model Performance,
arXiv, 2405.10938
, arxiv, pdf, cication: -1Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto
-
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory,
arXiv, 2405.08707
, arxiv, pdf, cication: -1Xueyan Niu, Bo Bai, Lei Deng, Wei Han
-
Chinchilla Scaling: A replication attempt,
arXiv, 2404.10102
, arxiv, pdf, cication: -1Tamay Besiroglu, Ege Erdil, Matthew Barnett, Josh You
-
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws,
arXiv, 2404.05405
, arxiv, pdf, cication: -1Zeyuan Allen-Zhu, Yuanzhi Li
-
DiPaCo: Distributed Path Composition,
arXiv, 2403.10616
, arxiv, pdf, cication: -1Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Adhiguna Kuncoro, Yani Donchev, Rachita Chhaparia, Ionel Gog, Marc'Aurelio Ranzato, Jiajun Shen, Arthur Szlam
-
Algorithmic progress in language models,
arXiv, 2403.05812
, arxiv, pdf, cication: -1Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan Carl Guo, David Atkinson, Neil Thompson, Jaime Sevilla
· (mp.weixin.qq)
since 2012, the computational efficiency for pretraining language models (including large language models) has doubled approximately every 8 months, a pace much faster than the hardware advancements predicted by Moore's Law.
-
When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method,
arXiv, 2402.17193
, arxiv, pdf, cication: -1Biao Zhang, Zhongtao Liu, Colin Cherry, Orhan Firat
-
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs,
arXiv, 2402.15627
, arxiv, pdf, cication: -1Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong
-
OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning,
arXiv, 2402.06954
, arxiv, pdf, cication: -1Rui Ye, Wenhao Wang, Jingyi Chai, Dihan Li, Zexi Li, Yinda Xu, Yaxin Du, Yanfeng Wang, Siheng Chen · (openfedllm - rui-ye)
-
A Tale of Tails: Model Collapse as a Change of Scaling Laws,
arXiv, 2402.07043
, arxiv, pdf, cication: -1Elvis Dohmatob, Yunzhen Feng, Pu Yang, Francois Charton, Julia Kempe
-
Scaling Laws for Downstream Task Performance of Large Language Models,
arXiv, 2402.04177
, arxiv, pdf, cication: -1Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, Sanmi Koyejo
-
T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives,
arXiv, 2401.16677
, arxiv, pdf, cication: -1Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, Matthew D. Sinclair
-
Zero Bubble Pipeline Parallelism,
arXiv, 2401.10241
, arxiv, pdf, cication: -1Penghui Qi, Xinyi Wan, Guangxing Huang, Min Lin · (zero-bubble-pipeline-parallelism - sail-sg)
-
Asynchronous Local-SGD Training for Language Modeling,
arXiv, 2401.09135
, arxiv, pdf, cication: -1Bo Liu, Rachita Chhaparia, Arthur Douillard, Satyen Kale, Andrei A. Rusu, Jiajun Shen, Arthur Szlam, Marc'Aurelio Ranzato
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism,
arXiv, 2401.02954
, arxiv, pdf, cication: -1DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong
-
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws,
arXiv, 2401.00448
, arxiv, pdf, cication: -1Nikhil Sardana, Jonathan Frankle
-
Unicron: Economizing Self-Healing LLM Training at Scale,
arXiv, 2401.00134
, arxiv, pdf, cication: -1Tao He, Xue Li, Zhibin Wang, Kun Qian, Jingbo Xu, Wenyuan Yu, Jingren Zhou
-
SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling,
arXiv, 2312.15166
, arxiv, pdf, cication: -1Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim
-
Distributed Inference and Fine-tuning of Large Language Models Over The Internet,
arXiv, 2312.08361
, arxiv, pdf, cication: -1Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, Colin Raffel
· (petals - bigscience-workshop)
-
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models,
arXiv, 2312.06109
, arxiv, pdf, cication: -1Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, Xiangyu Zhang
-
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism,
arXiv, 2312.04916
, arxiv, pdf, cication: -1Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou · (EE-LLM - pan-x-c)
-
DiLoCo: Distributed Low-Communication Training of Language Models,
arXiv, 2311.08105
, arxiv, pdf, cication: -1Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc'Aurelio Ranzato, Arthur Szlam, Jiajun Shen
-
Microscaling Data Formats for Deep Learning,
arXiv, 2310.10537
, arxiv, pdf, cication: -1Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf
-
A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale,
arXiv, 2309.06497
, arxiv, pdf, cication: -1Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, Michael Rabbat
-
Scaling Laws for Sparsely-Connected Foundation Models,
arXiv, 2309.08520
, arxiv, pdf, cication: 1Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci
-
Scaling TransNormer to 175 Billion Parameters,
arXiv, 2307.14995
, arxiv, pdf, cication: 1Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong Lv, Fei Yuan, Xiao Luo · (jiqizhixin)
-
Inverse Scaling: When Bigger Isn't Better,
arXiv, 2306.09479
, arxiv, pdf, cication: 15Ian R. McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu
-
Tune As You Scale: Hyperparameter Optimization For Compute Efficient Training,
arXiv, 2306.08055
, arxiv, pdf, cication: 1Abraham J. Fetterman, Ellie Kitanidis, Joshua Albrecht, Zachary Polizzi, Bryden Fogelman, Maksis Knutins, Bartosz Wróblewski, James B. Simon, Kanjun Qiu
-
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis,
arXiv, 2305.13230
, arxiv, pdf, cication: -1Fuzhao Xue, Yao Fu, Wangchunshu Zhou, Zangwei Zheng, Yang You
-
Training Compute-Optimal Large Language Models,
arXiv, 2203.15556
, arxiv, pdf, cication: 202Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark
-
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling,
ICML, 2023
, arxiv, pdf, cication: -1Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff
-
Unforgettable Generalization in Language Models,
arXiv, 2409.02228
, arxiv, pdf, cication: -1Eric Zhang, Leshem Chosen, Jacob Andreas
-
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models,
arXiv, 2403.13372
, arxiv, pdf, cication: -1Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo · (LLaMA-Factory - hiyouga)
-
Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training,
arXiv, 2403.09613
, arxiv, pdf, cication: -1Yanlai Yang, Matt Jones, Michael C. Mozer, Mengye Ren
-
Larimar: Large Language Models with Episodic Memory Control,
arXiv, 2403.11901
, arxiv, pdf, cication: -1Payel Das, Subhajit Chaudhury, Elliot Nelson, Igor Melnyk, Sarath Swaminathan, Sihui Dai, Aurélie Lozano, Georgios Kollias, Vijil Chenthamarakshan, Jiří
-
Simple and Scalable Strategies to Continually Pre-train Large Language Models,
arXiv, 2403.08763
, arxiv, pdf, cication: -1Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, Irina Rish
· (twitter)
LLMs can be efficiently updated with new data through a combination of simple learning rate rewarming and adding a small fraction of previous training data to counteract catastrophic forgetting.
-
Personalized Large Language Models,
arXiv, 2402.09269
, arxiv, pdf, cication: -1Stanisław Woźniak, Bartłomiej Koptyra, Arkadiusz Janz, Przemysław Kazienko, Jan Kocoń
-
BitDelta: Your Fine-Tune May Only Be Worth One Bit,
arXiv, 2402.10193
, arxiv, pdf, cication: -1James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai · (BitDelta - FasterDecoding)
-
EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models,
arXiv, 2402.00518
, arxiv, pdf, cication: -1Xuchen Pan, Yanxi Chen, Yaliang Li, Bolin Ding, Jingren Zhou · (EE-LLM - pan-x-c)
-
Scaling Sparse Fine-Tuning to Large Language Models,
arXiv, 2401.16405
, arxiv, pdf, cication: -1Alan Ansell, Ivan Vulić, Hannah Sterz, Anna Korhonen, Edoardo M. Ponti · (peft - AlanAnsell) · (sft-llm - ducdauge)
-
Tuning Language Models by Proxy,
arXiv, 2401.08565
, arxiv, pdf, cication: -1Alisa Liu, Xiaochuang Han, Yizhong Wang, Yulia Tsvetkov, Yejin Choi, Noah A. Smith
· (lightning)
-
LLaMA Pro: Progressive LLaMA with Block Expansion,
arXiv, 2401.02415
, arxiv, pdf, cication: -1Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ping Luo, Ying Shan · (LLaMA-Pro - TencentARC) · (huggingface)
-
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models,
arXiv, 2401.01335
, arxiv, pdf, cication: -1Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu
· (SPIN - uclaml)
-
Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes,
arXiv, 2312.06353
, arxiv, pdf, cication: -1Zhen Qin, Daoyuan Chen, Bingchen Qian, Bolin Ding, Yaliang Li, Shuiguang Deng
-
Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2,
arXiv, 2311.10702
, arxiv, pdf, cication: -1Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy
-
Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse,
arXiv, 2311.07468
, arxiv, pdf, cication: -1Ang Lv, Kaiyi Zhang, Shufang Xie, Quan Tu, Yuhan Chen, Ji-Rong Wen, Rui Yan · (jiqizhixin)
-
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization,
arXiv, 2311.06243
, arxiv, pdf, cication: -1Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng
-
· (mp.weixin.qq)
-
A recipe for frontier model post-training
· (mp.weixin.qq)
-
Llama-3.1-Storm-8B: Improved SLM with Self-Curation + Model Merging
· (huggingface) · (x)
-
Fine-tuning GPT3.5-turbo based on 140k slack messages · Ross Lazerowitz
-
Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach,
arXiv, 2406.04594
, arxiv, pdf, cication: -1Jianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng, Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao
-
Linear Transformers with Learnable Kernel Functions are Better In-Context Models,
arXiv, 2402.10644
, arxiv, pdf, cication: -1Yaroslav Aksenov, Nikita Balagansky, Sofia Maria Lo Cicero Vaina, Boris Shaposhnikov, Alexey Gorbatovski, Daniil Gavrilov
-
Rethinking Optimization and Architecture for Tiny Language Models,
arXiv, 2402.02791
, arxiv, pdf, cication: -1Yehui Tang, Fangcheng Liu, Yunsheng Ni, Yuchuan Tian, Zheyuan Bai, Yi-Qi Hu, Sichao Liu, Shangling Jui, Kai Han, Yunhe Wang · (RethinkTinyLM - YuchuanTian)
-
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers,
arXiv, 2311.10642
, arxiv, pdf, cication: -1Vukasin Bozic, Danilo Dordervic, Daniele Coppola, Joseph Thommes
-
UT5: Pretraining Non autoregressive T5 with unrolled denoising,
arXiv, 2311.08552
, arxiv, pdf, cication: -1Mahmoud G. Salem, Jiayu Ye, Chu-Cheng Lin, Frederick Liu
-
How to Build Low-cost Networks for Large Language Models (without Sacrificing Performance)?,
arXiv, 2307.12169
, arxiv, pdf, cication: -1Weiyang Wang, Manya Ghobadi, Kayvon Shakeri, Ying Zhang, Naader Hasani
-
Stack More Layers Differently: High-Rank Training Through Low-Rank Updates,
arXiv, 2307.05695
, arxiv, pdf, cication: 2Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, Anna Rumshisky
-
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
· (mp.weixin.qq)
-
[2409.11321] SOAP: Improving and Stabilizing Shampoo using Adam
· (SOAP - nikhilvyas)
-
Theory, Analysis, and Best Practices for Sigmoid Self-Attention,
arXiv, 2409.04431
, arxiv, pdf, cication: -1Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani · (ml-sigmoid-attention - apple)
-
The AdEMAMix Optimizer: Better, Faster, Older,
arXiv, 2409.03137
, arxiv, pdf, cication: -1Matteo Pagliardini, Pierre Ablin, David Grangier
· (AdEMAMix-Optimizer-Pytorch - nanowell)
-
Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler,
arXiv, 2408.13359
, arxiv, pdf, cication: -1Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox, Rameswar Panda · (huggingface)
-
grokadamw - cognitivecomputations
-
Patch-Level Training for Large Language Models,
arXiv, 2407.12665
, arxiv, pdf, cication: -1Chenze Shao, Fandong Meng, Jie Zhou · (PatchTrain - shaochenze)
-
optimizers - facebookresearch
-
Scaling Exponents Across Parameterizations and Optimizers,
arXiv, 2407.05872
, arxiv, pdf, cication: -1Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee
-
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion,
arXiv, 2407.01392
, arxiv, pdf, cication: -1Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, Vincent Sitzmann
· (diffusion-forcing - buoyancy99)
-
MiniCPM: Unveiling the Potential of End-side Large Language Models
-
Adam-mini: Use Fewer Learning Rates To Gain More,
arXiv, 2406.16793
, arxiv, pdf, cication: -1Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun
-
Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs,
arXiv, 2406.10209
, arxiv, pdf, cication: -1Abhimanyu Hans, Yuxin Wen, Neel Jain, John Kirchenbauer, Hamid Kazemi, Prajwal Singhania, Siddharth Singh, Gowthami Somepalli, Jonas Geiping, Abhinav Bhatele · (goldfish-loss - ahans30)
-
2BP: 2-Stage Backpropagation,
arXiv, 2405.18047
, arxiv, pdf, cication: -1Christopher Rae, Joseph K. L. Lee, James Richings
-
The Road Less Scheduled,
arXiv, 2405.15682
, arxiv, pdf, cication: -1Aaron Defazio, Xingyu, Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, Ashok Cutkosky · (schedule_free - facebookresearch)
-
Thermodynamic Natural Gradient Descent,
arXiv, 2405.13817
, arxiv, pdf, cication: -1Kaelan Donatella, Samuel Duffield, Maxwell Aifer, Denis Melanson, Gavin Crooks, Patrick J. Coles
-
The Entropy Enigma: Success and Failure of Entropy Minimization,
arXiv, 2405.05012
, arxiv, pdf, cication: -1Ori Press, Ravid Shwartz-Ziv, Yann LeCun, Matthias Bethge · (EntropyEnigma - oripress)
-
psgd_torch - lixilinx
Pytorch implementation of preconditioned stochastic gradient descent (affine group preconditioner, low-rank approximation preconditioner and more) · (Preconditioned-Stochastic-Gradient-Descent - opooladz)
-
A Large-Scale Exploration of
$μ$ -Transfer,arXiv, 2404.05728
, arxiv, pdf, cication: -1Lucas Lingle
-
schedule_free - facebookresearch
Schedule-Free Optimization in PyTorch
-
Reverse Training to Nurse the Reversal Curse,
arXiv, 2403.13799
, arxiv, pdf, cication: -1Olga Golovneva, Zeyuan Allen-Zhu, Jason Weston, Sainbayar Sukhbaatar
-
Towards Optimal Learning of Language Models,
arXiv, 2402.17759
, arxiv, pdf, cication: -1Yuxian Gu, Li Dong, Yaru Hao, Qingxiu Dong, Minlie Huang, Furu Wei · (aka)
-
Stabilizing Transformer Training by Preventing Attention Entropy Collapse,
ICML, 2023
, arxiv, pdf, cication: -1Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, Josh Susskind · (ml-sigma-reparam - apple)
-
CAME: Confidence-guided Adaptive Memory Efficient Optimization,
arXiv, 2307.02047
, arxiv, pdf, cication: -1Yang Luo, Xiaozhe Ren, Zangwei Zheng, Zhuo Jiang, Xin Jiang, Yang You · (qbitai)
-
Octopus v4: Graph of language models,
arXiv, 2404.19296
, arxiv, pdf, cication: -1Wei Chen, Zhiyuan Li
-
Training-Free Pretrained Model Merging,
arXiv, 2403.01753
, arxiv, pdf, cication: -1Zhengqi Xu, Ke Yuan, Huiqiong Wang, Yong Wang, Mingli Song, Jie Song
The proposed model merging framework addresses the challenge of balancing unit similarity inconsistencies between weight and activation spaces during model merging by linearly combining similarity matrices of both, resulting in better multi-task model performance.
-
Evolutionary Optimization of Model Merging Recipes,
arXiv, 2403.13187
, arxiv, pdf, cication: -1Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, David Ha
· (evolutionary-model-merge - sakanaai)
-
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM,
arXiv, 2403.07816
, arxiv, pdf, cication: -1Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston
-
AutoMerger - mlabonne 🤗
· (huggingface)
-
Learning to Decode Collaboratively with Multiple Language Models,
arXiv, 2403.03870
, arxiv, pdf, cication: -1Shannon Zejiang Shen, Hunter Lang, Bailin Wang, Yoon Kim, David Sontag · (co-llm - clinicalml)
-
FuseChat: Knowledge Fusion of Chat Models,
arXiv, 2402.16107
, arxiv, pdf, cication: -1Fanqi Wan, Ziyi Yang, Longguang Zhong, Xiaojun Quan, Xinting Huang, Wei Bi · (FuseLLM - fanqiwan)
-
Knowledge Fusion of Large Language Models,
arXiv, 2401.10491
, arxiv, pdf, cication: -1Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, Shuming Shi · (FuseLLM - fanqiwan)
-
Beagle14-7B - mlabonne 🤗
-
Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM,
arXiv, 2401.02994
, arxiv, pdf, cication: -1Xiaoding Lu, Adian Liusie, Vyas Raina, Yuwen Zhang, William Beauchamp
-
LLM Augmented LLMs: Expanding Capabilities through Composition,
arXiv, 2401.02412
, arxiv, pdf, cication: 8Rachit Bansal, Bidisha Samanta, Siddharth Dalmia, Nitish Gupta, Shikhar Vashishth, Sriram Ganapathy, Abhishek Bapna, Prateek Jain, Partha Talukdar
-
mergekit - cg123
Tools for merging pretrained large language models.
-
LM-Cocktail: Resilient Tuning of Language Models via Model Merging,
arXiv, 2311.13534
, arxiv, pdf, cication: -1Shitao Xiao, Zheng Liu, Peitian Zhang, Xingrun Xing · (FlagEmbedding - FlagOpen)
-
Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models,
arXiv, 2311.08692
, arxiv, pdf, cication: -1Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, Jingren Zhou
-
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch,
arXiv, 2311.03099
, arxiv, pdf, cication: -1Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li · (mergelm - yule-buaa)
-
LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion,
arXiv, 2306.02561
, arxiv, pdf, cication: 16Dongfu Jiang, Xiang Ren, Bill Yuchen Lin · (LLM-Blender - yuchenlin) · (mp.weixin.qq)
- Model Merging: A Survey - by Cameron R. Wolfe, Ph.D.
- Evolutionary Model Merging For All
- mergekit-gui - julien-c 🤗
- Model merging lessons in The Waifu Research Department
- Model Merging - a osanseviero Collection
- 🤗 PEFT welcomes new merging methods
- Weight averaging and model merging for LLMs seem to be the most interesting themes in 2024 so far.
-
OLMoE: Open Mixture-of-Experts Language Models,
arXiv, 2409.02060
, arxiv, pdf, cication: -1Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert · (OLMoE - allenai) · (huggingface) · (interconnects)
-
BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts,
arXiv, 2408.08274
, arxiv, pdf, cication: -1Qizhen Zhang, Nikolas Gritsch, Dwaraknath Gnaneshwar, Simon Guo, David Cairuz, Bharat Venkitesh, Jakob Foerster, Phil Blunsom, Sebastian Ruder, Ahmet Ustun
-
Layerwise Recurrent Router for Mixture-of-Experts,
arXiv, 2408.06793
, arxiv, pdf, cication: -1Zihan Qiu, Zeyu Huang, Shuang Cheng, Yizhi Zhou, Zili Wang, Ivan Titov, Jie Fu · (RMoE - qiuzh20)
-
Mixture of A Million Experts,
arXiv, 2407.04153
, arxiv, pdf, cication: 1Xu Owen He
-
A Survey on Mixture of Experts,
arXiv, 2407.06204
, arxiv, pdf, cication: -1Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang
-
A Closer Look into Mixture-of-Experts in Large Language Models,
arXiv, 2406.18219
, arxiv, pdf, cication: -1Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, Jie Fu · (Look-into-MoEs - kamanphoebe)
-
Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts,
arXiv, 2406.12034
, arxiv, pdf, cication: -1Junmo Kang, Leonid Karlinsky, Hongyin Luo, Zhen Wang, Jacob Hansen, James Glass, David Cox, Rameswar Panda, Rogerio Feris, Alan Ritter
-
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models,
arXiv, 2406.06563
, arxiv, pdf, cication: -1Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei Lü, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng
-
MoEUT: Mixture-of-Experts Universal Transformers,
arXiv, 2405.16039
, arxiv, pdf, cication: -1Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber, Christopher Potts, Christopher D. Manning · (moeut - robertcsordas)
-
Multi-Head Mixture-of-Experts,
arXiv, 2404.15045
, arxiv, pdf, cication: -1Xun Wu, Shaohan Huang, Wenhui Wang, Furu Wei
-
mergoo - Leeroo-AI
A library for easily merging multiple LLM experts, and efficiently train the merged LLM.
-
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models,
arXiv, 2404.02258
, arxiv, pdf, cication: -1David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro · (qbitai) · (OLMo - thepowerfuldeez) · (twitter)
-
JetMoE - myshell-ai
Reaching LLaMA2 Performance with 0.1M Dollars
· (research.myshell)
JetMoE-8B has 24 blocks where each block has two MoE layers: Mixture of Attention heads (MoA) and Mixture of MLP Experts (MoE); each MoA and MoE layer has 8 experts, and 2 experts are activated for each input token with 2.2B active parameters.
-
megablocks - databricks
-
Scattered Mixture-of-Experts Implementation,
arXiv, 2403.08245
, arxiv, pdf, cication: -1Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville
-
Scaling Laws for Fine-Grained Mixture of Experts,
arXiv, 2402.07871
, arxiv, pdf, cication: -1Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski
-
BlackMamba: Mixture of Experts for State-Space Models,
arXiv, 2402.01771
, arxiv, pdf, cication: -1Quentin Anthony, Yury Tokpanov, Paolo Glorioso, Beren Millidge · (BlackMamba - Zyphra)
-
LocMoE: A Low-overhead MoE for Large Language Model Training,
arXiv, 2401.13920
, arxiv, pdf, cication: -1Jing Li, Zhijie Sun, Xuan He, Li Zeng, Yi Lin, Entong Li, Binfan Zheng, Rongqian Zhao, Xin Chen · (jiqizhixin)
-
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts,
arXiv, 2401.04081
, arxiv, pdf, cication: -1Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, Sebastian Jaszczur
-
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models,
arXiv, 2401.06066
, arxiv, pdf, cication: -1Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu · (DeepSeek-MoE - deepseek-ai) · (huggingface)
-
Mixtral of Experts,
arXiv, 2401.04088
, arxiv, pdf, cication: -1Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand
-
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts,
arXiv, 2401.04081
, arxiv, pdf, cication: -1Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, Sebastian Jaszczur
-
Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning,
arXiv, 2312.12379
, arxiv, pdf, cication: -1Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang · (qbitai)
-
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention,
arXiv, 2312.07987
, arxiv, pdf, cication: -1Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber
· (moe_attention - robertcsordas)
-
megablocks-public - mistralai
· (qbitai)
-
llama-mistral - dzhulgakov
Inference code for Mistral and Mixtral hacked up into original Llama implementation
-
SmartMoE - zms1999
A MoE impl for PyTorch, [ATC'23] SmartMoE · (jiqizhixin)
-
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models,
arXiv, 2305.14705
, arxiv, pdf, cication: 5Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen · (jiqizhixin)
-
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models,
arXiv, 2402.01739
, arxiv, pdf, cication: -1Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, Yang You
-
OpenMoE - XueFuzhao
A family of open-sourced Mixture-of-Experts (MoE) Large Language Models
· (OpenMoE - XueFuzhao)
-
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models,
arXiv, 2305.14705
, arxiv, pdf, cication: -1Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen
-
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts,
proceedings of machine learning and systems, 2023
, arxiv, pdf, cication: -1Trevor Gale, Deepak Narayanan, Cliff Young, Matei Zaharia
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity,
the journal of machine learning research, 2022
, arxiv, pdf, cication: -1William Fedus, Barret Zoph, Noam Shazeer
-
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding,
arXiv, 2006.16668
, arxiv, pdf, cication: -1Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen
-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer,
arXiv, 1701.06538
, arxiv, pdf, cication: -1Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean
-
How do mixture-of-experts layers affect transformer models? - Stack Overflow
-
Accelerating MoE model inference with Locality-Aware Kernel Design | PyTorch
-
makeMoE - AviSoori1x
From scratch implementation of a sparse mixture of experts language model inspired by Andrej Karpathy's makemore :) · (huggingface) · (qbitai)
-
8x7B-MoE-test-NOT-MIXTRAL - CausalLM 🤗
-
Unlocking Continual Learning Abilities in Language Models,
arXiv, 2406.17245
, arxiv, pdf, cication: -1Wenyu Du, Shuang Cheng, Tongxu Luo, Zihan Qiu, Zeyu Huang, Ka Chun Cheung, Reynold Cheng, Jie Fu · (MIGU - wenyudu)
-
Online Training of Large Language Models: Learn while chatting,
arXiv, 2403.04790
, arxiv, pdf, cication: -1Juhao Liang, Ziwei Wang, Zhuoheng Ma, Jianquan Li, Zhiyi Zhang, Xiangbo Wu, Benyou Wang
-
torchtitan - pytorch
A native PyTorch Library for large model training
-
DisTrO - NousResearch
Distributed Training Over-The-Internet
-
1.5-Pints - Pints-AI
A compact LLM pretrained in 9 days by using high quality data
-
Liger-Kernel - linkedin
Efficient Triton Kernels for LLM Training
-
cookbook - EleutherAI
Deep learning for dummies. All the practical details and useful utilities that go into working with real models.
-
autotrain-advanced - huggingface
-
nanotron - huggingface
Minimalistic large language model 3D-parallelism training
-
mistral-finetune - mistralai
-
torchtitan - pytorch
A native PyTorch Library for large model training
-
ColossalAI - hpcaitech
Making large AI models cheaper, faster and more accessible
-
xtuner - InternLM
An efficient, flexible and full-featured toolkit for fine-tuning large models (InternLM, Llama, Baichuan, Qwen, ChatGLM)
-
corenet - apple
CoreNet: A library for training deep neural networks
-
maxtext - google
A simple, performant and scalable Jax LLM!
-
lightning-thunder - Lightning-AI
Source to source compiler for PyTorch. It makes PyTorch programs faster on single accelerators and distributed.
-
zero-bubble-pipeline-parallelism - sail-sg
Zero Bubble Pipeline Parallelism · (mp.weixin.qq)
-
levanter - stanford-crfm
Legibile, Scalable, Reproducible Foundation Models with Named Tensors and Jax
-
axolotl - OpenAccess-AI-Collective
Go ahead and axolotl questions
-
LLMtuner - promptslab
Tune LLM in few lines of code
-
LLM-FineTuning-Large-Language-Models - rohan-paul
LLM (Large Language Model) FineTuning
-
Megatron-LM - NVIDIA
Ongoing research training transformer models at scale
-
saturn - knagrecha
Saturn accelerates the training of large-scale deep learning models with a novel joint optimization approach.
-
SynapseML - microsoft
Simple and Distributed Machine Learning
-
gpt-llm-trainer - mshumer
· (qbitai)
-
LLaMA-Factory - hiyouga
Easy-to-use LLM fine-tuning framework (LLaMA, BLOOM, Mistral, Baichuan, Qwen, ChatGLM)
-
Megatron-LLaMA - alibaba
Best practice for training LLaMA models in Megatron-LM · (jiqizhixin)
-
Efficient-PyTorch - Lyken17
My best practice of training large dataset using PyTorch.
-
Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs,
arXiv, 2311.02262
, arxiv, pdf, cication: -1Qingru Zhang, Chandan Singh, Liyuan Liu, Xiaodong Liu, Bin Yu, Jianfeng Gao, Tuo Zhao
-
Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks,
arXiv, 2310.02244
, arxiv, pdf, cication: -1Greg Yang, Dingli Yu, Chen Zhu, Soufiane Hayou · (qbitai)
-
Think before you speak: Training Language Models With Pause Tokens,
arXiv, 2310.02226
, arxiv, pdf, cication: -1Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan · (qbitai)
-
Textbooks Are All You Need II: phi-1.5 technical report,
arXiv, 2309.05463
, arxiv, pdf, cication: 9Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee · (jiqizhixin)
-
Towards Robust and Efficient Continual Language Learning,
arXiv, 2307.05741
, arxiv, pdf, cication: 1Adam Fisch, Amal Rannen-Triki, Razvan Pascanu, Jörg Bornschein, Angeliki Lazaridou, Elena Gribovskaya, Marc'Aurelio Ranzato
- Fetching Title#hxa6
- Pathways 论文精读【论文精读】_哔哩哔哩_bilibili
- GPipe论文精读【论文精读】_哔哩哔哩_bilibili
- Megatron LM 论文精读【论文精读】_哔哩哔哩_bilibili
- Zero 论文精读【论文精读】_哔哩哔哩_bilibili
-
static.sched.com/hosted_files/pytorch2024/8f/Pytorch Conference - Making LLM training faster.pdf
· (x)
-
packing-with-FA2 - 🤗
-
Training great LLMs entirely from ground zero in the wilderness as a startup — Yi Tay
· (twitter) · (mp.weixin.qq)
-
Everything about Distributed Training and Efficient Finetuning | Sumanth's Personal Website
-
llm-alignment-survey - Magnetic2014
A curated reading list for large language model (LLM) alignment. Take a look at our new survey "Large Language Model Alignment: A Survey" for more details!