** 训练过程中每个epoch结束时,loss会有明显下降的解释 ** https://github.com/huggingface/transformers/issues/18730 Previous Multi-Query-Attention Next Activation SWIGLU的介绍 CATALOG FEATURED TAGS LLM NLP SFT chatGLM BLOOM Megatron-Deepspeed Paper generate llama