PaTH Attention: Position Encoding via Accumulating Householder TransformationsSonglin YangYikang Shenet al.2025NeurIPS 2025
Reducing Transformer Key-Value Cache Size with Cross-Layer AttentionWilliam BrandonMayank Mishraet al.2024NeurIPS 2024
Granite code models: A family of open foundation models for code intelligenceMayank MishraMatthew Stalloneet al.2024arXiv