Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs

Megh Thakkar; Yash More; Quentin Fournier; Matthew Riemer; Pin-Yu Chen; Amal Zouaq; Payel Das; Sarath Chandar

NeurIPS 2024

Workshop paper

10 Dec 2024

Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs

Abstract

There is a growing interest in training domain-expert LLMs that excel in specific technical fields compared to their general-purpose instruction-tuned counterparts. However, these expert models often experience a loss in their safety abilities in the process, making them capable of generating harmful content. As a solution, we introduce an efficient and effective merging-based alignment method called MergeAlign that interpolates the domain and alignment vectors, creating safer domain-specific models while preserving their utility. We apply MergeAlign on Llama3 variants that are experts in medicine and finance, obtaining substantial alignment improvements with minimal to no degradation on domain-specific benchmarks. We study the impact of model merging through model similarity metrics and contributions of individual models being merged. We hope our findings open new research avenues and inspire more efficient development of safe expert LLMs.

Workshop paper