Skip to content
导航

服务器备忘录

硬件组装

ASMB9-iKVM | ASUS 华硕

Z11PR-D16 | ASUS 华硕

华硕服务器通过IKVM安装系统 - 知乎

因课题组内服务器不能满足个人需求,最近攒了台小服务器作为个人学习使用。

配件规格价格(扣除优惠券)链接备注
CPUXeon Platinum 8260 × 2550 × 2https://item.taobao.com/item.htm?id=852556787651
主板华硕Z11PR-D161435.7https://item.taobao.com/item.htm?id=904033503394
内存三星ddr4 64G ECC 2666MHz × 6+固态合计7150含税
固态致态TiPlus7100 2T
水冷Bykski B-FRDSR-V2 LGA3647 双路862.6https://item.taobao.com/item.htm?id=880270522805客服备注3647
电源长城猎金电源G11金牌全模组1100W630.85https://detail.tmall.com/item.htm?id=708933603177
机箱航嘉GX750C + 3 个风扇315中关村科贸电子城4B173
总计11494.15

在同学的推荐下购置了华硕Z11PR-D16主板,它所搭载的AST2500芯片内置了ASMB9-iKVM,可以提供IPMI(智能平台管理接口,这好像是微星的叫法,华硕的叫法是BMC控制器)。将网线连上主板(注意不要连到iKVM的专属接口),另一边连到路由器上,这样同一个局域网的设备就可以通过路由器分配的ip地址访问IPMI,实现远程操控、更新BIOS等功能。

TIP

不过安装系统建议还是用USB,IPMI加载镜像的速度特别慢。

开启时间同步

bash
sudo apt install -y systemd-timesyncd
sudo systemctl enable --now systemd-timesyncd

运行以下命令可以检测烤机性能

bash
stress-ng --cpu 96 --vm 1 --vm-bytes 384G --timeout 300 --metrics-brief
psensor

用户管理

ssh登录

安装好系统后,先给自己添加一下管理员权限。管理员账户名为root,而我作为用户所申请的账户名为fisherd,需要运行

bash
su -

切换到root用户,然后运行

bash
usermode -aG sudo <用户>

添加新用户时,运行

bash
adduser <用户>

接下来配置ssh

bash
更新索引
sudo apt update
安装软件包
sudo apt install -y openssh-server
立即启动并设为开机自启
sudo systemctl enable --now ssh          # Ubuntu 18.04+ 服务名是 ssh,老版本可能是 sshd
确认监听
sudo ss -tlnp | grep :22                 # 看到 sshd 进程即成功

在同一路由下,找到路由器分配给Fisherd-Server的IP地址,为192.168.31.225,然后就可以用ssh登录了

bash
ssh -p 22 fisherd@192.168.31.225

接下来设置秘钥登录,参考设置 SSH 通过密钥登录 | 菜鸟教程,不过密钥对最好在本地电脑上制作

shell
ssh-keygen
windows生成的私钥默认在C:/User/<用户名>/.ssh/id_rsa 公钥在C:/User/<用户>/.ssh/id_rsa.pub
将公钥添加到~/.ssh/authorized_keys
chmod 600 authorized_keys
chmod 700 ~/.ssh

一些小备忘

查看当前目录下一级子目录占用空间

bash
du -h --max-depth=1

sh命令和bash命令是不同的,bash支持更高级的功能。现代linux的sh可能被重链接到其他shell,可以检查:

bash
ls -l $(which sh)

slurm任务管理(单节点快速版)

shell
# 1. 装包
sudo apt update
sudo apt install -y munge slurm-wlm slurmctld slurmd

# 1.5 生成配置
--- 可以用自带的在线配置器生成slurm.conf ---
# 找到配置器
dpkg -L slurmctld | grep slurm-wlm-configurator.html
# 赋予读取权限
sudo chmod +r /usr/share/doc/slurmctld/slurm-wlm-configurator.html
# 临时起个 Web 服务
cd /usr/share/doc/slurmctld
python3 -m http.server 8000
浏览器访问 http://localhost:8000/slurm-wlm-configurator.html
--- 可以用自带的在线配置器生成slurm.conf ---
# 2 写入配置
HOST=$(hostname)
sudo mkdir -p /etc/slurm
--- 目前测出来可行的slurm.conf ---
cat <<EOF | sudo tee /etc/slurm/slurm.conf
ClusterName=cluster
SlurmctldHost=$HOST
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
#
# COMPUTE NODES
NodeName=$HOST CPUs=$(nproc) RealMemory=$(free -m | awk '/^Mem:/{print $2}') State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
#
# TIMERS
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
# SCHEDULING
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
SchedulerParameters=allow_oversubscribe
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
EOF

# 3. 一次目录+权限
sudo mkdir -p /var/lib/slurm/{slurmctld,slurmd} /var/log/slurm
sudo chown -R slurm: /var/lib/slurm /var/log/slurm
# 4. 起服务
sudo systemctl enable --now munge
sudo systemctl enable --now slurmctld slurmd

# 5. 验证
sinfo
sinfo -Nl
srun -n1 hostname # 应当出现主机名

如果出问题需要卸载重装的话,以下是卸载命令

bash
# 停掉所有服务
sudo systemctl stop slurmctld slurmd slurmdbd 2>/dev/null

# 卸载软件包(含配置文件)
sudo apt remove --purge -y slurm* munge
sudo apt autoremove -y

# 删除目录、日志、数据、用户(可选)
sudo rm -rf /etc/slurm /var/spool/slurm* /var/log/slurm \
            /usr/local/slurm /run/slurm /var/lib/slurm
sudo deluser --remove-home slurm 2>/dev/null
sudo delgroup slurm 2>/dev/null

# 清 systemd 缓存
sudo systemctl daemon-reload
sudo systemctl reset-failed

方便查看任务的命令

shell
# note slurm useful command
alias si='sinfo -o "%.9P %.10A %.5t %.10n %.13C %.8O"'
alias sq='squeue -o "%.18i %.9P %.12j %.12u %.12T %.12M %.16l %.6D %R"'
alias sac='sacct --format=JobID,JobName,AllocCPUs,State,ExitCode,MaxRSS,Elapsed,Start,End' #slurm accounting
alias sst='sstat --format=JobID,MaxVMSize,MaxVMSizeNode,MaxVMSizeTask,AveVMSize,\
MaxRSS,MaxRSSNode,MaxRSSTask,AveRSS' # must followed by -j jobid

安装开发工具

Conda

bash
sudo apt update && sudo apt install -y wget bzip2
wget https://repo.anaconda.com/archive/Anaconda3-2025.06-1-Linux-x86_64.sh
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2025.06-1-Linux-x86_64.sh //清华镜像源似乎会屏蔽无头下载
bash Anaconda3-2025.06-1-Linux-x86_64.sh
在选择安装路径这一步输入/opt/anaconda3
conda config --set auto_activate_base false //禁用自动启动base环境

添加国内源(仅作用于当前账户,会写到当前~/.condarc)

bash
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
conda config --set show_channel_urls yes

可以直接apt的开发工具

bash
sudo apt install -y libblas-test libblas-dev libblas3 liblapack-test liblapack-dev liblapack3
# sudo apt install -y libscalapack-mpi-dev ##要装oneAPI的话最好别装
sudo apt install cmake
sudo apt install gdb
sudo apt install pkg-config

Intel oneAPI

这里先安装module,方便管理oneAPI的环境

bash
sudo apt install -y environment-modules

用离线安装包安装好oneapi后,向/etc/profile.d/下写入自动启动setvars.sh命令

bash
sudo tee /etc/profile.d/oneapi.sh >/dev/null <<'EOF'
# Intel oneAPI 环境变量
[[ -f /opt/intel/oneapi/setvars.sh ]] && source /opt/intel/oneapi/setvars.sh
EOF
sudo chmod 644 /etc/profile.d/oneapi.sh

但这条命令通常很慢,另一种方法是将oneapi设为module并自动加载

bash
===== 在之前的公共服务器上
是把所有的oneAPI模块写成一个文件oneapi2024.0存在/intel/modulefiles下
文件内容可参考https://cuterwrite.top/p/intel-oneapi
export MODULEPATH=${HOME}/intel/modulefiles:$MODULEPATH 
module load oneapi2024.0

===== 在Fisher-Server上
cd /opt/intel/oneapi
sudo ./modulefiles-setup.sh --output-dir=/opt/intel/oneapi/modulefiles

===== 检查一下是否能加载
module use /opt/intel/oneapi/modulefiles
module avail -t | grep -E '^(advisor|ccl|compiler|debugger|dev-utilities|dpct|dpl|intel[^/]*|ishmem|mkl|mpi|tbb|umf|vtune)/latest$'
===== 正式加入启动加载
sudo tee /etc/profile.d/oneapi-modules.sh >/dev/null <<'EOF'
# Intel oneAPI 环境变量
echo "start loading intelOneAPI module"
module use /opt/intel/oneapi/modulefiles
module purge
module load $(module avail -t | grep -E '^(advisor|ccl|compiler|debugger|dev-utilities|dpct|dpl|intel[^/]*|ishmem|mkl|mpi|tbb|umf|vtune)/latest$' | tr '\n' ' ')
echo "finish loading intelOneAPI module"
EOF
sudo chmod 644 /etc/profile.d/oneapi-modules.sh

vscode的终端默认是非登录Shell,不会自动加载/profile.d路径下的文件,需要在.bashrc中加入

shell
# Ensure /etc/profile.d/ scripts are loaded in non-login shells
if [ -d /etc/profile.d ]; then
    for script in /etc/profile.d/*.sh; do
        if [ -r "$script" ]; then
            source "$script"
        fi
    done
fi

ELPA

https://gitlab.mpcdf.mpg.de/elpa/elpa.git 下载 ELPA 源代码。需要先运行一个.sh文件,才会生成configure文件。安装手册https://gitlab.mpcdf.mpg.de/elpa/elpa/-/blob/master/documentation/INSTALL.md

shell
cd ~/Downloads
wget https://gitlab.mpcdf.mpg.de/elpa/elpa/-/archive/new_release_2025.06.001/elpa-new_release_2025.06.001.tar.bz2
tar -xvf elpa-new_release_2025.06.001.tar.bz2
cd elpa-new_release_2025.06.001
sh autogen.sh
cd ..
mkdir elpa_2025.06.001-build
cd elpa_2025.06.001-build
../elpa-new_release_2025.06.001/configure \
 --prefix=/opt/elpa_2025.06.001-install FC="mpiifx" FCFLAGS="-qopenmp -O3 -xCORE-AVX512" CC="mpiicx" CFLAGS="-qopenmp -O3 -xCORE-AVX512" --enable-openmp --enable-avx512 \
LDFLAGS="-Wl,--copy-dt-needed-entries" \
SCALAPACK_LDFLAGS="-L$MKL_HOME/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread -lm -Wl,-rpath,$MKL_HOME/lib/intel64" \
SCALAPACK_FCFLAGS="-L$MKL_HOME/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread -lm -I$MKL_HOME/include/intel64/lp64"
make -j `nproc`
sudo make install
ln -s elpa目标绝对路径/include/elpa-文件夹名/elpa elpa目标绝对路径/include
% 具体例子:
ln -s /opt/elpa_2025.06.001-install/include/elpa_openmp-2025.06.001/elpa /opt/elpa_2025.06.001-install/include

Libxc

Libxc - a library of exchange-correlation functionals for density-functional theory

shell
tar -xvf libxc-7.0.0.tar.bz2
cd libxc-7.0.0
cmake -H. -Bobjdir -DCMAKE_C_COMPILER=mpiicx -DCMAKE_INSTALL_PREFIX=/opt/libxc-7.0.0-install
cd objdir && make -j 48
sudo make install

Cereal

https://github.com/USCiLab/cereal/releases 下载 cereal 源代码。

shell
sudo tar -xzvf cereal-1.3.2.tar.gz -C /opt/

对于OneAPI 2025.1及以后的编译器,需要修改cereal的代码,删掉两处template。见issue[Compile] Build failed with LibRI via OneAPI 2025.1 · Issue #6190 · deepmodeling/abacus-develop

安装计算软件

LibRI&LibComm

https://github.com/abacusmodeling/LibRI.git

https://github.com/abacusmodeling/LibComm.git

ABACUS

git clone -o fish https://github.com/Fisherd99/abacus-BSE.git

shell
cmake -B build -DELPA_DIR=/elpa安装目录/ -DCMAKE_INSTALL_PREFIX=/ABACUS安装目录/ -DENABLE_DEEPKS=1 -DENABLE_LIBRI=ON -DTorch_DIR=/Torch目录/ -Dlibnpy_INCLUDE_DIR=/libnpy目录/ -DLibxc_DIR=/libxc目录/ -DCEREAL_INCLUDE_DIR=$Path to the parent folder of `cereal/cereal.hpp`
% 具体例子:
cmake -B build -DELPA_DIR=/opt/elpa_2025.06.001-install -DLibxc_DIR=/opt/libxc-7.0.0-install -DCEREAL_INCLUDE_DIR=/opt/cereal-1.3.2/include -DLIBRI_DIR=$HOME/deepmodeling/LibRI -DLIBCOMM_DIR=$HOME/deepmodeling/LibComm -DGTEST_DIR=$HOME/Downloads/googletest-1.17.0/install -DBUILD_TESTING=ON -DDEBUG_INFO=ON

cmake -B build_debug -DELPA_DIR=/opt/elpa_2025.06.001-install -DLibxc_DIR=/opt/libxc-7.0.0-install -DCEREAL_INCLUDE_DIR=/opt/cereal-1.3.2/include -DLIBRI_DIR=$HOME/deepmodeling/LibRI -DLIBCOMM_DIR=$HOME/deepmodeling/LibComm -DCMAKE_BUILD_TYPE=Debug
------------------------------
cmake --build build -j`nproc`
cmake --build build_debug -j`nproc`
cmake --install build

pyatb

先安装eigen-3.4.0

shell
tar -xvf eigen-3.4.0.tar.bz2
cd eigen-3.4.0
mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/Downloads/eigen-3.4.0/install
make install

然后安装

shell
conda install pybind11 mpi4py
git clone https://github.com/pyatb/pyatb.git
cd pyatb

修改setup.py
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'm', 'dl', 'pthread']

include_dirs = [
    os.path.join("src", "cpp", "core"),
    os.path.join("src", "cpp", "interface_python"),
    os.path.join("eigen"),
    os.path.join("/home","fisherd","Downloads","eigen-3.4.0","install","include","eigen3"),
    os.path.join("/opt","intel","oneapi","mkl","latest","include"),
]

pip install ./

pyatb所需的gcc和mpi很容易和conda环境冲突,需要在.bashrc加上

shell
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libstdc++.so.6:$LD_PRELOAD
export LD_PRELOAD=/opt/intel/oneapi/mpi/2021.16/lib/libmpi.so.12:$LD_PRELOAD

Released under the MIT License.