Slow Training of RL Agent on HPC Compared to Local Machine
I am currently running a MATLAB 2021a script (execute.m added as attachment for reference) to train a reinforcement learning (RL) agent in Simulink to control a drone. While training it in my local machine it connects to 6 workers and the training speed is much higher compared to HPC which is connected to 12 workers. I have ensured that the whole node is assigned to the the job with 28 cores in total.
Here is the SLURM script:
#!/bin/bash -l
#SBATCH -J MATLAB_Execute # Job name
#SBATCH -N 1 # Number of nodes
#SBATCH -n 1 # Number of tasks (1 instance of the program)
#SBATCH -c 28 # Number of CPU cores per node
#SBATCH –gres=gpu:0 # Number of GPUs per node
#SBATCH –time=1:00:0 # Time limit (10 minutes)
#SBATCH -p batch -C skylake # Partition name (GPU partition)
export JAVA_LOG_DIR=/scratch/users/gshetty/java_logs
mkdir -p $JAVA_LOG_DIR
# Load the MATLAB module
module load math/MATLAB/2021a
module load openssl/1.1.1k
export LD_PRELOAD=/usr/lib64/libcrypto.so.1.1
# Run the MATLAB script
srun matlab -nodisplay -nosplash -r execute -logfile execute.out
what can be the potential reason?I am currently running a MATLAB 2021a script (execute.m added as attachment for reference) to train a reinforcement learning (RL) agent in Simulink to control a drone. While training it in my local machine it connects to 6 workers and the training speed is much higher compared to HPC which is connected to 12 workers. I have ensured that the whole node is assigned to the the job with 28 cores in total.
Here is the SLURM script:
#!/bin/bash -l
#SBATCH -J MATLAB_Execute # Job name
#SBATCH -N 1 # Number of nodes
#SBATCH -n 1 # Number of tasks (1 instance of the program)
#SBATCH -c 28 # Number of CPU cores per node
#SBATCH –gres=gpu:0 # Number of GPUs per node
#SBATCH –time=1:00:0 # Time limit (10 minutes)
#SBATCH -p batch -C skylake # Partition name (GPU partition)
export JAVA_LOG_DIR=/scratch/users/gshetty/java_logs
mkdir -p $JAVA_LOG_DIR
# Load the MATLAB module
module load math/MATLAB/2021a
module load openssl/1.1.1k
export LD_PRELOAD=/usr/lib64/libcrypto.so.1.1
# Run the MATLAB script
srun matlab -nodisplay -nosplash -r execute -logfile execute.out
what can be the potential reason? I am currently running a MATLAB 2021a script (execute.m added as attachment for reference) to train a reinforcement learning (RL) agent in Simulink to control a drone. While training it in my local machine it connects to 6 workers and the training speed is much higher compared to HPC which is connected to 12 workers. I have ensured that the whole node is assigned to the the job with 28 cores in total.
Here is the SLURM script:
#!/bin/bash -l
#SBATCH -J MATLAB_Execute # Job name
#SBATCH -N 1 # Number of nodes
#SBATCH -n 1 # Number of tasks (1 instance of the program)
#SBATCH -c 28 # Number of CPU cores per node
#SBATCH –gres=gpu:0 # Number of GPUs per node
#SBATCH –time=1:00:0 # Time limit (10 minutes)
#SBATCH -p batch -C skylake # Partition name (GPU partition)
export JAVA_LOG_DIR=/scratch/users/gshetty/java_logs
mkdir -p $JAVA_LOG_DIR
# Load the MATLAB module
module load math/MATLAB/2021a
module load openssl/1.1.1k
export LD_PRELOAD=/usr/lib64/libcrypto.so.1.1
# Run the MATLAB script
srun matlab -nodisplay -nosplash -r execute -logfile execute.out
what can be the potential reason? parallel computing toolbox, simulink, reinforcement learning, hpc MATLAB Answers — New Questions