Gaussian 16 can use NVIDIA K40, K80, P100 (Rev. B.01) and V100 (Rev. C.01) GPUs under Linux. Earlier GPUs do not have the computational capabilities or memory size to run the algorithms in Gaussian 16.
Allocating Memory for Jobs
Allocating sufficient amounts of memory to jobs is even more important when using GPUs than for CPUs, since larger batches of work must be done at the same time in order to use the GPUs efficiently. The K40 and K80 units can have up to 16 GB of memory. Typically, most of this should be made available to Gaussian. Giving Gaussian 8-9 GB works well when there is 12 GB total on each GPU; similarly, allocating Gaussian 11-12 GB is appropriate for a 16 GB GPU. In addition, at least an equal amount of memory must be available for each CPU thread which is controlling a GPU.
About Control CPUs
When using GPUs, each GPU must be controlled by a specific CPU. The controlling CPU should be as physically close as possible to the GPU it is controlling. GPUs cannot share controlling CPUs. Note that CPUs used as GPU controllers do not participate as compute nodes during the parts of the calculation that are GPU-parallel.
The hardware arrangement on a system with GPUs can be checked using the nvidia-smi utility. For example, this output is for a machine with two 16-core Haswell CPU chips and four K80 boards, each of which has two GPUs:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity GPU0 X PIX SOC SOC SOC SOC SOC SOC 0-15 cores on first chip GPU1 PIX X SOC SOC SOC SOC SOC SOC 0-15 GPU2 SOC SOC X PIX PHB PHB PHB PHB 16-31 cores on second chip GPU3 SOC SOC PIX X PHB PHB PHB PHB 16-31 GPU4 SOC SOC PHB PHB X PIX PXB PXB 16-31 GPU5 SOC SOC PHB PHB PIX X PXB PXB 16-31 GPU6 SOC SOC PHB PHB PXB PXB X PIX 16-31 GPU7 SOC SOC PHB PHB PXB PXB PIX X 16-31
The important part of this output is the CPU affinity. This example shows that GPUs 0 and 1 (on the first K80 card) are connected to the CPUs on chip 0 while GPUs 2-7 (on the other two K80 cards) are connected to the CPUs on chip 1.
Specifying GPUs & Control CPUs for a Gaussian Job
The GPUs to use for a calculation and their controlling CPUs are specified with the %GPUCPU Link 0 command. This command takes one parameter:
where gpu-list is a comma-separated list of GPU numbers, possibly including numerical ranges (e.g., 0-4,6), and control-cpus is a similarly-formatted list of controlling CPU numbers. The corresponding items in the two lists are the GPU and its controlling CPU.
For example, on a 32-processor system with 6 GPUs, a job which uses all the CPUs—26 CPUs serving solely as compute nodes and 6 CPUs used for controlling GPUs—would use the following Link 0 commands:
%CPU=0-31 Control CPUs are included in this list. %GPUCPU=0,1,2,3,4,5=0,1,16,17,18,19
These command state that CPUs 0-31 will be used in the job. GPUs 0 through 5 will be used, with GPU0 controlled by CPU 0, GPU1 controlled by CPU 1, GPU2 controlled by CPU 16, GPU3 controlled by CPU 17, and so on. Note that the controlling CPUs are included in %CPU.
In the preceding example, the GPU and CPU lists could be expressed more tersely as:
Normally one uses consecutive processors in the obvious way, but things can be associated differently in special cases. For example, suppose the same machine already had a job using 6 CPUs, running with %CPU=16-21. Then, in order to use the other 26 CPUs with 6 controlling GPUs, you would specify:
This job would use a total of 26 processors, employing 20 of them solely for computation, along with the six GPUs controlled by CPUs 0, 1, 22, 23, 24 and 25 (respectively).
In [REV B], the lists of CPUs and GPUs are both sorted and then matched up. This ensures that the the lowest numbered threads are executed on CPUs that have GPUs. Doing so ensures that if a part of a calculation has to reduce the number of processors used (i.e., because of memory limitations), it will preferentially use/retain the threads with GPUs (since it removes threads in reverse order).
GPUs and Overall Job Performance
GPUs are effective for larger molecules when doing DFT energies, gradients and frequencies (for both ground and excited states), but they are not effective for small jobs. They are also not used effectively by post-SCF calculations such as MP2 or CCSD.
Each GPU is several times faster than a CPU. However, on modern machines, there are typically many more CPUs than GPUs. The best performance comes fromm using all the CPUs as well as the GPUs.
In some circumstances, the potential speedup from GPUs can be limited because many CPUs are also used effectively by Gaussian 16. For example, if the GPU is 5x faster than a CPU, then the speedup of using the GPU versus the CPU alone would be 5x. However, the potential speedup resulting from using GPUs on a larger computer with 32 CPUs and 8 GPUs is 2x:
Without GPUs: 32*1 = 32
With GPUs: (24*1) + (8*5) = 64 Remember that control CPUs are not used for computation.
Speedup: 64/32 = 2
Note that this analysis assumes that the GPU-parallel portion of the calculation dominates the total execution time.
Allocation of memory. GPUs can have up to 16 GB of memory. One typically tries to make most of this available to Gaussian. Be aware that there must be at least an equal amount of memory given to the CPU thread running each GPU as is allocated for computation. Using 8-9 GB works well on a 12 GB GPU, or 11-12 GB on a 16 GB GPU (reserving some memory for the system). Since Gaussian gives equal shares of memory to each thread, this means that the total memory allocated should be the number of threads times the memory required to use a GPU efficiently. For example, when using 4 CPUs and 2 GPUs each with 16 GB of memory, you should use 4 × 12 GB of total memory. For example:
%Mem=48GB %CPU=0-3 %GPUCPU=0-1=0,2
You will need to analyze the characteristics of your own environment carefully when making decisions about which processors and GPUs to use and how much memory to allocate.
GPUs in a Cluster
GPUs on nodes in a cluster can be used. Since the %CPU and %GPUCPU specifications are applied to each node in the cluster, the nodes must have identical configurations (number of GPUs and their affinity to CPUs); since most clusters are collections of identical nodes, this restriction is not usually a problem.