New Modeling Capabilities
- TD-DFT analytic second derivatives for predicting vibrational frequencies/IR and Raman spectra and performing transition state optimizations and IRC calculations for excited states.
- EOMCC analytic gradients for performing geometry optimizations.
- Anharmonic vibrational analysis for VCD and ROA spectra: see Freq=Anharmonic.
- Vibronic spectra and intensities: see Freq=FCHT and related options.
- Resonance Raman spectra: see Freq=ReadFCHT.
- New DFT functionals: M08HX, MN15, MN15L.
- New double-hybrid methods: DSDPBEP86, PBE0DH and PBEQIDH.
- PM7 semi-empirical method.
- Ciofini excited state charge transfer diagnostic: see Pop=DCT.
- The EOMCC solvation interaction models of Caricato: see SCRF=PTED.
- Generalized internal coordinates, a facility which allows arbitrary redundant internal coordinates to be defined and used for optimization constraints and other purposes. See Geom=GIC and GIC Info.
- NVIDIA K40 and K80 GPUs are supported under Linux for Hartree-Fock and DFT calculations. See the Using GPUs tab for details.
- Parallel performance on larger numbers of processors has been improved. See the Parallel Performance tab for information about how to get optimal performance on multiple CPUs and clusters.
- Gaussian 16 uses an optimized memory algorithm to avoid I/O during CCSD iterations.
- There are several enhancements to the GEDIIS optimization algorithm.
- CASSCF improvements for active spaces ≥ (10,10) increase performance and make active spaces of up to 16 orbitals feasible (depending on the molecular system).
- Significant speedup of the core correlation energies for W1 compound model.
- Gaussian 16 incorporates algorithmic improvements for significant speedup of the diagonal, second-order self-energy approximation (D2) component of composite electron propagator (CEP) methods as described in [DiazTinoco16]. See EPT.
- Tools for interfacing Gaussian with other programs, both in compiled languages such as Fortran and C and with interpreted languages such as Python and Perl. Refer to Interfacing to Gaussian 16 for details.
- Parameters specified in Link 0 (%) input lines and/or in a Default.Route file can now also be specified via either command-line
arguments or environment variables. See the Link 0 Equivalences tab for details.
- Compute the force constants are every nth step of a geometry optimization: see Opt=Recalc.
The following calculation defaults are different in Gaussian 16:
- Integral accuracy is 10-12 rather than 10-10 in Gaussian 09.
- The default DFT grid for general use is UltraFine rather than FineGrid in G09; the default grid for CPHF is SG1 rather than CoarseGrid. See the discussion of the Integral keyword for details.
- SCRF defaults to the symmetric form of IEFPCM [Lipparini10] (not present in Gaussian 09) rather than the non-symmetric version.
- Physical constants use the 2010 values rather than the 2006 values in Gaussian 09.
The first two items were changed to ensure accuracy in several new calculation types (e.g., TD-DFT frequencies, anharmonic ROA). For these reasons, Integral=(UltraFine,Acc2E=12) was made the default. Using these settings generally improve the reliability of calculations involving numerical integration, e.g., DFT optimizations in solution. There is a modest increase in the CPU requirements for these options compared to the Gaussian 09 defaults of Integral=(FineGrid,Acc2E=10).
The G09Defaults keyword sets all four of these defaults back to the Gaussian 09 values. It is provided for compatibility with previous calculations, but the new defaults are strongly recommended for new studies.
Default Memory Use
Gaussian 16 defaults memory usage to %Mem=100MW (800MB). Even larger values are appropriate for calculations on larger molecules and when using many processors; refer to the Parallel Jobs tab for details.
TDDFT frequency calculations compute second derivatives analytically by default, since these are much faster than the numerical derivatives (the only choice in Gaussian 09).
Gaussian 16 can use NVIDIA K40, K80, P100 (Rev. B.01) and V100 (Rev. C.01) GPUs under Linux. Earlier GPUs do not have the computational capabilities or memory size to run the algorithms in Gaussian 16.
Allocating Memory for Jobs
Allocating sufficient amounts of memory to jobs is even more important when using GPUs than for CPUs, since larger batches of work must be done at the same time in order to use the GPUs efficiently. The K40 and K80 units can have up to 16 GB of memory. Typically, most of this should be made available to Gaussian. Giving Gaussian 8-9 GB works well when there is 12 GB total on each GPU; similarly, allocating Gaussian 11-12 GB is appropriate for a 16 GB GPU. In addition, at least an equal amount of memory must be available for each CPU thread which is controlling a GPU.
About Control CPUs
When using GPUs, each GPU must be controlled by a specific CPU. The controlling CPU should be as physically close as possible to the GPU it is controlling. GPUs cannot share controlling CPUs. Note that CPUs used as GPU controllers do not participate as compute nodes during the parts of the calculation that are GPU-parallel.
The hardware arrangement on a system with GPUs can be checked using the nvidia-smi utility. For example, this output is for a machine with two 16-core Haswell CPU chips and four K80 boards, each of which has two GPUs:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity GPU0 X PIX SOC SOC SOC SOC SOC SOC 0-15 cores on first chip GPU1 PIX X SOC SOC SOC SOC SOC SOC 0-15 GPU2 SOC SOC X PIX PHB PHB PHB PHB 16-31 cores on second chip GPU3 SOC SOC PIX X PHB PHB PHB PHB 16-31 GPU4 SOC SOC PHB PHB X PIX PXB PXB 16-31 GPU5 SOC SOC PHB PHB PIX X PXB PXB 16-31 GPU6 SOC SOC PHB PHB PXB PXB X PIX 16-31 GPU7 SOC SOC PHB PHB PXB PXB PIX X 16-31
The important part of this output is the CPU affinity. This example shows that GPUs 0 and 1 (on the first K80 card) are connected to the CPUs on chip 0 while GPUs 2-7 (on the other two K80 cards) are connected to the CPUs on chip 1.
Specifying GPUs & Control CPUs for a Gaussian Job
The GPUs to use for a calculation and their controlling CPUs are specified with the %GPUCPU Link 0 command. This command takes one parameter:
where gpu-list is a comma-separated list of GPU numbers, possibly including numerical ranges (e.g., 0-4,6), and control-cpus is a similarly-formatted list of controlling CPU numbers. The corresponding items in the two lists are the GPU and its controlling CPU.
For example, on a 32-processor system with 6 GPUs, a job which uses all the CPUs—26 CPUs serving solely as compute nodes and 6 CPUs used for controlling GPUs—would use the following Link 0 commands:
%CPU=0-31 Control CPUs are included in this list. %GPUCPU=0,1,2,3,4,5=0,1,16,17,18,19
These command state that CPUs 0-31 will be used in the job. GPUs 0 through 5 will be used, with GPU0 controlled by CPU 0, GPU1 controlled by CPU 1, GPU2 controlled by CPU 16, GPU3 controlled by CPU 17, and so on. Note that the controlling CPUs are included in %CPU.
In the preceding example, the GPU and CPU lists could be expressed more tersely as:
Normally one uses consecutive processors in the obvious way, but things can be associated differently in special cases. For example, suppose the same machine already had a job using 6 CPUs, running with %CPU=16-21. Then, in order to use the other 26 CPUs with 6 controlling GPUs, you would specify:
This job would use a total of 26 processors, employing 20 of them solely for computation, along with the six GPUs controlled by CPUs 0, 1, 22, 23, 24 and 25 (respectively).
In [REV B], the lists of CPUs and GPUs are both sorted and then matched up. This ensures that the the lowest numbered threads are executed on CPUs that have GPUs. Doing so ensures that if a part of a calculation has to reduce the number of processors used (i.e., because of memory limitations), it will preferentially use/retain the threads with GPUs (since it removes threads in reverse order).
GPUs and Overall Job Performance
GPUs are effective for larger molecules when doing DFT energies, gradients and frequencies (for both ground and excited states), but they are not effective for small jobs. They are also not used effectively by post-SCF calculations such as MP2 or CCSD.
Each GPU is several times faster than a CPU. However, on modern machines, there are typically many more CPUs than GPUs. The best performance comes fromm using all the CPUs as well as the GPUs.
In some circumstances, the potential speedup from GPUs can be limited because many CPUs are also used effectively by Gaussian 16. For example, if the GPU is 5x faster than a CPU, then the speedup of using the GPU versus the CPU alone would be 5x. However, the potential speedup resulting from using GPUs on a larger computer with 32 CPUs and 8 GPUs is 2x:
Without GPUs: 32*1 = 32
With GPUs: (24*1) + (8*5) = 64 Remember that control CPUs are not used for computation.
Speedup: 64/32 = 2
Note that this analysis assumes that the GPU-parallel portion of the calculation dominates the total execution time.
Allocation of memory. GPUs can have up to 16 GB of memory. One typically tries to make most of this available to Gaussian. Be aware that there must be at least an equal amount of memory given to the CPU thread running each GPU as is allocated for computation. Using 8-9 GB works well on a 12 GB GPU, or 11-12 GB on a 16 GB GPU (reserving some memory for the system). Since Gaussian gives equal shares of memory to each thread, this means that the total memory allocated should be the number of threads times the memory required to use a GPU efficiently. For example, when using 4 CPUs and 2 GPUs each with 16 GB of memory, you should use 4 × 12 GB of total memory. For example:
%Mem=48GB %CPU=0-3 %GPUCPU=0-1=0,2
You will need to analyze the characteristics of your own environment carefully when making decisions about which processors and GPUs to use and how much memory to allocate.
GPUs in a Cluster
GPUs on nodes in a cluster can be used. Since the %CPU and %GPUCPU specifications are applied to each node in the cluster, the nodes must have identical configurations (number of GPUs and their affinity to CPUs); since most clusters are collections of identical nodes, this restriction is not usually a problem.
Parallel Usage and Performance Notes
Memory allocation. Calculations involving larger molecules and basis sets benefit from larger memory allocations. 4 GB or more per processor is recommended for calculations involving 50 or more atoms and/or 500 or more basis functions. The freqmem utility estimates the optimal memory size per thread for ground-state frequency calculations, and the same value is reasonable for excited-state frequencies and is more than sufficient for ground and excited state optimizations.
The amount of memory allowed should rise with the number of processors: if 4 GB is reasonable for one processor, then the same job using 8 CPUs would run well in 32 GB. Of course, there may be limitations to smaller values imposed by the particular hardware, but scaling memory linearly with number of CPUs should be the goal. In particular, increasing only the number of CPUs with fixed memory size is unlikely to lead to good performance when using large numbers of processors.
For large frequency calculations and for large CCSD and EOM-CCSD energies, it is also desirable to leave enough memory to buffer the large disk files involved. Therefore, a Gaussian job should only be given 50-70% of the total memory on the system. For example, on a machine with a total of 128 GB, one should typically give 64-80 GB to a job which was using all the CPUs, and leave the remaining memory for the operating system to use as disk cache.
Pinning threads to CPUs. Efficiency is lost when threads are moved from one CPU to another, thereby invalidating the cache and causing other overhead. On most machines, Gaussian can tie threads to specific CPUs, and this is the recommended mode of operation, especially when using larger numbers of processors. The %CPU Link 0 line specifies the numbers of specific CPUs to be used. Thus, on a machine with one 8-core chip, one should use %CPU=0‑7 rather than %NProc=8 because the former ties the first thread to CPU 0, the next to CPU 1, etc.
On some older Intel processors (Nehalem and before), there is not enough memory bandwidth to keep all the CPUs on a chip busy, and it is often preferable to use half the CPUs, each with twice as much memory as if all were used. For example, on such a machine with four 12-core chips and 128 GB of memory, with CPUs 0-11 on the first chip, 12-23 on the second, and so on, it is better to run using 24 processors (6 on each chip) and give them 72 GB/24 procs = 3 GB memory each, rather than use all 48 with only 1.5 GB of memory each. The required input directives would be:
where the /2 means to use every other core: i.e., cores 0, 2, 4, 6, 8, and 10 (on chip 0), 12, 14, 16, 18, 20, and 22 (on chip 1), etc.
With the most recent generations of Intel processors (Haswell and later), the memory bandwidth is better and using all the cores on each chip works well.
As long as sufficient memory is available and threads are tied to specific cores, then parallel efficiency on large molecules is good up to 64 or more cores.
Disable hyperthreading. Hyperthreading is not useful for Gaussian since it effectively divides resources such as memory bandwidth among threads on the same physical CPU. If hyperthreading cannot be turned off, Gaussian jobs should use only one hyperthread on each physical CPU. Under Linux, hyperthreads on different processors are grouped together. That is, if a machine has 2 chips each with 8 cores and 3-way hyperthreading, then “CPUs” 0-7 are across the 8 cores on chip 0, 8-15 are across the 8 cores on chip 1, and 16-23 are the second hyperthreads on the 8 cores of chip 0, and so on. So a job would run best with %CPU=0‑15.
Under AIX, hyperthreads are grouped together with up 8 hyperthread numbers for each CPU even if fewer hyperthreads are in use, so with two 8 core chips and 4-way hyperthreading, “CPUs” 0-3 are all on core 0 of chip 0, 8-11 are on core 1 of chip 0, etc. Thus, one would want to use %CPU=0‑127/8 to select “CPUs” 0, 8, 16, … which are each using a distinct core.
Cluster (Linda) parallelism
Availability. Hartree-Fock and DFT energies, gradients and frequencies run in parallel across clusters, as do MP2 energies and gradients. MP2 frequencies, CCSD, and EOM-CCSD energies and optimizations are SMP parallel but not cluster parallel. Numerical derivatives, such as DFT anharmonic frequencies and CCSD frequencies, are parallelized across nodes of a cluster by doing a complete gradient or second derivative calculation on each node, splitting the directions of differentiation across workers in the cluster.
Combining with MP parallelism. Shared-memory and cluster parallelism can be combined. Generally, one uses shared-memory parallelism across all CPUs in each node of the cluster. Note that %CPU and %Mem apply to each node of the cluster. Thus, if one has 3 nodes names apple, banana and cherry, each with two chips which have 8 CPUs each, then one might specify:
%Mem=64GB %CPU=0-15 %LindaWorkers=apple,banana,cherry # B3LYP/6-311+G(2d,p) Freq …
This would run 16 threads, each pinned to a CPU, on each of the 3 nodes, giving 4 GB to each of the 48 threads.
For the special case of numerical differentiation only—e.g., Freq=Anharm, CCSD Freq, etc.—one extra worker is used to collect the results. So these jobs should be run with two workers on the master node (where Gaussian 16 is started). For the above example if the job was computing anharmonic frequencies, then one would use:
%Mem=64GB %CPU=0-15 %LindaWorkers=apple:2,banana,cherry # B3LYP/6-311+G(2d,p) Freq=Anharm …
where Gaussian 16 is assumed to be started on node apple. This will start 2 workers on node apple, one of which just collects results, and will do the computational work using the other worker on apple and those on banana and cherry.
Memory requirements for CCSD, CCSD(T) and EOM-CCSD calculations
These calculations can use memory to avoid I/O and will run much more efficiently if they are allowed enough memory to store the amplitudes and product vectors in memory. If there are NO active occupied orbitals (NOA in the output) and NV virtual orbitals (NVB in the output) then approximately 9NO2NV2 words of memory are required. This does not depend on the number of processors used.
Most options that control how Gaussian 16 operates can be specified in any of 4 ways. From highest to lowest precedence these are:
- As Link 0 input (%-lines): This is the usual method to control a specific job and the only way to control a specific step within a multi-step input file. Example: %CPU=1,2,3,4
- As options on the command line: This is useful when one wants aliases or other shortcuts for different common ways of running the program. Example: g16 -c="1,2,3,4" …
- As environment variables: This is most useful in standard scripts, for example for generating and submitting jobs to batch queuing systems. Example: export GAUSS_CDEF="1,2,3,4"
- As directives in the Default.Route file: This is most useful when one wants to change the program defaults for all jobs. Example: -C- 1,2,3,4
When searching for a Default.Route file the current default directory is checked first, followed by the directories in the path for Gaussian 16 executables: environment variable GAUSS_EXEDIR, which normally points to $g16root/g16.
The following table lists the most important Link 0 commands and their equivalences:
|Default.Route||Link 0||Option||Env. Var.||Description|
|Gaussian 16 execution defaults|
|-R-||-r||GAUSS_RDEF||Route section keyword list.|
|-M-||%Mem||-m||GAUSS_MDEF||Memory amount for Gaussian jobs.|
|-C-||%CPU||-c||GAUSS_CDEF||Processor/core list for multiprocessor parallel jobs.|
|-G-||%GPUCPU||-g||GAUSS_GDEF||GPUs=Cores list for GPU parallel jobs.|
|-S-||%UseSSH||-s||GAUSS_SDEF||Program to start workers for network parallel jobs: rsh or ssh.|
|-W-||%LindaWorkers||-w||GAUSS_WDEF||List of hostnames for network parallel jobs.|
|-P-||%NProcShared||-p||GAUSS_PDEF||#processors/cores for multiprocessor parallel jobs.
Deprecated; use -C-.
|-L-||%NProcLinda||-l||GAUSS_LDEF||#nodes for network parallel jobs. Deprecated; use -W-.|
|Archive entry data|
|-O-||GAUSS_ODEF||Organization (site) name.|
|Utility program defaults|
|-F-||GAUSS_FDEF||Options for the formchk utility.|
|-U-||GAUSS_UDEF||Memory amount for utilities.|
|Parameters for scripts and external programs|
|# section||-x||GAUSS_XDEF||Complete route for the job (route not read from input file).|
Note that the quotation marks are normally required around the specified value for the command line and environment variables to avoid modification of the parameter string by the shell.
Last updated: 3 March 2017