Python API¶
Train module¶
This module contains all routines for training GDML and sGDML models.

class
sgdml.train.
GDMLTrain
(max_memory=None, max_processes=None, use_torch=False)[source]¶ Bases:
object

_assemble_kernel_mat
(R_desc, R_d_desc, tril_perms_lin, sig, desc, use_E_cstr=False, col_idxs=slice(None, None, None), alloc_extra_rows=0, callback=None)[source]¶ Compute force field kernel matrix.
The Hessian of the Matern kernel is used with n = 2 (twice differentiable). Each row and column consists of matrixvalued blocks, which encode the interaction of one training point with all others. The result is stored in shared memory (a global variable).
 Parameters
R_desc (
numpy.ndarray
) – Array containing the descriptor for each training point.R_d_desc (
numpy.ndarray
) – Array containing the gradient of the descriptor for each training point.tril_perms_lin (
numpy.ndarray
) – 1D array containing all recovered permutations expanded as one large permutation to be applied to a tiled copy of the object to be permuted.sig (int) – Hyperparameter :math:`sigma`(kernel length scale).
use_E_cstr (bool, optional) – True: include energy constraints in the kernel, False: default (s)GDML kernel.
callback (callable, optional) – Kernel assembly progress function that takes three arguments:
 currentint
Current progress (number of completed entries).
 totalint
Task size (total number of entries to create).
 done_str
str
, optional Once complete, this string contains the time it took to assemble the kernel (seconds).
cols_m_limit (int, optional (DEPRECATED)) – Only generate the columns up to index ‘cols_m_limit’. This creates a M*3N x cols_m_limit*3N kernel matrix, instead of M*3N x M*3N.
cols_3n_keep_idxs (
numpy.ndarray
, optional) – Only generate columns with the given indices in the 3N x 3N kernel function. The resulting kernel matrix will have dimension M*3N x M*len(cols_3n_keep_idxs).
 Returns
Force field kernel matrix.
 Return type
numpy.ndarray

_recov_int_const
(model, task, R_desc=None, R_d_desc=None)[source]¶ Estimate the integration constant for a force field model.
The offset between the energies predicted for the original training data and the true energy labels is computed in the least square sense. Furthermore, common issues with the userprovided datasets are self diagnosed here.
 Parameters
model (
dict
) – Data structure of custom typemodel
.task (
dict
) – Data structure of custom typetask
.R_desc (
numpy.ndarray
, optional) – An 2D array of size M x D containing the descriptors of dimension D for M molecules.R_d_desc (
numpy.ndarray
, optional) – A 2D array of size M x D x 3N containing of the descriptor Jacobians for M molecules. The descriptor has dimension D with 3N partial derivatives with respect to the 3N Cartesian coordinates of each atom.
 Returns
Estimate for the integration constant.
 Return type
float
 Raises
ValueError – If the sign of the force labels in the dataset from which the model emerged is switched (e.g. gradients instead of forces).
ValueError – If inconsistent/corrupted energy labels are detected in the provided dataset.
ValueError – If potentially inconsistent scales in energy vs. force labels are detected in the provided dataset.

create_model
(task, solver, R_desc, R_d_desc, tril_perms_lin, std, alphas_F, alphas_E=None)[source]¶ Create a data structure of custom type model.
These data structures contain the trained model are everything that is needed to generate predictions for new inputs.
Each task also contains the MD5 fingerprints of the used datasets.
 Parameters
task (
dict
) – Data structure of custom typetask
from which the model emerged.solver (
str
) – Identifier string for the solver that has been used to train this model.R_desc (
numpy.ndarray
, optional) – An 2D array of size M x D containing the descriptors of dimension D for M molecules.R_d_desc (
numpy.ndarray
, optional) – A 2D array of size M x D x 3N containing of the descriptor Jacobians for M molecules. The descriptor has dimension D with 3N partial derivatives with respect to the 3N Cartesian coordinates of each atom.tril_perms_lin (
numpy.ndarray
) – 1D array containing all recovered permutations expanded as one large permutation to be applied to a tiled copy of the object to be permuted.std (float) – Standard deviation of the training labels.
alphas_F (
numpy.ndarray
) – A 1D array of size 3NM containing of the linear coefficients that correspond to the force constraints.alphas_E (
numpy.ndarray
, optional) – A 1D array of size N containing of the linear coefficients that correspond to the energy constraints.
 Returns
Data structure of custom type
model
. Return type
dict

create_task
(train_dataset, n_train, valid_dataset, n_valid, sig, lam=1e10, perms=None, use_sym=True, use_E=True, use_E_cstr=False, callback=None)[source]¶ Create a data structure of custom type task.
These data structures serve as recipes for model creation, summarizing the configuration of one particular training run. Training and test points are sampled from the provided dataset, without replacement. If the same dataset if given for training and testing, the subsets are drawn without overlap.
Each task also contains a choice for the hyperparameters of the training process and the MD5 fingerprints of the used datasets.
 Parameters
train_dataset (
dict
) – Data structure of custom typedataset
containing train dataset.n_train (int) – Number of training points to sample.
valid_dataset (
dict
) – Data structure of custom typedataset
containing validation dataset.n_valid (int) – Number of validation points to sample.
sig (int) – Hyperparameter (kernel length scale).
lam (float, optional) – Hyperparameter lambda (regularization strength).
perms (
numpy.ndarray
, optional) – An 2D array of size P x N containing P possible permutations of the N atoms in the system. This argument takes priority over the ones provided in the trainig dataset. No automatic discovery is run when this argument is provided.use_sym (bool, optional) – True: include symmetries (sGDML), False: GDML.
use_E (bool, optional) – True: reconstruct force field with corresponding potential energy surface, False: ignore energy during training, even if energy labels are available
in the dataset. The trained model will still be able to predict energies up to an unknown integration constant. Note, that the energy predictions accuracy will be untested.
use_E_cstr (bool, optional) – True: include energy constraints in the kernel, False: default (s)GDML.
callback (callable, optional) – Progress callback function that takes three arguments:
 currentint
Current progress.
 totalint
Task size.
 done_str
str
, optional Once complete, this string is shown.
 Returns
Data structure of custom type
task
. Return type
dict
 Raises
ValueError – If a reconstruction of the potential energy surface is requested, but the energy labels are missing in the dataset.

create_task_from_model
(model, dataset)[source]¶ Create a data structure of custom type task from existing an structure of custom type model. This method is used to resume training of unconverged models.
Any hyperparameter (including all symmetry permutations) in the provided model file is reused without further optimization. The current linear coeffiecient are used as starting point for the iterative training procedure.
 Parameters
model (
dict
) – Data structure of custom typemodel
based on which to create the training task.dataset (
dict
) – Data structure of custom typedataset
containing the original dataset from which the provided model emerged.
 Returns
Data structure of custom type
task
. Return type
dict

draw_strat_sample
(T, n, excl_idxs=None)[source]¶ Draw sample from dataset that preserves its original distribution.
The distribution is estimated from a histogram were the bin size is determined using the FreedmanDiaconis rule. This rule is designed to minimize the difference between the area under the empirical probability distribution and the area under the theoretical probability distribution. A reduced histogram is then constructed by sampling uniformly in each bin. It is intended to populate all bins with at least one sample in the reduced histogram, even for small training sizes.
 Parameters
T (
numpy.ndarray
) – Dataset to sample from.n (int) – Number of examples.
excl_idxs (
numpy.ndarray
, optional) – Array of indices to exclude from sample.
 Returns
Array of indices that form the sample.
 Return type
numpy.ndarray

train
(task, save_progr_callback=None, callback=None)[source]¶ Train a model based on a training task.
 Parameters
task (
dict
) – Data structure of custom typetask
.desc_callback (callable, optional) –
 Descriptor and descriptor Jacobian generation status.
 currentint
Current progress (number of completed descriptors).
 totalint
Task size (total number of descriptors to create).
 done_str
str
, optional Once complete, this string contains the time it took complete this task (seconds).
ker_progr_callback (callable, optional) – Kernel assembly progress function that takes three arguments:
 currentint
Current progress (number of completed entries).
 totalint
Task size (total number of entries to create).
 done_str
str
, optional Once complete, this string contains the time it took to assemble the kernel (seconds).
solve_callback (callable, optional) –
 Linear system solver status.
 donebool
False when solver starts, True when it finishes.
 done_str
str
, optional Once done, this string contains the runtime of the solver (seconds).
 Returns
Data structure of custom type
model
. Return type
dict
 Raises
ValueError – If the provided dataset contains invalid lattice vectors.


sgdml.train.
_assemble_kernel_mat_wkr
(j, tril_perms_lin, sig, use_E_cstr=False, exploit_sym=False, cols_m_limit=None)[source]¶ Compute one row and column of the force field kernel matrix.
The Hessian of the Matern kernel is used with n = 2 (twice differentiable). Each row and column consists of matrixvalued blocks, which encode the interaction of one training point with all others. The result is stored in shared memory (a global variable).
 Parameters
j (int) – Index of training point.
tril_perms_lin (
numpy.ndarray
) – 1D array (int) containing all recovered permutations expanded as one large permutation to be applied to a tiled copy of the object to be permuted.sig (int) – Hyperparameter \(\sigma\).
use_E_cstr (bool, optional) – True: include energy constraints in the kernel, False: default (s)GDML kernel.
exploit_sym (boolean, optional) – Do not create symmetric entries of the kernel matrix twice (this only works for spectific inputs for cols_m_limit)
cols_m_limit (int, optional) – Limit the number of columns (include training points 1M). Note that each training points consists of multiple columns.
 Returns
Number of kernel matrix blocks created, divided by 2 (symmetric blocks are always created at together).
 Return type
int
Return a ctypes array allocated from shared memory with data from a NumPy array.
 Parameters
arr_np (
numpy.ndarray
) – NumPy array.typecode_or_type (char or
ctype
) – Either a ctypes type or a one character typecode of the kind used by the Python array module.
 Returns
 Return type
array of
ctype
Predict module¶
This module contains all routines for evaluating GDML and sGDML models.

class
sgdml.predict.
GDMLPredict
(model, batch_size=None, num_workers=None, max_memory=None, max_processes=None, use_torch=False, log_level=None)[source]¶ Bases:
object

_set_batch_size
(batch_size=None)[source]¶ Warning
Deprecated! Please use the function _set_chunk_size in future projects.
Set chunk size for each worker process. A chunk is a subset of the training data points whose linear combination needs to be evaluated in order to generate a prediction.
The chunk size determines how much of a processes workload will be passed to Python’s underlying lowlevel routines at once. This parameter is highly hardware dependent.
Note
This parameter can be optimally determined using prepare_parallel.
 Parameters
batch_size (int) – Chunk size (maximum value is set if None).

_set_bulk_mp
(bulk_mp=False)[source]¶ Toggles bulk prediction mode.
If bulk prediction is enabled, the prediction is parallelized accross input geometries, i.e. each worker generates the complete prediction for one query. Otherwise (depending on the number of available CPU cores) the input geometries are process sequentially, but every one of them may be processed by multiple workers at once (in chunks).
Note
This parameter can be optimally determined using prepare_parallel.
 Parameters
bulk_mp (bool, optional) – Enable or disable bulk prediction mode.

_set_chunk_size
(chunk_size=None)[source]¶ Set chunk size for each worker process.
Every prediction is generated as a linear combination of the training points that the model is comprised of. If multiple workers are available (and bulk mode is disabled), each one processes an (approximatelly equal) part of those training points. Then, the chunk size determines how much of a processes workload is passed to NumPy’s underlying lowlevel routines at once. If the chunk size is smaller than the number of points the worker is supposed to process, it processes them in multiple steps using a loop. This can sometimes be faster, depending on the available hardware.
Note
This parameter can be optimally determined using prepare_parallel.
 Parameters
chunk_size (int) – Chunk size (maximum value is set if None).

_set_num_workers
(num_workers=None, force_reset=False)[source]¶ Set number of processes to use during prediction.
If bulk_mp == True, each worker handles the whole generation of single prediction (this if for querying multiple geometries at once) If bulk_mp == False, each worker may handle only a part of a prediction (chunks are defined in ‘wkr_starts_stops’). In that scenario multiple proesses are used to distribute the work of generating a single prediction
This number should not exceed the number of available CPU cores.
Note
This parameter can be optimally determined using prepare_parallel.
 Parameters
num_workers (int, optional) – Number of processes (maximum value is set if None).
force_reset (bool, optional) – Force applying the new setting.

get_GPU_batch
()[source]¶ Get batch size used by the GPU implementation to process bulk predictions (predictions for multiple input geometries at once).
This value is determined onthefly depending on the available GPU memory.

predict
(R=None, return_E=True)[source]¶ Predict energy and forces for multiple geometries. This function can run on the GPU, if the optional PyTorch dependency is installed and use_torch=True was speciefied during initialization of this class.
Optionally, the descriptors and descriptor Jacobians for the same geometries can be provided, if already available from some previous calculations.
Note
The order of the atoms in R is not arbitrary and must be the same as used for training the model.
 Parameters
R (
numpy.ndarray
, optional) – An 2D array of size M x 3N containing the Cartesian coordinates of each atom of M molecules. If this parameter is ommited, the training error is returned. Note that the training geometries need to be set right after initialization using set_R() for this to work.return_E (boolean, optional) – If false (default: true), only the forces are returned.
 Returns
numpy.ndarray
– Energies stored in an 1D array of size M (unless return_E == False)numpy.ndarray
– Forces stored in an 2D arry of size M x 3N.

prepare_parallel
(n_bulk=1, n_reps=1, return_is_from_cache=False)[source]¶ Find and set the optimal parallelization parameters for the currently loaded model, running on a particular system. The result also depends on the number of geometries n_bulk that will be passed at once when calling the predict function.
This function runs a benchmark in which the prediction routine is repeatedly called n_repstimes (default: 1) with varying parameter configurations, while the runtime is measured for each one. The optimal parameters are then cached for fast retrival in future calls of this function.
We recommend calling this function after initialization of this class, as it will drastically increase the performance of the predict function.
Note
Depending on the parameter n_reps, this routine may take some seconds/minutes to complete. However, once a statistically significant number of benchmark results has been gathered for a particular configuration, it starts returning almost instantly.
 Parameters
n_bulk (int, optional) – Number of geometries that will be passed to the predict function in each call (performance will be optimized for that exact use case).
n_reps (int, optional) – Number of repetitions (bigger value: more accurate, but also slower).
return_is_from_cache (bool, optional) – If enabled, this function returns a second value indicating if the returned results were obtained from cache.
 Returns
int – Force and energy prediciton speed in geometries per second.
boolean, optional – Return, whether this function obtained the results from cache.

set_R_d_desc
(R_d_desc)[source]¶ Store a reference to the training geometry descriptor Jacobians. This function must be called before set_alphas() can be used.
This routine is used during iterative model training.
 Parameters
R_d_desc (
numpy.ndarray
, optional) – A 2D array of size M x D x 3N containing of the descriptor Jacobians for M molecules. The descriptor has dimension D with 3N partial derivatives with respect to the 3N Cartesian coordinates of each atom.

set_R_desc
(R_desc)[source]¶ Store a reference to the training geometry descriptors.
This can accelerate iterative model training.
 Parameters
R_desc (
numpy.ndarray
, optional) – An 2D array of size M x D containing the descriptors of dimension D for M molecules.

set_alphas
(alphas_F, alphas_E=None)[source]¶ Reconfigure the current model with a new set of regression parameters. R_d_desc needs to be set for this function to work.
This routine is used during iterative model training.
 Parameters
alphas_F (
numpy.ndarray
) – 1D array containing the new model parameters.alphas_E (
numpy.ndarray
, optional) – 1D array containing the additional new model parameters, if energy constraints are used in the kernel (use_E_cstr=True)

set_opt_num_workers_and_batch_size_fast
(n_bulk=1, n_reps=1)[source]¶ Warning
Deprecated! Please use the function prepare_parallel in future projects.
 Parameters
n_bulk (int, optional) – Number of geometries that will be passed to the predict function in each call (performance will be optimized for that exact use case).
n_reps (int, optional) – Number of repetitions (bigger value: more accurate, but also slower).
 Returns
Force and energy prediciton speed in geometries per second.
 Return type
int


sgdml.predict.
_predict_wkr
(r, r_desc_d_desc, lat_and_inv, glob_id, wkr_start_stop=None, chunk_size=None)[source]¶ Compute (part) of a prediction.
Every prediction is a linear combination involving the training points used for this model. This function evalutates that combination for the range specified by wkr_start_stop. This workload can optionally be processed in chunks, which can be faster as it requires less memory to be allocated.
Note
It is sufficient to provide either the parameter r or r_desc_d_desc. The other one can be set to None.
 Parameters
r (
numpy.ndarray
) – An array of size 3N containing the Cartesian coordinates of each atom in the molecule.r_desc_d_desc (tuple of
numpy.ndarray
) – A tuple made up of:
(1) An array of size D containing the descriptors of dimension D for the molecule. (2) An array of size D x 3N containing the descriptor Jacobian for the molecules. It has dimension D with 3N partial derivatives with respect to the 3N Cartesian coordinates of each atom.
lat_and_inv (tuple of
numpy.ndarray
) – Tuple of 3 x 3 matrix containing lattice vectors as columns and its inverse.glob_id (int) – Identifier of the global namespace that this function is supposed to be using (zero if only one instance of this class exists at the same time).
wkr_start_stop (tuple of int, optional) – Range defined by the indices of first and last (exclusive) sum element. The full prediction is generated if this parameter is not specified.
chunk_size (int, optional) – Chunk size. The whole linear combination is evaluated in a large vector operation instead of looping over smaller chunks if this parameter is left unspecified.
 Returns
Partial prediction of all force components and energy (appended to array as last element).
 Return type
numpy.ndarray
Return a ctypes array allocated from shared memory with data from a NumPy array of type float.
 Parameters
arr_np (
numpy.ndarray
) – NumPy array. Returns
 Return type
array of
ctype