Python API

Train module

This module contains all routines for training GDML and sGDML models.

class sgdml.train.GDMLTrain(max_memory=None, max_processes=None, use_torch=False)[source]

Bases: object

_assemble_kernel_mat(R_desc, R_d_desc, tril_perms_lin, sig, desc, use_E_cstr=False, col_idxs=slice(None, None, None), alloc_extra_rows=0, callback=None)[source]

Compute force field kernel matrix.

The Hessian of the Matern kernel is used with n = 2 (twice differentiable). Each row and column consists of matrix-valued blocks, which encode the interaction of one training point with all others. The result is stored in shared memory (a global variable).

Parameters
  • R_desc (numpy.ndarray) – Array containing the descriptor for each training point.

  • R_d_desc (numpy.ndarray) – Array containing the gradient of the descriptor for each training point.

  • tril_perms_lin (numpy.ndarray) – 1D array containing all recovered permutations expanded as one large permutation to be applied to a tiled copy of the object to be permuted.

  • sig (int) – Hyper-parameter :math:`sigma`(kernel length scale).

  • use_E_cstr (bool, optional) – True: include energy constraints in the kernel, False: default (s)GDML kernel.

  • callback (callable, optional) – Kernel assembly progress function that takes three arguments:

    currentint

    Current progress (number of completed entries).

    totalint

    Task size (total number of entries to create).

    done_strstr, optional

    Once complete, this string contains the time it took to assemble the kernel (seconds).

  • cols_m_limit (int, optional (DEPRECATED)) – Only generate the columns up to index ‘cols_m_limit’. This creates a M*3N x cols_m_limit*3N kernel matrix, instead of M*3N x M*3N.

  • cols_3n_keep_idxs (numpy.ndarray, optional) – Only generate columns with the given indices in the 3N x 3N kernel function. The resulting kernel matrix will have dimension M*3N x M*len(cols_3n_keep_idxs).

Returns

Force field kernel matrix.

Return type

numpy.ndarray

_recov_int_const(model, task, R_desc=None, R_d_desc=None)[source]

Estimate the integration constant for a force field model.

The offset between the energies predicted for the original training data and the true energy labels is computed in the least square sense. Furthermore, common issues with the user-provided datasets are self diagnosed here.

Parameters
  • model (dict) – Data structure of custom type model.

  • task (dict) – Data structure of custom type task.

  • R_desc (numpy.ndarray, optional) – An 2D array of size M x D containing the descriptors of dimension D for M molecules.

  • R_d_desc (numpy.ndarray, optional) – A 2D array of size M x D x 3N containing of the descriptor Jacobians for M molecules. The descriptor has dimension D with 3N partial derivatives with respect to the 3N Cartesian coordinates of each atom.

Returns

Estimate for the integration constant.

Return type

float

Raises
  • ValueError – If the sign of the force labels in the dataset from which the model emerged is switched (e.g. gradients instead of forces).

  • ValueError – If inconsistent/corrupted energy labels are detected in the provided dataset.

  • ValueError – If potentially inconsistent scales in energy vs. force labels are detected in the provided dataset.

create_model(task, solver, R_desc, R_d_desc, tril_perms_lin, std, alphas_F, alphas_E=None)[source]

Create a data structure of custom type model.

These data structures contain the trained model are everything that is needed to generate predictions for new inputs.

Each task also contains the MD5 fingerprints of the used datasets.

Parameters
  • task (dict) – Data structure of custom type task from which the model emerged.

  • solver (str) – Identifier string for the solver that has been used to train this model.

  • R_desc (numpy.ndarray, optional) – An 2D array of size M x D containing the descriptors of dimension D for M molecules.

  • R_d_desc (numpy.ndarray, optional) – A 2D array of size M x D x 3N containing of the descriptor Jacobians for M molecules. The descriptor has dimension D with 3N partial derivatives with respect to the 3N Cartesian coordinates of each atom.

  • tril_perms_lin (numpy.ndarray) – 1D array containing all recovered permutations expanded as one large permutation to be applied to a tiled copy of the object to be permuted.

  • std (float) – Standard deviation of the training labels.

  • alphas_F (numpy.ndarray) – A 1D array of size 3NM containing of the linear coefficients that correspond to the force constraints.

  • alphas_E (numpy.ndarray, optional) – A 1D array of size N containing of the linear coefficients that correspond to the energy constraints.

Returns

Data structure of custom type model.

Return type

dict

create_task(train_dataset, n_train, valid_dataset, n_valid, sig, lam=1e-10, perms=None, use_sym=True, use_E=True, use_E_cstr=False, callback=None)[source]

Create a data structure of custom type task.

These data structures serve as recipes for model creation, summarizing the configuration of one particular training run. Training and test points are sampled from the provided dataset, without replacement. If the same dataset if given for training and testing, the subsets are drawn without overlap.

Each task also contains a choice for the hyper-parameters of the training process and the MD5 fingerprints of the used datasets.

Parameters
  • train_dataset (dict) – Data structure of custom type dataset containing train dataset.

  • n_train (int) – Number of training points to sample.

  • valid_dataset (dict) – Data structure of custom type dataset containing validation dataset.

  • n_valid (int) – Number of validation points to sample.

  • sig (int) – Hyper-parameter (kernel length scale).

  • lam (float, optional) – Hyper-parameter lambda (regularization strength).

  • perms (numpy.ndarray, optional) – An 2D array of size P x N containing P possible permutations of the N atoms in the system. This argument takes priority over the ones provided in the trainig dataset. No automatic discovery is run when this argument is provided.

  • use_sym (bool, optional) – True: include symmetries (sGDML), False: GDML.

  • use_E (bool, optional) – True: reconstruct force field with corresponding potential energy surface, False: ignore energy during training, even if energy labels are available

    in the dataset. The trained model will still be able to predict energies up to an unknown integration constant. Note, that the energy predictions accuracy will be untested.

  • use_E_cstr (bool, optional) – True: include energy constraints in the kernel, False: default (s)GDML.

  • callback (callable, optional) – Progress callback function that takes three arguments:

    currentint

    Current progress.

    totalint

    Task size.

    done_strstr, optional

    Once complete, this string is shown.

Returns

Data structure of custom type task.

Return type

dict

Raises

ValueError – If a reconstruction of the potential energy surface is requested, but the energy labels are missing in the dataset.

create_task_from_model(model, dataset)[source]

Create a data structure of custom type task from existing an structure of custom type model. This method is used to resume training of unconverged models.

Any hyperparameter (including all symmetry permutations) in the provided model file is reused without further optimization. The current linear coeffiecient are used as starting point for the iterative training procedure.

Parameters
  • model (dict) – Data structure of custom type model based on which to create the training task.

  • dataset (dict) – Data structure of custom type dataset containing the original dataset from which the provided model emerged.

Returns

Data structure of custom type task.

Return type

dict

draw_strat_sample(T, n, excl_idxs=None)[source]

Draw sample from dataset that preserves its original distribution.

The distribution is estimated from a histogram were the bin size is determined using the Freedman-Diaconis rule. This rule is designed to minimize the difference between the area under the empirical probability distribution and the area under the theoretical probability distribution. A reduced histogram is then constructed by sampling uniformly in each bin. It is intended to populate all bins with at least one sample in the reduced histogram, even for small training sizes.

Parameters
  • T (numpy.ndarray) – Dataset to sample from.

  • n (int) – Number of examples.

  • excl_idxs (numpy.ndarray, optional) – Array of indices to exclude from sample.

Returns

Array of indices that form the sample.

Return type

numpy.ndarray

train(task, save_progr_callback=None, callback=None)[source]

Train a model based on a training task.

Parameters
  • task (dict) – Data structure of custom type task.

  • desc_callback (callable, optional) –

    Descriptor and descriptor Jacobian generation status.
    currentint

    Current progress (number of completed descriptors).

    totalint

    Task size (total number of descriptors to create).

    done_strstr, optional

    Once complete, this string contains the time it took complete this task (seconds).

  • ker_progr_callback (callable, optional) – Kernel assembly progress function that takes three arguments:

    currentint

    Current progress (number of completed entries).

    totalint

    Task size (total number of entries to create).

    done_strstr, optional

    Once complete, this string contains the time it took to assemble the kernel (seconds).

  • solve_callback (callable, optional) –

    Linear system solver status.
    donebool

    False when solver starts, True when it finishes.

    done_strstr, optional

    Once done, this string contains the runtime of the solver (seconds).

Returns

Data structure of custom type model.

Return type

dict

Raises

ValueError – If the provided dataset contains invalid lattice vectors.

sgdml.train._assemble_kernel_mat_wkr(j, tril_perms_lin, sig, use_E_cstr=False, exploit_sym=False, cols_m_limit=None)[source]

Compute one row and column of the force field kernel matrix.

The Hessian of the Matern kernel is used with n = 2 (twice differentiable). Each row and column consists of matrix-valued blocks, which encode the interaction of one training point with all others. The result is stored in shared memory (a global variable).

Parameters
  • j (int) – Index of training point.

  • tril_perms_lin (numpy.ndarray) – 1D array (int) containing all recovered permutations expanded as one large permutation to be applied to a tiled copy of the object to be permuted.

  • sig (int) – Hyper-parameter \(\sigma\).

  • use_E_cstr (bool, optional) – True: include energy constraints in the kernel, False: default (s)GDML kernel.

  • exploit_sym (boolean, optional) – Do not create symmetric entries of the kernel matrix twice (this only works for spectific inputs for cols_m_limit)

  • cols_m_limit (int, optional) – Limit the number of columns (include training points 1-M). Note that each training points consists of multiple columns.

Returns

Number of kernel matrix blocks created, divided by 2 (symmetric blocks are always created at together).

Return type

int

sgdml.train._share_array(arr_np, typecode_or_type)[source]

Return a ctypes array allocated from shared memory with data from a NumPy array.

Parameters
  • arr_np (numpy.ndarray) – NumPy array.

  • typecode_or_type (char or ctype) – Either a ctypes type or a one character typecode of the kind used by the Python array module.

Returns

Return type

array of ctype

Predict module

This module contains all routines for evaluating GDML and sGDML models.

class sgdml.predict.GDMLPredict(model, batch_size=None, num_workers=None, max_memory=None, max_processes=None, use_torch=False, log_level=None)[source]

Bases: object

_set_batch_size(batch_size=None)[source]

Warning

Deprecated! Please use the function _set_chunk_size in future projects.

Set chunk size for each worker process. A chunk is a subset of the training data points whose linear combination needs to be evaluated in order to generate a prediction.

The chunk size determines how much of a processes workload will be passed to Python’s underlying low-level routines at once. This parameter is highly hardware dependent.

Note

This parameter can be optimally determined using prepare_parallel.

Parameters

batch_size (int) – Chunk size (maximum value is set if None).

_set_bulk_mp(bulk_mp=False)[source]

Toggles bulk prediction mode.

If bulk prediction is enabled, the prediction is parallelized accross input geometries, i.e. each worker generates the complete prediction for one query. Otherwise (depending on the number of available CPU cores) the input geometries are process sequentially, but every one of them may be processed by multiple workers at once (in chunks).

Note

This parameter can be optimally determined using prepare_parallel.

Parameters

bulk_mp (bool, optional) – Enable or disable bulk prediction mode.

_set_chunk_size(chunk_size=None)[source]

Set chunk size for each worker process.

Every prediction is generated as a linear combination of the training points that the model is comprised of. If multiple workers are available (and bulk mode is disabled), each one processes an (approximatelly equal) part of those training points. Then, the chunk size determines how much of a processes workload is passed to NumPy’s underlying low-level routines at once. If the chunk size is smaller than the number of points the worker is supposed to process, it processes them in multiple steps using a loop. This can sometimes be faster, depending on the available hardware.

Note

This parameter can be optimally determined using prepare_parallel.

Parameters

chunk_size (int) – Chunk size (maximum value is set if None).

_set_num_workers(num_workers=None, force_reset=False)[source]

Set number of processes to use during prediction.

If bulk_mp == True, each worker handles the whole generation of single prediction (this if for querying multiple geometries at once) If bulk_mp == False, each worker may handle only a part of a prediction (chunks are defined in ‘wkr_starts_stops’). In that scenario multiple proesses are used to distribute the work of generating a single prediction

This number should not exceed the number of available CPU cores.

Note

This parameter can be optimally determined using prepare_parallel.

Parameters
  • num_workers (int, optional) – Number of processes (maximum value is set if None).

  • force_reset (bool, optional) – Force applying the new setting.

get_GPU_batch()[source]

Get batch size used by the GPU implementation to process bulk predictions (predictions for multiple input geometries at once).

This value is determined on-the-fly depending on the available GPU memory.

predict(R=None, return_E=True)[source]

Predict energy and forces for multiple geometries. This function can run on the GPU, if the optional PyTorch dependency is installed and use_torch=True was speciefied during initialization of this class.

Optionally, the descriptors and descriptor Jacobians for the same geometries can be provided, if already available from some previous calculations.

Note

The order of the atoms in R is not arbitrary and must be the same as used for training the model.

Parameters
  • R (numpy.ndarray, optional) – An 2D array of size M x 3N containing the Cartesian coordinates of each atom of M molecules. If this parameter is ommited, the training error is returned. Note that the training geometries need to be set right after initialization using set_R() for this to work.

  • return_E (boolean, optional) – If false (default: true), only the forces are returned.

Returns

  • numpy.ndarray – Energies stored in an 1D array of size M (unless return_E == False)

  • numpy.ndarray – Forces stored in an 2D arry of size M x 3N.

prepare_parallel(n_bulk=1, n_reps=1, return_is_from_cache=False)[source]

Find and set the optimal parallelization parameters for the currently loaded model, running on a particular system. The result also depends on the number of geometries n_bulk that will be passed at once when calling the predict function.

This function runs a benchmark in which the prediction routine is repeatedly called n_reps-times (default: 1) with varying parameter configurations, while the runtime is measured for each one. The optimal parameters are then cached for fast retrival in future calls of this function.

We recommend calling this function after initialization of this class, as it will drastically increase the performance of the predict function.

Note

Depending on the parameter n_reps, this routine may take some seconds/minutes to complete. However, once a statistically significant number of benchmark results has been gathered for a particular configuration, it starts returning almost instantly.

Parameters
  • n_bulk (int, optional) – Number of geometries that will be passed to the predict function in each call (performance will be optimized for that exact use case).

  • n_reps (int, optional) – Number of repetitions (bigger value: more accurate, but also slower).

  • return_is_from_cache (bool, optional) – If enabled, this function returns a second value indicating if the returned results were obtained from cache.

Returns

  • int – Force and energy prediciton speed in geometries per second.

  • boolean, optional – Return, whether this function obtained the results from cache.

set_R_d_desc(R_d_desc)[source]

Store a reference to the training geometry descriptor Jacobians. This function must be called before set_alphas() can be used.

This routine is used during iterative model training.

Parameters

R_d_desc (numpy.ndarray, optional) – A 2D array of size M x D x 3N containing of the descriptor Jacobians for M molecules. The descriptor has dimension D with 3N partial derivatives with respect to the 3N Cartesian coordinates of each atom.

set_R_desc(R_desc)[source]

Store a reference to the training geometry descriptors.

This can accelerate iterative model training.

Parameters

R_desc (numpy.ndarray, optional) – An 2D array of size M x D containing the descriptors of dimension D for M molecules.

set_alphas(alphas_F, alphas_E=None)[source]

Reconfigure the current model with a new set of regression parameters. R_d_desc needs to be set for this function to work.

This routine is used during iterative model training.

Parameters
  • alphas_F (numpy.ndarray) – 1D array containing the new model parameters.

  • alphas_E (numpy.ndarray, optional) – 1D array containing the additional new model parameters, if energy constraints are used in the kernel (use_E_cstr=True)

set_opt_num_workers_and_batch_size_fast(n_bulk=1, n_reps=1)[source]

Warning

Deprecated! Please use the function prepare_parallel in future projects.

Parameters
  • n_bulk (int, optional) – Number of geometries that will be passed to the predict function in each call (performance will be optimized for that exact use case).

  • n_reps (int, optional) – Number of repetitions (bigger value: more accurate, but also slower).

Returns

Force and energy prediciton speed in geometries per second.

Return type

int

sgdml.predict._predict_wkr(r, r_desc_d_desc, lat_and_inv, glob_id, wkr_start_stop=None, chunk_size=None)[source]

Compute (part) of a prediction.

Every prediction is a linear combination involving the training points used for this model. This function evalutates that combination for the range specified by wkr_start_stop. This workload can optionally be processed in chunks, which can be faster as it requires less memory to be allocated.

Note

It is sufficient to provide either the parameter r or r_desc_d_desc. The other one can be set to None.

Parameters
  • r (numpy.ndarray) – An array of size 3N containing the Cartesian coordinates of each atom in the molecule.

  • r_desc_d_desc (tuple of numpy.ndarray) –

    A tuple made up of:

    (1) An array of size D containing the descriptors of dimension D for the molecule. (2) An array of size D x 3N containing the descriptor Jacobian for the molecules. It has dimension D with 3N partial derivatives with respect to the 3N Cartesian coordinates of each atom.

  • lat_and_inv (tuple of numpy.ndarray) – Tuple of 3 x 3 matrix containing lattice vectors as columns and its inverse.

  • glob_id (int) – Identifier of the global namespace that this function is supposed to be using (zero if only one instance of this class exists at the same time).

  • wkr_start_stop (tuple of int, optional) – Range defined by the indices of first and last (exclusive) sum element. The full prediction is generated if this parameter is not specified.

  • chunk_size (int, optional) – Chunk size. The whole linear combination is evaluated in a large vector operation instead of looping over smaller chunks if this parameter is left unspecified.

Returns

Partial prediction of all force components and energy (appended to array as last element).

Return type

numpy.ndarray

sgdml.predict.share_array(arr_np)[source]

Return a ctypes array allocated from shared memory with data from a NumPy array of type float.

Parameters

arr_np (numpy.ndarray) – NumPy array.

Returns

Return type

array of ctype