Data preparation

The sgdml package uses a proprietary format for its datasets, but we include scripts to convert from and to Extended XYZ files and other popular file formats. It is straightforward to create custom converters by using one of those scripts as a template.

To convert a dataset from a Extended XYZ file (example), run:

$ sgdml_dataset_from_extxyz.py <ext_xyz_file>

This will create a dataset file (with ending .npz) in the supported format. Vice versa, we can convert back from this proprietary format to Extended XYZ, using:

$ sgdml_dataset_to_extxyz.py <ext_xyz_file>

Note

Any metadata specific to sgdml, e.g. the dataset name, level of theory, unit descriptions, or the dataset checksum will be dropped when exporting to other file formats.

Importing third-party formats

Warning

The output format of external programs can suddenly change in the future, which would make adjustments to the following scripts necessary.

Fritz Haber Institute ab initio molecular simulations (FHI-aims)

To create a datasets from a FHI-aims molecular dynamics output files (example), run:

$ sgdml_dataset_from_aims.py <aims_output_file>

i-PI: a universal force engine

To create a datasets from i-PI molecular dynamics trajectories (example), run:

$ sgdml_dataset_from_ipi.py <xyz_geometries> <xyz_forces> <energies> [<energy_col>]

i-PI stores geometries, forces and energies in separate files. The desired columns in its energy output file is selected via the parameter <energy_col>.

Other input formats (sGDML >=0.4.3, requires ASE)

We also include an experimental dataset import script that taps into ASE’s (optional dependency) extensive support for a wide range of additional input formats. Any format from this list (automatically recognized by ASE) can be imported via

$ sgdml_dataset_via_ase.py <dataset_file>

Warning

Caution is advised when using this script, as it can yield unexpected results. Always verify the imported dataset using sgdml show <dataset_file> (see below) by checking that if the printed properties and statistics make sense.

Viewing dataset properties

The details for any dataset file can be inspected using

$ sgdml show <dataset_file>

which will print something similar to this:

         __________  __  _____
   _____/ ____/ __ \/  |/  / /
  / ___/ / __/ / / / /|_/ / /
 (__  ) /_/ / /_/ / /  / / /___
/____/\____/_____/_/  /_/_____/  0.4.3.dev3                                          found 12 CPU(s)

 SHOW DETAILS
----------------------------------------------------------------------------------------------------
Dataset properties
  Name:              CsPbBr3, 500K, 10fs (40 atoms)
  Theory:            VASP
  Size:              2,000 data points
  Lattice:           a          b           c
                     11.6163     7.873e-06  -9.876e-06
                      7.873e-06 11.6163      1.2881e-05
                     -9.876e-06  1.2881e-05 11.6163
    Lengths:         a = 11.6163, b = 11.6163, c = 11.6163
    Angles [deg]:    alpha = 89.9999, beta = 90.0001, gamma = 89.9999
  Energies [eV]:
    Range:           -1.4e+02 |--   2.02   --| -1.38e+02
    Mean:            -138.988
    Variance:        0.132
  Forces [eV/Ang]:
    Range:           -2.11 |--   4.17   --| 2.05
    Mean:            -0.000
    Variance:        0.109
  Fingerprint:       8c9b459f632a7134848a5c92402a0312

Example geometry (no. 1,161, chosen randomly)
  Copy&paste the string below into Jmol (www.jmol.org), Avogadro (www.avogadro.cc), etc. to
  visualize a geometry from this dataset. A new example will be drawn on each call.

  ---- CUT HERE --- CUT HERE --- CUT HERE --- CUT HERE --- CUT HERE --- CUT HERE --- CUT HERE ----
  40
  Lattice="11.616298069 7.873e-06 -9.876e-06 7.873e-06 11.616306553 1.2881e-05 -9.876e-06 1.2881e-05 11.61630755" Energy=-139.23610834 Properties=species:S:1:pos:R:3:forces:R:3
  Pb  0.6149  11.5371   5.96188 -0.658385 -0.01036  -0.309138
  Pb  0.45736  5.54317  5.63009 -0.895169  0.259049 -0.190419
  Pb  6.05761  0.14687  5.91688  0.463358 -0.583672  0.609393
  Pb  5.67843  5.74485  6.04255  0.452753  0.105349 -0.25027
  Pb 11.392   11.4198   0.19456 -0.054637 -0.201066 -0.229103
  Pb 11.1545   5.63932 11.6069  -0.046257  0.866832  0.089131
  Pb  5.86645  0.10878  0.14953 -0.224227  0.243382 -0.221047
  Pb  5.53374  6.10861 11.5692   0.507028 -1.05318   0.287179
  Cs  2.65696  2.76843  1.92402  0.147986  0.083318 -0.277711
  Cs  3.2544   8.72126  2.50281 -0.416375  0.132518  0.060757
  Cs  8.03648  3.24618  2.63973 -0.024972 -0.095778  0.262713
  Cs  8.97209  8.84332  3.64384 -0.092558 -0.13054  -0.485435
  Cs  1.84245  2.88172  9.08498  0.039288 -0.270671  0.331387
  Cs  2.7464   8.54156  8.63513 -0.282382  0.019342 -0.036995
  Cs  8.81882  2.91659  8.51183  0.633232  0.111299 -0.002106
  Cs  9.09737  8.85903  7.42172  0.010123 -0.20591   0.702575
  Br  2.75493 11.2924  11.2354  -0.310241  0.347108 -0.031611
  Br  5.94428  2.9451  11.0683  -0.261107  0.232289  0.21885
  Br  4.90483  5.4145   3.05733  0.211533  0.052715  0.238589
  Br  2.537    5.6697  11.3341   0.324577  0.22128  -0.317152
  Br  5.31398  8.79667 11.3534   0.127817  0.441474 -0.149943
  Br  5.18871  0.50588  3.36554 -0.189689  0.045813 -0.759197
  Br  8.35926 11.4619   1.35293  0.682207 -0.064515  0.037932
  Br 10.938    2.83985  0.08844 -0.182543 -0.508183  0.08176
  Br  0.66206  5.80504  2.80397 -0.13863   0.078374 -0.679582
  Br  8.54658  6.56376  1.14919 -0.378726 -0.367373 -0.310592
  Br 11.2746   8.5901  11.1478   0.135547  0.161482 -0.023156
  Br  0.66982  0.11832  2.86471 -0.040507 -0.422352  0.178096
  Br  3.26133  0.30487  6.70639  0.09869  -0.026901  0.433768
  Br  6.25237  2.8823   6.40061 -0.287116  0.322048 -0.204326
  Br  6.89953  6.16544  9.00266 -0.177235  0.052506 -0.400216
  Br  2.93778  5.20783  6.88583  0.958546  0.204122  0.153683
  Br  5.30321  8.65312  5.55974  0.276979  0.009966  0.17536
  Br  7.22531 11.5782   9.2761  -0.151673 -0.41358  -0.321117
  Br  9.00526  0.23154  5.40351  0.063848  0.198287  0.025501
  Br  0.38697  2.69704  5.25357 -0.080373  0.389529  0.14632
  Br 11.1616   5.73738  8.37778 -0.0481    0.153284  0.71168
  Br  8.95001  5.59011  5.34824 -0.280161  0.113828  0.079881
  Br  0.57208  8.7157   5.15009  0.167686 -0.363692  0.450414
  Br 11.3248   0.05482  8.74895 -0.085141 -0.135819 -0.082334
  ---- CUT HERE --- CUT HERE --- CUT HERE --- CUT HERE --- CUT HERE --- CUT HERE --- CUT HERE ----

Periodic boundary conditions (sGDML >=0.4.0)

The sgdml package supports periodic boundary conditions, allowing a description of macro-scale systems like bulk gases, liquids or crystal structures in addition to molecular structures. Internally, this is achieved via a modification of the Euclidean distance metric in the descriptor of the system, such that it adheres to the so-called minimum-image convention whereby each atom in the unit-cell only interacts with the closest copy of each other atom.

In order train a periodic model, a matrix of (column-wise) lattice vectors needs to be included in the dataset file:

dataset['lattice'] = np.array([[H11 H12 H13], [H21 H22 H23], [H31 H32 H33]])

This entry is automatically generated when a dataset in Extended XYZ format is imported (by interpreting the Lattice-string in the original file), but might need to be added manually for alternative sources. We provide a Python script (download) to add this entry manually:

$ python add_pcb_to_file.py <dataset_file>

Tip

To verify that everything was interpreted correctly, inspect the dataset file (see above) to check the lattice matrix.