Data preparation¶
The sgdml
package uses a proprietary format for its datasets, but we include scripts to convert from and to Extended XYZ files and other popular file formats. It is straightforward to create custom converters by using one of those scripts as a template.
To convert a dataset from a Extended XYZ file (example
), run:
$ sgdml_dataset_from_extxyz.py <ext_xyz_file>
This will create a dataset
file (with ending .npz
) in the supported format.
Vice versa, we can convert back from this proprietary format to Extended XYZ, using:
$ sgdml_dataset_to_extxyz.py <ext_xyz_file>
Note
Any metadata specific to sgdml
, e.g. the dataset name, level of theory, unit descriptions, or the dataset checksum will be dropped when exporting to other file formats.
Importing third-party formats¶
Warning
The output format of external programs can suddenly change in the future, which would make adjustments to the following scripts necessary.
Fritz Haber Institute ab initio molecular simulations (FHI-aims)¶
To create a datasets from a FHI-aims molecular dynamics output files (example
), run:
$ sgdml_dataset_from_aims.py <aims_output_file>
i-PI: a universal force engine¶
To create a datasets from i-PI molecular dynamics trajectories (example
), run:
$ sgdml_dataset_from_ipi.py <xyz_geometries> <xyz_forces> <energies> [<energy_col>]
i-PI stores geometries, forces and energies in separate files. The desired columns in its energy output file is selected via the parameter <energy_col>
.
Other input formats (sGDML >=0.4.3, requires ASE)¶
We also include an experimental dataset import script that taps into ASE’s (optional dependency) extensive support for a wide range of additional input formats. Any format from this list (automatically recognized by ASE) can be imported via
$ sgdml_dataset_via_ase.py <dataset_file>
Warning
Caution is advised when using this script, as it can yield unexpected results. Always verify the imported dataset using sgdml show <dataset_file>
(see below) by checking that if the printed properties and statistics make sense.
Viewing dataset properties¶
The details for any dataset file can be inspected using
$ sgdml show <dataset_file>
which will print something similar to this:
__________ __ _____
_____/ ____/ __ \/ |/ / /
/ ___/ / __/ / / / /|_/ / /
(__ ) /_/ / /_/ / / / / /___
/____/\____/_____/_/ /_/_____/ 0.4.3.dev3 found 12 CPU(s)
SHOW DETAILS
----------------------------------------------------------------------------------------------------
Dataset properties
Name: CsPbBr3, 500K, 10fs (40 atoms)
Theory: VASP
Size: 2,000 data points
Lattice: a b c
11.6163 7.873e-06 -9.876e-06
7.873e-06 11.6163 1.2881e-05
-9.876e-06 1.2881e-05 11.6163
Lengths: a = 11.6163, b = 11.6163, c = 11.6163
Angles [deg]: alpha = 89.9999, beta = 90.0001, gamma = 89.9999
Energies [eV]:
Range: -1.4e+02 |-- 2.02 --| -1.38e+02
Mean: -138.988
Variance: 0.132
Forces [eV/Ang]:
Range: -2.11 |-- 4.17 --| 2.05
Mean: -0.000
Variance: 0.109
Fingerprint: 8c9b459f632a7134848a5c92402a0312
Example geometry (no. 1,161, chosen randomly)
Copy&paste the string below into Jmol (www.jmol.org), Avogadro (www.avogadro.cc), etc. to
visualize a geometry from this dataset. A new example will be drawn on each call.
---- CUT HERE --- CUT HERE --- CUT HERE --- CUT HERE --- CUT HERE --- CUT HERE --- CUT HERE ----
40
Lattice="11.616298069 7.873e-06 -9.876e-06 7.873e-06 11.616306553 1.2881e-05 -9.876e-06 1.2881e-05 11.61630755" Energy=-139.23610834 Properties=species:S:1:pos:R:3:forces:R:3
Pb 0.6149 11.5371 5.96188 -0.658385 -0.01036 -0.309138
Pb 0.45736 5.54317 5.63009 -0.895169 0.259049 -0.190419
Pb 6.05761 0.14687 5.91688 0.463358 -0.583672 0.609393
Pb 5.67843 5.74485 6.04255 0.452753 0.105349 -0.25027
Pb 11.392 11.4198 0.19456 -0.054637 -0.201066 -0.229103
Pb 11.1545 5.63932 11.6069 -0.046257 0.866832 0.089131
Pb 5.86645 0.10878 0.14953 -0.224227 0.243382 -0.221047
Pb 5.53374 6.10861 11.5692 0.507028 -1.05318 0.287179
Cs 2.65696 2.76843 1.92402 0.147986 0.083318 -0.277711
Cs 3.2544 8.72126 2.50281 -0.416375 0.132518 0.060757
Cs 8.03648 3.24618 2.63973 -0.024972 -0.095778 0.262713
Cs 8.97209 8.84332 3.64384 -0.092558 -0.13054 -0.485435
Cs 1.84245 2.88172 9.08498 0.039288 -0.270671 0.331387
Cs 2.7464 8.54156 8.63513 -0.282382 0.019342 -0.036995
Cs 8.81882 2.91659 8.51183 0.633232 0.111299 -0.002106
Cs 9.09737 8.85903 7.42172 0.010123 -0.20591 0.702575
Br 2.75493 11.2924 11.2354 -0.310241 0.347108 -0.031611
Br 5.94428 2.9451 11.0683 -0.261107 0.232289 0.21885
Br 4.90483 5.4145 3.05733 0.211533 0.052715 0.238589
Br 2.537 5.6697 11.3341 0.324577 0.22128 -0.317152
Br 5.31398 8.79667 11.3534 0.127817 0.441474 -0.149943
Br 5.18871 0.50588 3.36554 -0.189689 0.045813 -0.759197
Br 8.35926 11.4619 1.35293 0.682207 -0.064515 0.037932
Br 10.938 2.83985 0.08844 -0.182543 -0.508183 0.08176
Br 0.66206 5.80504 2.80397 -0.13863 0.078374 -0.679582
Br 8.54658 6.56376 1.14919 -0.378726 -0.367373 -0.310592
Br 11.2746 8.5901 11.1478 0.135547 0.161482 -0.023156
Br 0.66982 0.11832 2.86471 -0.040507 -0.422352 0.178096
Br 3.26133 0.30487 6.70639 0.09869 -0.026901 0.433768
Br 6.25237 2.8823 6.40061 -0.287116 0.322048 -0.204326
Br 6.89953 6.16544 9.00266 -0.177235 0.052506 -0.400216
Br 2.93778 5.20783 6.88583 0.958546 0.204122 0.153683
Br 5.30321 8.65312 5.55974 0.276979 0.009966 0.17536
Br 7.22531 11.5782 9.2761 -0.151673 -0.41358 -0.321117
Br 9.00526 0.23154 5.40351 0.063848 0.198287 0.025501
Br 0.38697 2.69704 5.25357 -0.080373 0.389529 0.14632
Br 11.1616 5.73738 8.37778 -0.0481 0.153284 0.71168
Br 8.95001 5.59011 5.34824 -0.280161 0.113828 0.079881
Br 0.57208 8.7157 5.15009 0.167686 -0.363692 0.450414
Br 11.3248 0.05482 8.74895 -0.085141 -0.135819 -0.082334
---- CUT HERE --- CUT HERE --- CUT HERE --- CUT HERE --- CUT HERE --- CUT HERE --- CUT HERE ----
Periodic boundary conditions (sGDML >=0.4.0)¶
The sgdml
package supports periodic boundary conditions, allowing a description of macro-scale systems like bulk gases, liquids or crystal structures in addition to molecular structures. Internally, this is achieved via a modification of the Euclidean distance metric in the descriptor of the system, such that it adheres to the so-called minimum-image convention whereby each atom in the unit-cell only interacts with the closest copy of each other atom.
In order train a periodic model, a matrix of (column-wise) lattice vectors needs to be included in the dataset
file:
dataset['lattice'] = np.array([[H11 H12 H13], [H21 H22 H23], [H31 H32 H33]])
This entry is automatically generated when a dataset in Extended XYZ format is imported (by interpreting the Lattice
-string in the original file), but might need to be added manually for alternative sources. We provide a Python script (download
) to add this entry manually:
$ python add_pcb_to_file.py <dataset_file>
Tip
To verify that everything was interpreted correctly, inspect the dataset file (see above) to check the lattice matrix.