Make Dataset
1. Core Concepts & Workflow
1.1 Data Flow Model
Linear Processing Chain: Cards execute sequentially, with each card’s output automatically becoming the next card’s input
In-Group Parallel Flow: All cards within a group share the same input, with outputs automatically merged
Filter Mechanism: Filters can be added at group ends to screen outputs from all group cards
1.2 Basic Operations
Import Structures:
Supported formats: VASP/POSCAR, CIF, XYZ
Methods: Click “Open” button or drag files directly into window
Build Processing Pipeline:
Add processing cards via “Add new card”
Reorder cards via drag-and-drop
Use Card Groups for complex workflows
Save/Load Configuration:
Export: Save current card setup as JSON
Import: Load existing configuration files
Sample configuration file:
{
"software_version": "2.0.6.dev35",
"cards": [
{
"class": "SuperCellCard",
"check_state": true,
"super_cell_type": 0,
"super_scale_radio_button": false,
"super_scale_condition": [1,1,1],
"super_cell_radio_button": true,
"super_cell_condition": [20,20,20],
"max_atoms_radio_button": false,
"max_atoms_condition": [1]
},
{
"class": "CardGroup",
"check_state": true,
"card_list": [
{
"class": "CellStrainCard",
"check_state": true,
"engine_type": "triaxial",
"x_range": [-5,5,1],
"y_range": [-5,5,1],
"z_range": [-5,5,1]
},
{
"class": "PerturbCard",
"check_state": true,
"engine_type": 0,
"organic": true,
"scaling_condition": [0.3],
"num_condition": [50]
},
{
"class": "CellScalingCard",
"check_state": true,
"engine_type": 0,
"perturb_angle": true,
"scaling_condition": [0.04],
"num_condition": [50]
}
],
"filter_card": {
"class": "FPSFilterDataCard",
"check_state": true,
"nep_path": "D:\\PycharmProjects\\NepTrainKit\\src\\NepTrainKit\\Config\\nep89.txt",
"num_condition": [100],
"min_distance_condition": [0.001]
}
}
]
}
2. Production Cards Explained
2.1 Super Cell Generation
Function: Creates supercells through expansion
Parameters:
Parameter Group |
Option |
Description |
Typical Values |
|---|---|---|---|
Mode |
Maximum |
Generates largest possible supercell |
- |
Iteration |
Generates all possible combinations |
- |
|
Expansion Method |
Super Scale |
Fixed expansion multiplier |
(2,2,2) |
Super Cell |
Calculates by max lattice constant |
(10Å,10Å,10Å) |
|
Max Atoms |
Limits by maximum atom count |
200 |
Structure Tagging:
structure.info["Config_type"] += "supercell(nx,ny,nz)" # e.g., supercell(2,2,1)
Algorithm
Super Scale: directly uses
[na, nb, nc]as diagonal expansion factors and buildsmake_supercell(structure, diag([na,nb,nc])).Super Cell (max lattice): computes
[na, nb, nc] = floor([amax/a, bmax/b, cmax/c])with safety checks, then clamps to at least 1 in each direction.Max Atoms: enumerates integer triplets
[na, nb, nc]whose product timesN_atomsdoes not exceed the limit; sorted by total atoms, return either all (Iteration) or the largest (Maximum).
Caveats
If any lattice vector length is near zero, max‑cell estimation falls back to 1 to avoid divide‑by‑zero.
Very large supercells can create memory pressure; prefer Max Atoms to cap size.
Iteration may generate many structures—use a follow‑up filter (e.g., FPS) to down‑select.
Best Practices
Use Max Atoms to scale safely across diverse inputs.
Prefer Iteration when exploring and Maximum when preparing a single supercell.
Combine with Lattice Strain or Perturb after expansion to enrich diversity.
2.2 Vacancy Defect Generation
Function: Creates vacancy-defect structures by deleting random atoms
Key Parameters:
Random engine:
Sobolsequence orUniformdistributionVacancy specification:
Vacancy num: fixed number of vacancies
Vacancy concentration: fraction of atoms to remove
Max structures: number of structures to generate
Structure Tagging:
structure.info["Config_type"] += f" Vacancy(num={defect_count})"
Algorithm
If using concentration
c, computemax_defects = floor(c * N_atoms); else use fixed number.Engine
Sobol: sample both defect count and positions deterministically in [0,1), then map to indices.Engine
Uniform: draw a random integer count and random, unique atom indices to remove.
Caveats
A concentration of 1.0 is clamped to avoid removing all atoms.
Removing many atoms can break intended stoichiometry or PBC artifacts—validate downstream.
Best Practices
Keep
csmall (e.g., ≤ 0.2) for incremental datasets.Prefer Sobol for repeatable sweeps; use Uniform for stochastic augmentation.
2.3 Atomic Perturbation
Function: Adds random displacements to atomic positions
Key Parameters:
Random engine:
SobolorUniformMax distance: maximum displacement amplitude in Å
Identify organic: treat organic molecules as rigid units
Max structures: number of structures to generate
Structure Tagging:
structure.info["Config_type"] += f" Perturb(distance={max_displacement}, {engine_type})"
Algorithm
Builds a 3N‑dimensional random vector from Sobol or Uniform in [‑1,1], reshaped to N×3, then scaled by
max_distanceand added to atomic positions.If “Identify organic” is enabled: organic clusters are translated as rigid units; inorganic clusters perturb per‑atom.
Wraps atoms back into the cell.
Caveats
Large
max_distancemay cause unrealistic overlaps even with wrapping.Cluster detection is heuristic; verify for complex organics.
Best Practices
Start small (0.1–0.3 Å) and inspect min interatomic distances.
Combine with FPS Filter to keep diverse displacements while avoiding redundancy.
2.4 Lattice Scaling
Function: Randomly scales lattice vectors
Key Parameters:
Random engine:
SobolorUniformMax scaling: 0–1, applied symmetrically as
1±valuePerturb angle: whether lattice angles are also perturbed
Identify organic: treat organic molecules as rigid units
Max structures: number of structures to generate
Structure Tagging:
structure.info["Config_type"] += f" Scaling({scaling_factor})"
Algorithm
Random engine produces a sequence of factors; lattice lengths are scaled by
1 ± s(uniform or Sobol).If “Perturb angle” is enabled, angles are also perturbed and a new lattice is reconstructed from (lengths, angles).
Optionally treats organics as rigid clusters during scaling.
Caveats
Angle perturbation rebuilds the lattice; extreme angles can generate ill‑conditioned cells.
Ensure
Max scalingis small (≤ 0.05) when also perturbing angles.
Best Practices
Use Uniform for broad coverage; Sobol for low‑discrepancy sweeps.
Keep scaling conservative for stability, especially before DFT relaxations.
2.5 Lattice Strain
Strain Modes:
Uniaxial
Biaxial
Triaxial
Isotropic (uniform scaling; only X range is used)
Custom Axis Combinations: Supports any XYZ combinations (e.g., “XY”, “XZ”, “YZX”)
# Example: Apply strain only to X and Z axes strain_axes = "XZ" # Equivalent to "ZX"
Key Parameters:
Axes: built‑in modes or custom strings like
X,XYX/Y/Z range: strain percentage ranges. In isotropic mode only
Xvalues are usedIdentify organic: treat organic molecules as rigid units
Structure Tagging:
structure.info["Config_type"] += f" Strain({axis1}:{value1}%, {axis2}:{value2}%)"
Algorithm
Builds mesh over selected axes:
np.arange(start, end+step, step)per axis; isotropic uses a single scalar applied to all.Scales selected lattice vectors by
(1 + strain/100); optionally applies rigid handling of organic clusters.
Caveats
Large strains (>10–15%) can produce severe distortions; prefer smaller increments.
Custom axis strings must only include X/Y/Z.
Best Practices
Use biaxial/triaxial for comprehensive scans; isotropic for quick sweeps.
Pair with FPS or error‑based selection to trim the grid.
2.6 Random Doping Substitution
Function: Randomly substitute atoms according to user-defined rules
Key Parameters:
Doping rules: each rule contains a target element, dopant elements and their ratios, a concentration or count range, and optional groups
Doping algorithm:
Random(sample dopants by probability) orExact(follow ratios exactly)Max structures: number of structures to generate
2.7 Organic Molecular Rotation (TorsionGuard, PBC-aware)
Function: Generate organic molecular conformations by rotating torsional subtrees around rotatable bonds, with physical guards to keep bonded distances reasonable and avoid non‑bonded clashes. Supports PBC or non‑PBC frames and adds optional Gaussian noise.
Key Parameters:
Confs per structure (
perturb_per_frame): number of conformations generated per input structureTorsion range (
torsion_range_deg): angle range in degrees, e.g., [-180, 180]Max torsions/conf (
max_torsions_per_conf): maximum different torsion bonds to rotate per conformationGaussian sigma (
gaussian_sigma): standard deviation (Å) of added positional noisePBC mode (
pbc_mode):auto(use cell if present),yes(force PBC if lattice exists),no(non‑PBC)Local‑mode cutoff atoms (
local_mode_cutoff_atoms): if atom count exceeds this, rotate only a local subtree for efficiencyLocal torsion max subtree (
local_torsion_max_subtree): max atoms in the rotated local subtreeBond detect factor (
bond_detect_factor): scale on (r_i + r_j) to detect bonded pairs when building the graphBond keep min/max factor (
bond_keep_min_factor,bond_keep_max_factor): allowed bonded distance window as factors of (r_i + r_j); if max is None, uses detect factorNon‑bond min factor (
nonbond_min_factor): minimum separation factor for non‑bonded pairs to avoid clashesMax retries/conf (
max_retries_per_frame): retries with smaller angle/noise scales if guards failMulti‑bond factor (
mult_bond_factor): skip “short” edges (likely multiple bonds) from torsion setNon‑PBC box size (
nonpbc_box_size): cubic box length used to center non‑PBC outputs for visualization
Algorithm:
Build adjacency from covalent radii using
bond_detect_factor; find rotatable torsions via graph analysis (prefer bridges; fall back to internal bonds)For each conformation:
Randomly pick up to
max_torsions_per_conftorsion axesFor each, collect a local subtree to rotate; in PBC use MIC to define the bond axis
Apply a random rotation within
torsion_range_deg, then add Gaussian noise (gaussian_sigma)Enforce guards: keep bonded pairs within [min,max] factors; ensure non‑bonded distances above
nonbond_min_factorIf a guard fails, retry up to
max_retries_per_frametimes with halved ranges/noise each attemptWrap to cell for PBC; otherwise center to a cubic box (
nonpbc_box_size)
Tagging:
structure.info["Config_type"] += f" TorsionGuard(n={perturb_per_frame}, sigma={gaussian_sigma}, pbc={pbc_mode})"
Caveats:
Ensure a valid lattice for PBC frames; otherwise prefer
autoornoLarge angle ranges with high
max_torsions_per_confcan raise rejection rate; consider reducing ranges or enabling local modebond_keep_max_factor=Noneusesbond_detect_factoras the upper boundVery large molecules: increase
local_mode_cutoff_atomsand adjustlocal_torsion_max_subtree
Best Practices:
Start with
torsion_range_deg = [-60, 60]andgaussian_sigma = 0.02–0.05 ÅEnable local mode on big organics for faster, safer rotations
Combine with FPS/filters downstream to curate diverse yet non‑redundant conformers
Structure Tagging:
structure.info["Config_type"] += f" Doping(num={dopant_count})"
When using grouping (group), you must use files in XYZ format and specify the group column. For example:
5
Lattice="6.383697472927415 0.0 0.0 0.0 6.383697472927415 1.4e-15 0.0 8e-16 6.383697472927415" Properties=species:S:1:pos:R:3:group:S:1 pbc="T T T"
Cs 0.00000000 0.00000000 0.00000000 a
I 3.19184873 3.19184873 -0.00000000 b
I 3.19184873 -0.00000000 3.19184873 c
I -0.00000000 3.19184873 3.19184873 c
Pb 3.19184873 3.19184873 3.19184873 d
Rule Schema
Rules is a list; each item is a JSON object with keys:
target(string): Element symbol to be replaced (e.g., “Si”).dopants(object): Mapping of element symbol → ratio. Ratios are normalized internally (e.g., “Ge:0.7,C:0.3”).use(string): One of"concentration"or"count".concentration([min, max], floats in 0–1): Fraction of eligible sites to replace.count([min, max], integers): Number of sites to replace.group(array, optional): Limit replacement to atoms whosegroupvalue is in this list (e.g., “a,c”).
If use is omitted or unrecognized, all eligible sites are replaced.
How Ratios and Exact Work
Random: for each selected site, samples a dopant species according to normalizeddopantsratios.Exact: computes integer counts asfloor(ratio * N)for each species, assigns leftovers to the largest‑ratio species, shuffles order for randomness.
Edge Cases and Validation
If fewer eligible
targetsites exist than requested, the algorithm clamps to the available count.groupmatching is exact (case‑sensitive) against the structure’sgrouparray values.Combining multiple rules applies them sequentially per structure; later rules see results of earlier substitutions.
Tips
Use
Exactfor reproducible stoichiometry,Randomfor data augmentation.Keep concentration windows narrow for incremental variations (e.g., 2–10%).
Prefer
groupwhen only specific sublattices or layers should be doped.
Algorithm
For each rule, find candidate sites (optionally limited by
group).If
use == concentration, sample a fraction of candidates; ifuse == count, sample an integer range.Choose dopant species by probability (
Random) or exact counts (Exact) and substitute symbols.
Caveats
If candidates are fewer than requested, the algorithm clamps to available sites.
Doping can break charge balance; validate for your physics/chemistry constraints.
Best Practices
Prefer
Exactto reproduce specific stoichiometries; useRandomfor augmentation.Use
groupto limit substitutions to sublattices or layers.
2.7 Random Vacancy Deletion
Function: Removes atoms according to vacancy rules
Key Parameters:
Vacancy rules: each rule specifies an element, a deletion count range, and optional groups to restrict affected sites
Max structures: number of structures to generate
Structure Tagging:
structure.info["Config_type"] += f" Vacancy(num={removed_count})"
Rule Schema
Rules is a list; each item is a JSON object with keys:
element(string): Element symbol to delete.count([min, max], integers): Number of atoms to remove per rule application.group(array, optional): Restrict deletions to atoms withgroupin the list.
If element is missing or count.max <= 0, the rule is skipped.
Edge Cases and Validation
Requested deletions are clamped to available eligible atoms.
Multiple rules are applied sequentially; later rules operate on the already‑modified structure.
Tips
Combine with Super Cell to maintain sufficient system size after deletions.
For surface models, use
groupto target only top layers or named regions.
Algorithm
For each rule, determines deletions by element and optional
group, draws a random integer count in range, and deletes unique indices.
Caveats
Excessive deletion can collapse periodic images; keep counts modest.
Best Practices
Use alongside Super Cell to maintain adequate atoms after deletion.
2.8 Random Slab Generation
Function: Builds slabs with random Miller indices and vacuum thickness
Key Parameters:
h/k/l range: Miller index ranges
Layer range: minimum, maximum and step
Vacuum range: vacuum thickness in Å
Structure Tagging:
structure.info["Config_type"] += f" Slab(hkl={h}{k}{l},layers={layers},vacuum={vac})"
Algorithm
Enumerates Miller indices (h,k,l), layer counts, and vacuum values; for each combination builds an ASE slab
surface(...), wraps, and annotates.
Caveats
(0,0,0) is skipped; degenerate or redundant orientations can arise for symmetric lattices.
Large enumerations can explode combinatorially; trim ranges sensibly.
Best Practices
Start with a few low‑index planes and small layer/vacuum ranges.
Post‑filter by thickness, surface area, or descriptor distance.
2.9 Shear Angle Strain
Adjusts cell angles (alpha, beta, gamma) over specified ranges.
Parameters: Alpha/Beta/Gamma ranges (degrees), Identify organic.
Algorithm
Convert cell to
[a,b,c,alpha,beta,gamma], add angle deltas, rebuild the lattice, and rescale atoms.
Caveats
Large angle changes may yield ill‑conditioned cells; keep steps small.
Best Practices
Combine with small lattice scaling to explore local angle neighborhoods.
2.10 Shear Matrix Strain
Applies a shear matrix to the lattice; optionally symmetric.
Parameters: XY/YZ/XZ strain percentages, Identify organic, Symmetric shear.
Algorithm
Build shear matrix with off‑diagonal terms from percentages; if symmetric, also fill transposed entries; multiply with cell and rescale atoms.
Caveats
Large shear may produce non‑physical cells; maintain moderate ranges (±5%).
Best Practices
Use symmetric shear for balanced distortions unless specific directionality is desired.
2.11 Stacking Fault Generation
Generates stacking faults (or twins) along a specified Miller plane.
Parameters: (h,k,l), Step range (start, end, step), Layers.
Algorithm
Compute plane normal from reciprocal lattice, select a layer split along a perpendicular direction, translate the upper part by multiples in the normal direction, wrap, and annotate.
Caveats
If the normal becomes ill‑defined, the card skips modification.
Best Practices
Use small step sizes and validate resulting interlayer distances.
3. Filter Cards
3.1 FPS Filter (Farthest Point Sampling)
Algorithm:
Calculates NEP descriptors for all structures
Executes FPS algorithm in high-dimensional space
Key Parameters:
NEP file path (required)
Maximum selection count
Minimum distance threshold
Filter Mechanism:
Filters only affect exported results, not data flow
Export logic:
if filter_active: export_filtered_results else: export_raw_merged_results
Algorithm
Runs NEP to compute structure descriptors; uses FPS to select up to
Max selectedwith a minimum pairwise distanceMin distancein descriptor space.
Caveats
Descriptor generation requires a valid NEP file; large datasets can take time—progress is reported.
Best Practices
Use a modest
Min distancefirst (e.g., 1e-3–1e-2) and tune up.Cascade after generation cards to reduce redundancy and export only representative samples.
4. Container Cards
4.1 Card Group
Usage Guide:
Create Group: Add Card Group card
Add Members: Drag cards into group
Set Filter: Drag filter card to group bottom area
Execution Example:
Scenario: 3 group cards generating 10, 15, and 20 structures respectively
Without Filter: Passes 45 structures to next stage
With Filter: Passes 45 structures but may only export 30
Caveats
Group outputs can be large; consider an in‑group filter to control size.
Best Practices
Use Card Group for parallel variants (e.g., different perturb cards) and a single filter at the bottom to unify selection criteria.
NepTrainKit Custom Card Development Guide
1. Development Environment Setup
1.1 Card Directory Structure
User_Config_Directory/
├── cards/
│ ├── custom_card1.py # Custom card files
│ └── custom_card2.py
1.2 Get Config Directory Path
import os
import platform
def get_user_config_path():
if platform.system() == 'Windows':
local_path = os.getenv('LOCALAPPDATA', None)
if local_path is None:
local_path = os.getenv('USERPROFILE', '') + '\\AppData\\Local'
user_config_path = os.path.join(local_path, 'NepTrainKit')
else:
user_config_path = os.path.expanduser("~/.config/NepTrainKit")
return user_config_path
Default paths:
Windows: C:\Users\Username\AppData\Local\NepTrainKit
Linux: ~/.config/NepTrainKit
2. Card Development Template
2.1 Basic Template Structure
from NepTrainKit.core import CardManager
from NepTrainKit.custom_widget.card_widget import MakeDataCard
@CardManager.register_card
class CustomCard(MakeDataCard):
# Required class attributes
group = "Custom" # menu name
card_name = "Custom Card Name"
menu_icon = ":/images/src/images/logo.svg"
def __init__(self, parent=None):
super().__init__(parent)
self.setTitle("Card Title")
self.init_ui()
def init_ui(self):
"""Initialize UI"""
self.setObjectName("custom_card_widget")
# Add controls and layout code here
def process_structure(self, structure):
"""Core processing logic"""
processed_structures = []
# Processing code...
return processed_structures
def to_dict(self):
"""Serialize card configuration"""
return super().to_dict()
def from_dict(self, data_dict):
"""Deserialize configuration"""
super().from_dict(data_dict)
# Custom parameter restoration...
3. Core Function Implementation
3.1 Processing Function Specification
def process_structure(self, structure):
"""
Parameters:
structure (ase.Atoms): Input structure object
Returns:
List[ase.Atoms]: Processed structure list
Notes:
- Must return a list, even with single structure
- Each structure should use copy() to avoid modifying original
"""
new_structure = structure.copy()
# Processing logic...
return [new_structure]
3.2 UI Development Recommendations
def init_ui(self):
# Example: Add parameter input
from qfluentwidgets import SpinBox, BodyLabel
self.param_label = BodyLabel("Parameter:", self)
self.param_input = SpinBox(self)
self.param_input.setRange(1, 100)
self.param_input.setValue(10)
self.settingLayout.addWidget(self.param_label, 0, 0)
self.settingLayout.addWidget(self.param_input, 0, 1)
4. Advanced Features
4.1 State Persistence
def to_dict(self):
data = super().to_dict()
data.update({
'custom_param': self.param_input.value(),
'other_setting': True
})
return data
def from_dict(self, data):
super().from_dict(data)
self.param_input.setValue(data.get('custom_param', 10))
Appendix: Complete Example Card
https://github.com/aboys-cb/NepTrainKit/tree/dev/src/NepTrainKit/views/_card