Make Dataset

1. Core Concepts & Workflow

1.1 Data Flow Model

Linear Processing Chain: Cards execute sequentially, with each card’s output automatically becoming the next card’s input
In-Group Parallel Flow: All cards within a group share the same input, with outputs automatically merged
Filter Mechanism: Filters can be added at group ends to screen outputs from all group cards

1.2 Basic Operations

Import Structures:
- Supported formats: VASP/POSCAR, CIF, XYZ
- Methods: Click “Open” button or drag files directly into window
Build Processing Pipeline:
- Add processing cards via “Add new card”
- Reorder cards via drag-and-drop
- Use Card Groups for complex workflows
Save/Load Configuration:
- Export: Save current card setup as JSON
- Import: Load existing configuration files

Sample configuration file:

{
    "software_version": "2.0.6.dev35",
    "cards": [
        {
            "class": "SuperCellCard",
            "check_state": true,
            "super_cell_type": 0,
            "super_scale_radio_button": false,
            "super_scale_condition": [1,1,1],
            "super_cell_radio_button": true,
            "super_cell_condition": [20,20,20],
            "max_atoms_radio_button": false,
            "max_atoms_condition": [1]
        },
        {
            "class": "CardGroup",
            "check_state": true,
            "card_list": [
                {
                    "class": "CellStrainCard",
                    "check_state": true,
                    "engine_type": "triaxial",
                    "x_range": [-5,5,1],
                    "y_range": [-5,5,1],
                    "z_range": [-5,5,1]
                },
                {
                    "class": "PerturbCard",
                    "check_state": true,
                    "engine_type": 0,
                    "organic": true,
                    "scaling_condition": [0.3],
                    "num_condition": [50]
                },
                {
                    "class": "CellScalingCard",
                    "check_state": true,
                    "engine_type": 0,
                    "perturb_angle": true,
                    "scaling_condition": [0.04],
                    "num_condition": [50]
                }
            ],
            "filter_card": {
                "class": "FPSFilterDataCard",
                "check_state": true,
                "nep_path": "D:\\PycharmProjects\\NepTrainKit\\src\\NepTrainKit\\Config\\nep89.txt",
                "num_condition": [100],
                "min_distance_condition": [0.001]
            }
        }
    ]
}

2. Production Cards Explained

2.1 Super Cell Generation

Function: Creates supercells through expansion

Parameters:

Parameter Group	Option	Description	Typical Values
Mode	Maximum	Generates largest possible supercell	-
	Iteration	Generates all possible combinations	-
Expansion Method	Super Scale	Fixed expansion multiplier	(2,2,2)
	Super Cell	Calculates by max lattice constant	(10Å,10Å,10Å)
	Max Atoms	Limits by maximum atom count	200

Structure Tagging:

structure.info["Config_type"] += "supercell(nx,ny,nz)"  # e.g., supercell(2,2,1)

Algorithm

Super Scale: directly uses [na, nb, nc] as diagonal expansion factors and builds make_supercell(structure, diag([na,nb,nc])).
Super Cell (max lattice): computes [na, nb, nc] = floor([amax/a, bmax/b, cmax/c]) with safety checks, then clamps to at least 1 in each direction.
Max Atoms: enumerates integer triplets [na, nb, nc] whose product times N_atoms does not exceed the limit; sorted by total atoms, return either all (Iteration) or the largest (Maximum).

Caveats

If any lattice vector length is near zero, max‑cell estimation falls back to 1 to avoid divide‑by‑zero.
Very large supercells can create memory pressure; prefer Max Atoms to cap size.
Iteration may generate many structures—use a follow‑up filter (e.g., FPS) to down‑select.

Best Practices

Use Max Atoms to scale safely across diverse inputs.
Prefer Iteration when exploring and Maximum when preparing a single supercell.
Combine with Lattice Strain or Perturb after expansion to enrich diversity.

2.2 Vacancy Defect Generation

Function: Creates vacancy-defect structures by deleting random atoms

Key Parameters:

Random engine: Sobol sequence or Uniform distribution
Vacancy specification:
- Vacancy num: fixed number of vacancies
- Vacancy concentration: fraction of atoms to remove
Max structures: number of structures to generate

Structure Tagging:

structure.info["Config_type"] += f" Vacancy(num={defect_count})"

Algorithm

If using concentration c, compute max_defects = floor(c * N_atoms); else use fixed number.
Engine Sobol: sample both defect count and positions deterministically in [0,1), then map to indices.
Engine Uniform: draw a random integer count and random, unique atom indices to remove.

Caveats

A concentration of 1.0 is clamped to avoid removing all atoms.
Removing many atoms can break intended stoichiometry or PBC artifacts—validate downstream.

Best Practices

Keep c small (e.g., ≤ 0.2) for incremental datasets.
Prefer Sobol for repeatable sweeps; use Uniform for stochastic augmentation.

2.3 Atomic Perturbation

Function: Adds random displacements to atomic positions

Key Parameters:

Random engine: Sobol or Uniform
Max distance: maximum displacement amplitude in Å
Identify organic: treat organic molecules as rigid units
Max structures: number of structures to generate

Structure Tagging:

structure.info["Config_type"] += f" Perturb(distance={max_displacement}, {engine_type})"

Algorithm

Builds a 3N‑dimensional random vector from Sobol or Uniform in [‑1,1], reshaped to N×3, then scaled by max_distance and added to atomic positions.
If “Identify organic” is enabled: organic clusters are translated as rigid units; inorganic clusters perturb per‑atom.
Wraps atoms back into the cell.

Caveats

Large max_distance may cause unrealistic overlaps even with wrapping.
Cluster detection is heuristic; verify for complex organics.

Best Practices

Start small (0.1–0.3 Å) and inspect min interatomic distances.
Combine with FPS Filter to keep diverse displacements while avoiding redundancy.

2.4 Lattice Scaling

Function: Randomly scales lattice vectors

Key Parameters:

Random engine: Sobol or Uniform
Max scaling: 0–1, applied symmetrically as 1±value
Perturb angle: whether lattice angles are also perturbed
Identify organic: treat organic molecules as rigid units
Max structures: number of structures to generate

Structure Tagging:

structure.info["Config_type"] += f" Scaling({scaling_factor})"

Algorithm

Random engine produces a sequence of factors; lattice lengths are scaled by 1 ± s (uniform or Sobol).
If “Perturb angle” is enabled, angles are also perturbed and a new lattice is reconstructed from (lengths, angles).
Optionally treats organics as rigid clusters during scaling.

Caveats

Angle perturbation rebuilds the lattice; extreme angles can generate ill‑conditioned cells.
Ensure Max scaling is small (≤ 0.05) when also perturbing angles.

Best Practices

Use Uniform for broad coverage; Sobol for low‑discrepancy sweeps.
Keep scaling conservative for stability, especially before DFT relaxations.

2.5 Lattice Strain

Strain Modes:

Uniaxial
Biaxial
Triaxial
Isotropic (uniform scaling; only X range is used)
Custom Axis Combinations: Supports any XYZ combinations (e.g., “XY”, “XZ”, “YZX”)
```
# Example: Apply strain only to X and Z axes
strain_axes = "XZ"  # Equivalent to "ZX"
```

Key Parameters:

Axes: built‑in modes or custom strings like X, XY
X/Y/Z range: strain percentage ranges. In isotropic mode only X values are used
Identify organic: treat organic molecules as rigid units

Structure Tagging:

structure.info["Config_type"] += f" Strain({axis1}:{value1}%, {axis2}:{value2}%)"

Algorithm

Builds mesh over selected axes: np.arange(start, end+step, step) per axis; isotropic uses a single scalar applied to all.
Scales selected lattice vectors by (1 + strain/100); optionally applies rigid handling of organic clusters.

Caveats

Large strains (>10–15%) can produce severe distortions; prefer smaller increments.
Custom axis strings must only include X/Y/Z.

Best Practices

Use biaxial/triaxial for comprehensive scans; isotropic for quick sweeps.
Pair with FPS or error‑based selection to trim the grid.

2.6 Random Doping Substitution

Function: Randomly substitute atoms according to user-defined rules

Key Parameters:

Doping rules: each rule contains a target element, dopant elements and their ratios, a concentration or count range, and optional groups
Doping algorithm: Random (sample dopants by probability) or Exact (follow ratios exactly)
Max structures: number of structures to generate

Structure Tagging:

structure.info["Config_type"] += f" Doping(num={dopant_count})"

When using grouping (group), you must use files in XYZ format and specify the group column. For example:

5
Lattice="6.383697472927415 0.0 0.0 0.0 6.383697472927415 1.4e-15 0.0 8e-16 6.383697472927415" Properties=species:S:1:pos:R:3:group:S:1 pbc="T T T"
Cs       0.00000000       0.00000000       0.00000000 a
I        3.19184873       3.19184873      -0.00000000 b
I        3.19184873      -0.00000000       3.19184873 c
I       -0.00000000       3.19184873       3.19184873 c
Pb       3.19184873       3.19184873       3.19184873 d

Rule Schema

Rules is a list; each item is a JSON object with keys:
- target (string): Element symbol to be replaced (e.g., “Si”).
- dopants (object): Mapping of element symbol → ratio. Ratios are normalized internally (e.g., “Ge:0.7,C:0.3”).
- use (string): One of "concentration" or "count".
- concentration ([min, max], floats in 0–1): Fraction of eligible sites to replace.
- count ([min, max], integers): Number of sites to replace.
- group (array, optional): Limit replacement to atoms whose group value is in this list (e.g., “a,c”).

If use is omitted or unrecognized, all eligible sites are replaced.

How Ratios and Exact Work

Random: for each selected site, samples a dopant species according to normalized dopants ratios.
Exact: computes integer counts as floor(ratio * N) for each species, assigns leftovers to the largest‑ratio species, shuffles order for randomness.

Edge Cases and Validation

If fewer eligible target sites exist than requested, the algorithm clamps to the available count.
group matching is exact (case‑sensitive) against the structure’s group array values.
Combining multiple rules applies them sequentially per structure; later rules see results of earlier substitutions.

Tips

Use Exact for reproducible stoichiometry, Random for data augmentation.
Keep concentration windows narrow for incremental variations (e.g., 2–10%).
Prefer group when only specific sublattices or layers should be doped.

Algorithm

For each rule, find candidate sites (optionally limited by group).
If use == concentration, sample a fraction of candidates; if use == count, sample an integer range.
Choose dopant species by probability (Random) or exact counts (Exact) and substitute symbols.

Caveats

If candidates are fewer than requested, the algorithm clamps to available sites.
Doping can break charge balance; validate for your physics/chemistry constraints.

Best Practices

Prefer Exact to reproduce specific stoichiometries; use Random for augmentation.
Use group to limit substitutions to sublattices or layers.

2.7 Random Vacancy Deletion

Function: Removes atoms according to vacancy rules

Key Parameters:

Vacancy rules: each rule specifies an element, a deletion count range, and optional groups to restrict affected sites
Max structures: number of structures to generate

Structure Tagging:

structure.info["Config_type"] += f" Vacancy(num={removed_count})"

Rule Schema

Rules is a list; each item is a JSON object with keys:
- element (string): Element symbol to delete.
- count ([min, max], integers): Number of atoms to remove per rule application.
- group (array, optional): Restrict deletions to atoms with group in the list.

If element is missing or count.max <= 0, the rule is skipped.

Edge Cases and Validation

Requested deletions are clamped to available eligible atoms.
Multiple rules are applied sequentially; later rules operate on the already‑modified structure.

Tips

Combine with Super Cell to maintain sufficient system size after deletions.
For surface models, use group to target only top layers or named regions.

Algorithm

For each rule, determines deletions by element and optional group, draws a random integer count in range, and deletes unique indices.

Caveats

Excessive deletion can collapse periodic images; keep counts modest.

Best Practices

Use alongside Super Cell to maintain adequate atoms after deletion.

2.8 Random Slab Generation

Function: Builds slabs with random Miller indices and vacuum thickness

Key Parameters:

h/k/l range: Miller index ranges
Layer range: minimum, maximum and step
Vacuum range: vacuum thickness in Å

Structure Tagging:

structure.info["Config_type"] += f" Slab(hkl={h}{k}{l},layers={layers},vacuum={vac})"

Algorithm

Enumerates Miller indices (h,k,l), layer counts, and vacuum values; for each combination builds an ASE slab surface(...), wraps, and annotates.

Caveats

(0,0,0) is skipped; degenerate or redundant orientations can arise for symmetric lattices.
Large enumerations can explode combinatorially; trim ranges sensibly.

Best Practices

Start with a few low‑index planes and small layer/vacuum ranges.
Post‑filter by thickness, surface area, or descriptor distance.

2.9 Shear Angle Strain

Adjusts cell angles (alpha, beta, gamma) over specified ranges.

Parameters: Alpha/Beta/Gamma ranges (degrees), Identify organic.

Algorithm

Convert cell to [a,b,c,alpha,beta,gamma], add angle deltas, rebuild the lattice, and rescale atoms.

Caveats

Large angle changes may yield ill‑conditioned cells; keep steps small.

Best Practices

Combine with small lattice scaling to explore local angle neighborhoods.

2.10 Shear Matrix Strain

Applies a shear matrix to the lattice; optionally symmetric.

Parameters: XY/YZ/XZ strain percentages, Identify organic, Symmetric shear.

Algorithm

Build shear matrix with off‑diagonal terms from percentages; if symmetric, also fill transposed entries; multiply with cell and rescale atoms.

Caveats

Large shear may produce non‑physical cells; maintain moderate ranges (±5%).

Best Practices

Use symmetric shear for balanced distortions unless specific directionality is desired.

2.11 Stacking Fault Generation

Generates stacking faults (or twins) along a specified Miller plane.

Parameters: (h,k,l), Step range (start, end, step), Layers.

Algorithm

Compute plane normal from reciprocal lattice, select a layer split along a perpendicular direction, translate the upper part by multiples in the normal direction, wrap, and annotate.

Caveats

If the normal becomes ill‑defined, the card skips modification.

Best Practices

Use small step sizes and validate resulting interlayer distances.

3. Filter Cards

3.1 FPS Filter (Farthest Point Sampling)

Algorithm:

Calculates NEP descriptors for all structures
Executes FPS algorithm in high-dimensional space

Key Parameters:

NEP file path (required)
Maximum selection count
Minimum distance threshold

Filter Mechanism:

Filters only affect exported results, not data flow

Export logic:

if filter_active:
    export_filtered_results
else:
    export_raw_merged_results

Algorithm

Runs NEP to compute structure descriptors; uses FPS to select up to Max selected with a minimum pairwise distance Min distance in descriptor space.

Caveats

Descriptor generation requires a valid NEP file; large datasets can take time—progress is reported.

Best Practices

Use a modest Min distance first (e.g., 1e-3–1e-2) and tune up.
Cascade after generation cards to reduce redundancy and export only representative samples.

4. Container Cards

4.1 Card Group

Usage Guide:

Create Group: Add Card Group card
Add Members: Drag cards into group
Set Filter: Drag filter card to group bottom area

Execution Example:

Scenario: 3 group cards generating 10, 15, and 20 structures respectively
Without Filter: Passes 45 structures to next stage
With Filter: Passes 45 structures but may only export 30

Caveats

Group outputs can be large; consider an in‑group filter to control size.

Best Practices

Use Card Group for parallel variants (e.g., different perturb cards) and a single filter at the bottom to unify selection criteria.

NepTrainKit Custom Card Development Guide

1. Development Environment Setup

1.1 Card Directory Structure

User_Config_Directory/
├── cards/
│   ├── custom_card1.py  # Custom card files
│   └── custom_card2.py

1.2 Get Config Directory Path

import os
import platform

def get_user_config_path():
    if platform.system() == 'Windows':
        local_path = os.getenv('LOCALAPPDATA', None)
        if local_path is None:
            local_path = os.getenv('USERPROFILE', '') + '\\AppData\\Local'
        user_config_path = os.path.join(local_path, 'NepTrainKit')
    else:
        user_config_path = os.path.expanduser("~/.config/NepTrainKit")
    return user_config_path

Default paths: Windows: C:\Users\Username\AppData\Local\NepTrainKit
Linux: ~/.config/NepTrainKit

2. Card Development Template

2.1 Basic Template Structure

from NepTrainKit.core import CardManager
from NepTrainKit.custom_widget.card_widget import MakeDataCard
@CardManager.register_card
class CustomCard(MakeDataCard):
    # Required class attributes
    group = "Custom" # menu name
    card_name = "Custom Card Name"
    menu_icon = ":/images/src/images/logo.svg"
    
    def __init__(self, parent=None):
        super().__init__(parent)
        self.setTitle("Card Title")
        self.init_ui()
    
    def init_ui(self):
        """Initialize UI"""
        self.setObjectName("custom_card_widget")
        # Add controls and layout code here
    
    def process_structure(self, structure):
        """Core processing logic"""
        processed_structures = []
        # Processing code...
        return processed_structures
    
    def to_dict(self):
        """Serialize card configuration"""
        return super().to_dict()
        
    def from_dict(self, data_dict):
        """Deserialize configuration"""
        super().from_dict(data_dict)
        # Custom parameter restoration...

3. Core Function Implementation

3.1 Processing Function Specification

def process_structure(self, structure):
    """
    Parameters:
        structure (ase.Atoms): Input structure object
    
    Returns:
        List[ase.Atoms]: Processed structure list
    
    Notes:
        - Must return a list, even with single structure
        - Each structure should use copy() to avoid modifying original
    """
    new_structure = structure.copy()
    # Processing logic...
    return [new_structure]

3.2 UI Development Recommendations

def init_ui(self):
    # Example: Add parameter input
    from qfluentwidgets import SpinBox, BodyLabel
    
    self.param_label = BodyLabel("Parameter:", self)
    self.param_input = SpinBox(self)
    self.param_input.setRange(1, 100)
    self.param_input.setValue(10)
    
    self.settingLayout.addWidget(self.param_label, 0, 0)
    self.settingLayout.addWidget(self.param_input, 0, 1)

4. Advanced Features

4.1 State Persistence

def to_dict(self):
    data = super().to_dict()
    data.update({
        'custom_param': self.param_input.value(),
        'other_setting': True
    })
    return data

def from_dict(self, data):
    super().from_dict(data)
    self.param_input.setValue(data.get('custom_param', 10))

Appendix: Complete Example Card

https://github.com/aboys-cb/NepTrainKit/tree/dev/src/NepTrainKit/views/_card