MatGPT is a transformer-based framework for property-conditioned generation of inorganic crystal structures. It represents crystals as SLICES strings, trains GPT-style decoder models on structure-property data, samples new SLICES strings for target property values, reconstructs generated structures, and evaluates them with proxy property models.
This repository contains the reusable code. Datasets, trained weights, checkpoints, logs, generated samples, reconstructed CIF files, and plots are kept outside Git and should be supplied separately for reproduction.
MatGPT/
mat/ # single-property conditioning
train.py
sample_property.py
sli_to_cif.py
form_proxy.py
band_proxy.py
bulk_proxy.py
metrics.py
config/default.yaml
utils/
dataset/
weights/
mat-multi/ # multi-property conditioning
train.py
sample_property.py
sli_to_cif.py
band_form.py
form_bulk.py
crystal_form.py
crystal_band.py
crystal_bulk.py
form_proxy.py
band_proxy.py
metrics.py
config/default.yaml
utils/
dataset/
weights/
Use mat/ when conditioning on one property. Use mat-multi/ when conditioning on more than one value, including cases where one value is an encoded crystal system.
Processed datasets and trained weights are distributed separately from this code repository. Place them in the following folders before running the scripts:
mat/dataset/
mat/weights/
mat-multi/dataset/
mat-multi/weights/
The source databases used to prepare materials data can be accessed here:
Processed datasets and trained weights are available on Hugging Face: harshasatyavardhan/matgpt_datasets.
Each training CSV must include:
- a SLICES column, configured with
training.slices - one or more numeric property columns, configured with
training.selected_properties - any metadata columns required by the evaluation scripts
Trained model artifacts are expected in this format:
weights/transformer_property_<N>.pth
weights/model_info_property_<N>.json
<N> is the number of conditioning values used by the model.
For runs that use crystal_system_encoded, the encoding used by the evaluation scripts is:
| Code | Crystal system |
|---|---|
| 1 | Triclinic |
| 2 | Monoclinic |
| 3 | Orthorhombic |
| 4 | Tetragonal |
| 5 | Trigonal |
| 6 | Hexagonal |
| 7 | Cubic |
Code 0 is reserved for Unknown in the plotting/evaluation utilities.
Create a Python environment with the required scientific and deep-learning packages. A typical installation is:
pip install torch pytorch-lightning hydra-core omegaconf pandas numpy matplotlib seaborn tqdm einops tensorboard pymatgen matgl m3gnet megnet tensorflowThe materials-science stack can be sensitive to Python, CUDA, TensorFlow, and operating-system versions, so a dedicated environment is recommended.
Both code paths use Hydra:
mat/config/default.yaml
mat-multi/config/default.yaml
Important fields:
| Field | Description |
|---|---|
data_path.fname |
path to the training CSV |
training.selected_properties |
property columns used for conditioning |
training.num_properties |
number of conditioning values |
training.special_tokens |
property tokens prepended to the generated sequence |
training.slices |
SLICES column name |
paths.checkpoint_path |
exported model weights used for sampling |
paths.vocab_and_model |
vocabulary and model metadata JSON |
sample.properties |
target property values used during generation |
sample.save_path |
output CSV for generated SLICES strings |
back_to_cif.output_folder |
reconstructed CIF files and conversion plots |
proxy_model.* |
proxy model names and output paths |
Run scripts from inside mat/ or mat-multi/ so the relative paths in the config resolve correctly.
Single-property model:
cd mat
python train.py \
data_path.fname=./dataset/data.csv \
training.selected_properties='["PROPERTY_COLUMN"]' \
training.num_properties=1 \
training.special_tokens='["<PROPERTY_TOKEN>"]' \
training.slices=SLICESMulti-property model:
cd mat-multi
python train.py \
data_path.fname=./dataset/data.csv \
training.selected_properties='["PROPERTY_COLUMN_1","PROPERTY_COLUMN_2"]' \
training.num_properties=2 \
training.special_tokens='["<PROPERTY_TOKEN_1>","<PROPERTY_TOKEN_2>"]' \
training.slices=SLICESTraining writes logs, PyTorch Lightning checkpoints, exported weights, and vocabulary metadata to the configured paths.
Generate SLICES strings from a trained model:
cd mat
python sample_property.py \
sample.properties='[TARGET_VALUE]' \
sample.number_sequences=500For multi-property conditioning:
cd mat-multi
python sample_property.py \
sample.properties='[[TARGET_VALUE_1,TARGET_VALUE_2]]' \
sample.number_sequences=500The sampler loads the configured model weights and metadata, validates generated SLICES strings, and writes the accepted samples to sample.save_path.
Convert generated SLICES strings to CIF structures:
cd mat
python sli_to_cif.pyor:
cd mat-multi
python sli_to_cif.pyOutputs are written under the configured sample/output_structures/ folder.
Single-property evaluation utilities:
cd mat
python form_proxy.py
python band_proxy.py
python bulk_proxy.py
python metrics.pyMulti-property evaluation utilities:
cd mat-multi
python band_form.py
python form_bulk.py
python crystal_form.py
python crystal_band.py
python crystal_bulk.py
python metrics.pyThe proxy model names and reconstructed CSV path are configured under proxy_model in the corresponding Hydra config.
- This repository tracks code and configuration only.
- Use the same processed dataset, property columns, target values, weights, and proxy model versions to reproduce a specific result.
train.pyuses the first 90% of rows for training and the remaining 10% for validation.- Sampling uses nucleus sampling with default
temperature=1.2andtop_p=0.9. model_info_property_<N>.jsonstores the vocabulary and block size required for sampling.
The repository does not track datasets, weights, checkpoints, logs, generated samples, CIF files, plots, Python caches, or local OS files such as .DS_Store.
If you use this repository, please cite the associated manuscript:
Transformer-Based Generation of Inorganic Materials with Targeted Properties.
Manuscript submitted for journal review.
