Pipeline Configuration

You can change attributes in the user_configuration.yml file to run PAL 2.0 according to your custom choice. The attributes for setting up the pipeline are given below:

Directory Attributes

  • run_folder
    • Address of the cloned repository on your system ending with a /, preceeded by the home anchor

    • Example: C:/Users/MatDisc_ML/

    • This is a required value, there is no default value

  • output_folder
    • List with home reference as first element and address of directory to save outputs of Bayesian Optimization ending with a / as second element

    • Values = Any string literal for the second element

    • Default value = bo_output/

General Attributes

  • test_size
    • Fraction of the unobserved materials space that needs to be explored by Bayesian Optimization.

    • Example: If we have a total of 100 materials, then test_size = 0.9 implies, we will use 10 materials to train our surrogate models initially and then explore the remaining 90 based on Bayesian Optimization.

    • Values = Floating point in (0,1)

    • Default value = 0.9

  • verbose
    • Set to True if we want to print the progress of the code and Bayesian Optimization iterations without too much detail regarding the fitting process. It mainly prints out the outputs of the Bayesian Optimization iterations.

    • Values = True or False

    • Default value = True

  • deep_verbose
    • Set to True if we want to print the progress of the code and Bayesian Optimization iterations with all details including training of the Gaussian Process models, prior means and Bayesian Optimization iteration outputs.

    • Values = True or False

    • Default value = False

Input Attributes

  • dataset_folder
    • Folder name suffix to save the Bayesian Optimization output for a given data.

    • The standard format of the output folder name is: dataset_folder + test_size + p_Run + Run number

    • Values = Any string literal

    • Default value = newDataset

  • InputType
    • The format in which your dataset is stored.

    • Values = Gryffin, PerovAlloys, PALSearch, MPEA

    • Default value = Gryffin

    • Given below is a table which shows the various input types and their associated file extensions:

      InputType

      File Extension

      Gryffin

      .pkl

      PerovAlloys

      .csv

      PALSearch

      .xls, .xlsx

      MPEA

      .xls, .xlsx

  • InputPath
    • List with home reference as first element and address for where the dataset is saved ending with a / as second element.

    • Values = Name of the directory where the dataset is stored

    • Default value = datasets/

  • InputFile
    • Name of the dataset file.

    • Values = Filename of the dataset being used

    • Default value = perovskites_GRYFFIN.pkl

  • AddTargetNoise
    • Set to True if we want to add a small Gaussian noise to the target property

    • Values = True or False

    • Default value = False

Feature Selection Attributes

  • test_size_fs
    • Fraction of the data to be used to do feature selection.

    • In the case of running Bayesian Optimization, this needs to be set the same as the test_size variable mentioned earlier.

    • Values = Floating point in (0,1)

    • Default value = 0.1

  • select_features_otherModels
    • Set to True if we want to do feature selection of input descriptors for all models other than Gaussian Process - Neural Network model.

    • Values = True or False

    • Default value = True

  • select_features_NN
    • Set to true if we want to do feature selection of input descriptors for the Gaussian Process - Neural Network model.

    • Values = True or False

    • Default value = True

  • random_state
    • This is used to set the seed to dividing the dataset into train and test sets for feature engineering.

    • Values = Any Real Positive Number

    • Default value = 40

  • onlyImportant
    • Set to true if we want to output only features selected from the list of input features.

    • Values = True or False

    • Default value = False

Surrogate Models Training Attributes

  • train_NN
    • Set to True if we want to train the Neural Network model initial before using the Neural Network as a prior mean to fit the Gaussian Process model,

    • Values = True or False

    • Default value = True

  • saveModel_NN
    • Set to true if we want to save the Neural Network model in a file after fitting.

    • This has to be set to True if we are training the Neural Network model with the given initial data for the first time.

    • Values = True or False

    • Default value = True

  • train_GP
    • Set to True if we want to train the Gaussian Process models

    • Values = True or False

    • Default value = True

  • predict_NN
    • Set to True if we want to use the Neural Network model to do predictions

    • Values = True or False

    • Default value = False