Tutorial ====================== Knowing how to set up valid YAML configuration files is fundamental to properly use **PyDGN**. Custom behavior with more advanced functionalities can be generally achieved by subclassing the individual modules we provide, but this is very much dependent on the specific research project. Data Preprocessing *********************** The ML pipeline starts with the creation of the dataset and of the data splits. The general template that we can use is the following, with an explanation of each field as a comment: .. code-block:: yaml splitter: root: # folder where to store the splits class_name: # dotted path to splitter class args: n_outer_folds: # number of outer folds for risk assessment n_inner_folds: # number of inner folds for model selection seed: stratify: # target stratification: works for graph classification tasks only shuffle: # whether to shuffle the indices prior to splitting inner_val_ratio: # percentage of validation for hold-out model selection. this will be ignored when the number of inner folds is > than 1 outer_val_ratio: # percentage of validation data to extract for risk assessment final runs test_ratio: # percentage of test to extract for hold-out risk assessment. this will be ignored when the number of outer folds is > than 1 dataset: root: # path to data root folder class_name: # dotted path to dataset class args: # arguments to pass to the dataset class arg_name1: arg_namen: transform: # on the fly transforms: useful for social datasets with no node features (with an example) - class_name: pydgn.data.transform.ConstantIfEmpty args: value: 1 # pre_transform: # transform data and store it at dataset creation time # pre_filter: # filter data and store it at dataset creation time Data Splitting ------------------- We provide a general :class:`~pydgn.data.splitter.Splitter` class that is able to split a dataset of multiple graphs. The most important parameters are arguably ``n_outer_folds`` and ``n_inner_folds``, which represent the way in which we want to perform **risk assessment** and **model selection**. For instance: * ``n_outer_folds=10`` and ``n_inner_folds=1``: 10-fold external Cross Validation (CV) on test data, with hold-out model selection inside each of the 10 folds, * ``n_outer_folds=5`` and ``n_inner_folds=3``: Nested CV, * ``n_outer_folds=1`` and ``n_inner_folds=1``: Simple Hold-out model assessment and selection, or ``train/val/test`` split. We assume that the difference between **risk assessment** and **model selection** is clear to the reader. If not, please refer to `Samy Bengio's lecture (Part 3) `_. Here's an snippet of a potential configuration file that splits a graph classification dataset: .. code-block:: yaml splitter: root: examples/DATA_SPLITS/CHEMICAL class_name: pydgn.data.splitter.Splitter args: n_outer_folds: 10 n_inner_folds: 1 seed: 42 stratify: True shuffle: True inner_val_ratio: 0.1 outer_val_ratio: 0.1 test_ratio: 0.1 Dataset Creation ------------------- To create your own dataset, you should implement the :class:`~pydgn.data.dataset.DatasetInterface` interface. For instance, we provide a wrapper around the `TUDataset `_ dataset of Pytorch Geometric in :class:`~pydgn.data.dataset.TUDatasetInterface`, which you can check to get an idea. Here's an snippet of a potential configuration file that downloads and processes the ``PROTEINS`` graph classification dataset: .. code-block:: yaml dataset: root: DATA/ class_name: pydgn.data.dataset.TUDataset args: root: DATA/ name: PROTEINS You can also apply ``transform``, ``pre_transform`` and ``pre_filter`` that follow the same semantic of PyG. Once our data configuration file is ready, we can create the dataset using (for the example above) .. code-block:: bash pydgn-dataset --config-file examples/DATA_CONFIGS/config_PROTEINS.yml Experiment Setup ********************** Once we have created a dataset and its data splits, it is time to implement our model and define a suitable task. Every model must implement the :class:`~pydgn.model.interface.ModelInterface` interface, and it can optionally use a readout module that must implement the :class:`~pydgn.model.interface.ReadoutInterface`. At this point, it is time to define the experiment. The general template that we can use is the following, with an explanation of each field as a comment: .. code-block:: yaml # Dataset and Splits data_root: # path to DATA root folder (same as in data config file) dataset_class: # dotted path to dataset class dataset_name: # dataset name (same as in data config file) data_splits_file: # path to data splits file # Hardware device: # cpu | cuda max_cpus: # > 1 for parallelism max_gpus: # > 0 for gpu usage (device must be cuda though) gpus_per_task: # percentage of gpus to allocate for each task gpus_subset: # optional, comma-separated list of gpu indices, e.g. 0,2. Used to force a particular subset of GPUs being used. # Data Loading dataset_getter: # dotted path to dataset provider class data_loader: class_name: # dotted path to data loader class args: num_workers : pin_memory: # possibly other arguments (we set `worker_init_fn`, `sampler` and `shuffle`, so do not override) # Reproducibility seed: 42 # Experiment result_folder: # path of the folder where to store results exp_name: # name of the experiment experiment: # dotted path to experiment class higher_results_are_better: # model selection: should we select based on max (True) or min (False) main score? evaluate_every: # evaluate on train/val/test every `n` epochs and log results final_training_runs: # how many final (model assessment) training runs to perform to mitigate bad initializations # Grid Search # if only 1 configuration is selected, any inner model selection will be skipped grid: supervised_config: model: # dotted path to model class checkpoint: # whether to keep a checkpoint of the last epoch to resume training shuffle: # whether to shuffle the data batch_size: # batch size epochs: # number of maximum training epochs # Model specific arguments # # TBD # ------------------------ # # Optimizer (with an example - 3 possible alternatives) optimizer: - class_name: pydgn.training.callback.optimizer.Optimizer args: optimizer_class_name: torch.optim.Adam lr: - 0.01 - 0.001 weight_decay: 0. - class_name: pydgn.training.callback.optimizer.Optimizer args: optimizer_class_name: torch.optim.Adagrad lr: - 0.1 weight_decay: 0. # Scheduler (optional) scheduler: null # Loss metric (with an example of Additive Loss) loss: - class_name: pydgn.training.callback.metric.AdditiveLoss args: loss_1: pydgn.training.callback.metric.MulticlassClassification loss_2: pydgn.training.callback.metric.MulticlassClassification # Score metric (with an example of Multi Score) scorer: - class_name: pydgn.training.callback.metric.MultiScore args: main_scorer: pydgn.training.callback.metric.MulticlassAccuracy my_second_metric: pydgn.training.callback.metric.ToyMetric # Readout (optional) readout: # Training engine engine: pydgn.training.engine.TrainingEngine # Gradient clipper (optional) gradient_clipper: null # Early stopper (optional, with an example of "patience" early stopping on the validation score) early_stopper: - class_name: - pydgn.training.callback.early_stopping.PatienceEarlyStopper args: patience: - 5 # SYNTAX: (train_,validation_)[name_of_the_scorer_or_loss_to_monitor] -> we can use MAIN_LOSS or MAIN_SCORE monitor: validation_main_score mode: max # is best the `max` or the `min` value we are monitoring? checkpoint: True # store the best checkpoint # Plotter of metrics plotter: pydgn.training.callback.plotter.Plotter Data Information ----------------- Here we can specify some information about the dataset: .. code-block:: yaml data_root: DATA dataset_class: pydgn.data.dataset.TUDatasetInterface dataset_name: PROTEINS data_splits_file: examples/DATA_SPLITS/CHEMICAL/PROTEINS/PROTEINS_outer10_inner1.splits Hardware ----------------- Here we can define how many resources to allocate to parallelize different experiments: .. code-block:: yaml # this will run a maximum of 4 experiments to allocate all of the 2 gpus we have. # We use some more cpu resources to take into account potential `data loader workers `_. device: cuda max_cpus: 8 max_gpus: 2 gpus_per_task: 0.5 Data Loading ----------------- Here we specify which :class:`~pydgn.data.provider.DataProvider` we want to use to load the data associated with the given splits, and the :class:`DataLoader` that needs to handle such data: .. code-block:: yaml # Data Loading dataset_getter: pydgn.data.provider.DataProvider data_loader: class_name: torch_geometric.loader.DataLoader args: num_workers : 2 pin_memory: True # should be True when device is set to `cuda` Experiment Details -------------------- Here we define the experiment details, including the experiment name and type, and the folder where we want to store our results: .. code-block:: yaml result_folder: RESULTS exp_name: supervised_grid_search_toy experiment: pydgn.experiment.supervised_task.SupervisedTask higher_results_are_better: True evaluate_every: 3 final_training_runs: 3 Grid Search -------------- Grid search is identified by the keyword ``grid`` after the experimental details. This is the easiest setting, in which you can define lists associated to an hyper-parameter and all possible combinations will be created. You can even have nesting of these combinations for maximum flexibility. There is one config file ``examples/MODEL_CONFIGS/config_SupToyDGN.yml`` that you can check to get a better idea. Random Search -------------- Random search, on the other hand, is identified by the keyword ``random`` after the experimental details. One line above we have to specify the number of random trials, using the keyword ``num_samples``. We provide different sampling methods: * choice --> pick at random from a list of arguments * uniform --> pick uniformly from min and max arguments * normal --> sample from normal distribution with mean and std * randint --> pick at random from min and max * loguniform --> pick following the recprocal distribution from log_min, log_max, with a specified base There is one config file ``examples/MODEL_CONFIGS/config_SupToyDGN_RandomSearch.yml`` that you can check to get a better idea. Experiment -------------- Depending on the experiment type, different main keywords are required: * :class:`~pydgn.experiment.supervised_task.SupervisedTask` -> ``supervised_config`` * :class:`~pydgn.experiment.semi_supervised_task.SemiSupervisedTask` -> ``unsupervised_config`` and ``supervised_config`` Note that an **unsupervised** task may be seen as a supervised task with a loss objective that does not depend on the target. In the **supervised** task, we have the "standard" training procedure. In the **semi-supervised** task, we expect a first training phase which is unsupervised and that can produce node/graph embeddings. After that, a second, supervised model is trained on such embeddings, but adjacency information is not preserved in this second stage. Inside the dictionary associated to the keyword ``[un]supervised_config``, we expect another number of keywords to be present: .. code-block:: yaml model: # dotted path to model class checkpoint: # whether to keep a checkpoint of the last epoch to resume training shuffle: # whether to shuffle the data batch_size: epochs: # number of maximum training epochs # Model specific arguments # # ------------------------ # optimizer: scheduler: # (can be "null") loss: scorer: readout: # (can be "null") engine: # training engine gradient_clipper: # (can be "null") early_stopper: # (can be "null") plotter: # (can be "null") Once our experiment configuration file is ready, we can launch an experiment using (see below for a couple of examples) .. code-block:: bash pydgn-train --config-file examples/MODEL_CONFIGS/config_SupToyDGN.yml or .. code-block:: bash pydgn-train --config-file examples/MODEL_CONFIGS/config_SemiSupToyDGN.yml And we are up and running! .. image:: _static/exp_gui.png :width: 600 To stop the computation, use ``CTRL-C`` to send a ``SIGINT`` signal, and consider using the command ``ray stop`` to stop all Ray processes. **Warning:** ``ray stop`` stops **all** ray processes you have launched, including those of other experiments in progress, if any. Some things to notice: because we have chosen a 10-fold CV for risk assessment with a simple hold-out model selection **for each** external fold, you can notice in the picture there are ``10*1`` rows with ``Out_*/Inn_*`` written. For each of these, we have to perform a model selection with ``4`` possible hyper-parameters' configurations (progress shown on the right handside). In addition, there are also some stats about the time required to complete the experiments. After the 10 model selection are complete (i.e., one "best" model for each outer/external fold), it is time to re-train the chosen models on the 10 different train/test splits. Therefore, you can notice ``10`` rows with ``Final run *`` written. Since we have specified ``final_training_runs: 3`` in our exp. config file, we will mitigate unlucky random initializations of the chosen models by averaging test results (of a single outer fold) over 3 training runs. The final generalization performances of the model (a less ambiguous definition would be: the **class of models** you developed) is obtained, for this specific case, as the average of the 10 test scores across the external folds. Again, if this does not make sense to you, please consider reading `Samy Bengio's lecture (Part 3) `_. Inspecting Results -------------------- According to our configuration file, the results are stored in the ``RESULTS`` folder. The hierarchy of folder is the following: .. code-block:: supervised_grid_search_toy_PROTEINS |__ MODEL_ASSESSMENT |__ assessment_results.json # contains the risk assessment results (average of the outer fold results) |__ OUTER_FOLD_1 ... |__ OUTER_FOLD_5 |__ outer_results.json # contains the aggregated results of the three final runs |__ final_run_1 |__ final_run_2 |__ tensorboard # tensorboard folder |__ experiment.log # log file with profiling information |__ best_checkpoint.pth # torch dict holding the "best" checkpoint information according to the early stopper used |__ last_checkpoint.pth # torch dict holding the checkpoint information of the last epoch (top ``checkpoint`` keyword set to true) |__ run_2_results.torch # torch dict holding the results of the 2nd final run on the 5th outer fold. |__ final_run_3 |__ MODEL_SELECTION # files regarding the model selection inside the 5th outer fold |__ winner_config.json # contains the "best model" across the inner folds (in this case just 1 inner fold) for the 5th fold to be used in the final training runs |__ config_1 |__ config_2 |__ config_3 |__ config_results.json # contains the aggregated results of the K inner model selection folds |__ INNER_FOLD_1 # first (and only in this case) inner model selection fold |__ tensorboard |__ experiment.log # log file with profiling information |__ best_checkpoint.pth |__ last_checkpoint.pth |__ fold_1_results.torch # torch dict holding the results of the 1st fold results of the 3rd configuration. |__ config_4 ... |__ OUTER_FOLD_10 Profiling Information ----------------------- Inside each ``experiment.log`` file, you will find training logs and, at the end of each training, the profiler information with the per-epoch and total time required by each :class:`~pydgn.training.event.handler.EventHandler`, provided the time spent is non-negligible (threshold specified in the log file). Here's what it looks like: .. image:: _static/profiler.png :width: 600 Tensorboard ----------------------- We can use the generic :class:`~pydgn.training.callback.plotter.Plotter` class to easily visualize the training trend with Tensorboard, using the information in the ``tensorboard`` folder: .. code-block:: bash tensorboard --logdir RESULTS/supervised_grid_search_toy_PROTEINS/MODEL_ASSESSMENT/OUTER_FOLD_1/final_run1/tensorboard/ And we get: .. image:: _static/tensorboard.png :width: 600 Filtering Configurations for Post-processing of Results ---------------------------------------------------------- You can use some utilities we provide to focus on a specific set of configurations after your experiments are terminated. Assuming you run `pydgn-train --config-file examples/MODEL_CONFIGS/config_SupToyDGN.yml` inside the PyDGN repo, you can then do something like .. code-block:: python3 from pydgn.evaluation.util import retrieve_experiments, filter_experiments configs = retrieve_experiments('RESULTS/supervised_grid_search_toy_NCI1/MODEL_ASSESSMENT/OUTER_FOLD_1/MODEL_SELECTION/') print(len(configs)) # will return 32 filtered_configs = filter_experiments(configs, logic='OR', parameters={'Multiclass Classification': 1, 'lr': 0.001}) print(len(filtered_configs)) # will return 24 Loading Model for Inspection in a Notebook ---------------------------------------------- We provide utilities to use your model immediately after experiments end to run additional analyses. Here's how: .. code-block:: python3 from pydgn.evaluation.util import * config = retrieve_best_configuration('RESULTS/supervised_grid_search_toy_NCI1/MODEL_ASSESSMENT/OUTER_FOLD_1/MODEL_SELECTION/') splits_filepath = 'examples/DATA_SPLITS/CHEMICAL/NCI1/NCI1_outer10_inner1.splits' device = 'cpu' # instantiate dataset dataset = instantiate_dataset_from_config(config) # instantiate model model = instantiate_model_from_config(config, dataset, config_type="supervised_config") # load model's checkpoint, assuming the best configuration has been loaded checkpoint_location = 'RESULTS/supervised_grid_search_toy_NCI1/MODEL_ASSESSMENT/OUTER_FOLD_1/final_run1/best_checkpoint.pth' load_checkpoint(checkpoint_location, model, device=device) # you can now call the forward method of your model y, embeddings = model(dataset[0]) # ------------------------------------------------------------------ # # OPTIONAL: you can also instantiate a DataProvider to load TR/VL/TE splits specific to each fold data_provider = instantiate_data_provider_from_config(config, splits_filepath) # select outer fold 1 (indices start from 0) data_provider.set_outer_k(0) # select inner fold 1 (indices start from 0) data_provider.set_inner_k(0) # set exp seet for workers (does not affect inference) data_provider.set_exp_seed(42) # any seed # load loaders associated with final runs of outer 1 split train_loader = data_provider.get_outer_train() val_loader = data_provider.get_outer_train() test_loader = data_provider.get_outer_train() # Please refer to the DataProvider documentation to use it properly. # ------------------------------------------------------------------ # Telegram Bot ----------------------- Once you have a Telegram bot token and chat id, it is super easy to set up automatic reporting of the main results! Create a file ``telegram_config.yml`` in the main project folder, and set it up like this: .. code-block:: yaml bot_token: [YOUR TOKEN] bot_chat_ID: [YOUR CHAT ID] log_model_selection: True # logs the best config for each outer fold (validation score) log_final_runs: True # logs the outcome of the final runs for each outer fold (test score) Inside your experiment configuration file (see example in ``examples/MODEL_CONFIGS/config_SupToyDGN.yml``), it is sufficient to specify your telegram configuration file by adding: .. code-block:: yaml # Telegram Bot telegram_config_file: telegram_config.yml And that's all you have to do to start receiving messages when the model selection/final runs for a specific fold end! You will also receive a message when the experiment terminates with the test score.