Azure-Samples
diff --git a/‎docs/README.md‎
Lines changed: 9 additions & 0 deletions b/‎docs/README.md‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎docs/pics/fldatatypes.png‎
45.5 KB b/‎docs/pics/fldatatypes.png‎
45.5 KB
diff --git a/‎docs/pics/vfltrainingloop.png‎
77.6 KB b/‎docs/pics/vfltrainingloop.png‎
77.6 KB
diff --git a/‎docs/tutorials/vertical-fl.md‎
Lines changed: 113 additions & 0 deletions b/‎docs/tutorials/vertical-fl.md‎
Lines changed: 113 additions & 0 deletions
diff --git a/‎examples/components/CCFRAUD/preprocessing/run.py‎
Lines changed: 3 additions & 0 deletions b/‎examples/components/CCFRAUD/preprocessing/run.py‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎examples/components/CCFRAUD/upload_data/run.py‎
Lines changed: 2 additions & 7 deletions b/‎examples/components/CCFRAUD/upload_data/run.py‎
Lines changed: 2 additions & 7 deletions
diff --git a/‎examples/components/CCFRAUD_VERTICAL/preprocessing/run.py‎
Lines changed: 200 additions & 0 deletions b/‎examples/components/CCFRAUD_VERTICAL/preprocessing/run.py‎
Lines changed: 200 additions & 0 deletions
@@ -17,6 +17,7 @@
 - [Concepts](#concepts)
   - [Why should you consider Federated Learning?](#why-should-you-consider-federated-learning)
   - [How to plan for your Federated Learning project](#how-to-plan-for-your-federated-learning-project)
+  - [Vertical federated learning](#vertical-federated-learning)
   - [Glossary](#glossary)
 - [Tutorials](#tutorials)
   - [What this repo has to offer?](#what-this-repo-has-to-offer)
@@ -93,6 +94,14 @@ Creating such a graph of jobs can be complex. This repository provides a recipe
 
 We wrote a generic guide on how to get started, ramp-up and mature your [FL project](./concepts/plan-your-fl-project.md).
 
+## Vertical federated learning
+
+> - :warning: EXPERIMENTAL :warning: We are delighted to share with you our solution for vertical federated learning, however, please keep in mind that it is still in active development.
+
+Vertical federated learning is a branch of federated learning where the data are split across the features (vertically) instead of across the samples (horizontally). This provides communication challenges as the nodes running the code needs to exchange intermediate outputs and their corresponding gradients of aligned samples.
+
+We provide examples on how to run **MNIST** and **CCFRAUD** examples using vertical federated learning. These are essentially copies of the original examples with features scattered across the nodes. We invite you to learn more about this approach in the [vertical federated learing tutorial](./tutorials/vertical-fl.md).
+
 ## Glossary
 
 The complete glossary list can be seen [**here**](./concepts/glossary.md).
 
@@ -0,0 +1,113 @@
+# Cross-silo vertical federated learning
+
+## Background
+Vertical federated learning (VFL) is branch of federated learning where the data are split across the features among the participants rather than across the samples (horizontal FL). In other words we can say that it takes federated learning to another level as it allows for cross-organization collaboration without need for having the same features while keeping privacy and security of each individual's data intact. Some of real-world examples include, but are not limited to:
+- Finance: several institutions owning different pieces of data about their clients (e.g. bank account data, credit card data, loans data, ...etc)
+- Healthcare: different healthcare facilities may own different modalities (e.g. x-ray scans, prescriptions, patient health records, ...etc)
+- Retail: each retailer owns different information about customer and aggregating this information may result in better recommendations for the customer
+
+<br/><br/>
+<div align="center">
+    <img src="../pics/fldatatypes.png" alt="Homogenous vs heterogenous data" width="400">
+</div>
+
+> Note: In this tutorial we refer to "host" as the party who owns the data labels and optionally some part of features and "contributors" as parties who own only features and provide host with intermediate outputs of their share of the network
+
+## Objective and contents
+This tutorial will guide you through steps required to set-up VFL experiments and point out important parts of the code. We target MNIST (written number recognition) and [CCFRAUD (financial tabular data)](../real-world-examples/ccfraud.md) examples in order to showcase versatility of the solution in regards to type of the data.  All of the examples here make use of mean aggregation and assumption is that the host owns only labels while features are equally distributed among the contributors.
+
+## Infrastructure
+First step towards successfully running VFL example is to provision an infrastructure. In order to do so, please navigate to [quickstart](../quickstart.md) and use **single-button deployment for vnet infrastructure deployment**. This is necessary in order for nodes to be able to communicate.
+
+## Install the required dependencies
+
+You'll need python to submit experiments to AzureML. You can install the required dependencies by running:
+
+```bash
+conda env create --file ./examples/pipelines/environment.yml
+conda activate fl_experiment_conda_env
+```
+
+Alternatively, you can just install the required dependencies:
+
+```bash
+python -m pip install -r ./examples/pipelines/requirements.txt
+```
+
+## Data provisioning
+The data format for VFL is different from regular FL. That is why each of our examples contains its own script for uploading data that are needed for a given example.
+
+> Note: This will split the data such that each contributor owns its portion of the features and host own only the labels
+
+### CCFRAUD
+
+Please follow steps in [CCFRAUD - Add your Kaggle credentials to the workspace key vault](../real-world-examples/ccfraud.md#Add-your-Kaggle-credentials-to-the-workspace-key-vault). Afterwards, follow same steps as for **MNIST** and **please do not forget to replace `--example MNIST_VERTICAL` with `--example CCFRAUD_VERTICAL`**).
+
+### MNIST
+
+This can all be performed with ease using a data provisioning pipeline. To run it follow these steps:
+
+1. If you are not using the quickstart setup, adjust the config file  `config.yaml` in `examples/pipelines/utils/upload_data/` to match your setup.
+
+2. Submit the experiment by running:
+
+   ```bash
+   python ./examples/pipelines/utils/upload_data/submit.py --example MNIST_VERTICAL --workspace_name "<workspace-name>" --resource_group "<resource-group-name>" --subscription_id "<subscription-id>"
+   ```
+
+   > Note: You can use --offline flag when running the job to just build and validate pipeline without submitting it.
+
+    :star: you can simplify this command by entering your workspace details in the file `config.yaml` in this same directory.
+
+:warning: Proceed to the next step only once the pipeline completes. This pipeline will create data in 3 distinct locations.
+
+## Model preparation for VFL
+It is an ongoing research topic on how the model can be orchestrated in VFL. We have decided to go with the most common approach by splitting it between the host and contributors, also referred to as **split learning**, this approach can be easily altered by moving layers between parties to hosting whole model on contributors while host provides only aggregation and/or activation function. We believe that this can better demonstrate capabilities of VFL on AzureML and most of the existing models can be easily split without requiring too much work.
+
+## Training
+
+### Overview
+Now, before we run the training itself let's take a step back and take a look on how such training works in VFL setup that is roughly depicted in the figure below. The first step that needs to take place ahead of the training is:
+
+- **Private entity intersection and alignment** - before the training takes place we need to make sure that all of the parties involved share the same sample space and these samples are aligned during the training. **Our samples provide these guarantees by design but please make sure it's true for your custom data. This can be achieved by, for example, providing preprocessing step before training as we do not provide any for of PSI as of now.**
+
+Afterwards, we can continue with regular training loop:
+- **Forward pass in contributors** - all contributors, and optionally host, perform forward pass on their part of the model with features they own
+- **Intermediate outputs transfer** - all outputs from previous step are sent to the host that performs an aggregation (for simplicity sake we make use of mean operation)
+- **Loss computation** - host performs either forward pass on its part of network or just passes aggregated outputs of previous step through an activation function followed by loss computation
+- **Gradients computation** - if host owns some part of the network, it performs backward pass, followed by computing gradients w.r.t inputs in all cases
+- **Gradient transfer** - all contributors, and optionally host, receives gradients w.r.t. their intermediate outputs
+- **Backward pass** - gradients are used to perform backward pass and update the network weights
+
+<br/><br/>
+<div align="center">
+    <img src="../pics/vfltrainingloop.png" alt="Vertical federated learning training loop" width="400">
+</div>
+
+### Steps to launch
+1. If you are not using the quickstart setup, adjust the config file  `config.yaml` in `examples/pipelines/<example-name>/` to match your setup.
+
+2. Submit the experiment by running:
+
+   ```bash
+   python ./examples/pipelines/<example-name>/submit.py --config examples/pipelines/<example-name>/config.yaml --workspace_name "<workspace-name>" --resource_group "<resource-group-name>" --subscription_id "<subscription-id>"
+   ```
+
+   > Note: You can use --offline flag when running the job to just build and validate pipeline without submitting it.
+
+    :star: you can simplify this command by entering your workspace details in the file `config.yaml` in this same directory.
+
+
+## Tips and pitfalls
+1. **Vertical Federated Learning comes at a cost**
+    There is significant overhead when launching vertical federated learning due to heavy communication among participants. As we can see in the training loop there are two transfers per each mini-batch. One for forward pass outputs, one for gradients. This means that the training may take longer than expected.
+2. **Intersection and entity alignment**
+   The samples needs to be aligned across participants ahead of the training after we created set intersection of samples that are present on all involved parties. This process can reveal information to other entities that we may want to keep private. Fortunately there are **private set intersection** methods available out there that come to rescue.
+3. **Communication encryption**
+    Even though the intermediate outputs and gradients are not raw data, they still have been inferred using private data. Therefore, it's good to use encryption when communicating the data to parties outside of Azure.
+
+## Additional resources
+- [Private set intersection algorithm overview](https://xianmu.github.io/posts/2018-11-03-private-set-intersection-based-on-rsa-blind-signature.html)
+
+
+
@@ -116,6 +116,9 @@ def preprocess_data(
     logger.debug(f"Train data samples: {len(train_data)}")
     logger.debug(f"Test data samples: {len(test_data)}")
 
+    os.makedirs(train_data_dir, exist_ok=True)
+    os.makedirs(test_data_dir, exist_ok=True)
+
     train_data = train_data.sort_values(by="trans_date_trans_time")
     test_data = test_data.sort_values(by="trans_date_trans_time")
 
 
@@ -17,7 +17,7 @@
     3: [["South"], ["Midwest"], ["West", "Northeast"]],
     4: [["South"], ["West"], ["Midwest"], ["Northeast"]],
 }
-CATEGORICAL_PROPS = ["category", "region", "gender", "state", "job"]
+CATEGORICAL_PROPS = ["category", "region", "gender", "state"]
 ENCODERS = {}
 
 
@@ -67,23 +67,18 @@ def preprocess_data(df):
     useful_props = [
         "amt",
         "age",
-        # "cc_num",
         "merch_lat",
         "merch_long",
         "category",
         "region",
         "gender",
         "state",
-        # "zip",
         "lat",
         "long",
         "city_pop",
-        "job",
-        # "dob",
         "trans_date_trans_time",
         "is_fraud",
     ]
-    categorical = ["category", "region", "gender", "state", "job"]
 
     df.loc[:, "age"] = (pd.Timestamp.now() - pd.to_datetime(df["dob"])) // pd.Timedelta(
         "1y"
@@ -92,7 +87,7 @@ def preprocess_data(df):
     # Filter only useful columns
     df = df[useful_props]
 
-    for column in categorical:
+    for column in CATEGORICAL_PROPS:
         encoder = ENCODERS.get(column)
         encoded_data = encoder.transform(df[column].values.reshape(-1, 1)).toarray()
         encoded_df = pd.DataFrame(
 
@@ -0,0 +1,200 @@
+import os
+import argparse
+import logging
+import sys
+import numpy as np
+
+from sklearn.preprocessing import StandardScaler
+import pandas as pd
+import mlflow
+
+SCALERS = {}
+
+
+def get_arg_parser(parser=None):
+    """Parse the command line arguments for merge using argparse.
+
+    Args:
+        parser (argparse.ArgumentParser or CompliantArgumentParser):
+        an argument parser instance
+
+    Returns:
+        ArgumentParser: the argument parser instance
+
+    Notes:
+        if parser is None, creates a new parser instance
+    """
+    # add arguments that are specific to the component
+    if parser is None:
+        parser = argparse.ArgumentParser(description=__doc__)
+
+    parser.add_argument("--raw_training_data", type=str, required=True, help="")
+    parser.add_argument("--raw_testing_data", type=str, required=True, help="")
+    parser.add_argument("--train_output", type=str, required=True, help="")
+    parser.add_argument("--test_output", type=str, required=True, help="")
+    parser.add_argument(
+        "--metrics_prefix", type=str, required=False, help="Metrics prefix"
+    )
+    return parser
+
+
+def apply_transforms(df):
+    """Applies transformation for datetime and numerical columns
+
+    Args:
+        df (pd.DataFrame):
+        dataframe to transform
+
+    Returns:
+        pd.DataFrame: transformed dataframe
+    """
+    global SCALERS
+
+    datetimes = ["trans_date_trans_time"]  # "dob"
+    normalize = [
+        "age",
+        "merch_lat",
+        "merch_long",
+        "lat",
+        "long",
+        "city_pop",
+        "trans_date_trans_time",
+        "amt",
+    ]
+
+    for column in datetimes:
+        if column not in df.columns:
+            continue
+        df.loc[:, column] = pd.to_datetime(df[column]).view("int64")
+    for column in normalize:
+        if column not in df.columns:
+            continue
+
+        if column not in SCALERS:
+            print(f"Creating encoder for column: {column}")
+            # Simply set all zeros if the category is unseen
+            scaler = StandardScaler()
+            scaler.fit(df[column].values.reshape(-1, 1))
+            SCALERS[column] = scaler
+
+        scaler = SCALERS.get(column)
+        df.loc[:, column] = scaler.transform(df[column].values.reshape(-1, 1))
+
+    return df
+
+
+def preprocess_data(
+    raw_training_data,
+    raw_testing_data,
+    train_data_dir="./",
+    test_data_dir="./",
+    metrics_prefix="default-prefix",
+):
+    """Preprocess the raw_training_data and raw_testing_data and save the processed data to train_data_dir and test_data_dir.
+
+    Args:
+        raw_training_data: Training data directory that need to be processed
+        raw_testing_data: Testing data directory that need to be processed
+        train_data_dir: Train data directory where processed train data will be saved
+        test_data_dir: Test data directory where processed test data will be saved
+    Returns:
+        None
+    """
+
+    logger.info(
+        f"Raw Training Data path: {raw_training_data}, Raw Testing Data path: {raw_testing_data}, Processed Training Data dir path: {train_data_dir}, Processed Testing Data dir path: {test_data_dir}"
+    )
+
+    logger.debug(f"Loading data...")
+    train_df = pd.read_csv(raw_training_data + f"/train.csv", index_col=0)
+    test_df = pd.read_csv(raw_testing_data + f"/test.csv", index_col=0)
+
+    if "is_fraud" in train_df.columns:
+        fraud_weight = (
+            train_df["is_fraud"].value_counts()[0]
+            / train_df["is_fraud"].value_counts()[1]
+        )
+        logger.debug(f"Fraud weight: {fraud_weight}")
+        np.savetxt(train_data_dir + "/fraud_weight.txt", np.array([fraud_weight]))
+
+    logger.debug(f"Applying transformations...")
+    train_data = apply_transforms(train_df)
+    test_data = apply_transforms(test_df)
+
+    logger.debug(f"Train data samples: {len(train_data)}")
+    logger.debug(f"Test data samples: {len(test_data)}")
+    logger.info(f"Saving processed data to {train_data_dir} and {test_data_dir}")
+
+    os.makedirs(train_data_dir, exist_ok=True)
+    os.makedirs(test_data_dir, exist_ok=True)
+
+    train_data.to_csv(train_data_dir + "/data.csv")
+    test_data.to_csv(test_data_dir + "/data.csv")
+
+    # Mlflow logging
+    log_metadata(train_data, test_data, metrics_prefix)
+
+
+def log_metadata(train_df, test_df, metrics_prefix):
+    with mlflow.start_run() as mlflow_run:
+        # get Mlflow client
+        mlflow_client = mlflow.tracking.client.MlflowClient()
+        root_run_id = mlflow_run.data.tags.get("mlflow.rootRunId")
+        logger.debug(f"Root runId: {root_run_id}")
+        if root_run_id:
+            mlflow_client.log_metric(
+                run_id=root_run_id,
+                key=f"{metrics_prefix}/Number of train datapoints",
+                value=f"{train_df.shape[0]}",
+            )
+
+            mlflow_client.log_metric(
+                run_id=root_run_id,
+                key=f"{metrics_prefix}/Number of test datapoints",
+                value=f"{test_df.shape[0]}",
+            )
+
+
+def main(cli_args=None):
+    """Component main function.
+
+    It parses arguments and executes run() with the right arguments.
+
+    Args:
+        cli_args (List[str], optional): list of args to feed script, useful for debugging. Defaults to None.
+    """
+    # build an arg parser
+    parser = get_arg_parser()
+    # run the parser on cli args
+    args = parser.parse_args(cli_args)
+    logger.info(f"Running script with arguments: {args}")
+
+    def run():
+        """Run script with arguments (the core of the component).
+
+        Args:
+            args (argparse.namespace): command line arguments provided to script
+        """
+
+        preprocess_data(
+            args.raw_training_data,
+            args.raw_testing_data,
+            args.train_output,
+            args.test_output,
+            args.metrics_prefix,
+        )
+
+    run()
+
+
+if __name__ == "__main__":
+    # Set logging to sys.out
+    logger = logging.getLogger(__name__)
+    logger.setLevel(logging.DEBUG)
+    log_format = logging.Formatter("[%(asctime)s] [%(levelname)s] - %(message)s")
+    handler = logging.StreamHandler(sys.stdout)
+    handler.setLevel(logging.DEBUG)
+    handler.setFormatter(log_format)
+    logger.addHandler(handler)
+
+    main()