この記事は IPFactory OB Advent Calendar 2023 23日目の記事です。

こんにちは、ふたばとです。

今日は CMU から出されている LEAF という連合学習用のベンチマーキングフレームワークを使ってみます。英語でも使ってみた系のテックブログがすぐに見つけられなかったので、参考になれば幸いです。

LEAF -A Benchmark for Federated Settings-

LEAF について

LEAF は、連合学習、マルチタスク学習、メタ学習、オンデバイス学習などのアプリケーションを含む、Federated Setting における学習用ベンチマーキングフレームワークです。

leaf.cmu.edu

LEAF では５種のタスク６つのデータセットが提供されています。それぞれのデータセットの基本情報は以下のとおりです。 LEAF で提供されているデータは Non-IID に従うため、クライアントにあたる user の数と、各 user が所持する sample の数がデータの基本情報にあたります。

FEMNIST

Task: Image Classification
Statistics:
- number of users: 3,550
- number of samples: 80,5263
- mean(samples per user): 88.94
- std: 88.94
- std/mean: 0.39

Shakespeare

Task: Next Character Prediction
Statistics:
- number of users: 3,550
- number of samples: 80,5263
- mean(samples per user): 88.94
- std: 88.94
- std/mean: 0.39

Twitter

Task: Sentiment Analysis
Statistics:
- number of users: 660,120
- number of samples: 1,600,498
- mean(samples per user): 2.42
- std: 4.71
- std/mean: 1.94

Celeba

Task: Image Classification
Statistics:
- number of users: 9,343
- number of samples: 200,288
- mean(samples per user): 21.44
- std: 7.63
- std/mean: 0.36

Synthetic Dataset

Task: Classification
Statistics:
- number of users: 1,000
- number of samples: 107,553
- mean(samples per user): 107.55
- std: 213.22
- std/mean: 1.98

Task: Language Modeling
Statistics:
- number of users: 1,660,820
- number of samples: 56,587,343
- mean(samples per user): 30.07
- std: 62.9
- std/mean: 1.84

LEAF を使ってみよう

今回は FEMNIST をどんな感じで使っていけばよいのかを見ていきます。元のリポジトリはこちら。

github.com

こちらリポジトリは最終コミットが2年以上前で、Python 3.5, TensorFlow 1.13.1 をベースに動くものになっています。環境を用意するのが大変だけど論文を読んでいてよく登場するので、みんな同じように苦しんでいるのかなぁと考えていました。

さすがに元の GitHub がアップデートされていないだけで使いやすくなっているはずと思って調べていたらいい感じに PyTorch の DataLoader として提供してくれているフレームワークを見つけられたので、これを使ってみればよさそうです。

fedlab.readthedocs.io

github.com

この FedLab を利用すれば、LEAF の６つのデータセットを含む連合学習の研究で使用される一般的なデータセットを利用できます。

Installation

pip で提供されていたり requirements.txt が用意されています。

pypi.org

今回は Poetry でインストールします。 pyproject.toml はこんな感じです。

[tool.poetry]
name = "fedlab-feminist"
version = "0.1.0"
description = ""
authors = ["futabato <01futabato10@gmail.com>"]
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.11"
Pillow = "<=9.5.0"
fedlab = "^1.3.0"
spacy = "^3.7.2"


[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

[tool.poetry] の name を fedlab にすると [tool.poetry.dependencies] の fedlab と被ってインストールが失敗することに注意してください。

fedlab==1.3.0 のみを指定してインストールすると Pillow>=10.0.0 がインストールされるのですが、Pillow<=9.5.0 に実装されていたメソッドが使われるので、Pillow のバージョンを先に指定しています。また、spacy はデータセット作成スクリプトの実行のために必要なものになります。spacy をインストールしない場合は、Tokenizer が使われているところを呼び出されないように少しコードを修正する必要があります。

Tutorial

Download and preprocess data

FedLab/datasets/femnist/ にデータセットをダウンロードして前処理を行うスクリプト preprocess.sh が用意されているので、それを利用します。

同階層に stats.sh が用意されています。stats.sh は、実行すると FedLab/datasets/utils/stats.py が呼び出され、preprcess.sh で得た FEMNIST データセットの統計情報を出力してくれます。

preprocess.sh を実行します。実行の例は以下の通りです。

bash preprocess.sh -s niid --sf 0.05 -k 0 -t sample --tf 0.9

オプションを簡単に説明します。

-s: IID or Non-IID
--sf: サンプリングするデータの割合
-k: userごとの最小 sample 数
-t: user を Train, Test グループに分割するか、sample を Train, Test データに分割するか
--tf: Train : Test の比率

Pickle file stores Dataset

FedLab では、データの読み込みの高速化のために raw data を Dataset として加工して pickle 化します。データ処理後の pickle ファイルを読み込むことで各クライアントに対応した Dataset が得られるようになっています。

FedLab/datasets/ に gen_pickle_dataset.sh が用意されています。 gen_pickle_dataset.sh は pickle_dataset.py を実行するだけです。

gen_pickle_dataset.sh を実行します。実行の例は以下の通りです。

bash gen_pickle_dataset.sh "femnist" "../datasets" "./femnist/pickle_data/"

オプションを簡単に説明します。 - 第一引数(dataset): データセット名 - 第二引数(data_root): データの保存先パス - 第三引数(pickle_root): pickle ファイルを保存先パス

上記のコマンドによって、FedLab/datasets/femnist/pickle_data/femnist/ に train と test の二つのディレクトリが作成されました。この二つのディレクトリの中に、pickle ファイルが格納されています。

Dataloader loading data set

以下のコードによって DataLoader を用意することができます。

github.com

# Copyright 2021 Peng Cheng Laboratory (http://www.szpclab.com/) and FedLab Authors (smilelab.group)

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#     http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
    get dataloader for dataset in LEAF processed
"""
import logging
from pathlib import Path
import torch
from torch.utils.data import ConcatDataset
from leaf.pickle_dataset import PickleDataset

BASE_DIR = Path(__file__).resolve().parents[2]


def get_LEAF_dataloader(dataset: str, client_id=0, batch_size=128, data_root: str = None, pickle_root: str = None):
    """Get dataloader with ``batch_size`` param for client with ``client_id``

    Args:
        dataset (str):  dataset name string to get dataloader
        client_id (int, optional): assigned client_id to get dataloader for this client. Defaults to 0
        batch_size (int, optional): the number of batch size for dataloader. Defaults to 128
        data_root (str): path for data saving root.
                        Default to None and will be modified to the datasets folder in FedLab: "fedlab-benchmarks/datasets"
        pickle_root (str): path for pickle dataset file saving root.
                        Default to None and will be modified to Path(__file__).parent / "pickle_datasets"
    Returns:
        A tuple with train dataloader and test dataloader for the client with `client_id`

    Examples:
        trainloader, testloader = get_LEAF_dataloader(dataset='shakespeare', client_id=1)
    """
    # Need to run leaf/gen_pickle_dataset.sh to generate pickle files for this dataset firstly
    pdataset = PickleDataset(dataset_name=dataset, data_root=data_root, pickle_root=pickle_root)
    try:
        trainset = pdataset.get_dataset_pickle(dataset_type="train", client_id=client_id)
        testset = pdataset.get_dataset_pickle(dataset_type="test", client_id=client_id)
    except FileNotFoundError:
        logging.error(f"""
                        No built PickleDataset json file for {dataset}-client {client_id} in {pdataset.pickle_root.resolve()}
                        Please run `{BASE_DIR / 'leaf/gen_pickle_dataset.sh'} to generate {dataset} pickle files firstly!` 
                        """)

    trainloader = torch.utils.data.DataLoader(
        trainset,
        batch_size=batch_size,
        drop_last=False)  # avoid train dataloader size 0
    testloader = torch.utils.data.DataLoader(
        testset,
        batch_size=len(testset),
        drop_last=False,
        shuffle=False)
        
    return trainloader, testloader


def get_LEAF_all_test_dataloader(dataset: str, batch_size=128, data_root: str = None, pickle_root: str = None):
    """Get dataloader for all clients' test pickle file

    Args:
        dataset (str): dataset name
        batch_size (int, optional): the number of batch size for dataloader. Defaults to 128
        data_root (str): path for data saving root.
                        Default to None and will be modified to the datasets folder in FedLab: "fedlab-benchmarks/datasets"
        pickle_root (str): path for pickle dataset file saving root.
                        Default to None and will be modified to Path(__file__).parent / "pickle_datasets"
    Returns:
        ConcatDataset for all clients' test dataset
    """
    pdataset = PickleDataset(dataset_name=dataset, data_root=data_root, pickle_root=pickle_root)

    try:
        all_testset = pdataset.get_dataset_pickle(dataset_type="test")
    except FileNotFoundError:
        logging.error(f"""
                        No built test PickleDataset json file for {dataset} in {pdataset.pickle_root.resolve()}
                        Please run `{BASE_DIR / 'leaf/gen_pickle_dataset.sh'} to generate {dataset} pickle files firstly!` 
                        """)
    test_loader = torch.utils.data.DataLoader(
                    all_testset,
                    batch_size=batch_size,
                    drop_last=True)  # avoid train dataloader size 0
    return test_loader

呼び出す際には pickle_dataset.py 実行時のオプションデータセット名 --dataset_name で指定したデータセットを渡してあげてください。

train_loader, test_loader = get_LEAF_dataloader(dataset="femnist")

Torchvision から提供されるようになると格段に使いやすくなるだろうなというお気持ちです。

この記事は IPFactory OB Advent Calendar 2023 23日目の記事です。

qiita.com

現役生のアドカレはこちら。

qiita.com

昨日 22 日目の記事は n01e0 の sedbf でした。

feneshi.co

明日の記事があるかはわかりません。

アルゴリズム弱太郎

Twitter @01futabato10

LEAF -A Benchmark for Federated Settings-

LEAF -A Benchmark for Federated Settings-

LEAF について

FEMNIST

Shakespeare

Twitter

Celeba

Synthetic Dataset

Reddit

LEAF を使ってみよう

Installation

Tutorial

Download and preprocess data

Pickle file stores Dataset

Dataloader loading data set