NLP and Speech
Background
Recently, privacy-preserving methods gain increasing attentions in machine learning (ML) applications using linguistic data including text and audio, due to the fact that linguistic data can involve a wealth of information relating to an identified or identifiable natural person, such as the physiological, psychological, economic, cultural or social identity.
Federated Learning (FL) methods show promising results for collaboratively training models from a large number of clients without sharing their private linguistic data. To facilitate FL research in linguistic data, FederatedScope provides several built-in linguistic datasets and supports various tasks such as language modeling and text classification with various FL algorithms.
Natural Language Processing (NLP)
Datasets
We provide three popular text datasets for next-character prediction, next-word prediction, and sentiment analysis.
- Shakespeare: a federation text dataset of Shakespeare Dialogues from LEAF [1] for next-character prediction, which contains 422,615 sentences and about 1,100 clients.
- subReddit: a federation text dataset and subsampled of reddit from LEAF for next-word prediction, which contains 216,858 sentences and about 800 clients.
- Sentiment140: a federation text dataset of Twitter from LEAF for Sentiment Analysis, which contains 1,600,498 sentences, about 660,000 clients.
Models
We provide a LSTM model implementation in federatedscope/nlp/model
- LSTM: a type of RNN that solves the vanishing gradient problem through additional cells, input and output gates. (
cfg.model.type = 'lstm'
)
class LSTM(nn.Module):
def __init__(self,
in_channels,
hidden,
out_channels,
n_layers=2,
embed_size=8):
pass
- Currently, we are working on implement more interfaces to support more popular NLP Transformer models and more NLP tasks with HuggingFace Transformers [2].
Start an example
Next-character/word prediction is a classic NLP task as it can be applied in many consumer applications and appropriately be modeled by statistical language models, we show how to achieve next-character prediction in cross-device FL setting.
- Here we implement a simple LSTM model for next-character prediction: taking an English characters sequence as input, the model learns to predict the next possible character. After registering the model, we can use it by specifying
cfg.model.type=lstm
and hyper-parameters such ascfg.model.in_channels=80, cfg.model.out_channels=80, cfg.model.emd_size=8
. Complete codes are infederatedscope/nlp/model/rnn.py
andfederatedscope/nlp/model/model_builder.py
.
class LSTM(nn.Module):
def __init__(self,
in_channels,
hidden,
out_channels,
n_layers=2,
embed_size=8):
super(LSTM, self).__init__()
self.in_channels = in_channels
self.hidden = hidden
self.embed_size = embed_size
self.out_channels = out_channels
self.n_layers = n_layers
self.encoder = nn.Embedding(in_channels, embed_size)
self.rnn =\
nn.LSTM(
input_size=embed_size,
hidden_size=hidden,
num_layers=n_layers,
batch_first=True
)
self.decoder = nn.Linear(hidden, out_channels)
def forward(self, input_):
encoded = self.encoder(input_)
output, _ = self.rnn(encoded)
output = self.decoder(output)
output = output.permute(0, 2, 1) # change dimension to (B, C, T)
final_word = output[:, :, -1]
return final_word
- For the dataset, we use the Shakespeare dataset from LEAF, which is built from The Complete Works of William Shakespeare, and partitioned to ~1100 clients (speaking roles) from 422615. We can specify the
cfg.dataset.type=shakespeare
and adjust the fraction of data subsample (cfg.data.sub_sample=0.2
), and train/val/test ratio (``cfg.data.splits=[0.6,0.2,0.2]). Complete NLP data codes are infederatedscope/nlp/dataset
.
class LEAF_NLP(LEAF):
"""
LEAF NLP dataset from
leaf.cmu.edu
self:
root (str): root path.
name (str): name of dataset, ‘shakespeare’ or ‘xxx’.
s_frac (float): fraction of the dataset to be used; default=0.3.
tr_frac (float): train set proportion for each task; default=0.8.
val_frac (float): valid set proportion for each task; default=0.0.
"""
def __init__(
self,
root,
name,
s_frac=0.3,
tr_frac=0.8,
val_frac=0.0,
seed=123,
transform=None,
target_transform=None):
pass
- To enable large-scale clients simulation, we provide online aggregator in standalone mode to save the memory, which maintains only three model objects for the FL server aggregation. We can use this feature by specifying
cfg.federate.online_aggr = True
andfederate.share_local_model=True
, more details about this feature can be found in the post “Simulation and Deployment”. - To handle the non-i.i.d. challenge, FederatedScope supports several SOTA personalization algorithms and easy extension.
- To enable partial clients participation in each FL round, we provide clients sampling feature with various configuration manners: 1)
cfg.federate.sample_client_rate
, which is in the range (0, 1] and indicates selecting partial clients using random sampling with replacement; 2)cfg.federate.sample_client_num
, which is an integer to indicate sample client number at each round.
With these specification, we can run the experiment with
main.py --cfg federatedscope/nlp/baseline/fedavg_lstm_on_shakespeare.yaml
You will get the accuracy of FedAvg algorithm around 43.80%
.
Other NLP related scripts to run the next-character prediction experiments can be found in federatedscope/nlp/baseline
.
Customize your NLP task
FederatedScope enables users to easily implement and register more NLP datasets and models.
- Implement and register your own NLP data
```python
federatedscope/contrib/data/my_nlp_data.py
import torch import copy import numpy as np
from federatedscope.register import register_data
def get_my_nlp_data(config): r””” This function returns a dictionary, where key is the client id and value is the data dict of each client with ‘train’, ‘test’ or ‘val’. NOTE: client_id 0 is SERVER!
Returns:
dict: {
'client_id': {
'train': DataLoader or Data,
'test': DataLoader or Data,
'val': DataLoader or Data,
}
}
"""
import numpy as np
from torch.utils.data import DataLoader
# Build data
dataset = LEAF_NLP(root=path,
name="twitter",
s_frac=config.data.subsample,
tr_frac=splits[0],
val_frac=splits[1],
seed=1234,
transform=transform)
client_num = min(len(dataset), config.federate.client_num
) if config.federate.client_num > 0 else len(dataset)
config.merge_from_list(['federate.client_num', client_num])
# get local dataset
data_local_dict = dict()
for client_idx in range(client_num):
dataloader = {
'train': DataLoader(dataset[client_idx]['train'],
batch_size,
shuffle=config.data.shuffle,
num_workers=config.data.num_workers),
'test': DataLoader(dataset[client_idx]['test'],
batch_size,
shuffle=False,
num_workers=config.data.num_workers)
}
if 'val' in dataset[client_idx]:
dataloader['val'] = DataLoader(dataset[client_idx]['val'],
batch_size,
shuffle=False,
num_workers=config.data.num_workers)
data_local_dict[client_idx + 1] = dataloader
return data_local_dict, confi
def call_my_data(config): if config.data.type == “my_nlp_data”: data, modified_config = get_my_nlp_data(config) return data, modified_config
register_data(“my_nlp_data”, call_my_data)
- Implement and register your own NLP model
```python
import torch
from federatedscope.register import register_model
class KIM_CNN(nn.Module):
"""
ref to Kim's CNN text classification paper [3]
https://github.com/Shawn1993/cnn-text-classification-pytorch
"""
def __init__(self, args):
super(CNN_Text, self).__init__()
self.args = args
V = args.embed_num
D = args.embed_dim
C = args.class_num
Ci = 1
Co = args.kernel_num
Ks = args.kernel_sizes
self.embed = nn.Embedding(V, D)
self.convs = nn.ModuleList([nn.Conv2d(Ci, Co, (K, D)) for K in Ks])
self.dropout = nn.Dropout(args.dropout)
self.fc1 = nn.Linear(len(Ks) * Co, C)
if self.args.static:
self.embed.weight.requires_grad = False
def forward(self, x):
x = self.embed(x) # (N, W, D)
x = x.unsqueeze(1) # (N, Ci, W, D)
x = [F.relu(conv(x)).squeeze(3) for conv in self.convs] # [(N, Co, W), ...]*len(Ks)
x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x] # [(N, Co), ...]*len(Ks)
x = torch.cat(x, 1)
x = self.dropout(x) # (N, len(Ks)*Co)
logit = self.fc1(x) # (N, C)
return logit
def call_my_net(model_config, local_data):
if model_config.type == "my_nlp_model":
model = KIM_CNN(args=model_config)
return model
register_model("my_nlp_model", call_my_net)
- Then with fruitful built-in FL experiments scripts , users can run own FL experiments by replacing the model type and dataset type in the provided scripts.
Speech (Coming soon)
We are working on implement more interfaces to support more Conformer [4] models and more speech-related tasks with WeNet [5], which is designed for various end-2-end speech recognition tasks and provides full stack solutions for production and real-world applications.
Reference
[1] Caldas, Sebastian, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečný, H. Brendan McMahan, Virginia Smith, and Ameet Talwalkar. “Leaf: A benchmark for federated settings.” arXiv preprint arXiv:1812.01097 (2018).
[2] Wolf, Thomas, et al. “Huggingface’s transformers: State-of-the-art natural language processing.” arXiv preprint arXiv:1910.03771 (2019).
[3] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar.
[4] Gulati, Anmol, et al. “Conformer: Convolution-augmented transformer for speech recognition.” arXiv preprint arXiv:2005.08100 (2020).
[5] Zhang, Binbin, et al. “Wenet: Production first and production ready end-to-end speech recognition toolkit.” arXiv e-prints (2021): arXiv-2102.