Personal Blog

Handling long documents for sequence classification by using transformers - RoBERT (Recurrence over BERT)

Dominos Dominos by Bradyn Trollip

Paper Implemenations - RoBERT Implementation

BERT [1] is a transformer-based language model used in several natural language processing tasks. Using contextual level text representations enables BERT to benefit semantic information in the documents. Even BERT is pretrained on a large corpus and it is possible to finetune the BERT model on any downstream tasks such as text classification, summarization and token classification. But BERT is capable of up to 512 token input size limit. That means for long documents it only takes almost first ~400 words as input at a single forward and back propagation step. In such cases, input sentences need to be truncated before fed into the model. But truncation causes later information loss in the long documents. To handle this, for some specific tasks such as sequence classification and next sentence prediction, input sentence representations can be obtained by dividing input sequences into segments.

In Pappagari et. al. paper [2], it is suggested that combined segments of the documents can be fed into a recurrent neural network (RNN) or a small transformer architecture. These models are called as RoBERT and ToBERT respectively. By using this methodology all information can be used for the long documents without losing any later information in the text. As stated in the paper, authors approach is as follows:

First, the document to be classified should be divided into chunks. For this purpouse, authors use sliding window approach to divide texts into chunks. That means the chunks are overlapped with the previous and the next chunks. The overlap amount and the window size are hyper parameters of the RoBERT model. But it is not clear in the paper that the divided input sequence means raw text of words or numerical BERT tokens. Note that BERT Tokenizer uses sub-words for tokenizing texts, so that it is possible to split any word into different chunks when the sequence is splitted according to tokens. The below code divides any tokenized text into chunks, then stacks the segment-level representations into an embedding tensor. Note that chunking operation is parameterized with the given chunk size and the overlap size.

def chunker(tokenized_text, chunk_size, overlap_size):
    ''' Chunker divides into chunks any tokenized texts '''

    if tokenized_text.input_ids.shape[1] > chunk_size:
        func = lambda x: (x[0], x[1][0,:].unfold(0,chunk_size, chunk_size - overlap_size))
        return dict(map(func, dict(tokenized_text).items()))
    else:
        return(tokenized_text)
}
def stacked_tokenized_texts(tokenized_text, model, device, chunk_size, overlap_size, batch_size):
  '''
  This function takes any single tokenized text and divides into chunks.
  Then embeds these chunks by using given tranformer model.
  Then concats all embeddings in order.
  '''

  # divide into chunks
  tokenized_chunks = chunker(tokenized_text, chunk_size, overlap_size)

  # convert into Dataset
  dataset = CustomDataset(tokenized_chunks)

  # create dataloader object
  loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

  embeddings = list()
  for batch in loader:
    with torch.no_grad():
      # send tensors to device
      input_ids = batch['input_ids'].to(device)
      attention_mask = batch['attention_mask'].to(device)

      # take all outputs including hidden states
      outputs = model(input_ids, attention_mask=attention_mask)

      # take only last hidden state
      last_hidden_state_outputs = outputs.hidden_states[-1]

      # take [CLS] token output
      embedding = last_hidden_state_outputs[:,0,:]

      # concat all embeddings
      embeddings.append(embedding)

      torch.cuda.empty_cache()

  return torch.cat(embeddings, axis=0)
model_name_or_directory = 'bert-base-uncased'
model = BertForSequenceClassification.from_pretrained(model_name_or_directory, output_hidden_states=True)
tokenizer = BertTokenizer.from_pretrained(model_name_or_directory)

tokenized_text = tokenizer(text, return_tensors='pt')
embeddings = stacked_tokenized_texts(tokenized_text,  model, device, chunk_size, overlap_size, batch_size)

In the classification stage, for RoBERT, stacked segment-level document representations are fed into 100-dimensional LSTM layer followed by two 30-dimensional fully connected layers with ReLU and Softmax activations. By simply replacing LSTM layer with two layers of transformer building blocks, RoBERT can be transformed into ToBERT architecture. The RoBERT architecture code is given below.

In evaluation, while authors compare performances between 3 different datasets, I only use 20newsgroup [3] dataset to compare my implementation with the baselines by using the same data split. The baselines consist of Logistic Regression by using TfIdf word vectors and BERT sentence embeddings. Although the architecture stated in the paper is including 3 layers (one LSTM and two FC layers), in my case, only a single LSTM layer converged faster. Considering one epoch takes about 1 minute, at the end of 100 epochs I get 50% test accuracy by using a single LSTM layer with 20 hidden state. That means for long documents still truncation is applied but this time ~20 times of more text information is taken into account.

# Create an LSTM model as the same as stated in the paper
class RoBERT(nn.Module):
    def __init__(self, num_classes, input_size, hidden_size, num_layers):
        super(RoBERT, self).__init__()
        self.num_classes = num_classes
        self.num_layers = num_layers
        self.input_size = input_size
        self.hidden_size = hidden_size

        self.lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size,
                          num_layers=num_layers, batch_first=True)
        
        self.fc_1 =  nn.Linear(hidden_size, 30)

        self.fc_2 = nn.Linear(30, num_classes)

        self.fc_3 = nn.Linear(hidden_size, num_classes)

        self.softmax = nn.Softmax()

        self.relu = nn.ReLU()
    
    def forward(self,x):
        total_length = x.size(1)
        seq_lengths = torch.LongTensor(list(map(len, x)))

        packed_input = pack_padded_sequence(x, lengths=seq_lengths,
                                    batch_first=True)

        packed_output, _ = self.lstm(packed_input)
        
        out, _ = pad_packed_sequence(packed_output, batch_first=True,
                                total_length=total_length)

        out = self.fc_1(out) 
        out = self.relu(out) 
        
        out = self.fc_2(out)
        out = self.relu(out) 

        out = self.fc_3(out)
        out = self.relu(out) 

        out = self.softmax(out)

        return out[:,-1,:]

In the paper, authors report that RoBERT achieves 60.75% test accuracy without finetuning and 84.71% with finetuning. By using segment-level predictions with finetuned BERT, they achieve most frequently 84.78% and 84.51% test accuracy on average. There are two options to finetune BERT: the first one is to use each text chunk as different documents with the same label, the second is just truncate the text. But whether finetuning BERT is applied by using truncation or not is not clear in the paper. I suppose that authors used the first one because they use segment-level predictions to calculate average and the most frequent labels later in the paper.

In my experiments, Logistic Regression model by using TfIdf and BERT embeddings results in 65.5% and 72.59% test accuracy respectively. By using my RoBERT implementation (one layer LSTM) gives only 50%. But note that, since these configurations are not stated in the paper, the train-test split and hyper parameters are determined independently. The hyper-parameters I selected are listed as:

In conlusion, RoBERT (Recurrence over BERT) is useful to take later information in the long documents into account in theory. But it requires additional training time because of the recurrence model (LSTM). Without finetuning, my RoBERT implementation could not outperform the baselines. But authors show that with finetuned embeddings the performance increases significantly. Next, finetuned segment-level representations will be my further experiment to compare with these results.

References