nlp - PyTorch GPU memory leak during inference

Question

Welcome To Ask or Share your Answers For Others

nlp - PyTorch GPU memory leak during inference

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

I am trying to encode documents sentence-wise with a huggingface transformer module. I'm using the very small google/bert_uncased_L-2_H-128_A-2 pretrained model with the following code:

def pre_encode_wikipedia(model, tokenizer, device, save_path):
  
  document_data_list = []

  for iteration, document in enumerate(wikipedia_small['text']):
    torch.cuda.empty_cache()

    sentence_embeds_per_doc = [torch.randn(128)]
    attention_mask_per_doc = [1]
    special_tokens_per_doc = [1]

    doc_split = nltk.sent_tokenize(document)
    doc_tokenized = tokenizer.batch_encode_plus(doc_split, padding='longest', truncation=True, max_length=512, return_tensors='pt')

    for key, value in doc_tokenized.items():
      doc_tokenized[key] = doc_tokenized[key].to(device)

    with torch.no_grad():  
      doc_encoded = model(**doc_tokenized)

    for sentence in doc_encoded['last_hidden_state']:
      sentence[0].to('cpu')
      sentence_embeds_per_doc.append(sentence[0])
      attention_mask_per_doc.append(1)
      special_tokens_per_doc.append(0)

    sentence_embeds = torch.stack(sentence_embeds_per_doc)
    attention_mask = torch.FloatTensor(attention_mask_per_doc)
    special_tokens_mask = torch.FloatTensor(special_tokens_per_doc)

    document_data = torch.utils.data.TensorDataset(*[sentence_embeds, attention_mask, special_tokens_mask])
    torch.save(document_data, f'{save_path}{time.strftime("%Y%m%d-%H%M%S")}{iteration}.pt')
    print(f"Document at {iteration} encoded and saved.")

After about 200-300 iterations on my local GTX 1060 3GB I get an error saying that my CUDA memory is full. Running this code on Colab with more GPU RAM gives me a few thousand iterations.

Things I've tried:

Adding torch.cuda.empty_cache() to the start of every iteration to clear out previously held tensors
Wrapping the model in torch.no_grad() to disable the computation graph
Setting model.eval() to disable any stochastic properties that might take up memory
Sending the output straight to CPU in hopes to free up memory

I'm baffled as to why my memory keeps overflowing. I've trained several models of bigger sizes, applying all the standard practices of a training loop (optimizer.zero_grad(), etc.) I've never had this problem. Why does it appear during this seemingly trivial task?

Edit #1 Changing sentence[0].to('cpu') to cpu_sentence = sentence[0].to('cpu') gave me a few thousand iterations before VRAM usage suddenly spiked, causing the run to crash:

question from:https://stackoverflow.com/questions/65906965/pytorch-gpu-memory-leak-during-inference

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1.3k views

1 Answer

深蓝 · Answer 1 · 2021-10-06T18:54:42+0000

Can you try replacing

sentence[0].to('cpu')

with

cpu_sentence = sentence[0].to('cpu')

See more info here https://pytorch.org/docs/stable/tensors.html#torch.Tensor.to

Categories

nlp - PyTorch GPU memory leak during inference

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags