What do you do when your input text is longer than BERT's maximum of 512 tokens? Longformer & BigBird are two very similar models which employ a technique called Sparse Attention to address this.
In my video lecture (divided into 9 bite-size pieces), I provide the context for Sparse Attention and explain all about how it works.
I've also created an eBook covering the same material if you prefer that medium!
To put things into practice, there is also a Colab Notebook applying BigBird to a dataset with longer text sequences.
Video Tutorial   +   eBook   +   Colab Notebook
PyTorch   +   huggingface/transformers
Why does BERT have a limitation on sequence length to begin with?
Here's what you'll see in your library!