AGENDA
6:30 – Arrive and mingle
7:00 – Talk begins
8:00 – Discussion
TOPIC
Modern natural language processing systems have to deal with widely varying data coming directly from consumers. This invariably means models will need to provide quality inference on text with misspellings, and even text in other languages.
Subword tokenization is a modern NLP technique for vectorizing text that is meant to address these problems while keeping models performant. It is used in modern systems like BERT, XLNet, and their derivatives, and is the emerging standard for preparing text for neural nets.
This presentation will explore recent strategies for implementing subword tokenization, and walk through a simple implementation of byte-pair encoding.
THANK YOU
Thanks again to OnePieceWork (http://www.onepiecework.com/) for hosting us!
ABOUT NLSEA
NLSea is a special interest group of PuPPy focused on application of natural language processing (NLP). The event is for NLP practitioners as well as those wanting to get into the field. We plan to cover modern applications of NLP, including project briefs as well as recent important research papers.