Wednesday, November 27, 2024
Loading Events

« All Events

  • This event has passed.

[NLSea] Subword tokenization – handling multilingual data and mispellings

December 3, 2019 @ 6:30 pm - 8:30 pm

AGENDA

6:30 – Arrive and mingle
7:00 – Talk begins
8:00 – Discussion

TOPIC

Modern natural language processing systems have to deal with widely varying data coming directly from consumers. This invariably means models will need to provide quality inference on text with misspellings, and even text in other languages.

Subword tokenization is a modern NLP technique for vectorizing text that is meant to address these problems while keeping models performant. It is used in modern systems like BERT, XLNet, and their derivatives, and is the emerging standard for preparing text for neural nets.

This presentation will explore recent strategies for implementing subword tokenization, and walk through a simple implementation of byte-pair encoding.

THANK YOU

Thanks again to OnePieceWork (http://www.onepiecework.com/) for hosting us!

ABOUT NLSEA

NLSea is a special interest group of PuPPy focused on application of natural language processing (NLP). The event is for NLP practitioners as well as those wanting to get into the field. We plan to cover modern applications of NLP, including project briefs as well as recent important research papers.

Details

Date:
December 3, 2019
Time:
6:30 pm - 8:30 pm
Website:
http://www.meetup.com/PSPPython/events/266325236/

Venue

OnePiece Work – Seattle
720 3rd Ave suite 1100
Seattle, WA 98104 us
+ Google Map