Tokenization and Word Embedding

Tokenization and Word Embedding are two common NLP processes before we do anything else. The reason is that AI models can only handle numbers as input and output but the target that NLP handles is mostly about language data, like words or sentences. So tokenization and word embedding are needed to transform natural languages into numbers so AI models can handle them.

What is Tokenization?

Tokenization is usually pretty straight forward. For a given word sequence, tokenization would turn them to a series of numbers. Usually one sequence unit corresponds to a number in the output number sequence.

Read more...