BpeTokenizer Class

Definition

Namespace:: Microsoft.ML.Tokenizers

Assembly:: Microsoft.ML.Tokenizers.dll

Package:: Microsoft.ML.Tokenizers v1.0.1

Package:: Microsoft.ML.Tokenizers v0.22.0

Package:: Microsoft.ML.Tokenizers v2.0.0-preview.1.25125.4

Source:: BPETokenizer.cs

Source:: BPETokenizer.cs

Source:: BPETokenizer.cs

Important

Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.

Represent the Byte Pair Encoding model.

public sealed class BpeTokenizer : Microsoft.ML.Tokenizers.Tokenizer

type BpeTokenizer = class
    inherit Tokenizer

Public NotInheritable Class BpeTokenizer
Inherits Tokenizer

Inheritance: Object

Tokenizer
BpeTokenizer

Properties

ContinuingSubwordPrefix	A prefix to be used for every subword that is not a beginning-of-word
EndOfWordSuffix	An optional suffix to characterize and end-of-word sub-word
FuseUnknownTokens	Gets or sets whether allowing multiple unknown tokens get fused
Normalizer	Gets the Normalizer in use by the Tokenizer.
PreTokenizer	Gets the PreTokenizer used by the Tokenizer.
SpecialTokens	Gets the special tokens.
UnknownToken	Gets or Sets unknown token. The unknown token to be used when we encounter an unknown char
Vocabulary	Gets the dictionary mapping tokens to Ids.

Methods

CountTokens(ReadOnlySpan<Char>, Boolean, Boolean)	Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer)
CountTokens(String, Boolean, Boolean)	Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer)
CountTokens(String, ReadOnlySpan<Char>, EncodeSettings)	Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer)
Create(Stream, Stream, PreTokenizer, Normalizer, IReadOnlyDictionary<String,Int32>, String, String, String, Boolean)	Create a new Bpe tokenizer object to use for text encoding.
Create(Stream, Stream)	Create a new Bpe tokenizer object to use for text encoding.
Create(String, String, PreTokenizer, Normalizer, IReadOnlyDictionary<String,Int32>, String, String, String, Boolean)	Create a new Bpe tokenizer object to use for text encoding.
Create(String, String)	Create a new Bpe tokenizer object to use for text encoding.
CreateAsync(Stream, Stream, PreTokenizer, Normalizer, IReadOnlyDictionary<String,Int32>, String, String, String, Boolean)	Create a new Bpe tokenizer object asynchronously to use for text encoding.
Decode(IEnumerable<Int32>, Boolean)	Decode the given ids, back to a String.
Decode(IEnumerable<Int32>, Span<Char>, Boolean, Int32, Int32)	Decode the given ids back to text and store the result in the `destination` span.
Decode(IEnumerable<Int32>, Span<Char>, Int32, Int32)	Decode the given ids back to text and store the result in the `destination` span.
Decode(IEnumerable<Int32>)	Decode the given ids, back to a String.
EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean)	Encodes input text to token Ids. (Inherited from Tokenizer)
EncodeToIds(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)	Encodes input text to token Ids up to maximum number of tokens. (Inherited from Tokenizer)
EncodeToIds(String, Boolean, Boolean)	Encodes input text to token Ids. (Inherited from Tokenizer)
EncodeToIds(String, Int32, String, Int32, Boolean, Boolean)	Encodes input text to token Ids up to maximum number of tokens. (Inherited from Tokenizer)
EncodeToIds(String, ReadOnlySpan<Char>, EncodeSettings)	Encodes input text to token Ids. (Inherited from Tokenizer)
EncodeToTokens(ReadOnlySpan<Char>, String, Boolean, Boolean)	Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer)
EncodeToTokens(String, ReadOnlySpan<Char>, EncodeSettings)	Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer)
EncodeToTokens(String, String, Boolean, Boolean)	Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer)
GetIndexByTokenCount(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCount(String, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCount(String, ReadOnlySpan<Char>, EncodeSettings, Boolean, String, Int32)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCountFromEnd(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCountFromEnd(String, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)

Applies to

Share via

BpeTokenizer Class

Definition

Properties

Methods

Applies to