Make Cue Matrix
JudiLing.Cue_Matrix_Struct
— TypeA structure that stores information created by makecuematrix: C is the cue matrix; f2i is a dictionary returning the indices for features; i2f is a dictionary returning the features for indices; goldind is a list of indices of gold paths; A is the adjacency matrix; grams is the number of grams for cues; targetcol is the column name for target strings; tokenized is whether the dataset target is tokenized; septoken is the separator; keepsep is whether to keep separators in cues; startendtoken is the start and end token in boundary cues.
JudiLing.make_cue_matrix
— FunctionConstruct cue matrix.
JudiLing.make_combined_cue_matrix
— FunctionConstruct cue matrix where combined features and adjacencies for both training datasets and validation datasets.
JudiLing.make_ngrams
— FunctionGiven a list of string tokens, extract their n-grams.
JudiLing.make_cue_matrix
— Methodmake_cue_matrix(data::DataFrame)
Make the cue matrix for training datasets and corresponding indices as well as the adjacency matrix and gold paths given a dataset in a form of dataframe.
Obligatory Arguments
data::DataFrame
: the dataset
Optional Arguments
grams::Int64=3
: the number of grams for cuestarget_col::Union{String, Symbol}=:Words
: the column name for target stringstokenized::Bool=false
:if true, the dataset target is assumed to be tokenizedsep_token::Union{Nothing, String, Char}=nothing
: separatorkeep_sep::Bool=false
: if true, keep separators in cuesstart_end_token::Union{String, Char}="#"
: start and end token in boundary cuesverbose::Bool=false
: if true, more information is printed
Examples
# make cue matrix without tokenization
cue_obj_train = JudiLing.make_cue_matrix(
latin_train,
grams=3,
target_col=:Word,
tokenized=false,
sep_token="-",
start_end_token="#",
keep_sep=false,
verbose=false
)
# make cue matrix with tokenization
cue_obj_train = JudiLing.make_cue_matrix(
french_train,
grams=3,
target_col=:Syllables,
tokenized=true,
sep_token="-",
start_end_token="#",
keep_sep=true,
verbose=false
)
JudiLing.make_cue_matrix
— Methodmake_cue_matrix(data::DataFrame, cue_obj::Cue_Matrix_Struct)
Make the cue matrix for validation datasets and corresponding indices as well as the adjacency matrix and gold paths given a dataset in a form of dataframe.
Obligatory Arguments
data::DataFrame
: the datasetcue_obj::Cue_Matrix_Struct
: training cue object
Optional Arguments
grams::Int64=3
: the number of grams for cuestarget_col::Union{String, Symbol}=:Words
: the column name for target stringstokenized::Bool=false
:if true, the dataset target is assumed to be tokenizedsep_token::Union{Nothing, String, Char}=nothing
: separatorkeep_sep::Bool=false
: if true, keep separators in cuesstart_end_token::Union{String, Char}="#"
: start and end token in boundary cuesverbose::Bool=false
: if true, more information is printed
Examples
# make cue matrix without tokenization
cue_obj_val = JudiLing.make_cue_matrix(
latin_val,
cue_obj_train,
grams=3,
target_col=:Word,
tokenized=false,
sep_token="-",
keep_sep=false,
start_end_token="#",
verbose=false
)
# make cue matrix with tokenization
cue_obj_val = JudiLing.make_cue_matrix(
french_val,
cue_obj_train,
grams=3,
target_col=:Syllables,
tokenized=true,
sep_token="-",
keep_sep=true,
start_end_token="#",
verbose=false
)
JudiLing.make_cue_matrix
— Methodmake_cue_matrix(data_train::DataFrame, data_val::DataFrame)
Make the cue matrix for traiing and validation datasets at the same time.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation dataset
Optional Arguments
grams::Int64=3
: the number of grams for cuestarget_col::Union{String, Symbol}=:Words
: the column name for target stringstokenized::Bool=false
:if true, the dataset target is assumed to be tokenizedsep_token::Union{Nothing, String, Char}=nothing
: separatorkeep_sep::Bool=false
: if true, keep separators in cuesstart_end_token::Union{String, Char}="#"
: start and end token in boundary cuesverbose::Bool=false
: if true, more information is printed
Examples
# make cue matrix without tokenization
cue_obj_train, cue_obj_val = JudiLing.make_cue_matrix(
latin_train,
latin_val,
grams=3,
target_col=:Word,
tokenized=false,
keep_sep=false
)
# make cue matrix with tokenization
cue_obj_train, cue_obj_val = JudiLing.make_cue_matrix(
french_train,
french_val,
grams=3,
target_col=:Syllables,
tokenized=true,
sep_token="-",
keep_sep=true,
start_end_token="#",
verbose=false
)
JudiLing.make_cue_matrix
— Methodmakecuematrix(data::DataFrame, pyndlweights::PyndlWeight_Struct)
Make the cue matrix for pyndl mode.
JudiLing.make_combined_cue_matrix
— Methodmake_combined_cue_matrix(data_train, data_val)
Make the cue matrix for training and validation datasets at the same time, where the features and adjacencies are combined.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation dataset
Optional Arguments
grams::Int64=3
: the number of grams for cuestarget_col::Union{String, Symbol}=:Words
: the column name for target stringstokenized::Bool=false
:if true, the dataset target is assumed to be tokenizedsep_token::Union{Nothing, String, Char}=nothing
: separatorkeep_sep::Bool=false
: if true, keep separators in cuesstart_end_token::Union{String, Char}="#"
: start and end token in boundary cuesverbose::Bool=false
: if true, more information is printed
Examples
# make cue matrix without tokenization
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(
latin_train,
latin_val,
grams=3,
target_col=:Word,
tokenized=false,
keep_sep=false
)
# make cue matrix with tokenization
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(
french_train,
french_val,
grams=3,
target_col=:Syllables,
tokenized=true,
sep_token="-",
keep_sep=true,
start_end_token="#",
verbose=false
)
JudiLing.make_ngrams
— Methodmake_ngrams(tokens, grams, keep_sep, sep_token, start_end_token)
Given a list of string tokens return a list of all n-grams for these tokens.