Make Cue Matrix

JudiLing.Cue_Matrix_StructType

A structure that stores information created by makecuematrix: C is the cue matrix; f2i is a dictionary returning the indices for features; i2f is a dictionary returning the features for indices; goldind is a list of indices of gold paths; A is the adjacency matrix; grams is the number of grams for cues; targetcol is the column name for target strings; tokenized is whether the dataset target is tokenized; septoken is the separator; keepsep is whether to keep separators in cues; startendtoken is the start and end token in boundary cues.

source
JudiLing.make_cue_matrixMethod
make_cue_matrix(data::DataFrame)

Make the cue matrix for training datasets and corresponding indices as well as the adjacency matrix and gold paths given a dataset in a form of dataframe.

Obligatory Arguments

  • data::DataFrame: the dataset

Optional Arguments

  • grams::Int64=3: the number of grams for cues
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • keep_sep::Bool=false: if true, keep separators in cues
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • verbose::Bool=false: if true, more information is printed

Examples

# make cue matrix without tokenization
cue_obj_train = JudiLing.make_cue_matrix(
     latin_train,
    grams=3,
    target_col=:Word,
    tokenized=false,
    sep_token="-",
    start_end_token="#",
    keep_sep=false,
    verbose=false
    )

# make cue matrix with tokenization
cue_obj_train = JudiLing.make_cue_matrix(
    french_train,
    grams=3,
    target_col=:Syllables,
    tokenized=true,
    sep_token="-",
    start_end_token="#",
    keep_sep=true,
    verbose=false
    )
source
JudiLing.make_cue_matrixMethod
make_cue_matrix(data::DataFrame, cue_obj::Cue_Matrix_Struct)

Make the cue matrix for validation datasets and corresponding indices as well as the adjacency matrix and gold paths given a dataset in a form of dataframe.

Obligatory Arguments

  • data::DataFrame: the dataset
  • cue_obj::Cue_Matrix_Struct: training cue object

Optional Arguments

  • grams::Int64=3: the number of grams for cues
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • keep_sep::Bool=false: if true, keep separators in cues
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • verbose::Bool=false: if true, more information is printed

Examples

# make cue matrix without tokenization
cue_obj_val = JudiLing.make_cue_matrix(
  latin_val,
  cue_obj_train,
  grams=3,
  target_col=:Word,
  tokenized=false,
  sep_token="-",
  keep_sep=false,
  start_end_token="#",
  verbose=false
  )

# make cue matrix with tokenization
cue_obj_val = JudiLing.make_cue_matrix(
    french_val,
    cue_obj_train,
    grams=3,
    target_col=:Syllables,
    tokenized=true,
    sep_token="-",
    keep_sep=true,
    start_end_token="#",
    verbose=false
    )
source
JudiLing.make_cue_matrixMethod
make_cue_matrix(data_train::DataFrame, data_val::DataFrame)

Make the cue matrix for traiing and validation datasets at the same time.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset

Optional Arguments

  • grams::Int64=3: the number of grams for cues
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • keep_sep::Bool=false: if true, keep separators in cues
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • verbose::Bool=false: if true, more information is printed

Examples

# make cue matrix without tokenization
cue_obj_train, cue_obj_val = JudiLing.make_cue_matrix(
    latin_train,
    latin_val,
    grams=3,
    target_col=:Word,
    tokenized=false,
    keep_sep=false
    )

# make cue matrix with tokenization
cue_obj_train, cue_obj_val = JudiLing.make_cue_matrix(
    french_train,
    french_val,
    grams=3,
    target_col=:Syllables,
    tokenized=true,
    sep_token="-",
    keep_sep=true,
    start_end_token="#",
    verbose=false
    )
source
JudiLing.make_combined_cue_matrixMethod
make_combined_cue_matrix(data_train, data_val)

Make the cue matrix for training and validation datasets at the same time, where the features and adjacencies are combined.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset

Optional Arguments

  • grams::Int64=3: the number of grams for cues
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • keep_sep::Bool=false: if true, keep separators in cues
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • verbose::Bool=false: if true, more information is printed

Examples

# make cue matrix without tokenization
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(
    latin_train,
    latin_val,
    grams=3,
    target_col=:Word,
    tokenized=false,
    keep_sep=false
    )

# make cue matrix with tokenization
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(
    french_train,
    french_val,
    grams=3,
    target_col=:Syllables,
    tokenized=true,
    sep_token="-",
    keep_sep=true,
    start_end_token="#",
    verbose=false
    )
source
JudiLing.make_ngramsMethod
make_ngrams(tokens, grams, keep_sep, sep_token, start_end_token)

Given a list of string tokens return a list of all n-grams for these tokens.

source