Find Paths

Structures

JudiLing.Gold_Path_Info_StructType

Store gold paths' information including indices and indices' support and total support. It can be used to evaluate how low the threshold needs to be set in order to find most of the correct paths or if set very low, all of the correct paths.

source

Build paths

JudiLing.build_pathsFunction

The build_paths function constructs paths by only considering those n-grams that are close to the target. It first takes the predicted c-hat vector and finds the closest n neighbors in the C matrix. Then it selects all n-grams of these neighbors, and constructs all valid paths with those n-grams. The path producing the best correlation with the target semantic vector (through synthesis by analysis) is selected.

source
JudiLing.build_pathsMethod
build_paths(
    data_val,
    C_train,
    S_val,
    F_train,
    Chat_val,
    A,
    i2f,
    C_train_ind;
    rC = nothing,
    max_t = 15,
    max_can = 10,
    n_neighbors = 10,
    grams = 3,
    tokenized = false,
    sep_token = nothing,
    target_col = :Words,
    start_end_token = "#",
    if_pca = false,
    pca_eval_M = nothing,
    ignore_nan = true,
    verbose = false,
)

The build_paths function constructs paths by only considering those n-grams that are close to the target. It first takes the predicted c-hat vector and finds the closest n neighbors in the C matrix. Then it selects all n-grams of these neighbors, and constructs all valid paths with those n-grams. The path producing the best correlation with the target semantic vector (through synthesis by analysis) is selected.

Obligatory Arguments

  • data::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • C_train::SparseMatrixCSC: the C matrix for the training dataset
  • S_val::Union{SparseMatrixCSC, Matrix}: the S matrix for the validation dataset
  • F_train::Union{SparseMatrixCSC, Matrix}: the F matrix for the training dataset
  • Chat_val::Matrix: the Chat matrix for the validation dataset
  • A::SparseMatrixCSC: the adjacency matrix
  • i2f::Dict: the dictionary returning features given indices
  • C_train_ind::Array: the gold paths' indices for the training dataset

Optional Arguments

  • rC::Union{Nothing, Matrix}=nothing: correlation Matrix of C and Chat, specify to save computing time
  • max_t::Int64=15: maximum number of timesteps
  • max_can::Int64=10: maximum number of candidates to consider
  • n_neighbors::Int64=10: the top n form neighbors to be considered
  • grams::Int64=3: the number n of grams that make up n-grams
  • tokenized::Bool=false: if true, the dataset target is tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • target_col::Union{String, :Symbol}=:Words: the column name for target strings
  • if_pca::Bool=false: turn on to enable pca mode
  • pca_eval_M::Matrix=nothing: pass original F for pca mode
  • verbose::Bool=false: if true, more information will be printed

Examples

# training dataset
JudiLing.build_paths(
    latin_train,
    cue_obj_train.C,
    S_train,
    F_train,
    Chat_train,
    A,
    cue_obj_train.i2f,
    cue_obj_train.gold_ind,
    max_t=max_t,
    n_neighbors=10,
    verbose=false
    )

# validation dataset
JudiLing.build_paths(
    latin_val,
    cue_obj_train.C,
    S_val,
    F_train,
    Chat_val,
    A,
    cue_obj_train.i2f,
    cue_obj_train.gold_ind,
    max_t=max_t,
    n_neighbors=10,
    verbose=false
    )

# pca mode
res_build = JudiLing.build_paths(
    korean,
    Array(Cpcat),
    S,
    F,
    ChatPCA,
    A,
    cue_obj.i2f,
    cue_obj.gold_ind,
    max_t=max_t,
    if_pca=true,
    pca_eval_M=Fo,
    n_neighbors=3,
    verbose=true
    )
source

Learn paths

JudiLing.learn_pathsFunction

A sequence finding algorithm using discrimination learning to predict, for a given word, which n-grams are best supported for a given position in the sequence of n-grams.

source
JudiLing.learn_pathsMethod
learn_paths(
    data::DataFrame,
    cue_obj::Cue_Matrix_Struct,
    S_val::Union{SparseMatrixCSC, Matrix},
    F_train,
    Chat_val::Union{SparseMatrixCSC, Matrix};
    Shat_val::Union{Nothing, Matrix} = nothing,
    check_gold_path::Bool = false,
    threshold::Float64 = 0.1,
    is_tolerant::Bool = false,
    tolerance::Float64 = (-1000.0),
    max_tolerance::Int = 3,
    activation::Union{Nothing, Function} = nothing,
    ignore_nan::Bool = true,
    verbose::Bool = true)

A high-level wrapper function for learn_paths with much less control. It aims for users who is very new to JudiLing and learn_paths function.

Obligatory Arguments

  • data::DataFrame: the training dataset
  • cue_obj::Cue_Matrix_Struct: the C matrix object containing all information with C
  • S_val::Union{SparseMatrixCSC, Matrix}: the S matrix for validation dataset
  • F_train::Union{SparseMatrixCSC, Matrix, Chain}: either the F matrix for training dataset, or a deep learning comprehension model trained on the training set
  • Chat_val::Union{SparseMatrixCSC, Matrix}: the Chat matrix for validation dataset

Optional Arguments

  • Shat_val::Union{Nothing, Matrix}=nothing: the Shat matrix for the validation dataset
  • check_gold_path::Bool=false: if true, return a list of support values for the gold path; this information is returned as second output value
  • threshold::Float64=0.1:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into consideration
  • is_tolerant::Bool=false: if true, select a specified number (given by max_tolerance) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the path
  • tolerance::Float64=(-1000.0): the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the path
  • max_tolerance::Int64=4: maximum number of n-grams allowed in a path
  • activation::Function=nothing: the activation function you want to pass
  • ignore_nan::Bool=true: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation value
  • verbose::Bool=false: if true, more information is printed

Examples

res = learn_paths(latin, cue_obj, S, F, Chat)
source
JudiLing.learn_pathsMethod
learn_paths(
    data_train::DataFrame,
    data_val::DataFrame,
    C_train::Union{Matrix, SparseMatrixCSC},
    S_val::Union{Matrix, SparseMatrixCSC},
    F_train,
    Chat_val::Union{Matrix, SparseMatrixCSC},
    A::SparseMatrixCSC,
    i2f::Dict,
    f2i::Dict;
    gold_ind::Union{Nothing, Vector} = nothing,
    Shat_val::Union{Nothing, Matrix} = nothing,
    check_gold_path::Bool = false,
    max_t::Int = 15,
    max_can::Int = 10,
    threshold::Float64 = 0.1,
    is_tolerant::Bool = false,
    tolerance::Float64 = (-1000.0),
    max_tolerance::Int = 3,
    grams::Int = 3,
    tokenized::Bool = false,
    sep_token::Union{Nothing, String} = nothing,
    keep_sep::Bool = false,
    target_col::Union{Symbol, String} = "Words",
    start_end_token::String = "#",
    issparse::Union{Symbol, Bool} = :auto,
    sparse_ratio::Float64 = 0.05,
    if_pca::Bool = false,
    pca_eval_M::Union{Nothing, Matrix} = nothing,
    activation::Union{Nothing, Function} = nothing,
    ignore_nan::Bool = true,
    check_threshold_stat::Bool = false,
    verbose::Bool = false
)

A sequence finding algorithm using discrimination learning to predict, for a given word, which n-grams are best supported for a given position in the sequence of n-grams.

Obligatory Arguments

  • data::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • C_train::Union{SparseMatrixCSC, Matrix}: the C matrix for training dataset
  • S_val::Union{SparseMatrixCSC, Matrix}: the S matrix for validation dataset
  • F_train::Union{SparseMatrixCSC, Matrix, Chain}: the F matrix for training dataset, or a deep learning comprehension model trained on the training data
  • Chat_val::Union{SparseMatrixCSC, Matrix}: the Chat matrix for validation dataset
  • A::SparseMatrixCSC: the adjacency matrix
  • i2f::Dict: the dictionary returning features given indices
  • f2i::Dict: the dictionary returning indices given features

Optional Arguments

  • gold_ind::Union{Nothing, Vector}=nothing: gold paths' indices
  • Shat_val::Union{Nothing, Matrix}=nothing: the Shat matrix for the validation dataset
  • check_gold_path::Bool=false: if true, return a list of support values for the gold path; this information is returned as second output value
  • max_t::Int64=15: maximum timestep
  • max_can::Int64=10: maximum number of candidates to consider
  • threshold::Float64=0.1:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into consideration
  • is_tolerant::Bool=false: if true, select a specified number (given by max_tolerance) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the path
  • tolerance::Float64=(-1000.0): the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the path
  • max_tolerance::Int64=4: maximum number of n-grams allowed in a path
  • grams::Int64=3: the number n of grams that make up an n-gram
  • tokenized::Bool=false: if true, the dataset target is tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator token
  • keep_sep::Bool=false:if true, keep separators in cues
  • target_col::Union{String, :Symbol}=:Words: the column name for target strings
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • issparse::Union{Symbol, Bool}=:auto: control of whether output of Mt matrix is a dense matrix or a sparse matrix
  • sparse_ratio::Float64=0.05: the ratio to decide whether a matrix is sparse
  • if_pca::Bool=false: turn on to enable pca mode
  • pca_eval_M::Matrix=nothing: pass original F for pca mode
  • activation::Function=nothing: the activation function you want to pass
  • ignore_nan::Bool=true: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation value
  • check_threshold_stat::Bool=false: if true, return a threshold and torlerance proportion for each timestep
  • verbose::Bool=false: if true, more information is printed

Examples

# basic usage without tokenization
res = JudiLing.learn_paths(
latin,
latin,
cue_obj.C,
S,
F,
Chat,
A,
cue_obj.i2f,
max_t=max_t,
max_can=10,
grams=3,
threshold=0.1,
tokenized=false,
keep_sep=false,
target_col=:Word,
verbose=true)

# basic usage with tokenization
res = JudiLing.learn_paths(
french,
french,
cue_obj.C,
S,
F,
Chat,
A,
cue_obj.i2f,
max_t=max_t,
max_can=10,
grams=3,
threshold=0.1,
tokenized=true,
sep_token="-",
keep_sep=true,
target_col=:Syllables,
verbose=true)

# basic usage for validation data
res_val = JudiLing.learn_paths(
latin_train,
latin_val,
cue_obj_train.C,
S_val,
F_train,
Chat_val,
A,
cue_obj_train.i2f,
max_t=max_t,
max_can=10,
grams=3,
threshold=0.1,
tokenized=false,
keep_sep=false,
target_col=:Word,
verbose=true)

# turn on tolerance mode
res_val = JudiLing.learn_paths(
...
threshold=0.1,
is_tolerant=true,
tolerance=-0.1,
max_tolerance=4,
...)

# turn on check gold paths mode
res_train, gpi_train = JudiLing.learn_paths(
...
gold_ind=cue_obj_train.gold_ind,
Shat_val=Shat_train,
check_gold_path=true,
...)

res_val, gpi_val = JudiLing.learn_paths(
...
gold_ind=cue_obj_val.gold_ind,
Shat_val=Shat_val,
check_gold_path=true,
...)

# control over sparsity
res_val = JudiLing.learn_paths(
...
issparse=:auto,
sparse_ratio=0.05,
...)

# pca mode
res_learn = JudiLing.learn_paths(
korean,
korean,
Array(Cpcat),
S,
F,
ChatPCA,
A,
cue_obj.i2f,
cue_obj.f2i,
check_gold_path=false,
gold_ind=cue_obj.gold_ind,
Shat_val=Shat,
max_t=max_t,
max_can=10,
grams=3,
threshold=0.1,
tokenized=true,
sep_token="_",
keep_sep=true,
target_col=:Verb_syll,
if_pca=true,
pca_eval_M=Fo,
verbose=true);
source
JudiLing.learn_paths_rpiMethod
learn_paths_rpi(
    data_train::DataFrame,
    data_val::DataFrame,
    C_train::Union{Matrix, SparseMatrixCSC},
    S_val::Union{Matrix, SparseMatrixCSC},
    F_train,
    Chat_val::Union{Matrix, SparseMatrixCSC},
    A::SparseMatrixCSC,
    i2f::Dict,
    f2i::Dict;
    gold_ind::Union{Nothing, Vector} = nothing,
    Shat_val::Union{Nothing, Matrix} = nothing,
    check_gold_path::Bool = false,
    max_t::Int = 15,
    max_can::Int = 10,
    threshold::Float64 = 0.1,
    is_tolerant::Bool = false,
    tolerance::Float64 = (-1000.0),
    max_tolerance::Int = 3,
    grams::Int = 3,
    tokenized::Bool = false,
    sep_token::Union{Nothing, String} = nothing,
    keep_sep::Bool = false,
    target_col::Union{Symbol, String} = "Words",
    start_end_token::String = "#",
    issparse::Union{Symbol, Bool} = :auto,
    sparse_ratio::Float64 = 0.05,
    if_pca::Bool = false,
    pca_eval_M::Union{Nothing, Matrix} = nothing,
    activation::Union{Nothing, Function} = nothing,
    ignore_nan::Bool = true,
    check_threshold_stat::Bool = false,
    verbose::Bool = false
)

Calculate learn_paths with results indices supports as well.

Obligatory Arguments

  • data::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • C_train::Union{SparseMatrixCSC, Matrix}: the C matrix for training dataset
  • S_val::Union{SparseMatrixCSC, Matrix}: the S matrix for validation dataset
  • F_train::Union{SparseMatrixCSC, Matrix, Chain}: the F matrix for training dataset, or a deep learning comprehension model trained on the training data
  • Chat_val::Union{SparseMatrixCSC, Matrix}: the Chat matrix for validation dataset
  • A::SparseMatrixCSC: the adjacency matrix
  • i2f::Dict: the dictionary returning features given indices
  • f2i::Dict: the dictionary returning indices given features

Optional Arguments

  • gold_ind::Union{Nothing, Vector}=nothing: gold paths' indices
  • Shat_val::Union{Nothing, Matrix}=nothing: the Shat matrix for the validation dataset
  • check_gold_path::Bool=false: if true, return a list of support values for the gold path; this information is returned as second output value
  • max_t::Int64=15: maximum timestep
  • max_can::Int64=10: maximum number of candidates to consider
  • threshold::Float64=0.1:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into consideration
  • is_tolerant::Bool=false: if true, select a specified number (given by max_tolerance) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the path
  • tolerance::Float64=(-1000.0): the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the path
  • max_tolerance::Int64=4: maximum number of n-grams allowed in a path
  • grams::Int64=3: the number n of grams that make up an n-gram
  • tokenized::Bool=false: if true, the dataset target is tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator token
  • keep_sep::Bool=false:if true, keep separators in cues
  • target_col::Union{String, :Symbol}=:Words: the column name for target strings
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • issparse::Union{Symbol, Bool}=:auto: control of whether output of Mt matrix is a dense matrix or a sparse matrix
  • sparse_ratio::Float64=0.05: the ratio to decide whether a matrix is sparse
  • if_pca::Bool=false: turn on to enable pca mode
  • pca_eval_M::Matrix=nothing: pass original F for pca mode
  • activation::Function=nothing: the activation function you want to pass
  • ignore_nan::Bool=true: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation value
  • check_threshold_stat::Bool=false: if true, return a threshold and torlerance proportion for each timestep
  • verbose::Bool=false: if true, more information is printed
source

Utility functions

JudiLing.eval_canMethod
eval_can(candidates, S, F::Union{Matrix,SparseMatrixCSC, Chain}, i2f, max_can, if_pca, pca_eval_M)

Calculate for each candidate path the correlation between predicted semantic vector and the gold standard semantic vector, and select as target for production the path with the highest correlation.

source
JudiLing.predict_shatMethod
predict_shat(F::Union{Matrix, SparseMatrixCSC},
             ci::Vector{Int})

Predicts semantic vector shat given a comprehension matrix F and a list of indices of ngrams ci.

Obligatory arguments

  • F::Union{Matrix, SparseMatrixCSC}: Comprehension matrix F.
  • ci::Vector{Int}: Vector of indices of ngrams in c vector. Essentially, this is a vector indicating which ngrams in a c vector are absent and which are present.
source