Find Paths

Structures

JudiLing.Result_Path_Info_Struct — Type

Store paths' information built by learn_paths or build_paths

JudiLing.Gold_Path_Info_Struct — Type

Store gold paths' information including indices and indices' support and total support. It can be used to evaluate how low the threshold needs to be set in order to find most of the correct paths or if set very low, all of the correct paths.

source

JudiLing.Threshold_Stat_Struct — Type

Store threshold and tolerance proportional for each timestep.

source

Build paths

JudiLing.build_paths — Function

The build_paths function constructs paths by only considering those n-grams that are close to the target. It first takes the predicted c-hat vector and finds the closest n neighbors in the C matrix. Then it selects all n-grams of these neighbors, and constructs all valid paths with those n-grams. The path producing the best correlation with the target semantic vector (through synthesis by analysis) is selected.

source

JudiLing.build_paths — Method

build_paths(
    data_val,
    C_train,
    S_val,
    F_train,
    Chat_val,
    A,
    i2f,
    C_train_ind;
    rC = nothing,
    max_t = 15,
    max_can = 10,
    n_neighbors = 10,
    grams = 3,
    tokenized = false,
    sep_token = nothing,
    target_col = :Words,
    start_end_token = "#",
    if_pca = false,
    pca_eval_M = nothing,
    ignore_nan = true,
    verbose = false,
)

Obligatory Arguments

data::DataFrame: the training dataset
data_val::DataFrame: the validation dataset
C_train::SparseMatrixCSC: the C matrix for the training dataset
S_val::Union{SparseMatrixCSC, Matrix}: the S matrix for the validation dataset
F_train::Union{SparseMatrixCSC, Matrix}: the F matrix for the training dataset
Chat_val::Matrix: the Chat matrix for the validation dataset
A::SparseMatrixCSC: the adjacency matrix
i2f::Dict: the dictionary returning features given indices
C_train_ind::Array: the gold paths' indices for the training dataset

Optional Arguments

rC::Union{Nothing, Matrix}=nothing: correlation Matrix of C and Chat, specify to save computing time
max_t::Int64=15: maximum number of timesteps
max_can::Int64=10: maximum number of candidates to consider
n_neighbors::Int64=10: the top n form neighbors to be considered
grams::Int64=3: the number n of grams that make up n-grams
tokenized::Bool=false: if true, the dataset target is tokenized
sep_token::Union{Nothing, String, Char}=nothing: separator
target_col::Union{String, :Symbol}=:Words: the column name for target strings
if_pca::Bool=false: turn on to enable pca mode
pca_eval_M::Matrix=nothing: pass original F for pca mode
verbose::Bool=false: if true, more information will be printed

Examples

# training dataset
JudiLing.build_paths(
    latin_train,
    cue_obj_train.C,
    S_train,
    F_train,
    Chat_train,
    A,
    cue_obj_train.i2f,
    cue_obj_train.gold_ind,
    max_t=max_t,
    n_neighbors=10,
    verbose=false
    )

# validation dataset
JudiLing.build_paths(
    latin_val,
    cue_obj_train.C,
    S_val,
    F_train,
    Chat_val,
    A,
    cue_obj_train.i2f,
    cue_obj_train.gold_ind,
    max_t=max_t,
    n_neighbors=10,
    verbose=false
    )

# pca mode
res_build = JudiLing.build_paths(
    korean,
    Array(Cpcat),
    S,
    F,
    ChatPCA,
    A,
    cue_obj.i2f,
    cue_obj.gold_ind,
    max_t=max_t,
    if_pca=true,
    pca_eval_M=Fo,
    n_neighbors=3,
    verbose=true
    )

source

Learn paths

JudiLing.learn_paths — Function

A sequence finding algorithm using discrimination learning to predict, for a given word, which n-grams are best supported for a given position in the sequence of n-grams.

source

JudiLing.learn_paths — Method

learn_paths(
    data::DataFrame,
    cue_obj::Cue_Matrix_Struct,
    S_val::Union{SparseMatrixCSC, Matrix},
    F_train,
    Chat_val::Union{SparseMatrixCSC, Matrix};
    Shat_val::Union{Nothing, Matrix} = nothing,
    check_gold_path::Bool = false,
    threshold::Float64 = 0.1,
    is_tolerant::Bool = false,
    tolerance::Float64 = (-1000.0),
    max_tolerance::Int = 3,
    activation::Union{Nothing, Function} = nothing,
    ignore_nan::Bool = true,
    verbose::Bool = true)

A high-level wrapper function for learn_paths with much less control. It aims for users who is very new to JudiLing and learn_paths function.

Obligatory Arguments

data::DataFrame: the training dataset
cue_obj::Cue_Matrix_Struct: the C matrix object containing all information with C
S_val::Union{SparseMatrixCSC, Matrix}: the S matrix for validation dataset
F_train::Union{SparseMatrixCSC, Matrix, Chain}: either the F matrix for training dataset, or a deep learning comprehension model trained on the training set
Chat_val::Union{SparseMatrixCSC, Matrix}: the Chat matrix for validation dataset

Optional Arguments

Shat_val::Union{Nothing, Matrix}=nothing: the Shat matrix for the validation dataset
check_gold_path::Bool=false: if true, return a list of support values for the gold path; this information is returned as second output value
threshold::Float64=0.1:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into consideration
is_tolerant::Bool=false: if true, select a specified number (given by max_tolerance) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the path
tolerance::Float64=(-1000.0): the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the path
max_tolerance::Int64=4: maximum number of n-grams allowed in a path
activation::Function=nothing: the activation function you want to pass
ignore_nan::Bool=true: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation value
verbose::Bool=false: if true, more information is printed

Examples

res = learn_paths(latin, cue_obj, S, F, Chat)

source

JudiLing.learn_paths — Method

learn_paths(
    data_train::DataFrame,
    data_val::DataFrame,
    C_train::Union{Matrix, SparseMatrixCSC},
    S_val::Union{Matrix, SparseMatrixCSC},
    F_train,
    Chat_val::Union{Matrix, SparseMatrixCSC},
    A::SparseMatrixCSC,
    i2f::Dict,
    f2i::Dict;
    gold_ind::Union{Nothing, Vector} = nothing,
    Shat_val::Union{Nothing, Matrix} = nothing,
    check_gold_path::Bool = false,
    max_t::Int = 15,
    max_can::Int = 10,
    threshold::Float64 = 0.1,
    is_tolerant::Bool = false,
    tolerance::Float64 = (-1000.0),
    max_tolerance::Int = 3,
    grams::Int = 3,
    tokenized::Bool = false,
    sep_token::Union{Nothing, String} = nothing,
    keep_sep::Bool = false,
    target_col::Union{Symbol, String} = "Words",
    start_end_token::String = "#",
    issparse::Union{Symbol, Bool} = :auto,
    sparse_ratio::Float64 = 0.05,
    if_pca::Bool = false,
    pca_eval_M::Union{Nothing, Matrix} = nothing,
    activation::Union{Nothing, Function} = nothing,
    ignore_nan::Bool = true,
    check_threshold_stat::Bool = false,
    verbose::Bool = false
)

A sequence finding algorithm using discrimination learning to predict, for a given word, which n-grams are best supported for a given position in the sequence of n-grams.

Obligatory Arguments

data::DataFrame: the training dataset
data_val::DataFrame: the validation dataset
C_train::Union{SparseMatrixCSC, Matrix}: the C matrix for training dataset
S_val::Union{SparseMatrixCSC, Matrix}: the S matrix for validation dataset
F_train::Union{SparseMatrixCSC, Matrix, Chain}: the F matrix for training dataset, or a deep learning comprehension model trained on the training data
Chat_val::Union{SparseMatrixCSC, Matrix}: the Chat matrix for validation dataset
A::SparseMatrixCSC: the adjacency matrix
i2f::Dict: the dictionary returning features given indices
f2i::Dict: the dictionary returning indices given features

Optional Arguments

gold_ind::Union{Nothing, Vector}=nothing: gold paths' indices
Shat_val::Union{Nothing, Matrix}=nothing: the Shat matrix for the validation dataset
check_gold_path::Bool=false: if true, return a list of support values for the gold path; this information is returned as second output value
max_t::Int64=15: maximum timestep
max_can::Int64=10: maximum number of candidates to consider
threshold::Float64=0.1:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into consideration
is_tolerant::Bool=false: if true, select a specified number (given by max_tolerance) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the path
tolerance::Float64=(-1000.0): the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the path
max_tolerance::Int64=4: maximum number of n-grams allowed in a path
grams::Int64=3: the number n of grams that make up an n-gram
tokenized::Bool=false: if true, the dataset target is tokenized
sep_token::Union{Nothing, String, Char}=nothing: separator token
keep_sep::Bool=false:if true, keep separators in cues
target_col::Union{String, :Symbol}=:Words: the column name for target strings
start_end_token::Union{String, Char}="#": start and end token in boundary cues
issparse::Union{Symbol, Bool}=:auto: control of whether output of Mt matrix is a dense matrix or a sparse matrix
sparse_ratio::Float64=0.05: the ratio to decide whether a matrix is sparse
if_pca::Bool=false: turn on to enable pca mode
pca_eval_M::Matrix=nothing: pass original F for pca mode
activation::Function=nothing: the activation function you want to pass
ignore_nan::Bool=true: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation value
check_threshold_stat::Bool=false: if true, return a threshold and torlerance proportion for each timestep
verbose::Bool=false: if true, more information is printed

Examples

# basic usage without tokenization
res = JudiLing.learn_paths(
latin,
latin,
cue_obj.C,
S,
F,
Chat,
A,
cue_obj.i2f,
max_t=max_t,
max_can=10,
grams=3,
threshold=0.1,
tokenized=false,
keep_sep=false,
target_col=:Word,
verbose=true)

# basic usage with tokenization
res = JudiLing.learn_paths(
french,
french,
cue_obj.C,
S,
F,
Chat,
A,
cue_obj.i2f,
max_t=max_t,
max_can=10,
grams=3,
threshold=0.1,
tokenized=true,
sep_token="-",
keep_sep=true,
target_col=:Syllables,
verbose=true)

# basic usage for validation data
res_val = JudiLing.learn_paths(
latin_train,
latin_val,
cue_obj_train.C,
S_val,
F_train,
Chat_val,
A,
cue_obj_train.i2f,
max_t=max_t,
max_can=10,
grams=3,
threshold=0.1,
tokenized=false,
keep_sep=false,
target_col=:Word,
verbose=true)

# turn on tolerance mode
res_val = JudiLing.learn_paths(
...
threshold=0.1,
is_tolerant=true,
tolerance=-0.1,
max_tolerance=4,
...)

# turn on check gold paths mode
res_train, gpi_train = JudiLing.learn_paths(
...
gold_ind=cue_obj_train.gold_ind,
Shat_val=Shat_train,
check_gold_path=true,
...)

res_val, gpi_val = JudiLing.learn_paths(
...
gold_ind=cue_obj_val.gold_ind,
Shat_val=Shat_val,
check_gold_path=true,
...)

# control over sparsity
res_val = JudiLing.learn_paths(
...
issparse=:auto,
sparse_ratio=0.05,
...)

# pca mode
res_learn = JudiLing.learn_paths(
korean,
korean,
Array(Cpcat),
S,
F,
ChatPCA,
A,
cue_obj.i2f,
cue_obj.f2i,
check_gold_path=false,
gold_ind=cue_obj.gold_ind,
Shat_val=Shat,
max_t=max_t,
max_can=10,
grams=3,
threshold=0.1,
tokenized=true,
sep_token="_",
keep_sep=true,
target_col=:Verb_syll,
if_pca=true,
pca_eval_M=Fo,
verbose=true);

source

JudiLing.learn_paths_rpi — Method

learn_paths_rpi(
    data_train::DataFrame,
    data_val::DataFrame,
    C_train::Union{Matrix, SparseMatrixCSC},
    S_val::Union{Matrix, SparseMatrixCSC},
    F_train,
    Chat_val::Union{Matrix, SparseMatrixCSC},
    A::SparseMatrixCSC,
    i2f::Dict,
    f2i::Dict;
    gold_ind::Union{Nothing, Vector} = nothing,
    Shat_val::Union{Nothing, Matrix} = nothing,
    check_gold_path::Bool = false,
    max_t::Int = 15,
    max_can::Int = 10,
    threshold::Float64 = 0.1,
    is_tolerant::Bool = false,
    tolerance::Float64 = (-1000.0),
    max_tolerance::Int = 3,
    grams::Int = 3,
    tokenized::Bool = false,
    sep_token::Union{Nothing, String} = nothing,
    keep_sep::Bool = false,
    target_col::Union{Symbol, String} = "Words",
    start_end_token::String = "#",
    issparse::Union{Symbol, Bool} = :auto,
    sparse_ratio::Float64 = 0.05,
    if_pca::Bool = false,
    pca_eval_M::Union{Nothing, Matrix} = nothing,
    activation::Union{Nothing, Function} = nothing,
    ignore_nan::Bool = true,
    check_threshold_stat::Bool = false,
    verbose::Bool = false
)

Calculate learn_paths with results indices supports as well.

Obligatory Arguments

data::DataFrame: the training dataset
data_val::DataFrame: the validation dataset
C_train::Union{SparseMatrixCSC, Matrix}: the C matrix for training dataset
S_val::Union{SparseMatrixCSC, Matrix}: the S matrix for validation dataset
F_train::Union{SparseMatrixCSC, Matrix, Chain}: the F matrix for training dataset, or a deep learning comprehension model trained on the training data
Chat_val::Union{SparseMatrixCSC, Matrix}: the Chat matrix for validation dataset
A::SparseMatrixCSC: the adjacency matrix
i2f::Dict: the dictionary returning features given indices
f2i::Dict: the dictionary returning indices given features

Optional Arguments

gold_ind::Union{Nothing, Vector}=nothing: gold paths' indices
Shat_val::Union{Nothing, Matrix}=nothing: the Shat matrix for the validation dataset
check_gold_path::Bool=false: if true, return a list of support values for the gold path; this information is returned as second output value
max_t::Int64=15: maximum timestep
max_can::Int64=10: maximum number of candidates to consider
threshold::Float64=0.1:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into consideration
is_tolerant::Bool=false: if true, select a specified number (given by max_tolerance) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the path
tolerance::Float64=(-1000.0): the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the path
max_tolerance::Int64=4: maximum number of n-grams allowed in a path
grams::Int64=3: the number n of grams that make up an n-gram
tokenized::Bool=false: if true, the dataset target is tokenized
sep_token::Union{Nothing, String, Char}=nothing: separator token
keep_sep::Bool=false:if true, keep separators in cues
target_col::Union{String, :Symbol}=:Words: the column name for target strings
start_end_token::Union{String, Char}="#": start and end token in boundary cues
issparse::Union{Symbol, Bool}=:auto: control of whether output of Mt matrix is a dense matrix or a sparse matrix
sparse_ratio::Float64=0.05: the ratio to decide whether a matrix is sparse
if_pca::Bool=false: turn on to enable pca mode
pca_eval_M::Matrix=nothing: pass original F for pca mode
activation::Function=nothing: the activation function you want to pass
ignore_nan::Bool=true: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation value
check_threshold_stat::Bool=false: if true, return a threshold and torlerance proportion for each timestep
verbose::Bool=false: if true, more information is printed

source

Utility functions

JudiLing.eval_can — Method

eval_can(candidates, S, F::Union{Matrix,SparseMatrixCSC, Chain}, i2f, max_can, if_pca, pca_eval_M)

Calculate for each candidate path the correlation between predicted semantic vector and the gold standard semantic vector, and select as target for production the path with the highest correlation.

source

JudiLing.find_top_feature_indices — Method

find_top_feature_indices(rC, C_train_ind)

Find all indices for the n-grams of the top n closest neighbors of a given target.

source

JudiLing.make_ngrams_ind — Method

make_ngrams_ind(res, n)

Construct ngrams indices.

source

JudiLing.predict_shat — Method

predict_shat(F::Union{Matrix, SparseMatrixCSC},
             ci::Vector{Int})

Predicts semantic vector shat given a comprehension matrix F and a list of indices of ngrams ci.

Obligatory arguments

F::Union{Matrix, SparseMatrixCSC}: Comprehension matrix F.
ci::Vector{Int}: Vector of indices of ngrams in c vector. Essentially, this is a vector indicating which ngrams in a c vector are absent and which are present.

source