Make Semantic Matrix

Make binary semantic vectors

JudiLing.PS_Matrix_StructType

A structure that stores the discrete semantic vectors: pS is the discrete semantic matrix; f2i is a dictionary returning the indices for features; i2f is a dictionary returning the features for indices.

source
JudiLing.make_pS_matrixMethod
make_pS_matrix(data)

Create a discrete semantic matrix given a dataframe.

Obligatory Arguments

  • data::DataFrame: the dataset

Optional Arguments

  • features_col::Symbol=:CommunicativeIntention: the column name for target
  • sep_token::String="_": separator

Examples

s_obj_train = JudiLing.make_pS_matrix(
    utterance,
    features_col=:CommunicativeIntention,
    sep_token="_")
source
JudiLing.make_pS_matrixMethod
make_pS_matrix(data_val, pS_obj)

Construct discrete semantic matrix for the validation datasets given by the exemplar in the dataframe, and given the S matrix for the training datasets.

Obligatory Arguments

  • data_val::DataFrame: the dataset
  • pS_obj::PS_Matrix_Struct: training PS object

Optional Arguments

  • features_col::Symbol=:CommunicativeIntention: the column name for target
  • sep_token::String="_": separator

Examples

s_obj_val = JudiLing.make_pS_matrix(
    data_val,
    s_obj_train,
    features_col=:CommunicativeIntention,
    sep_token="_")
source
JudiLing.make_combined_pS_matrixMethod
make_combined_pS_matrix(
    data_train,
    data_val;
    features_col = :CommunicativeIntention,
    sep_token = "_",
)

Create discrete semantic matrices for a train and validation dataframe.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset

Optional Arguments

  • features_col::Symbol=:CommunicativeIntention: the column name for target
  • sep_token::String="_": separator

Examples

s_obj_train, s_obj_val = JudiLing.make_combined_pS_matrix(
    data_train,
    data_val,
    features_col=:CommunicativeIntention,
    sep_token="_")
source

Simulate semantic vectors

JudiLing.L_Matrix_StructType

A structure that stores Lexome semantic vectors: L is Lexome semantic matrix; f2i is a dictionary returning the indices for features; i2f is a dictionary returning the features for indices.

source
JudiLing.make_S_matrixMethod
make_S_matrix(data::DataFrame, base::Vector, inflections::Vector)

Create simulated semantic matrix for the training datasets, given the input data of a vector specified contex lexemes and a vector specified gramatic lexemes. The semantic vector of a word form is constructed summing semantic vectors of content and gramatic lexemes.

Obligatory Arguments

  • data::DataFrame: the dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
S_train = JudiLing.make_S_matrix(
    french,
    ["Lexeme"],
    ["Tense","Aspect","Person","Number","Gender","Class","Mood"],
    ncol=200)

# deep mode
S_train = JudiLing.make_S_matrix(
    ...
    sd_base_mean=1,
    sd_inflection_mean=1,
    isdeep=true,
    ...)

# non-deep mode
S_train = JudiLing.make_S_matrix(
    ...
    isdeep=false,
    ...)

# add additional Gaussian noise
S_train = JudiLing.make_S_matrix(
    ...
    add_noise=true,
    sd_noise=1,
    ...)

# further control of means and standard deviations
S_train = JudiLing.make_S_matrix(
    ...
    sd_base_mean=1,
    sd_inflection_mean=1,
    sd_base=4,
    sd_inflection=4,
    sd_noise=1,
    ...)
source
JudiLing.make_S_matrixMethod
make_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector)

Create simulated semantic matrix for the validation datasets, given the input data of a vector specified contex lexemes and a vector specified gramatic lexemes. The semantic vector of a word form is constructed summing semantic vectors of content and gramatic lexemes.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
S_train, S_val = JudiLing.make_S_matrix(
    french,
    french_val,
    ["Lexeme"],
    ["Tense","Aspect","Person","Number","Gender","Class","Mood"],
    ncol=200)

# deep mode
S_train, S_val = JudiLing.make_S_matrix(
    ...
    sd_base_mean=1,
    sd_inflection_mean=1,
    isdeep=true,
  ...)

# non-deep mode
S_train, S_val = JudiLing.make_S_matrix(
    ...
    isdeep=false,
    ...)

# add additional Gaussian noise
S_train, S_val = JudiLing.make_S_matrix(
    ...
    add_noise=true,
    sd_noise=1,
    ...)

# further control of means and standard deviations
S_train, S_val = JudiLing.make_S_matrix(
    ...
    sd_base_mean=1,
    sd_inflection_mean=1,
    sd_base=4,
    sd_inflection=4,
    sd_noise=1,
    ...)
source
JudiLing.make_S_matrixMethod
make_S_matrix(data::DataFrame, base::Vector)

Create simulated semantic matrix for the training datasets with only base features, given the input data of a vector specified contex lexemes and a vector specified gramatic lexemes. The semantic vector of a word form is constructed summing semantic vectors of content and gramatic lexemes.

Obligatory Arguments

  • data::DataFrame: the dataset
  • base::Vector: context lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_base::Int64=4: the sd of base features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
S_train = JudiLing.make_S_matrix(
    french,
    ["Lexeme"],
    ncol=200)

# deep mode
S_train = JudiLing.make_S_matrix(
    ...
    sd_base_mean=1,
    sd_inflection_mean=1,
    isdeep=true,
    ...)

# non-deep mode
S_train = JudiLing.make_S_matrix(
    ...
    isdeep=false,
    ...)

# add additional Gaussian noise
S_train = JudiLing.make_S_matrix(
    ...
    add_noise=true,
    sd_noise=1,
    ...)

# further control of means and standard deviations
S_train = JudiLing.make_S_matrix(
    ...
    sd_base_mean=1,
    sd_inflection_mean=1,
    sd_base=4,
    sd_inflection=4,
    sd_noise=1,
    ...)
source
JudiLing.make_S_matrixMethod
make_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector)

Create simulated semantic matrix for the validation datasets with only base features, given the input data of a vector specified contex lexemes and a vector specified gramatic lexemes. The semantic vector of a word form is constructed summing semantic vectors of content and gramatic lexemes.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_base::Int64=4: the sd of base features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
S_train, S_val = JudiLing.make_S_matrix(
    french,
    french_val,
    ["Lexeme"],
    ncol=200)

# deep mode
S_train, S_val = JudiLing.make_S_matrix(
    ...
    sd_base_mean=1,
    sd_inflection_mean=1,
    isdeep=true,
    ...)

# non-deep mode
S_train, S_val = JudiLing.make_S_matrix(
    ...
    isdeep=false,
    ...)

# add additional Gaussian noise
S_train, S_val = JudiLing.make_S_matrix(
    ...
    add_noise=true,
    sd_noise=1,
    ...)

# further control of means and standard deviations
S_train, S_val = JudiLing.make_S_matrix(
    ...
    sd_base_mean=1,
    sd_inflection_mean=1,
    sd_base=4,
    sd_inflection=4,
    sd_noise=1,
    ...)
source
JudiLing.make_S_matrixMethod
make_S_matrix(data_train::DataFrame, data_val::DataFrame, pyndl_weights::Pyndl_Weight_Struct, n_features_columns::Vector)

Create semantic matrix for pyndl mode

source
JudiLing.make_S_matrixMethod
make_S_matrix(data_train::DataFrame, base::Vector, inflections::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix where lexome matrix is available.

Obligatory Arguments

  • data::DataFrame: the dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes
  • L::L_Matrix_Struct: the lexome matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
S1 = JudiLing.make_S_matrix(
    latin,
    ["Lexeme"],
    ["Person","Number","Tense","Voice","Mood"],
     L1,
     add_noise=true,
    sd_noise=1,
    normalized=false
    )
source
JudiLing.make_S_matrixMethod
make_S_matrix(data_train::DataFrame, data_val::Union{DataFrame, Nothing}, base::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix where lexome matrix is available.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • L::L_Matrix_Struct: the lexome matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
S1, S2 = JudiLing.make_S_matrix(
     latin,
    latin_val,
    ["Lexeme"],
    L1,
    add_noise=true,
    sd_noise=1,
    normalized=false
    )
source
JudiLing.make_S_matrixMethod
make_S_matrix(data::DataFrame, base::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix where lexome matrix is available.

Obligatory Arguments

  • data::DataFrame: the dataset
  • base::Vector: context lexemes
  • L::L_Matrix_Struct: the lexome matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
S1 = JudiLing.make_S_matrix(
    latin,
    ["Lexeme"],
    L1,
    add_noise=true,
    sd_noise=1,
    normalized=false
    )
source
JudiLing.make_S_matrixMethod
make_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix where lexome matrix is available.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes
  • L::L_Matrix_Struct: the lexome matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
S1, S2 = JudiLing.make_S_matrix(
    latin,
    latin_val,
    ["Lexeme"],
    ["Person","Number","Tense","Voice","Mood"],
    L1,
    add_noise=true,
    sd_noise=1,
    normalized=false
    )
source
JudiLing.make_L_matrixMethod
make_L_matrix(data::DataFrame, base::Vector)

Create Lexome Matrix with simulated semantic vectors where there are only base features.

Obligatory Arguments

  • data::DataFrame: the dataset
  • base::Vector: context lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_base::Int64=4: the sd of base features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized

Examples

# basic usage
L = JudiLing.make_L_matrix(
    latin,
    ["Lexeme"],
    ncol=200)
source
JudiLing.make_combined_S_matrixMethod
make_combined_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix for the training datasets and validation datasets with existing Lexome matrix, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes
  • L::L_Matrix_Struct: the Lexome Matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
S_train, S_val = JudiLing.make_combined_S_matrix(
    latin_train,
    latin_val,
    ["Lexeme"],
    ["Person","Number","Tense","Voice","Mood"],
    L)
source
JudiLing.make_combined_S_matrixMethod
make_combined_S_matrix(data_train::DataFrame, data_val::Union{DataFrame, Nothing}, base::Vector, L::L_Matrix_Struct)

Create simulated semantic matrix for the training datasets and validation datasets with existing Lexome matrix, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • L::L_Matrix_Struct: the Lexome Matrix

Optional Arguments

  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
S_train, S_val = JudiLing.make_combined_S_matrix(
    latin_train,
    latin_val,
    ["Lexeme"],
    ["Person","Number","Tense","Voice","Mood"],
    L)
source
JudiLing.make_combined_S_matrixMethod
make_combined_S_matrix(  data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector)

Create simulated semantic matrix for the training datasets and validation datasets, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
S_train, S_val = JudiLing.make_combined_S_matrix(
    latin_train,
    latin_val,
    ["Lexeme"],
    ["Person","Number","Tense","Voice","Mood"],
    ncol=n_features)
source
JudiLing.make_combined_S_matrixMethod
make_combined_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector)

Create simulated semantic matrix for the training datasets and validation datasets, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized
  • add_noise::Bool=true: if true, add additional Gaussian noise
  • sd_noise::Int64=1: the sd of the Gaussian noise
  • normalized::Bool=false: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd

Examples

# basic usage
S_train, S_val = JudiLing.make_combined_S_matrix(
    latin_train,
    latin_val,
    ["Lexeme"],
    ["Person","Number","Tense","Voice","Mood"],
    ncol=n_features)
source
JudiLing.make_combined_L_matrixMethod
make_combined_L_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector)

Create Lexome Matrix with simulated semantic vectors, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes
  • inflections::Vector: grammatic lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized

Examples

# basic usage
L = JudiLing.make_combined_L_matrix(
    latin_train,
    latin_val,
    ["Lexeme"],
    ["Person","Number","Tense","Voice","Mood"],
    ncol=n_features)
source
JudiLing.make_combined_L_matrixMethod
make_combined_L_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector)

Create Lexome Matrix with simulated semantic vectors, where features are combined from both training datasets and validation datasets.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • base::Vector: context lexemes

Optional Arguments

  • ncol::Int64=200: dimension of semantic vectors, usually the same as that of cue vectors
  • sd_base_mean::Int64=1: the sd mean of base features
  • sd_inflection_mean::Int64=1: the sd mean of inflectional features
  • sd_base::Int64=4: the sd of base features
  • sd_inflection::Int64=4: the sd of inflectional features
  • seed::Int64=314: the random seed
  • isdeep::Bool=true: if true, mean of each feature is also randomized

Examples

# basic usage
L = JudiLing.make_combined_L_matrix(
    latin_train,
    latin_val,
    ["Lexeme"],
    ncol=n_features)
source
JudiLing.L_Matrix_StructMethod
L_Matrix_Struct(L, sd_base, sd_base_mean, sd_inflection, sd_inflection_mean, base_f, infl_f, base_f2i, infl_f2i, n_base_f, n_infl_f, ncol)

Construct LMatrixStruct with deep mode.

source
JudiLing.L_Matrix_StructMethod
L_Matrix_Struct(L, sd_base, sd_inflection, base_f, infl_f, base_f2i, infl_f2i, n_base_f, n_infl_f, ncol)

Construct LMatrixStruct without deep mode.

source

Load from word2vec, fasttext or similar

JudiLing.load_S_matrix_from_fasttextMethod
load_S_matrix_from_fasttext(data::DataFrame,
                                 language::Symbol;
                                 target_col=:Word,
                                 default_file::Int=1)

Load semantic matrix from fasttext, loaded using the Embeddings.jl package. Subset fasttext vectors to include only words in target_col of data, and subset data to only include words in target_col for which semantic vector is available.

The last parameter, default_file, specifies which vectors are loaded. To learn about all available vectors, use the following commands:

using Embeddings
language_files(FastText_Text{:nl})

replacing the language code (here :nl) with the language you are interested in. In general, for all languages other than English, these files are available:

  • default_file=1 loads from https://fasttext.cc/docs/en/crawl-vectors.html, paper: E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages License: CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0/
  • default_file=2 loads from https://fasttext.cc/docs/en/pretrained-vectors.html paper: P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information License: CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0/

Obligatory Arguments

  • data::DataFrame: the dataset
  • language::Symbol: the language of the words in the dataset, offically ISO 639-2 (see https://github.com/JuliaText/Embeddings.jl/issues/34#issuecomment-782604523) but practically it seems more like ISO 639-1 to me with ISO 639-2 only being used if ISO 639-1 isn't available (see https://en.wikipedia.org/wiki/ListofISO639-2codes)

Optional Arguments

  • target_col=:Word: column with orthographic representation of words in data
  • default_file::Int=1: source of vectors, for more information see above and here: https://github.com/JuliaText/Embeddings.jl#loading-different-embeddings

Examples

# basic usage
latin_small, S = JudiLing.load_S_matrix_from_fasttext(latin, :la, target_col=:Word)
source
JudiLing.load_S_matrix_from_fasttextMethod
load_S_matrix_from_fasttext(data_train::DataFrame,
                                 data_val::DataFrame,
                                 language::Symbol;
                                 target_col=:Word,
                                 default_file::Int=1)

Load semantic matrix from fasttext, loaded using the Embeddings.jl package. Subset fasttext vectors to include only words in target_col of data_train and data_val, and subset data to only include words in target_col for which semantic vector is available. Returns subsetted train and val data and train and val semantic matrices.

The last parameter, default_file, specifies which vectors are loaded. To learn about all available vectors, use the following commands:

using Embeddings
language_files(FastText_Text{:nl})

replacing the language code (here :nl) with the language you are interested in. In general, for all languages other than English, these files are available:

  • default_file=1 loads from https://fasttext.cc/docs/en/crawl-vectors.html, paper: E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages License: CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0/
  • default_file=2 loads from https://fasttext.cc/docs/en/pretrained-vectors.html paper: P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information License: CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0/

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • language::Symbol: the language of the words in the dataset, offically ISO 639-2 (see https://github.com/JuliaText/Embeddings.jl/issues/34#issuecomment-782604523) but practically it seems more like ISO 639-1 to me with ISO 639-2 only being used if ISO 639-1 isn't available (see https://en.wikipedia.org/wiki/ListofISO639-2codes)

Optional Arguments

  • target_col=:Word: column with orthographic representation of words in data
  • default_file::Int=1: source of vectors, for more information see above and here: https://github.com/JuliaText/Embeddings.jl#loading-different-embeddings

Examples

# basic usage
latin_small_train, latin_small_val, S_train, S_val = JudiLing.load_S_matrix_from_fasttext(latin_train,
                                                      latin_val,
                                                      :la,
                                                      target_col=:Word)
source
JudiLing.load_S_matrix_from_word2vec_fileMethod
load_S_matrix_from_word2vec_file(data::DataFrame,
                            filepath::String;
                            target_col=:Word)

Load semantic matrix from word2vec filepath. Subset word2vec vectors to include only words in target_col of data, and subset data to only include words in target_col for which semantic vector is available. Returns subsetted data and semantic matrix.

Obligatory Arguments

  • data::DataFrame: the training dataset
  • filepath::String: path to file with word2vec vectors in .txt (not compressed in any way)

Optional Arguments

  • target_col=:Word: column with orthographic representation of words in data
source
JudiLing.load_S_matrix_from_word2vec_fileMethod
load_S_matrix_from_word2vec_file(data_train::DataFrame,
                            data_val::DataFrame,
                            filepath::String;
                            target_col=:Word)

Load semantic matrix from word2vec filepath. Subset word2vec vectors to include only words in target_col of data_train and data_val, and subset data to only include words in target_col for which semantic vector is available. Returns subsetted train and val data and train and val semantic matrices.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • filepath::String: path to file with word2vec vectors in .txt (not compressed in any way)

Optional Arguments

  • target_col=:Word: column with orthographic representation of words in data
source
JudiLing.load_S_matrix_from_fasttext_fileMethod
load_S_matrix_from_fasttext_file(data::DataFrame,
                            filepath::String;
                            target_col=:Word)

Load semantic matrix from fasttext filepath. Subset fasttext vectors to include only words in target_col of data, and subset data to only include words in target_col for which semantic vector is available. Returns subsetted data and semantic matrix.

Obligatory Arguments

  • data::DataFrame: the training dataset
  • filepath::String: path to file with fasttext vectors in .txt or .vec (not compressed in any way)

Optional Arguments

  • target_col=:Word: column with orthographic representation of words in data
source
JudiLing.load_S_matrix_from_fasttext_fileMethod
load_S_matrix_from_fasttext_file(data_train::DataFrame,
                            data_val::DataFrame,
                            filepath::String;
                            target_col=:Word)

Load semantic matrix from fasttext filepath. Subset fasttext vectors to include only words in target_col of data_train and data_val, and subset data to only include words in target_col for which semantic vector is available. Returns subsetted train and val data and train and val semantic matrices.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset
  • filepath::String: path to file with fasttext vectors in .txt (not compressed in any way)

Optional Arguments

  • target_col=:Word: column with orthographic representation of words in data
source

Utility functions

JudiLing.merge_f2iMethod
merge_f2i(base_f2i, infl_f2i, n_base_f, n_infl_f)

Merge base f2i dictionary and inflectional f2i dictionary.

source
JudiLing.make_StMethod
make_St(L, n, data, base, inflections)

Make S transpose matrix with inflections.

source