Word Embedding

This repository provides a simple implementation of embedding layer, taking pytorch's Embedding as template. This repository is not uploaded to julia's package, to add this package, run:

julia> ]

pkg> add https://github.com/Fainabi/WordEmbedding.jl

Note

Flux has implemented Embedding since v0.12.4

Dependencies

This package is developed in julia 1.6, and using Flux v0.12.1. The tests passed on Flux v0.12.6. If you run it with some problems, please check the version of Flux.

Test

One test is provided for checking embedding trainning. To run this test, clone the git codes and run

julia> import Pkg

julia> Pkg.activate(".")

julia> include("test/sanity_test.jl")

Utilisation

WordEmbedding.jl provides a basic embedding layer. To construct it, using Embedding constructor:

julia> n_vocab, d_embed = 100, 32;

julia> ve = Embedding(n_vocab, d_embed)
Embedding(100, 32)

We could query the embedding by directly call the embedding with the vocab's id, or index it,

julia> ve[1]
32-element Array{Float32,1}:
 -0.048652478
  0.7851229
  0.43120283
  ⋮

julia> ve(1)
32-element Array{Float32,1}:
 -0.048652478
  0.7851229
  0.43120283
  ⋮

and also we can have batched queries:

julia> ve(rand(1:100, 5))
32×5 Array{Float32,2}:
  0.463585    2.54845     0.0250509  -1.62112     0.586433
  1.24004    -0.0695169   0.572597    0.645755   -0.473197
  1.89235    -1.36418    -0.995274   -1.48188    -1.42972
  ⋮

Ususally, we apply the embedding layer in RNNs, to pass the variable length data queries through the rnn model, just collect the query ids and length of queries, padded zeros, then call the embedding with the model:

julia> using Flux

julia> lstm = LSTM(32, 2);

julia> queries = [
        [1, 4, 7, 3, 12],
        [3, 52, 20, 2, 0],
        [13, 0, 32, 59, 0],
        [42, 0, 0, 31, 0],
        [0, 0, 0, 9, 0],
        [0, 0, 0, 84, 0],
    ];

julia> query_len = [4, 2, 3, 6, 1];

julia> ve(lstm, queries, query_len)
2×5 Array{Float32,2}:
  0.183306  -0.139541   0.0306142   0.279944  -0.00556634
 -0.604177  -0.0863534  0.0932789  -0.189739   0.167395

julia> ve(queries, query_len) do seq
          lstm(seq)
       end
2×5 Array{Float32,2}:
  0.183306  -0.139541   0.0306142   0.279944  -0.00556634
 -0.604177  -0.0863534  0.0932789  -0.189739   0.167395

it will return the states rendered from lstm, with specific length of sequences.

One can verify its correctness:

julia> ve(queries, query_len, identity) == ve([queries[query_len[i]][i] for i in 1:5])
true

Apply in models

Here several tiny models were provided, for example, an LSTM based sequence classification in model/seq_classification:

julia>] activate .

julia> include("models/seq_classification/lstm.jl")

The lstm.jl uses dataset of Movielen. And the ml-100k.zip data is used, please extract the u.data and u.user into directory of test/data, and then run the test. This model compared predictions with and without embedding training.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
models/seq_classification		models/seq_classification
src		src
test		test
.gitignore		.gitignore
Manifest.toml		Manifest.toml
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word Embedding

Note

Dependencies

Test

Utilisation

Apply in models

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Word Embedding

Note

Dependencies

Test

Utilisation

Apply in models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages