Skip to content

Fainabi/Embedding.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word Embedding

This repository provides a simple implementation of embedding layer, taking pytorch's Embedding as template. This repository is not uploaded to julia's package, to add this package, run:

julia> ]

pkg> add https://github.com/Fainabi/WordEmbedding.jl

Note

Flux has implemented Embedding since v0.12.4

Dependencies

This package is developed in julia 1.6, and using Flux v0.12.1. The tests passed on Flux v0.12.6. If you run it with some problems, please check the version of Flux.

Test

One test is provided for checking embedding trainning. To run this test, clone the git codes and run

julia> import Pkg

julia> Pkg.activate(".")

julia> include("test/sanity_test.jl")

Utilisation

WordEmbedding.jl provides a basic embedding layer. To construct it, using Embedding constructor:

julia> n_vocab, d_embed = 100, 32;

julia> ve = Embedding(n_vocab, d_embed)
Embedding(100, 32)

We could query the embedding by directly call the embedding with the vocab's id, or index it,

julia> ve[1]
32-element Array{Float32,1}:
 -0.048652478
  0.7851229
  0.43120283
  

julia> ve(1)
32-element Array{Float32,1}:
 -0.048652478
  0.7851229
  0.43120283
  

and also we can have batched queries:

julia> ve(rand(1:100, 5))
32×5 Array{Float32,2}:
  0.463585    2.54845     0.0250509  -1.62112     0.586433
  1.24004    -0.0695169   0.572597    0.645755   -0.473197
  1.89235    -1.36418    -0.995274   -1.48188    -1.42972
  

Ususally, we apply the embedding layer in RNNs, to pass the variable length data queries through the rnn model, just collect the query ids and length of queries, padded zeros, then call the embedding with the model:

julia> using Flux

julia> lstm = LSTM(32, 2);

julia> queries = [
        [1, 4, 7, 3, 12],
        [3, 52, 20, 2, 0],
        [13, 0, 32, 59, 0],
        [42, 0, 0, 31, 0],
        [0, 0, 0, 9, 0],
        [0, 0, 0, 84, 0],
    ];

julia> query_len = [4, 2, 3, 6, 1];

julia> ve(lstm, queries, query_len)
2×5 Array{Float32,2}:
  0.183306  -0.139541   0.0306142   0.279944  -0.00556634
 -0.604177  -0.0863534  0.0932789  -0.189739   0.167395

julia> ve(queries, query_len) do seq
          lstm(seq)
       end
2×5 Array{Float32,2}:
  0.183306  -0.139541   0.0306142   0.279944  -0.00556634
 -0.604177  -0.0863534  0.0932789  -0.189739   0.167395

it will return the states rendered from lstm, with specific length of sequences.

One can verify its correctness:

julia> ve(queries, query_len, identity) == ve([queries[query_len[i]][i] for i in 1:5])
true

Apply in models

Here several tiny models were provided, for example, an LSTM based sequence classification in model/seq_classification:

julia>] activate .

julia> include("models/seq_classification/lstm.jl")

The lstm.jl uses dataset of Movielen. And the ml-100k.zip data is used, please extract the u.data and u.user into directory of test/data, and then run the test. This model compared predictions with and without embedding training.

About

The embedding layer for sequencial data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages