Learn, Share, Build

265
September 27, 2017, at 2:16 PM

I am building a word2vec model as follows.

from gensim.models import word2vec, Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]
bigram = Phrases(sentence_stream, min_count=1, delimiter=b' ')
trigram = Phrases(bigram[sentence_stream], min_count=1, delimiter=b' ')
for sent in sentence_stream:
    bigrams_ = bigram[sent]
    trigrams_ = trigram[bigram[sent]]
    print(bigrams_)
    print(trigrams_)

# Set values for various parameters
num_features = 10    # Word vector dimensionality                      
min_word_count = 1   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 5          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

model = word2vec.Word2Vec(trigrams_, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)
vocab = list(model.wv.vocab.keys())
print(vocab[:10])

However, the output I get for the model's vocabulary is single character as follows.

['h', 'u', 'm', 'a', 'n', ' ', 'c', 'o', 'p', 't']

I am getting the bigrams and trigrams correctly. Hence, I am just confused where I make the code wrong. Please let me know what is the problem?

Answer 1

This solved my issue. I should pass list of lists to the word2vec model as follows.

trigram_sentences_project = []

bigram = Phrases(sentence_stream, min_count=1, delimiter=b' ')
trigram = Phrases(bigram[sentence_stream], min_count=1, delimiter=b' ')

for sent in sentence_stream:
    #bigrams_ = [b for b in bigram[sent] if b.count(' ') == 1]
    #trigrams_ = [t for t in trigram[bigram[sent]] if t.count(' ') == 2]
    bigrams_ = bigram[sent]
    trigrams_ = trigram[bigram[sent]]
    trigram_sentences_project.append(trigrams_)
Rent Charter Buses Company
READ ALSO
Learn, Share, Build

Learn, Share, Build

I have a data frame as below and I would like to filter every 3 rows in the Column FisherID and then drop the duplicate species in column SpeciesNameActually, I can do it manually by use the code below, but it take a lot of time since I have many rows

237
Learn, Share, Build

Learn, Share, Build

I've been recently trying to port maximum clique algorithm to python, yet I can not seem to implement it correctlyThe goal is to find the largest clique, present in the graph

255
Learn, Share, Build

Learn, Share, Build

I have two tuples a = (('1',), ('2',)) and b = (('3',), ('4',))

230
Learn, Share, Build

Learn, Share, Build

The sample data

240