2016-04-06 23 views
1

python -m gensim.scripts.make_wiki komut dosyasını kullanarak Wikipedia dökümanını düz metne dönüştürmek için gensim kullanmak istiyorum. Ben olarak kullanmakVikipedi dökümünü python -m ile kopyala gensim.scripts.make_wiki

:

python -m gensim.scripts.make_wiki ./enwiki-latest-pages-articles.xml.bz2 ./results 

bana sonunda bir hata veriyor:

2016-04-06 20:43:46,471 : INFO : storing corpus in Matrix Market format to ./results/_bow.mm 
Traceback (most recent call last): 
    File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main 
    "__main__", fname, loader, pkg_name) 
    File "/usr/lib/python2.7/runpy.py", line 72, in _run_code 
    exec code in run_globals 
    File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/scripts/make_wiki.py", line 88, in <module> 
    MmCorpus.serialize(outp + '_bow.mm', wiki, progress_cnt=10000) # another ~9h 
    File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/corpora/indexedcorpus.py", line 89, in serialize 
    offsets = serializer.save_corpus(fname, corpus, id2word, progress_cnt=progress_cnt, metadata=metadata) 
    File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/corpora/mmcorpus.py", line 49, in save_corpus 
    return matutils.MmWriter.write_corpus(fname, corpus, num_terms=num_terms, index=True, progress_cnt=progress_cnt, metadata=metadata) 
    File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/matutils.py", line 486, in write_corpus 
    mw = MmWriter(fname) 
    File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/matutils.py", line 436, in __init__ 
    self.fout = utils.smart_open(self.fname, 'wb+') # open for both reading and writing 
    File "build/bdist.linux-x86_64/egg/smart_open/smart_open_lib.py", line 111, in smart_open 
NotImplementedError: unknown file mode wb+ 

herkes ne olup bittiğini biliyor mu? Komut satırı script

cevap

1

Emin değilim, ama benim için şu işleri -

def parse_wiki(wiki_bz_file): 
    output = open('./wiki_text_dump.txt', 'w') 
    i = 0 
    wiki = WikiCorpus(wiki_bz_file, lemmatize=False, dictionary={}) #vocab dict not needed 
    for text in wiki.get_texts(): 
     output.write(u.listToStr(chunk) + '\n') 
     i = i + 1 
     if i%50000 == 0: 
      logger.info("Saved " + str(i) + " articles") 
    output.close() 
    logger.info("Finished Saved " + str(i) + " articles") 
    return 
İlgili konular