Normal ifadeleri kullanarak bir giriş metninin yalnızca belirli bölümlerini nasıl alırsınız?

Bu sorun oldukça basit, ancak burada oldukça kayboldum.Normal ifadeleri kullanarak bir giriş metninin yalnızca belirli bölümlerini nasıl alırsınız?

Giriş metni:

'Az/RBR tüm IN 1/2/CD/IN/DT ABD/NNP işletmelere/NNS/VBP tek/JJ şirketleri/NNS/vardır/daha?. '

kodu:

def get_words(pos_sent): 
# Your code goes here 
    s = "" 
    x = re.findall(r"\b(\w*?)/\w*?\b", pos_sent) 
    for i in range(0, len(x)): 
     s = s + " " + x[i] 
    return s 

def get_noun_phrase(pos_sent): 
    # Penn Tagset 
    # Adjetive can be JJ,JJR,JJS 
    # Noun can be NN,NNS,NNP,NNPS 
    t = get_words(pos_sent) 
    regex = r'((\S+\/DT)?(\S+\/JJ)*(\S+\/NN)*(\S+\/NN))' 
    return re.findall(regex, t)

İlk bölümde sadece konuşma etiketleri parçası kaldırır ve ikinci o alıp isim cümleleri bulmak için kullanabilirsiniz gerekiyordu.

için çıkış almak gerekiyordu:

[’all US businesses’, ’sole proprietorships’]

ama bunun yerine boş bir listesini verir: Ben orijinal etiketli cümlede almaya değiştirebilirsiniz, Şimdi

[]

ve ben alıyorum:

tüm doğru bitlere sahip, ancak aynı zamanda içinde bir sürü başka şey var. istemiyorum.

Hala regex için çok yeni, bu yüzden muhtemelen aptalca bir şey eksik. \b([0-9A-z\/]*)\/\w*?\b - -

İlk işlevde

kaynak

2016-04-14 Jane Doe

Düzenli ifadenin burada en iyi yaklaşım olup olmadığından emin değilsiniz, ancak yanılmış olabilirim>> – Adib

Kodunuzu düzgün bir şekilde girmelisiniz. İhtiyacınız varsa yardım [burada] (https://stackoverflow.com/editing-help) var. – jDo

Sadece sağladığınız tek bir cümleyle eşleştirmek oldukça kolaydır, ancak desene uymayan diğer metinleri ayrıştırmaya çalışırsanız bozulur. Örneğin. "all/DT US/NNP işletmeleri/NNS" eşleşmesi için, "\ S +/DT \ S +/NNP \ S +/NNS" yazabilir, çıktıda bir "değiştir" veya "çevir" yazabilirsiniz. yapılır. Bununla birlikte, "tüm/DT işletmeleri/NNS" ile de eşleşmek istemiyor musunuz? Bir şey bana bir trie ya da bir grafiğe ve dizginin içinden geçip bir sonraki kelime/etiketin geçerli bir düğüm olup olmadığına karar vermesi için bazı özyinelemeye ihtiyaç duyduğunuzu söylüyor. Varsa, yeni başlangıç düğümünü yapın ve tekrarlayın/tekrarlayın. Hayırsa, yolu/cümleyi iade edin. – jDo

aşağıdaki normal ifadeler kullanabilirsiniz böylece "1/2" değil 1 2 olarak (outputted metnin geliştirilmiş biçimlendirme ile birlikte) 1/2 olarak kalmak ve sağlayabilirsiniz:

import re 

string = 'Less/RBR than/IN 1/2/CD of/IN all/DT US/NNP businesses/NNS are/VBP sole/JJ proprietorships/NNS ?/.' 

def create_relationship(pos_sent): 
    # Get all the words individually 
    words = re.findall(r'\b([0-9A-z\/]*)\/\w*?\b', pos_sent) 
    # ['Less', 'than', '1/2', 'of', 'all', 'US', 'businesses', 'are', 'sole', 'proprietorships'] 

    # Get all the tags individually 
    penn_tag = re.findall(r'\b[0-9A-z\/]*\/(\w*)?\b', pos_sent) 
    # ['RBR', 'IN', 'CD', 'IN', 'DT', 'NNP', 'NNS', 'VBP', 'JJ', 'NNS'] 

    # Create a relationship between the words and penn tag: 
    relationship = [] 
    for i in range(0,len(words)): 
     relationship.append([words[i],penn_tag[i]]) 

    # [['Less', 'RBR'], ['than', 'IN'], ['1/2', 'CD'], ['of', 'IN'], ['all', 'DT'], 
    # ['US', 'NNP'], ['businesses', 'NNS'], ['are', 'VBP'], ['sole', 'JJ'], ['proprietorships', 'NNS']] 

    return relationship 


def get_words(pos_sent): 
    # Pass string into relationship engine 
    array = create_relationship(pos_sent) 

    # Start with empty string 
    s = '' 

    # Conduct loop to combine string 
    for i in range(0, len(array)): 
     # index 0 has the words 
     s = s + array[i][0] + ' ' 

    # Return the sentence 
    return s 

def get_noun_phrase(pos_sent): 
    # Penn Tagset 
    # Adjetive can be JJ,JJR,JJS 
    # Noun can be NN,NNS,NNP,NNPS 
    # Noun Phrase must be made of: DT+RB+JJ+NN+PR (http://www.clips.ua.ac.be/pages/mbsp-tags) 

    # Pass string into relationship engine 
    array = create_relationship(pos_sent) 
    bucket = array 
    output = [] 

    # Find the last instance of NN where the next word is not "NN" 
    # For example, NNP VBP qualifies. In the case of NN NNP VBP, then 
    # the correct instance is NNP. To do this, we need to loop and use 
    # a bucket to capture what we need. The bucket will shirnk as we 
    # shrink the array to capture what we want 

    noun = True 

    # Keep doing this until there is no instances of Nouns 
    while noun: 

     # Would be ideal to have an if condition to see if there's a noun 
     # in the first place to stop this form working (and avoiding errors) 
     for i in range(0, len(bucket)): 
      if re.match(r'(NN.*)',bucket[i][1]): 
       # Set position of last noun 
       last_noun = i 

     noun_phrase = [] 

     # If we don't have noun, it'll stop the while loop 
     if last_noun < 0: 
      noun = False 
     else: 
      # go backwards from the point where you found the last noun 
      for x in range(last_noun, -1, -1): 
       # The penn tag must match any of these conditions 
       if re.match(r'(NN.*|DT.*|JJ.*|RB.*|PR.*)',bucket[x][1]): 
        # if there is a match, then let's build the word 
        noun_phrase.append(bucket[x][0]) 
        bucket.pop(x) 
       else: 
        last_noun = -1 
        break 

     # Make sure noun phrase isn't empty 
     if noun_phrase: 
      # Collect the noun phrase 
      output.append(" ".join(reversed(noun_phrase))) 

    # Fix the reverse issue 
    return [i for i in reversed(output)] 

print get_noun_phrase(string) 
# ['all US businesses', 'sole proprietorships']

kaynak

2016-04-14 09:31:32 Adib

Demek istediğim, ne iş yaptığımı, tür: Bu çıktı: [('all/DT US/NN', 'tümü/DT', '', '', ABD/NN '), (' işletmeler/NN ',' ',' ',' ',' işletmeler/NN '), (' sole/JJ mülkiyetleri/NN ',' ',' tek/JJ ',' ',' mülk sahipleri/NN ')] Hangi tüm parçaları var. –

@BNSlug Herhangi bir ek kullanım vakası veya örneğiniz var mı ve çıktılarının nasıl görüneceğini biliyor musunuz? – Adib

Yani, belki biraz bulabilirim. İsim cümlesi ortak bir terim değil mi? –

Normal ifadeleri kullanarak bir giriş metninin yalnızca belirli bölümlerini nasıl alırsınız?

cevap

İlgili konular