2016-04-09 13 views
1

Bu yüzden bu web tarayıcısını kodlamaya çalışıyorum, böylece tüm başlıklar URL linklerini alıp tüm bölümleri URL linklerini bulmaya gidiyor, daha sonra tüm bölümler linklerini bulmak için bölüm linklerinden.Python Web Crawler, Döngüler için çağrılar yapabilir miyim?

Sorun şu ki, bu https://github.com/buckyroberts/Source-Code-from-Tutorials/blob/master/Python/27_workingsolution_python.py öğreticisinde görüyorum, yazar bunu tanımlayamadan önce ikinci işlevi çağırmayı başardı. Hangisi gerçekten kafa karıştırıcı.

Benzer şekilde denedim, ancak "leveltwo" adı beklenildiği gibi tanımlanmadı. Sorum şu, ikinci işlev için parametre olarak kullanmak için önceki işlevden elde edilen bağlantılar nasıl kullanılır.

kodum:

import requests 
from bs4 import BeautifulSoup, SoupStrainer 
import re 


######################################Titles############################### 
def levelone(url): 
r = requests.get(url) 
for links in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')): 
    if links.has_attr('href'): 
     if 'title' in links['href']: 
      titlelinks = "http://law.justia.com" + links.get('href') 
      # titlelinks = "\n" + str(titlelinks) 
      leveltwo(titlelinks) 
      # print (titlelinks) 


base_url = "http://law.justia.com/codes/alabama/2015/" 
levelone(base_url) 


########################################Chapters########################## 
def leveltwo(item_url): 
r = requests.get(item_url) 
for sublinks in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')): 
    if sublinks.has_attr('href'): 
    if 'chapt' in sublinks['href']: 
     chapterlinks = "http://law.justia.com" + sublinks.get('href') 
     # chapterlinks = "\n" + str(chapterlinks) 

     levelthree(chapterlinks) 
     # print (chapterlinks) 

# leveltwo(titlelinks) ### I tried call the function right here, but titlelinks is not defined. 

########################################Sections########################## 
def levelthree(item2_url): 
r = requests.get(item2_url) 
for sectionlinks in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')): 
    if sectionlinks.has_attr('href'): 
    if 'section' in sectionlinks['href']: 
     href = "http://law.justia.com" + sectionlinks.get('href') 
     href = "\n" + str(href) 
     print (href) 

cevap

1

Birinci işlevi tanımlayın ve sonra diyoruz.

import requests 
from bs4 import BeautifulSoup, SoupStrainer 
import re 

########################################Sections########################## 
def levelthree(item2_url): 
r = requests.get(item2_url) 
for sectionlinks in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')): 
    if sectionlinks.has_attr('href'): 
    if 'section' in sectionlinks['href']: 
     href = "http://law.justia.com" + sectionlinks.get('href') 
     href = "\n" + str(href) 
     print (href) 

########################################Chapters########################## 
def leveltwo(item_url): 
r = requests.get(item_url) 
for sublinks in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')): 
    if sublinks.has_attr('href'): 
    if 'chapt' in sublinks['href']: 
     chapterlinks = "http://law.justia.com" + sublinks.get('href') 
     # chapterlinks = "\n" + str(chapterlinks) 

     levelthree(chapterlinks) 
     # print (chapterlinks) 

######################################Titles############################### 
def levelone(url): 
r = requests.get(url) 
for links in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')): 
    if links.has_attr('href'): 
     if 'title' in links['href']: 
      titlelinks = "http://law.justia.com" + links.get('href') 
      # titlelinks = "\n" + str(titlelinks) 
      leveltwo(titlelinks) 
      # print (titlelinks) 

########################################################################### 
base_url = "http://law.justia.com/codes/alabama/2015/" 
levelone(base_url) 
İlgili konular