2016-03-29 26 views
1

Python3 ile bir html dosyası aracılığıyla arama yapmak ve sayfadaki tüm bağlantılarını geri almak için bir çözüm konusunda yardıma ihtiyacım var. Sonra yakalanan değeri bitişik href (url) ile bir sözlüğe ekleme.Python: htf ve metin içeriğiyle <a> etiketleri yakalama yoluyla html dosyasını arama

Zaten denediğim şey budur. Yardımlarınız için bu hatayı alıyorum

import urllib3 
import re 

http = urllib3.PoolManager() 
my_url = "https://in.finance.yahoo.com/q/h?s=AAPL" 
a = http.request("GET",my_url) 
html = a.data 

links = re.finditer(' href="?([^\s^"]+)', html) 

for link in links: 
    print(link) 

...

TypeError: can't use a string pattern on a bytes-like object 

teşekkürler.

Ayrıca LXML denedim

...

links = lxml.html.parse("http://www.google.co.uk/?gws_rd=ssl#q=apple+stock&tbm=nws").xpath("//a/@href") 
for link in links: 
    print(link) 

sonuç tüm bağlantıları görünmüyor ve neden emin değilim.

GÜNCELLEME:

Yeni kod =>

def news_feed(self, stock): 
    http = urllib3.PoolManager() 
    my_url = "https://in.finance.yahoo.com/q/h?s="+stock 
    a = http.request("GET",my_url) 
    html = a.data.decode('utf-8') 
    xml = fromstring(html, HTMLParser()) 
    a_tags = xml.xpath("//a/@href") 
    xml = fromstring(html, HTMLParser()) 
    a_tags = xml.xpath("//table[@id='yfncsumtab']//a") 
    self.paired = dict((a.xpath(".//text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags) 
    pp(self.paired) 
+0

'html = a.data.decode ('utf-8')' –

+1

Bunu el ile (James'den öneri kullanarak) ayrıştırabilirsiniz, ama gerçekten [BeautifulSoup] kullanmanız gerektiğini düşünüyorum (http: // www. crummy.com/software/BeautifulSoup/) – Bahrom

+0

Harika, işe yarıyor. Bunu bilmeliydim! –

cevap

4

html çözümleyicisi kullanılır ve önerildiği gibi bayt kodunu çözmek, BeautifulSoup işi çok kolay hale getirecek ve bir regex daha güvenilir bir çok ayrıştırılırken html: yalnızca http ile başlayan bağlantıları istiyorsanız

http = urllib3.PoolManager() 
my_url = "https://in.finance.yahoo.com/q/h?s=AAPL" 
a = http.request("GET", my_url) 
html = a.data.decode("utf-8") 

from bs4 import BeautifulSoup 

print([a["href"] for a in BeautifulSoup(html).find_all("a",href=True)]) 

bir css kullanabilirsiniz belirleyin:

soup = BeautifulSoup(html) 

print([a["href"] for a in soup.select("a[href^=http]")]) 

verecek Hangi:

['https://edit.yahoo.com/config/eval_register?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 'https://login.yahoo.com/config/login?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 'https://help.yahoo.com/l/in/yahoo/finance/', 'http://in.yahoo.com/bin/set?cmp=uheader&src=others', 'https://in.mail.yahoo.com/?.intl=in&.lang=en-IN', 'http://in.my.yahoo.com', 'https://in.yahoo.com/', 'https://in.finance.yahoo.com', 'https://in.finance.yahoo.com/investing/', 'https://yahoo.uservoice.com/forums/170320-india-finance/category/84926-data-accuracy', 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html', 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html', 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html', 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html', 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html', 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html', 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html', 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html', 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html', 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html', 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html', 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html', 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html', 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html', 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html', 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html', 'https://in.finance.yahoo.com/news/apple-sees-first-sales-dip-011402926.html', 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-031840725.html', 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html', 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html', 'http://help.yahoo.com/l/in/yahoo/finance/basics/fitadelay2.html', 'http://billing.finance.yahoo.com/realtime_quotes/signup?.src=quote&.refer=quote', 'http://www.capitaliq.com', 'http://www.csidata.com', 'http://www.morningstar.com/'] 

almak için Metin ve href:

soup = BeautifulSoup(html) 

a_tags = soup.select("a[href^=http]") 
from pprint import pprint as pp 
paired = dict((a.text, a["href"]) for a in a_tags) 

pp(paired) 

Çıktı:

{u'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html', 
u'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html', 
u'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html', 
u'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html', 
u'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html', 
u'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html', 
u'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html', 
u"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html', 
u"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html', 
u'Capital IQ': 'http://www.capitaliq.com', 
u'Commodity Systems, Inc. (CSI)': 'http://www.csidata.com', 
u'Download the new Yahoo Mail app': 'https://in.mobile.yahoo.com/mail/', 
u"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html', 
u'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html', 
u'Help': 'https://help.yahoo.com/l/in/yahoo/finance/', 
u'Mail': 'https://in.mail.yahoo.com/?.intl=in&.lang=en-IN', 
u'Markets': 'https://in.finance.yahoo.com/investing/', 
u'Morningstar, Inc.': 'http://www.morningstar.com/', 
u'My Yahoo': 'http://in.my.yahoo.com', 
u'New User? Register': 'https://edit.yahoo.com/config/eval_register?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 
u'Report an Issue': 'https://yahoo.uservoice.com/forums/170320-india-finance/category/84926-data-accuracy', 
u'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html', 
u'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html', 
u'Sign In': 'https://login.yahoo.com/config/login?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 
u"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html', 
u'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html', 
u'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html', 
u'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html', 
u'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html', 
u'Yahoo': 'https://in.yahoo.com/', 
u'Yahoo India Finance': 'https://in.finance.yahoo.com', 
u'other exchanges': 'http://help.yahoo.com/l/in/yahoo/finance/basics/fitadelay2.html', 
u'premium service.': 'http://billing.finance.yahoo.com/realtime_quotes/signup?.src=quote&.refer=quote'} 

a[href^=http] vasıta bana hiç bir etiketlerini vermek href's ve href değerleri h ile başlıyor ttp.

LXML kullanma ve muhtemelen en çok ilgilenen sadece hikaye bağlantıları almak için tablo id kullanarak:

from lxml.etree import fromstring, HTMLParser 

xml = fromstring(_html, HTMLParser()) 

a_tags = xml.xpath("//table[@id='yfncsumtab']//a") 

paired = dict((a.xpath(".//text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags) 
from pprint import pprint as pp 
pp(paired) 

size verir:

{'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html', 
'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html', 
'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html', 
'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html', 
'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html', 
'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html', 
'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html', 
"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html', 
"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html', 
"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html', 
'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html', 
'Older Headlines': '/q/h?s=AAPL&t=2016-01-27T03:49:08+05:30', 
'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html', 
'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html', 
"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html', 
'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html', 
'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html', 
'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html', 
'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html'} 

Biz yapabiliriz seçim dışı aynı:

id benzersiz olması gerektiği gibi sadece //*[@id='yfncsumtab']//a kullanabilirsiniz

{u'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html', 
u'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html', 
u'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html', 
u'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html', 
u'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html', 
u'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html', 
u'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html', 
u"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html', 
u"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html', 
u"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html', 
u'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html', 
u'Older Headlines': '/q/h?s=AAPL&t=2016-01-27T03:49:08+05:30', 
u'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html', 
u'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html', 
u"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html', 
u'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html', 
u'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html', 
u'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html', 
u'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html'} 

: Bizim Lxml çıkışını maç olacak

.

bir XPath kullanılarak tablodan ilk altı bağlantıları almak için, ül kullanmasını ve ayıklayabilirsiniz ilk 6 ul[position() < 7] kullanarak:

a_tags = xml.xpath("//*[@id='yfncsumtab']//ul[position() < 7]//a") 

paired = dict((a.xpath("./text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags) 
from pprint import pprint as pp 
pp(paired) 

size verecektir:

{'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html', 
"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html', 
'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html', 
'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html', 
"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html', 
'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html'} 

için küçük tablolar, ayrıca basitçe dilimleyebilirsiniz.

+1

için Wlecome Ben * projelerim birinde BeautifulSoup kullanarak * hata yapmadım. – Bahrom

+0

@BAH, hayatı çok yapar kolay, html hakkında hiçbir şey bilmese bile, dokümanlar hızlı bir okuma ile hiçbir zaman çalışır ve çalışıyor olurdu –

+0

Ben BeautifulSoup kullanmayı denedim ama Python3 herhangi bir öneride çalışan alamadım? –

İlgili konular