Scraping doesn't hurt
I am in general allergic to HTML, specially when it comes to parsing it. However, every now and then something comes up and it's fun to keep the muscles stretched.
So, consider the Ted Talks site. They have a really nice table with information about their talks, just in case you want to do something with them.
But how do you get that information? By scraping it. And what's an easy way to do it? By using Python and BeautifulSoup:
from BeautifulSoup import BeautifulSoup import urllib # Read the whole page. data = urllib.urlopen('http://www.ted.com/talks/quick-list').read() # Parse it soup = BeautifulSoup(data) # Find the table with the data table = soup.findAll('table', attrs= {"class": "downloads notranslate"})[0] # Get the rows, skip the first one rows = table.findAll('tr')[1:] items = [] # For each row, get the data # And store it somewhere for row in rows: cells = row.findAll('td') item = {} item['date'] = cells[0].text item['event'] = cells[1].text item['title'] = cells[2].text item['duration'] = cells[3].text item['links'] = [a['href'] for a in cells[4].findAll('a')] items.append(item)
And that's it! Surprisingly pain-free!
No podés traducir scrap :)
I am a contrarian.
No sería más facil parsear el RSS?
http://www.ted.com/talks/rss
Sí, pero no están todas ahí, creo.
Have you tried Scrapy? It ahs some nice features to crawl and scrape web pages ;)
I have heard of scrapy, but have not tried it.