Note that there are some explanatory texts on larger screens.

plurals
  1. POParsing HTML page into CSV - python
    primarykey
    data
    text
    <p>I am trying to transfer all the data i parsed from a website into a csv file but i have run into a couple of problems:</p> <p>1.Even though i have added the character encoding, it still prints out as HTML in excel rather than plain text:</p> <p>e.g</p> <pre><code>&lt;option redirectvalue="/partfinder/Asus/All In One/E Series/ET10B"&gt;ET10B&lt;/option&gt; </code></pre> <p>2.It prints out in one column rather than rows for them all</p> <p>Here is my code so far:</p> <pre><code>import string, urllib2, urlparse, csv, sys, codecs, cStringIO from urllib import quote from urlparse import urljoin from bs4 import BeautifulSoup from ast import literal_eval class UnicodeWriter: """ A CSV writer which will write rows to CSV file "f", which is encoded in the given encoding. """ def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): # Redirect output to a queue self.queue = cStringIO.StringIO() self.writer = csv.writer(self.queue, dialect=dialect, **kwds) self.stream = f self.encoder = codecs.getincrementalencoder(encoding)() def writerow(self, row): self.writer.writerow([s.encode("utf-8") for s in row]) # Fetch UTF-8 output from the queue ... data = self.queue.getvalue() data = data.decode("utf-8") # ... and reencode it into the target encoding data = self.encoder.encode(data) # write to the target stream self.stream.write(data) # empty queue self.queue.truncate(0) def writerows(self, rows): for row in rows: self.writerow(row) changable_url = 'http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series' page = urllib2.urlopen(changable_url) base_url = 'http://www.asusparts.eu' soup = BeautifulSoup(page) selects = [] redirects = [] model_info = [] #Opening csv writer c = UnicodeWriter(open(r"asus_stock.csv", "wb")) #Object reader cr = UnicodeWriter(open(r"asus_stock.csv", "rb")) print "FETCHING OPTIONS" select = soup.find(id='myselectListModel') selects.append(select) for item in selects: print item.get_text() options = select.findAll('option') for option in options: if(option.has_attr('redirectvalue')): redirects.append(option['redirectvalue']) for r in redirects: rpage = urllib2.urlopen(urljoin(base_url, quote(r))) s = BeautifulSoup(rpage) #print s #Fetching the main title for each specific model and printing it out print "FETCHING MAIN TITLE" maintitle = s.find(id='puffBreadCrumbs') model_info.append(maintitle) print maintitle.get_text() datas = s.find(id='accordion') a = datas.findAll('a') content = datas.findAll('span') print "FETCHING CATEGORY" for data in a: if(data.has_attr('onclick')): arguments = literal_eval('(' + data['onclick'].replace(', this', '').split('(', 1)[1]) #model_info.append(arguments) print arguments #arguments[1] + " " + arguments[3] + " " + arguments[4] # Retrieves Part number and Price print "FETCHING DATA" for complete in content: if(complete.has_attr('class')): #model_info.append(complete['class']) print complete.get_text() print "FETCHING IMAGES" img = s.find('td') images = img.findAll('img') model_info.append(images) print images c.writerows(selects) </code></pre> <p>How can i make it so it prints out as</p> <blockquote> <p>1-Text rather than HTML</p> <p>2-Rows rather than one column</p> </blockquote> <p><strong>[EDIT]</strong> This is how i would like the CSV file to be displayed and example of values to be returned</p> <pre><code>"Brand Name" "CategoryID" "ModelID" "Family" "Name" "Part Number" "Price" "Image src" Asus | AC Adapter | ET1602 | E Series | Power Cord 3P L:80CM,UK(B) | 14G110008350 |14.77 | image src </code></pre> <p><strong>[NEW EDIT]</strong></p> <p>These are the outputs for the printed values:</p> <pre><code>print "FETCHING OPTIONS" select = soup.find(id='myselectListModel') selects.append(select) for item in selects: print item.get_text() </code></pre> <p>yields:</p> <pre><code>ET10B ET1602 ET1602C etc.. </code></pre> <p>Fetching Main Title:</p> <pre><code>print "FETCHING MAIN TITLE" maintitle = s.find(id='puffBreadCrumbs') model_info.append(maintitle) print maintitle.get_text() </code></pre> <p>yields:</p> <blockquote> <p>Asus - All In One - E Series - ET10B</p> </blockquote> <p>Fetching Category</p> <pre><code>datas = s.find(id='accordion') a = datas.findAll('a') content = datas.findAll('span') print "FETCHING CATEGORY" for data in a: if(data.has_attr('onclick')): arguments = literal_eval('(' + data['onclick'].replace(', this', '').split('(', 1)[1]) #model_info.append(arguments) print arguments </code></pre> <p>yields:</p> <pre><code>FETCHING CATEGORY ('Asus', 'AC Adapter', 'ET10B', '6941', 'E Series') ('Asus', '04G265003580') ('Asus', '14G110008340') ('Asus', 'Bracket', 'ET10B', '7138', 'E Series') ('Asus', 'Cable', 'ET10B', '6983', 'E Series') ('Asus', 'Camera', 'ET10B', '6985', 'E Series') ('Asus', 'Cooling', 'ET10B', '6999', 'E Series') ('Asus', 'Cover', 'ET10B', '6984', 'E Series') etc.. </code></pre> <p>Fetching the Name:</p> <pre><code>print "FETCHING NAME" name = s.find('b').get_text() print name </code></pre> <p>yields:</p> <blockquote> <p>POWER ADAPTER 65W19V 3PIN</p> </blockquote> <p>Fetching Part Number and Price</p> <pre><code>print "FETCHING PART NUMBER AND PRICE (inc. VAT)" for complete in content: if(complete.has_attr('class')): #model_info.append(complete['class']) print complete.get_text() </code></pre> <p>yields:</p> <pre><code>FETCHING PART NUMBER AND PRICE (inc. VAT) Part number: 04G265003580 Remote stock 38.09:- EUR </code></pre> <p>Fetching the images</p> <pre><code>print "FETCHING IMAGES" img = s.find('td') images = img.findAll('img') model_info.append(images) print images </code></pre> <p>yields:</p> <pre><code>FETCHING IMAGES [&lt;img alt="" src="/images/Articles/thumbs/04G265003580_thumb.jpg"/&gt;] </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload