Note that there are some explanatory texts on larger screens.

plurals
  1. POScrapy output feed international unicode characters (e.g. Japanese chars)
    primarykey
    data
    text
    <p>I'm a newbie to python and scrapy and I'm following the dmoz tutorial. As a minor variant to the tutorial's suggested start URL, I chose a Japanese category from the dmoz sample site and noticed that the feed export I eventually get shows the unicode numeric values instead of the actual Japanese characters. </p> <p>It seems like I need to use <a href="http://doc.scrapy.org/topics/request-response.html#scrapy.http.TextResponse">TextResponse</a> somehow, but I'm not sure how to make my spider use that object instead of the base Response object.</p> <ol> <li>How should I modify my code to show the Japanese chars in my output?</li> <li>How do I get rid of the square brackers, the single quotes, and the 'u' that's wrapping my output values?</li> </ol> <p>Ultimately, I want to have an output of say</p> <p><strong>オンラインショップ</strong> (these are japanese chars)</p> <p>instead of the current output of</p> <p><strong>[u'\u30aa\u30f3\u30e9\u30a4\u30f3\u30b7\u30e7\u30c3\u30d7']</strong> (the unicodes)</p> <p>If you look at my screenshot, it corresponds to cell C7, one of the text titles.</p> <p>Here's my spider (identical to the one in the tutorial, except for different start_url):</p> <pre><code>from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from dmoz.items import DmozItem class DmozSpider(BaseSpider): name = "dmoz.org" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/World/Japanese/" ] def parse(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select('//ul/li') items = [] for site in sites: item = DmozItem() item['title'] = site.select('a/text()').extract() item['link'] = site.select('a/@href').extract() item['desc'] = site.select('text()').extract() items.append(item) return items </code></pre> <p>settings.py:</p> <pre><code>FEED_URI = 'items.csv' FEED_FORMAT = 'csv' </code></pre> <p>output screenshot: <a href="http://i55.tinypic.com/eplwlj.png">http://i55.tinypic.com/eplwlj.png</a> (sorry I don't have enough SO points yet to post images)</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload