StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POScrapy output feed international unicode characters (e.g. Japanese chars)
primarykey
Id
6191902
data
AcceptedAnswerId
6193195
AnswerCount
1
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2011-05-31T18:31:04.893
FavoriteCount
3
LastActivityDate
2011-05-31T22:39:52.107
LastEditDate
LastEditorUserId
0
OwnerUserId
773694
ParentId
0
PostTypeId
1
Score
7
ViewCount
2815
LastEditorDisplayName
text
Body
I'm a newbie to python and scrapy and I'm following the dmoz tutorial. As a minor variant to the tutorial's suggested start URL, I chose a Japanese category from the dmoz sample site and noticed that the feed export I eventually get shows the unicode numeric values instead of the actual Japanese characters. It seems like I need to use <a href="http://doc.scrapy.org/topics/request-response.html#scrapy.http.TextResponse">TextResponse</a> somehow, but I'm not sure how to make my spider use that object instead of the base Response object. <ol> <li>How should I modify my code to show the Japanese chars in my output?</li> <li>How do I get rid of the square brackers, the single quotes, and the 'u' that's wrapping my output values?</li> </ol> Ultimately, I want to have an output of say オンラインショップ (these are japanese chars) instead of the current output of [u'\u30aa\u30f3\u30e9\u30a4\u30f3\u30b7\u30e7\u30c3\u30d7'] (the unicodes) If you look at my screenshot, it corresponds to cell C7, one of the text titles. Here's my spider (identical to the one in the tutorial, except for different start_url): <pre><code>from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from dmoz.items import DmozItem class DmozSpider(BaseSpider): name = "dmoz.org" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/World/Japanese/" ] def parse(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select('//ul/li') items = [] for site in sites: item = DmozItem() item['title'] = site.select('a/text()').extract() item['link'] = site.select('a/@href').extract() item['desc'] = site.select('text()').extract() items.append(item) return items </code></pre> settings.py: <pre><code>FEED_URI = 'items.csv' FEED_FORMAT = 'csv' </code></pre> output screenshot: <a href="http://i55.tinypic.com/eplwlj.png">http://i55.tinypic.com/eplwlj.png</a> (sorry I don't have enough SO points yet to post images)
Tags
<python><unicode><scrapy>
Title
Scrapy output feed international unicode characters (e.g. Japanese chars)
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USfortuneRice
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POScrapy output feed international unicode characters (e.g. Japanese chars)
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POScrapy output feed international unicode characters (e.g. Japanese chars)
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POScrapy output feed international unicode characters (e.g. Japanese chars)
 UserUserId
 UStotoro
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId
1. COThe scraping is working, the issue is just with how the values get written out to disk. How are you calling Scrapy to run the code?
 singulars
 PostPostId
 POScrapy output feed international unicode characters (e.g. Japanese chars)
 UserUserId
 USThomas K
2. CO@Thomas I think the problem was just that the text was embedded in lists. Once I extracted them from the lists the unicode chars displayed properly.
 singulars
 PostPostId
 POScrapy output feed international unicode characters (e.g. Japanese chars)
 UserUserId
 USfortuneRice

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.