StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POIs this a sensible approach for an EBCDIC (CP500) to Latin-1 converter?
primarykey
Id
1071667
data
AcceptedAnswerId
0
AnswerCount
6
ClosedDate
CommentCount
9
CommunityOwnedDate
CreationDate
2009-07-01T22:27:56.713
FavoriteCount
0
LastActivityDate
2016-06-05T23:13:49.243
LastEditDate
2009-07-02T06:53:53.467
LastEditorUserId
93975
OwnerUserId
93975
ParentId
0
PostTypeId
1
Score
-1
ViewCount
4920
LastEditorDisplayName
text
Body
I have to convert a number of large files (up to 2GB) of EBCDIC 500 encoded files to Latin-1. Since I could only find EBCDIC to ASCII converters (dd, recode) and the files contain some additional proprietary character codes, I thought I'd write my own converter. I have the <a href="http://www.tachyonsoft.com/cp01047.htm" rel="nofollow noreferrer">character mapping</a> so I'm interested in the technical aspects. This is my approach so far: <pre><code># char mapping lookup table EBCDIC_TO_LATIN1 = { 0xC1:'41', # A 0xC2:'42', # B # and so on... } BUFFER_SIZE = 1024 * 64 ebd_file = file(sys.argv[1], 'rb') latin1_file = file(sys.argv[2], 'wb') buffer = ebd_file.read(BUFFER_SIZE) while buffer: latin1_file.write(ebd2latin1(buffer)) buffer = ebd_file.read(BUFFER_SIZE) ebd_file.close() latin1_file.close() </code></pre> This is the function that does the converting: <pre><code>def ebd2latin1(ebcdic): result = [] for ch in ebcdic: result.append(EBCDIC_TO_LATIN1[ord(ch)]) return ''.join(result).decode('hex') </code></pre> The question is whether or not this is a sensible approach from an engineering standpoint. Does it have some serious design issues? Is the buffer size OK? And so on... As for the "proprietary characters" that some don't believe in: Each file contains a year's worth of patent documents in SGML format. The patent office has been using EBCDIC until they switched to Unicode in 2005. So there are thousands of documents within each file. They are separated by some hex values that are not part of any IBM specification. They were added by the patent office. Also, at the beginning of each file there are a few digits in ASCII that tell you about the length of the file. I don't really need that information but if I want to process the file so I have to deal with them. Also: <pre><code>$ recode IBM500/CR-LF..Latin1 file.ebc recode: file.ebc failed: Ambiguous output in step `CR-LF..data' </code></pre> Thanks for the help so far.
Tags
<python><ebcdic>
Title
Is this a sensible approach for an EBCDIC (CP500) to Latin-1 converter?
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USEisen
UserOwnerUserId
1. USEisen
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POIs this a sensible approach for an EBCDIC (CP500) to Latin-1 converter?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTDownMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.