Note that there are some explanatory texts on larger screens.

plurals
  1. POIs this a sensible approach for an EBCDIC (CP500) to Latin-1 converter?
    primarykey
    data
    text
    <p>I have to convert a number of large files (up to 2GB) of EBCDIC 500 encoded files to Latin-1. Since I could only find EBCDIC to ASCII converters (dd, recode) and the files contain some additional proprietary character codes, I thought I'd write my own converter.</p> <p>I have the <a href="http://www.tachyonsoft.com/cp01047.htm" rel="nofollow noreferrer">character mapping</a> so I'm interested in the technical aspects.</p> <p>This is my approach so far:</p> <pre><code># char mapping lookup table EBCDIC_TO_LATIN1 = { 0xC1:'41', # A 0xC2:'42', # B # and so on... } BUFFER_SIZE = 1024 * 64 ebd_file = file(sys.argv[1], 'rb') latin1_file = file(sys.argv[2], 'wb') buffer = ebd_file.read(BUFFER_SIZE) while buffer: latin1_file.write(ebd2latin1(buffer)) buffer = ebd_file.read(BUFFER_SIZE) ebd_file.close() latin1_file.close() </code></pre> <p>This is the function that does the converting:</p> <pre><code>def ebd2latin1(ebcdic): result = [] for ch in ebcdic: result.append(EBCDIC_TO_LATIN1[ord(ch)]) return ''.join(result).decode('hex') </code></pre> <p>The question is whether or not this is a sensible approach from an engineering standpoint. Does it have some serious design issues? Is the buffer size OK? And so on...</p> <p>As for the "proprietary characters" that some don't believe in: Each file contains a year's worth of patent documents in SGML format. The patent office has been using EBCDIC until they switched to Unicode in 2005. So there are thousands of documents <em>within</em> each file. They are separated by some hex values that are not part of any IBM specification. They were added by the patent office. Also, at the beginning of each file there are a few digits in ASCII that tell you about the length of the file. I don't really need that information but if I want to process the file so I have to deal with them.</p> <p>Also:</p> <pre><code>$ recode IBM500/CR-LF..Latin1 file.ebc recode: file.ebc failed: Ambiguous output in step `CR-LF..data' </code></pre> <p>Thanks for the help so far.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload