StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
3408618
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
3
CommunityOwnedDate
CreationDate
2010-08-04T18:31:30.627
FavoriteCount
0
LastActivityDate
2010-08-05T22:36:26.933
LastEditDate
2010-08-05T22:36:26.933
LastEditorUserId
207663
OwnerUserId
207663
ParentId
3407277
PostTypeId
2
Score
7
ViewCount
0
LastEditorDisplayName
text
Body
<blockquote> Can I change the maximum allowed heap space if I am using Swank-Clojure (via Leiningen) the JVM has on startup? </blockquote> You can change the Java heap size by supplying the -Xms (min heap) and -Xmx (max heap) options at startup, see the <a href="http://download-llnw.oracle.com/javase/6/docs/technotes/tools/windows/java.html#options" rel="noreferrer">docs</a>. So something like <code>java -Xms256m -Xmx1024m ...</code> would give 256MB initial heap with the option to grow to 1GB. I don't use Leiningen/Swank, but I expect that it's possible to change it. If nothing else, there should be a startup script for Java somewhere where you can change the arguments. <blockquote> If I package this application (like I plan to) as an Uberjar, would I be able to ensure my JVM has some kind of minimum heap space? </blockquote> Memory isn't controlled from within a jar file, but from the startup script, normally a .sh or .bat file that calls java and supplies the arguments. <blockquote> Can I "sample" from the file; e.g. read only every z lines? </blockquote> <a href="http://download.oracle.com/javase/6/docs/api/java/io/RandomAccessFile.html" rel="noreferrer">java.io.RandomAccessFile</a> gives random file access by byte index, which you can build on to sample the contents. <blockquote> Would it be possible to read in only parts of a large (text) file at a time, so I could import and process the data in "chunks", e.g, n lines at a time? If so, how? </blockquote> <a href="http://clojuredocs.org/v/2048" rel="noreferrer">line-seq</a> returns a lazy sequence of each line in a file, so you can process as much at a time as you wish. Alternatively, use the Java mechanisms in <a href="http://download-llnw.oracle.com/javase/6/docs/api/java/io/package-summary.html" rel="noreferrer">java.io</a> - <code>BufferedReader.readLine()</code> or <code>FileInputStream.read(byte[] buffer)</code> <blockquote> Is there some faster way of accessing the file I'd be reading from (potentially rapidly, depending on the implementation), other than simply reading from it a bit at a time? </blockquote> Within Java/Clojure there is BufferedReader, or you can maintain your own byte buffer and read larger chunks at a time. To make the most out of the memory you have, keep the data as primitive as possible. For some actual numbers, let's assume you want to graph the contents of a music CD: <ul> <li>A CD has two channels, each with 44,100 samples per second <ul> <li>60 min. of music is then ~300 million data points</li> </ul></li> <li>Represented as 16 bits (2 bytes, a short) per datapoint: 600MB</li> <li>Represented as primitive int array (4 bytes per datapoint): 1.2GB</li> <li>Represented as Integer array (32 bytes per datapoint): 10GB</li> </ul> Using the numbers from <a href="http://devblog.streamy.com/2009/07/24/determine-size-of-java-object-class/" rel="noreferrer">this blog</a> for object size (16 byte overhead per object, 4 bytes for primitive int, objects aligned to 8-byte boundaries, 8-byte pointers in the array = 32 bytes per Integer datapoint). Even 600MB of data is a stretch to keep in memory all at once on a "normal" computer, since you will probably be using lots of memory elsewhere too. But the switch from primitive to boxed numbers will all by itself reduce the number of datapoints you can hold in memory by an order of magnitude. If you were to graph the data from a 60 min CD on a 1900 pixel wide "overview" timeline, you would have one pixel to display two seconds of music (~180,000 datapoints). This is clearly way too little to show any level of detail, you would want some form of subsampling or summary data there. So the solution you describe - process the full dataset one chunk at a time for a summary display in the 'overview' timeline, and keep only the small subset for the main "detail" window in memory - sounds perfectly reasonable. Update: On fast file reads: <a href="http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly" rel="noreferrer">This article</a> times the file reading speed for 13 different ways to read a 100MB file in Java - the <a href="http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly#Fullplot" rel="noreferrer">results</a> vary from 0.5 seconds to 10 minutes(!). In general, reading is fast with a decent buffer size (4k to 8k bytes) and (very) slow when reading one byte at a time. The article also has a <a href="http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly#ComparisontoC" rel="noreferrer">comparison to C</a> in case anyone is interested. (Spoiler: The fastest Java reads are within a factor 2 of a memory-mapped file in C.)
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POHandling large datasets in Java/Clojure: littleBig data
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USj-g-faustus
UserOwnerUserId
1. USj-g-faustus
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POHandling large datasets in Java/Clojure: littleBig data
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.