StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POloading large matrix from text file into Java arrays
primarykey
Id
6421757
data
AcceptedAnswerId
6421949
AnswerCount
3
ClosedDate
CommentCount
4
CommunityOwnedDate
CreationDate
2011-06-21T07:24:14.233
FavoriteCount
1
LastActivityDate
2011-06-24T22:37:44.167
LastEditDate
2011-06-24T22:37:44.167
LastEditorUserId
807797
OwnerUserId
807797
ParentId
0
PostTypeId
1
Score
2
ViewCount
4423
LastEditorDisplayName
text
Body
My data is stored in large matrices stored in txt files with millions of rows and 4 columns of comma-separated values. (Each column stores a different variable, and each row stores a different millisecond's data for all four variables.) There is also some irrelevant header data in the first dozen or so lines. I need to write Java code to load this data into four arrays, with one array for each column in the txt matrix. The Java code also needs to be able to tell when the header is done, so that the first data row can be split into entries for the 4 arrays. Finally, the java code needs to iterate through the millions of data rows, repeating the process of decomposing each row into four numbers which are each entered into the appropriate array for the column in which the number was located. Can anyone show me how to alter the code below in order to accomplish this? I want to find the fastest way to accomplish this processing of millions of rows. Here is my code: MainClass2.java <pre><code> package packages; public class MainClass2{ public static void main(String[] args){ readfile2 r = new readfile2(); r.openFile(); int x1Count = r.readFile(); r.populateArray(x1Count); r.closeFile(); } } </code></pre> readfile2.java <pre><code> package packages; import java.io.*; import java.util.*; public class readfile2 { private Scanner scan1; private Scanner scan2; public void openFile(){ try{ scan1 = new Scanner(new File("C:\\test\\samedatafile.txt")); scan1 = new Scanner(new File("C:\\test\\samedatafile.txt")); } catch(Exception e){ System.out.println("could not find file"); } } public int readFile(){ int scan1Count = 0; while(scan1.hasNext()){ scan1.next(); scan1Count += 1; } return scan1Count; } public double[] populateArray(int scan1Count){ double[] outputArray1 = new double[scan1Count]; double[] outputArray2 = new double[scan1Count]; double[] outputArray3 = new double[scan1Count]; double[] outputArray4 = new double[scan1Count]; int i = 0; while(scan2.hasNext()){ //what code do I write here to: // 1.) identify the start of my time series rows after the end of the header rows (e.g. row starts with a number AT LEAST 4 digits in length.) // 2.) split each time series row's data into a separate new entry for each of the 4 output arrays i++; } return outputArray1, outputArray2, outputArray3, outputArray4; } public void closeFile(){ scan1.close(); scan2.close(); } } </code></pre> Here are the first 19 lines of a typical data file: <pre><code>text and numbers on first line 1 msec/sample 3 channels ECG Volts Z_Hamming_0_05_LPF Ohms dz/dt Volts min,CH2,CH4,CH41, ,3087747,3087747,3087747, 0,-0.0518799,17.0624,0, 1.66667E-05,-0.0509644,17.0624,-0.00288295, 3.33333E-05,-0.0497437,17.0624,-0.00983428, 5E-05,-0.0482178,17.0624,-0.0161573, 6.66667E-05,-0.0466919,17.0624,-0.0204402, 8.33333E-05,-0.0448608,17.0624,-0.0213986, 0.0001,-0.0427246,17.0624,-0.0207532, 0.000116667,-0.0405884,17.0624,-0.0229672, </code></pre> <hr> EDIT I tested Shilaghae's code suggestion. It seems to work. However, the length of all the resulting arrays is the same as x1Count, so that zeros remain in the places where Shilaghae's pattern matching code is not able to place a number. (This is a result of how I wrote the code originally.) I was having trouble finding the indices where zeros remain, but there seemed to be a lot more zeros besides the ones expected where the header was. When I graphed the derivative of the temp[1] output, I saw a number of sharp spikes where false zeros in temp[1] might be. If I can tell where the zeros in temp[1], temp[2], and temp[3] are, I might be able to modify the pattern matching to better retain all the data. Also, it would be nice to simply shorten the output array to no longer include the rows where the header was in the input file. However, the tutorials I have found regarding variable length arrays only show oversimplified examples like: <pre><code>int[] anArray = {100, 200, 300, 400}; </code></pre> The code might run faster if it no longer uses scan1 to produce scan1Count. I do not want to slow the code down by using an inefficient method to produce a variable-length array. And I also do not want to skip data in my time series in the cases where the pattern matching is not able to split the input row into 4 numbers. I would rather keep the in-time-series zeros so that I can find them and use them to debug the pattern matching. Can anyone show how to do these things in fast-running code? <hr> <h2>SECOND EDIT</h2> So <pre><code>"-{0,1}\\d+.\\d+," </code></pre> repeats for times in the expression: <pre><code>"-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+," </code></pre> Does <pre><code>"-{0,1}\\d+.\\d+," </code></pre> decompose into the following three statements: <pre><code>"-{0,1}" means that a minus sign occurs zero or one times, while "\\d+." means that the minus sign(or lack of minus sign) is followed by several digits of any value followed by a decimal point, so that finally "\\d+," means that the decimal point is followed by several digits of any value? </code></pre> If so, what about numbers in my data like "1.66667E-05," or "-8.06131E-05," ? I just scanned one of the input files, and (out of 3+ million 4-column rows) it contains 638 numbers that contain E, of which 5 were in the first column, and 633 were in the last column. <hr> <h2>FINAL EDIT</h2> The final code was very simple, and simply involved using string.split() with "," as the regular expression. To do that, I had to manually delete the headers from the input file so that the data only contained rows with 4 comma separated numbers. In case anyone is curious, the final working code for this is: <pre><code>public double[][] populateArray(int scan1Count){ double[] outputArray1 = new double[scan1Count]; double[] outputArray2 = new double[scan1Count]; double[] outputArray3 = new double[scan1Count]; double[] outputArray4 = new double[scan1Count]; try { File tempfile = new File("C:\\test\\mydatafile.txt"); FileInputStream fis = new FileInputStream(tempfile); DataInputStream in = new DataInputStream(fis); BufferedReader br = new BufferedReader(new InputStreamReader(in)); String strLine; int i = 0; while ((strLine = br.readLine()) != null) { String[] split = strLine.split(","); outputArray1[i] = Double.parseDouble(split[0]); outputArray2[i] = Double.parseDouble(split[1]); outputArray3[i] = Double.parseDouble(split[2]); outputArray4[i] = Double.parseDouble(split[3]); i++; } } catch (IOException e) { System.out.println("e for exception is:"+e); e.printStackTrace(); } double[][] temp = new double[4][]; temp[0]= outputArray1; temp[1]= outputArray2; temp[2]= outputArray3; temp[3]= outputArray4; return temp; } </code></pre> Thank you for everyone's help. I am going to close this thread now because the question has been answered.
Tags
<java><arrays><text-files><java.util.scanner><scientific-computing>
Title
loading large matrix from text file into Java arrays
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USCodeMed
UserOwnerUserId
1. USCodeMed
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POloading large matrix from text file into Java arrays
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POloading large matrix from text file into Java arrays
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.