StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POCounting duplicate data which satisfy conditions and remove data
text
Body
copied!<p>The sample file has been uploaded to <a href="http://www.mediafire.com/?ycibi4avgl9o8rg" rel="nofollow noreferrer">MediaFile</a>.</p> <p><strong>Background Information</strong></p> <p><strong>Section 1: In the sample file, “Sheet1”</strong></p> <pre><code>a. Values in “Column A” are the original name. For example from Cell A1: “>hg19_refGene_NM_000392_0 range=chr10:101542463-101542634 5'pad=0 3'pad=0 strand=+ repeatMasking=none” b. Values in “Column B” is a value that correspond to values in Column A, for example from Cell B1 which correspond to value in Cell A1: “ABCC2” </code></pre> <p><strong>Section 2: In the sample file, “Sheet2”</strong></p> <pre><code>a. In the Sheet2, the values from Sheet1 have been separated to clarify the data because in Sheet1, everything is packed in one cell. b. Column A represents “GENE”, which refers to the value in Column B in Sheet1, for example, “ABCC2” from Section 1 of this article. c. Column B represents “refGENE”, an example of refGENE is “NM000392” which come from the original name from “Sheet1” d. Column C represents “CHROMOSOME”, this is another value that was derived from Values in Column A of Sheet1, for example, “chr10” e. Similar Idea, “EXON START” came from the original name in Column A of Sheet1, for example “101542463” f. And “EXON END” came from the original name in Column A of Sheet1, for example “101542634” </code></pre> <p><strong>The Challenge</strong> is to develop a program that can solve the following requirements:</p> <p><strong>Requirement 1:</strong> counting for each gene, the number of times each refGene is observed, e.g.:</p> <p><a href="https://i.stack.imgur.com/QfhkW.jpg" rel="nofollow noreferrer">Table Example refGENE COUNT NM000927 29 NM00078 32 NM00042 32 . . . . . .</a></p> <p><a href="https://i.stack.imgur.com/QfhkW.jpg" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/QfhkW.jpg" alt="enter image description here"></a> </p> <p>Note: The way I do it is to use SUMPRODUCT in Excel, however, I don't know how to put everything in a simple table. </p> <p><strong>Requirement 2:</strong> This requires comparing values in two different rows, Please note that this requires using the original name from "Sheet1". Please don't use the separated value from "Sheet2". Basically, it is query each row, if Gene, Chromosome, EXONSTART, EXON END are the same, then remove rows with the least frequent refgene. I will explain further below.</p> <p>In "Shee1", there are "Original Name" and "GENE",</p> <p><strong>Step 1:</strong> Compare if the values in Column B are the same. For example, when comparing row 1 and row 2, there are <code>ABCC2</code> and <code>ABCC2</code>. This satisfies the condition, so proceed to Step 2, else continue to compare GENE from different rows.</p> <p><strong>Step 2:</strong> Compare "chr" values from different rows, same example from previous step. Row 1 has <code>chr10</code> and row 2 has <code>chr10</code>, as they are the same continue to the next step, else move on.</p> <p><strong>Step 3:</strong> Now compare "exon start" - a number looks like <code>101542463</code> in row 1 and the number in row 2 looks like <code>101544365</code>, now they are not the same, save the file and move on. Imagine if the numbers are the same, then continue to compare "exon end", which is step 4.</p> <p><strong>Step 4:</strong> Assume, the "exon start" from two different rows are the same, then compare "exon end". The number from row 1 looks like <code>101542634</code> and the number of "exon end" from row 2 looks like <code>101544538</code>. Same condition as above, if they are different, leave the file alone and continue comparing the next GENE. </p> <p>Here is the part that requires attention, if they are the same, that means, "GENE" are the same, "chr" are the same, "exon start" and "exon end" are the same. In the end, everything is the same, that means there is a duplicated row. Now, the duplicated rows will be deleted. But what's the condition of deleting the row. This will link us back to the challenge that we solved from requirement 1. Remember that the number of occurrences has been counted for all refGENE? Recall 29 times for <code>NM000927</code>, 32 times for <code>Nm00078</code>. The rows of "GENE" to be removed are the ones containing <code>NM000927</code>. </p> <p>But, please keep a record for all the deleted data, and all the remaining data, preferably with a table.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload