Note that there are some explanatory texts on larger screens.

plurals
  1. POFinding duplicate video files by database (millions), fingerprint? Pattern recognition?
    primarykey
    data
    text
    <p>In the following scenario:</p> <p>I got a project having a catalog of currently some ten thousand video files, the number is going to increase dramatically.</p> <p>However lots of them are duplicates. With every video file I have associated semantic and descriptive information which I want to merge duplicates to achive better results for every one.</p> <p>Now I need some sort of procedure where I index metadata in a database, and whenever a new video enters the catalog the same data is calculated and matched against in the database.</p> <p>Problem is the videos aren't exact duplicates. They can have different quality, are amby cropped, watermarked or have a sequel/prequel. Or are cut off at the beginning and/or end.</p> <p>Unfortunately the better the comparision the more cpu and memory intensive it gets so I plan on implementing several layers of comparision that begin with very graceful but fast comparision (maby video lengh with a tolerance of 10%) and end with the final comparision that decides whether its really a duplicate (that would be a community vote).</p> <p>So as I have a community to verify the results it suffices to deliver "good guesses" with a low miss ratio.</p> <p>So now my question is what layers can you guys think of or do you have a better approach?</p> <p>I don't care the effort to create the metadata, I have enough slaves to do that. Just the comparision should be fast. So if it helps I can convert the video 100 times as well...</p> <p>Here are my current ideas:</p> <ul> <li><p>video length (seconds)</p></li> <li><p>first and last frame picture analysis</p></li> </ul> <p>I would resample the picture to a thumbnail size and get the average rgb values then serialize pixel by pixel if the color at this pixel is greater/smaller than the average represented by 0 or 1. So I get a binary string which I can store into mysql and do a boolean bit-sum (supported by mysql internally) and count the remaining uneval bits (as well supported internally, that would then be the Levenshtein distance of the bianry strings)</p> <ul> <li>developement of the bitrate over time with the same vbr codec</li> </ul> <p>I would transcode the video into a vbr videofile with the exact same settings. then I would look at the bitrate at certain points of time (percentage of the video completed or absolute seconds.. then we would only analyze a portion of the video). same thing as with the picture. Iif the bitrate is greater the average its 1 else its 0. we make a binary string and store it in db and calculate the Levenshtein distance later</p> <ul> <li><p>audio analyisis (bitrate and decibel varaition over time just as bitrate of the video)</p></li> <li><p>keyframe analysis</p></li> </ul> <p>Image comarision just like the first and last frame but at keyframe positions? We would use the same source files we used for bitrate calcluiations because keyframes are heavy depended on the codec and settings.</p> <ul> <li>developement of color over time</li> </ul> <p>Maybe let's take one or more areas/pixels inside the image and see how they develope over time. As well the change abov/below average. black/white would suffice I think.</p> <ul> <li>present the suggestions to the user for final approval...</li> </ul> <p>Or am I going the completely wrong way? I think I can't be the first one having this problem but I have not had any luck finding solutions.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload