StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POFinding duplicate video files by database (millions), fingerprint? Pattern recognition?
primarykey
Id
3591731
data
AcceptedAnswerId
3593153
AnswerCount
3
ClosedDate
CommentCount
5
CommunityOwnedDate
CreationDate
2010-08-28T17:32:15.463
FavoriteCount
14
LastActivityDate
2017-11-08T07:44:04.207
LastEditDate
2017-11-08T07:44:04.207
LastEditorUserId
1685157
OwnerUserId
413910
ParentId
0
PostTypeId
1
Score
21
ViewCount
2919
LastEditorDisplayName
text
Body
In the following scenario: I got a project having a catalog of currently some ten thousand video files, the number is going to increase dramatically. However lots of them are duplicates. With every video file I have associated semantic and descriptive information which I want to merge duplicates to achive better results for every one. Now I need some sort of procedure where I index metadata in a database, and whenever a new video enters the catalog the same data is calculated and matched against in the database. Problem is the videos aren't exact duplicates. They can have different quality, are amby cropped, watermarked or have a sequel/prequel. Or are cut off at the beginning and/or end. Unfortunately the better the comparision the more cpu and memory intensive it gets so I plan on implementing several layers of comparision that begin with very graceful but fast comparision (maby video lengh with a tolerance of 10%) and end with the final comparision that decides whether its really a duplicate (that would be a community vote). So as I have a community to verify the results it suffices to deliver "good guesses" with a low miss ratio. So now my question is what layers can you guys think of or do you have a better approach? I don't care the effort to create the metadata, I have enough slaves to do that. Just the comparision should be fast. So if it helps I can convert the video 100 times as well... Here are my current ideas: <ul> <li>video length (seconds)</li> <li>first and last frame picture analysis</li> </ul> I would resample the picture to a thumbnail size and get the average rgb values then serialize pixel by pixel if the color at this pixel is greater/smaller than the average represented by 0 or 1. So I get a binary string which I can store into mysql and do a boolean bit-sum (supported by mysql internally) and count the remaining uneval bits (as well supported internally, that would then be the Levenshtein distance of the bianry strings) <ul> <li>developement of the bitrate over time with the same vbr codec</li> </ul> I would transcode the video into a vbr videofile with the exact same settings. then I would look at the bitrate at certain points of time (percentage of the video completed or absolute seconds.. then we would only analyze a portion of the video). same thing as with the picture. Iif the bitrate is greater the average its 1 else its 0. we make a binary string and store it in db and calculate the Levenshtein distance later <ul> <li>audio analyisis (bitrate and decibel varaition over time just as bitrate of the video)</li> <li>keyframe analysis</li> </ul> Image comarision just like the first and last frame but at keyframe positions? We would use the same source files we used for bitrate calcluiations because keyframes are heavy depended on the codec and settings. <ul> <li>developement of color over time</li> </ul> Maybe let's take one or more areas/pixels inside the image and see how they develope over time. As well the change abov/below average. black/white would suffice I think. <ul> <li>present the suggestions to the user for final approval...</li> </ul> Or am I going the completely wrong way? I think I can't be the first one having this problem but I have not had any luck finding solutions.
Tags
<language-agnostic><video><comparison><fingerprint><audio-fingerprinting>
Title
Finding duplicate video files by database (millions), fingerprint? Pattern recognition?
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USPaolo Forgia
UserOwnerUserId
1. USThe Surrican
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POFinding duplicate video files by database (millions), fingerprint? Pattern recognition?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POFinding duplicate video files by database (millions), fingerprint? Pattern recognition?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POFinding duplicate video files by database (millions), fingerprint? Pattern recognition?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.