Finding Duplicate Data

Brian · April 25, 2012, 03:18:46 AM

Hi,

Every day at work we get a text file with maybe (at the last count), around 27,000
lines of data. We have to take this file and compare it with the previous day's file,
and output to a separate file any changes that have occurred between the two

At the moment we use WinMerge, but this is very time consuming

I'm thinking of doing something to speed up this process, and automatically output
the differences file. Don't know how to go about it, but would consider looking at
each line in one file, then searching through 27,000 lines in the other file, then
doing it again until the end of the file is reached, quite intensive a task, even for
a computer!

Have we anybody with a bright idea out there?

Brian

billhsln · April 25, 2012, 03:58:44 AM

Are the records a fixed length? If so, read both files, writing a new file with a 1 on the first a 2 on the second at the end of the record. Then sort, after that just read merged+sorted file. If current record (without extra char) = Previous (without extra char) then same with a flag telling you that the previous record was equal or not. May not be very elegant, but would do the job. You can also probably do the sort from the system command: SORT c:\temp\input.txt /o c:\temp\output.txt

If you are sure there are no duplicates with each file, then you won't need to add the 1/2 to the end, just merge, sort and then read checking for duplicates within the merged/sorted file.

System commands:
copy file1.txt+file2.txt merged.txt
sort merged.txt /o sorted.txt

then read the sorted.txt file.

Hope that helps,
Bill

RitchieF · April 25, 2012, 04:29:19 AM

Or have a look here :

http://www.sourcegear.com/diffmerge/index.html

Richard

LarryMc · April 25, 2012, 07:25:37 AM

I threw this together just to see if I could do it.

It reads the old and new files into separate linked lists. (reduces the overall number of disk reads)
(uses a modified routine that LOCODARWIN wrote.

It then goes through the old list searching for a match in the second list.
When a match is encountered both entries are deleted and the loops are reset.

When all the old entries have been compared the searching ends.

The old list contains all the entries in the old list that do not have an exact match in the new list.

The new list contains all the entries in the new list that do not have an exact match in old list.

The differences can be due to a line deleted or added to the old file; or a single character could have changed on a line that is in both the old and new files.

This whole scheme assumes that there are never duplicate entries in a file.

The whole thing slows down if you print the info as it goes so I waited to print at the end.

There is a routine to copy the list to files, if you desire.

Hope this helps.

NOTE: Make sure you make the constant MAX_LINE_LEN longer than the longest line you will ever encounter. But don't make it too big or you might run out of memory.

LarryMc

Code Select

	const MAX_LINE_LEN = 1000
	type rec
		istring txt[MAX_LINE_LEN]
	endtype

	POINTER oldLL,newLL
	pointer sTempold,sTempnew
	pointer oldpos,newpos
	oldLL = FileToList(getstartpath+"old.iwb")
	newLL = FileToList(getstartpath+"new.iwb")
	oldpos = ListGetFirst(oldLL)
	int skip=0
	WHILE oldpos <> 0
		sTempold = ListGetData(oldpos)
		newpos = ListGetFirst(newLL)

		WHILE newpos <> 0
			sTempnew = ListGetData(newpos)
			IF #<rec>sTempold.txt = #<rec>sTempnew.txt
				oldpos = ListRemove(oldpos,TRUE)
				newpos = ListRemove(newpos,TRUE)
				skip=1
				break
			ELSE
				newpos = ListGetNext(newpos)
			ENDIF
		ENDWHILE
		if skip
			oldpos = ListGetFirst(oldLL)
			skip=0
		else
			oldpos = ListGetNext(oldpos)
		endif
	ENDWHILE

	print "Lines in old but no match in new"
	For sTempold = Each oldLL As rec
		Print #sTempold.txt
	Next
	print:print

	print "Lines in new but no match in old"
	For sTempnew = Each newLL As rec
		Print #sTempnew.txt
	Next
	print
	ListRemoveAll(oldLL, TRUE)
	ListRemoveAll(newLL, TRUE)
end

Sub FileToList(sFilePath as STRING), POINTER
	POINTER lstList, pTempPoint
	FILE iFile
	INT iStat
	iSTRING sTempText[MAX_LINE_LEN]
	lstList = ListCreate()
	iStat = OpenFile(iFile, sFilePath, "R")
	If iStat = 0
		While Read(iFile, sTempText) = 0
			pTempPoint = ListAdd(lstList, New(rec, 1))
			#<rec>pTempPoint.txt = sTempText
		WEnd
		CloseFile(iFile)
		Return lstList
	Else
		Return NULL
	EndIf
	Return NULL
EndSub

Sub ListToFile(lstList as POINTER, sFilePath as STRING, Opt iAppend=0 as INT), INT
	FILE iFile
	INT iStat
	If iAppend = 0 Then
		iStat = OpenFile(iFile, sFilePath, "W")
	Else
		iStat = OpenFile(iFile, sFilePath, "A")
	EndIf	
	If iStat = 0
		For xx = Each lstList as rec
			Write(iFile, #xx.txt)
		Next
		CloseFile(iFile)
		Return 1
	Else
		Return 0
	EndIf
	Return 0
EndSub

Brian · April 25, 2012, 08:03:31 AM

Blimey!

Great answers from all of you - I will look at all of them tonight, but having some working
code from Larry might just pip the lot...

Many, many thanks,

Brian

Brian · April 25, 2012, 08:05:41 AM

Larry,

Just curious - why shouldn't there be duplicate lines? The whole idea is to winkle out the lines
that are different, and save them to a third file. With the data files I have got, there are always
duplicate lines

Brian

LarryMc · April 25, 2012, 08:17:33 AM

Quote from: Brian Pugh on April 25, 2012, 08:05:41 AM
Larry,

Just curious - why shouldn't there be duplicate lines? The whole idea is to winkle out the lines
that are different, and save them to a third file. With the data files I have got, there are always
duplicate lines

Brian

Are you telling me that the old file (by itself) or the new file(by itself) will have lines that are duplicates of each other.
I'm not talking about there being a single line that appears in both the old and new file. My code strips those out.

What my code won't do is address if there are 3 lines in old file that are = "aaa" and 5 lines in the new file that are ="aaa".

LarryMc

Brian · April 25, 2012, 09:16:40 AM

Well, yes, exactly that

The new file will contain lines that are exactly the same as the old. The new file will also have
lines that are amendments, eg, name changes, address changes, etc

So there are some lines that will possibly never change, while other lines will be nearly the same,
but are obviously different

Brian

aurelCB · April 25, 2012, 10:33:57 AM

Hi Brian...
I agree with Larry.
Or you can do almost same with hashtable ( dictionary).
or maybe create 2 listbox .
1.LB -load old file
2-LB load new file
serch and compare -> write different string into 3.Lbox
or something like that...

LarryMc · April 25, 2012, 01:16:53 PM

Quote from: Brian Pugh on April 25, 2012, 08:05:41 AM
Larry,

Just curious - why shouldn't there be duplicate lines? The whole idea is to winkle out the lines
that are different, and save them to a third file. With the data files I have got, there are always
duplicate lines

Brian

I think we are still missing each other.

The code I posted will find, for each line in the old file, one and only one exact matching line in the new file, if it exists.
If there are two or more lines in the new file that match a single line in the old file only one of the lines will be flagged and removed.
The extra matches will be treated as if there was no match.

And, since you have indicated that everything is subject to change that is the closest you are going to get and will have to put an eye on the results to really see what is going on.

Other than that I can't help futher without two sample files.
I could go farther if I knew there was something like an account number that would be constant for a given person no matter what other changes were made. But from the info you have provided I don't know that.

LarryMc

Doc · April 25, 2012, 09:29:51 PM

Brian, my curiosity is getting the better of me.

Presuming that each line contains a group of fields...

1.) What delimiter is being used?
2.) Are each and every field in a given line subject to change, requiring a comparison?
3.) Will each file contain the same number of lines?
4.) Will each line of data remain in the same relative position in each file?
(Ex: Line 10 in file "A" should also be Line 10 in file "B", not 11, 14 or 2500)

-Doc-

Edit: Just for kicks, try this file to see if it (in theory) does what you want, excluding opening or saving data:
http://www.docstoolchest.com/catch_all/Differences.zip

LarryMc · April 25, 2012, 11:19:52 PM

DOC

based upon work I've helped him with before here is my GUESS at what is going on.

The files are comma delimited
The data is associated with insurance policies.
Transactions can occur in any order in either file.
Line entries can be name,address and other personal information changes on one or more policies belonging to a given person.
A single person could have multiple changes on any given day.
There will be a randomly different number of lines in each file (new customers, former customers, add/delete policies, etc)
A policy # is the only field that is almost constant.

But like I said, that's my GUESS.

LarryMc

LarryMc · April 25, 2012, 11:36:57 PM

I modified my original suggestion to add the part that writes the differences out to a file.

Those lines that were in the old but not in the new are preceded with a '--'
Those lines in the new but not in the old are preceded with a '++'

Just thinking, this could be a daily dump of all their customers and their policies.

This exercise could easily be used to generate the net changes each day.

LarryMc

--** removed attachment **--

LarryMc · April 25, 2012, 11:56:36 PM

-DOC-

I stole your data and plugged it into my old and new files.
I inserted a new line in the new data at line #6
I changed some text in the new line#
I deleted line #4987 in the old file.

This is what my program spit out:

Quote-- Item 7   781676641420   Amorphis - Tuonela   1432   2068   295766_CD   http://df9.us/c?c=4000006
++ lllll
++ Item 7   ffff   Amorphis - Tuonela   1432   2068   295766_CD   http://df9.us/c?c=4000006
++ Item 4987   000427085901   Dell 3J426 Battery - Dell 3J426 Laptop Battery   488   790   3J426   http://df9.us/c?c=4004996

LarryMc

Brian · April 26, 2012, 02:18:28 AM

Morning, All,

Couldn't get anything done last night, due to family commitments...

Looked at some sample data this morning, and realised that:

Any line that is exactly duplicated in the new file with the old file can be discarded;

As a datum point, there is a policy reference number for each entry, consisting of 9 figures; if it is a completely
new entry, the nine figures have a prefix of "H";

When finished, there should be two files (two listviews?) showing the entry from the first file,
and showing the corresponding entry from the second file, which could be matched (paired) by the
policy reference number

Sorry if I have mislead anyone, but I was very uncertain myself of what was needed or
required. I think this is the definitive answer

Many thanks to all who have responded so far - it makes interesting reading

Brian

LarryMc · April 26, 2012, 06:10:32 AM

Quote from: Brian Pugh on April 26, 2012, 02:18:28 AM
Any line that is exactly duplicated in the new file with the old file can be discarded;

Mine does that

Quote from: Brian Pugh on April 26, 2012, 02:18:28 AM
As a datum point, there is a policy reference number for each entry, consisting of 9 figures; if it is a completely
new entry, the nine figures have a prefix of "H";

I read that as there will be a line in the new file and not the old file; mine catches that.

Quote from: Brian Pugh on April 26, 2012, 02:18:28 AM
When finished, there should be two files (two listviews?) showing the entry from the first file,
and showing the corresponding entry from the second file, which could be matched (paired) by the
policy reference number

There won't be a matching entry in each case because of the 'H' new ones.
I'm also reading that (because you are matching) that there is only one entry per policy ref number in each file. correct?

LarryMc

Brian · April 26, 2012, 08:32:11 AM

That is correct! I can't wait until I get home to try your code

If I do eventually read the data into listviews, I will have to have a blank entry on one side
if there is only one distinct line

Will let you know . . .

Brian

LarryMc · April 27, 2012, 10:34:59 PM

In the last version that I attached there was a bug that made the program run for what seemed like forever.
A corrected version is attached.
This version compared two 27,000+ line files and wrote the results to a file in a little over 10 secs.

LarryMc

Brian · April 28, 2012, 04:03:25 AM

Hi, Larry,

Just run it through here - it took 1min 30sec from hitting the compile button to
seeing data on the screen. Bit better than the over 3 hours last night! I think
your PC could be a tad quicker than mine. Methinks I could do with an update

Now I just have to pretty it up a little, and that could take a while! Will let you
know how I go on. Got to understand your code, though. First time I have used
a LinkedList

Your efforts are much appreciated

Brian

News:

Finding Duplicate Data

April 25, 2012, 03:58:44 AM #1 Last Edit: April 25, 2012, 04:10:04 AM by billhsln

April 25, 2012, 04:29:19 AM #2

April 25, 2012, 07:25:37 AM #3

April 25, 2012, 08:03:31 AM #4

April 25, 2012, 08:05:41 AM #5

April 25, 2012, 08:17:33 AM #6

April 25, 2012, 09:16:40 AM #7

April 25, 2012, 10:33:57 AM #8

April 25, 2012, 01:16:53 PM #9

April 25, 2012, 09:29:51 PM #10 Last Edit: April 25, 2012, 10:57:38 PM by Doc

April 25, 2012, 11:19:52 PM #11

April 25, 2012, 11:36:57 PM #12

April 25, 2012, 11:56:36 PM #13

April 26, 2012, 02:18:28 AM #14

April 26, 2012, 06:10:32 AM #15

April 26, 2012, 08:32:11 AM #16

April 27, 2012, 10:34:59 PM #17

April 28, 2012, 04:03:25 AM #18