Cleaning up CSV files with duplicate records.

Egil · May 25, 2016, 12:21:39 PM

This is posted here because I always use CB for experimenting. It is the fastest and easiest tool for such kind of work I've ever seen. But the described method should be easy to convert to IWB or Aurora.

Today I finally found a free csv file containing data for most cities in the world: https://www.maxmind.com/en/free-world-cities-database

More than three million city names are listed with coordinates, ISO two-letter country codes (the same codes used for top domains on the Internet), and population. But the file is not maintained any more, and contains thousands of duplicates. Therefore I started to search for an effective algorithm for cleaning up such files, and do my own "laundry" software. I found lots of suggestions on how to do it, but most of them were rather complicated, and I did not like any of them.

But I was quite convinced that such file "cleaning" can be done relatively fast with rather simple code, so I first made a routine to read lines sequentially, and compare it with the line before. If the two was equal, only one of them was written to a new file. This resulted in a new file containing almost two houndred thousand rows less (!!!) than the original. But it ran rather slow (more than five minutes).
And still a lot of duplicates appeared, mainly because of typing errors. There were other records in between the duplicates.

Experiments to speed up the routine were started. Many of the remaining duplicates were consecutive lines with typing errors. Got rid of many of them by comparing only the first half of each line. Doing it this way also reduced the computing time to 46 seconds. More than ten times faster than the first attempt.

After a lot of head scratching, and trial and error, I ended up with cleaning the file in three passes, each pass comparing small parts at different parts of each line. This three pass method uses 42 seconds to run on my pc.

The original file contains 3,173,960 rows, which after three passes ended up with a file containing 2,256,525 rows. I am quite convinced that no cities were deleted. But there are still some duplicates left,mainly because the duplicates are not recorded in consecutive order. But now they are in numbers that are easy to deal with if I decide to refine the cleaning process any more.

The code is posted below. Also posted the line counting code plus a screendump of the directory with the original file first.

Have fun!
Egil.

First, the cleaning routines.

Code Select

'
'----------------------------------------------------------------------------------------
' CSVclean.cba - Find and delete duplicate lines in text based database files (CSV)
'----------------------------------------------------------------------------------------
'
autodefine "OFF"

DECLARE Clean(Fname:string)
def filter,cfil:string
def number:int

OPENCONSOLE 


filter = " Files (*.*)|*.*||"
cfil=filerequest("Select file",0,1,filter," ",number)
Clean(cfil)

print:print"DONE!"
print:print "Press any key to exit"
do:until inkey$<>""
CLOSECONSOLE 
END 
'----------------------------------------------------------------------------------------
 


SUB Clean(Fname:string)
'----------------------------------------------------------------------------------------
' Clean file contents and write to new file
'----------------------------------------------------------------------------------------
def infile,outfile:FILE 
def tmpfil,newfil,oldtext,newtext,a$,b$:STRING 
def i,cnt:int
oldtext=""

'Pass one:
	print:print"Please wait"
	PRINT "Cleaning - Pass one....."
   OPENFILE(outfile,"clean1.txt","W")
      IF(OPENFILE(infile,Fname,"R") = 0) 
         WHILE EOF(infile) = 0
            i = i+1
            READ(infile,newtext)
			if left$(newtext,13) <> left$(oldtext,13)
				'Writing filtered text to a new file:
            WRITE outfile,newtext
				oldtext=newtext
            newtext=""
			endif
         ENDWHILE
       CLOSEFILE infile
      ENDIF
   CLOSEFILE outfile

'Pass two:
	PRINT "Cleaning - Pass two....."
   OPENFILE(outfile,"clean2.txt","W")
      IF(OPENFILE(infile,"clean1.txt","R") = 0) 
         WHILE EOF(infile) = 0
            i = i+1
            READ(infile,newtext)
			if right$(newtext,13) <> right$(oldtext,13)
				'Writing filtered text to a new file:
            WRITE outfile,newtext
				oldtext=newtext
            newtext=""
			endif
         ENDWHILE
       CLOSEFILE infile
      ENDIF
   CLOSEFILE outfile

'Pass three:
	PRINT "Cleaning - Pass three..."
   OPENFILE(outfile,"clean3.txt","W")
      IF(OPENFILE(infile,"clean2.txt","R") = 0) 
         WHILE EOF(infile) = 0
            i = i+1
            READ(infile,newtext)
			if mid$(newtext,13,10) <> mid$(oldtext,13,10)
				'Writing filtered text to a new file:
            WRITE outfile,newtext
				oldtext=newtext
            newtext=""
			endif
         ENDWHILE
       CLOSEFILE infile
      ENDIF
   CLOSEFILE outfile
RETURN

Then the line counting code.

Code Select

'
'----------------------------------------------------------------------------------------
' CountLines.cba
'----------------------------------------------------------------------------------------
'
DECLARE  LineCount(fname:string)
def count:int
def infil$,filter:string

OPENCONSOLE

filter = "Creative Basic Files (*.*)|*.*||"
infil$ = filerequest("Select file",0,1,filter," ")

	count = LineCount(infil$)
	PRINT "Number of lines in "+infil$+": ",count

do:until inkey$<>""
CLOSECONSOLE
end
'----------------------------------------------------------------------------------------

'
SUB LineCount(fname:string)
'----------------------------------------------------------------------------------------
' Counting number of lines in a text file.
' Call with filename 
' Returns number of lines
'----------------------------------------------------------------------------------------
def ifile:file
def lines:int
def ln:string
	IF(OPENFILE(ifile,fname,"R") = 0)
		DO 
			Read(ifile,ln)
			lines = lines + 1
		UNTIL EOF(ifile)
		CLOSEFILE ifile
	ENDIF
RETURN lines

Egil · May 26, 2016, 01:02:28 PM

Talked to a friend of mine tonight. He is maintaining similar databases at his job. But he has never experienced so many errors in one of their files, fortunately.
He gave some good advice: Start cleaning by comparing the positions and delete the duplicates found. Then most of the remaining duplicates are results of the fact that many countries have several official languages, and many city names spell differently in each language. He means that making functions to decide what language version I finally use is "a piece of cake". But first I have to decide if I want to use international names or local names.

If I get some spare time, I'll look into this tomorrow...

Egil

Egil · May 30, 2016, 03:14:01 AM

This morning I continued to work on routines cleaning up the mentioned Cities database.
Took a closer look on the norwegian records, and found that most of the duplicates ic because of multiple records with different numbers in the population field. The same for the Sone field. And I do not quite know what sones they are talking about, as none of the recorded Sone numbers for Norway correspond with postal codes or telephone area codes. In addition, in many records the citynames are misspelled, whithout deleting the wrong one when correcting. One small place was listed not less than six times with six different spellings, and none of them, exept the position, were correct. And one of the records was even for Kristiania. The name Kristiania was changed back to Oslo in 1925. And before we got postal zip codes here, most places with names like, or similar sounding, to other places, had the county name added to the city names.
But this way of naming the places became history when the authorities decided to use zip codes instead.
If cases like this are true also for other nations, I wonder if it is worth the effort trying to clean this database.

Anyway, I did follow the advice of my friend, and made subroutines comparing Country Codes and City Name fields with the successive line, and only write the records to a "clelaned" file if they not duplicated the last (or former) line. The same was done for the Latitude and Longitude fields. But even if I got rid of thousands of duplicates, There still were thousands left, mainly because of the Population and Sone fields, as satated above. There fore I made a subroutine for deleting these two fields. (writing a new file whithout these fields, retaining the original).

The original file has 3.173.960 records. And making a new file without the Sone and Population fields took five minutes and twelve seconds.
Then, by comparing the position fields and omitting duplicates, the new file was reduced to 2.780.338 records. But This operation took around sixteen minutes to complete.
Doing the same for the Country and Name fields, using the already "cleaned" file for input, the file was reduced to 2.772.318 records. and took another 16 minutes to complete.

The pc worked for 38 minutes, and by inspecting random places in the resulting file, there are still plenty of duplicates. Mainly because of missplellings where the wrong versions were not deleted when correcting.
This is a far cry from the 48 seconds consumed in my first try last week. And the resulting final number of records differed only slightly.

So what I ended up with, is a program that first remove the Sone and Population fields, then use the twelve rightmost characters in the record (longitudes plus part of the latitudes) to discover and remove duplicates, doing the same on the 20 leftmost characters (Country Codes and City Names.... well, roughly). And finally another run on the rightmost characters of the records.
This resulted in a program that took around six minutes to run, and the final number of records is 2.6625.392. Using the left$ and right$ string functions for testing, appeared to be very fast. Around 15 second each run.
But there are still duplicates in there, mainly because of different ways to write the citynames. This is espepcially true for countries having several official languages. And for Norway, some very old names, no longer in use, were in there. So maybe I shall continue to search for a better database.

I still have a couple of ideas to get rid of the remaining duplicate records, so will not post the code until these ideas are tried. But not till tomorrow.

Have fun!

Egil.

Egil · May 30, 2016, 10:03:04 AM

Seems like young eyes see better than my old ones...
My grandson came along and saw what I was doing, and after looking at the site where I downloaded that database, he wondered why I did not use the other free cities database they offered.
The cities database found here: http://dev.maxmind.com/geoip/legacy/geolite/#Downloads has more than 700.000 records, is free to use, and is updated once a month.

Egil

Bill-Bo · May 30, 2016, 04:19:34 PM

Egil,

You indicate the database is updated once a month. The file date is 05/03/2016. Why is the copyright 2012?

I like the big one because there is more decimal places in the Lat/Long.

Bill

Egil · May 31, 2016, 01:30:39 AM

Hi Bill,

This is what they state on the download webpage ( http://dev.maxmind.com/geoip/legacy/geolite/#Downloads ):

QuoteDatabases

IP Geolocation
The GeoLite databases are our free IP geolocation databases. They are updated on the first Tuesday of each month. These databases are offered in the same binary and csv formats as our subscription databases. Any code which can read the subscription databases can also read the GeoLite databases.

So I'm unable to answer your question.
I like the big one better too, and for the same reason. But since there are so many misspellings and errors, I am not sure how useful it will be.

On the other hand, this "cleaning" exercise has been a real eye-opener for me. Especially the fact that working with specific fields in an attempt to get rid of duplicates took 16 minutes, while just comparing the lines using left$ and right$ took only 14 seconds. That's 60 times faster! And both methods gave the same result.
So today I'll step through my code, just to see that I did not do anything fundamentally wrong. I'll keep you posted..

Egil

Egil · May 31, 2016, 03:33:59 AM

Been checking my code again. Cant find anything wrong, but not sure if I'm thinking right...
Here is the code:

Code Select

'
'----------------------------------------------------------------------------------------
' CSV_clean.cba - Find and delete duplicate lines in text based database files (CSV)
'----------------------------------------------------------------------------------------
'
autodefine "OFF"

DECLARE Clean(Fname:string)
DECLARE RemoveFields(Fname:string)

def filter,cfil:string
def number:int

OPENCONSOLE 

filter = " Files (*.*)|*.*||"
cfil=filerequest("Select file",0,1,filter," ",number)
RemoveFields(cfil)
Clean(cfil) 

print:print "Cleaning DONE........... "+time$
print:print "Press any key to exit"
do:until inkey$<>""
CLOSECONSOLE 
END 
'----------------------------------------------------------------------------------------
 
'
SUB RemoveFields(Fname:string)
'----------------------------------------------------------------------------------------
'Remove fields 4 and 5 (Zone and Population)
'----------------------------------------------------------------------------------------
def infile,outfile:FILE 
def newtext,outtext,tmptxt:STRING 
def i,j,left,right:int
left=0:right=0:j=0
	PRINT "Removing field 4 and 5.. "+time$
    OPENFILE(outfile,"Pass0.txt","W")
      IF(OPENFILE(infile,Fname,"R") = 0) 
         WHILE EOF(infile) = 0
            READ(infile,newtext)
					for i = 1 to len(newtext)
						if mid$(newtext,i,1)="," 	:'counting commas
							j=j+1
							if j=3 then left=i		:'find comma #3 position in line
							if j=5 then right=i+1	:'find comma #5 position in line
						endif
					next i
					outtext = left$(newtext,left) 
					outtext = outtext + right$(newtext,i-right)
					if outtext <> "" then WRITE outfile,outtext
					outtext=""
					j=0
					left=0
					right=0
         ENDWHILE
       CLOSEFILE infile
      ENDIF
   CLOSEFILE outfile

RETURN



SUB Clean(Fname:string)
'----------------------------------------------------------------------------------------
' Clean file contents and write to new file
' (Comparing "random" parts of fields)
'----------------------------------------------------------------------------------------
def infile,outfile:FILE 
def newtext,oldtext,test$:STRING 
'def i,cnt:int
oldtext=""
newtext=""
print  

'Pass 1:
	PRINT "Cleaning - Pass ONE..... "+time$
	oldtext = ""
   OPENFILE(outfile,"Pass1.txt","W")
      IF(OPENFILE(infile,"Pass0.txt","R") = 0) 
         WHILE EOF(infile) = 0
            READ(infile,newtext)
				
				'Compare Positions:
				test$=right$(newtext,12)

				if test$ <> oldtext						:' Compare with former line loaded
					'Write line to a new file when lines do not match:
            	WRITE outfile,newtext				
					oldtext = test$
            	test$ = ""
				endif
         ENDWHILE
       CLOSEFILE infile
      ENDIF
   CLOSEFILE outfile

'Pass 2:
	PRINT "Cleaning - Pass TWO..... "+time$
   OPENFILE(outfile,"Pass2.txt","W")
      IF(OPENFILE(infile,"Pass1.txt","R") = 0) 
         WHILE EOF(infile) = 0
            READ(infile,newtext)
				test$=left$(newtext,20)
				if test$ <> oldtext						:' Compare with last line loaded
            	WRITE outfile,newtext				:'Write to a new file if lines do not match
					oldtext = test$
            	test$ = ""
				endif
         ENDWHILE
       CLOSEFILE infile
      ENDIF
   CLOSEFILE outfile

'Pass 3:
	PRINT "Cleaning - Pass THREE... "+time$
	oldtext = ""
   OPENFILE(outfile,"worldcitiespop_CLEANED.txt","W")
      IF(OPENFILE(infile,"Pass2.txt","R") = 0) 
         WHILE EOF(infile) = 0
            READ(infile,newtext)
				test$=right$(newtext,12)
				if test$ <> oldtext						:'Compare with last line loaded
            	WRITE outfile,newtext				:'Write to a new file if lines do not match
					oldtext = test$
            	test$ = ""
				endif
         ENDWHILE
       CLOSEFILE infile
      ENDIF
   CLOSEFILE outfile

RETURN

Also attach screendumps of running program showing start and endtimes for the routines, and one showing the directory to show how much smaller the final file becomes.

For the radiohams in here: Seems like the 50MHz band has a Sporadic-E opening. The signals are just too weak to be useful here. At the moment I am too far north, but things change fast, so Ill be standing by to see if the local conditions improve. No more coding today.

Egil

IonicWind Software

News: