Matching Up
Although I am not quite adept at data analysis or even close to becoming a data scientist, I decided to at least apply for internships so that I can get real-world experience. While searching, I came across an internship position, and towards the bottom of the page it read:
"Simple assignment to see if you actually read this. Take this dataset of corporations in New York and left join onto it this dataset of companies that are owned by Minorities/Women. You can do it for a specific year, use fuzzy matching on business name, address, or your own business logic."
This was the perfect opportunity to learn more about FuzzyWuzzy.
The first data set provided had more than 2.5 million rows and the second one had more than nine thousand. This was the first time I experienced working with a data file greater than a couple thousand rows.
Although I am not quite adept at data analysis or even close to becoming a data scientist, I decided to at least apply for internships so that I can get real-world experience. While searching, I came across an internship position, and towards the bottom of the page it read:
"Simple assignment to see if you actually read this. Take this dataset of corporations in New York and left join onto it this dataset of companies that are owned by Minorities/Women. You can do it for a specific year, use fuzzy matching on business name, address, or your own business logic."
This was the perfect opportunity to learn more about FuzzyWuzzy.
The first data set provided had more than 2.5 million rows and the second one had more than nine thousand. This was the first time I experienced working with a data file greater than a couple thousand rows.
I realized that it will take couple hours to fuzzy merge the data set. I chose to filter the data by the year. I chose 2018, picking the year was my choice and I chose 2018 for no particular reason.
Even converting the dates took a while, in hindsight I should have reduced the data a little more and then turned the dates into a date-time object.
Because I am trying to merge the datasets by address, I decided to look into the address for both data sets. It was confusing. There were lots of missing values. However, I decided to add address 1 to address 2 for both data sets and create a whole new column, hence creating a full address
I also made sure to apply the upper method to mwbe's address, and vendor name column, so that it can be consistent with that of nycorps' dataframe
Notice that for index 45 & 111,the DOS Address =
DOS Process Address 1 + DOS Process Address 2
The data set for both nycorp and mwbe is significantly reduced. Nycorp is reduced to 199544 rows and mwbe is reduced to 127 rows.
I used a for loop to get the values in DOS Address and MWBE Address. I created a new list with the Address's and the ratio values. Doing this however took very long, almost 10 minutes. I need to look into a faster way to fuzzy match data.
After getting the matched values, I took the list and appended the addresses to its respective data set.
I realized that the fuzzy matching was not 100% accurate. Some addresses gave a 100% ratio, however, they were not equal.
Notice how the DOS Address column and the Similar column are aligned, however they are not all equal. And so I fuzzy matched names of the entity to get a better accuracy.
Now out of the 91 columns with similar addresses there were only 14 columns that had the same address and the same entity/vendor name.
Conclusion:
Although I merged two data sets, I only got 14 matches when I set the year to 2018. This seems odd as I expected to have much more matches. I assumed that every business in the mwbe data set would be in the nycorp data set. The error can be due to many factors, such as vendors inputting the addresses in the wrong section/column. For the future, I can change the type of fuzzy matching or even change the ratio threshold. There can be many other ways to fix this problem. I can also look into a faster way to fuzzy-merge. Maybe in the future I can to try use the TF-IDF method.
Comments
Post a Comment