Ad
Comparing 2 Huge (5-6 GB) Csv Files And Count The Number Of Matching And Unmatched No. Of Rows
There are 2 huge (5-6 GB) each csv files. Now the objective is to compare both these files. how many rows are matching and how many rows are not matching?
Lets say file1.csv contains 5 similar lines, we need to count it as 1 but not 5. Similarly, for file2.csv if there are redundant data, we need to count it as 1.
I expect the output to display the number of rows that are matching and the no. of rows that are different.
Ad
Answer
I have written a file comparer in python that can optimally compare huge files and get matching lines count and different lines count. Replace the input_file1 and input_file2 with your 2 large files and run it. Let me know the results.
input_file1 = r'input_file.txt'
input_file2 = r'input_file.1.txt'
__author__ = 'https://github.com/praveen-kumar-rr'
# Simple Memory Efficient high perfomance file comparer.
# Can be used to efficiently compare large files.
# Alogrithm:
# Hashes the lines and compared first.
# Non matching lines are picked as different count.
# All the matching lines are taken and the exact lines are read from file
# These strings undergo same comparison process based on string itself
def accumulate_index(values):
'''
Returns dict like key: [indexes]
'''
result = {}
for i, v in enumerate(values):
indexes = result.get(v, [])
result[v] = indexes + [i]
return result
def get_lines(fp, line_numbers):
'''
Reads lines from the file pointer based on the lines_numbers list of indexes
'''
return (v for i, v in enumerate(fp) if i in line_numbers)
def get_match_diff(left, right):
'''
Compares the left and right iterables and returns the matching and different items
'''
left_set = set(left)
right_set = set(right)
return left_set ^ right_set, left_set & right_set
if __name__ == '__main__':
# Gets hashes of all lines for both files
dict1 = accumulate_index(map(hash, open(input_file1)))
dict2 = accumulate_index(map(hash, open(input_file2)))
diff_hashes, matching_hashes = get_match_diff(
dict1.keys(), dict2.keys())
diff_lines_count = len(diff_hashes)
matching_lines_count = 0
for h in matching_hashes:
with open(input_file1) as fp1, open(input_file2) as fp2:
left_lines = get_lines(fp1, dict1[h])
right_lines = get_lines(fp2, dict2[h])
d, m = get_match_diff(left_lines, right_lines)
diff_lines_count += len(d)
matching_lines_count += len(m)
print('Total number of matching lines is : ', matching_lines_count)
print('Total number of different lines is : ', diff_lines_count)
Ad
source: stackoverflow.com
Related Questions
- → What are the pluses/minuses of different ways to configure GPIOs on the Beaglebone Black?
- → Django, code inside <script> tag doesn't work in a template
- → React - Django webpack config with dynamic 'output'
- → GAE Python app - Does URL matter for SEO?
- → Put a Rendered Django Template in Json along with some other items
- → session disappears when request is sent from fetch
- → Python Shopify API output formatted datetime string in django template
- → Can't turn off Javascript using Selenium
- → WebDriver click() vs JavaScript click()
- → Shopify app: adding a new shipping address via webhook
- → Shopify + Python library: how to create new shipping address
- → shopify python api: how do add new assets to published theme?
- → Access 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT' with Python Shopify Module
Ad