Merging and Filtering CSV files using Linux

Posted on Wed 03 February 2021 in Linux

I recently ran into a problem after downloading multiple CSV files. I wanted to firstly merge all those files and then extract/filter data from that merged file. Doing this is possible using code, but Linux has a bunch of powerful tools that can also handle such tasks and can do them very quickly(in comparison to Python).

Let us say we have a bunch of CSV files labelled like so: data_1.csv, data_2.csv and so on

Merging

In order to merge[1] all these CSV files, we can use sed:

sed 1d data-*.csv > merged.csv

The command collects all the data- files and after merging them, saves them to merged.csv. From the reference article I learned that if you used *.csv it may also merge the merged.csv file and create duplicates and other issues.

Filtering - exact match

In order to find an exact entry in the CSV file, we can use awk[2]:

awk -F"," '$1==somestring' merged.csv

The "," tells awk to view the comma as a separator. The $1 says to look in the first column. somestring is the exact match we are looking for in column 1. merged.csv is the file we want to look into for the value.

Filtering - partial strings

My main filtering goal was to use partial strings to find certain values that match the partial string. Again we will use awk[3]:

awk 'index($1, "partialstring")' merged.csv

Similar to the above, $1 looks in column 1 for the "partialstring" string. In both the awk commands you can also save the output to a file like so:

awk 'index($1, "partialstring")' merged.csv > output.txt

I hope these 3 commands help you save a lot of time if you are attempting to do anything with CSV files.

Sources:

1 - https://blog.softhints.com/how-to-merge-multiple-csv-files-with-python/

2 - https://stackoverflow.com/questions/2373885/searching-a-csv-file-using-grep

3 - https://stackoverflow.com/questions/17001849/awk-partly-string-match-if-column-word-partly-matches