Text Processing on Linux using grep, awk, and sed

·

5 min read

When it comes to text processing on Linux, every user should be familiar with three essential tools: grep, awk, and sed. We'll explore some common scenarios and provide examples of how to use grep, awk, and sed to solve them efficiently. This article will focus on helping you understand how to think about these tools and how to use them effectively in your daily text-processing tasks.

grep: Finding Patterns in Text

'grep' is a text-hunting tool that searches through lines of text, helping you track down those elusive words, phrases, or patterns that you're searching for. The name grep comes from a Unix utility called "global regular expression print," which reflects its ability to search for regular expressions.

Some of the most commonly used grep options:

  • -i: Ignore case when searching

  • -r: Recursively search directories

  • -l: Print only the names of matching files

  • -v: Print lines that do not match the pattern

  • -c: Count the number of matches

These options provide a lot of flexibility and control when using grep, allowing you to customize your searches to suit your specific needs. You can combine multiple options in a single grep command to achieve complex search scenarios.

Let's understand the use-cases of grep using the inventory.txt file which contains a list of fruit and vegetable items, their quantities, and prices.

➜  ~ cat inventory.txt
Apple,10,$2.00
Orange,5,$1.50
Grapes,2 pounds,$3.99
Strawberries,1 pound,$4.99
Blueberries,1 pint,$3.99
Spinach,1 bunch,$2.99

The following grep command displays the line in the file that contains the word "apple", where the -i option is used to make the search case-insensitive:

➜  ~ grep -i "apple" inventory.txt
Apple,10,$2.00

The following grep -c command counts the number of times the word "pound" appears in the file. This includes not just the word "pound" by itself, but also any words that start with "pound", such as "pounds". The result shows that the word appears 2 times in the file.

➜  ~ grep -c "pound" inventory.txt
2

awk: Extracting and Processing Columns

'awk' is a powerful tool for processing and manipulating text data, especially data organized in rows and columns. It allows you to extract specific columns, perform calculations, and even join data from multiple files. While its syntax may seem complex at first, it is quite consistent and logical, and with some practice, you can use it to perform a wide range of tasks. Let's explore some examples to get a better understanding of how awk works using the inventory.txt file.

We can use the awk command to extract and print specific columns from a file. The following command prints the names of the items in the inventory, extracted from the first column (represented by the variable $1) of each line in the file, using the comma (,) as the field separator (-F ',').

➜  ~ awk -F ',' '{print $1}' inventory.txt
Apple
Orange
Grapes
Strawberries
Blueberries
Spinach

In addition to printing columns, awk can also be used to perform various calculations on the data in the columns. Let's say we need to calculate the total quantity of apples and oranges by adding the quantities (2nd column) of the first two rows in the inventory.txt file. NR <= 2 is a condition that selects the first two rows of the file, where NR is the current row number. {total += $2} is an action that adds the second column of the selected rows to a variable called total. END {print total} is another action that prints the final value of total after all rows have been processed.:

➜  ~ awk -F ',' 'NR <= 2 {total += $2} END {print total}' inventory.txt
15

awk is also used for performing conditional logic on data. Suppose we need to extract the items from the inventory that cost more than $3. The following command extracts the numeric part of the third column (excluding the dollar sign), converts it to a number, and checks if it's greater than 3. If true, it prints the first and third columns of that line. The output displays the item names and their prices.

➜  ~ awk -F, '{ if (substr($3, 2) + 0 > 3) print $1,$3 }' inventory.txt
Grapes $3.99
Strawberries $4.99
Blueberries $3.99

sed: Performing Text Replacement

'sed' is a stream editor that can be used to perform various text manipulations on input streams. It's commonly used to perform search and replace operations on text data, but it can also be used for more complex text transformations. Here are some examples of how you can use sed to perform operations:

One of the most common uses of sed is to replace one string with another in a file. The following command uses the sed command to replace all occurrences of the word "apples" with "bananas" in the input string by using a substitution expression (denoted by the s flag).

➜  ~ echo "Add apples in the inventory, apples are missing" | sed 's/apples/bananas/'
Add bananas in the inventory, apples are missing

In the above example, only the first occurrence of "apples" was replaced with "bananas" because, by default, the s command in sed only replaces the first occurrence of the target string in each line.

To replace all occurrences of "apples" in each line, you can use the g flag (which stands for "global") in the s command. Here's an updated version of the previous example that uses the g flag:

➜  ~ echo "Add apples in the inventory, apples are missing" | sed 's/apples/bananas/g'
Add bananas in the inventory, bananas are missing

This command will replace all occurrences of "apples" with "bananas" in the input string.

In the inventory.txt file suppose we want to replace the price of "Apple" with "2.50" instead of "2.00" in the file:

➜  ~ sed -i '/Apple/s/2.00/2.50/p' inventory.txt
Apple,10,$2.50

-i: This option tells sed to edit files in place. With this option, sed modifies the input file directly.

p: This is a flag used with the substitution command s. When the /Apple/ condition is met, and a substitution is performed, p prints the modified line.

Conclusion

grep, awk, and sed are powerful tools for text processing in Unix-like systems and are essential for anyone working with text data, and a solid understanding of their capabilities is important for effective data analysis and manipulation. These three tools can be used together in a pipeline to perform complex text-processing tasks. For example, you can use grep to filter a file and extract specific lines, then use awk to process those lines and perform calculations or other transformations, and finally use sed to perform additional text manipulations or substitutions.