awk

Reference

https://blog.robertelder.org/intro-to-awk-command/

A sample exercise

Problem statement

Convert all the entries from the csv file into Celsius

The file: temps.csv

temp unit
26.1 C
78.1 F
23.1 C
25.7 C
76.3 F
77.3 F
24.2 C
79.3 F
27.9 C
75.1 F
25.9 C
79.0 F

Sample command to show how it uses different columns

awk '{print "First column item: " $1 " Second column item: " $2 }' temps.csv

Output:

[explorer436@explorer436-legion-82b1 awk]$ awk '{print "First column item: " $1 " Second column item: " $2 }' temps.csv
First column item: temp Second column item: unit
First column item: 26.1 Second column item: C
First column item: 78.1 Second column item: F
First column item: 23.1 Second column item: C
First column item: 25.7 Second column item: C
First column item: 76.3 Second column item: F
First column item: 77.3 Second column item: F
First column item: 24.2 Second column item: C
First column item: 79.3 Second column item: F
First column item: 27.9 Second column item: C
First column item: 75.1 Second column item: F
First column item: 25.9 Second column item: C
First column item: 79.0 Second column item: F

So, essentially, $1 represents the first column and $2 represents the second column.

NR = Number of Record; NR>1 means, don’t do the operation for the first line.

Attempt 1

awk 'NR>1 {print ($1-32) / 1.8 }' temps.csv

Output:

-3.27778
25.6111
-4.94444
-3.5
24.6111
25.1667
-4.33333
26.2778
-2.27778
23.9444
-3.38889
26.1111

It is converting all the rows. We need a condition to check if the second column is Farenheit.

Attempt 2: Using terminal operator and printing the Celsius units in the output.

awk 'NR>1{print ($2=="F" ? ($1-32) / 1.8 : $1)"\tC"}' temps.csv

Output:

26.1    C
25.6111 C
23.1    C
25.7    C
24.6111 C
25.1667 C
24.2    C
26.2778 C
27.9    C
23.9444 C
25.9    C
26.1111 C

Good. But it is not printing the header row.

Attempt 3

awk 'NR==1; NR>1{print ($2=="F" ? ($1-32) / 1.8 : $1)"\tC"}' temps.csv

Output:

[explorer436@explorer436-legion-82b1 awk]$ awk 'NR==1; NR>1{print ($2=="F" ? ($1-32) / 1.8 : $1)"\tC"}' temps.csv
temp unit
26.1    C
25.6111 C
23.1    C
25.7    C
24.6111 C
25.1667 C
24.2    C
26.2778 C
27.9    C
23.9444 C
25.9    C
26.1111 C

NR==1; just skips the operation on the header row but it will still print the header row to the output.

Attempt 4: Do a formatted print operation and print only one digit after the conversion.

awk 'NR==1; NR>1{printf("%.1f\t%c\n",($2=="F" ? ($1-32) / 1.8 : $1),"C")}' temps.csv

Output:

[explorer436@explorer436-legion-82b1 awk]$ awk 'NR==1; NR>1{printf("%.1f\t%c\n",($2=="F" ? ($1-32) / 1.8 : $1),"C")}' temps.csv
temp unit
26.1    C
25.6    C
23.1    C
25.7    C
24.6    C
25.2    C
24.2    C
26.3    C
27.9    C
23.9    C
25.9    C
26.1    C

Different separators

Same operation but with using “comma” as a separator (instead of using white space)

awk -F',' 'NR==1; NR>1{printf("%.1f,%c\n",($2=="F" ? ($1-32) / 1.8 : $1),"C")}' temps.csv

Same operation but with using “tab” as a separator (instead of using white space)

awk -F'\t' 'NR==1; NR>1{printf("%.1f\t%c\n",($2=="F" ? ($1-32) / 1.8 : $1),"C")}' temps.csv

awk language

Awk is extremely easy to learn. The reason that it seems so hard to learn is because awk has so many implicit defaults, but nobody ever seems to explain this fact.

You can think of every awk command as a collection of ‘if statements’ that run against every line in the file. The syntax of every awk command looks pretty close to something like this:

awk 'if(PATTERN1){...print something...} if(PATTERN2){...print something...} ...'

with the one exception beging that the ‘if’ keyword is never actually written out since it’s assumed to be there by default (if you do write it, you’ll get a syntax error). Therefore, the overall syntax for every awk command (that doesn’t rely on defaults) is pretty much this:

awk '(PATTERN1){...Action 1...} (PATTERN2){...Action 2...} ...'

In the above command, the ‘PATTERN1’ or ‘PATTERN2’ is the trigger you want to cause the stuff inside the ‘{’ ‘}’ characters to actually execute. Here are a few examples of commonly used patterns:

Print out the second line:

echo -e "hello\nworld" | awk '(NR==2){print $0}'

Output:

world

Print out any line that matches a regular expression (that just looks for an ’l’ character): The ‘~’ character has a special meaning here in relation to regular expression matching

echo -e "hello\nthere\nworld" | awk '($0 ~ /l/){print $0}'

Output:

hello
world

If the item in the first column is greater than 5 characters, print out the item in the second column:

echo -e "acb def\nsomething else" | awk '(length($1) > 5){print $2}'

Output

else

This provides some context on what you can do in the ‘pattern’ part, but what about the ‘action’ part? Well, you can use your imagination since that’s where awk becomes a fully fledged programming language. Here is an example awk command that will iterate over every character in the 3rd column on the 4th line and print out each character on a different line:

echo -e "a a a a\na a a a\na a a a\na a hello_there a" | awk '(NR==4){
  n_chrs = split($3, individual_characters, "")
  for (i=1; i <= n_chrs; i++){
    printf("Here is character number %d : %c\n", i, individual_characters[i]);
  }
}'

Output:

Here is character number 1 : h
Here is character number 2 : e
Here is character number 3 : l
Here is character number 4 : l
Here is character number 5 : o
Here is character number 6 : _
Here is character number 7 : t
Here is character number 8 : h
Here is character number 9 : e
Here is character number 10 : r
Here is character number 11 : e

Implicit defaults

awk makes many implicit default assumptions. To illustrate them, let’s do a few more examples of matching regular expression against the following file ‘animals.txt’.

Rabbit
Bird
Dog
Pig
Lobster
Ape
Chicken
Lion
Pony
Fish
Cow
Cat
Horse
Deer
Turkey
Spider
Duck
Shark
Bear
Snake
Eagle
Bison
Monkey
Dolphin

Regular expression search that will print out any lines that end with the letter ’e’

awk '($0 ~ /e$/){print $0;}' animals.txt

Output:

Ape
Horse
Snake
Eagle

The parentheses on the PATTERN are optional

awk '$0 ~ /e$/{print $0;}' animals.txt

We don’t actually need to specify the ‘$0’ part (the variable that denotes the current entire line). If you write a regular expression by itself, it will be assume that you’re comparing it against the contents of the current line. Therefore, we can do this:

awk '/e$/{print $0;}' animals.txt

If we’re printing out the entire line, we don’t actually need to say ‘print $0’, we can just say ‘print;’ and it will assume that we want to print out the current line:

awk '/e$/{print;}' animals.txt

We don’t even need to specify the action at all since it’s optional! In cases where the action is missing, the assumption is to print out the entire line, so we can just do this:

awk '/e$/' animals.txt

At this point, we’ve simplified awk to the point where it would do pretty much the same this that grep does with the ‘-E’ flag:

#  Extended Regular Expression Search In Grep:
grep -E 'THE_REGEX' animals.txt
#  Extended Regular Expression Search In Awk:
awk 'THE_REGEX' animals.txt

Here is an example awk command that will replace the ’e’ character at the end of a line with five ‘z’ characters:

awk '{ gsub(/e$/, "zzzzz"); print}' animals.txt

BEGIN & END Actions

Awk has two very special ‘actions’ called ‘BEGIN’ and ‘END’. The ‘BEGIN’ action runs when awk first starts up, and the ‘END’ action runs when awk is about to shut down. Here is a brief example of this in action:

awk '
        BEGIN{print "I run once when awk starts up."}
        END{print "I run once when awk is about to exit."}
' temps.csv

Output:

I run once when awk starts up.
I run once when awk is about to exit.

We can use them to do important ‘programming language’ type things like setting up and initializing variables in the ‘BEGIN’ action, or checking and aggregating information in the ‘END’ action. Here is an example use case of awk that calculates the average temperature (in Celsius) from our file of mixed Fahrenheit and Celsius values:

awk '
        BEGIN{temp_sum=0; total_records=0; print "Begin calculating average temperature."}
        $2=="F"{temp_sum += ($1-32) / 1.8; total_records += 1;}
        $2=="C"{temp_sum += $1; total_records += 1;}
        END{print "Average temperature: "(temp_sum/total_records) C" = "(temp_sum)" / "(total_records)}
' temps.csv

Output:

Begin calculating average temperature.
Average temperature: 25.3852 = 304.622 / 12

Links to this note