awk
- Reference
- A sample exercise
- Problem statement
- The file: temps.csv
- Sample command to show how it uses different columns
- Attempt 1
- Attempt 2: Using terminal operator and printing the Celsius units in the output.
- Attempt 3
- Attempt 4: Do a formatted print operation and print only one digit after the conversion.
- Different separators
- awk language
- Implicit defaults
- BEGIN & END Actions
Reference
https://blog.robertelder.org/intro-to-awk-command/
A sample exercise
Problem statement
Convert all the entries from the csv file into Celsius
The file: temps.csv
temp unit
26.1 C
78.1 F
23.1 C
25.7 C
76.3 F
77.3 F
24.2 C
79.3 F
27.9 C
75.1 F
25.9 C
79.0 F
Sample command to show how it uses different columns
awk '{print "First column item: " $1 " Second column item: " $2 }' temps.csv
Output:
[explorer436@explorer436-legion-82b1 awk]$ awk '{print "First column item: " $1 " Second column item: " $2 }' temps.csv
First column item: temp Second column item: unit
First column item: 26.1 Second column item: C
First column item: 78.1 Second column item: F
First column item: 23.1 Second column item: C
First column item: 25.7 Second column item: C
First column item: 76.3 Second column item: F
First column item: 77.3 Second column item: F
First column item: 24.2 Second column item: C
First column item: 79.3 Second column item: F
First column item: 27.9 Second column item: C
First column item: 75.1 Second column item: F
First column item: 25.9 Second column item: C
First column item: 79.0 Second column item: F
So, essentially, $1 represents the first column and $2 represents the second column.
NR = Number of Record; NR>1 means, don’t do the operation for the first line.
Attempt 1
awk 'NR>1 {print ($1-32) / 1.8 }' temps.csv
Output:
-3.27778
25.6111
-4.94444
-3.5
24.6111
25.1667
-4.33333
26.2778
-2.27778
23.9444
-3.38889
26.1111
It is converting all the rows. We need a condition to check if the second column is Farenheit.
Attempt 2: Using terminal operator and printing the Celsius units in the output.
awk 'NR>1{print ($2=="F" ? ($1-32) / 1.8 : $1)"\tC"}' temps.csv
Output:
26.1 C
25.6111 C
23.1 C
25.7 C
24.6111 C
25.1667 C
24.2 C
26.2778 C
27.9 C
23.9444 C
25.9 C
26.1111 C
Good. But it is not printing the header row.
Attempt 3
awk 'NR==1; NR>1{print ($2=="F" ? ($1-32) / 1.8 : $1)"\tC"}' temps.csv
Output:
[explorer436@explorer436-legion-82b1 awk]$ awk 'NR==1; NR>1{print ($2=="F" ? ($1-32) / 1.8 : $1)"\tC"}' temps.csv
temp unit
26.1 C
25.6111 C
23.1 C
25.7 C
24.6111 C
25.1667 C
24.2 C
26.2778 C
27.9 C
23.9444 C
25.9 C
26.1111 C
NR==1; just skips the operation on the header row but it will still print the header row to the output.
Attempt 4: Do a formatted print operation and print only one digit after the conversion.
awk 'NR==1; NR>1{printf("%.1f\t%c\n",($2=="F" ? ($1-32) / 1.8 : $1),"C")}' temps.csv
Output:
[explorer436@explorer436-legion-82b1 awk]$ awk 'NR==1; NR>1{printf("%.1f\t%c\n",($2=="F" ? ($1-32) / 1.8 : $1),"C")}' temps.csv
temp unit
26.1 C
25.6 C
23.1 C
25.7 C
24.6 C
25.2 C
24.2 C
26.3 C
27.9 C
23.9 C
25.9 C
26.1 C
Different separators
Same operation but with using “comma” as a separator (instead of using white space)
awk -F',' 'NR==1; NR>1{printf("%.1f,%c\n",($2=="F" ? ($1-32) / 1.8 : $1),"C")}' temps.csv
Same operation but with using “tab” as a separator (instead of using white space)
awk -F'\t' 'NR==1; NR>1{printf("%.1f\t%c\n",($2=="F" ? ($1-32) / 1.8 : $1),"C")}' temps.csv
awk language
Awk is extremely easy to learn. The reason that it seems so hard to learn is because awk has so many implicit defaults, but nobody ever seems to explain this fact.
You can think of every awk command as a collection of ‘if statements’ that run against every line in the file. The syntax of every awk command looks pretty close to something like this:
awk 'if(PATTERN1){...print something...} if(PATTERN2){...print something...} ...'
with the one exception beging that the ‘if’ keyword is never actually written out since it’s assumed to be there by default (if you do write it, you’ll get a syntax error). Therefore, the overall syntax for every awk command (that doesn’t rely on defaults) is pretty much this:
awk '(PATTERN1){...Action 1...} (PATTERN2){...Action 2...} ...'
In the above command, the ‘PATTERN1’ or ‘PATTERN2’ is the trigger you want to cause the stuff inside the ‘{’ ‘}’ characters to actually execute. Here are a few examples of commonly used patterns:
Print out the second line:
echo -e "hello\nworld" | awk '(NR==2){print $0}'
Output:
world
Print out any line that matches a regular expression (that just looks for an ’l’ character): The ‘~’ character has a special meaning here in relation to regular expression matching
echo -e "hello\nthere\nworld" | awk '($0 ~ /l/){print $0}'
Output:
hello
world
If the item in the first column is greater than 5 characters, print out the item in the second column:
echo -e "acb def\nsomething else" | awk '(length($1) > 5){print $2}'
Output
else
This provides some context on what you can do in the ‘pattern’ part, but what about the ‘action’ part? Well, you can use your imagination since that’s where awk becomes a fully fledged programming language. Here is an example awk command that will iterate over every character in the 3rd column on the 4th line and print out each character on a different line:
echo -e "a a a a\na a a a\na a a a\na a hello_there a" | awk '(NR==4){
n_chrs = split($3, individual_characters, "")
for (i=1; i <= n_chrs; i++){
printf("Here is character number %d : %c\n", i, individual_characters[i]);
}
}'
Output:
Here is character number 1 : h
Here is character number 2 : e
Here is character number 3 : l
Here is character number 4 : l
Here is character number 5 : o
Here is character number 6 : _
Here is character number 7 : t
Here is character number 8 : h
Here is character number 9 : e
Here is character number 10 : r
Here is character number 11 : e
Implicit defaults
awk makes many implicit default assumptions. To illustrate them, let’s do a few more examples of matching regular expression against the following file ‘animals.txt’.
Rabbit
Bird
Dog
Pig
Lobster
Ape
Chicken
Lion
Pony
Fish
Cow
Cat
Horse
Deer
Turkey
Spider
Duck
Shark
Bear
Snake
Eagle
Bison
Monkey
Dolphin
Regular expression search that will print out any lines that end with the letter ’e’
awk '($0 ~ /e$/){print $0;}' animals.txt
Output:
Ape
Horse
Snake
Eagle
The parentheses on the PATTERN are optional
awk '$0 ~ /e$/{print $0;}' animals.txt
We don’t actually need to specify the ‘$0’ part (the variable that denotes the current entire line). If you write a regular expression by itself, it will be assume that you’re comparing it against the contents of the current line. Therefore, we can do this:
awk '/e$/{print $0;}' animals.txt
If we’re printing out the entire line, we don’t actually need to say ‘print $0’, we can just say ‘print;’ and it will assume that we want to print out the current line:
awk '/e$/{print;}' animals.txt
We don’t even need to specify the action at all since it’s optional! In cases where the action is missing, the assumption is to print out the entire line, so we can just do this:
awk '/e$/' animals.txt
At this point, we’ve simplified awk to the point where it would do pretty much the same this that grep does with the ‘-E’ flag:
# Extended Regular Expression Search In Grep:
grep -E 'THE_REGEX' animals.txt
# Extended Regular Expression Search In Awk:
awk 'THE_REGEX' animals.txt
Here is an example awk command that will replace the ’e’ character at the end of a line with five ‘z’ characters:
awk '{ gsub(/e$/, "zzzzz"); print}' animals.txt
BEGIN & END Actions
Awk has two very special ‘actions’ called ‘BEGIN’ and ‘END’. The ‘BEGIN’ action runs when awk first starts up, and the ‘END’ action runs when awk is about to shut down. Here is a brief example of this in action:
awk '
BEGIN{print "I run once when awk starts up."}
END{print "I run once when awk is about to exit."}
' temps.csv
Output:
I run once when awk starts up.
I run once when awk is about to exit.
We can use them to do important ‘programming language’ type things like setting up and initializing variables in the ‘BEGIN’ action, or checking and aggregating information in the ‘END’ action. Here is an example use case of awk that calculates the average temperature (in Celsius) from our file of mixed Fahrenheit and Celsius values:
awk '
BEGIN{temp_sum=0; total_records=0; print "Begin calculating average temperature."}
$2=="F"{temp_sum += ($1-32) / 1.8; total_records += 1;}
$2=="C"{temp_sum += $1; total_records += 1;}
END{print "Average temperature: "(temp_sum/total_records) C" = "(temp_sum)" / "(total_records)}
' temps.csv
Output:
Begin calculating average temperature.
Average temperature: 25.3852 = 304.622 / 12