We want to know what a file contains and could clearly use a text editor to see for ourselves. Often, however, it is useful to have a simple, automated scheme to do things like search a file for specific contents. Linux provides simple, powerful built-in tools to do this.
Also, it is often useful to capture the screen output of a command into a file so we can examine it at our leisure. This is especially so if the command produces many screenfuls of information. Linux helps here too with built-in commands.
A useful command to search for a character string in a potentially large file is the awful sounding grep command. (The letters are an acronym for Global Regular Expression Print, hinting at the technical features of the command.) Suppose you wanted to know for some obscure reason all the name of all the ”.txt” files in my “home_list.txt” file we played with before. Since these are mixed in with other file types, reading “home_list.txt”, using, say, gedit is painful. Instead, we tell grep to look for the character string “txt” in “home_list.txt”. grep will then print out every line that contains “txt”. Assuming you have a copy of home_list.txt (and if not, go get one!) issue the command:
% grep txt home_list.txt
You should see 66 files. Note the structure of the command: grep, search string and then the searched file. You can easily search the same file for the “xls”, “pdf” and “tex” files it contains as well:
% grep xls home_list.txt
% grep pdf home_list.txt
% grep tex home_list.txt
grep has many useful switches. If you needed to examine the “grep-ed” file around the lines grep found, it could be useful to number the found lines so you could examine their neighbors in the original file using an editor. The -n switch comes to the rescue:
% grep -n txt home_list.txt
If you care only for the total count of lines that contain a search string, and not thy corresponding lines that contain it, use the -c switch:
% grep -c txt home_list.txt
Do you see the integer 66 again?
You can also do an “anti-search” with grep. Suppose you wanted to see all the lines that do not contain our sample search string “txt” in our sample searched file “home_list.txt”. This feat is accomplished using the -v switch. (Think “v” for inverted.)
% grep -v txt home_list.txt
Like the ls command, you can stack switch options with grep.
Can you see how to find the
number of files that do not contain "txt"?
See me if you get stuck.
Sometimes you want to search for a string and do not care if it is capitalized or not. In this case use the -i (for ignore case). For example, copy the file “nash.txt” from my home area to your area and cat it to see what it contains.
% cat nash.txt
Now search for the word “love”. What did you find? Now use the -i switch and repeat. What happens now?
How many lines contained "love" or "LOVE" or "Love" or ...
This is called ... showing the love? grep is powerful because you can search for much more than the simple strings we used here. You can search for strings that are only at the end or beginning of a line, strings that are surrounded by white space, etc. Google “regular expressions” and grep and you’ll find more.
Finally, redirection can work with an existing file as well. You can send its contents to a command. For example, this works:
% grep love < nash.txt
Capturing screen output into a file is easy with linux. You use what are called redirection operators >, >> and >&. To write all the lines that contain “txt” from “home_list.txt” into a separate file “greppie.txt”, do:
% grep txt home_list.txt > greppie.txt
Examine “greppie.txt” somehow. What do you see? You have redirected the output of grep from the monitor (the standard output, aka stdout) to a file. Try something similar again but use the search string “pdf”:
% grep pdf home_list.txt > greppie.txt
What if I want the new stuff but not overwrite the existing contents? In that case you use the append redirection operator “>>”. Go ahead and delete “greppie.txt”, grep the “txt” lines to it and then use the append operator to append lines that contain pdf:
% grep pdf home_list.txt >> greppie.txt
Finally, we come to “>&”. This is useful if you run a command that creates errors in addition to its normal output. Normally, error messages just go to the monitor (aka stderr) unless you do something special. If your redirection command line does not contain “>&” and an error occurs, you will not capture the error into the file intended to capture the output. This is a special case of redirection and I do not have a good example for the novice yet.
From time to time you want to know how many characters, words or lines a file contains. The linux word count command wc handles this case. To count the number of words in the file goethe.txt (see my home area), issue wc with the word switch -w:
% wc -w goethe.txt
You should see 22. Is this accurate? Check it! To count he lines, use the line switch -l:
% wc -l goethe.txt
Check it! What’s the deal here? Well, blank lines count as lines too. Strictly speaking, wc`-l counts the number of newlines (or times the key “enter” was typed into the file). To count characters, use the -m switch (I have no mnemonic for this):
% wc -m goethe.txt
How did you, or rather, wc do? To get the lines, words and characters in a file all in one go, just dispense with the switches.
% wc goethe.txt
Is this consistent with what you found above?
You can “pipe” the output of one linux command into the input of another. For example, instead of using the grep -c command to count lines that contain “txt” in “home_list.txt”, we can use a pipe (indicated by the “|” sign):
% grep txt home_list.txt | wc -l
This example is perhaps a bit contrived but illustrates the point. Note the difference between redirection and a pipe. A pipe has commands on either side of it while redirection has only a file on one side of the operator(s). There is no limit to how many pipes you can use on a command line. Put that in your pipe and ...!
For those files whose contents arranged in tabular form or lists, it is often useful to be able to sort the contents line-by-line according to some criterion. The sort command does this. For example, the command ls -l produces output in this form. If you wish to sort the output by file size, from smallest to largest, you can pipe the output from ls -l to sort and then set appropriate sort switches. Note that information returned by ls -l is separated by one or more blank spaces, the default “field” separator. File size information is contained in the 5th field. Verify this!
To tell sort what field to pay attention to, you include the -k switch (for “key”) and the number of the field. To sort numerically (versus alphabetically) you need also the -n switch (for “numerical”). Combining these features, we would issue:
% cat home_list.txt | sort -nk 5
where we have used cat and the existing file “home_list.txt”. Note the order of the switches. Since 5 modifies -k they need to be next to one another. You could also issue:
% cat home_list.txt | sort -n -k 5
Since “home_list.txt” is quite large, a sensible command would be to use multiple pipes:
% cat home_list.txt | sort -nk 5 | more
sort has a switch -r for reverse sorting. Can you sort “home_list.txt” according to the month of last modification in order of last to most recent and vice-versa? Do you need/want the -n switch in this case? Now sort and reverse sort according to year of modification. What do you need to do now? See me if you get stuck. Read the man pages on sort for more information.
Linux has a utility for displaying unique lines in a file. Adjacent duplicate lines are suppressed. (The file itself is not modified.) To test this, create a file echo.txt that contains adjacent duplicate lines of text. Then issue the unique command uniq:
% uniq echo.txt
Now rearrange the duplicate lines in “echo.txt” so that some of them are not adjacent. Reissue uniq. What happens?
It is sometimes convenient to know what type of file you are dealing with (e.g., ascii v. binary) before you start acting on it with an editor or some other specialized command. This is often true if the file has an unfamiliar name and you don’t really know what it is. The file command is helpful in this circumstance. Issue it on various files:
% file nash.txt
% file /lustre/users
% file /etc/localtime
% file /etc/rpc
% file /usr/bin/GET
What do you see? file works on any kind of file.
Command | Meaning |
---|---|
grep angel hell.txt | display every line in hell.txt that contains angel |
grep -n angel hell.txt | displays and numbers each line in “hell.txt” that contains angel |
grep -c angel hell.txt | display the total number of lines in hell.txt that contain angel |
grep -v angel hell.txt | displays lines in hell.txt that do not contain angel |
wc poetry.txt | displays the number of lines, words and characters, respectively, in “poetry.txt” |
ls -l book.txt > prose.txt | capture output of ls -l on book.txt and place in new file “prose.txt” |
ls -l diary.txt >> prose.txt | append output of ls -l on diary.txt to prose.txt |
cat bordeaux.txt >! wine.txt | overwrite contents of wine.txt with those of bordeaux.txt |
grep fun stooges.txt | wc -l | displays the number of lines containing fun in “stooges.txt” |
sort -k 7 info.txt | sort file info.txt alphabetically according to data in field 7 |
sort -nk 3 info.txt | sort file info.txt numerically according to data in field 3 |
uniq echo.txt | displays echo.txt, skipping adjacent duplicate lines |
file mystery | classifies the file mystery according to its contents |