Two loops for the price of one awk?

Message

Ahtiga Saraz · #1 Post by **Ahtiga Saraz** » 2012-01-14 18:03

According to information I can find, awk has the following model:

input: a file consisting of lines all having the same form

and an awk program has the form

BEGIN by doing something
loop over the lines, doing the same things with each line
END by doing something

But it seems that awk would be much more useful if one could loop twice:

BEGIN by doing something
loop over the lines, doing the same things with each line
record a result in a variable
loop over the lines again, doing the same things to each line
END by doing something

Is it possible?

Put another way, the built in NF function in awk must do something to count up the number of fields. I want an SF function, where the fields are numeric and I sum them.

drl · #2 Post by **drl** » 2012-01-14 18:57

Hi.

I think of awk as the preeminent data-in-fields-processor. The form of an awk program is a series of statements:

Code: Select all

pattern { action }

where BEGIN and END are optional, special patterns that allow actions to be performed before the list of files is read, and after all files are read. The action allows a syntax very much like c. The input text data files can be almost any form. I think early on I had an awk program that implemented much of nroff (not written by me).

The object NF is a variable maintained by awk, not a function.

One reason that I might prefer perl over awk is that perl can read non-text files, awk cannot.

Modern awk allows user functions so that you can do modular coding.

The article at http://en.wikipedia.org/wiki/AWK has a lot of information and references.

The site http://awk.info/ is for everything awk.

The http://www.unix.com forum has an amazing group of gifted awk coders.

The book by A.W.K. is still in print, but I have no idea why it is priced so high at $80 / $50 at Amazon; I think I paid about $20 for it long ago.

Best wishes ... cheers, drl

tukuyomi · #3 Post by **tukuyomi** » 2012-01-14 21:23

BEGIN by doing something
loop over the lines, doing the same things with each line
record a result in a variable
loop over the lines again, doing the same things to each line
END by doing something

Code: Select all

awk '
BEGIN{Do stuff}
NR==FNR{Do stuff for file, record in a variable if you want; next}
{Do stuff for file again}
END{Do stuff}
' file file

NR is incremented each line, FNR too but with the difference that it's reseted with each input:
Say file has 10 lines. As it's called twice, FNR will go 0~10, then 0~10, while FNR will go 0~10~20. This explains the first loop (NR==FNR{...; next})

Telemachus · #4 Post by **Telemachus** » 2012-01-14 23:53

Ahtiga Saraz wrote:According to information I can find, awk has the following model:

input: a file consisting of lines all having the same form
and an awk program has the form

BEGIN by doing something

loop over the lines, doing the same things with each line

END by doing something

But it seems that awk would be much more useful if one could loop twice:

BEGIN by doing something

loop over the lines, doing the same things with each line

record a result in a variable

loop over the lines again, doing the same things to each line

END by doing something

Is it possible?

I'm worried this is an XY problem. Can you tell us what the real goal is? Ideally give a concrete example with a small amount of realistic data.

Ahtiga Saraz wrote:Put another way, the built in NF function in awk must do something to count up the number of fields. I want an SF function, where the fields are numeric and I sum them.

NF is not a function. It's a built-in variable that stores the number of fields for each line. The code that computes that value is somewhere in the C interpreter (presumably written in C), and I don't think it's directly available to you as a user of awk. Having said that, summing columns or rows is not very hard in awk, and you can define your own functions as part of a larger awk script.

Again it would help if you told us more concretely what you're trying to do.

Ahtiga Saraz · #5 Post by **Ahtiga Saraz** » 2012-01-15 19:37

Hi Telemachus, I sure am glad to see you!

Yes, an XY problem.

The origins of my project is that there are FOSSware items--- including a few "toys" in the Debian repos--- which claim to solve certain problems, but they don't work very well and have very limited utility even for "toy problems".

Since I know from experience solving such problems "by hand" that one can do much better than the FOSSware I have found on the web, I decided to try to develop my own set of scripts each peforming various specific small tasks, with the goal of eventually formulating an outline for a modular package, as a way of trying to give something back to the community (in the unlikely event I ever actually came up with anything useful). The general area involves text processing, but I am reluctant to say more in public.

As time and energy permit, I have been writing a few sample scripts to build my skills, generally by modifying an example I found in a book. For example, here is a script I wrote the other day which computes averages across lines in a file of numbers and which uses an associative array:

Code: Select all

#!/bin/bash
# Ahtiga Saraz; modified from an example in Dougherty and Robbins, sed and awk
# Input: a text file consisting lines of numbers separated by spaces
# (where the lines can contain different numbers of fields)
# Output: average of each line, followed by average of averages
cat $1 | mawk '
BEGIN { OFS = "\t" }
# Do this to each line
{
# compute line average
        total = 0
        for (i = 1; i <= NF; ++i)
                total += $i
        avg = total / NF
# assign average to element of an array for later reference
        line_avg[NR] = avg
# assign number of fields to an array for later reference
        numf[NR] = NF
# print number of observations and line average
        print "Fields: ", NF, "Line Average: ", avg
}
# Compute average of line averages
END {
        totnum = 0
        for (x = 1; x <= NR; x++)
                totnum += numf[x]
        cum = 0
        for (y = 1; y <= NR; y++)
                cum += numf[y]*line_avg[y]
        cavg = cum / totnum
        print "Total fields: ", totnum, "Cumulative Average: ", cavg
}'

For example, given as input the file

Code: Select all

10 11 12
1 2 3 4 5 6 7 8 9
100

this produces the output

Code: Select all

Fields:         3       Line Average:   11
Fields:         9       Line Average:   5
Fields:         1       Line Average:   100
Total fields:   13      Cumulative Average:     13.6923

(Because other scripts I use consist of long chains where pipes pass data on to another script, I usually use the cat file | awk ' stuff ' style of writing awk scripts. I mention this as an example because some regulars here are making feel defensive about my alleged unwillingness to try to learn.)

The problems I am having so much trouble are superficially similar: given a file like

Code: Select all

110 5 25
3 17
1 2 34 13

I want to replace each entry by the value of a function which depends upon that entry and also upon the other entries in the line. I'd settle for doing this for just one line but it seems clear that if I can do it for one line I can do it for many and that should increase flexibility. I see no way of doing this kind of task except by at least two loops run after the other on each line, in which one computes first a quantity depending on all the values in the line, then the second computing the function value for each entry in the line. Then do this for each line.

There must be a way to do this using defined functions and associative arrays. tukuyomi suggested something which sounds interesting but I couldn't quite figure out how to make it work.

While I am asking here about awk, and naturally prefer to do as much early development as possible using tools I understand better than python, I am aware that python is often used these days for text processing tasks and am not averse to learning to use it bye and bye. So far I have not found a Python book which offers examples which seem relevant. (Quite possibly because I don't yet know Python, or I would perhaps better recognize relevance.) The on-line book called something like "Dive into python" proved frustrating.

tukuyomi · #6 Post by **tukuyomi** » 2012-01-15 20:22

Ahtiga Saraz wrote: For example, given as input the file
Code: Select all
10 11 12
1 2 3 4 5 6 7 8 9
100
this produces the output
Code: Select all
Fields:         3       Line Average:   11
Fields:         9       Line Average:   5
Fields:         1       Line Average:   100
Total fields:   13      Cumulative Average:     13.6923

As an awk example:

Code: Select all

#!/bin/sh

awk '
{sum=0
for(i=1;i<=NF;i++)sum+=$i
nf+=NF; cumul+=sum
print "Fields:\t"NF"\tLine Average:\t"sum/NF
}
END{print "Total Fields:\t"nf"\tCumulative Avg:\t"cumul/nf}
' file

Ahtiga Saraz · #7 Post by **Ahtiga Saraz** » 2012-01-15 20:35

Nice!

Just to clarify: I was trying to use the weighted average as an example of the general kind of problem where I want to compute the values of a function which depends on all the entries in a line of data.

Debian User Forums

Two loops for the price of one awk?

Two loops for the price of one awk?

Re: Two loops for the price of one awk?

Re: Two loops for the price of one awk?

Re: Two loops for the price of one awk?

Cowardly reluctance to address XY problem

Re: Cowardly reluctance to address XY problem

I like your awk style