Scheduled Maintenance: We are aware of an issue with Google, AOL, and Yahoo services as email providers which are blocking new registrations. We are trying to fix the issue and we have several internal and external support tickets in process to resolve the issue. Please see: viewtopic.php?t=158230

 

 

 

Improving on my C (little?) understanding.

Programming languages, Coding, Executables, Package Creation, and Scripting.
Message
Author
User avatar
edbarx
Posts: 5401
Joined: 2007-07-18 06:19
Location: 35° 50 N, 14 º 35 E
Been thanked: 2 times

Improving on my C (little?) understanding.

#1 Post by edbarx »

Since at the moment I am revising my knowledge of the C programming language, I would like to take the opportunity to post some of my rudimentary C programs for comments by anyone having a good command of the language. My efforts are not directed towards profit, but to enrich my knowledge to be of better use where contributing to free and open source software is concerned. I have already written a functional free and open source project, that I submitted to the Devuan project, but which can be forked by anyone wanting to use it in any distribution, including this one to which this board is dedicated.

I am reading the book:
The C Programming Language 2nd Edition
By: Brian W. Kernighan and Dennis M. Ritchie

Right now I am revising pointers although till now I haven't used malloc, calloc and alloc.

I will post some of my worked examples for comments as I know feedback is very important where learning is involved.

Since I find myself noticeably weak where pointer use is involved, I think, it is better to start with pointers.

Code: Select all

#include <stdio.h>
#include <ctype.h>

/* assume target string has enough free space */
void strcat1(char* s, char *t) {
  /* find terminating null */
  
/* 
The for loop increments pointer t after every iteration.
Loop starts at t. The final value of t is the position of
/0 character.
*/
  for (; *t; t++);
  
/*
The while loop uses C's bastardized version of an assignment
treating it also as a statement. At each iteration, first the
char at t is assigned the char at s. After this, both t and s
are incremented.

*t++ means first increment the pointer, then dereference it,
otherwise this wouldn't work.
*/  
  while(*t++ = *s++);
}


int main() {
/*
Here I am assuming both source and target are automatically
appended by a \0 character. Otherwise strcat1 would go into
an uncontrolled memory corrupting frenzy only to be stopped 
by a segmentation fault.

However this didn't happen.
*/
  char source[] = "Appended end of string\n";
  char target[1024] = "Target string\n";
  
  strcat1(source, target);
  printf(target); 
  
  return 0;
}

Code: Select all

#include <stdio.h>
#include <ctype.h>

int strend(char* s, char* t) {
  char* saved_t = t;

  for (; *t; t++);
  for (; *s; s++);
  
  while (*--t == *--s);
  return (t + 1 == saved_t);
}

int main() {
  char a[] = "mouse, cat, dog\n";
  char b[] = "cat, dog\n";
  char c[] = "horse, cow\n";
  
  if (strend(a, b))
    printf("b found in a\n");
    else printf("b not found in a\n");
  
  if (strend(a, c))
    printf("c found in a\n");
    else printf("c not found in a\n");
      
  return 0;  
}
Debian == { > 30, 000 packages }; Debian != systemd
The worst infection of all, is a false sense of security!
It is hard to get away from CLI tools.

tomazzi
Posts: 730
Joined: 2013-08-02 21:33

Re: Improving on my C (little?) understanding.

#2 Post by tomazzi »

Hi, edbarx.

Well, there are few things which I definitely don't like in Your code, but first, I would like to say a few words about the C syntax:
C is so powerful language, that many people are calling it a "portable assembly" (falsely) - and because of this (I think), especially the beginners are thinking that C indeed is equivalent to assembly language. As a consequence, many times I've seen a code in which people are trying to "outsmart" the compiler, by applying "optimizations" at the syntax level (which is very different from optimizations at the algorithm level).
The results of "optimizations" at the syntax level are always the same: suboptimal resulting code, very bad readability of the sources and problems in using debugger.

For example (1):

Code: Select all

for (; *t; t++);
Correct version (2):

Code: Select all

char chr = *p_ch;
for (; chr != 0; p_ch++) {
   chr = *p_ch; 
}
1. More text != bigger or slower resulting code.
2. More text == better readability.
3. Versions 1 & 2 are equivalent, because compiler must create a local variable (using CPU register) to perform (implied!) checking for zero.
4. Source Debuggers, such as GDB will not show real values processed in version 1.
5. Single-letter variables can become highly non-distinguishable (depending on what font is used), f.e.: (i||l) ???
6. Syntax highlighting in modern editors/IDEs doesn't work with single-letter names of variables.
7. Variables used in serious projects should have names derived from the variable's type, f.e.: p_char or chr_ptr.

Of course, both versions are just too small and too simple to show the real picture of the problem, but I hope that my points are clear enough to understand.

Regards.
Odi profanum vulgus

User avatar
edbarx
Posts: 5401
Joined: 2007-07-18 06:19
Location: 35° 50 N, 14 º 35 E
Been thanked: 2 times

Re: Improving on my C (little?) understanding.

#3 Post by edbarx »

I have been coding since 1995 although I started to code in C well into the first decade of the third millenium. Single letter variables make sense when their use is limited to loop counters and their number is small. As you said, variable names should give some hint as to their type. When I use widgets (GUI), I name them like this: for a button I use btn_description like btn_close. I do the same when code starts to get complicated.

The code that I pasted follows the same fashion as used in the book. As far as I know, one of the authors was one of the pioneer coders who coded the C language. Their code is far more cryptier than I can succeed to write at the moment. As you rightly insisted, code should be written with readability as one of the major objectives. However, sometimes writing simplified code, is often interpreted as a lack of coding proficiency. Initially, I was against the coding style I posted, but unfortunately, C coders expect such code from whoever says that they have some knowledge of C.

My objective is to join other free and open source developers, therefore I aim to make my code look as familiar to them as they expect. Currently, I joined Devuan and actually have contributed a project (simple-netaid-backend; simple-netaid-gui; simple-netaid-lightweight).
Debian == { > 30, 000 packages }; Debian != systemd
The worst infection of all, is a false sense of security!
It is hard to get away from CLI tools.

tomazzi
Posts: 730
Joined: 2013-08-02 21:33

Re: Improving on my C (little?) understanding.

#4 Post by tomazzi »

edbarx wrote:As far as I know, one of the authors was one of the pioneer coders who coded the C language. Their code is far more cryptier than I can succeed to write at the moment.
Yeah, the IOCCC was started by the "masters" and "pioneers" in C, but I don't think that a code like this:
http://www.ioccc.org/2015/dogon/prog.c
is a proof of mastery - it's rather a proof of good sense of humour ;)

I prefer briliant algorithms/programming techniqes over briliantly squashed, but hard to read code....

So, f.e: "C function can't return more than one variable"
Oh really?

Code: Select all

typedef union __attribute__ ((transparent_union)) {
   int64_t pad;
   struct __attribute__ ((packed)) {
      int16_t var1;
      int16_t var2;
      int16_t var3;
      int16_t var4;
   }
} multivar_t;

multivar_t give_me_four_values(void) {
   multivar_t mvar;
   mvar.var1 = 1;
   mvar.var2 = 2;
   mvar.var3 = 3;
   mvar.var4 = 4;
   return mvar;
}

main() {
   multivar_t mvar = give_me_four_values();
   printf("%d, %d, %d, %d\n", \
             (int) mvar.var1,
             (int) mvar.var2,
             (int) mvar.var3,
             (int) mvar.var4
           )
}
:mrgreen:

Regards.

PS: just a quick example, I hope that the compiler won't camplain about some missing semicolon or something :)

PS2: So what problems do You have with the pointers?
Odi profanum vulgus

User avatar
edbarx
Posts: 5401
Joined: 2007-07-18 06:19
Location: 35° 50 N, 14 º 35 E
Been thanked: 2 times

Re: Improving on my C (little?) understanding.

#5 Post by edbarx »

tomazzi wrote:So, f.e: "C function can't return more than one variable"
Oh really?
Actually, one can return as many values as one may wish. There are various ways to do it. One way is using the return value of a function, another is using external/global variables, and yet another method, is to use pointers.

My problem consists mainly attempting to use incompatible pointer types for the parameters of functions and in assignments. I am still hazy as to why I was receiving those errors.

One such error was: incompatible pointer types - const char* used instead of char*. I am quoting from memory which can be very volatile.
Debian == { > 30, 000 packages }; Debian != systemd
The worst infection of all, is a false sense of security!
It is hard to get away from CLI tools.

tomazzi
Posts: 730
Joined: 2013-08-02 21:33

Re: Improving on my C (little?) understanding.

#6 Post by tomazzi »

edbarx wrote:My problem consists mainly attempting to use incompatible pointer types for the parameters of functions and in assignments. I am still hazy as to why I was receiving those errors.

One such error was: incompatible pointer types - const char* used instead of char*. I am quoting from memory which can be very volatile.
Well, in C language, in many (or rather in most) cases, const types are just declarations used by compiler to detect incorrect access type, i.e. writing to a variable declared as "should not be changed", but which can be changed physically without causing a crash.
In other words, const type itself does not guarantee that the value is read-only. Truly constant variables are those placed in write protected memory areas - and in such case, ignored compiler warning will result in segfault during code execution.

Moreover, the const keyword can be used conditionally - f.e. if some function is not allowed to change some variable then:
a) the variable should be passed "by value" - that is, without any reference to its original memory location.
b) if it's a big data structure, then the function should take const pointer to that structure.
Therefore, this is completely correct to cast pointers to const/non-const type, depending on what function has to be called.

However, it would be nice to see some meaningful code snippet - code speaks for itself ;)
edbarx wrote:Actually, one can return as many values as one may wish. There are various ways to do it. One way is using the return value of a function, another is using external/global variables, and yet another method, is to use pointers.
Yes and no.
I've posted this example for two reasons:
First, in conjunction with the IOCCC code it shows two extremely different faces of C language.

Second, this is not just an example of how to return a structure from a function. The multivar_t is a special case of one of most powerful programming techniques available in C: this data structure fits into a single 64bit register or into a pair of 32bit registers.
In the effect, the structure has many extremely useful properties, but since this post is already becoming too long, here are only few most obvious ones:
1. Atomic, hyper-fast copying: max 1 or 2 assembly instructions:

Code: Select all

mvarA.pad = mvarB.pad;
/* or just: */
mvarA = mvarB;
/* Yes, this is possible thanks to the transparency of union */
2. Atomic, hyper-fast multiplication/division (by power of 2 values), addition/substraction of all 4 vars simultaneously, in max 1 or 2 asm instructions:

Code: Select all

/* increment all the 4 vars by 1: */
mvarA.pad += 0x0001000100010001;
/* multiply all the 4 vars by 2: */
mvarA.pad <<= 1;
In general form, the structure looks like follows:

Code: Select all

 <pseudocode>
typedef union __attribute__ ((transparent_union)) {
   cpu_reg_double_size_t pad;
   /* this allows to do many more tricks: */
   struct __attribute__ ((packed)) {
      cpu_reg_size_t pad_Low;
      cpu_reg_size_t pad_High;
   }
   /* Here, a structure which contains any combination of bitfields,
      bytes, int16_t words, int32_t words etc, but of total size == sizeof(cpu_reg_double_size_t)
   */
   struct __attribute__ ((packed)) {
      <your data structure>
   }
} multivar_t;
It's not true that in C (or in fact in any other language) the function may return any amount of data. "Returning" the variables means that they are passed back to the caller trough the CPU registers - otherwise, it's just "processing the data using provided pointers" - so, if the returned structure is too big to fit the registers, then the compiler will just transform the code to a classic solution based on pointers.

Regards.
Odi profanum vulgus

User avatar
edbarx
Posts: 5401
Joined: 2007-07-18 06:19
Location: 35° 50 N, 14 º 35 E
Been thanked: 2 times

Re: Improving on my C (little?) understanding.

#7 Post by edbarx »

I tried to reproduce the incompatible types errors I used to have when I used C pointers but failed to get any such errors even though I am compiling with the -Wall compiler parameter. Apparently, re-studying C from the very beginning like a clueless newbie is beginning to bear its fruit.

Here is the code.

Code: Select all

#include <stdio.h>
#include <ctype.h>
#include <stdlib.h>
#include <string.h>

int main()
{
  char* strStory = "The terrier ran after the mouse but failed to catch it.";
  char* strPhrase = "but failed";
  
  /* char *strstr(const char *haystack, const char *needle) */
  if (strstr(strStory, strPhrase) != NULL)
    printf("Substring found.\n");
    else printf("NOTFOUND\n");
    
  /* copy two strings to a third dynamically allocated string */
  char *strDyna = malloc(1024*sizeof(char)); /* more than enough */
  
  /* char *strcpy(char *dest, const char *src) */
  strcpy(strDyna, strStory);
  
  /* char *strcat(char *dest, const char *src) */
  strcat(strDyna, strPhrase);
  printf("%s\n", strDyna);
  
  free(strDyna);
    
  return 0;  
}
I am tempted to write a small parser in C to continue testing my proficiency in the language. Can you give me a small hint as to what parser I can attempt to write to testing my C coding ability?

EDIT:
I am remembering a simple parser I can use an exercise. This tests the syntax of expressions like this:
(A^B)v(!C^B)

I will allow the use of strings instead of single character identifiers.

Thanks.
Debian == { > 30, 000 packages }; Debian != systemd
The worst infection of all, is a false sense of security!
It is hard to get away from CLI tools.

tomazzi
Posts: 730
Joined: 2013-08-02 21:33

Re: Improving on my C (little?) understanding.

#8 Post by tomazzi »

edbarx wrote:Can you give me a small hint as to what parser I can attempt to write to testing my C coding ability?
hmm... maybe a parser for Regular Expressions? (without the extended syntax) ;)
Odi profanum vulgus

User avatar
edbarx
Posts: 5401
Joined: 2007-07-18 06:19
Location: 35° 50 N, 14 º 35 E
Been thanked: 2 times

Re: Improving on my C (little?) understanding.

#9 Post by edbarx »

I am trying a parser for expressions like (AvB)^(!CvD). The code is a tokeniser and a token lister. The parser is still to be coded.

Code: Select all

#include <stdio.h>
#include <ctype.h>
#include <stdlib.h>
#include <string.h>

#define token_delta 50

typedef char* pchar;
pchar* tokens;
int token_count = 0; 

void tokenise(char* expression) {
  char ch;              /* character at iterator's position */
  char* atoken;         /* token to be added to token list  */
  char* start = NULL;   /* holds a char* to point to start of multicharacter token */
  char* string_token;   /* holds multicharacter token */
  int char_count = 0;   /* keeps track of number of characters in multicharacter token */
  
  int while_counter = 0;
  /* expr_len_and_null holds the total length of input string including null terminator */
  int expr_len_and_null = strlen(expression) + 1;
  while (while_counter++ <= expr_len_and_null) {
    ch = *expression++; 
    switch (ch) {
      case '(': 
      case ')':
      case '^':
      case 'v':
      case '!':
      case '\0':
        if (start && char_count > 0) {
          string_token = malloc((char_count + 1)*sizeof(char));
          strncpy(string_token, start, char_count);
          tokens[token_count++] = string_token;
                  
          start = NULL;
          char_count = 0;
        }                
      
        if (ch == '\0') return;
        atoken = malloc(2*sizeof(char));
        atoken[0] = ch;
        atoken[1] = '\0';
        tokens[token_count++] = atoken;
        break;
        
      case '\n':
        break;  
        
      default:
        if (start == NULL)
          start = expression - 1;
        char_count++;  
    }
  }
}

void free_memory() {
  int k = token_count;
  
  while (--k >= 0) free(tokens[k]);
  free(tokens);
}


int main() {
  char input[101];

  tokens = malloc(token_delta*sizeof(char*));
  fgets(input, 100, stdin);
  tokenise(input);
  
  int k = -1;
  printf("token count is %d\n", token_count);
  while(++k < token_count)
    printf("%s\n", tokens[k]);

  free_memory();
  
  return 0;
}
Debian == { > 30, 000 packages }; Debian != systemd
The worst infection of all, is a false sense of security!
It is hard to get away from CLI tools.

tomazzi
Posts: 730
Joined: 2013-08-02 21:33

Re: Improving on my C (little?) understanding.

#10 Post by tomazzi »

... just to trigger Your attention:
all of the compilers, including gcc, are not able to generate jump-offset-arrays for switch() statements, because they are 'too stupid' - i.e. they can generate correct code *only* in case when the tested variable value is in range 0..n .
The real code generated by the compiler will look like this:
if (chr == '(') then ...
if (chr == ')') then ...
if (chr == '^') then ...

... compilers are stupid, no matter what amazing level of "optimizations" is enabled. dot.

Solution:
a) hashing functions (in case when there are more than let's say 100 possible hits)
b) manually written jump-offset arrays - or arrays of function pointers (for functions which are handling particular cases)

Regards.

... and yes, I'm a speed maniac... ;)
Odi profanum vulgus

reinob
Posts: 1196
Joined: 2014-06-30 11:42
Has thanked: 99 times
Been thanked: 47 times

Re: Improving on my C (little?) understanding.

#11 Post by reinob »

tomazzi wrote: Second, this is not just an example of how to return a structure from a function. The multivar_t is a special case of one of most powerful programming techniques available in C: this data structure fits into a single 64bit register or into a pair of 32bit registers.
In the effect, the structure has many extremely useful properties, but since this post is already becoming too long, here are only few most obvious ones:

...
That may be useful in whatever specific program for a specific platform with a specific compiler (and version) you are developing, but the real strength of C is its portability.

Quoting what you said a few posts above:
tomazzi wrote: C is so powerful language, that many people are calling it a "portable assembly" (falsely) - and because of this (I think), especially the beginners are thinking that C indeed is equivalent to assembly language. As a consequence, many times I've seen a code in which people are trying to "outsmart" the compiler, by applying "optimizations" at the syntax level (which is very different from optimizations at the algorithm level).
Your example above looks like the "portable assembly" C shouldn't be...

tomazzi
Posts: 730
Joined: 2013-08-02 21:33

Re: Improving on my C (little?) understanding.

#12 Post by tomazzi »

reinob wrote:That may be useful in whatever specific program for a specific platform with a specific compiler (and version) you are developing, but the real strength of C is its portability.
Portability of C is not better than f.e. portability of C++, so Your claim is at least questionable.

That example code is 100% syntactically correct and 100% portable - personally, I've tested this on armhf, amd64 and x86 using several different compilers in several versions.

Those "magic" attributes used in the example are supported by every serious compiler in the world, like gcc, icc, clang and even by mingw on winblows. They are not supported by microshit's compiler, but who cares?

The biggest strenght of C is that I can tell the compiler what *exactly* I want to do.

This is impossible in C++ and other higher level languages, where significant portion of the resulting (executable) code comes from the compiler itself and not from the sources -> just to support the fancy syntax of the language.

Regards.
Odi profanum vulgus

User avatar
edbarx
Posts: 5401
Joined: 2007-07-18 06:19
Location: 35° 50 N, 14 º 35 E
Been thanked: 2 times

Re: Improving on my C (little?) understanding.

#13 Post by edbarx »

Hi Tomazzi,

Analising the expression grammar, I found these syntax rules to be checked by the parser.

Boolean Expressions like: (AvB)^(!CvD)

Key for Syntax Rules:
a) [ ] means whatever is in the brackets is optional
b) ... means repeat the latter
c) no brackets mean compulsory syntax

Syntax Rules:
i) operand = [ ! ... ] identifier
ii) bracketed operand = [ ( ... xN ] operand [ ) ... xN ]
iii) term = bracketed operand [ [ operator ----> bracketed operand ] ... ]
iv) bracketed term = [ ! ... ] [ ( ... xN ] term [ ) ... xN ]
v) expression = bracketed term [ [ operator -----> bracketed term ] ... ]

Edward
Debian == { > 30, 000 packages }; Debian != systemd
The worst infection of all, is a false sense of security!
It is hard to get away from CLI tools.

reinob
Posts: 1196
Joined: 2014-06-30 11:42
Has thanked: 99 times
Been thanked: 47 times

Re: Improving on my C (little?) understanding.

#14 Post by reinob »

tomazzi wrote:
reinob wrote:That may be useful in whatever specific program for a specific platform with a specific compiler (and version) you are developing, but the real strength of C is its portability.
Portability of C is not better than f.e. portability of C++, so Your claim is at least questionable.
Your logic is also questionable, but it's OK.
That example code is 100% syntactically correct and 100% portable - personally, I've tested this on armhf, amd64 and x86 using several different compilers in several versions.
They are not supported by microshit's compiler, but who cares?
QED.

Code: Select all

/* increment all the 4 vars by 1: */
mvarA.pad += 0x0001000100010001;
The above may be clever, but it's not good programming by any means. It's hacky, unreliable, not clearly understandable, dependent on whatever (re-)definition of your union, etc.

You even told @edbarx "The results of "optimizations" at the syntax level are always the same: suboptimal resulting code, very bad readability of the sources and problems in using debugger.", which is exactly what you are showing.

But in any case, I'm not here to discuss or to complain, but if someone asks for help improving his understanding it's best if you at least one stick to either one argument or to the other, but now you are arguing X and (not X).

As for @edbarx, the code itself looks OK but a bit messy.
Instead of

Code: Select all

  int k = -1;
  ...
  while(++k < token_count)
start with k = 0 and do while(k < token_count) and then k++.

For the compiler (generated code) it will not make any difference, but it helps to make the code much more readable.
As @tomazzi (originally) said (or how I understood him), it's better to write readable, correct, code than messy and possibly incorrect code (imagine you for some reason decide that k will be an unsigned int (like size_t) -- which would actually make sense for a string --, then you have a problem..)

Cheers.

tomazzi
Posts: 730
Joined: 2013-08-02 21:33

Re: Improving on my C (little?) understanding.

#15 Post by tomazzi »

reinob wrote:QED.
Yeah, so everyone who is using attributes should now change their code, because there's a single crippled compiler in the world, which doesn't stick to the standards?
Even if this is the worst compiler available for the windows platform, which means only about 14% of the market? You must be kidding.
reinob wrote:You even told @edbarx "The results of "optimizations" at the syntax level are always the same: suboptimal resulting code, very bad readability of the sources and problems in using debugger.", which is exactly what you are showing.
Nope. I was talking about this:

Code: Select all

for (; *t; t++);
Where:
a) debugger doesn't work at the source level.
b) avoiding explicit usage of local variable is not an optimization - the compiler will create local variable and it will perform implicit checking for zero using that variable.
No benefits, You can't debug the code, but what is really bad: - you let the compiler to write the essential part of this code (checking for zero). Maybe it'll decide to optimize-out this loop - surprise!. :mrgreen:

My example is:
a) fully debuggable.
b) this is optimization at the algorithm level, which is widely known under the name of "vectorization".
c) this not some "ugly hackish trick" - I'm using here well-known and well-described properties of the binary system - I'm surprised that You don't understand this...
d) Since this is a valid c syntax, it will work with any compiler on any platform(*) - optimizations performed by compiler are not guaranteed to be applied, because this depends on the platform and on the compiler version and on the build-time compiler flags.

(*) Except that crap from microshit, of course.
Odi profanum vulgus

tomazzi
Posts: 730
Joined: 2013-08-02 21:33

Re: Improving on my C (little?) understanding.

#16 Post by tomazzi »

edbarx wrote:Hi Tomazzi,

Analising the expression grammar, I found these syntax rules to be checked by the parser.
Hi Edward ;)

OK, so what do You expect from me?
Odi profanum vulgus

User avatar
edbarx
Posts: 5401
Joined: 2007-07-18 06:19
Location: 35° 50 N, 14 º 35 E
Been thanked: 2 times

Re: Improving on my C (little?) understanding.

#17 Post by edbarx »

tomazzi wrote:
edbarx wrote:Hi Tomazzi,

Analising the expression grammar, I found these syntax rules to be checked by the parser.
Hi Edward ;)

OK, so what do You expect from me?
I cannot expect anything by right, however, if you feel you want to, I would appreciate some comments whenever you feel a comment is appropriate. :)

Re-examining the Syntax Rules resulted into the following:
a) operand = identifier
b) term = operand [ operator-->operand ... ]
c) sub-expression = term [ operator-->term ... ]
d) bracketed sub-expression = [ ! ... ] [ ( ... ] sub-expression [ ) ... ]
e) expression = [ ! ... ] sub-expression [ operator-->sub-expression ... ]
Debian == { > 30, 000 packages }; Debian != systemd
The worst infection of all, is a false sense of security!
It is hard to get away from CLI tools.

tomazzi
Posts: 730
Joined: 2013-08-02 21:33

Re: Improving on my C (little?) understanding.

#18 Post by tomazzi »

Well, the thing is, that (as someone already said): "code wins arguments".

Your idea is only as good as the implementation, that is, it doesn't really matter if someone says "OK, this looks good" or "oh, this seems to be suboptimal" - because even apparently suboptimal assumptions can result in very efficient code...

In any way it does not mean, that I'm calling Your plan 'suboptimal' - to make it clear - but I'm just waiting for the implementation ;)

Regards.
Odi profanum vulgus

User avatar
edbarx
Posts: 5401
Joined: 2007-07-18 06:19
Location: 35° 50 N, 14 º 35 E
Been thanked: 2 times

Re: Improving on my C (little?) understanding.

#19 Post by edbarx »

I think the code I am attaching can correctly find errors in Boolean expressions of the type:

a) (a+b).(c+d)
b) ((a+!d).(!c+e))
c) (!!!a+(!d)).(!!!!c+f)

It should also be able to use identifiers with more than one letter like:
a) _abc
b) _12b
c) abc2

The Algorithm:
After many hours of pulling my little left hair trying to make recursive descent parsing to work without going into too many complications, I decided to try an old parsing algorithm I created some 20 years ago. This algorithm is straightforward and works by verifying any token is preceded and succeeded by acceptable tokens. In cases a token cannot follow another if a certain other token is not present at a specified position, the algorithm makes deeper comparisons to detect erroneous tokens.

I am also including the source file for comments for all those who may want to comment.

Code: Select all

#include <stdio.h>
#include <ctype.h>
#include <stdlib.h>
#include <string.h>

#define INVALID 0
#define VALID   1

typedef char* pchar;
pchar* tokens;             /* global variable to hold token list */
int token_count = 0;       /* global variable to keep track of token count */
int error_pos = 0;         /* global variable to hold error position */

#define token_delta 50

typedef struct {
  int start_bracket;
  int stop_bracket;
  int consecutive_brackets;
} t_bracket_values;

void tokenise(char* expression) {
  char ch;              /* character at iterator's position */
  char* atoken;         /* token to be added to token list  */
  char* start = NULL;   /* holds a char* to point to start of multicharacter token */
  char* string_token;   /* holds multicharacter token */
  int char_count = 0;   /* keeps track of number of characters in multicharacter token */
  
  int while_counter = 0;
  /* expr_len_and_null holds the total length of input string including null terminator */
  int expr_len_and_null = strlen(expression) + 1;
  while (while_counter++ <= expr_len_and_null) {
    ch = *expression++; 
    switch (ch) {
      case '(': 
      case ')':
      case '.':
      case '+':
      case '!':
      case '\0':
        if (start && char_count > 0) {
          string_token = malloc((char_count + 1)*sizeof(char));
          strncpy(string_token, start, char_count);
          tokens[token_count++] = string_token;
                  
          start = NULL;
          char_count = 0;
        }                
      
        if (ch == '\0') return;
        atoken = malloc(2*sizeof(char));
        atoken[0] = ch;
        atoken[1] = '\0';
        tokens[token_count++] = atoken;
        break;
        
      case '\n':
        break;  
        
      default:
        if (start == NULL)
          start = expression - 1;
        char_count++;  
    }
  }
}

void free_memory() {
  int k = token_count;
  
  while (--k >= 0) free(tokens[k]);
  free(tokens);
}

/****************PARSER*****************/

static char* getToken(pchar* tokens, int i) {
  if (i < token_count)
    return tokens[i];
    else return NULL;
}
#define alpha              "abcdefghijklmnopqrstuvwxyz"

/* succeeds ". + )" end */
#define alpha_cl_br        ")"alpha

/* succeeds "! alpha " */
#define oper_op_br         "!+.("

/* at the beginning of expression */
#define alpha_excl_op_br   "!("alpha

static int isOperand(pchar* tokens, int i) {
  char* atoken = getToken(tokens, i);
  
  if (atoken == NULL) 
    return 0;
  else {
    if (strlen(atoken) > 1)
      return 1;
      else return (strchr(alpha, atoken[0]) != NULL);
  }  
}

int getErrorPosn(pchar* tokens) {
  int i = -1;
  char prev_ch = '\0', ch;
  int error_pos1 = -1, error_pos2 = -1;
  char* atoken;
  
  if (token_count == 1) {
    if (!isOperand(tokens, 0))
      i = 0;
      
    error_pos = i;
    return error_pos;  
  }  
  
  while (++i < token_count) {
    atoken = getToken(tokens, i);
    ch = atoken[0];
    
    if (i > 0 && i < token_count - 1) {
      if ( (ch == '+' || ch == '.' || ch == ')') && 
           (prev_ch == ')' ||  isOperand(tokens, i - 1))  )
      {    
        prev_ch = ch;  
        continue;
      }
        
      if ( (ch == '!' || ch == '(' || isOperand(tokens, i)) &&
           strchr(oper_op_br, prev_ch) )
      {
        prev_ch = ch;
        continue;
      }
    }
    prev_ch = ch;
    
    /* at the end of token list */
    if ( (i == token_count - 1) && (ch == ')' || isOperand(tokens, i)) )
      continue;
    /* at the beginning of token list */  
    if ( (i == 0) && (ch == '!' || ch == '(' || isOperand(tokens, i)) )
      continue; 
      
    error_pos1 = i;
    break;   
  }
  
  int brackets = 0;
  for (i = 0; i < token_count; i++) {
    atoken = getToken(tokens, i);
    ch = atoken[0];
    
    if (ch == '(') brackets++;
      else if (ch == ')')
        brackets--;
        
    if (brackets < 0) {
      error_pos2 = i;
      break;
    }    
  }
  
  if (brackets > 0)
    error_pos2 = i;
  
  /* choose the earlier error for output */
  if (error_pos1 > error_pos2)
    error_pos = error_pos1;
    else error_pos = error_pos2;
    
  return error_pos;  
}


/***************************************/

int main() {
  extern int error_pos;

  char input[101];

  tokens = malloc(token_delta*sizeof(char*));
  fgets(input, 100, stdin);
  tokenise(input);
  
  int k = -1;
  /* print the tokens one per line */
  printf("token count is %d\n", token_count);
  while(++k < token_count)
    printf("%s\n", tokens[k]);
    
  getErrorPosn(tokens);
  if (error_pos > -1)
    printf("Error found at position %d\n", error_pos + 1);
    else printf("No error found.\n");

  free_memory();
  
  return 0;
}
Debian == { > 30, 000 packages }; Debian != systemd
The worst infection of all, is a false sense of security!
It is hard to get away from CLI tools.

tomazzi
Posts: 730
Joined: 2013-08-02 21:33

Re: Improving on my C (little?) understanding.

#20 Post by tomazzi »

The first thing which is visible in Your code is that You're not checking for NULL pointer when You're calling malloc().

Even when on today's systems this function is rather unlikely to fail, this is not an excuse - without checking the returned pointer the program can randomly crash (depending on the system load).

Currently I have no time for deeper analysis of this code, so that's just a first hint.

EDIT:
The parser seems to work - and that was the target ;)
Just a few more hints here:

Code: Select all

tokens = malloc(token_delta*sizeof(char*));
Size of pointer of any type is always the same as the size of void*
Type of pointer is used mainly in pointer arithmetic, f.e:

Code: Select all

int32_t *pint = 100; pint++ /* gives 104, not 101 */

Code: Select all

atoken = malloc(2*sizeof(char));
Size of char is always 1 because char is guaranteed to be represented by a single byte, so it doesn't make sense to use sizof(char).

Code: Select all

void tokenise(char* expression) ...
static char* getToken(pchar* tokens, int i) ...
In C, static function means that it is visible only for current translation unit, which means, only within "current source file" (to simplify). There's no point in declaring static function when there's only a single source file.

In void tokenise(char* expression):

Code: Select all

int expr_len_and_null = strlen(expression) + 1;
while (while_counter++ <= expr_len_and_null) {
...
if (ch == '\0') return;
...
}
You are checking for the null byte inside the loop, and the while_counter represents the length of the string, so the call to strlen() is redundant.

Regards
Odi profanum vulgus

Post Reply