Reading through large open source projects

Message

hudson · #1 Post by **hudson** » 2015-07-17 04:36

Hello...I've been wondering about reading though large open source projects and if a single individual is capable of understanding one. I'll give examples: xorg, xterm, openbox, apache...large programs with 100's of files in dozens of directories.

I'm fairly capable in Perl/Python and have read up on C...so I understand programming more or less. But, my head spins when I open the source for some of the above mentioned programs and I end up closing the directory fairly quickly.

So, I'm just curious if there are people out there who spend a week or a month or a year reading through a program of such scale. What does it take? ...how much effort is required? ...and where to start? I used to read a lot of books on programmers and read about a few who could maintain 10,000 lines of code, so I think that might be the max one person can deal with.

I'm sure, in the Debian project, there are many people out there who do just that...I would like to hear about their experience.

Thanks!

Head_on_a_Stick · #2 Post by **Head_on_a_Stick** » 2015-07-17 07:01

I use dwm...

Code: Select all

empty@Debian ~ % cat dwm/dwm.c|wc -l
2072

hudson · #3 Post by **hudson** » 2015-07-17 07:10

@Head_on_a_Stick

+1

dasein · #4 Post by **dasein** » 2015-07-17 15:57

hudson wrote:What does it take?

Depends on the code

hudson wrote:how much effort is required?

Depends on the code

hudson wrote:and where to start?

By understanding the problem domain that the code was originally written to address. Without that understanding, slogging through code is like trying to work a jigsaw puzzle without any idea what the finished picture looks like.

That said, there is a jewel of an insight hidden in your current feelings of frustration. Despite knowing that maintenance is the vast majority of programming activity, despite knowing that it's harder (much harder) to read code than to write it, the sad fact is that only a tiny fraction of coders write for maintainability. Strive to become one of them.

hudson · #5 Post by **hudson** » 2015-07-17 18:32

One book I can highly recommend is "Coders at Work - Reflections on the Craft of Programming," which is a book of interviews. A question asked to each programmer was how do you read other people's code. The answer that struck me most was finding a thread and following the thread around until you understand the whole program.

On the other hand, some old Bell programmer said show me your data structures and I'll understand your program. So maybe reading header files is the place to start?

As a side note, in Coders at Work, one guy was talking about an employee (from MIT) he had to lecture about not writing 50 page functions!

@dasein - it is true, I've always heard that reading code is much harder than writing it.

dasein · #6 Post by **dasein** » 2015-07-18 11:17

Sounds like maybe you're interested in the subject of psychology of programming (PoP) in general. There was a series of workshops called Empirical Studies of Programmers (ESP) a few years back. IIRC, some of that early research focused specifically on strategies for program comprehension.

From those early ESP workshops came PPIG (http://ppig.org)

Idle musing: I have zero data to support this hunch, but I'd bet a day's pay that the information visualization literature also has lots of insights potentially applicable to code comprehension. (I'm thinking in particular of Ben Shneiderman's InfoViz "mantra": Overview first/query by navigation/details on demand.)

Afterthought: Much more philosophically focused but still well worth a read is Winograd & Flores' Understanding Computers and Cognition

マーズ maazu · #7 Post by **マーズ maazu** » 2015-07-19 03:41

I think bzr, git or mercury enables one to follow large open source projects that spans many many files.
They work like a database for classes, functions etc. So you can choose which bits and piece of the code you want to read, understand or modify then push your modifications back to project.

hudson wrote:Hello...I've been wondering about reading though large open source projects and if a single individual is capable of understanding one. I'll give examples: xorg, xterm, openbox, apache...large programs with 100's of files in dozens of directories.

I'm fairly capable in Perl/Python and have read up on C...so I understand programming more or less. But, my head spins when I open the source for some of the above mentioned programs and I end up closing the directory fairly quickly.

So, I'm just curious if there are people out there who spend a week or a month or a year reading through a program of such scale. What does it take? ...how much effort is required? ...and where to start? I used to read a lot of books on programmers and read about a few who could maintain 10,000 lines of code, so I think that might be the max one person can deal with.

I'm sure, in the Debian project, there are many people out there who do just that...I would like to hear about their experience.

Thanks!

hudson · #8 Post by **hudson** » 2015-07-19 04:54

Thanks everyone. I do enjoy reading about the psychology of programming and the whole field of Human-Computer Interaction. Especially I like stuff from the 60's and 70's when the field was new.

And the idea of an IDE/revision control system is something to look into. I just read an interview with Hal Abelson where he talks about the scale of programs at Google and how it is impossible to deal with such large programs without the help of something like Eclipse (for Java).

Here are a bunch of links I'm going to check out based on the above comments:
http://ppig.org/library
https://www.youtube.com/watch?v=xjOxWKeHWDM Terry Winograd and Fernando Flores (in English after first four minutes)
https://www.youtube.com/watch?v=ZYLWHRa8Et4 Ben Shneiderman
http://codequarterly.com/2011/hal-abelson/ Hal Abelson Q&A (same author as Coders at Work)

dasein · #9 Post by **dasein** » 2015-07-19 13:22

hudson wrote:I do enjoy reading about the psychology of programming and the whole field of Human-Computer Interaction. Especially I like stuff from the 60's and 70's when the field was new.

In that case, you are probably already familiar with the things I'm about to suggest, but just in case...

- Almost anything written by the late Doug Engelbart (in particular, Augmenting Human Intellect). And the Mother of All Demos is well worth the time if you have the patience to sit through grainy 50-year-old video. (Especially when you remind yourself that the term HCI wouldn't be coined for almost another decade.)

- Moran, Card, & Newell's Psychology of HCI is a bit dated, but if you have access to a university library, it's a good read, especially for historical interest/value. I still find their MHP model a handy tool for thinking about low-level mechanics of user interaction.

- A second shout-out to Understanding Computers and Cognition. It's short but very dense, and anyone interested in HCI should be able to say s/he's read it at least once. (I'm sure the YouTube video is very nice, and listening to almost anything Terry Winograd has to say is time well spent. But there is no substitute for actually reading this classic work.)

tomazzi · #10 Post by **tomazzi** » 2015-07-19 19:11

hudson wrote:I used to read a lot of books on programmers and read about a few who could maintain 10,000 lines of code, so I think that might be the max one person can deal with.

Seriously? Who wrote that?

Number of lines of code has nothing to do with the ability to understand or maintain the project.
What really matters is the quality of code and whether You can understand what the code is supposed to do.

10'000 LOC means rather small project. A program with only 1000 lines of code can be completely unreadable and unmaintanable if it's written badly.

Another thing is, that for complex projects special tools are needed, like f.e. advanced programmistic editors with syntax highlighting or IDEs.

Linux kernel for example, has a great on-line cross-referenced source browser:
http://lxr.free-electrons.com/

It's really easy check what's going on in the kernel, if You're interrested ofc.

Regards.

hudson · #11 Post by **hudson** » 2015-07-20 03:26

@dasein I saw "the Mothers of all Demos" once upon a time. I also like Alan Kay and find all the Bell Labs/Xerox Parc stuff very instructive. Thanks for your other recommendations and I'll definitely look into them.

@tomazzi wait, my mistake...the quote was:

I essentially wrote the whole kernel. This was done in assembly language. We’re talking about programs that are getting to be a little larger here, probably on the order of 10,000 lines of assembler.

- L Peter Deutsch

Well, seriously, I have no idea what people are capable of. I looked online and I guess it depends...C or C++ or GUI with a builder and lots of reuse. There are some very bright people out there...I was reading about the guys from BBN who first set up ARPANET and they wrote and debugged the Interface Message Processors (packet switching software) without a terminal...unbelievable to me!

tomazzi · #12 Post by **tomazzi** » 2015-07-22 02:25

hudson wrote:
I essentially wrote the whole kernel. This was done in assembly language. We’re talking about programs that are getting to be a little larger here, probably on the order of 10,000 lines of assembler.

- L Peter Deutsch

I can't see how have You estimated a limit of maintability at the level of 10'000 LOC for a single programmer (in the context of the above qoute)...
I suppose, that "assembler" is the magic word here, which for most people could be replaced with the word "chinese", giving the same effect...

When code is well commented, and the program is well designed, then it doesn't really matter what language is used (in terms of readability or maintainability).

The only difference is that assembler has absolutely no limits

Regards.

runfrodorun · #13 Post by **runfrodorun** » 2015-08-26 01:02

Sorry to resurrect old thread, just saw the first couple sentences and had to respond:

No one human understands how all of xorg works. I think the minimal spanning set is like 4 guys on the dev team, it's freaking huge, been updated for 30+ years and is very very hard to understand. So definitely and certainly no, I would venture to say there is no chance anybody's going to read that and understand it unless they're a child prodigy or have their entire life to give to that, and just reading it wouldn't be enough you'd have to get in and get your hands dirty.

Interesting to me anyway

-RJ

tomazzi · #14 Post by **tomazzi** » 2015-08-26 21:09

runfrodorun wrote:No one human understands how all of xorg works.

Oh really? So how would You explain the fact that it is still developed and it is getting patches from so many different developers?

runfrodorun wrote:it's freaking huge, been updated for 30+ years and is very very hard to understand. So definitely and certainly no, I would venture to say there is no chance anybody's going to read that and understand it unless they're a child prodigy or have their entire life to give to that, and just reading it wouldn't be enough you'd have to get in and get your hands dirty.

What about Linux kernel? - there are at least few tens of hundreds of persons who do understand *exactly* how it works...

The point is, that transparency of the code is tightly tied with the ability to understand its underlying dependancies, f.e. You have to know how the CPU works, the ACPI interface, MMU, bootstrap procedure for various CPUs/architectures, what's the connectivity between video frame buffer and the system memory (or the memory bus), etc, etc .... etc.

I mean, that the most problematic part in understanding the code is to know what it is supposed to do - and if You know this, then the code might be at best complex, but not hard to understand ...

...or more generally: there is no such thing as a "hard task" - there can be only time consuming tasks, which You do understand - in other cases there are only falsely proclaimed "hard tasks" - where people are just unsure about how to deal with them... (in both cases it's just a mater of time ... and money

)

Regards.

runfrodorun · #15 Post by **runfrodorun** » 2015-08-29 02:46

tomazzi wrote:Oh really? So how would You explain the fact that it is still developed and it is getting patches from so many different developers?

Simple... it takes many developers to understand how that works. There was an article I read a few years ago that described a few of the core developers and maintainers of xorg, and you can't become an expert in an area of the code if you're stuck on the big picture. I'm sure there are plenty of people who understand the overall architecture, but that only gets you so far.

tomazzi wrote:What about Linux kernel? - there are at least few tens of hundreds of persons who do understand *exactly* how it works...

No there aren't.

I'm sure there are many people who know a good deal about how certain areas work. Again, I have no doubt that there are a few select people that are very knowledgeable about it, Torvalds to cite one (because he's owned most of the developement that has happened on that for the 24 years it has been around). All of the work I have done on the kernel has been simpler than the stuff you see in xorg. Xorg has a lot of 'artifacts' from all of the changes in graphics technology from the last 35 years, and believe me things have changed a lot. Think about how drastically different the methods of accel are.

For you to have an expert understanding of 'exactly' how an entire operating system kernel works it has to be a lot more than your job. Citation: Software developer, developed 2 O/S kernels from the ground up in assembly language, linux and illumos kernel contributor.

Being good at reading code is one thing; being good at reading somebody else's code is different.

runfrodorun · #16 Post by **runfrodorun** » 2015-08-29 03:14

Just one more thing to help shed a little more light on this:

Linux has ~8mil lines of _implementation_ code (excluding header files, documentation, and anything else that would go along with it) that is to say .c files only.

If you spent every day for ten years no vacations no weekends, you'd have to read about 2200 lines of code and understand it crystal clear every day. It's possible for a prodigy with no life maybe, but that's ten years and those are some pretty unreasonable circumstances. Anyone who has a life might not be doing that

If you want to read supporting materials to help your understanding, forget it.

Also assumes not having to re-read code as you read other code that calls other functions you forgot about. Hey, it's 8 million lines!

The GPL is like a black hole for linux kernel code. More than half of the changes made today are made by companies, not volunteer contributors. Even microsoft has had some stakes in developing certain features in linux. You can bet that Torvalds might not even know some things that are in there. Scary? probably not... I like to think somebody's looking at it, because these days it aint me

-RJ

tomazzi · #17 Post by **tomazzi** » 2015-08-29 18:34

runfrodorun wrote:If you spent every day for ten years no vacations no weekends, you'd have to read about 2200 lines of code and understand it crystal clear every day. It's possible for a prodigy with no life maybe, but that's ten years and those are some pretty unreasonable circumstances. Anyone who has a life might not be doing that If you want to read supporting materials to help your understanding, forget it.

Also assumes not having to re-read code as you read other code that calls other functions you forgot about. Hey, it's 8 million lines!

That way of thinking is typical for people who actualy don't write any code. Writting a code is just some abstraction for them and reding that huge number of lines of "strange" text appears as just unimaginably hard task...

For professional programmer reading a code is like looking at the picture - You need maybe few seconds to realize what it shows. After writting hundreds of thousants lines of code, reading the code becomes easier than reading a book in Your native language

runfrodorun wrote:The GPL is like a black hole for linux kernel code.

Without GPL Linux kernel would have died as a project 25 years ago.

runfrodorun wrote:(...) Even microsoft has had some stakes in developing certain features in linux. You can bet that Torvalds might not even know some things that are in there. Scary? probably not... I like to think somebody's looking at it (...)

Microsoft have added drivers for their Hyper-V machines to improve performance of GNU/Linux systems running on top of windows server. They were just forced by their customers - that's the simple truth.
This is actually a good, non-intrusive piece of code. Some people were yelling that Microsoft have infected/took over the Linux kernel in some way, while the situation is exactly opposite

Every patch is discussed and the code is reviewed before it goes to kernel - no need to worry.

GarryRicketson · #18 Post by **GarryRicketson** » 2015-08-29 18:43

@by tomazzi
+100

Just a short comment , ( I couldn' t think of a better way to explain or say, what tomazzi said. very well put), but the compilers, and other software or programs, make it possible to "scan" large amounts of code, no need to "manually" read every single "bit".
In other words , the computer does most of the "work".
Essentially this is what many "virus scanners" do, but the same kind of programs can be modified, to scan for other types of undesirable code as well.
Normal users, would find it equally "impossible" to manually scan every single file in the system, looking for a piece of code, "kiddie script", that is a virus or mal-ware,worm,etc.
In a nut shell it is quite possible, and necessary to go over millions of lines of code, in order to locate "bugs", etc. That also is why any "big" program, or OS, requires "teams", usually it is more then just one person doing this work.

runfrodorun · #19 Post by **runfrodorun** » 2015-08-30 21:32

tomazzi wrote:That way of thinking is typical for people who actualy don't write any code. Writting a code is just some abstraction for them and reding that huge number of lines of "strange" text appears as just unimaginably hard task...

For professional programmer reading a code is like looking at the picture - You need maybe few seconds to realize what it shows. After writting hundreds of thousants lines of code, reading the code becomes easier than reading a book in Your native language

You're oversimplifying this. Good code is treated like a black box, yes, but each of those lines is meaningful and just being able to read them and know what they mean is not enough. My experiences in graduate level math taught me that. If you have good modularization, then the number of lines of code is actually more staggering, not less, because you don't have the redundancy.

TLDR: knowing what it means does not mean you know why it's there. Knowing how to use it doesn't tell you how it works.

Perhaps we're disagreeing on our definitions of understand, or reading through.

You don't have to believe me but you should

20 years of experience. Every piece of software is different, I have an easier time reading some software than others, but looking at a picture shows the big picture. and even then, sometimes looking at a picture can give you jack. xorg would be an example of that. Understanding the big big picture is about all you're going to get from reading the code, and that's not very helpful.

Also depends how many quick hacks were put in to get things working. You won't understand details like this.

For programmers that think very differently, and for design patterns that you've never seen before, things can get murky pretty quickly. To cite another quick example to explain what I'm talking about, #define macros can make things very murky without running the preprocessor.

tomazzi wrote:
runfrodorun wrote:The GPL is like a black hole for linux kernel code.
Without GPL Linux kernel would have died as a project 25 years ago.

This was actually the point I was trying to make, I didn't say it was a bad thing. GPL fan here. (black holes accumulate mass rapidly, that was what I was hinting at!)

tomazzi wrote:
runfrodorun wrote:(...) Even microsoft has had some stakes in developing certain features in linux. You can bet that Torvalds might not even know some things that are in there. Scary? probably not... I like to think somebody's looking at it (...)
Microsoft have added drivers for their Hyper-V machines to improve performance of GNU/Linux systems running on top of windows server. They were just forced by their customers - that's the simple truth.
This is actually a good, non-intrusive piece of code. Some people were yelling that Microsoft have infected/took over the Linux kernel in some way, while the situation is exactly opposite

Every patch is discussed and the code is reviewed before it goes to kernel - no need to worry.

I haven't forgotten. My point is you'd be surprised who contributes, not oh no scary microsoft. but thanks anyway.

GarryRicketson wrote:@by tomazzi
+100
Just a short comment , ( I couldn' t think of a better way to explain or say, what tomazzi said. very well put), but the compilers, and other software or programs, make it possible to "scan" large amounts of code, no need to "manually" read every single "bit".
In other words , the computer does most of the "work".
Essentially this is what many "virus scanners" do, but the same kind of programs can be modified, to scan for other types of undesirable code as well.
Normal users, would find it equally "impossible" to manually scan every single file in the system, looking for a piece of code, "kiddie script", that is a virus or mal-ware,worm,etc.
In a nut shell it is quite possible, and necessary to go over millions of lines of code, in order to locate "bugs", etc. That also is why any "big" program, or OS, requires "teams", usually it is more then just one person doing this work.

Perhaps we are on different wavelengths here -- let me take a second to explain where I'm coming from.

Say you're on a massive development project. Say you're new to the team (You're supposed to keep new hires tacked on to big projects to a minimum in a business world, here's why) You're charged with creating a certain feature in the code. You take ownership of the backlog item, feature development log, whatever your company or partnership calls it. Now, you don't have a clue what the overarching design of the software is, you don't know where to look, how many places to look, and then once you understand those basics (often it can take months of learning for the largest of projects if you have no support from your coworkers) now each little piece of code has it's own patterns, it's own rules, it's own assumptions that it makes on its calls and callers. How are you to make an informed change that is consistent with the design? It is not so simple as looking at a 'big picture.' Understanding code is hard, reading it maybe not as much (but still d___ hard for some!)

Also- not all features are created equally. When I was working on a filesystem driver that I didn't design I ended up putting in little hacks here and there that were probably pretty safe things to do. Changing the defaults, bit patterns for file headers, etc. but that's not an understanding of the code. When I worked on linux (back in 2.4 and early 2.6) I did NOT have a clear understanding of how everything in the kernel worked. I learned what I needed to know to work on my drivers, and only what I needed to know. NOT a clear understanding of the whole thing. Nobody's got time for that.

TLDR: You can understand what you need to know to do your job, this is NOT a clear understanding of the source base. When you need to do a big refactor, you are dead in the water having just skimed the code as you suggest.

Prove me wrong... find out what the developers were thinking when they created each function in the linux kernel system calls, and how when you change something it's not going to break their model.

I hope I'm making sense, but probably not as per usual.

-RJ

edit: clarification

tomazzi · #20 Post by **tomazzi** » 2016-01-10 02:46

runfrodorun wrote: Say you're on a massive development project. Say you're new to the team (You're supposed to keep new hires tacked on to big projects to a minimum in a business world, here's why) You're charged with creating a certain feature in the code. You take ownership of the backlog item, feature development log, whatever your company or partnership calls it. Now, you don't have a clue what the overarching design of the software is, you don't know where to look, how many places to look, and then once you understand those basics (often it can take months of learning for the largest of projects if you have no support from your coworkers) now each little piece of code has it's own patterns, it's own rules, it's own assumptions that it makes on its calls and callers. How are you to make an informed change that is consistent with the design? It is not so simple as looking at a 'big picture.' Understanding code is hard, reading it maybe not as much (but still d___ hard for some!)

I'm surprised, but somehow I've missed Your reply...
I've marked bold fragments of Your post which are interesting for me.

1. "You're charged with creating a certain feature in the code" and You do what? - There's only one way to go: read and understand the source (it may be a hard task, but it's unavoidable)
2. If you can't "get a clue" *after* reading the sources, then there are 2 possible ways to go:
a) the sources are written so badly, that the best way to go is to not waste Your time, at least at not at a given salary.... (rise it!

)
b) Make an agreement in which You'll claim that the sources are shitty, but they can be turned to/(or replaced with) some another solution which will work in particular case.

3. "...now each little piece of code has it's own patterns, it's own rules, it's own assumptions" - if this is the case - leave this shitty company as soon as possible....

Regerds.

Debian User Forums

Reading through large open source projects

Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects

Re: Reading through large open source projects