Ads are not an endorsement by the blog author.

OpenAFS and Kerberos for Windows Development

Public Journal
Notes from the primary developer of OpenAFS and Kerberos for Windows.
Archives | Subscribe to Alerts Alerts Subscribe to Alerts | Feeds
   
Wednesday, May 14, 2008
Subject: File System Internationalization sucks
Time: 12:21:43 PM EDT
Author:  secureendpoints



Internationalization in file systems really sucks.  There are two perspectives in the world.  First, there are the POSIX proponents who believe that names are simply nul terminated octet sequences that have no meaning except to the application that created them.  Second, there are those who believe that names are should be portable between systems and therefore should all be encoded in a common character set.  Lets call these second group of folks the UNICODE camp. 

I fall into the UNICODE camp.  This is most likely a side effect of having spent nearly fifteen years of my life working on Kermit, an application and file transfer protocol designed specifically to move files (by name) between computer systems using different architectures and locales.  I learned very early on that if you followed the POSIX approach the end result when a file is copied from an EBCDIC system to an ASCII system or a Latin-1 system to a CP437 system is gibberish.  Not only for human beings but for the applications as well.

A globally accessible file system such as AFS is in many regards similar to Kermit except that instead of copying files into a local file system from a remote system, the AFS client makes the entire remote file system accessible to the local machine.    The exact same character set conversion issues occur.  As long as all of the file names are in the same character set all is dandy and applications on one machine can access files created on another machine.

But what happens when the character sets are different?  In that circumstance, the names become gibberish to humans and applications.  In a worst case scenario, the file name as stored in the directory cannot even be represented on the local machine because the file name contains illegal code points according to the rules of the local environment.

This situation doesn't happen as frequently as it could because still most of the world is only storing US-ASCII or ISO-Latin-1 into the file system.  However, even with those restrictions there are still problems.  For example, the following characters are illegal on Windows systems

  " / \ * ? < > | :

It doesn't matter what the underlying file system is.  If those characters are in the name, the name is illegal.  Any name with those characters will not be included in the directory listing.
This in turn means it is impossible to see the file, access the file, rename the file, delete the file, or delete the directory the file is located in.  File systems that include objects with such names must perform name translation in order for the Windows users or applications to be able to manipulate them.

With the introduction of Unicode another set of complications are introduced.  Unicode provides for multiple semantically equivalent encodings of the same string based upon whether composed or decomposed sequences are used.  For historical reasons, MacOS X stores its file names using UTF-8 encoding of decomposed Unicode sequences, Microsoft Windows stores composed Unicode sequences, Linux also stores composed sequences, and all of the sequences for a given string can be different.  That means that a user who types the same string on all three platforms will obtain a different octet sequence for each platform.  So much for interoperability. 

The POSIX supporters make the claim that names must be treated as octet strings because the locale between two different processes on the same machine can be different.  All that tells me is that POSIX allows users to shoot themselves in the foot.  It doesn't mean it is right.  Of course, the POSIX folks do have a point.  If a UNIX system is incapable of communicating the character set that is being used to the file system, how is the file system supposed to do something sane with it to provide for interoperability between heterogeneous environments.

Microsoft Windows has an advantage here in that there is a standard character set for the entire operating system and all file systems: Unicode.  As a result a file system client on Windows can at least ensure that Unicode names are normalized on output, that directory entry names are normalized for display and lookup, that all illegal characters are mapped to something legal, and ensure that all strings communicated with the file server are the original directory entry names and not the normalized names used locally.  This is the approach that will be taken as Unicode is added to the OpenAFS for Windows client.



Written by secureendpoints Permalink | Blog about this entry
This entry has 0 comments: Add your own

Wednesday, March 12, 2008
Subject: OpenAFS joins Google Summer of Code 2008
Time: 8:39:58 PM EDT
Author:  secureendpoints



Today OpenAFS submitted an application to take part in the 2008 Google Summer of Code.  OpenAFS project ideas are listed at http://www.openafs.org/gsoc.html.

Thanks to Asanka Herath, Matt Benjamin, Simon Wilkinson and Derrick Brashear for volunteering to be mentors to the next generation of OpenAFS developers.

Update: Monday 17 March 2008, OpenAFS was accepted.



Written by secureendpoints Permalink | Blog about this entry
This entry has 0 comments: Add your own

Tuesday, March 4, 2008
Subject: OpenAFS vs Norton Internet Security 2008
Time: 10:59:00 PM EST
Author:  secureendpoints



OpenAFS requires several rules to be set in order to work with Norton Internet Security 2008.

1. Under "Personal Firewall->Program Control" add a "Allow" rule for "C:\Program Files\OpenAFS\Client\Program\afsd_service.exe"
2. Do the same for "fs.exe", "aklog.exe", and other command line utilities if so desired.
3. Under "Personal->Firewall->Trust Control, Trusted tab", add a "Trusted" rule for "02-00-4C-4F-4F-50".
4. Under the "Personal Filewall->Advanced Settings" press the "Configure" button.
5. Add a new rule:
    "Allow", "Inbound", "Any computer", "Protocol: UDP", "Port 7001", and describe it as "AFS Callback Port".  Make it the first rule in the list.
6. Add a new rule:
    "Allow", "Outbound", "Any computer", "Protocol: UDP", "Port range: 7001-7008" and describe it as "AFS Server Ports".  Make it the second rule in the list.

Finally, double check the configuration of the "Microsoft Loopback Adapter" labeled "AFS" in the Network Control Panel.   Make sure that "TCP/IP is checked", that "Client for Microsoft Networking" is checked, and that "File and Printer Sharing" is not checked.

You should now be able to access "\\afs\all" in the Explorer Shell.






Written by secureendpoints Permalink | Blog about this entry
This entry has 0 comments: Add your own

Sunday, March 2, 2008
Subject: I want my OpenAFS Windows client to be fast
Time: 10:46:21 AM EST
Author:  secureendpoints



There are a number of configuration knobs available to tune the OpenAFS for Windows client.  The most important related to throughput fall into two categories:

How much data can I cache?
CacheSize
Stats

How Fast Can I Read and Write?
BlockSize
ChunkSize
EnableSMBAsyncStore
SMBAsyncStoreSize
RxMaxMTU
SecurityLevel
TraceOption

All of these options are described in Appendix A of the Release Notes.  Here are the values I use:

CacheSize = 60GB (64-bit)  1GB (32-bit)
Stats = 120,000 (64-bit)  30,000 (32-bit)

BlockSize = 4
ChunkSize = 21 (2MB)
EnableSMBAsyncStore = 1
SMBAsyncStoreSize = 262144 (but would use 1MB if I didn't use cellular networks as often)
RxMaxMTU = 9000
SecurityLevel = 1 (when I need speed I use "fs setcrypt" to adjust on the fly)
TraceOption = 0 (no logging)




Written by secureendpoints Permalink | Blog about this entry
This entry has 0 comments: Add your own

Saturday, March 1, 2008
Subject: Problems Discovered when Profiling the OpenAFS Windows client
Time: 5:09:37 PM EST
Author:  secureendpoints
Mood:  Ecstatic



I have spent the last month analyzing the performance of the OpenAFS for Windows cache manager using the Sysinternal's Process Monitor's Profiling toolset. The results were quite eye opening.  What I had believed was a highly parallelized code set instead was filled with bottlenecks that seriously hampered the ability to process data at high rates.  What follows are some of the most significant issues that were uncovered.  Some of the issues are specific to AFS, others are likely to be problems found in many other applications.

Reference Counts
Each of the objects maintained by the cache manager (data buffers, vnodes, cells, smb file handles, directory searches, users, etc) are reference counted in order to determine when they should be garbage collected or can be recycled.  Reference counts must be incremented and decremented in a thread safe manner.  Otherwise races between the threads when they update the reference count will result in the count becoming inconsistent.  Objects will either be freed prematurely (undercounts) or never become available for recycling (overcount).  Reference counts were therefore protected by the same read/write locks that protect the hash tables used to find enumerate the objects.  The problem is that although a read lock can be used to safely traverse a hash table's link list, a write lock is required to safely update the reference count of the desired object once it is located.  As a result, only one thread can be searching for objects or releasing them at a time. 

If it were possible to adjust the reference count values in an atomic operation most of the hash table transactions that required write locks could use read locks instead.  As it turns out, Windows supports Interlocked increment and decrement operations for aligned 32-bit and 64-bit values.  By making use of the Interlocked operations reference counts are safely adjusted and parallel access hash table contents is permitted. 

Network Reads
The AFS servers support Rx hot threads.  As soon as a message is received by the listener thread, another thread is woken to listen for the next incoming message while the current thread becomes a worker to process the message.  The AFS clients did not support Rx hot threads and therefore could only process a single incoming message at a time.  By activating Rx hot threads in the AFS client the latency between received messages was significantly reduced.

Lock Churn
Throughout many code paths the same lock or mutex would often be released and re-obtained.  Doing so increases the possibility that the current thread will be swapped out and an alternate thread activated.  These context switches between threads are expensive and increase the overall clock time required to respond to a request.  By refactoring the code it was possible to avoid many such transitions thereby improving overall performance.

Write-to-Read and Read-to-Write Lock Transitions
Similar to the previous case, there are many situations in when it is desirable to either downgrade or upgrade a read-write lock.  Write-to-Read transitions are always safe to perform and can be done without forcing a context switch between threads in all cases.  Read-to-Write transitions can be done without a context switch whenever the requesting thread is the only reader.  Regardless of how often it is the case, a read-to-write transition will be cheaper than dropping the read lock and requesting a write-lock. 

Equality comparisons must be cheap
The function used to determine if two File Identifiers are the same is one of the most frequently called functions.  It is used every time a vnode or buffer must be located.  As a result it must be fast.  Instead of comparing each of the elements of a FID, the structure was extended with a hash value that can eliminate the vast majority of false matches with a single comparison.  In addition, the function was inlined to avoid the function call overhead.

Do Not Perform Unnecessary Work
The AFS client has an extensive logging infrastructure which is disabled by default.  However, it turns out that although the actual logging was disabled a majority of the work that is required to construct the log messages continued to be performed.  This unnecessary work was a significant drain on resources and increased clock time for all operations.

Do Not Perform Unnecessary Work - Part II
When copying a file on top of an existing file, the first operation that is performed is to truncate the file.  This results in the invalidation of all the cached data buffers associated with the file.  The actual truncation is not sent to the file server until the first write completes which is not attempted until the first chunk size of data is ready to be sent.  As a result, when the initial data buffers are being written to in the cache the cache manager believed that it must read their contents from the file server.  If the pre-fetch criteria are met, additional data buffers would be queued as well.  Performing these reads is useless work given the fact that the client will overwrite them or discard them once the truncation is sent to the file server.  The answer of course was to check for the outstanding truncation when getting data buffers.

Do Not Perform Unnecessary Work - Part III
Acquiring mutexes and locks are expensive because they often result in the active thread giving up the rest of its allocated time slice and being forced to be rescheduled at a later time.  Therefore, if there are locks that are not required to perform the current operation, they should not be acquired.

Do Not Sleep if it is Not Required
If the file server responds EAGAIN to an RPC, the cache manager will under most circumstances put the current thread to sleep and try again in a few seconds provided that the SMB redirector timeout limit has not been reached.  There are several operations for which retries are not permitted which include background writes, lock acquisition, etc.  Due to an architectural design flaw, the cache manager was putting threads to sleep even if retries were not permitted.

Setting Max MTU Size hurts
Back in 2003 it was discovered that the IPSec VPN products did a very poor job on interacting with AFS due to the reduction in the actual data payload in a UDP packet caused by the addition of the IPSec headers.   Due to an ever increasing number of complaints to Help Desks and to OpenAFS stating that AFS didn't work it was decided that the OpenAFS installation packages on Windows would ship with the RxMaxMTU value set to 1260.  At the time the performance of the cache manager was so bad that it was not possible to notice the difference.  Unfortunately, now that the cache manager is better performing, setting RxMaxMTU to 1260 can result in a reduction in StoreData throughput of 50%

Avoid Modifying Objects Whenever Possible
Every vnode and every data buffer object contains a version number.  Every time the vnode changes the file server increments the version number.  Doing so automatically invalidates the contents of all caches forcing the clients to re-read the data from the file server.  Reading from the file server is an expensive operation so we try to avoid it when we know that the current contents of the cache are already valid.  We know that to be true when the cache manager performed the most recent change to the vnode and the version delta is one.  Over the summer code was added that would bump the version number on all of the data buffers in this circumstance.  However, this had the side effect that writes became slower as the file got larger.  By maintaining a range of valid data versions instead of just the current data version, it is possible to maintain the benefits of the existing cached data at a cost that is independent of the file size.

Hash Algorithms Matter
The lock package uses an array of critical section objects to protect the internals of the mutex and read/write locks.  Which critical section was used for which lock object was determined by hashing the memory address at which the lock was located.  Unfortunately, the distribution of the objects was poor and some critical sections were used much more frequently than others.  Worse was the fact that several highly used global locks shared the same critical sections. 



Written by secureendpoints Permalink | Blog about this entry
This entry has 0 comments: Add your own

Wednesday, September 26, 2007
Subject: Windows Error Reporting versus Open Source Development
Time: 9:44:54 AM EDT
Author:  secureendpoints
Mood:  Frustrated



Windows Error Reporting is one of the greatest services that Microsoft has ever provided to developers of applications and device drivers for Microsoft Windows operating systems.  It provides a registered and verified software developer with access to crash report data that for that developer's applications.

How does it work?
When an application terminates unexpectedly or a user terminates an application  due to a lack of responsiveness, Windows will capture a mini-dump of the application, the version information of all loaded modules, and the version information for the Windows operating system on which it is being run.  The user is then presented a dialog requesting permission to deliver this information to Microsoft. 

Registered application developers provide Microsoft with a mapping file that describes each binary in a product release including version info, link times, and other traits that can be used to uniquely identity the module.  When crash reports are received by Microsoft, the WER servers compare each report against the mapped modules.  When a match occurs, a WER event is generated and the application developer is notified. 

One of the really nice benefits of WER is that it can sort the events into buckets based upon the type of crash, hang, and process state at the time of the crash.  If the same type of crash occurs 50 times, all of the matching events will be placed into the same event bucket.  Application developers can easily compare the state of all of the crash reports to assist in tracking down the cause.

When a fix is available, the application developer can register a response which will be delivered to subsequent users that experience the same type of crash with the same version of the module or application.  These responses can indicate that the software is not supported on the OS version that it is installed on, or that a new version is available, or that a workaround can be found be reading a provided web page. 

This mechanism benefits both the developers and the end users because as soon as a bug is found it can be fixed without requiring that the end users go through a long process of reporting a crash to the developers directly and being unable to provide enough technical detail for the developers to fix it.  Once the fix is available, end users are automatically notified.  Less frustration for end users and for developers.  Everyone wins.

Unless you are an open source developer or end user....

What is the problem with Open Source?

Secure Endpoints is an open source vendor.  We distribute pre-built installers for Kerberos for Windows and OpenAFS for Windows.  For each of these distributions we have binaries and matching symbol data.  When a crash report arrives from WER, the mini-dump is loaded into a debugger along with the matching binaries and symbol data.  Without the binaries or the symbols, the mini-dump information is much less useful before the stack addresses cannot be matched up with specific functions in the application modules.
As long as the version of the application that is installed is the one Secure Endpoints built, we can make use of the crash reports to identify problems, fix them and notify end users via the WER response mechanism. 

What happens when an organization decides to build the product from the published source code instead of using the pre-built binaries?  In that case, WER matches the module names and file version information and places an event into a crash bucket.  Secure Endpoints downloads the crash report, loads it into the debugger only to find that we have neither matching binaries nor matching symbols.  The end result is that the WER report is useless.  The best I can do is file a response to the end user recommending the use of the pre-built binaries.

I can certainly understand why organizations wish to build their own binaries.  In most cases its because they want to be able to debug problems they experience in-house.  For that they need matching symbols files.  This is exactly the reason why both the Kerberos for Windows and OpenAFS for Windows distributions include the symbol files from the official build.  This way organizations have all the necessary pieces: binaries, symbols files and source code.  Organizations that identify problems internally should file bug reports to the open source maintainers so that fixes can be developed and incorporated into future releases.





Written by secureendpoints Permalink | Blog about this entry
This entry has 0 comments: Add your own

Saturday, February 24, 2007
Subject: Squeaky wheels receive attention (both good and bad)
Time: 10:21:34 AM EST
Author:  secureendpoints



I spent the past few weeks traveling the country meeting with organizations that use OpenAFS and Kerberos for Windows.  I heard a number of really wonderful things:
  • "We haven't had a show stopper event in more than a year"
  • "The performance is so much better than it used to be.  We no longer receive complaints about how slow it is instead our users send us messages like this one, 'OH My gosh, afs is so fast now since i got my upgrade :)'"
At the same time the amount of funding spent on support and new development has been decreasing.  Budgets are always tight and management wants to spend its money on addressing the issues that cause on-going problems. 

Just a couple of years ago, the OpenAFS Windows client was so bad that not only were organizations sending money but individuals would send personal paypal payments and bottles of tequila as a "thank you for improving my life".   These days expectations have changed.  The assumption is that the OpenAFS Windows client just works.

In the 1.5.15 release of OpenAFS for Windows, a serious data corruption bug was fixed.  As it turns out this bug had been reported to IBM within the last year by an organization that was still using the IBM AFS Windows client.  When the organization switched to OpenAFS it never occurred to them that OpenAFS would have the same problem given their common heritage.  OpenAFS is so much better in so many ways that they "just assumed it had already been fixed."

The truth is that all of the low hanging fruit has already been picked.  Its not that there is no more work to be done but that all of the remaining work is big.  So big in fact that it cannot be paid for out of support budgets.  Instead strategic planning funds must be used and those are much harder to come by especially when the scope of the projects is in developer years and hundreds of thousands of U.S. dollars.  Its no longer possible for someone to ask "how much would it cost to fix xyz?" and receive a response indicating that the work could be done in a few hours or a day or two.

Instead, much of the longer term strategic work that was done to support the Windows Vista platform was unsupported.  Secure Endpoints contributed hundreds of hours of developer time to ensure that there would be an OpenAFS client for the new operating system.  This was done on the assumption that the costs would be re-couped in the future through interest in support contracts.  What a surprise it was to hear this week that existing support contract customers are questioning the need for the support.  The long hours spent improving the product have taken OpenAFS off the radar of senior management and as a result the funding is disappearing.

One large user described how there have been so few reported issues with the 1.4.2 client that he can't justify upgrading to 1.5.15 even though he is aware of all of the significant improvements in performance and stability.  Performance improvements just aren't a reason to upgrade when there are thousands of clients involved.  Stability doesn't matter if the end users are not being adversely affected.   Sure there are bugs and annoyances but the help desk knows how to address them and the users move on with life.   Management simply is not going to spend money on something that is faster or prettier.  If there isn't a critical show stopper issue, it won't be detected by their radar.

Our philosophy is that software is built to address the needs of its users with the goal of making their lives happier and more productive.  Good software doesn't attract unwanted attention.  In the case of a file system or other infrastructure, the end user should be able to take it for granted.  If it receives attention from the user, that is a bad thing.

A good support contract vendor is one that addresses issues promptly when they occur, but more importantly works to ensure that you do not have issues in the first place.  The question is, if support dollars are used to fund development that pro actively addresses issues before they are noticed by the customer, how does the customer know that the support dollars were well spent?  This is especially true when management does not believe that incremental improvements in performance and stability are worth paying for.

I am now beginning to understand the behaviors of large corporations providing support to Federal agencies.  I find them extremely frustrating to deal with because the apparent goal is to deploy software with just the right amount of bugs such that there are never issues that bring the entire system to a halt but that ensure that there is a constant stream of small issues that will keep them on the phone with the agency's help desk.  Every week a report is sent to the customer detailing the number of issues categorized by severity and whether or not the user's problem could be addressed.  Large numbers of low severity issues is encouraged whereas even a single Priority One issue is to be avoided. 

Fortunately for the clients of Secure Endpoints Inc, I believe that our role is to help prevent problems regardless of the severity.  Unfortunately, it is then harder to make the case for additional financial investment in products that are already deemed to be "good enough".


Written by secureendpoints Permalink | Blog about this entry
This entry has 0 comments: Add your own

Monday, January 8, 2007
Subject: Happy New Year!
Time: 12:19:09 PM EST
Author:  secureendpoints



It has been many months since this blog has been updated and many wonderful things occurred during the final three months of 2006. 

On the Kerberos front:

On Nov 9th, MIT announced that they want to provide a full-time developer to support Windows development.  As a result, Secure Endpoints Inc. has become a development and support partner.  Secure Endpoints Inc. will continue to enhance Kerberos for Windows and Network Identity Manager as well as issue new releases in conjunction with MIT's Kerberos team.  The primary change is that MIT will no longer be funding Secure Endpoints' efforts.  As a result, Secure Endpoints is reaching out to the broader Kerberos for Windows user community to help support on-going development. 
http://www.secure-endpoints.com/kfw/New%20Direction%20for%20Kerberos%20for%20Windows.eml

On Nov 30th, MIT Kerberos for Windows 3.1 including Network Identity Manager 1.1.8 was finally released. http://www.secure-endpoints.com/kfw/Kerberos%20for%20Windows%20version%203.1%20is%20released.eml
Although Network Identity Manager has not changed much on the outside since the KFW 3.0 release, on the inside the changes were dramatic.  A large number of usability issues were addressed and the plug-in interface was improved to support a wider range of functionlity.   KFW 3.1 can be downloaded from MIT: http://web.mit.edu/kerberos/dist/index.html#kfw-3.1

Development on KFW 3.2 and NIM 1.2 is underway.  Secure Endpoints has posted a development road map including 64-bit Windows support, Vista support, and a wide range of enhancements to the Network Identity Manager user interface.  Financial support from the community is required to sustain the on-going improvements that KFW has received over the last several years.
http://www.secure-endpoints.com/netidmgr/roadmap.html

For OpenAFS for Windows, 2006 was a banner year.  It started off with the 1.4.1 release candidates and ended with the release of 1.5.13.  Throughout those releasesthere were more than 150 improvements to the product.  The most important changes include:
* No more resource leaks within the SMB Server
* Locally managed byte range locks backed by full file locks on the file server
* Improved performance when disconnected from the network
* Improved performance for directory listing
* Improved performance when storing temporary files within AFS
* Improved power management event handling
* Support for file sizes greater than 2GB
* Over quota and disk full errors are now reported
* Significantly improved handling of dirty buffers results in decreased cpu utilization and faster writes
* A Network Identity Manager AFS credential plug-in is provided
* Support for 64-bit Windows
* Support for Microsoft Vista
A summary of the current state of OpenAFS for Windows can be found at http://www.secure-endpoints.com/openafs-windows.html as well as the most recent Status Report http://www.secure-endpoints.com/talks/OpenAFS-Windows-Dec-2006-Status-Report.pdf.

Secure Endpoints has published a development road map for OpenAFS for Windows which includes a number of performance improvements to the AFS Client Service as well as a complete set of re-writes of the Explorer Shell integration, the OpenAFS Control Panel, and the development of a Microsoft Management Console for configuring the AFS Client Service.  http://www.secure-endpoints.com/openafs-windows-roadmap.html

Finally, perhaps the best surprise for last.  Just before the end of the year the AFS Servers (file, protection, volume, volume database, bos) were made functional once again.  The install wizard has been removed because it made assumptions that no longer hold true, but by manually installing the servers as is done on UNIX, it is now possible to run a cell from a Windows Server.  See the road map for a summary of what still remains to be done.
http://www.secure-endpoints.com/openafs-windows-roadmap.html#afs%20servers

In 2007, there is much to look forward to.  During the first quarter Secure Endpoints will release a new Network Identity Manager plug-in for obtaining KX509/KCA certificates; and with community support there will be significant releases of both KFW and OpenAFS. 

Mark on your calendar that the next AFS & Kerberos Best Practice Workshop will be held at Stanford during the week of May 7 to 11.  As always full day tutorials will be provided on AFS and Kerberos installation, administration, and maintenance.  This year Secure Endpoints will be providing the Kerberos tutorial.  New this year will be discussion of Kerberos and GSS-API programming practices.

Here's a toast to the accomplishments of 2006 and those that are to come in 2007. 
Happy New Year!!!!



Written by secureendpoints Permalink | Blog about this entry
This entry has 0 comments: Add your own

Wednesday, October 18, 2006
Subject: The need to avoid release labeling and choice for end users
Time: 9:09:35 AM EDT
Author:  secureendpoints
Mood:  Frustrated



Developers have a tendancy to focus on source code management. We maintain source code repositories to help us manage the development process. Within the repository we construct release branches. Each branch allows a set of sources to be shaped for a specific purpose. Typical branching strategies include separate branches for the maintenance of a public release, for development of the next release, and experimental branches for risky development that might not work out or may have an adverse impact on other developers. Developers often give somewhat arbitrary names to these branches "stable", "unstable", "maintenance", "development", "project foo", etc. that only have meaning to the developers.

As is often the case, the names assigned to the branches have no relationship with the quality of the code on a particular branch. This is especially true for a software project which supports large numbers of operating system platforms. Given the rate of development it may often be true that different branches might be a better choice than others for a given platform.

OpenAFS has traditionally labeled its branches as "stable" and "unstable".   The even numbered branches are "stable" and the odd numbered branches are "unstable".  This has resulted in significant amounts of confusion and frustration for end users.  At any given time end users have been presented with up to three current releases:
  • the last final release off of the "stable" branch
  • the most recent test release off of the "stable" branch
  • the most recent release off of the "unstable" or "development" branch
What's an end user to do?  More importantly, what's an administrator responsible for choosing the release to distribute throughout their organization to do?

When presented with the choice of selecting among "stable", "beta", or "unstable" which do you think the majority of individuals will choose?  End users don't want to install software that is going to cause them to lose data and they don't want to be guinea pigs so more often than not they are going to choose the "stable" release.  Even if this release has a list of known bugs a milelong and is years old. 

The distinction between the various source code branches is of meaning only to the developers.  End users do not think of software as source code.  They think of it as a product and the labels associated with different versions of a product will signfiicantly influence the end user's decisions especially when faced with complex choices they are not qualified distinguish between.  It is unrealistic to assume that an end user is going to understand the importance of file locking or the meaning of a 64-bit file size or the terminology surrounding deadlocks and reference count leaks.  When a typical end user is presented with a choice among two or three complex options without a strong recommendation specifying which should be used, simplistic labels such as "unstable", "stable", "final", "development", "test", "beta", "candidate", etc. are much more influential than they are intended to be. 

The reputation of OpenAFS on the Microsoft Windows and MacOS X platforms is suffering in part because of the choices given to end users and the terminology used to describe them.  End users want something that works.  They want to visit a web site and see that version X.Y.Z is the best version available for their platform and this is what they should be using.  When they experience a problem and see that they are not currently running the recommended version, then they will upgrade.  If they experience a problem and are presented with choices that they can't make heads nor tails of, they are going to take the path that appears to have the least risk.  End users will choose the "stable" or "final" release over something labeled "test", "beta", "unstable", or "development" 9 out of 10 times.  Even though the problem they are experiencing might very well be fixed in one of these apparently riskier releases.

For Windows users the availability of multiple releases has been a serious problem.  The 1.4 series does not contain significant functionality that is meant to protect end users from data loss.  This functionality is only available in the 1.5 series.  Unfortunately, due to the fact that end users are presented with new releases from both the 1.4 and 1.5 branches as they are released it is truly impossible for end users to know which to use without a very clear recommendation from the gatekeepers and perhaps the broader user community. 

One of the other significant problems facing OpenAFS versioning is the length of time it takes in order to get through a test cycle.  It is often the case that a small number of problems on specific operating system versions or hardware architectures can prevent a test cycle from being completed.  In the meantime, the release that should be considered the best choice on all of the other operating system versions and hardware architectures is stuck with a label of "test", "beta", or "candidate" which results in organizations and end users from being willing to install it.

As a result I am recommending that OpenAFS (and all other cross-platform open source projects) avoid the use of the one version is best for all platforms mentality.  Instead of labeling releases as "stable-1-4-2", "stable-1-4-2-beta-1", "stable-1-4-2-rc3", or "unstable-1-5-9", just use numbers such as"1-4-41", "1-4-42", "1-4-43", "1-5-9". This removes the negative connotations associated with the labels.  For each platform a recommended release number can be provided. 

This new approach provides a number of side benefits.  No longer do the developers need to guess at what version numbers to assign to test builds.  When preparing for a new release we want the final version number to be X.Y.Z.00.  Therefore, the developers typically try to assign numbers starting with X.Y.(Z-1).90 in order to ensure that version numbers always increase but to avoid the confusion that might arise if end users thought the test release was in fact the final release. 

Another benefit is that it will be much easier for administrators to convince management to deploy fixes.  Management is always reluctant to deploy a "beta" or "candidate" release because such a release must have bugs.  The reality is that all software has bugs.  Even if there are no known bugs in a given release at the time the release is announced it is guarranteed that over time bugs will be discovered and they will be fixed in later releases.  A "final" release is simply one that is believed to build and run on all supported platforms without known faults.

The requirement that a "final" release build and run on all supported platforms including all new Linux kernels often results in significant delays before important bug fixes can make it out to the user community.  For example, at the AFS & Kerberos Best Practice Workshop a demonstration was given of a bug fix to a problem in the 1.4.1 file serverthat adversely affects client mobility.  The bug fix was committed on June 3rd and yet it has taken until October 17th before a 1.4.2 final release to be issued.  In the meantime, more than four months of end user frustration has accumulated and many sites have deployed 1.4.1 on their file servers instead of one of the "beta" or "candidate" releases that contained the fix.

In speaking with end users, as long as the version label does not contain negative terminology they can push out any build that is recommended.  However, once doubt is raised regarding the quality of the release in the minds of management all bets are off.

It is my hope that OpenAFS and other open source projects will abandon the traditional release labeling and replace it with incremental build numbers and platform specific recommendations.



Written by secureendpoints Permalink | Blog about this entry
This entry has 0 comments: Add your own

Friday, September 8, 2006
Subject: OpenAFS for Windows September 2006 Status Report is now available
Time: 12:42:58 PM EDT
Author:  secureendpoints



The OpenAFS for Windows September 2006 Status Report is now available:

http://www.secure-endpoints.com/talks/OpenAFS-Windows-Sep-2006-Status-Report.pdf

For the complete list of changes since the 1.2 release see: http://www.openafs.org/dl/openafs/1.5.8/winnt/afs-changes-since-1.2.txt

and of course be sure to read the Release Notes:
 http://www.openafs.org/dl/openafs/1.5.8/winnt/relnotes-frames.htm

As always I encourage all organizations and individuals who wish to
support the development of OpenAFS for Windows to contact me. Financialcontributions as well as in kind assistance are seriously appreciated.Tax deductible donations may be made via the OpenAFS account operated byUsenix (a 501c3 not for profit corporation.)

Written by secureendpoints Permalink | Blog about this entry
This entry has 0 comments: Add your own