Google Summer of Code 2015 proposal: Create new nmatrix gem for advanced linear algebra features - wlevine/nmatrix GitHub Wiki

Organization: Ruby Science Foundation

Abstract: NMatrix installation is difficult because it requires the ATLAS library which provides advanced linear algebra features. I propose to remove the dependency on ATLAS by moving these features to a separate gem package and by allowing the use of alternate linear algebra libraries.

Contact information. Please provide your email address, GitHub username, and approximate physical location.

William Levine
[email protected]
https://github.com/wlevine
Pittsburgh

Why do you like Ruby, and why do you want to work on SciRuby?

Describe your educational background (school, degree plan, major, past degrees, research area, publications, etc.).

(I'm combining the answers to these two questions since they're related.)

I'm a PhD student in physics at Carnegie Mellon University. I plan to finish my degree by the end of 2015. I got my BS in physics from the University of Virginia in 2009.

I work in experimental particle physics. Data analysis and computation are at the heart of our field. The standard data analysis library that is used by everyone in the field (and no one outside of it) is ROOT, which is the bane of every particle physicist's existence.

ROOT is a C++ library which also has Python and Ruby bindings. The Ruby bindings are not widely used: I use them, and the three previous grad students of my advisor used them, otherwise I haven't heard of anyone using them. I think PyROOT is a little more popular, but judging by the volume on the ROOT message boards, the vast majority of physicists are using ROOT from C++. I think this has a terrible effect on particle physics. One of the first tasks for every new student is learning C++, so that they have to learn the difference between a pointer and a reference before they can do any actual physics (this is perhaps hyperbole: a lot of physicists never truly learn what a pointer is and have just learned that sometimes adding or removing a * or a & will make your code work). Everyone wastes time debugging mysterious segfaults in programs that shouldn't have been written in C++ in the first place. ROOT, in various ways, encourages you to write to bad C++ code, so it's not even like we're learning a useful skill. C++ has it's place in physics, but it should not be the language that everyone uses everyday. C++ is way too hard to use (correctly) by non-experts, and not every physicist can or should become a C++ expert.

In our research group at Carnegie Mellon, we have a tradition of using Ruby and I picked it up when I joined. Ruby was such a joy to use, mostly because I felt like it made simple things simple. Things like string manipulation and iterating over containers that were always weirdly awkward in C++ were a breeze in Ruby. It was great to be able to quickly iterate and test code without compiling. Things that were possible to do correctly in C++ were easy to do correctly in Ruby. I could (mostly) forget about integer overflow, undefined behavior, segfaults, and memory leaks. I realize that these features I'm listing aren't unique to Ruby, but Ruby was the first language I worked with that had them, so it was the language I fell for.

So, I think this starts to make it clear why I'm interested in SciRuby. I think the current computing situation in particle physics is terrible, in large part because our standard analysis library is too closely coupled to C++. I think physicists would work smarter and be happier if their day-to-day programming language was something like Ruby or Python. Personally, I am Ruby person. Therefore, I am excited by the SciRuby project and want to contribute. I hadn't heard of SciRuby before I looked at the GSoC organization list, but when I saw it there I was very happy to learn that it exists and I immediately started to play around with NMatrix. I don't think I'll be able to use SciRuby in my main research project (ROOT is too firmly entrenched and it's important that I stick to the standard so I can share data and code with collaborators), but I look forward to using it in a side project, or something outside physics.

What do you like about science and why? What area do you like best?

Answering the second question is easy. Physics, obviously.

Answering the first question is way harder. I guess the best way to describe it is just a joy in learning new things and understanding how things fit together.

Describe your experience with the following: Ruby, C, C++, other languages.

I've been working with Ruby and C++ daily for the past six years as part of my research.

As I mentioned before, I was introduced to Ruby at the beginning of grad school, as it was the popular language in our research group. A former grad student here wrote a couple of nifty ruby tools (one of which you can check out here: http://www-meg.phys.cmu.edu/williams/wiki-ruby-pwa/index.php/Main_Page ) which I use frequently and hack on occasionally. When possible, I use the Ruby bindings for ROOT for doing analysis and plotting. I also use Ruby for everyday scripting tasks. I have a small amount of experience with Ruby C extensions.

C++ I started playing around with a long time ago, when I was in high school or maybe earlier. In college I took a class on scientific computing in C++ and used it for a small research project. I still use it pretty heavily. All of our reconstruction code (the stuff that needs to be done before the analysis stage) is pure C++ and I contributed a lot to that (specifically calorimetry reconstruction, in case you're interested). In addition, there are some things in ROOT that are a lot easier to do in C++ than in Ruby, so I use it there also. Despite all my whining about C++ above, I feel pretty comfortable writing it, and appreciate it for what it is (it's nice that the compiler catches mistakes for you; it would be nicer if it caught all of your mistakes).

Have you offered any pull requests for SciRuby or contributed in other ways? Please provide links, if possible. Past contributions are required, and must be in the form of code. Documentation contributions are also beneficial.

Pull requests:
https://github.com/SciRuby/nmatrix/pull/331
https://github.com/SciRuby/nmatrix/pull/332

My first contribution to nmatrix was fixing the tutorial so that the examples given actually worked: https://github.com/SciRuby/nmatrix/wiki/Tentative-NMatrix-Tutorial/_history

I also filed a couple of bugs:
https://github.com/SciRuby/nmatrix/issues/329
https://github.com/SciRuby/nmatrix/issues/330

What other commitments do you have this summer aside from GSoC? What obstacles do you foresee this summer as far as contributing the full forty hours per week during the GSoC period?

Are you planning any fun vacations this summer?

How many classes are you taking this summer?

Do you have any other employment this summer?

[Edited out the answers to these questions]

Please talk a bit about any past GSoC projects in which you have participated. If you've done GSoC before, how could we reach your mentor(s)?

Never participated before.

Please propose a project you would like to work on.

Motivation: NMatrix is too hard to install. The installation page is long, with too many steps. On my laptop (Ubuntu), I needed to do this weird dependency dance to build it. If I had tried to install it on my work machine (Red Hat), I would have had no guidance. Other projects are reluctant to use NMatrix because of the installation process (https://github.com/jekyll/classifier-reborn/issues/14). One big issue in the installation process is the installation of ATLAS. So I want to remove the dependency on ATLAS without removing any features.

Summary: I propose two different ideas. The first idea is to separate NMatrix into two gems, one of which contains basic math functionality and has no external dependencies on ATLAS or any other linear algebra package. The second gem (let's call it nmatrix-atlas) would contain more advanced linear algebra functions and would make use of an external linear algebra package. This would make installation easier for users who are not interested in advanced features. The second idea is to allow NMatrix to run with any implementation of liblapack and libblas, rather than being limited to ATLAS. This would simplify installation by allowing users to use whatever version of LAPACK is easily available from their OS or package manger. At the same time, it would allow users who are concerned about performance to use whatever tuned version of LAPACK works best for them.

Part 1 (nmatrix-atlas gem) details: The plain nmatrix gem should include basic math functions like matrix multiplication, inversion, and determinants. The nmatrix-atlas gem will provide additional functions provided by LAPACK.

I started to write some code to see what issues I would run into when I tried to separate the ATLAS-dependent code from the rest of nmatrix: see https://github.com/wlevine/nmatrix/tree/test_two_gems

I didn't get to the point of actually packaging two separate gems, but I was able to remove the ATLAS dependencies from nmatrix.so and make a separate nmatrix_atlas.so that extended NMatrix to implement one of the ATLAS functions (getri).

Here are the lessons I learned from this: nmatrix_atlas will need access to the nmatrix header files. As a quick solution I just symlinked all the headers from ext/nmatrix to ext/nmatrix_atlas. This is probably not a good idea in the long run, other solutions are possible, probably what should happen is that building plain nmatrix should install these headers somewhere where nmatrix_atlas (or other potential extensions) can see them.

The nmatrix header files contain some information that shouldn't be exposed to external libraries. For example, nmatrix.h defines a lot of important structs and macros which will be needed in nmatrix_atlas to read data, but it also defines Init_nmatrix() and some other functions which are irrelevant. I don't know the full extent of this issue, but I may need to break up the header files into parts that are relevant to external code and parts that should only be exposed internally.

All I needed to get nmatrix_atlas working was the header files from nmatrix, none of the c files. This is a good thing.

I plan to develop both gems in the same repository so that they don't get out-of-sync. Code in ext/nmatrix makes the nmatrix.so libary, while code in ext/nmatrix_atlas will make the nmatrix_atlas.so extension. The ruby files for the two gems can live in two different subdirectories of lib/.

Should be able to generate two gems from same repository: see http://opensoul.org/2012/05/30/releasing-multiple-gems-from-one-repository/

If only nmatrix is built, the LAPACK-specific specs should not be tested. If nmatrix-atlas is built, any function that is reimplemented by nmatrix-atlas should be tested twice, with and without nmatrix-atlas.

I also will need to update all the build instructions and documentation to explain the situation.

I don't have access to an OS X machine, so I will need help testing anything that changes the build process.

Part 2 (allow alternatives to ATLAS) details: Currently if an NMatrix user wants to use external library to provide BLAS/LAPACK functions, ATLAS is the only choice. It would be better if nmatrix interfaced directly with liblapack and libblas, so that the user could use any implementation they preferred (ATLAS, OpenBLAS, Intel, etc.). This is how numpy works (http://www.scipy.org/scipylib/building/linux.html).

The main obstacle here seems to be a C interface for LAPACK. NMatrix currently uses CLAPACK as provided by ATLAS. I propose replacing this with LAPACKE (http://www.netlib.org/lapack/lapacke.html).

Again, I've done a little work to investigate feasibility: see https://github.com/wlevine/nmatrix/tree/test_no_atlas

I got everything building without ATLAS and reimplemented one function (sgetri) using LAPACKE instead of CLAPACK. I tested that it works using three standard packages from the Ubuntu repositories: liblapack3 (= the reference implementation), libatlas3, and libopenblas.

The approach I'm currently taking is to copy a lot of code from LAPACKE into the nmatrix repository. Normally, I wouldn't think this is a good idea, but I think its okay here for several reasons:

  1. Compatible license. LAPACKE is under 3-clause BSD license.
  2. The code is stable. Not a lot of worry about continually having to update the code.
  3. LAPACKE is just a thin layer that provides an interface from C to the underlying LAPACK functions that are provided elsewhere. So it shouldn't be hard to build and there are no heavy-duty functions that need be highly optimized.
  4. Not copying the code would result in an additional dependency that wouldn't really have a benefit.

I think replacing CLAPACK with LAPACKE should be fairly straightforward. Again I will need to update the documentation and I will need help testing on other platforms.

(This section added April 17)

Travis-CI allows you to set up different build configurations and add them to the "build matrix" to be tested. Different configuration are specified by different environment variables (see http://docs.travis-ci.com/user/build-configuration/#The-Build-Matrix ). Also Travis runs on Ubuntu, so it should be possible to use the update-alternatives command to painlessly switch between different versions of LAPACK. So it should be possible to test a single gem with multiple, different external libraries. It would work something like:

Replace env section of .travis.yml:

env:
  - USE_ATLAS=1
  - USE_OPENBLAS=1
  - NO_EXTERNAL_LIB=1

Install all additional external dependencies in before_install section.

Use script command to launch external script instead of launching tests directly:

script: ./travis.sh

Where travis.sh contains something like:

if [ -n "$USE_ATLAS" ]
then
  sudo update-alternatives --set liblapack.so.3 /usr/lib/atlas-base/atlas/liblapack.so.3
  [set up other stuff if necessary]
  bundle exec rake compile && bundle exec rake spec
fi

if [ -n "$USE_OPENBLAS" ]
then
  sudo update-alternatives --set liblapack.so.3 /usr/lib/openblas-base/liblapack.so.3
  [...]
  bundle exec rake compile && bundle exec rake spec
fi

[other cases]

Please provide a specific timeline for your project from application period until pencils-down. What benchmarks will you set for yourself? The greater the detail on this question and the previous, the better.

Before the coding period, I should talk with my mentors and the community to firm up some parts of my plan that are kind of vague, like exactly what functionality should be in the basic nmatrix gem or how to expose the header files from nmatrix. I should also read up a little bit on gem packaging.

By the end of four weeks of coding, I'd like to have a good, clean implementation of the two gems side-by-side. The two gems together should provide all the functionality that nmatrix provides currently. Obviously, they need to pass all the specs. The nmatrix gem must build and run without depending on ATLAS (or similar libraries) and provide basic math.

After eight weeks, I'd like to have entirely removed the dependency on ATLAS. nnmatrix-atlas (maybe not a good name anymore) should compile, build and pass the tests with all three of the LAPACK implementations available to me (and of course on other systems as well).

The remainder of the time I will spend integrating my changes back into the main repository, writing documentation, and taking care of whatever tasks slipped through the cracks. Ideally I would like to test on OS X and other platforms throughout the summer, but if that's not possible I will do it in this period. By the end of the summer, installing nmatrix should be as easy as 'gem install nmatrix'. Installing the extra dependencies for nmatrix-atlas should be as easy as 'sudo apt-get install liblapack3' (or whatever).

What is one long-term vision for something you'd like scientific software to be able to do. Think big picture, not necessarily realistic in the short term.

My vision for scientific computing is that scientists should be able to focus on science and the computers should take care of the computing. I don't know exactly what this vision entails, but I know that at the moment we are nowhere near it, as I discussed above. But we're trying to make it better.

What are your hobbies, aside from coding? Tell us a little about yourself that isn't reflected in the rest of your application. What do you want to do with your life (if you have any idea)?

[Edited out this answer]

What else do you think we should have asked but didn't? Propose a question of your own and answer it here.

What's the importance of open-source software in science?

One thing I didn't mention above is how important I think it is for scientists to use open-source software. We have a responsibility to make sure our results are verifiable. We never throw away lab notebooks. I work with a national lab, and the government has all these initiatives for long-term data preservation. But the data is useless without the software needed to analyze it and there's no guarantee that your favorite closed-source program will work twenty years from now. I always have this image that using closed-source software is like writing your lab notebooks in invisible ink. In particle physics we are pretty good about this, I think the astrophysicists are even better, some fields seems to be a lot worse.

Bonus question: One aim of the SciRuby foundation is to increase diversity in open source science software development. How do we get more women interested in open source software development and science? How do we get more people from underrepresented groups involved?

[Edited out this answer]