Accessing Files & Resources in CLU Code - clulab/processors GitHub Wiki

Introduction

CLU Lab projects tend to involve lots of information that isn't source code. Often this is just the "big" and "data" from Big Data that includes corpora, neural network models, and word embeddings where lots indicates size. However, it can also mean many smaller pieces like grammar rules, stop word lists, ontologies, configuration files, and the like where lots refers to the count. Either way, and perhaps because of the wide range, managing this data is often not as straightforward as managing the code that uses it. There are many choices to be made, quirks to be aware of, gotchas to avoid, formats to use, standards to meet, code to share, etc. While advice abounds on the internet, it might not be obvious how it relates to local practices, so that is the emphasis of this work in progress. Here's what you might need to be aware of when working with data.

Access

In the end, almost all the data we work with will end up in a file/document on the local disk. Occasionally information is retrieved on the fly from a network, but that is not addressed here. In part because we very often use Java/Scala, there is a distinction made between accessing the data as a file and accessing it as a resource. It may be the very same data in each case, but the procedure for getting to it is different.

Files

Files are less dependent on programming language, closely related to the operating system, and generally more flexible. They can be erased, moved, copied, and read with random access. Interfaces are provided through classes including File in java.io and Path and Files in java.nio.file. There are lots of tutorials about them such as Basic I/O. Something very important to remember is that finding a file is often dependent on the current (or present) working directory. If you try to access a file and get null returned or an exception thrown and especially if you use a relative path (such as either file.txt or ./file.txt), it may be that the development environment is manipulating the working directory in ways not obvious to you. It may be helpful to check where . is with code such as

println(new java.io.File(".").getAbsolutePath)

Something else to remember is that Java/Scala does not interpret ~ as your home directory. To get that try

System.getProperty("user.home")

and build a file name from there.

Some calls that access files, most often executables, make use of the operating system's $PATH environment variable and it is sometimes help to double check its value:

println(System.getenv("PATH").split(java.io.File.pathSeparator).mkString("\n"))

Resources

Resources on the other hand, are more specific to Java/Scala, look more like URLs, and are meant to be read only via a stream interface. They wouldn't generally be moved around, have names unknown at program build time, or even go missing. If you have the jar file that is running your code, the resources should be right there in the same jar file with it or in a different jar or zip file close by that is providing library-like support of the main program. Your development tools should see to it that the resource is packaged in the jar, possibly by copying it from a file somewhere in the build environment. The interface to resources is provided by methods in ClassLoader, which knows about how the program was started up and where classes and resources can be found. There are several ways to get ahold of a class loader and create a resource stream for reading the data:

this.getClass.getResourceAsStream("/resource.txt" )
this.getClass.getClassLoader.getResourceAsStream("/resource.txt")
Thread.currentThread.getContextClassLoader.getResourceAsStream("/resource.txt")

The semantics of these variations differ in how relative paths are treated and there can be differences between the result of getClassLoader and getContextClassLoader, apparently. Sometimes problems can be diagnosed by checking where the class loader thinks it is operating, more or less its working directory:

println(this.getClass.getResource("."))

Similarly to how file access is influenced by $PATH, resource access is influenced by a "classpath", either one specified in the $CLASSPATH environment variable, one added in the command line call to java with -cp or -classpath, by an IDE run configuration, etc. For class loaders that are instances of URLClassLoader, which is often the case, one can check where all resources are expected to possibly reside with

Some(this.getClass.getClassLoader.asInstanceOf[java.net.URLClassLoader])
    .foreach(_.getURLs.foreach(println))

Jar files are largely compatible with zip files, so one can usually change the file's extension to .zip and look inside the file for the resource in question.

Path

The location of a file or resource is often described using a string called the path which has quite a bit of internal structure that needs to be parsed. Something one runs into right away is the distinction between absolute and relative paths. Using the right one in the wrong place can be the cause of significant frustration.

Absolute

Absolute paths generally begin with a /. Even that statement includes a hedge because sometimes a \ is called for. For that matter, in some circumstances there might be a drive letter, think C:\ or other device like prn: first, especially if the string is passed through to C++ code, and sometimes a protocol is added on the front of the string such as file: or jar:. Mostly the single forward slash gives it away and Java/Scala will in most situations adjust it appropriately for operating systems that uses something else.

The thing to remember is that absolute paths should be avoided when accessing files and preferred when accessing resources. One generally doesn't know how the hard drive of the person using your software is organized, so /home/me probably doesn't exist. Places that do are probably not available for writing and maybe not even reading. For locations such as /tmp there are special methods to use:

val file = java.io.File.createTempFile("temp", ".tmp")

With the resources on the other hand, you should know how they are organized, since you are in charge of creating the jar file. It's often easiest to use an absolute path because then you don't have to worry about what it's relative to. Resources often reside during development in a file located relative to a project directory at ./src/main/resources/org/clulab/project/resource.txt. When the jar is eventually created, the resources should be at /org/clulab/project/resource.txt.

Relative

Relative paths are the thing to use for files. Generally users have some sort of home on their drive and are sitting in a subdirectory denoted by . in a path. You can ask them to ensure that the location is writable so that you can create a file there or ask them to be in that place so that a file can be found and read there.

It starts getting tricky when you don't want to force the user to go to a particular location to find files. There's a tendency to try to use ~ as a kind of absolute path on a per user basis. Unfortunately, that doesn't work in Java/Scala. It's quite specific to the operating system and even a shell program. To find the user's home directory, use code such as this in which Java has figured out where the home directory is located:

System.getProperty("user.home")

Be aware that lots of programs try to write files to the home directory, often hiding them using attributes or names beginning with a period, so that the storage device fills up even though other devices are available. Consider other options such as allowing the user to specify file locations as command line parameters or, if necessary, environment variables, or include the path in a configuration file. In addition to not understanding ~, Java generally does not understand strings like $HOME and neither do some operating systems. Windows, for example, is more likely to use %HOMEDRIVE% and %HOMEPATH%. If you want to use the home directory, you'll probably need to build up a string in code such as

System.getProperty("user.home") + java.io.File.separator + "file.txt"

Here the java.io.File.separator is probably not necessary, depending on how the resulting string is used, and / could be used instead. It is included for completeness and reference.

Things get extra tricky when one wants to specify that a file or a resource is to be used. Some of the lab's software is delivered with data stored as resources but also allowing the user to override the resource data with a file. It is usually infeasible to express the possibilities in a single string. The resource should use an absolute path and the file should have a relative path. In these situations, entries in a configuration file tend to look like

resourcePath = /org/clulab/glove/glove.840B.300d.10f
filePath = ./glove/glove.840B.300d.10f

// path = /org/clulab/glove/glove.840B.300d.10f // use this for the resource
path = ./glove/glove.840B.300d.10f // use this for the file

where one option or the other is commented out.

Data
Objects

ClassLoader

Thread
ClassProvider

Environment

IntelliJ
sbt

Programming Language

Scala
Java

Closing

Manual
Automatic

Encoding

utf-8
8851

Incorporation into Project

/src/main/resources
libraryDependencies

Slashes

Forward slash
Backslash

Libraries

processors
eidos
dynet

Size

Small
Medium
Large

Examples

Glove
Properties
Rules
Models
Config
logback.xml
yaml