Project Ideas - MSUCSIS/Inference GitHub Wiki

Course Project Ideas for Murray State Students

CSC-445

For the algorithms course, I think there are probably quite a few possibilities.

Test Dependencies

We would like to be able to create a graph representing test dependencies so that we can determine when it is possible to ignore certain tests due to previous results. The part that is more related to the course would be that we don't necessarily want a user to have to explicitly define this graph. What would be interesting for this course, I think, is that once the graph is defined, we must generate an execution order for the tests that respects all dependencies (basically, we must remove redundant edges).

Dynamic Test Plan

It also seems like if we flip the graph upside down, we could get another interesting execution plan: if a parent node passes a test, then all children must pass as well. So, should we test more general types in hopes of pruning when there is no match, or should we test more specific types in hopes that it confirms several other tests without executing them?

This might lead to another project idea: use the past sample results to determine whether we are getting more specific or more general results and dynamically alter the execution plan so that it gets more efficient as it runs.

Specificity of results

Another issue is that, when generating results and using them to provide a best guess for the type, what we want to choose is the most specific type that has the highest match percentage. For instance, String will be returned for every input, but if Long also matches every input, then we would want to guess that the actual type of the field is Long because that is a more specific constraint than String. I suspect this is all related to the dependency graphs above, but perhaps not...either way, it probably would be good to incorporate this into the code somehow.

Parsers

I think it would be valuable to be able to define type matches by creating parsers. We already have a regular expression tester, but a parser would be more powerful. You would be able to give a grammar definition and generate a test from it.

Hard Mode

An idea I have been playing around with is somehow defining tests in terms of other tests. Perhaps we could combine parsers to create new parsers and use the knowledge of how they were composed to define dependencies between tests automatically. I need to think about this idea a little more, though.

The idea is that we can create an n-digit parser, then we can define a phone number as a 3-digit parser, followed by a -, followed by a 3-digit parser, etc. Once we do this, can we create a test, or subgraph of tests combines all of the related parsers and results in whatever sub grammar was matched. I.e., if the parser sees three digits, it returns a match for the three digit parser. If it sees 4 digits, it fails, if it sees the full collection needed for a phone number, we get a phone number, etc.

In other words, the grammars themselves define the dependencies between the tests.

Code Generation

Once we have guesses for our data types, we would like to generate code. We could do Java, bytecode, other languages.

Opposite Direction

You could perhaps investigate a tool for generating test data, which might be reusable for other use cases, where we some derive randomly generated data from tests to create random documents. We could use the random documents to test the performance of the CSC-410 projects, for instance.

Infer data structures

Right now, we only work on individual fields. We don't attempt to infer that, e.g., if a document has sibling fields that are a string, string, 2 character string, 5 digit number, it could be an address. I think this could be paired with the ability to read different types of document schemas (like xsd, dtd, java classes?) and extract definitions (and, accordingly, tests) automatically. So, we know the primitives, and we automate the import of types that build on the primitives.

Infer Regular Expression

It might be interesting to have a stateful test that attempts to infer an underlying regular expression for the data in a field. We would not necessarily want to use this as a normal test, since it would-by definition-always be true, but perhaps it would be a good idea to run all fields through such a test (maybe we shouldn't call it a test) so that the user can see the inferred regular expression and perhaps gain some insight into what the type is.

CSC-410

Looking at the parallel aspect of this, I think there are a couple of options. One idea we could do is parallelize the testing of data by concurrently executing tests on a sample, the other idea would be to parallelize the process by dividing the data and testing the subsets concurrently. In reality, it would be good if we did both.

Concurrent Tests

On the concurrent testing side, we would want to build up a graph of dependencies for our tests, then determine which tests can be executed simultaneously based on that graph.

Another idea that might be interesting is to simply divide the graph into "equal sized" subgraphs (for some definition of equal sized) that may potentially have some overlap, and execute the subgraphs independently. We would waste resources because some work would be repeated and we would need to make sure that executing the same tests multiple times doesn't affect the final result, but it still might be easier to do (though it might not be interesting enough for a class project).

Where does concurrency occur?

In either case, we have concurrent tests that can be executed, but executed where? I think we could simply do this locally on the processor by having a thread for each subgraph, or we could distribute to different machines.

If we use a framework like Akka, we could even make this choice configurable by simply distributing to an actor system that may or may not be distributed across machines.

Distributed Data

On the data side, we would simply be sending subsets of the data to different threads, processors, or machines which each execute all of the tests (or execute a subgraph if combined with the concurrent tests).

This would be problematic, though, for stateful tests like the enumeration test.

How would this be done?

I think there are a few interesting options here. One thing we could do is use Akka. Another option would be to make the library easier to use in a distributed processing system like Spark or Hadoop.

In the context of the projects, though, we would have to make sure there is enough meat to this for it to be accepted. It may be that you don't need to do much other than drop the library in and write two or three lines of code, which may or may not be enough.

I think you would need to talk to Dr. Jointer and see where he draws the line between using distributed systems and developing distributed algorithms, etc.

CSC-530

Web App and REST Api

Server

For this class, I think the low hanging fruit would be related to putting the library on display by building a website and providing a programmatic api which people can use to process their documents.

Client

Additionally, it would be nice to have some client code that the users can work with so that they don't have to mess with the raw HTTP requests/responses. We could perhaps mimic the interfaces in the library so that the client code acts as a kind of proxy.

Another cool project which might go well with the CSC-410 work is to build a streaming data monitor which would allow the user to collect data and establish type estimates, then observe data in the future to a) make sure it matches the estimates and b) improves the estimates by continuing testing.

Errors would indicate either malformed documents or that the data has changed and a new model is needed (which we may be able to build from the new estimates we were collecting). It may also be interesting to look into having a continuously evolving model that we check against.