Get started - aegisbigdata/documentation GitHub Wiki

A. Quick Introduction to the AEGIS tools

AEGIS offers various tools to address the diverse needs of its users:

  • AEGIS Integrated Services

    The AEGIS platform provides multi-tenant data management and processing services for Big Data, allowing different users and services to securely and privately access and process their data and share them with other users, allowing access for specific services. In order to do that, AEGIS builds upon Hopsworks, which also enables AEGIS to provide integrated support for various services such as interactive notebooks with Jupyter, Kafka, and ELK stack.

  • Query Builder

    Query Builder provides the capability to interactively define and execute queries on data available inside AEGIS. Query Builder aims to simplify and accelerate the process of retrieving data and creating views on them, which could be then saved as new datasets or used as input for more high-level AEGIS tools, like the Visualiser and the Algorithm Execution Container

  • Visualiser

    The Visualiser is the component enabling the visualisation capabilities of the AEGIS platform for the output of the querying and filtering results coming from the Query Builder as well as the output of the analysis results as produced from the Algorithm Execution Container.

  • Algorithm Execution Container

    This component features a UI that consists of an algorithm selection template, which offers to users some basic information regarding each algorithm available in the big data analysis platform of AEGIS and also enables the configuration and the execution of the selected algorithm.

  • The Data Harvester and Annotator

    The Data Harvester and Annotator features several sub-components which, in connection, represent the process of harvesting, transforming, harmonising, annotating and providing the required data and metadata for the AEGIS platform.

The previous tools are the ones that the users directly interact with through the core AEGIS platform. There are also some tools working "under the hood" to enable certain functionalities:

  • AEGIS Metadata Service, which is responsible for storing the metadata associated with a particular dataset within the AEGIS platform
  • Brokerage Engine, which includes elements of the Data Policy and the Business Brokerage framework.

Finally, AEGIS offers two optional offline tools, specifically:

  • An anonymisation tool, which is an extensible, schema-agnostic plugin that allows real-time efficient data anonymisation.
  • A cleansing tool that enables data validation, data cleansing and data completion processes towards increasing the reliability, accuracy and completeness of the data that will be imported in the AEGIS platform

B. Getting started workflow

Welcome to the AEGIS platform! The following lines aim to help you familiarise yourself with the platform and begin using our various tools and services.

Part 1 - First visit step by step

  1. Create a new account in the platform and wait for your activation e-mail.
  2. Welcome! You are now in the AEGIS home page, where you can browse our assets, i.e. our projects and datasets. Search around a bit to see what we have!
  3. When you have looked around enough, it's time to move on to do more interesting things and for that, you will need a project! You might have come to AEGIS through an invitation to join a project, so you may find yourself in an already populated workspace. But we'll assume that's not the case for this intro. So, go ahead and create a project by pressing the yellow button on the right
  4. You are now in the project's home page. From here you can review the last actions performed by the project users (currently only you). The menu on the left is the entrance point to the various offered tools. Please note the cluster utilisation percentage right under the menu, it shows the current work load on the cluster and it will be affected by the things you do in the platform as well.
  5. Select the "Datasets" menu option. From here you can explore the available datasets and of-course upload more data in your project. You will find existing datasets already there, which are necessary parts of every project. Don't mind them for now. Go ahead and create another dataset by pressing the first "Add New Dataset" item . After that, you provide some information about the folder to be created. Then you can click upload to upload some files in your new dataset - make sure to upload some csv files as well to use in our next steps. When the upload is over, select one of the uploaded files and notice the hdfs path that appears in the white bar above the datasets containter (e.g. hdfs:///Projects/.....csv).
  6. You can then process and visualise the file using one of the tools provided by the AEGIS platform, under the menu item AEGIS Jypyter tools. These tools are in the "AEGIS Jypyter Tools" icon in the left menu. The first time you click this menu item the platform will ask you to activate Anaconda where you can select which python version you wish to use. The you need to click AEGIS Jypyter tools again. In the menu you will now find the three tools (Query Builder, Visualiser and Algorithm Execution Container). You can click and see the documentation and source of each tool by clicking each button.
  7. Click the "Visualiser" button. For more information on how this tool is used and what you can do with it, you may refer to this guide. For now, just try loading one of your previously uploaded csv files and then choose a simple visualisation.
  8. Click the "Query Builder" button. Let's assume that in the previous step, when using the Visualiser, you would have wanted to filter out some values prior to visualising the data. Follow the provided instructions here to perform this simple task. The guide provides information for more advanced tasks as well and you can always try things on your own to find out exactly what you can do.
  9. Select the "Algorithm Execution Container" button. Again, for information about what you can do with this tool and how, you may refer to this guide. Here you will need that hdfs path from step 5. Copy it and go ahead to try the tool!
  10. Go to the "Extended" left menu option. From here you can add metadata to your data which are used for enhanced search and may also, under the hood, facilitate some data management processes.

Part 2 - Advanced options

Once you familiarise yourslef with the processes described in the first tour, you will understand how the related tools can be leveraged to enable and/or facilitate various data management and processing tasks. There are four menu options not included in the first part, that offer exactly these capabilities:

  1. Jupyter button at the "AEGIS Jypyter Tool": This will open another window with Jupyter Notebook, i.e. an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and more (see also Jupyter website. You have already seen Jupyter in the first tour, since Visualiser and Query Builder are built inside this notebook. We support the sparkmagic Jupyter kernel, which is used to run Spark applications.
  2. Jobs menu option: This will take you to the AEGIS UI that provides support for running Jobs. Jobs could be Spark applications or Spark workflows and they can be scheduled for periodic execution or run on-demand.
  3. Kafka: This will take you to the Kafka management page. AEGIS supports Kafka-as-a-Service. Kafka topics (used as channels for producing/consuming messages) are private to projects and users can create a topic with just a few clicks. Topics can also be shared with other projects enabling real-time communication between projects.
  4. Model Serving (Tensorflow): The AEGIS platform provides a model serving service to deploy pre trained models using the Tensorflow serving. The user can create a new serving by clicking on the “Create New Serving” button and then providing a model name, version, and a path.