Environment Setup - Sotera/track-communities GitHub Wiki

Warning

The .box files we have appear to be missing one or two things which are now nearly impossible to install because the images are so old the package managers don't work anymore. For example, even after you fix apt-get so you can install pip, it gives you v.1.0, when pip is now on v.21.1. Somewhere back around v.9 PyPI stopped accepting non-SSL connections, so you cannot pip install from this VM without a full upgrade.

We'll try to rebuild on a modern VM / Docker image.

Otherwise, here's notes on how it used to work, mixed with attempts to fix it.

Prerequisites

See the XDATA VM wiki for baseline software if installing or setting up on your own machine.

Load VM into Vagrant

This is three steps:

  1. Go to the folder with the data-[version].box file. Let's assume you are using 0.2.1. (Change as needed.)

  2. Add the XDATA VM box definition to Vagrant:

% vagrant box add  xdata-vm-0.2.1  ./xdata-0.2.1.box
  1. Initialize a new VM based on the XDATA VM box configuration.
% vagrant init xdata-vm-0.2.1

Example

(base) ~/vm/xdata-vm-0.2.1 % vagrant box add xdata-vm-0.2.1 xdata-0.2.1.box
==> box: Box file was not detected as metadata. Adding it directly...
==> box: Adding box 'xdata-vm-0.2.1' (v0) for provider: 
    box: Unpacking necessary files from: file:///Users/.../vm/xdata-vm-0.2.1/xdata-0.2.1.box
==> box: Successfully added box 'xdata-vm-0.2.1' (v0) for 'virtualbox'!
(base) ~/vm/xdata-vm-0.2.1 % vagrant init xdata-vm-0.2.1
A `Vagrantfile` has been placed in this directory. You are now
ready to `vagrant up` your first virtual environment! Please read
the comments in the Vagrantfile as well as documentation on
`vagrantup.com` for more information on using Vagrant.
(base) ~/vm/xdata-vm-0.2.1 % vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Importing base box 'xdata-vm-0.2.1'...
Progress: 90%      ⬅︎⬅︎⬅︎ This step can take a minute or so ⬅︎⬅︎⬅︎
==> default: Matching MAC address for NAT networking...
==> default: Setting the name of the VM: xdata-vm-021_default_1626462962175_24688
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
    default: Adapter 1: nat
==> default: Forwarding ports...
    default: 22 (guest) => 2222 (host) (adapter 1)
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 127.0.0.1:2222
    default: SSH username: vagrant
    default: SSH auth method: private key
    default: 
    default: Vagrant insecure key detected. Vagrant will automatically replace
    default: this with a newly generated keypair for better security.
    default: 
    default: Inserting generated public key within guest...
    default: Removing insecure key from the guest if it's present...
    default: Key inserted! Disconnecting and reconnecting using new SSH key...
==> default: Machine booted and ready!
==> default: Checking for guest additions in VM...
    default: The guest additions on this VM do not match the installed version of
    default: VirtualBox! In most cases this is fine, but in rare cases it can
    default: prevent things such as shared folders from working properly. If you see
    default: shared folder errors, please make sure the guest additions within the
    default: virtual machine match the version of VirtualBox you have installed on
    default: your host and reload your VM.
    default: 
    default: Guest Additions Version: 4.1.12
    default: VirtualBox Version: 6.1
==> default: Mounting shared folders...
    default: /vagrant => /Users/.../vm/xdata-vm-0.2.1

Sanity Check

Okay, you should now have a running VM!

  • The VirtualBox VM will be in a new folder like ../xdata-021_default_162649866.../
  • Vagrant will create a matching .vbox (not .box) file: xdata-021_default_162649866....vbox.
  • (The long string of digits is a random unique ID.)
  1. ssh into the box
vagrant@xdata:~$ vagrant ssh
Welcome to Ubuntu 12.04.4 LTS (GNU/Linux 3.8.0-39-generic x86_64)

 * Documentation:  https://help.ubuntu.com/

  System information as of Thu Jul 22 12:58:17 UTC 2021

  System load:  0.71               Processes:              91
  Usage of /:   19.4% of 39.34GB   Users logged in:        0
  Memory usage: 16%                IP address for eth0:    10.0.2.15
  Swap usage:   0%                 IP address for docker0: 172.17.42.1

  Graph this data and manage this system at:
    https://landscape.canonical.com/

  Get cloud support with Ubuntu Advantage Cloud Guest:
    http://www.ubuntu.com/business/services/cloud

Last login: Thu Jun 19 23:15:21 2014 from 10.0.2.2
vagrant@xdata:~$ whoami
vagrant
  1. Check that this machine has network access:
vagrant@xdata:~$ ping google.com
PING google.com (172.217.13.238) 56(84) bytes of data.
64 bytes from iad23s61-in-f14.1e100.net (172.217.13.238): icmp_req=1 ttl=63 time=48.5 ms
64 bytes from iad23s61-in-f14.1e100.net (172.217.13.238): icmp_req=2 ttl=63 time=88.3 ms

If that doesn't work, stop. There is something wrong with your VirtualBox setup. If on a Mac, check your System Settings ➛ Security & Privacy: you may have to allow VirtualBox to control your machine, and on newer OS X you will have to reboot (augh) because Apple disallowed hot kernel changes like that.

Install Project Components

TODO: Replace this by just sending a newer .box!

As of 0.2, the Sotera components are (mostly) included. (For older versions, see Old-Install-Components.) However:

  • It seems to lack the Python interface for Impala.
  • You can't pip install python-impala, because pip seems to be missing.
  • You can't apt-get install pip, because the apt config is years out of date.

So:

  1. Fix the apt-get package manager. The config files are now out of date.
$ sudo sed -i -e \
    's/archive.ubuntu.com\|security.ubuntu.com/old-releases.ubuntu.com/g' \
    /etc/apt/sources.list

$ ls /etc/apt/sources.list.d
cloudera-impala.list  cloudera.list  docker.list  java.list   r.list

$ sudo sed -i -e 's/impala1/impala1.4.0/g' /etc/apt/sources.list.d/cloudera-impala.list 
$ sudo sed -i -e 's/cdh4/cdh4.7.0/g' /etc/apt/sources.list.d/cloudera.list
$ sudo mv /etc/apt/sources.list.d/docker.list /etc/apt/sources.list.d/__docker.list__

$ sudo apt-get update
  1. Get pip and install Python interface to Impala. Note you need to force a pip upgrade because PyPi won't accept non-SSL connections anymore. And that will require a release upgrade. Sigh
$ sudo do-release-upgrade
<restart>
$ sudo apt-get install python-pip --upgrade
$ pip install impyla==0.7

Additional Configurations

Start your virtual machine.

    $ vagrant up

SSH into the VM as bigdata/bigdata, then edit the following configuration file to add additional properties. These configuration changes should allow you to protect your single VM machine from memory and node processing issues that may crop up in later steps.

    $ sudo vi /etc/hadoop/conf/mapred-site.xml
    
    <property>
      <name>mapred.child.java.opts</name>
      <value>-Xmx1024m</value>
    </property>

    <property>
        <name>mapred.tasktracker.map.tasks.maximum</name>
        <value>3</value>
    </property>
    
    <property>
        <name>mapred.tasktracker.reducer.tasks.maximum</name>
        <value>3</value>
    </property>

Stop your virtual machine.

$ vagrant halt

Testing the System

Start your virtual machine.

$ vagrant up

SSH into the VM as bigdata/bigdata, then test the following commands to ensure system is appropriately configured:

    $ hadoop fs -ls /
    $ hive -e "show tables"
    $ python
        > import impala
        > client = impala.ImpalaBeeswaxClient('localhost:21000')
        > client.connect()
        > print client.execute("show tables")

Stop your virtual machine.

$ vagrant halt

Additional Resources

⚠️ **GitHub.com Fallback** ⚠️