Superset Implementation - crocusllc/Ed-Fi-Educator-Workforce-Analytics-Tool GitHub Wiki
Apache Superset is an open source Business Intelligence tool that allows institutions capable of self hosting to implement the Educator Workforce Dashboards with no licensing costs. This page provides an overview of how to implement Educator Workforce for Superset.
This document details the architecture and deployment steps for instantiating an Apache Superset instance on an Amazon EC2 (Linux) host using Docker. The primary architectural requirement is to co-locate this instance within the same AWS VPC as an existing Windows VM running an MSSQL server (hosting an Ed-Fi ODS).
Connectivity will be established exclusively over the private network, using the MSSQL server's Private IP address. This design ensures low latency and a secure posture, as database traffic does not traverse the public internet.
A key technical requirement is the use of the mssql+pymssql:// SQLAlchemy connection string. This mandates customizing the standard Superset Docker deployment to include the pymssql Python library, which in turn relies on FreeTDS system libraries.
Before proceeding, ensure the following prerequisites are met and all information is available:
-
AWS Environment:
- An existing AWS VPC with at least one public and one private subnet (or two private subnets if a bastion host is used for access).
- An existing EC2 Key Pair for SSH access.
-
MSSQL Server (Windows VM):
- The instance is running and accessible within the VPC.
- The Private IP Address of the Windows VM (e.g.,
10.10.20.50). - The Database Name to be connected (e.g.,
EdFi_Ods_2024). - An SQL Server Authentication user (username and password) with appropriate
readpermissions on the database. Windows Authentication is not viable from the Linux-based Superset container.
-
MSSQL Configuration:
- TCP/IP Protocol: Enabled in SQL Server Configuration Manager.
- Listening Port: Configured to listen on a static port (default: TCP 1433).
-
Windows Firewall: An inbound rule must exist on the Windows VM allowing TCP traffic on port
1433from the Private IP or Subnet CIDR of the new Superset EC2 instance.
The core of this architecture relies on AWS Security Groups (SGs) to manage traffic flow within the VPC.
-
SG-MSSQL(Attached to Windows VM):- This existing security group must be modified.
-
Inbound Rule:
-
Type:
MSSQL(or Custom TCP) -
Protocol:
TCP -
Port Range:
1433 -
Source:
sg-superset-id(The ID of theSG-SUPERSETgroup created below). Using an SG-to-SG reference is more secure and dynamic than hard-coding IPs.
-
Type:
-
SG-SUPERSET(To be created and attached to Superset EC2):-
Inbound Rule (Management):
-
Type:
SSH -
Protocol:
TCP -
Port Range:
22 - Source: Your corporate IP range or a bastion host's IP/SG.
-
Type:
-
Inbound Rule (Application):
-
Type:
Custom TCP -
Protocol:
TCP -
Port Range:
8088(Default Superset web port) - Source: Your corporate IP range (for UI access).
-
Type:
-
Outbound Rules:
- Default
(0.0.0.0/0)is acceptable, but for a stricter policy, allow outboundTCP 1433to theSG-MSSQLsecurity group andTCP 80/443for pulling Docker images and packages.
- Default
-
Inbound Rule (Management):
- Navigate to the EC2 Console and launch a new instance.
- AMI: Select Amazon Linux 2 (or a recent Ubuntu Server).
-
Instance Type:
t3.mediumort3.largeis recommended. Superset (especially thesuperset-worker) is memory-intensive. At3.smallwill likely be insufficient. -
Network Settings:
- VPC: Select the correct VPC (the one containing the MSSQL VM).
- Subnet: Select a subnet. A public subnet is simpler for setup as it allows direct SSH and package downloads.
-
Auto-assign Public IP:
Enable.
-
Security Group: Attach the
SG-SUPERSETsecurity group created in the previous section. -
Key Pair: Select your pre-existing
.pemkey. - Launch the instance.
- SSH into the newly created EC2 instance using its Public IP.
Bash
ssh -i /path/to/your-key.pem ec2-user@<public-ip-address>- Install Docker and Docker Compose.
Bash
# Update packages and install Docker
sudo yum update -y
sudo amazon-linux-extras install docker -y
sudo service docker start
# Add ec2-user to the docker group to avoid using sudo
sudo usermod -a -G docker ec2-user
# Install Docker Compose V2 (latest)
sudo yum install -y git
sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose-
Log out and log back in to apply the new group permissions.
Bash
exit
ssh -i /path/to/your-key.pem ec2-user@<public-ip-address>- Verify installation:
b
docker --version
docker compose version- Clone the Apache Superset repository.
git clone https://github.com/apache/superset.git
cd superset-
Crucial: Add
pymssqlDependency. The official Superset Docker build process automatically installs Python packages listed indocker/requirements-local.txt. The base Superset images (based on Debian) already include thefreetds-devsystem dependency, so we only need to specify the Python package.
Bash
echo "pymssql" > ./docker/requirements-local.txt-
Launch Superset (Non-Development Mode). We will use the
docker-compose-non-dev.ymlfile, which is optimized for a more stable, production-like deployment.
Bash
# Pull the images defined in the compose file
docker compose -f docker-compose-non-dev.yml pull
# Build and start all services in detached mode
docker compose -f docker-compose-non-dev.yml up -d-
Initialize the Superset Instance. These commands must be run after the containers are up.
Bash
# Create an admin user (follow the prompts)
docker compose -f docker-compose-non-dev.yml exec superset superset fab create-admin
# Run database migrations
docker compose -f docker-compose-non-dev.yml exec superset superset db upgrade
# Initialize Superset roles and permissions
docker compose -f docker-compose-non-dev.yml exec superset superset init- Access the Superset UI in your browser:
http://<ec2-public-ip>:8088 - Log in with the
admincredentials you created in the previous step. - Navigate to Data -> Databases.
- Click the + Database button in the top right.
- In the "Connect a database" window:
-
Database Name: A user-friendly name (e.g.,
Ed-Fi ODS Production). - SQLAlchemy URI: This is the most critical component. Use the following format, substituting your MSSQL server's Private IP and credentials.
-
Database Name: A user-friendly name (e.g.,
-
Format:
mssql+pymssql://<USERNAME>:<PASSWORD>@<WINDOWS_VM_PRIVATE_IP>:<PORT>/<DATABASE_NAME>
Example:mssql+pymssql://edfi_reader:StrongP@[email protected]:1433/EdFi_Ods_2024 - Navigate to the "Other" tab.
- In the "Engine Parameters" JSON block, it is highly recommended to specify the TDS version to avoid connection issues with modern MSSQL servers.
JSON
{
"tds_version": "7.4"
}- Click the "Test Connection" button. You should receive an "OK" confirmation.
- Click "Connect" to save the database.
If the "Test Connection" fails, follow this diagnostic procedure from the Superset EC2 instance's terminal:
-
Check Docker Container Logs:
docker compose -f docker-compose-non-dev.yml logs superset- Look for
pymssqlerrors, "Login failed," or "timeout" messages.
-
Verify
pymssqlInstallation:docker compose -f docker-compose-non-dev.yml exec superset pip list | grep pymssql- This should return the installed
pymssqlpackage and version.
-
Perform Network Connectivity Test (Telnet):
- This is the definitive test for network/firewall issues.
sudo yum install -y telnettelnet <WINDOWS_VM_PRIVATE_IP> 1433-
Success: You will see a
Connected to...message or a blank, blinking cursor. This means the network path is open. The problem is likely withpymssql, auth, or TDS version. -
Failure (
Connection timed out): This is a network-level block.- Check
SG-MSSQL: Ensure it allows TCP 1433 fromSG-SUPERSET. - Check Windows VM Firewall: This is the most common culprit. Ensure the internal Windows Defender Firewall has an inbound rule for TCP 1433.
- Check
-
Failure (
Connection refused): The network path is open, but the MSSQL service is not running or not listening on port1433. Check SQL Server Configuration Manager on the Windows VM.