A union of curiosity and data science

Knowledgebase and brain dump of a database engineer

Setup and Install Apache Airflow on a Ubuntu 18 GCP (Google Cloud) VM


 First we log into GCP. 

Next create a VM within "Compute Engine". 

I create a small VM named Airflow for this demo.  

I choose Ubuntu 18.04 LTS Minimal. Create the VM

Connect to the VM using the browser SSH client.

sudo su
apt-get update
apt install python
apt-get install software-properties-common
apt-get install python-pip
pip install apache-airflow
pip uninstall marshmallow-sqlalchemy
pip install marshmallow-sqlalchemy==0.17.1
airflow initdb
airflow webserver -p 8080


The first thing I'll do when connected is elevate my user. 

Next I'll update the OS. 

Next Install Python. 

Next we'll install software-properties-common. This will help manage the repo's that we install software from. 

Next let's install Pip



We also want to export an environment variable for UNIDECODE to prevent errors. 

You can read more on this here : https://stackoverflow.com/questions/52203441/error-while-install-airflow-by-default-one-of-airflows-dependencies-installs-a

Now install apache airflow using pip

Currently in October 2019, you'll get a Marshmallow-SQLalchemy error if you attempt to initialize the default SQLite Database.

To prevent this error install an earlier version of Marshmallow-SQLalchemy.

Initialize the database

Run the web server on port 8080

Open the GCP Firewall to allow traffic to the airflow server. 


At this point you may be wondering ,  why is there an warning at the top of the page related to the scheduler. This is due to a "Max Threads" setting in the airflow config being greater than 1. With Sqlite as the DB , this setting will need to be set to 1 and the scheduler will need to be started. 


Ok, I'm going to log back into the console and use the browser to SSH into my instance. 
Once I'm in , I'll switch users and open the airflow config file. Once the config file is open, scroll down until you see  "max_threads". If you're using SQLite change this value to 1. Save the file.

Now we can start the scheduler. 





Airflow docs: https://airflow.apache.org/start.html







Note on GDPR - General Data Protection Regulation Summary

Personal Data = any combination of items that can uniquely identify an individual.

EU Directive - EU goal, attempt to implement - country chooses how to reach the goal.

EU Regulation - All EU states must follow the law. 

Data Controllers - Companies who are responsible for user data they capture (social networks, ecommerce sites, companies, etc...)

Controllers Own the data, are responsible to users, and must implement technical measures and process for managing the data. 

Data Processors - third party companies who "Process the data" like ESP's, Marketing Partners, HR Sass partners. They're are responsible to the controller and must implement security measures for safe guarding the data they use. They're required to have written permission to pass any of the controllers data onto additional third parties. 

Lots of contracts. Controllers can inspect the property of any processors. Contracts which outline the respective roles and responsibilities state what the controller is responsible for and what the processor can do with that data. 


A new Role - DPO Data Protection Officer (Perm position). DPO can't be Controller. They need to have no conflicts of interest.

It can be a side role of someone in the company. 

They basically ,  inform data subjects of their rights and raise awareness. Tell their company about GDPR . Keep a list of actions the company is taking to comply with the rule set. "Help the organisation be accountable to the governing body" = keep the EU informed of any information needed regarding the regulation. Handle complaints and answer questions. 

GDPR has 99 articles - article 24 has DPO responsibilities

Implement technical and organizational measures

data mapping process. What we have, why, nature, scope , purpose. 

Implement a data protection policy.

Article 40 - Core responsibilites of a data controlller. 

  1. Fair and transparent processing
  2. Legitimate interests
  3. Consider Rights


Article 28 - Security Measures, Sub Processors (can only be done by consent), ensure contracts with controllers. Process only inscope data.


Basis for processing data for GDPR. 

  1. Consent
  2. Contractual Necessity
  3. Compliance with legal obligation
  4. Project vital interest
  5. Legitimate interest
  6. Public Interest


 Documentation Activities: Data policy impact assessment.

  1. Data Collection Life Cycle, how is it used? Could the data collected be used outside of it's intended purpose in the future. 
  2. Map the flow of data  and determine the appropriate safeguards. 
  3. Nature, Category (digital, physical, database?), Retention policy, location, who's accountable. 


Technical Controls  (article 32)

  1. Anonymize and Encrypt Personal Data (In transit and at rest) 
  2. CIA = Confidentiality, Integrity, Availability. (Resilience)  : NIST, ISO 2002. 
  3. Critical Security Control
  4. Ability to restore critical data. 
  5. Regular testing (Restores and Penetration (hacks))


Breach Notification (article 33)

  1. If data subject privacy is at risk , they need to be notified. 
  2. This could be data spilled, data theft and reasonable belief of breach. 
  3. 72 hours, the data subject needs to be notified. 
  4.  Nature of the breach and likely consequences, proposed mitigation.


EU Peoples rights - Access : Article 15, 16

  1. Data Subjects can access their data for free and within reason.(not too many request)
  2. Can request to fix inaccurate information. They can put in a request form. Erasure and correction need to be possible by the data controller and the controller is responsible for pushing those request to the processors. 
  3. Right to be forgotten. This is damn near impossible. 
  4. Article 8 : Children under 16. Parental Consent for everything. 


Data Portability: 

  1. Transfer data from one controller to another with ease. 
  2. Storage on personal devices - the format given to the data subject (user) should be able to fit on a personal device if requested. 
  3. Data Transmission - subject request direct transmission of their data between data controllers. (Does not include inferred or derived data)


In general, there needs to be process put in place to store data, transmit data, notify of breaches, remove data, restore data and transfer user data and notifying users about specific data breaches.







Magento 2 API Product Get

Create a token in Magento Admin. System > Integrations > (Create a integration) or edit an existing integration and get the token. 


//Authentication rest API magento2.Please change url accordingly your url
$headers = array("Authorization: Bearer <token value>");

$requestUrl='https://www.site.com/index.php/rest/V1/products/<your sku>';

$ch = curl_init($requestUrl);

curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$result = curl_exec($ch);
$result=  json_decode($result);

if (curl_errno($ch)) {
   print curl_error($ch);