Machine Learning
Posted on March 7, 2021

Machine learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine learning algorithms use historical data as input to predict new output values.
Machine learning (ML) is a modern software development technique and a type of artificial intelligence (AI) that enables computers to solve problems by using examples of real-world data
Major Steps in the Machine Learning Process
Step 1: Define the problem
Step 2: Build the dataset
Step 3: Train the model
Step 4: Evaluate the model
Step 5: Use the model These steps are iterative.
There are four types of machine learning algorithms:
supervised
semi-supervised
unsupervised
reinforcement
Nearly all tasks solved with machine learning involve three primary components:
A machine learning model: an extremely generic program(or block of code), made specific by the data used to train it.
A model training algorithm: An iterative process fitting a generic model to specific data.
A model inference algorithm: used to generate predictions using the trained model.
Hyperparameters are settings on the model which are not changed during training but can affect how quickly or how reliably the model trains, such as the number of clusters the model should identify.
A loss function is used to codify the model’s distance from this goal
Training dataset: The data on which the model will be trained. Most of your data will be here.
Test dataset: The data withheld from the model during training, which is used to test how well your model will generalize to new data.
Model parameters are settings or configurations the training algorithm can update to change how the model behaves.
Log loss enables you to measure how strongly the model believes that its prediction is accurate. It seeks to calculate how uncertain your model is about the predictions it is generating
Examples of Deep learning models
FFNN: The most straightforward way of structuring a neural network, the Feed Forward Neural Network (FFNN) structures neurons in a series of layers, with each neuron in a layer containing weights to all neurons in the previous layer.
CNN: Convolutional Neural Networks (CNN) represent nested filters over grid-organized data. They are by far the most commonly used type of model when processing images.
RNN/LSTM: Recurrent Neural Networks (RNN) and the related Long Short-Term Memory (LSTM) model types are structured to effectively represent for loops in traditional computing, collecting state while iterating over some object. They can be used for processing sequences of data.
Transformer: A more modern replacement for RNN/LSTMs, the transformer architecture enables training over larger datasets involving sequences of data.
Tools that can be used to evaluate a linear regression model.
Mean absolute error (MAE): This is measured by taking the average of the absolute difference between the actual values and the predictions. Ideally, this difference is minimal.
Root mean square error (RMSE): This is similar MAE, but takes a slightly modified approach so values with large error receive a higher penalty. RMSE takes the square root of the average squared difference between the prediction and the actual value.
Coefficient of determination or R-squared (R^2): This measures how well-observed outcomes are actually predicted by the model, based on the proportion of total variation of outcomes.
Data vectorization: A process that converts non-numeric data into a numerical format so that it can be used by a machine learning model.
Silhouette coefficient: A score from -1 to 1 describing the clusters found during modeling. A score near zero indicates overlapping clusters, and scores less than zero indicate data points assigned to incorrect clusters. A score approaching 1 indicates successful identification of discrete non-overlapping clusters.
Stop words: A list of words removed by natural language processing tools when building your dataset
Action: For every state, an agent needs to take an action toward achieving its goal.
Agent: The piece of software you are training is called an agent. It makes decisions in an environment to reach a goal.
Discriminator: A neural network trained to differentiate between real and synthetic data.
Discriminator loss: Evaluates how well the discriminator differentiates between real and fake data.
Edit event: When a note is either added or removed from your input track during inference.
Environment: The environment is the surrounding area within which the agent interacts.
Exploration versus exploitation: An agent should exploit known information from previous experiences to achieve higher cumulative rewards, but it also needs to explore to gain new experiences that can be used in choosing the best actions in the future.
Generator: A neural network that learns to create new data resembling the source data on which it was trained.
Generator loss: Measures how far the output data deviates from the real data present in the training dataset. Hidden layer: A layer that occurs between the output and input layers. Hidden layers are tailored to a specific task. Input layer: The first layer in a neural network. This layer receives all data that passes through the neural network. Output layer: The last layer in a neural network. This layer is where the predictions are generated based on the information captured in the hidden layers.
Machine Learning Development process
- Label Data
- Collect and prepare data
- Store features
- Check for bias
- Visualize in notebooks
- Pick algorithm
- Train models
- Tune parameters
- Deploy in production
- Manage and monitor
- Continous delivery
Data roles
- Data Scientist does research ML/AI and Advanced Analytics
- ML Engineer operationalizes and Optimizes ML
- Data Engineer does Advanced Programming and deals Distributed Systems
Cloud DBs
- Amazon Redshift, Azure Synaps Snowflake, Google BigQuery
- Mainly used for data warehousing and analytical processing. Easy to scale up and down
Row-based Traditional DBS
- SQL Server, Mysql, PostgreSQl.
- Mainly used for transactional data, a source for the Data Warehouse or backends to applications.
NoSQL DBs
- Document DB, Key-value or graph DBs.
- MongoDB, ElasticSearch, Cassandra, Cosmos DB, Dynamo DB(AWS)
ELT components
- Extract and loading
- Batch Processing - Fivetran, Stich, Airbyte, Azure Data Factory, AWS glue
- Streaming - Apache Kafka, AWS Kinesis
Transforming Data Build tool: sql based tool that allows one to write transformations on top of the stage landed raw data and turn it into custom data models.
Reverse ETL Uses core data warehouse(the single source of truth) and allows one to sync data to the business application e.g. data from warehouse to Salesforce. Tools: Census, Hightouch, rudderstack
Task Orchestration and scheduling
- Apache Airflow, Luigi and Jenkins
Infrastructure management
Triggering containers, setting up environments and services.
- Terraform, Ansible
BI and analytics
- Reporting Tools: PowerBI, Tableau, Lookerstudio, Metabase(open source)
ML Packages
Data Manipulation & Wrangling packages
arrow: date & time manipulation & formatting beautifulsoup4: Beautiful Soup, for parsing HTML, JSON & XML data engarde: defensive data analysis jsonify: converts .csv files to .json numexpr: fast numerical array expression evaluator numpy: scientific computing library pandas: data structures & data analysis tools pandas_profiling: generates profile reports from a pandas DataFrame pandasql: queries pandas dataframes using SQL syntax prettytable: easily display tabular data as ASCII table shapely: manipulation & analysis of geometric objects tabulate: pretty-print tabular data
Machine Learning & Statistics libraries
cvxopt: convex optimization library emcee: an MIT MCMC library gensim: unsupervised semantic modeling from plain text hdbscan: Hierarchical Density-Based Spatial Clustering of Applications with Noise interpret: fit interpretable ML models and explain blackbox ML keras: high-level neural networks API lifelines: survival analysis in Python lifetimes: package for analyzing user behavior prophet: procedure for forecasting time series data pymc3: probabilistic programming & Bayesian modeling scikit-image: image processing library scikit-learn: tools for data mining & analysis scikits-bootstrap: bootstrap confidence interval algorithms for scipy scipy: mathematics, science & engineering statsmodels: estimate statistical models & perform statistical tests sympy: symbolic mathematics tensorflow: numerical computation using data flow graphs tensorflow-decision-forests: train, run and interpret decision forest models in TensorFlow xgboost: optimized distributed gradient boosting library
Visualization libraries
folium: build Leaflet.js maps in Python gviz_api: helper library for Google Visualization API igraph: network analysis tools mapbox: client for Mapbox web services matplotlib: 2D plotting library patsy: describe statistical models & build design matrices plotly: create interactive graphics pygal: create interactive svg charts pygraphviz: interface for Graphviz graph layout & visualizations pyproj: cartographic transformations & geodetic computations seaborn: viz library to draw statistical graphics squarify: implementation of the squarify treemap layout algorithm wordcloud: wordcloud generator in Python
Everything Else libraries
cufflinks: bind Plotly directly to pandas dataframes dask: flexible open-source Python library for parallel computing duckdb: in-process database management system focused on analytical query processing fiona: read & write geospatial data files geopandas: extends pandas to allow spatial operations on geometric types networkx: create, manipulate & study networks nltk: natural language toolkit pympler: measure, monitor and analyze the memory behavior of Python objects pysal: geospatial analysis library pyzipcode: query zip codes & location data requests: allows HTTP requests scrapy: scraping web pages six: Python 2 & 3 compatibility library spacy: advanced natural language processing textblob: simple API for common NLP tasks ua_parser: fast & reliable user agent parser urllib3: HTTP client for python
Snippet to detect AI-generated text
import re
import sys
def detect_text_origin(text):
if re.search(r'\b(embark|delve)\b', text):
return "AI-generated text was detected."
else:
return "Human-written."
if __name__ == "__main__":
text = " ".join(sys.argv[1:])
print(detect_text_origin(text))