Machine learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine learning algorithms use historical data as input to predict new output values.

Machine learning (ML) is a modern software development technique and a type of artificial intelligence (AI) that enables computers to solve problems by using examples of real-world data

Major Steps in the Machine Learning Process

Step 1: Define the problem
Step 2: Build the dataset
Step 3: Train the model
Step 4: Evaluate the model
Step 5: Use the model These steps are iterative.
There are four types of machine learning algorithms:
supervised
semi-supervised
unsupervised
reinforcement

Nearly all tasks solved with machine learning involve three primary components:

A machine learning model: an extremely generic program(or block of code), made specific by the data used to train it.
A model training algorithm: An iterative process fitting a generic model to specific data.
A model inference algorithm: used to generate predictions using the trained model.
Hyperparameters are settings on the model which are not changed during training but can affect how quickly or how reliably the model trains, such as the number of clusters the model should identify.

A loss function is used to codify the model’s distance from this goal

Training dataset: The data on which the model will be trained. Most of your data will be here.

Test dataset: The data withheld from the model during training, which is used to test how well your model will generalize to new data.

Model parameters are settings or configurations the training algorithm can update to change how the model behaves.

Log loss enables you to measure how strongly the model believes that its prediction is accurate. It seeks to calculate how uncertain your model is about the predictions it is generating

Examples of Deep learning models

FFNN: The most straightforward way of structuring a neural network, the Feed Forward Neural Network (FFNN) structures neurons in a series of layers, with each neuron in a layer containing weights to all neurons in the previous layer.
CNN: Convolutional Neural Networks (CNN) represent nested filters over grid-organized data. They are by far the most commonly used type of model when processing images.
RNN/LSTM: Recurrent Neural Networks (RNN) and the related Long Short-Term Memory (LSTM) model types are structured to effectively represent for loops in traditional computing, collecting state while iterating over some object. They can be used for processing sequences of data.
Transformer: A more modern replacement for RNN/LSTMs, the transformer architecture enables training over larger datasets involving sequences of data.

Tools that can be used to evaluate a linear regression model.

Mean absolute error (MAE): This is measured by taking the average of the absolute difference between the actual values and the predictions. Ideally, this difference is minimal.
Root mean square error (RMSE): This is similar MAE, but takes a slightly modified approach so values with large error receive a higher penalty. RMSE takes the square root of the average squared difference between the prediction and the actual value.
Coefficient of determination or R-squared (R^2): This measures how well-observed outcomes are actually predicted by the model, based on the proportion of total variation of outcomes.
Data vectorization: A process that converts non-numeric data into a numerical format so that it can be used by a machine learning model.
Silhouette coefficient: A score from -1 to 1 describing the clusters found during modeling. A score near zero indicates overlapping clusters, and scores less than zero indicate data points assigned to incorrect clusters. A score approaching 1 indicates successful identification of discrete non-overlapping clusters.
Stop words: A list of words removed by natural language processing tools when building your dataset
Action: For every state, an agent needs to take an action toward achieving its goal.
Agent: The piece of software you are training is called an agent. It makes decisions in an environment to reach a goal.
Discriminator: A neural network trained to differentiate between real and synthetic data.
Discriminator loss: Evaluates how well the discriminator differentiates between real and fake data.
Edit event: When a note is either added or removed from your input track during inference.
Environment: The environment is the surrounding area within which the agent interacts.
Exploration versus exploitation: An agent should exploit known information from previous experiences to achieve higher cumulative rewards, but it also needs to explore to gain new experiences that can be used in choosing the best actions in the future.

Generator: A neural network that learns to create new data resembling the source data on which it was trained.

Generator loss: Measures how far the output data deviates from the real data present in the training dataset. Hidden layer: A layer that occurs between the output and input layers. Hidden layers are tailored to a specific task. Input layer: The first layer in a neural network. This layer receives all data that passes through the neural network. Output layer: The last layer in a neural network. This layer is where the predictions are generated based on the information captured in the hidden layers.

Machine Learning Development process

Label Data
Collect and prepare data
Store features
Check for bias
Visualize in notebooks
Pick algorithm
Train models
Tune parameters
Deploy in production
Manage and monitor
Continous delivery

Data roles

Data Scientist does research ML/AI and Advanced Analytics
ML Engineer operationalizes and Optimizes ML
Data Engineer does Advanced Programming and deals Distributed Systems

Cloud DBs

Amazon Redshift, Azure Synaps Snowflake, Google BigQuery
Mainly used for data warehousing and analytical processing. Easy to scale up and down

Row-based Traditional DBS

SQL Server, Mysql, PostgreSQl.
Mainly used for transactional data, a source for the Data Warehouse or backends to applications.

NoSQL DBs

Document DB, Key-value or graph DBs.
MongoDB, ElasticSearch, Cassandra, Cosmos DB, Dynamo DB(AWS)

ELT components

Extract and loading

Batch Processing - Fivetran, Stich, Airbyte, Azure Data Factory, AWS glue
Streaming - Apache Kafka, AWS Kinesis

Transforming Data Build tool: sql based tool that allows one to write transformations on top of the stage landed raw data and turn it into custom data models.
Reverse ETL Uses core data warehouse(the single source of truth) and allows one to sync data to the business application e.g. data from warehouse to Salesforce. Tools: Census, Hightouch, rudderstack

Task Orchestration and scheduling

Apache Airflow, Luigi and Jenkins

Infrastructure management

Triggering containers, setting up environments and services.

Terraform, Ansible

BI and analytics

Reporting Tools: PowerBI, Tableau, Lookerstudio, Metabase(open source)

ML Packages

Data Manipulation & Wrangling packages

arrow: date & time manipulation & formatting beautifulsoup4: Beautiful Soup, for parsing HTML, JSON & XML data engarde: defensive data analysis jsonify: converts .csv files to .json numexpr: fast numerical array expression evaluator numpy: scientific computing library pandas: data structures & data analysis tools pandas_profiling: generates profile reports from a pandas DataFrame pandasql: queries pandas dataframes using SQL syntax prettytable: easily display tabular data as ASCII table shapely: manipulation & analysis of geometric objects tabulate: pretty-print tabular data

Machine Learning & Statistics libraries

cvxopt: convex optimization library emcee: an MIT MCMC library gensim: unsupervised semantic modeling from plain text hdbscan: Hierarchical Density-Based Spatial Clustering of Applications with Noise interpret: fit interpretable ML models and explain blackbox ML keras: high-level neural networks API lifelines: survival analysis in Python lifetimes: package for analyzing user behavior prophet: procedure for forecasting time series data pymc3: probabilistic programming & Bayesian modeling scikit-image: image processing library scikit-learn: tools for data mining & analysis scikits-bootstrap: bootstrap confidence interval algorithms for scipy scipy: mathematics, science & engineering statsmodels: estimate statistical models & perform statistical tests sympy: symbolic mathematics tensorflow: numerical computation using data flow graphs tensorflow-decision-forests: train, run and interpret decision forest models in TensorFlow xgboost: optimized distributed gradient boosting library

Visualization libraries

folium: build Leaflet.js maps in Python gviz_api: helper library for Google Visualization API igraph: network analysis tools mapbox: client for Mapbox web services matplotlib: 2D plotting library patsy: describe statistical models & build design matrices plotly: create interactive graphics pygal: create interactive svg charts pygraphviz: interface for Graphviz graph layout & visualizations pyproj: cartographic transformations & geodetic computations seaborn: viz library to draw statistical graphics squarify: implementation of the squarify treemap layout algorithm wordcloud: wordcloud generator in Python

Everything Else libraries

cufflinks: bind Plotly directly to pandas dataframes dask: flexible open-source Python library for parallel computing duckdb: in-process database management system focused on analytical query processing fiona: read & write geospatial data files geopandas: extends pandas to allow spatial operations on geometric types networkx: create, manipulate & study networks nltk: natural language toolkit pympler: measure, monitor and analyze the memory behavior of Python objects pysal: geospatial analysis library pyzipcode: query zip codes & location data requests: allows HTTP requests scrapy: scraping web pages six: Python 2 & 3 compatibility library spacy: advanced natural language processing textblob: simple API for common NLP tasks ua_parser: fast & reliable user agent parser urllib3: HTTP client for python

Snippet to detect AI-generated text

import re
import sys

def detect_text_origin(text):
	if re.search(r'\b(embark|delve)\b', text):
		return "AI-generated text was detected."
	else:
		return "Human-written."

if __name__ == "__main__":
	text = " ".join(sys.argv[1:])
	print(detect_text_origin(text))

Note Space

Machine Learning