thoughtSpace
TwitterGithubRSS Feed

Note Space

Hints, cheat sheets and notes on code.

Home

Machine Learning

Posted on March 7, 2021
machine-learning

Machine learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine learning algorithms use historical data as input to predict new output values.

Machine learning (ML) is a modern software development technique and a type of artificial intelligence (AI) that enables computers to solve problems by using examples of real-world data

Major Steps in the Machine Learning Process

  • Step 1: Define the problem

  • Step 2: Build the dataset

  • Step 3: Train the model

  • Step 4: Evaluate the model

  • Step 5: Use the model These steps are iterative.

  • There are four types of machine learning algorithms:

  • supervised

  • semi-supervised

  • unsupervised

  • reinforcement

Nearly all tasks solved with machine learning involve three primary components:

  • A machine learning model: an extremely generic program(or block of code), made specific by the data used to train it.

  • A model training algorithm: An iterative process fitting a generic model to specific data.

  • A model inference algorithm: used to generate predictions using the trained model.

  • Hyperparameters are settings on the model which are not changed during training but can affect how quickly or how reliably the model trains, such as the number of clusters the model should identify.

A loss function is used to codify the model’s distance from this goal

Training dataset: The data on which the model will be trained. Most of your data will be here.

Test dataset: The data withheld from the model during training, which is used to test how well your model will generalize to new data.

Model parameters are settings or configurations the training algorithm can update to change how the model behaves.

Log loss enables you to measure how strongly the model believes that its prediction is accurate. It seeks to calculate how uncertain your model is about the predictions it is generating

Examples of Deep learning models

  • FFNN: The most straightforward way of structuring a neural network, the Feed Forward Neural Network (FFNN) structures neurons in a series of layers, with each neuron in a layer containing weights to all neurons in the previous layer.

  • CNN: Convolutional Neural Networks (CNN) represent nested filters over grid-organized data. They are by far the most commonly used type of model when processing images.

  • RNN/LSTM: Recurrent Neural Networks (RNN) and the related Long Short-Term Memory (LSTM) model types are structured to effectively represent for loops in traditional computing, collecting state while iterating over some object. They can be used for processing sequences of data.

  • Transformer: A more modern replacement for RNN/LSTMs, the transformer architecture enables training over larger datasets involving sequences of data.

Tools that can be used to evaluate a linear regression model.

  • Mean absolute error (MAE): This is measured by taking the average of the absolute difference between the actual values and the predictions. Ideally, this difference is minimal.

  • Root mean square error (RMSE): This is similar MAE, but takes a slightly modified approach so values with large error receive a higher penalty. RMSE takes the square root of the average squared difference between the prediction and the actual value.

  • Coefficient of determination or R-squared (R^2): This measures how well-observed outcomes are actually predicted by the model, based on the proportion of total variation of outcomes.

  • Data vectorization: A process that converts non-numeric data into a numerical format so that it can be used by a machine learning model.

  • Silhouette coefficient: A score from -1 to 1 describing the clusters found during modeling. A score near zero indicates overlapping clusters, and scores less than zero indicate data points assigned to incorrect clusters. A score approaching 1 indicates successful identification of discrete non-overlapping clusters.

  • Stop words: A list of words removed by natural language processing tools when building your dataset

  • Action: For every state, an agent needs to take an action toward achieving its goal.

  • Agent: The piece of software you are training is called an agent. It makes decisions in an environment to reach a goal.

  • Discriminator: A neural network trained to differentiate between real and synthetic data.

  • Discriminator loss: Evaluates how well the discriminator differentiates between real and fake data.

  • Edit event: When a note is either added or removed from your input track during inference.

  • Environment: The environment is the surrounding area within which the agent interacts.

  • Exploration versus exploitation: An agent should exploit known information from previous experiences to achieve higher cumulative rewards, but it also needs to explore to gain new experiences that can be used in choosing the best actions in the future.

Generator: A neural network that learns to create new data resembling the source data on which it was trained.

Generator loss: Measures how far the output data deviates from the real data present in the training dataset. Hidden layer: A layer that occurs between the output and input layers. Hidden layers are tailored to a specific task. Input layer: The first layer in a neural network. This layer receives all data that passes through the neural network. Output layer: The last layer in a neural network. This layer is where the predictions are generated based on the information captured in the hidden layers.

Machine Learning Development process

  • Label Data
  • Collect and prepare data
  • Store features
  • Check for bias
  • Visualize in notebooks
  • Pick algorithm
  • Train models
  • Tune parameters
  • Deploy in production
  • Manage and monitor
  • Continous delivery

Data roles

  • Data Scientist does research ML/AI and Advanced Analytics
  • ML Engineer operationalizes and Optimizes ML
  • Data Engineer does Advanced Programming and deals Distributed Systems

Cloud DBs

  • Amazon Redshift, Azure Synaps Snowflake, Google BigQuery
  • Mainly used for data warehousing and analytical processing. Easy to scale up and down

Row-based Traditional DBS

  • SQL Server, Mysql, PostgreSQl.
  • Mainly used for transactional data, a source for the Data Warehouse or backends to applications.

NoSQL DBs

  • Document DB, Key-value or graph DBs.
  • MongoDB, ElasticSearch, Cassandra, Cosmos DB, Dynamo DB(AWS)

ELT components

  • Extract and loading
  1. Batch Processing - Fivetran, Stich, Airbyte, Azure Data Factory, AWS glue
  2. Streaming - Apache Kafka, AWS Kinesis
  • Transforming Data Build tool: sql based tool that allows one to write transformations on top of the stage landed raw data and turn it into custom data models.

  • Reverse ETL Uses core data warehouse(the single source of truth) and allows one to sync data to the business application e.g. data from warehouse to Salesforce. Tools: Census, Hightouch, rudderstack

Task Orchestration and scheduling

  • Apache Airflow, Luigi and Jenkins

Infrastructure management

Triggering containers, setting up environments and services.

  • Terraform, Ansible

BI and analytics

  • Reporting Tools: PowerBI, Tableau, Lookerstudio, Metabase(open source)

Note Space © 2022 — Published with Nextjs

HomeTopicsLinksDefinitionsCommandsSnippetsMy works