Notes
A personal collection of notes.
(no longer maintained)
LLMs
Prompt Engineering Tips
Thread (below is directly from the thread)
Tell the AI the persona you want
Give it lots of State (the facts section)
Give it a Policy (the openai policy)
Give it the Actions you want it to take (the Instructions)
Research
Natural Language Processing
humans evaluated summaries on: coherence (well-structured and well-organized), consistency (factual alignment), fluency (quality of syntax and structure of sentences), and relevance (selection of importance content from the source)
metrics most correlated with (in order of strongest correlation) [metrics table] [models table]
coherence: CHRF, (-) repeated unigram, (-) repeated bi-gram, ROUGE-4
consistency: ROUGE-3, BertScore-r, Stats-densityˆ, METEOR, SummaQAˆ
fluency: METEOR, ROUGE-4, ROUGE-1, ROUGE-3
relevance: CHRF, (-) repeated bi-gram, METEOR, ROUGE-we-1, BertScore-f
ETHICS = new dataset to evaluate language model's morality (ethics + human values) (dataset)
scenarios of: "justice, deontology, virtue ethics, utilitarianism, and commonsense moral intuitions"
train: file tune SOTA language models on ETHICS
eval: larger models tend to do better, but all perform below 50% on hard test set
it is difficult to align models to ethics -- models may "reward hack" to maximize a given reward function, leading to unforeseen and unintentional behavior
disagreement about ethics exists across cultures. not everyone's sense of morality is the same
test over 57 tasks over different domains (math, cs, law, morality) to provide insight into a text model's performance (tests + code)
large scale transformers perform poorly on law + moral scenarios tasks
large language models memorizing worst-case train data) --> leak sensitive information to adversary
contrary to popular belief, memorization does not require overfitting. large models memorize more data
only need black-box query access to language model (LM) to extract train data
how to prevent?
differentially private SGD (DP-SGD), but may not be practical due to increased training time
manually de-duplicate data and remove sensitive content before training, but sanitization is imperfect
Deep Reinforcement Learning
data augmentations cause instability in off-policy RL algorithms.
why? data augmentation is non deterministic so it results in high variance Q-targets and Q-value estimation from augmented data leads to over-regularization
method. SVEA: apply data augmentation only to current state, and not the successor state and introduce a novel Q-objective that uses augmented & unaugmented data
DrQ (data regularized Q) = data augmentation technique for model free RL
experiment on DeepMind Control Suite using SAC & DQN
only use random shifts as an augmentation and only apply to images sample from the replay buffer
DrQ algorithm
apply image transformation
to reduce variance in Q estimates, average Q targets over K image transformations and average Q function over M image transformations (K and M are hyperparameters)
K=1, M=1 recovers RAD
Reinforcement Learning with Augmented Data (RAD), a simple plug-and-play module that can enhance most RL algorithms.
Improves test-time generalization, outperforms prior state-of-the-art baselines, requires fewer experience rollouts leading to greater data efficiency
use soft actor-critic (off-policy) and proximal policy optimization (on-policy) as their base RL agents
Augmentations used
existing: crop, translate, window, grayscale, cutout, cutout-color, flip, rotate, random convolution, color jitter
random crop by itself was the most effective in improving the performance of the base policy by a large margin
new: random amplitude scaling, Gaussian noise
off-policy algorithm for continuous controls
actor aims to maximize expected reward and entropy - "succeed at the task while acting as randomly as possible"
trains & optimizes two Q-functions independently and takes min(Q1, Q2) to reduce positive bias
soft policy iteration converges to optimal policy → grounds to formulate soft actor-critic
increased sample efficiency, improved robustness and stability
on-policy algorithm for learning a continuous or discrete control policy
similar to policy gradient, but a new objective function allows for multiple epochs of minibatch updates
new "surrogate" objective
run N actors in parallel to collect T timesteps of data
construct surrogate loss on the NT datapoints
optimize using minibatch SGD
DQN & its extensions
Deep Q-Networks (DQN) - use CNN to approximate action values for a given state, but suffers from overestimation bias due to a max in the loss function
Double DQN (DDQN) - separates the selection of the action from its target, resulting in a new loss and improved performance
Prioritized replay - samples more frequently from transitions from which there is more to learn in the replay buffer
Dueling networks - neural network for computing value and advantage with shared convolutional encoder and merged by a special aggregator
Multi-step learning - use truncated n-step return from a given state to minimize a new loss function, resulting in faster training for suitable n
Distributional RL - approximates the distribution of returns instead of the expected return by modeling this distribution and updating it such that the distribution is close to the true distribution of returns
Noisy nets - adds a noisy linear layer which will can be learned to be ignored by the network, but allows for increased exploration
Rainbow integrates all of the above into a single agent. This agent experimentally has superior performance and decreased learning speed
Independently removing DDQN and dueling networks did not cause a significant drop in median performance
off-policy deep RL ensemble method that reweights Q-value estimates based on Q uncertainty estimates and chooses actions based on high confidence bounds
encourages efficient exploration and superior performance over Soft Actor-Critic and RAINBOW
Deep Neural Networks
useful diagram of existing data augmentation methods (link)
geometric transformations: axis flipping, color space (i.e. isolating a single RGB channel), cropping, translations, rotation, randomly inject a Guassian matrix of noise, photometric/lighting transformations (casting RGB colors onto the image), kernel filters (makes an image blurrier or sharper), mixing images (amalgamating multiple images together into one, not very interpretable), random erasing
also lists neural based duata augmentation methods
" there is no consensus about the best strategy for combining data warping and oversampling techniques."
random initializations of the same neural network explore different areas of function space
thus, one can utilize these properties to create deep ensembles which consist of the same model trained with random initializations
Agglomerative Contextual Decomposition (ACD) - given a prediction from a trained DNN, ACD produces a hierarchical agglomerative clustering of the input features, along with the contribution of each cluster to the final prediction. This hierarchy is optimized to identify clusters of features that the DNN learned are predictive.
provide importance scores for certain feature groups
ACD is robust under distribution shift
SVCCA captures the semantics of different classes, with similar classes having similar sensitivities, and vice versa.
SVCCA subspace demonstrates similarities and differences between learned representations
How?
Input: two sets of neurons (or layers of a network)
Step 1: SVD on input sets to create a subspace for each set that has enough directions to explain 99% of the variance
Step 2: Linearly transform new subspaces to be as aligned as possible and calculate correlation coefficients
Output: pair of aligned directions for subspaces and how well they correlate
Implications
model compression
inexpensive glimpse into learning dynamics for each layer of network
freeze training: networks converge from lower layers to higher layers, so freeze lower layers during training. this helps generalization accuracy.
CCA is a dimensionality reduction technique
CCA is a method of correlating linear relationships between two multidimensional variables
CCA finds basis vectors for two sets of variables that maximizes the correlation between the projections of the variables onto these basis vectors
CCA finds a pair of linear transformations, w and z, for each of our variables, x and y, such that corr(xw, yz) is maximized.
CCA answers the question: how many dimensions (canonical variables) are necessary to understand the association between two sets of variables?
Saliency methods are an increasingly popular class of tools designed to highlight relevant features in an input, typically, an image.
Findings
methods that are most similar to an edge detector, i.e., Guided Backprop and its variants, show minimal sensitivity to randomization tests
Gradients & GradCAM pass the sanity checks
Guided BackProp & Guided GradCAM are invariant to higher layer parameters
Proposed randomization tests for saliency maps
model parameter randomization test - compares the output of a saliency method on a trained model with the output of the saliency method on a randomly initialized untrained network of the same architecture. This exposes a method's dependency on model parameters.
data randomization test - compares a given saliency method applied to a model trained on a labeled data set with the method applied to the same model architecture but trained on a copy of the data set in which we randomly permuted all labels. This exposes a method's dependency on the relationship between images and their labels.
Contributions as stated in the paper:
The best current models, including the non-convolutional MLP-Mixer and Vision Transformers, are well calibrated compared to past models and their performance is more robust to distribution shift.
In-distribution calibration slightly decreases as model size increases, but this is out-weighed by a simultaneous improvement in accuracy
Under distribution shift, calibration improves with model size, reversing the trend seen in-distribution.
Accuracy and calibration are correlated under distribution shift, such that optimizing for accuracy may also benefit calibration.
Model size, pretraining duration, and pretraining dataset size cannot fully explain differences in calibration properties between model families.
Ensembles have shown to improve calibration and robustness
Methods: BatchEnsemble, MC-Dropout, Deep Ensembles
Data augmentation methods have shown to improve calibration and robustness
Methods: Mixup, AugMix
However, ensembles + data augmentation does not improve calibration and robustness
Propose CAMixup - apply Mixup to only the classes on which the model tends to be overconfident dynamically while training
They propose a simple, general method to accurately estimate test error
Train two networks with the same hyperparameters and the same dataset, but different random seeds
Evaluate these networks on a fresh, new, unlabeled dataset
Find the difference in the predicted labels. This approximates test error
Why does this happen?
Ensembles of independently and stochastically trained models tend to be well-calibrated (i.e. deep ensembles)
They show that if a stochastic learning algorithm leads to an ensemble that is well-calibrated, then the ensemble satisfies the Generalization Disagreement Equality in expectation (basically, the ensemble satisfies the conditions to measure test error using their method)
Popular ML datasets contain an average of 3.4% label errors
They propose a method to correct test set labels
Estimate label noise matrix using the confident learning (CL) framework outlined in Northcutt et al., 2021
Correct label errors that CL identifies with human validation
Due to label errors, many existing ML benchmarks are unstable
Neural networks are NOT robust under distribution shift
Training with more diverse data tends to make models more robust (the models that performed the best had the most data)
Existing methods for synthetic distribution shift are NOT predictive of natural distribution shifts
Best performing models in order of accuracy: efficientnet-l2-noisystudent, FixResNeXt101_32x48d_v2, FixResNeXt101_32x48d, instagram-resnext101_32x48d, efficientnet-b8-advprop-autoaug
Unsupervised Learning
improvement over t-SNE that "arguably preserves more of the global structure with superior run time performance"
how (theory)? first, approximate a manifold on which the data is assumed to lie & then construct a fuzzy simplicial set representation of the approximated manifold
simplicial sets can be thought of higher-dimensional directed graphs
optimize layout of data in low dimensional space to minimize error between two representations
how (computation)? use weighted k-neighbor graph to represent a manifold. iteratively update graph layout using attractive forces on edges and repulsive forces on vertices (referred to as a force directed graph layout).
guaranteed to converge in a fashion similar to simulated annealing where forces are slowly decreased
hyperparameters
number of neighbors n - trade off between fine/large scale manifold features. smaller ensures detailed structure is captured at the expense of losing a "big picture" view
min-dist - controls how closely points can be packed together in the low dim. representation. smaller values result is dense packed regions with a truer representation of the structure. (can be thought of as an aesthetic parameter)
t-SNE is a method for exploring high-dimensional data that creates 2D maps
performs a non-linear transformation on high-dimensional data to create a lower-dimensional representation
hyperparameters
perplexity - a "guess" about the number of close neighbors each point has (should usually be smaller than # points)
cluster sizes in t-SNE do not mean anything as the algorithm distorts distances (expands dense clusters, shrinks sparse ones)
distance between clusters in t-SNE does not necessarily mean anything (and may vary with different perplexity values)
"looking at multiple perplexity values gives the most complete picture", so look at more than one plot
anomalies are "loners" -- they are in the minority and have different feature values than inliers --> anomalies are closer to the root of the Isolation Tree (iTree)
"builds an ensemble of iTrees for a given data set, then anomalies are those instances which have short average path lengths on the iTrees"
hyperparameters.
t, the number of trees to build (ensemble size). default t = 100.
ψ, the sub-sampling size (training data size). default ψ = 256.
train. recursively partition training set using random partitions until points are isolated or tree height limit is reached
eval. path length h(x) = # edges from root → terminating node. anomaly score is E[H(x)] across t trees
Applied Deep Learning
search initially had a manual scoring function, later replaced with gradient boosted decision tree
test new models online using an A/B testing framework for statistical significance with both on/offline metrics
"don't be a hero, in the beginning" - simple model to initially deploy to prod, more complex model after initial validation
feature engineering recommendations: normalize inputs to NN (using normal / power law dists) to prevent vanishing gradients, look at the smoothness of the distributions as a method to catch bugs
pipeline: server for retrieval + scoring + log production → process logs into train data using spark → train model → upload model to server for retrieval + scoring
hyperparameters: no dropout as it created too much randomness in the model, initialize weights with Xavier initialization (better than random, faster convergence), Adam for optimizer, arbitrary batch size
Fundamentals
Multiple objective optimization (MOO) - decouple objectives and have multiple models with each optimizing one objective
Training Data
Confident Learning: Estimating Uncertainty in Dataset Labels (Northcutt et al., 2019)
use heuristics to assist in programmatic data labeling (labelFunction)
working with class imbalance
use asymmetric metrics (precision, recall, F1) to compare class-wise
can change loss function
Feature Engineering
data augmentations are the "future" of feature engineering
handling missing values: can either drop column or row. avoid filling in missing values
scale features (log scaling) to look like normal distribution (often results in improved performance)
feature crossing: create more features by combining features. blows up feature space and may lead to overtraining.
Model Development
partitioning: preserve distribution of labels across splits, split data by time (old data is train, newest data split into test/valid), oversample after splitting
process data using statistics computed on train
measuring feature importance: XGBoost.get_score, SHAP
ensemble to improve performance (bagging, boosting, stacking)
weaker models with well-tuned hyperparameters can outperform fancier models, so tune hyperparameters then look into NAS
Model Evaluation
[bookmark]
Books
The Elements of Investing, Malkiel & Ellish
Save early and regularly
Take advantage of your employer's retirement security plan
Set aside six months of living expenses.
Store in high interest bank deposit or safe money market funds (VMMXX)
Make sure to get the highest interest rate available
Diversify
Avoid credit card debt
Ignore "Mr. Market"
Use low-cost index funds
Acquire common stocks, bonds, and real estate
Dollar cost average
Re-balance portfolio every year
One Up On Wall Street, Lynch
invest in companies, not the stock market
invest in things you know about
the perfect stock. hits on any or all of the following: its ridiculous, dull, disagreeable, a spinoff, little institutions own it, analysts don't follow it, surrounded by rumors, depressing, in a no-growth industry, niche, necessary to people, uses tech, bought by insiders, company is buying shares back.
stocks to avoid. hottest stocks in hottest industry, the "next something" (i.e. the next facebook), middlemen, deceivingly exciting names.
P/E ratios. avoid stocks with high P/E ratios. try to predict future earnings, or find out how a company will increase its earnings. used to compare similar companies within same industry.
balance sheet. cash + marketable securities > long-term debt = very favorable. long term debt > cash = very unfavorable. outstanding shares reduced in the last 10 years = favorable.
favorable metrics.
1/2 * P/E ratio >= growth rate of earnings
lower % debt, greater % equity. limited bank and commercial paper debts. greater % funded debt.
do not pay dividends.
cash flow in > cash flow out
sales growth > inventory
pension fund assets > vested pension liabilities
highest pretax profit margin within a given industry
low % institutional interest
recheck metrics / story every few months
"If you are looking for tenbaggers the more stocks you own the more likely that one of them will become a tenbagger."
The Peter Lynch Approach to Investing in "Understandable" Stocks By Maria Crawford Scott
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers, Horowitz
peacetime (easy) vs wartime (extremely difficult) CEO
people > product > profit
good product managers are the CEO of the product. responsible for the right product at the right time, and all that entails (make little excuses)
firing. be direct, deliver bad news immediately, be straightforward, address the entire company.
promotions. define a formal process for all promotions. promotions should be leveled across teams / groups. manager submits an employee for review with comments. committee evaluates review against against level's skill description and other employees at that level.
organizational design. answer the following questions: what knowledge needs to be communicated, and to who? what needs to be decided? prioritize important communication and decision paths. decide who's going to run each group. identify sub-optimal paths, and develop a plan to optimize them. priorities change later? reorganize.
"great CEOs constantly assess whether they are building the best team"
Zero to One: Notes on Startups, or How to Build the Future, Thiel
"the most contrarian thing of all is not to oppose the crowd, but to think for yourself"
successful companies earn monopolies by solving a unique problem and by being so good at what they do such that no one can compete
Good to Great: Why Some Companies Make the Leap and Others Don't, Collins
first who, then what: first assemble the right team before tackling other challenges
put your best people on your biggest opportunities, not your biggest problems
establishing a truthful climate. lead with questions, debate, investigate without blame, build red flag mechanisms to speed up action
all great companies share "hardiness" - they use adversity as a defining event to make them stronger. they confront the brutal facts of their current reality and prioritize the highest impact items
Market Research On A Shoestring, Zafar
The Everything Store: Jeff Bezos and the Age of Amazon, Stone
Nonviolent Communication, Rosenburg
Entrepreneurship
IDEAL: This section is used to describe the desired or “to be” state of the process or product. It identifies the goals of the stakeholders and customers as well as assists in defining scope. At large, this section should illustrate what the expected environment would look like once the solution is implemented.
REALITY: This section is used to describe the current or “as is” state of the process or product. It explains the pain points expressed by the stakeholders and customers. It should also include the insights and expertise of the project team and subject matter experts provided during problem analysis.
CONSEQUENCES: This section is used to describe the impacts on the business if the problem is not fixed or improved upon. This includes costs associated with loss of money, time, productivity, competitive advantage, and so forth. The magnitude of these effects will also help determine the priority of the project.
PROPOSAL: This section is used to describe potential solutions. Once the ideal, reality, and consequences sections have been completed, understood, and approved, the project team can start offering options for solving the problem. It can also include suggestions by the stakeholders and customers, although further discussions and research will be needed before a specific course of action can be determined.
There are many problems where good data is difficult to obtain or just outright does not exist.
“If you cannot convince your client that you understand how the algorithm came to the decision it did, how likely are they to trust you and your expertise?”
Interview Questions, Naeem Zafar
Phase One
What is the most pressing issue in this space?
What are the two or three things you worry most about?
How have you tried to solve this problem?
What companies or products have you looked at to solve this problem?
Why haven’t you bought those products or services to fulfill that need?
What were your concerns about those products or services? (price, maturity, availability, etc.)
Where do you normally look when seeing products like these? (this will help us market)
What is a fair price range for such products?
What is your budget?
How has your budget been approved within your company?
Are you looking to spend the money this year, this quarter, this month?
If you were not to be able to spend money to buy a new solution, what other alternatives have you looked at?
If you did explore these alternatives, why haven’t they worked out?
Who will make this decision to purchase a new product?
What is the purchasing process (in your company or household)?
Who has to approve of the purchase of this amount?
Phase Two
What if a product or service like this existed?
How would you feel if it had some of the following features?
Would you consider implementing it? (questionable)
These are some of the aspects of features of this product. What other features would you like to see?
What features must it have before you consider making a purchase?
What features would be nice, making it more attractive for you?
What is a suitable price for this product or service?
What price would you consider too high, and why?
What price would you consider a bargain, and why?
Where would people buy this product?
How is such a product typically purchased?
When was the last time you looked for such a product?
“Clean data is better than big data”
Applying ML to anything is an iterative process that involves research, trial, and error
Craft questions in the “how might we” format to address any possible design challenges that may arise
“Most AI projects create AI value in one of three ways: reducing costs (automation creates opportunities for cost reduction in almost every industry), increasing revenue (recommendation and prediction systems increase sales and efficiency), or launching new lines of business (AI enables new projects that were not possible before).”
If you remove AI from the company and still have a valuable product, you are on the right track
Why do you believe you’re doing this?
What’s the “first useful thing” you’ll offer to those you wish to serve?
What does the horizon line look like?
How will you know if you’re on the right track?
"Your customers are not buying your products, they are hiring them to get a job done."
Adam Cheyer (founder of Siri)
Have a differentiated idea
Ambitious and for a big market
A magical demonstration
Why is it better than the competition?
Founding Team Skills
Visionary (big idea)
Marketeer (story)
Product (roadmap)
Builder (deliver it)
Hire action-takers who can get the job done
Different departments
Marketing and sales
Product design and development
Accounting and finance
Research and development
Human Resource
Consider investing in a recruiter
The bottom line is, you need to put your team in a position to succeed.
Train them continuously, perform on an individual level, listen
In order for firms to maintain longevity, they should establish smaller sub-organizations that act independently. These organizations are not to be pressured into making a short term profit, but should instead be given a unique identity and allowed to create their market.
Customers follow the "Buying Hierarchy" depending on the maturity of the market. The phases, in order, are: functionality, reliability, convenience, and price.
Launch something bad, quickly
Build something lean that can be done in months
Business Models, Mark Searle
Don't have business model? Then its a hobby, not a business
What is a business model?
The way an organization creates value, delivers value, and captures value
A system designed to deliver one or more value propositions to one or more customer segments, in a way that ensures the ongoing survival (and growth) of the delivering organization, allowing it to continue creating more and more value for others and therefore for itself
Methods for capturing business models
Business Model Canvas, Mark Searle
Business Model Canvas (BMC)
Right side of BMC drives the left
Living, breathing document
Test hypotheses
Be honest about what you know, believe, and are guessing
External Focus (right side)
Value Proposition - why will your customers love you?
Customer Segments - who is going to love you and your offering?
Channels - how does (or will) your offering reach your customers (users)?
Customer Relationships - how do you (or will you) get, keep, and grow your customers?
Keeping customers is cheaper than acquiring new ones, so retention is key
Revenue Streams - how and how much will your paying customers pay you?
Internal Focus (left side)
Key Activities - what must you be doing now to execute your (current) business model?
Key Resources - what do you need to perform your Key Activities?
Typically physical, financial, human, intellectual property / intangibles
Key Partners - whom do you need from outside to execute or accelerate your business?
You need to create and deliver value for your partners, not just receive value
Cost Structure - how much does it (or will it) cost to execute your Business Model, and on what are you spending?
start by testing specific hypotheses that you have about a market with Minimum Viable Tests (MVT)
MVT - specific test of an assumption that must be true for the business to succeed
You can’t have 20 insights and be successful — you must have just one
General Overview
List the riskiest assumptions that might lead your business to succeed or fail.
Test your assumptions through MVTs.
Build an initial product to bring all of your insights together and test them with your target customer.
Iterate on that product until you find product market fit
Scale!
MVT Process
Find your value proposition.
Determine the promise of your idea. Why would users want it? What are you promising them?
What are they already trying to do? How can you help them achieve their goals better than they know they can?
List your risky Assumptions.
why might this not work? What breaks your system?
Do you know enough about your market to know how to sell it and who will buy it? Is there even a go-to-market strategy that can work or is that the most difficult part of this business?
What price are consumers willing to pay relative to the cost for you to deliver your solution?
Test the smallest unit.
Pick your risky assumption and test just one at a time.
Devise a test for that specific assumption.
People embrace structure. It allows teams to accomplish more in small groups.
Building an Org Chart
Leading Startup Teams, Mark Searle
entrepreneurship is a team sport (co-founders, employees, customers, investors, advisors)
leaders: learn, innovate / adapt, teach
what to teach?
what the org is trying to accomplish
why that goal is important
what each person's / team's role is
why that role is important
business functions
strategy (business planning / analysis)
design (UI / UX)
engineering (product development)
selling & marketing (customer development)
project management (organization)
assume that you will face conflict
openly discuss strengths and weaknesses
create regular check ins, set clear achievable goals, set clear roles and functions, establish a strong culture of communication
identify and resolve conflicts early
values. judgement, communication, impact, curiosity, innovation, courage, passion, honesty, selflessness
keeper test managers use: "if my people told me they were leaving in 2 mos for a similar job at a peer company, who would i fight hard to keep?". people who you would not fight hard for should be let go so that a start can be found to fill that role.
measure people by how much, how quickly, and how well they get work done (especially under deadline)
increase employee freedom as we grow to attract and nourish innovative people
increase the % of high performance employees faster than business complexity grows. increase talent density with top of market compensation, freedom to make an impact, demand high performance culture.
occasionally get rid of rules to decrease complexity and and remove bad processes
focus on what people get done. no need for a 9-5 day policy. no need to have a vacation policy. let people do whatever they want as long as they perform at a high level.
policies for expensing entertainment, gifts, and travel: "act in Netflix's best interests"
managers should set the appropriate context for employees to make sound decisions. when employees fail, managers should ask "what context did i fail to set?"
highly aligned, loosely coupled. strategy and goals are clear and discussed by team. minimum meetings and high trust in other teams.
pay top of market. what could person get elsewhere? what would we pay for replacement? what would we pay to keep person (if they had a pending offer)?
annual comp review. "rehire" each employee each year. each year manager aligns their people to the existing market (different in different areas)
encourage employees to talk about what their pay could be at other firms, and to bring it up with their managers.
no other bs like bonus, retirement match, etc. put that entirely into big salaries
necessary conditions for promotion. job has to be big enough & person has to be a superstar in current role
how to develop people? give them the opportunity to develop themselves, surround themselves with outstanding colleagues, give them big challenges to work on
"high performance people are generally self-improving through experience, observation, introspection, reading, and discussion"
"formalized development is rarely effective so don't even try to do it" (?)
50th percentile startup compensations
Technical roles (Software Engineering, Product Management, Data Science)
Non-technical roles (Chief of Staff, Business Operations, Strategy, Business Development, Marketing, Finance)
Uncoventional Startup Lessons (NickFriend)
software companies have 7-10x revenue multiple for acquisition, so $3mm ARR = $21-30mm valuation
don't raise money. bootstrap if possible. grown organically, unless you are in a winner-take-all market and thus need funding.
you're probably not building the next unicorn. VCs are highly incentivized to fund the next unicorn. they need it to survive. they will push you to risk your businesses livelihood to become the next unicorn, when maybe that's not the best for your business.
building smaller businesses puts you at an advantage. existing competition is market validation. competition is less fierce (and sometimes incompetent)
accept VC money? you no longer work for yourself. you are taking on an obligation and now work for them.
being a successful entrepreneur will most likely take you more than one attempt
founding team should cover product & sales. should have as low burn rate as possible.
sell the product before you build it. don't even need to build software. get market validation that your product is useful and identify how much people will pay for it before you invest too heavily in building it.
Resources
Entrepreneurship
Design
Marketing
sales and marketing books (andrew gazdecki)