The learning rateα is how small or big that we adjust our theta values.

The local minimum and the global minimum. We try different initial values. They'll land us either a local minimum or a global minimum.

Why is there no need to change the learning rate?

We shared our ideas and experiences in machine learning:

Regression algorithm and clustering are used a lot. We need to master the basics well.

What's the main advantage of a distributed system? It has a lot of memory (Tera bytes!) Pick several initial values, pick several dimensions, do parallel jobs: map, sort, reduce. It's easy to do quicksort with Tera bytes of memory from 250 machines.

Is it challenging to do distributed programming? No. You just change your think style a bit - where to put logic, sort it, read it, implement it in another map. Put data in a grid, change data, and process data. One map, one reducer; another map, another reducer; give customers intermediate data. It's no long a single thread. It's only limited by our imagination.

Reduce dimensionality. Tell customers the cost and ask them to decide: with 3-dimensions, each run is 2 hours; with 4-dimension, reach run is 18 hours.

Briefly talked about applying machine learning in these area: detect fraudulent insurance claim, information management, speech content summarization and coaching of speakers, real time translation , neural networks, topic modeling.

We had 18 people and formed two discussion groups - one had 8 people and the other had 10 people.

2

May 1

We discussed these questions:

How to reduce number of variables. non negative matrix factorization

how was the normal equation θ=(XTX)−1XTy derived?

the advantage of gradient descent over normal equation: changes over the time; avoid too much CPU processing and disk access.

We shared our ideas and experiences in machine learning:

apply to email marketing (email customers based on their previous purchase behavior), vehicle fuel efficiency (firing, fuel, air pressure) especially when a car is at cruise control.

options trading algorithm. Latent Dirichlet allocation (LDA). Provide recommendation.

clustering model.

related classes: probabilistic graphical models (Bayes network), computing for data analysis (by Johns Hopkins), data analysis (by Johns Hopkins)

8 people joined the discussion.

3

May 8

Discussed cost function, gradient for classification, intuitive understanding of regularization.

6 people joined the discussion.

4

May 15

Discussed why neural network solves the challenge of too many features in linear regression.

Use neural network for regression instead of classification.

Whether to add more units in a hidden layer or add more hidden layers. We should start with one hidden layer and minimum units.

4 people joined the discussion.

5

May 24

Discussed how to understand backward propagation, how to derive the formula.

7 people joined the discussion.

6

May 31

Discussed precision, recall; brainstormed on how to solve several applications.

12 people joined the discussion.

7

June 5

Discussed support vector machines, especially kernels.

4 people joined the discussion.

8

June 18

Discussed how to understand SVM intuitively; clustering; PCA

12 people joined the discussion

9

June 21

Discussed SVM.

Shared experience in using machine learning in medical field and airspace.

8 people joined the discussion

10

June 27

Discussed recommender. Discussed which are the best libraries.

Watch short videos of Machine Learning and RSVP for discussion meeting at http://www.meetup.com/Hosting/

If this is your first time, take a look at Logistics

Notes on Programming Exercises

https://www.udacity.com/course/viewer#!/c-cs373/l-48736212/m-48717253