This was a takehome challenge for a mostly data engineer / data science position at a company that handles vast amounts of images.
It's a barebones flask application running in a docker container hosted on aws that displays a simple D3 viz generated by vega which was generated by the new library Vega in python. http://demo.altair.ryanglambert.com/ There were two parts 1. Write sample ETL code to transform a pandas dataframe into a db-like nested to go into NoSQL. 1. Transformation is done in here: https://github.com/Ryanglambert/altair_flask/blob/master/altair_flask/data/ETL_to_json.ipynb 2. Make this awesome visualization which you can see live here: 1. http://demo.altair.ryanglambert.com/ keywords: Flask, Docker, Altair, Vega-lite, AWS
0 Comments
This was a fun project I did in March as a part of my Machine Learning Nanodegree on Udacity. The results were competitive with the Princeton benchmark: http://3dshapenets.cs.princeton.edu/ ModelNet10 Accuracy: 93% ModelNet10 Mean Average Precision: 88% ModelNet40 Accuracy: 82% ModelNet40 Mean Average Precision: 70% So what are Capsule Networks? Capsule Networks are this cool new network architecture designed to generalize better than CNNs. For an explanation of what's going on please go here: https://www.youtube.com/watch?v=pPN8d0E3900&t=7s. This paper was an investigation in their performance for information retrieval of CAD models. Why that would be useful has to do with how capsule networks handle rotational variance. CNNs basically don't handle it at all. The visualization below is what you get if you modulate one of the capules by 1.5 std in either direction. In the paper, this is referred to as "dimensional perturbations". Although the primary goal of achieving rotation invariance on 3D CAD models wasn't achieved, it was fun to play with a new architecture! Give it a read and don't be bashful in the comments!
I was recently given a takehome challenge from a company whose product is essentially a dashboard of response times for customers. We are given two weeks worth of mean response times. There are two things we want to know in the data. 1. What and where are the anomalies, if there are any 2. Is there a significant difference in mean response times between week one and week two? This first plot sort of already exposes some points that appear to be anomalous. But let's actually step through this more quantitatively. Looking For AnomaliesTo decide what is an anomaly, I must decide what is normal. To do this I'm going to fit a distribution to the data. For time series data there are better ways of doing this, but as a first 'quick check' it seems reasonable. The biggest reason you might not want to depend on this approach in a more detailed analysis has to do with the assumptions when fitting a distribution. When fitting a distribution, we assume all of the samples are independent from one another. Since this is time series data, this isn't the case. This is quite a long tail. Let's see if ignoring / filtering away the outliers can give a sense of how the data is usually distributed. On deciding what distribution to use I inspected both gamma and log normal. It appears that log-normal most resembles this data. Fitting this distribution to the data, we can now use this distribution to declare what is 'normal ' behavior. Creating a bounded area that shows where 99% of the data lie, we can get a visual sense of where the data is anomalous for both week one and week two. I've demonstrated how to highlight anomalies in your data. Is there a significant difference in means between week one and week two? To answer this question I'm first going to use the distribution above to filter out the anomalies. My reasoning is that the mean is affected by the anomalous behavior, we are more interested in whether or not there was a difference in mean response time outside of these anomalies. After filtering away the anomalies. This is what we get if we bootstrap sample the means of week one and week two. I used bootstrapping because this isn't a normal distribution. It looks like there's a difference. Let's qualify this with a paired t-test to give a p-value. ConclusionsI've demonstrated a quantitative way to point out anomalies. There also appears to be a significant difference in mean response times from week one to week two. (p < .05) Closing thoughtsThe Jupyter notebook that I used to make these plots can be found here:
https://github.com/Ryanglambert/for_candidate/blob/master/response_times_scenario1.ipynb I welcome your constructive feedback on github or in the comments below! There is much more that can be done with evaluating time series for anomalies. For some interesting and more advanced stuff on time series anomaly detection: https://www.youtube.com/watch?v=0PqzukqMcdA https://www.youtube.com/watch?v=CAvKQHHNmcY&t=293s (note: Pubmatch.co is down indefinitely, it may come back in the future!) PUBmatch.co is a tool I thought of while I was at Metis Data Science Bootcamp. You can see my slides from my presentation here. This post has a companion python notebook if you'd like to follow along here: Latent Semantic Indexing Notebook Latent Semantic Indexing appealed to me because it involves the use of Singular Value Decomposition. Singular Value Decomposition is at the heart of signal processing. Signal processing is involved in so many systems that we use, it's the unsung hero of the information age. Telephones, radar, non-destructive testing, data compression, and movie special effects. Treating conversation like a signal simplifies the process of extracting meaning. Not buying it? Consider this: How many different ways you can express the same ideas and concepts. In communication we take our ideas and put together (compose) the appropriate words (frequencies). We say these words to others (raw signals) to communicate concepts. Other people hear these words (raw signals) and decompose them into different meanings (frequencies). What does it look like? The above pictures illustrate the ability to "decompose" a signal into "principal components". In the example above we're "decomposing" a given signal into two parts. How does this relate to LSI (Latent Semantic Indexing)? Imagine that each one of those signals in "SVD Original" is a sentence. Because we've decomposed these two signals we can throw one away. We'll throw away the Noise and keep the SVD smooth signal. However, with LSI we're usually filtering out hundreds of thousands of dimensions. The same idea still applies though: Decompose the signal, throw away the parts you don't care about, keep the ones that have meaning. Let's do a small example of LSI pseudo-by-hand. I have seven sentences. I'd like to extract the signal or meaning from each one of these sentences so I can see how similar they are to one another. The bacon egg and cheese was there. Bacon goes well with egg and cheese. Bacon is not cheese. There are books about cheese. I read books about cheese. I read books. You read books. What sort of signal is in there? LSI will start by having us get rid of "stop words" then put the words into a matrix counting the number of occurrences of each word in each document. The bacon egg and cheese was there. Bacon goes well with egg and cheese. Bacon is not cheese. There are books about cheese. I read books about cheese. I read books. You read books. Now put them into a matrix. Now each sentence is represented by the number of occurrences of words. This is called a "Bag of Words" or "Term Document Matrix". Each document can be imagined as a vector whose direction is within a hyperspace whose axes are represented by each of the words. It's called a hyperspace because as soon as you have more than 3 words your space you can't visualize it. Now let's decompose this system into 2 principal components. I use the scipy library to do this. ` from scipy.sparse.linalg import svds as SVDS ` ` SVDS(<the matrix above>)` This outputs three matrices: docs, eigen_roots, terms_T. We're interested in comparing documents so we will use the docs Matrix. In the above matrix, we're representing each doc in a two dimensional space. Since this is convenient to view for humans so let's visualize these sentences in 2d space. The X and Y axes are linear combinations of the preexisting words. You can see the X axis weighs the word "bacon" negatively whereas the Y axis weighs it positively by roughly the same order of magnitude. This would cause sentences that have and do not have the word bacon, to be further away from one another. Look at the sentences that had the word bacon, they're all pointing in a different direction than the sentences that did not have bacon in them. What can I do with this though? Now you can check similarity of sentences within this reduced space you've created. This new space we've created contains the "semantic meaning" we've chosen to care about. Cosine SimilarityCosine Similarity is just 1 - Cosine. You use Cosine because it is only a function of the angular distance between the vectors and thus ignores the magnitudes. This is important. If someone is talking about a Bacon Egg and Cheese McMuffin and you're writing a book about the history of the Bacon Egg and Cheese McMuffin, you'll be able to compare the two if you're using cosine similarity, but if you just computed, say, Euclidean distance, the two topics would appear to be vastly unrelated. Ok let's compute the cosine similarity between two of these sentences. "There are books about cheese" "You read books" For this I'll use `from scipy.spatial.distance import pdist` `1 - pdist((<you read books about cheese>,<I read books>))` Cosine Similarity = 0.763 We can also look at the entire corpus and see how each sentence compares to all the others. For more reading, or if you're interested in just jumping right in and doing this yourself I suggest you checkout: Gensim
Tools used: python, scipy, matplotlib, numpy, jupyter notebook PUBmatch.co is my passion project completed while at Metis Data Science Bootcamp. PUBmatch.co makes it easier to parse through the giant open access database PubMed by allowing you to input anything from a news article clipping to an email thread. Watch the presentation here: PUBmatch uses a technique called Latent Semantic Indexing to parse through everything ever published on PubMed (48GB!), finding the most conceptually similar research articles to a given input document. This project is open source. Check it out at www.github.com/ryanglambert/pubmatch
Overview: Stochastic SimulationsFound at edx.org (week 3 and 4). The goal is to model virus growth, via monte carlo simulation, in a patient over time to understand the behavior of virus growth and its interaction with time and a number of prescriptions. We use a monte carlo simulation in this case because we understand the microscopic behavior of each virus (reproduction probabilities, clearance probabilities) but want to extrapolate to a population of viruses. There are two major classes: Patients, and Viruses. Each “Patient” instance holds a list of “Virus” instances that have various probabilities assigned to things like: mutation probability, reproduction probability, clearance probability. Depending on the random numbers generated the different outcomes (whether to clear, whether to reproduce, whether to mutate) will be enacted. Example: Each virus has is a “Clearance Probability”. This is the probability that a virus will die at a given iteration. For these tests the “Clearance Probability was 5%. We generate a random number in python with In [1]: import random In [8]: random.random() <= .05 Out[8]: False In this example, since the outcome is false the virus particle would not reproduce. For more reading related to montecarlo simulations: https://docs.python.org/2/library/random.html, and https://en.wikipedia.org/wiki/Stochastic_simulation Simulation Comparison: No Treatment, 1 Drug, and 2 Drugs at various times
Simulation 1: No DrugsSimulation 2: Drug @ T = 0 "Ideal Case"Receiving Treatment at T = 0 results in 0 mutations and therefore no resistances developed. This is obviously ideal and unrealistic. Since the default starting resistance to each drug is "false" at T = 0 , and since viruses aren't allowed to reproduce in the presence of a drug they're not resistant to, you can see there is no reproduction and all viruses clear by roughly T = ~70. With a final population of 0 in all patients. Simulation 3: Drug @ T = 75Simulation 4: Drug @ T = 150Simulation 5: Drug @ T = 300For Prescriptions administered at T = 75, T = 150, and T = 300, there is a marginal difference in number of patients cured (virus pop < 50). Although, due to T = 300 not being at a steady state we can't include that in our comparison. Simulation 6: 1st Drug @ T = 0 2nd Drug @ T = 150As expected, this looks exactly like the test with one drug at T = 0. If the patient is cured before T = 100 then a drug at T = 150 should show no difference on either of the histogram or the time series charts. Simulation 7: 1st Drug @ T = 75 2nd Drug @ T = 150Simulation 8: 1st Drug @ T = 150 2nd Drug @ T = 150For sim 7: T = 75 T = 150 and Sim 8: two drugs at T = 150 , the results were surprising. I reran these a couple of times to see that I just hadn't mislabeled them. What I expected was for the Sim 7 to have more cured patients since any treatment at all was started earlier. However, Sim 8 had more cures even though both drugs were administered later. I think I have an explanation. When the patient receives a Drug 1 at T = 75 the virus population hasn't quite hit the "steady state" ceiling that slows reproduction. So viruses that haven't reached their steady state limit yet have more opportunities to reproduce, and therefore more opportunities for resistance mutation. For Sim 8, the drugs are administered at the same time, and at a steady state virus population which means two things: There are fewer opportunities for resistance mutation, and the mutations have two drugs to mutate resistance to simultaneously giving them an effective mutation probability of .25% ( .5% * .5%) at time of drug administration. More analysis would be necessary to characterize this difference in more detail. These two simulations also highlight a key difference in the "Time Window of Adminstration" that results in a cure (virus population < 50). Simulation 9: 1st Drug @ T = 150 2nd Drug @ T = 300Sim 9 is kind of funny. The drugs are spaced far enough apart that the viruses can mutate against each one independently enough to return to a steady state close to what would have happened if they had taken no drugs at all. ConclusionFor the realistic cases (No Drug at T = 0) it appears that taking two drugs simultaneously instead of one or two at different times allows for a larger window of administration (i.e. not requiring you to catch the virus at an early enough time) and higher likelihood of cure. Simulation 3 - 5 and 9 resistant populations seem to show that it is actually worse to take 1 drug "not early enough" or to take 1 or 2 drugs too far apart. Doing so would result in developing mutated viruses that no longer respond to either of the drugs administered. ThoughtsThis simulation doesn't consider things like:
which helps me understand how this is a whole entire niche industry in itself. (https://en.wikipedia.org/wiki/Bioinformatics) I can also see the use of very powerful computers for this kind of analysis since each simulation run here took roughly 2 minutes running on a MacBook 2.6 gHz 8 GB RAM. Learning to run these kinds of simulations in the cloud is something I'm planning on the horizon. Tools used: matplotlib, numpy, Python Source: https://github.com/Ryanglambert/virus_simulation |
Archives
June 2018
About MeData Science and Machine Learning. |