Go well armed on your journey towards predictive analytics and remember to always ask the right questions!

Each time we talk to someone about analytics, we ask the same question: “What is your ultimate goal with this project?” Often it is to increase sales or reduce turnover. Of course, this isn’t usually what’s said; initially all we get is a panicked look that says “we can’t get what we want out of our database—it isn’t working right…how do we fix it??”

Typically, there is nothing wrong with the database: the master and all its clones are just as they were designed to be, the variables are entered correctly and the reporting functions are pulling exactly what they were coded and designed to pull.

So what’s the issue?

Every database contains different information (employee addresses, phone numbers, salaries, benefits, and the like as an example), but how much of that information is connected? What I mean is: is there a unique quantifier that connects each table or database together? If I want to select all the people who might retire in the next 5 years out of a database, complete with demographic, sales and personal information, in order to create an organizational plan for this, can I accomplish this with my current database? By design relational databases are just that: “relational” Therefore, everything should flow, if set up correctly in the very beginning.

Which brings us back to asking the right questions—which might look a bit like these:

  • What is it really that I want to be able to answer with my data collection?
  • Structured vs unstructured data, am I asking the right questions and giving them choices or offering a space to add comments (be careful of this)? Example a) agree b) disagree c( unsure —– Can I work with unstructured data?
  • Are you offering incentives to employees or candidates to “complete additional info” as to glean a more complete pictures of your customer? (5 dollar iTunes or Starbucks card)
  • Do I want to link social data to what I collect from employees? (3rd party sign in via Facebook, Twitter, Google+, etc)
  • Are you including IT in your business meetings?

A gap analysis usually reports on what is missing, but it doesn’t have to be this way (reactive and not proactive). If you ask the right questions initially, your design and results will reflect this. Know what you are trying to accomplish.

If you say “my database is broken,” but what you really mean is “I need to be able to sustain sales throughout the next five years; I’m concerned with increasing what we have in the pipeline,” well, ask for help. Make sure you have the correct data to indicate your sales reps selling habits, including seasonality. What data does your sales department have that can help you answer these and many other questions?

Do you have store performance, or other line of business performance data to help form a more three dimensional view of your candidates or employees? A data-centric view is so valuable, but unless you ask the right questions when collecting your data and setting up your database, you may end up trying to build a predictive model using only name, address and phone number! As Chief Engineer Scotty Montgomery of the USS Enterprise might say, “I can’t do it captain!”

Being a “Data Scientist” Is As Much About IT As It Is Analysis by Carla Gentry, aka @Data_nerd

IBM defines the data scientistas -> A data scientist represents an evolution from the business or data analyst role.
The formal training is similar, with a solid foundation typically in computer science and applications, modeling, statistics, analytics and math. What sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge.
Good data scientists will not just address business problems, they will pick the right problems that have the most value to the organization. The data scientist role has been described as “part analyst, part artist.”

Anjul Bhambhri, vice president of big data products at IBM, says, “A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It’s almost like a Renaissance individual who really wants to learn and bring change to an organization.”…

A data scientist does not simply collect and report on data, but also looks at it from many angles, determines what it means, then recommends ways to apply the data.

Data scientists are inquisitive: exploring, asking questions, doing “what if” analysis, questioning existing assumptions and processes. Armed with data and analytical results, a top-tier data scientist will then communicate informed conclusions and recommendations across an organization’s leadership structure.
IBM hits the nail on the head with the above definition. Having worked with traditional data analysts as well as programmers, developers, architects, scrum masters, and data scientists — I can tell you they don’t all think alike. A data scientist could be a statistician but a statistician may not be completely ready to take on the role of data scientist, and the same goes for all the above titles as well.

Beth Schultz fromAll Analytics mentioned that we are like jacks of all trades but masters of none; I don’t completely agree with this comment, but do agree that my ETL skills are not as honed as my analysis skills, for example. My definition of the data scientist includes: knowledge of large databases and clones, slave, master, nodes, schemas, agile, scrum, data cleansing, ETL, SQL and other programming languages, presentation skills, Business Intelligence and Business Optimization — plus the ability to glean actionable insight from data. I could go on and on about what the data scientists needs to be familiar with, but the analysis part has to be mastered knowledge and not just general knowledge. If you want to separate the pretenders from the experienced in this business, ask a few questions about how data science actually works!
When I start working with a new data set (it doesn’t matter how much or what kind), the first question I usually ask is, what kind of servers do you own?

Why would you need to know about the servers to work with data? I ask this question so I will know what kind of load it can handle – is it going to take me 9 hours to process or 15 minutes? How many servers do you have? I ask this because if I have 4 or 5 servers, I can toggle or load balance versus having only 1 that I have to babysit.

What kind of environment will I be working in? I ask this because I need to know if they have a test environment versus a live environment, so I can play without crashing every server in the house and ticking a lot of people off. If you are working with lots of data, lower peak times or low load times are better for live, as compared to test or staging environments where you can “play” without fear. This way, you won’t “bring down the house”.

It’s a good idea for you Chief Marketing Officers (CMOs) to let your Data Scientist work in the evening hours and/or on weekends, at their homes if applicable. This, of course, requires setting up a VPN connection and it also depends on how secure the data connections are, as well as how much processing I can do before I crash them, – um, I mean, what is the speed and capacity to process? If a dial-up connection is all that’s available, forget it.

As a side note, I’ve crashed many a server in my day – how do you think I learned all this stuff? Back in the Nineties, someone would crash the mainframe at RJKA and we would all head to Einstein’s Deli in Oak Park, IL but today, this might be frowned upon. But I digress, back to more IT related things.

Another handy thing to find out is how the databases are joined. By that I mean, what variables do they have in common (i.e., “primary keys”)? Are the relationships one-to-one, one-to-many, or many-to-many? Why would you ask this? Some programmers (I don’t mean this in general) don’t completely understand relational databases, especially when it comes to transactional data and data that needs to be refreshed often. You have to set up a database like you would play chess: think at least three moves ahead.

Additionally, some programmers/developers use too many JOIN statements in their scripts, which cause large amounts of iterations. Since these tend to increase run time and are not very efficient, you don’t want to be linking too many of these babies together and then running complex algorithms or scripts.
Sometimes, it’s better to start from scratch and build your own data source. When writing scripts to extract or refresh data, don’t forget a few keys things: normalize, index, pick your design based on what you know about the data and what is being requested of it.

Servers are important, and if dealing with large databases, load balance or toggle whenever possible. Also, star schema versus snowflake schema is important, so please put some serious thought into this. Ask yourself, do I need it fast or efficient? Believe me, I always pick efficient (I am a nerd, after all) but if the client needs it ASAP, then the client shall have it ASAP.

With knowledge of the client’s IT setup from a data management/quality perspective, you’ll be equipped to handle most situations you run into when dealing with data, even if the Architect and Programmer are out sick. Your professional knowledge is going to be a big help in getting the assignment or job complete.

Happy data mining and please play with data responsibly!

About the Author

During the past 20+ years, my client list is private but I have worked with Fortune 100 and 500 companies including but not limited to, Discover Financial Services, J&J, Hershey, Kraft, Kellogg’s, SCJ, McNeil, Firestone, PBA, Disney, Deloitte, Talent Analytics, Samtec + more.

Acting as a liaison between the IT department and the Executive staff, I am able to take huge complicated databases, decipher business needs and come back with intelligence that quantifies spending, profit and trends. Being called a data nerd is a badge of courage for this curious Mathematician/Economist because knowledge is power and companies are now acknowledging its importance. To find out more about what I do, please visit my profile on LinkedIn

->https://www.linkedin.com/in/datanerd13

Asking The Right Questions

Each time I talk to someone about analytics, I ask the same question: “What is your ultimate goal with this project?” Often it is to increase sales or reduce turnover. Of course, this isn’t usually what’s said; initially all I get is a panicked look that says “we can’t get what we want out of our database—it isn’t working right…how do we fix it??”

Typically, there is nothing wrong with the database: the master and all its clones are just as they were designed to be, the variables are entered correctly and the reporting functions are pulling exactly what they were coded and designed to pull.

So what’s the issue?

 

Every HRIS or ATS database contains different information (employee addresses, phone numbers, salaries, benefits, and the like), but how much of that information is connected? What I mean is: is there a unique quantifier that connects each table or database together? If I want to select all the people who might retire in the next 5 years out of a database, complete with demographic, sales and personal information, in order to create an organizational plan for this, can I accomplish this with my current database? By design relational databases are just that: “relational.” Therefore, everything should flow, if set up correctly in the very beginning.

Which brings us back to asking the right questions—which might look a bit like these:

  • What is it really that I want to be able to answer with my data collection?
  • Structured vs unstructured data, am I asking the right questions and giving them choices or offering a space to add comments (be careful of this)? Example a) agree b) disagree c( unsure —– Can I work with unstructured data?
  • Are you offering incentives to employees or candidates to “complete additional info” as to glean a more complete pictures of your customer? (5 dollar iTunes or Starbucks card)
  • Do I want to link social data to what I collect from employees? (3rd party sign in via Facebook, Twitter, Google+, etc)
  • Are you including IT in your business meetings?

A gap analysis usually reports on what is missing, but it doesn’t have to be this way (reactive and not proactive). If you ask the right questions initially, your design and results will reflect this. Know what you are trying to accomplish.

If you say “my database is broken,” but what you really mean is “I need to be able to sustain sales throughout the next five years; I’m concerned with increasing what we have in the pipeline,” well, ask for help. Make sure you have the correct data to indicate your sales reps selling habits, including seasonality. What data does your sales department have that can help you answer these and many other questions?

Do you have store performance, or other line of business performance data to help form a more three dimensional view of your candidates or employees? A data-centric view is so valuable, but unless you ask the right questions when collecting your data and setting up your database, you may end up trying to build a predictive model using only name, address and phone number! As Chief Engineer Scotty Montgomery of the USS Enterprise might say, “I can’t do it captain!”

Go well armed on your journey towards predictive analytics and remember to always ask the right questions!

Carla Gentry

Data Scientist at Analytical-Solution.com

Deep learning foundational algorithms

There is a huge difference between what I consider deep learning foundational algorithms (those that power just about every neural network model that has existed ever) and deep learning architectures.

I think this distinction is important because it will help you determine how best to learn both. I would argue the foundational algorithms are more important to start with, and they are a prerequisite for the architecture types.

What do I mean when I’m referring to “foundational algorithms”? These include, but are not limited to, the following:

  • Backpropagation. This algorithm is literally the engine that powers everything that a neural network is. Today there is no deep learning without backpropagation. It’s the elegant algorithm developed by Rumelhart, Hinton, and others back in the 1980s that determines how we train models. For one of the most intuitive explanations of backprop I’ve encountered, check out.
  • Gradient descent. This a super important algorithm for determining how we update weights of a neural network. Vanilla gradient descent forms the core of all the fancy other stuff you see in papers including AdaGrad, Rmsprop, Adam, etc. so spend the time to learn it well. As a side note, though gradient descent is extensively used in deep learning, there’s nothing about the algorithm that restricts it to neural networks. In fact, it can be used for many different machine learning models including linear regressionlogistic regression, etc.

After you’ve got the foundational algorithms down, the model architectures refer to some of the model designs already mentioned including:

Start with learning feedforward networks, and then you can learn the other two architectures in whatever order makes most the sense for what you are working on.

Finally, a few other algorithms that are used extensively in neural networks which aren’t foundational, but are important to know for practical deep learning application. Learn these after the stuff above:

  • Dropout. If you plan on using regularization for your neural network (and you inevitably will), this is the most important regularization technique. I have basically never built a model that didn’t use dropout.
  • Weight initialization schemes. It turns out when building neural networks, how you initialize your weights is crucial for determining whether or not the model trains successfully. Therefore a number of different heuristics have been developed for initialization that you should learn eventually.

Hat tip and great work Mihail Eric, Deep Learning Reseacher | Stanford BS/MS in AI  https://www.quora.com/What-are-the-most-important-deep-learning-algorithms-In-which-order-should-I-learn-them

Should We Be Lowering The Social Media Marketing Bar?

However after almost a decade of social networking, the gap between the “experts” and the average brand or marketer is widening, therefore I believe the current path isn’t resolving the complexities faced by marketers and is only serving to perpetuate the massive learning curve. Furthermore, I think that the majority will continue to be left behind after giving up, running out of time and resources, or keep on trying without realizing the promised results.

BundlePost

Should we Lower the Social Media BarYes, we should. Now let me explain…

In my recent post entitled Top 2015 Social Media Predictions – Disruptive Technologies I covered one of the important disruption areas to watch this year, that was General Social Media Marketing. In fact it was the number one item listed in my 2015 predictions. Specifically I was referring to making social media easier to implement, get results and be effective. The actual prediction was as follows:

“As social media marketing becomes more and more complex, new technology is required to make it easier, regardless of user experience, knowledge or skill. This is a requirement for the industry whose time has come.”

The Problem:

The social media marketing industry is incredibly complex. Marketers, brands and individuals are attending events and classes, reading articles and buying books at a massive pace, trying to understand what to do. At the same time a handful of social media speakers, authors and…

View original post 598 more words

Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset

Hmmm, interesting -> Applying Differential Privacy
So, we’re at a point now where we can agree this data should not have been released in its current form. But this data has been collected, and there is a lot of value in it – ask any urban planner. It would be a shame if it was withheld entirely.

In my previous post, Differential Privacy: The Basics, I provided an introduction to differential privacy by exploring its definition and discussing its relevance in the broader context of public data release. In this post, I shall demonstrate how easily privacy can be breached and then counter this by showing how differential privacy can protect against this attack. I will also present a few other examples of differentially private queries.

The Data

There has been a lot of online comment recently about a dataset released by the New York City Taxi and Limousine Commission. It contains details about every taxi ride (yellow cabs) in New York in 2013, including the pickup and drop off times, locations, fare and tip amounts, as well as anonymized (hashed) versions of the taxi’s license and medallion numbers. It was obtained via a FOIL (Freedom of Information Law) request earlier this year and has been making waves in the…

View original post 2,314 more words

Hands on with Watson Analytics: Pretty useful when it’s working

If I have one big complaint about Watson Analytics, it’s that it’s still a bit buggy — the tool to download charts as images doesn’t seem to work, for example, and I had to reload multiple pages because of server errors. I’d be pretty upset if I were using the paid version, which allows for more storage and larger files, and experienced the same issues. Adding variables to a view without starting over could be easier, too.

Gigaom

Last month, [company]IBM[/company] made available the beta version of its Watson Analytics data analysis service, an offering first announced in September. It’s one of IBM’s only recent forays into anything resembling consumer software, and it’s supposed to make it easy for anyone to analyze data, relying on natural language processing (thus the Watson branding) to drive the query experience.

When the servers running Watson Analytics are working, it actually delivers on that goal.

Analytic power to the people

Because I was impressed that IBM decided to a cloud service using the freemium business model — and carrying the Watson branding, no less — I wanted to see firsthand how well Watson Analytics works. So I uploaded a CSV file including data from Crunchbase on all companies categorized as “big data,” and I got to work.

Seems like a good starting point.

watson14Choose one and get results. The little icon in…

View original post 433 more words

The Four Horsemen Of The Cyber Apocalypse

These “Four Horsemen” point us to the components we can expect to see used by hackers in 2015: exploits in unpatchable systems; recycled malware hidden imperceptibly; and human error. Studying these harbingers could very well save us from a potential cyber catastrophe

How big data got its mojo back

Big data never really went anywhere, but as a business, it did get a little boring over the past couple years.

Gigaom

Big data never really went anywhere, but as a business, it did get a little boring over the past couple years.

Big data technologies (and not just Hadoop) proved harder to deploy, harder to use and were a lot more limited in scope than all the hype suggested. Machine learning became the new black as startups infused it into everything, but most often marketing and sales software. So much ink and breath were wasted trying to define (or disprove) the idea of data science, probably because the tools of the trade were still so foreign to most people.

But while the early days of the big data movement hinted at greatness, it’s probably fair to say they didn’t deliver — even if the resulting tools were very useful and very necessary to set the stage for things to come. And, realistically, many companies still haven’t adopted these technologies or these techniques.

sd2015

Things are changing…

View original post 400 more words

The beginning of the end for email

Perhaps the biggest sign yet of the change at hand comes from Germany, which has called for an “anti-stress regulation” that would, among other things, ban employers from contacting employees after hours. Chancellor Angela Merkel has criticized the law and stopped it from moving forward for now, but German leaders have long been concerned about the growing tendency for technology to allow work to encroach on employees’ private lives.