This article will discuss two separate topics. The first is data lineage — mapping a piece of data from its source to the final data product. The other topic is simple graphing with networkx. Each section is useful on its own, but I wanted to demonstrate how one can apply graphs in everyday work. You will see that a graph is a tool that can validate the completeness of data lineage. Once you understand the inputs, you can apply the technique to other concepts, such as impact analysis or root cause analysis. So let’s go.
Much is available out there…
Without a doubt, consumer behavior has shifted dramatically in 2020. We will be hard-pressed to find a model that predicted the global need for N95 facemasks and other PPE. PPE supply models will need to be completely redesigned and retrained. How are your models doing? Like many applied data science practitioners, you might find that some of the model results haven’t been ‘typical’ in recent months—Start 2021 out by reviewing your models and making an action plan.
Obviously, consumer demand forecasting models fail when there are drastic shifts in actual consumer demand. Outliers are always problematic. 2020 has brought MANY…
While the AWS Data Wrangler open source code has been available for some time, AWS has announced that SageMaker Studio now has Data Wrangler built right into the interface. No coding required. That is very appealing when building machine learning models in SageMaker. Let’s take a look.
If you want to play along at home, the data is an open crash dataset from NY.
Once you have AWS and SageMaker set up, Data Wrangler has no additional configurations. You launch SageMaker Studio, and Data Wrangler is one of the options. …
One of the most important parts of working with technology and data is to keep learning. Not just courses, but also randomly challenging yourself to learn something completely new.
I decided to use a random number generator to identify packages in the first 20 python packages downloaded according to Python Wheels. I will try to use each package I am ‘assigned.’ My goal is to learn new things, and the bonus is to discover a hidden gem.
This package can evaluate a block of text and try to determine if there are any characters found. …
About ten years ago, a family friend had an interesting problem. He was a medical examiner and was responsible for cleaning up spreadsheets of data for reporting. He wanted to aggregate statistics by key columns such as primary and secondary causes of death. The issue he had was that these entries were hand-typed, and variations in format and spelling were sending him over the edge. He was spending hours each month cleaning up all of these datasets. Not one to waste his time, he wanted a solution.
At the time, I only knew of rules-based classification to assist. These days…
I enjoy seeking out new tools that help me create insights and new data products. I use a mix of code and no-code tools for ease of use and speed. When I’m working on a personal project or freelance contracts, I focus on tools that do not require expensive contracts. I recently provided a walk-through of the affordable AWS DataBrew. This past week, I was reviewing the 2020 Gartner Magic Quadrant for “Data Quality Solutions.” I discovered talend provided a free open source no-code data prep tool. Most companies focus on their enterprise offerings, which are often pricey and beyond…
Vendors everywhere, please don’t tell me your machine learning (or better yet, ‘AI’) model is 99% accurate. Don’t tell me that all we need to do is ‘plug in your data,’ and some shiny magic happens. I am not the only one. Recently 31 scientists challenged a study by Google Health because there wasn’t sufficient evidence made public to determine the credibility of the proposed AI models. We all need to do our part to build trust-worthy models that power our new advancing application. Being informed is the first step in mitigating model risk.
When examining internally created models or…
On Nov 11, 2020, AWS announced the release of Glue DataBrew. Focused on data prep, it provides over 250 functions to assist. I love all things data, so I checked it out. There are two main types of features, Data Profile and Transformation. I will walk through the steps for setting up and using both of these features.
Always a consideration — cost. $1.00 per interactive session. $0.48 per node hour for transformation jobs. The best part is that it is entirely pay-as-you-go. This makes the tool very accessible and affordable for everyone, even the casual enthusiast.
Word on the street is that Parler is the Twitter for the Conservative crowd in response to anger over Facebook and Twitter moderation. Free speech! No more left-wing censorship! One man’s conspiracy theory is another man’s truth. Always interested in human nature, I signed up for an account.
In a combination of sincere interest and snarky curiosity, the number of new accounts has exploded in recent weeks. Under similar circumstances, any social media site with a million-plus new accounts would struggle under the site traffic, data volumes, and sudden diversity in how the site is used. …
When marketing effectiveness analytics are being developed, the content reading level is often overlooked. Should this be a part of the analysis or your model? If you decide it is, how would you easily tag your different documents or creatives with the current reading level? Today I will review how to use python to do reading level analysis and create a report. We will focus on English as the primary language. This same technique can be used to determine your other content's readability, such as Medium Blogs.
Whether you want to educate your customers on the benefits of your new…
Data Enthusiast, fallible human. A data scientist with a background in both psychology and IT, public speaking in areas of data, career, and ethics.