How to create Data Lineage mappings and verify by visualizing using networkx

Image by Gerd Altmann from Pixabay

This article will discuss two separate topics. The first is data lineage — mapping a piece of data from its source to the final data product. The other topic is simple graphing with networkx. Each section is useful on its own, but I wanted to demonstrate how one can apply graphs in everyday work. You will see that a graph is a tool that can validate the completeness of data lineage. Once you understand the inputs, you can apply the technique to other concepts, such as impact analysis or root cause analysis. So let’s go.

We all need good Data Lineage.

Much is available out there…


Opinion

Identify the problems and make changes in 2021

Image by moritz320 from Pixabay

Without a doubt, consumer behavior has shifted dramatically in 2020. We will be hard-pressed to find a model that predicted the global need for N95 facemasks and other PPE. PPE supply models will need to be completely redesigned and retrained. How are your models doing? Like many applied data science practitioners, you might find that some of the model results haven’t been ‘typical’ in recent months—Start 2021 out by reviewing your models and making an action plan.

Why is there a concern?

Obviously, consumer demand forecasting models fail when there are drastic shifts in actual consumer demand. Outliers are always problematic. 2020 has brought MANY…


Categorical Encoding, Target Leakage Identification, and Feature Importance are readily available as no-code capabilities.

Image by Gerd Altmann from Pixabay

While the AWS Data Wrangler open source code has been available for some time, AWS has announced that SageMaker Studio now has Data Wrangler built right into the interface. No coding required. That is very appealing when building machine learning models in SageMaker. Let’s take a look.

The Setup

If you want to play along at home, the data is an open crash dataset from NY.

Once you have AWS and SageMaker set up, Data Wrangler has no additional configurations. You launch SageMaker Studio, and Data Wrangler is one of the options. …


Be sure your sentiment and text analytics is actually processing characters in your target language

Image by Free-Photos from Pixabay

One of the most important parts of working with technology and data is to keep learning. Not just courses, but also randomly challenging yourself to learn something completely new.

I decided to use a random number generator to identify packages in the first 20 python packages downloaded according to Python Wheels. I will try to use each package I am ‘assigned.’ My goal is to learn new things, and the bonus is to discover a hidden gem.

Random Pick: #10 chardet 3.0.4 The Universal Character Encoding Detector

This package can evaluate a block of text and try to determine if there are any characters found. …


Be a workplace Hero! Automate the annoying task of cleaning up the spelling and formatting of categorical text columns with fuzzy matching.

Image by PublicDomainPictures from Pixabay

About ten years ago, a family friend had an interesting problem. He was a medical examiner and was responsible for cleaning up spreadsheets of data for reporting. He wanted to aggregate statistics by key columns such as primary and secondary causes of death. The issue he had was that these entries were hand-typed, and variations in format and spelling were sending him over the edge. He was spending hours each month cleaning up all of these datasets. Not one to waste his time, he wanted a solution.

At the time, I only knew of rules-based classification to assist. These days…


Cut your data prep time down, for no extra money spent

gif by the author

I enjoy seeking out new tools that help me create insights and new data products. I use a mix of code and no-code tools for ease of use and speed. When I’m working on a personal project or freelance contracts, I focus on tools that do not require expensive contracts. I recently provided a walk-through of the affordable AWS DataBrew. This past week, I was reviewing the 2020 Gartner Magic Quadrant for “Data Quality Solutions.” I discovered talend provided a free open source no-code data prep tool. Most companies focus on their enterprise offerings, which are often pricey and beyond…


…unless you share the details of its goal, design, testing, and metrics

Image by Gerd Altmann from Pixabay

Vendors everywhere, please don’t tell me your machine learning (or better yet, ‘AI’) model is 99% accurate. Don’t tell me that all we need to do is ‘plug in your data,’ and some shiny magic happens. I am not the only one. Recently 31 scientists challenged a study by Google Health because there wasn’t sufficient evidence made public to determine the credibility of the proposed AI models. We all need to do our part to build trust-worthy models that power our new advancing application. Being informed is the first step in mitigating model risk.

Provide me context

When examining internally created models or…


Getting Started

A walk-through of Data Profiling and Transformation tools

Image by Negative-Space from Pixabay

On Nov 11, 2020, AWS announced the release of Glue DataBrew. Focused on data prep, it provides over 250 functions to assist. I love all things data, so I checked it out. There are two main types of features, Data Profile and Transformation. I will walk through the steps for setting up and using both of these features.

Pricing

Always a consideration — cost. $1.00 per interactive session. $0.48 per node hour for transformation jobs. The best part is that it is entirely pay-as-you-go. This makes the tool very accessible and affordable for everyone, even the casual enthusiast.

Loading your data

From the AWS…


My B.S. alarms went off before I read my first ‘Parley’..this startup is in trouble

Image by Amber Avalona from Pixabay

Word on the street is that Parler is the Twitter for the Conservative crowd in response to anger over Facebook and Twitter moderation. Free speech! No more left-wing censorship! One man’s conspiracy theory is another man’s truth. Always interested in human nature, I signed up for an account.

Growing Pains

In a combination of sincere interest and snarky curiosity, the number of new accounts has exploded in recent weeks. Under similar circumstances, any social media site with a million-plus new accounts would struggle under the site traffic, data volumes, and sudden diversity in how the site is used. …


How to identify reading level scores using python

Image by Evgeni Tcherkasski from Pixabay

When marketing effectiveness analytics are being developed, the content reading level is often overlooked. Should this be a part of the analysis or your model? If you decide it is, how would you easily tag your different documents or creatives with the current reading level? Today I will review how to use python to do reading level analysis and create a report. We will focus on English as the primary language. This same technique can be used to determine your other content's readability, such as Medium Blogs.

Why readability it important

Whether you want to educate your customers on the benefits of your new…

Dawn Moyer

Data Enthusiast, fallible human. A data scientist with a background in both psychology and IT, public speaking in areas of data, career, and ethics.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store