179: How I Use PRIVATE Data ETHICALLY In the New Era of AI

There is an impossible choice most organizations face. Companies building modern AI face a brutal, binary-feeling decision: either ship a privacy-first model that “kinda low key sucks,” or ship a high-performing model that likely exposes sensitive personal data. Luckily, there's a third option, and that's what I will share with you in this episode!

Check out Tonic Textual here:

👉 https://www.tonic.ai/products/textual

💌 Join 10k+ aspiring data analysts & get my tips in your inbox weekly 👉 https://www.datacareerjumpstart.com/newsletter

🆘 Feeling stuck in your data journey? Come to my next free "How to Land Your First Data Job" training 👉 https://www.datacareerjumpstart.com/training

👩‍💻 Want to land a data job in less than 90 days? 👉 https://www.datacareerjumpstart.com/daa

👔 Ace The Interview with Confidence 👉 https://www.datacareerjumpstart.com/interviewsimulator

⌚ TIMESTAMPS

00:00 - Introduction: The Ethical Dilemma in AI Development

01:21 - The "A Very Smith Health Solutions" LEAKED Zoom debate!

02:45 - Sensitive Data Discovery and Synthesis

03:41 - Redacting and Synthesizing Data with Tonic Textual

04:30 - Applications and Benefits

🔗 CONNECT WITH AVERY

🎥 YouTube Channel: https://www.youtube.com/@averysmith

🤝 LinkedIn: https://www.linkedin.com/in/averyjsmith/

📸 Instagram: https://instagram.com/datacareerjumpstart

🎵 TikTok: https://www.tiktok.com/@verydata

💻 Website: https://www.datacareerjumpstart.com/

Mentioned in this episode:

✨ Find A Data Job 2.0 is Here ✨

Just released a BRAND NEW version of FindADataJob.com. ✅ Skills highlighted on every job — scan a posting in seconds and know if it's a fit ✅ Seniority ranking 1–10 — see how senior a job actually is, no more guessing ✅ Save jobs — bookmark roles to come back to later ✅ Salary reports — see what data jobs are actually paying right now ✅ Skills reports — find out which skills are in demand in today's market ✅ Work arrangement reports — remote vs. hybrid vs. on-site breakdowns ✅ Feedback button — tell us what to build next GO CHECK IT OUT!!!

FindADataJob.com

[00:00:00] Avery Smith: In the new era of artificial intelligence companies face an impossible choice. True build AI that kind of low key sucks, but keeps your personal data safe or build ai that's absolutely amazing, but exploits your most private information. Ed tech companies could build personalized AI tutors, but they need access to students' data in learning struggles.

[00:00:22] Avery Smith: Healthcare companies could build AI doctors that. Actually work, but they need every doctor's note ever written about you, banks, law firms, HR departments, they're all facing this brutal ethical trade off. But what if I were to tell you that there is a third option that solves this ethical dilemma, but 99% of companies aren't really using it.

[00:00:44] Avery Smith: It's called Tonic Textual, and they're the sponsor of today's episode, but more on them in a bit. First, here's an insider's look to the conversations happening behind closed doors at many healthcare companies right now. You're not gonna hear this from anyone else. The following [00:01:00] internal Zoom recording.

[00:01:01] Avery Smith: Was leaked from a company called a very Smith Health Solutions, and that company definitely exists, and it's not just a made up name that I created like literally 30 seconds ago. Our sources for this leaked file wish to remain completely anonymous, but they go by an alias. Avery um Sanchez us. All right, thanks everyone for joining.

[00:01:24] Avery Smith: I'll get straight to it. The investors keep asking us why we don't have an AI doctor. As the CEO, I need to know why don't we have an AI solution yet? Well, we probably could, but honestly it probably wouldn't be very good without any real data. Perfect. We have millions of patient records and transcripts of doctor visits, and as a data scientist, I could build all sorts of things with that data.

[00:01:47] Avery Smith: I'll get started right away. Nuh, that's not going to fly with legal. It's not HIPAA compliant. We can get fined millions of dollars. Okay, so what are our options here? I need solutions [00:02:00] people, not problems. Well, I did hear about this new technology recently that unlocks off limits data for these types of models.

[00:02:07] Avery Smith: It redacts and synthesizes any sensitive information in the data without losing quality or the context. I think it was called tonic textual. Oh, so instead of showing their name, their date of birth, phone number, or any other personable, identical information, it shows a realistic alternative. Well, that could work, right?

[00:02:28] Avery Smith: I'm not going to jail, just so you can build a chat box. Yeah, that should work perfectly. I know. Pretty crazy that we were able to obtain that secret zoom recording. This insider access is the exact reason you're subscribed to this channel, right? Make sure you're subscribed for more in data science. When you have sensitive data, the best solution is to first do what's called sensitive data discovery, and that's the automated process of identifying, locating and cataloging sensitive or confidential information across an [00:03:00] organization's data sets, files, and all sorts of places.

[00:03:03] Avery Smith: Then once it's identified, the next step is to perform data synthesis, which is the process of generating artificial, but. Realistic data to replace the sensitive data. Today's sponsor, tonic Textual does both using its best in class models. It can automatically detect any sensitive information necessary for HIPAA or other compliance requirements, classes in 50 plus different languages, and then redact.

[00:03:29] Avery Smith: And synthesize that data consistently so your data doesn't lose quality or context. This allows you and your company to build world-class AI and machine learning models without losing your ethical integrity. So let's go ahead and dive into example. 'cause you know I love examples. So here's an example of a doctor's note.

[00:03:47] Avery Smith: Pretty standard doctor's note with a lot of personal information, right name. Date of birth, all that good stuff. But you can go ahead and plug that information, that same doctor's note into tonic, textual, and bam. You can see that [00:04:00] all of the PII, otherwise known as personal identifiable information was automatically found and redacted.

[00:04:06] Avery Smith: But if you wish, you can actually synthesize it and so that Harvey De Moore becomes a different name. Now, obviously this is just one record, but the cool thing about tonic textual is they actually have an API. It allows you to do this type of thing. This data cleansing programmatically with tools like Python.

[00:04:23] Avery Smith: So here's an example of doing that with millions of doctor's notes, and it's the exact same process, but just automatically done very quickly. And now you might be thinking, well, that's cool, Avery, but like what can I actually do with this data? Yeah, I get that. It's private data. I get that. We shouldn't give that to LLMs.

[00:04:38] Avery Smith: I get that we shouldn't expose that to the world, but what can we actually do with this data? What's the cool things that we can do? Well first we can build cool things like the AI doctor chatbot that we talked about earlier in the episode, and we can use all this data in that LLM workflow to actually build really cool things.

[00:04:55] Avery Smith: Beyond that, we can use the same data set for all sorts of things like machine learning. To [00:05:00] train AI models for other initiatives like predictive diagnoses and to boost efficiency and support better patient outcomes. Ai, well, it's only as powerful as data we give it. So if we feed it personal data, it becomes extremely powerful.

[00:05:15] Avery Smith: But also super dangerous. And remember that this just isn't the healthcare companies, the healthcare industry that needs to worry about this. Every industry is facing the same trade off doing sensitive data discovery and then redaction in synthesis allows us data practitioners, data scientists, data analysts, machine learning engineers to harvest the power of that personal data without losing our ethics and exposing our own customers.

[00:05:38] Avery Smith: Super private data and tools like tonic, textual. They make it extremely easy. So if you wanna try tonic, textual, head to the link in the description to get started. For absolutely free guys, let's build AI models. Let's analyze data, but let's do it ethically. And this is a great start.