Let’s say you have a codebase X, and a codebase Y. X produces some results or does some logic that are needed by codebase Y. What should we do to get the results (for this exercise, let’s say it’s a JSON file) from X to Y?
- Shared …
Let’s say you have a codebase X, and a codebase Y. X produces some results or does some logic that are needed by codebase Y. What should we do to get the results (for this exercise, let’s say it’s a JSON file) from X to Y?
Last year Alex Hutchinson, in his generally excellent Sweat Science series, discovered the four level framework for data analytics:
This isn't news for people who work in data, and yet he …
Didn't hit the mark this year, and likely went too hot for the first half. Bonus: either caught COVID during the race, or dampened the immune system enough to allow the infection that I was fighting already to go full force.
DBT forces you to use SQL, even when Python or R transformations are clearer.
DBT makes a lot of sense for 80% of data pipelines, but there are steps that it just isn’t worth forcing into SQL (even when that is possible!). If the code is simpler and easier …
With Pandas ≥ 1.0, the functional API is powerful and should be the new standard.
Someone just pointed me to the pyjanitor
package, which I don't actually think is very useful with Pandas >= 1.0, because the functional API for Pandas is quite powerful these days. The examples that the …
There are many possibilities for implementing custom logic, and this framework can help you sort through the options.
I have been really enjoying using the DBT to structure data pipelines, the framework it enforces requires you to structure your pipelines in a standardized way. The tool meets perfectly in the …
In the fall of 2020, I was curious about Whoop and hearing good things from some of the people I follow (Ted King). In the spirit of amateur fitness enthusiasm and research adjacent to MassMutual's wearables program, I ordered a 3.0 band in the October 2020.
Overall, I have …
Say you have a job, that depends on three other jobs having been completed. They all run at the same time, and are not explicitly a part of your pipeline (you can't put them as a single multijob step). You need you job to depend on at least one of …
This post is published on Medium and available as an Rmd notebook.
With dplyr
version 1.0, there are new ways that you can write functions.
The programming with dplyr vignette with the docs is the best reference.
If you're familiar with using sym
and converting from standard to nonstandard …
Joining in with Nick Symmonds and Ryan Hall's challenge to deadlift 500 pounds and run a mile in under 5 minutes, I was signed up for a marathon and couldn't resist going for a max pull the morning of.
Cleared 225, 275, and 295 for my max lift at 5AM …
Page 1 / 7 »