Showing posts with label PySpark. Show all posts
Showing posts with label PySpark. Show all posts

Python: A-Z of data handling with PySpark

 


Data Wrangling with PySpark


I was listening to a seminar on PySpark, where the speaker told that documentation for PySpark is not as organized as it is for Pandas on the web. My analyst alter-ego shouted - "Challenge Accepted!"

In one of our earlier blogs, we covered basics of Python and Pandas in great details. We have tried to keep PySpark blog in the same line, so that one can enjoy learning "analogically", which I prefer in my case.

Python: A-Z of data handling with Pandas


Data Wrangling with Pandas


Data wrangling is the practice of converting data from a "raw" form into a user-ready form for descriptive analytics and provide feed for other horizons of analytics such as predictive analytics.

Pandas is a data-centric package(library) of Python eco-system for importing, manipulating, managing and analyzing data. This library was originally built on NumPy, the fundamental library for scientific computation in Python.