Mito – Excel-like interface for Pandas dataframes in Jupyter notebook (opens in new tab)

Histograms and Boxplots (and IQRs) don't lie tho...

kite_and_code4y ago

If you want to use open-source Python-based visualizations instead of Tableau, the following tools allow the creation of custom plots - including the ability to export the underlying code.

- bamboolib (proprietary license - acquired by Databricks in order to run within the Databricks notebooks)

- mito (GPL license)

- dtale (MIT license)

pea4y ago

If you can write visualisations in Python itself, I am a big fan of Altair's syntax (https://github.com/altair-viz/altair), which is based on vega-lite. A while back, I wrote a brief guide and comparison of the main plotting libraries: https://datapane.com/reports/87NNEJ7/the-ultimate-guide-to-p...

One benefit of having them in actual code is that you can programmatically automate the creation of things like dashboards and reports. For instance, schedule a script to share an interactive plot every Monday morning, or build a live dashboard that updates every 10m. This opens up a lot of possibilities that would be impossible in a traditional drag-and-drop tool.

Of course that `.head()`, `.tail()`, `iloc` and other mechanisms to visualize the data of subsets is always important. But would you really caution AGAINST this? Like, literally telling someone NOT to use summary statistics to explore a dataset?

wenc4y ago

No, I’m more cautioning against using summary statistics in isolation without looking at the raw data.

I was more responding to the statement that one can “see” the shape of data through them and not needing visual tools. The lens of summary statistics is a very narrow one — it’s a necessary but almost always insufficient one. Even .ilocs are insufficient —- it’s hard to know what to .iloc for. One really needs to browse the data interactively to get a good sense of it.

[1] https://docs.trymito.io/how-to/summary-statistics

aarondia4y ago

This is a great point and something that we're actively working on improving in Mito. If you have millions of rows of data, its not enough to just scroll through your data, you need tools to build your understanding.

Some of the tools that you mentioned exist in Mito today. For example, Mito generates summary information about each column (all of the .describe() info along with a histogram of the data). And we're creating features for gaining a global understanding of the data too.

In practice, one of the main ways that we see people use Mito is for that initial exploration of the data. Often the first thing that users do when they import data into Mito is to correct the column dtype, delete columns that are irrelevant to their analysis, and filter out/replace missing values.

pbronez4y ago

It would be super fun to implement an intelligent head() function that shows a representative sample rather than the first X rows. Do the profiling & identify a collection of rows that represent the overall distribution.

You could develop some IP around efficient and effective ways to do this. Probably would require an ensemble of unsupervised methods.

aarondia4y ago

That's a cool idea! One helpful .head() function could include the most unique data typed data. It could help you identify which columns have mixed dtypes: mostly numbers, and some cells that are supposed to be numbers but are actually strings because of additional decimals.

narush4y ago

Good points! I also think that this is an area that Mito could do better in. While we do provide pretty cool summary stats [1] and graphing capabilities [2], there isn't a great view for the summary stats of the entire dataframe. It's def on the roadmap -- but this comment makes me think we should move on it quick.

Thanks for the feedback!

[2] https://docs.trymito.io/how-to/graphing

CJefferson4y ago

I find the world is full of datasets with < 200 datapoints, and that is where excel (in my experience) is great. With such datasets it often makes sense to look through the data at particular outliers.

Also, even with huge datasets I tend to always look at a random sample, and the "most extreme" datapoints -- mainly because in my experience there is a good chance some parts of the data are malformed, and need to be recollected/fixed. Of course, if you trust your data collection you don't need this!

kite_and_code4y ago

+1 - this is also how I operated as a Data Scientist myself

awild4y ago

> try loading a 15GB CSV in Excel.

Or visualising it in r or pandas without meaningful subsampling.

pea4y ago

One cool library I saw recently for helping on the visualisation side is https://github.com/vegafusion/vegafusion

It allows you to use Altair in Python for visualising data, but does the computation in the backend using Arrow DataFusion. Not for 15GB perhaps, but cool nonetheless.

kurupt2134y ago

I have an excel template for handling a relatively large amount of data. No where 15GB on one sheet. I use it for preprocessing experimental data from a single experiment. There are about 10 chart tabs build in so I can visually inspect the data looking for errors (and go back and inspect the raw instrument data when something looks off).

The aggregate data is around 1.5 million experimental results. MiniTab is too unwieldy and requires too much manual reformatting of the data sheets.

Is this something I should be looking at in R or project Jupyter? Does one make better visualizations than the other?

awild4y ago

Ggplot is extremely powerful if you can grok its grammar, which takes some getting used to. But I'd assume that if you see a graph in a scientific paper it's made with ggplot.

Having many data points you want to explore you are always going to be at the edges of what your hardware and software can produce.

The last really big datasets I worked with were for my thesis and I had to do subsampling to below 10% to get results within 10minutes or so and that was basically plotting midi recordings of piano performances, so nothing gigantic

kurupt2134y ago

In all seriousness, excell can’t be the right option for 15GB of alphanumeric data (one sheet?)

mint24y ago

Do you as a rule look at a sample of the individual raw data, non aggregated?

Usually aggregated... then can start looking at "subsets". For example, step 1 is look at the whole dataset. Then you identify that there are a lot of rows with a type of missing value, so you look at the statistical attributes of that subset (all the rows with value X in null).

From time to time you can do a `.head()/.title()` or an `.iloc[X:Y]` to check some things visually. But just as a "refresher".

[1] https://github.com/mito-ds/monorepo/blob/dev/LICENSE [2] https://github.com/mito-ds/monorepo/blob/dev/mitosheet/src/p...

narush4y ago· 7 in thread

Hey everyone. Mito cofounder here. Thanks to whoever posted this - was a real surprise to find it here :-)

Mito (pronounced my-toe) was born out of our personal experience with spreadsheets, and a previous (failed) spreadsheet version control product.

Spreadsheets were the original killer app for computers, and are the most popular programming language used worldwide today. That being said, spreadsheets have some growing to do! They don’t handle large datasets well, they don’t lead to repeatable or auditable processes, and generally they disrespect many of the hard won software engineering principals that us engineers fight for.

More than that, as spreadsheet users run into these problems and turn to Python to solve them, they struggle to use pandas to accomplish what would have been two clicks in a spreadsheet. Pandas is great, but the syntax is not always so obvious (not is learning to program in the first place!)

Mito is the our first step in addressing these problems. Take any dataframe, edit it like a spreadsheet, and generate code that corresponds to those edits. You can then take this Python code and use it in other scripts, send it to your colleagues, or just rerun it.

We’ve been working on Mito for over a year now. Growth has really picked up in the past few months - and we’ve begun working with larger companies to help accelerate their transition to Python.

To any companies who are somewhere in that Python transition process - please do reach out - we would love to see if we can be helpful for all your spreadsheet users!

Feel free to browse my profile for other spreadsheet related thoughts, I’m a bit of a HN junkie. Of course, any and all feedback (positive or negative) is appreciated.

My cofounders and I will be trolling about in the comments. Say hey! :-)

aarondia4y ago

Heyo! Another co-founder here. Excited to see Mito on HN :) Thanks @alefnula for posting!

+1 to everything @narush said.

It's important to us that the software we build is empowering to users and not restrictive. This plays out in two primary ways: 1) Since Mito is open source and generates Python code for every edit, Mito doesn't lock users into a 'Mito ecosystem', instead it help users interact with the powerful & robust Python ecosystem. 2) Because Mito is an extension to Jupyter Notebooks + JupyterLab, Mito improves your existing workflows instead of completely altering your data analytics stack.

Excited to interact with you all in the comments :)

kite_and_code4y ago

Can you please clarify what you mean by "mito is open-source"?

Last time I checked the code was under a proprietary license.

Edit: I found in another comment below that mito is now available under GPL license here: https://github.com/mito-ds/monorepo/blob/dev/LICENSE

Edit2: Just saw your answer now - thanks for the clarification and links!

aarondia4y ago

Mito is licensed [1] under the AGPL liscence. The TLDR of the license is that you can use, distribute, and modify Mito for free, but any modifications that you make need to be shared back with the Mito community.

There is an additional version of Mito, Mito Pro, that is licensed under a different license that provides access to advanced functionality only if you are paying for a Mito Pro / Enterprise subscription.

kite_and_code4y ago

If you are a large company trying to migrate to Python, you might also want to have a look at bamboolib.com which was acquired by Databricks.

bamboolib is very similar to mito (hard to tell who was first).

The advantage is that it runs within Databricks which gives you the ability to scale to any amount of data easily and Databricks has many (and growing) security certifications e.g. HIPAA compliance.

bamboolib can be used in plain Jupyter. Also, bamboolib private preview within Databricks is about to start within the next days.

Full disclosure: I am a co-founder of bamboolib and employed by Databricks

bamboolib appears to be closed-source. You're at their mercy.

kite_and_code4y ago

bamboolib co-founder here:

It's correct that bamboolib is (still) closed-source (which might be subject to change but I don't make promises).

It's also correct that customers can extend the bamboolib UI in various ways via plugins that they can author themselves. That empowers them to build bamboolib into the kind of tool that they want.

Also, all the code is always exported and thus, there is at least no "code lockin".

Regarding being "at their mercy", I can say that there are many customers who are happy by the service that we provide.

[1] https://github.com/mito-ds/monorepo/blob/974091b455950c6c50e... [2] https://github.com/mito-ds/monorepo/blob/dev/mitosheet/mitos...

lcrmorin4y ago

Hey a bit late to the party (HN newsletter crowd). This really seems like something my BigCorp could use. I am on holiday RN, so I won't fire my computer to try it. But I was wondering, does it allows easy copy pasting the table into standard MS documents (work ? outlook mails ?).

rcarmo4y ago· 5 in thread

The telemetry thing is... weird. So we can use it for free but have no way to turn it off but upgrade to paid?

aarondia4y ago

Thanks for that feedback. Mito's approach to telemetry is that we never log any of your data or metadata about your data. We don't track things like the size, shape, or content of your data.

We do collect info about app usage, things like which buttons users click. This allows us to focus development time on improving the features that are used most often.

That being said, it's important to us that there is a way to be totally telemetry-less if users don't want any information to be leave their computer. Compared to most other cloud-based sass data science tools where you pretty much have no hope of total privacy, we're proud of the flexibility that we offer.

But of course, we're always open to feedback about how we can continue to improve our practices!

learndeeply4y ago

I don't get it. What in the license prevents users from removing the telemetry? AGPL just means the user needs to open source that change, right?

Edit: To remove telemetry, just call:

   from mitoinstaller.user_install import go_pro; go_pro();

No licensing or payment required, and doesn't violate the license.

narush4y ago

Mito is open source, but using Pro features does actually require a Pro or enterprise license. You can check out this callout in the license [1], as well as the restrictions on Mito Pro features here [2]. We're in the process of fixing up the upgrade to Pro process a bit... as you can tell... :)

You can of course fork Mito and turn off telemetry as long as you open source your changes! Go for it - happy to hop on a call and help you get set up with the codebase, if you want. Yay open source!

Just to be very clear, the way to be "totally telemetry-less" is to pay you?

MadameBanaan4y ago

Yeah, I have a hard pass on anything that offers an "Open Source" version, but actually meant to be a "Try it and be my Guinea Pig".

boringg4y ago· 5 in thread

Looks neat - pandas is very powerful and it makes it more approachable for non-programmers. However paid product like this - I probably wouldn't make the switch to this and then have the company go belly up leaving users stranded. Too much risk.

Hope for the best though - pandas is pretty fantastic.

okennedy4y ago

You might want to check out a tool Vizier: https://vizierdb.info (I'm one of the devs). Direct interaction with notebooks state (e.g., dataframes as spreadsheets) is one of the central ideas, and it's fully open source.

aarondia4y ago

This looks cool :)

aarondia4y ago

One of the creators of Mito, here. Thanks for your feedback. I wanted to share a couple of nuggets about Mito that have been helpful in talking about this with other users.

1. The core Mito product is open source. You can see our GitHub here [1]. We also have a pro version that has some additional, code visible, but non-open source features. The way that we think about which features belong in which version of the product is as following: Features that are needed to just get any average analysis done are open source features. On the other hand, features that are specifically useful in an organization -- connecting to company databases, formatting / styling data and graphs for a presentation, etc. -- are pro features. So if you are a team that is relying on our pro features, you're helping support the longevity & progress of Mito. If you are not one of those users and using the open source version, then you will always have access to Mito (and can even help improve it!). Of course the line between what features are specifically helpful in an organization and what feature are needed for an average analysis is a bit blurry, and is a moving target as we continue to expand Mito's offering.

2. Mito is designed specifically to not force users to make a big 'switch'. I've commented this elsewhere in this thread, but just to recap: Because Mito is an extension to Juptyer and because we generate python code for every edit you make, Mito is designed to improve your existing workflow instead of lock you into a new system. Many Mito users use Mito as a starting point! They do as much of their analysis as they can in the Mito spreadsheet and then continue writing more customized Python code to finish up their work.

Not requiring a big switch is nice for the user and its nice for Mito too! Lots of large companies have been able to get up and running with Mito in 30 minutes because it fits into their data stack.

Anyways, not that these are the only two reasons you might feel uneasy about adopting Mito, but at least wanted to share why the switch to Mito might be less scary than switching to other tools.

kite_and_code4y ago

I love how mito enables companies to use the power of open-source!

You might want to think about enabling companies to create the company specific extensions themselves e.g. via a plugin API. You might still request them to pay for this version of Mito but they are enabled to extend it with their engineering power instead of relying on you.

We had good experiences with this at bamboolib (I am one of the co-founders) and in addition to recurring license revenue it also increased demand for consulting from our end because the internal company devs started working on plugins and then wanted our direct guidance on how to get the more tricky things to work.

narush4y ago

Yeah, we've thought a bit about a plugin API - for the reasons you say, I think it would be an awesome feature to open up to teams!

Any tips on going about it? No need to share the secret sauce, unless you want :P

To be totally honest, we're not architected super well to support plugins currently. The big challenge would be allowing users to specify this plugin in pure Python (seems like we want this) - but we think that hand-coded UIs outperform autogenerated ones for now. We've been thinking about how to do better though... maybe soon.

Of course, if Mito is missing features, we're open source [1] -- all contributions are welcome! Also feel free to open an issue and we can discuss :)

kite_and_code4y ago· 4 in thread

To the founders of mito, regarding the mito GPL license:

What is your take on that regarding usage inside cloud provider's notebooks like AWS, GCP, Azure, Databricks?

Is it allowed or not allowed by the license? And who should/can control the usage since users can install any kind of Python library in those environments.

And, separately from the maybe ambiguous legal answer: What is your personal intention with the license?

Disclosure: I am employed by Databricks.

narush4y ago

Hiya kite_and_code - thanks for the question + good to see you here :)

Our understanding of our license is evolving - we're first time open source devs, and as I'm sure you know it can be a tricky process. That being said: we totally support Mito users using Mito from notebooks hosted in the cloud!

Currently, we have quite a few users using Mito in notebooks hosted through AWS, GCP, etc. We’re aiming to be good stewards of open source software, and want to see Mito exist where ever it is solving users problems!

We’ve had lots of folks in lots of environments request Mito, and are actively working on prioritizing supporting those other environments. We added classic Notebook support last month (funnily, I thought it’d take weeks to support, and it took 2 days lol) - and are looking into VS Code, Streamlit, Dash, and more!

EDIT: due to comment below, I edited this comment for clarity that we 100% support users using Mito from notebooks in the cloud!

kite_and_code4y ago

I can totally relate that finding a suitable open-source business model is a fuzzy journey.

Nevertheless, from the user perspective I would love to hear a more clear answer - at least for e.g. the next 6-12 months.

Currently, it seems like you are tolerating usage inside the cloud providers without taking a clear stance. I think this creates fear, uncertainty, doubt and slows down mito adoption within the cloud.

I would appreciate a clear statement in the near future around your thinking on how mito should be made available in those environments. After all, the clouds are an environment to where more and more users are migrating to. Or at least use it in parallel to local setups.

I can understand if you don't want to answer on the spot in case you don't have a clear stance yet. In this case, please take your time and let us know when you made your decision.

Really love what you're doing and the innovation that you are pushing for! <3

narush4y ago

Oh, sorry I wasn't clear! We totally expect that users will use Mito in notebooks on the cloud cloud, and we are in support of this usage!

Ideally, we will continue to extend our support to these environments over time, as currently there are lots of environments where users want Mito but we don't support it yet (notebooks api differences, etc) - a good example being AWS Sagemaker.

I'll edit my answer above to be more clear about this as well. Thanks for the ask for clarification!

mbreese4y ago

> Our understanding of our license is evolving

As a potential user, this is pretty troubling. I can understand your intentions, but if the license doesn’t match your intentions (and if you don’t completely understand the license), how can we be sure our workflows will be supported/possible in the future?

flakiness4y ago· 4 in thread

Nice! The page looks more like a SaaS offering or something, which initially scared me away a bit. I hope the emphasis is more on the opensource library and showing paying options as some premium thing.

I didn't realize that the "too nice" landing page makes me anxious for open source software :-/

narush4y ago

To pull the curtains back a bit: we probably spend about 85% of our product and development time on open source code. Just this week, we developed copy and paste, nan value filling, and spilling a text column on a delimiter - all of these are open source features.

As we've begun to engage with larger teams, we often take features that we build out for their workflow and open source them as well - a few of the teams have been explicit proponents for the open source tool, which is awesome to see.

I'm sure our thinking on this will evolve over time, but we are highly focused on developing just a _great_ piece of open source software. And for folks that need more power, we want to give them the chance to get it - while also supporting Mito's development :)

P.S. Check out our Mito Pro roadmap here: https://www.trymito.io/plans#mito_pro_roadmap. Feedback appreciated!

aarondia4y ago

Well first of off, thank you, I put a lot of effort into implementing that landing page :)

We're super focused on the open source offering. The vast majority of our users are on the open source version and the vast majority of the features we release are open source! (You can check out our PR's if you're interested in verifying)

The Mito Pro and Enterprise plans are designed for advanced users and teams. In those versions we provide features that make it easier to collaborate, create presentation-ready materials, and hook up to other company resources.

But we're an open source tool through and through!

narush4y ago

Fancy seeing you here, writing the same comment as me... :}

kite_and_code4y ago

I am not so sure about the open-source fact. Please see comments and thread below.

Edit: It is GPL by now as seen here https://github.com/mito-ds/monorepo/blob/dev/LICENSE

harabat4y ago· 3 in thread

For those who are going through the thread finding new tools: pandas-profiling[0] is a library for automatic EDA (which bamboolib[1], mentioned elsewhere, also does).

[0]: https://github.com/pandas-profiling/pandas-profiling [1]: https://bamboolib.com/

kite_and_code4y ago

Lux might also be interesting: https://github.com/lux-org/lux

narush4y ago

Def check these all out! Lots of cool tools out there. For anyone who's tried a bunch of these... that's a great topic for a Medium post :)

MaxDPS4y ago

I just found out about pandas-profiling a couple days ago and the examples blew my mind, it looks amazing (I’ve yet to actually try it out though).

jpalomaki4y ago· 3 in thread

If others are interested, Mito does not work in vscode or Google Collab. Only classic Jupyter Notebooks and Jupyter Labs are supported currently [1].

[1] https://docs.trymito.io/misc/faq

malshe4y ago

Thanks for checking that! I use vscode so this is a no go...

aarondia4y ago

Yeah, Mito is limited to the Jupyter ecosystem (for now). We want to expand to VSCode, Google Collab, and Streamlit!

For the time being, because Mito generates pandas code for every edit you make, you can always use Mito in Jupyter to generate code, and then copy it over to VSCode. Admittedly, its not as nice of a workflow, but it does work!

pen2l4y ago

I see you guys provide convenient installers that can be obtained with pip. Me, I run my JupyterLab that I got with Conda on a Windows setup. Can you comment on whether it's a thorn-free path to get Mito in such a setup? Or should I use this as another sign to completely migrate to a nix system for all my dev needs... :)

[0] https://news.ycombinator.com/item?id=31450910

kite_and_code4y ago· 3 in thread

Another alternative is bamboolib.com which was acquired by Databricks last September to offer it within Databricks notebooks

filmor4y ago

Are you affiliated? There are three comments in this comment page by you, and they all manage to mention bamboolib...

kite_and_code4y ago

Yes, I am one of the co-founders of bamboolib and employed by Databricks.

I already added my disclosure to the following answer [0] in this thread but I was hesitant to add it to every answer.

Do you prefer if I explicitly add my affiliation in every comment that mentions bamboolib? If so, I will try to edit them (if the HN UI still allows me to - I observed that it stops allowing this after some time)

Closi4y ago

> Do you prefer if I explicitly add my affiliation in every comment that mentions bamboolib?

Personally I thought your original post's tone implied that you weren't affiliated to me personally.

You don't have to add a formal 'disclosure', but you could just say "I built x which is..." rather than "Another product is x which is...".

- https://github.com/quantopian/qgrid

jpn4y ago· 2 in thread

I played around with many of these before:

- https://github.com/man-group/dtale

I find that I'm actually a lot faster using basic Pandas methods to get the data I want in exactly the form I want it.

If I really want to show everything, I just use:

```

with pd.option_context('display.max_rows', None):

   print(df)

```

Foivos4y ago

I use a similar function when I want to see everything:

```

def showAllRows(dataframeToShow):

  with pd.option_context('display.max_rows', None, 'display.max_columns', None):

    display(dataframeToShow)

# calling it while limiting the number of returned rows.

showAllRows(df.head(1000))

```

Be warned though! if you call this function without limiting the number of rows to be fetched, it is guaranteed you will crash your machine. Always use head, sample or slices.

If do get a crush, then your only option is to open the ipynb file with vi and manually delete the millions of lines this function created.

Another function that I like is:

```

def showColumns(df, substring):

    print([x for x in df.columns if substring in x])

    return

# calling it

showColumns(df, "year")

```

This is useful in data frames with many columns, when you want to find all the columns that have a specific string in their name. It returns a string, which then you can pass it in the dataframe to print only these columns.

dekhn4y ago

what irks me about dtale is if you scroll with the vertical slider, it can't update the view fast enough until you stop scrolling.

sodimel4y ago· 2 in thread

Looks like a Datasette[0] clone which runs on top of something (jupyter) which runs on top of Python (ipython). I think I would like to see how much time it takes to open a massive dataset in Mito & in Datasette :P

[0]: https://datasette.io/

aarondia4y ago

Heyo, one of the Mito creators here. Thanks for sharing Datasette. I haven't seen that one before. It looks neat!

You're right though, there are several tools that fit the general shape of: GUI on top of Jupyter on top of Python. There's a few general vectors to understand these tools by:

1. Excel-ness: Although most (if not all) of these tools incorporate some type of spreadsheet, the interface for interacting with the data in that spreadsheet differs greatly. Some tools, like Bamboolib [1] and Datasette [2] resemble Excel only in the spreadsheet. Other tools, like Mito [3], stick to a lot of the other Excel design decisions -- things like having a toolbar with buttons and menu items to access functionality, the ability to write spreadsheet formulas inside of the cell & formula bar, etc. In many ways, this Excel-ness design vector is a proxy for how easy it is to get started with the tool. What we see, is that users are able to download Mito and get something useful out their first analysis because the interface is one that they are used to!

2. Ownership of your analysis / lack of lockin: We believe that the most powerful low-code spreadsheet tools allow spreadsheet users to easily transition to full programming languages, if they want to. Instead oflocking users into a limited and proprietary product, it's better if users can transition to a full programming language (like Python) very naturally. This transition is super natural in Mito because we generate Python code for every edit that a user makes. So if Mito doesn't support the exact transformation that you want, you can use Mito as a starting point for your analysis and customize the script that Mito generates.

[1] https://bamboolib.8080labs.com/ [2] https://datasette.io/ [3] https://www.trymito.io/

kite_and_code4y ago

bamboolib co-founder here. We are also thinking about adding Excel-type formulas to the UI and already have internal prototypes.

However, please be aware that bamboolib might soon only be available within Databricks notebooks instead of local Jupyter notebooks like mito.

noobker4y ago· 1 in thread

Mito looks cool. I'm hopeful a tool like it can create a bridge between Excel-based analysts/researchers and more mature application flows.

Another tool like Mito is Bamboo: https://bamboolib.8080labs.com/

aarondia4y ago

Heyo, Mito cofounder here, bridging that gap is one of the main ways that enterprises are using Mito today! Helping business users become data self-sufficient in a world where Excel's data size limitations make it a non-option is where Mito shines :)

ryzvonusef4y ago· 1 in thread

https://www.youtube.com/watch?v=T7YkWuTIlTw

video of it in use.

aarondia4y ago

Thanks for sharing. There's a few other YouTubers who have made some cool videos about Mito -- The Data Professor [1] and Talk Python to Me [2]

And some cool Medium posts too! Mitosheet: enabling collaboration [3], Mito: One of the Coolest Python Libraries You Have Ever Seen [4] Preparing a dataset for analysis [5]

[1] https://www.youtube.com/watch?v=l2nBO_LkkcQ [2] https://www.youtube.com/watch?v=XAGmSPZsYLU [3] https://medium.com/trymito/mitosheet-empowering-collaboratio... [4] https://towardsdatascience.com/mito-one-of-the-coolest-pytho... [5] https://medium.com/@twelsh37/preparing-a-dataset-for-analysi...

alefnulaOP4y ago· 1 in thread

I have no affiliation with the project. Just found it, tried it out, and it looks very promising...

aarondia4y ago

Thanks for posting!

punk_ihaq4y ago· 1 in thread

Wow love it! It would be cool to see a bidirectional Streamlit custom component for Mito!

narush4y ago

This is on the roadmap! Would love to hear a bit more about how you would use this component...

1. Would you want the component to generate code? Or would it just be the editing of a dataframe that is useful to you?

2. What other components would be used in this dashboard? Would love to hear a bit more about the workflow around Mito here.

The more detail you can provide - the more helpful in prioritizing this! I think Mito in streamlit would be ... awesome!

pipeline_peak4y ago· 1 in thread

The web page needs to be heavier

narush4y ago

Super fair, lol. We'll work on optimizing it - just a tiny team and lots of things on our plate rn. The main issue is our images / video, which I have tried compressing but can't do so while maintaining the quality. Any tips are greatly appreciated!

Believe it or not, the last version of this website was even heavier...

MaxDPS4y ago

This looks pretty interesting. I like how customizable JupyterLab is with these extensions. Not sure if this is the right place to ask, but does anyone have any recommendations for other extensions I might want to look at?

whoevercares4y ago

Tricky question - what do you think about Databricks who acquired Bamboolib and saying they will integrate pandas GUI into their workspace?

j / k navigate · click thread line to collapse

99 comments

81 comments · 18 top-level

santiagobasulto4y ago· 20 in thread

I like this. Is a "friendlier" way to browse data. Said that, I have to add:

[0] Your eyes are really inefficient at capturing information and there's only so much memory available: try loading a 15GB CSV in Excel.

wenc4y ago

mejutoco4y ago

A good example of what you are warning against is Anscombe’s quartet

https://en.wikipedia.org/wiki/Anscombe's_quartet

Histograms and Boxplots (and IQRs) don't lie tho...

kite_and_code4y ago

If you want to use open-source Python-based visualizations instead of Tableau, the following tools allow the creation of custom plots - including the ability to export the underlying code.

- bamboolib (proprietary license - acquired by Databricks in order to run within the Databricks notebooks)

- mito (GPL license)

- dtale (MIT license)

pea4y ago

wenc4y ago

No, I’m more cautioning against using summary statistics in isolation without looking at the raw data.

[1] https://docs.trymito.io/how-to/summary-statistics

aarondia4y ago

pbronez4y ago

You could develop some IP around efficient and effective ways to do this. Probably would require an ensemble of unsupervised methods.

aarondia4y ago

narush4y ago

Thanks for the feedback!

[2] https://docs.trymito.io/how-to/graphing

CJefferson4y ago

kite_and_code4y ago

+1 - this is also how I operated as a Data Scientist myself

awild4y ago

> try loading a 15GB CSV in Excel.

Or visualising it in r or pandas without meaningful subsampling.

pea4y ago

One cool library I saw recently for helping on the visualisation side is https://github.com/vegafusion/vegafusion

It allows you to use Altair in Python for visualising data, but does the computation in the backend using Arrow DataFusion. Not for 15GB perhaps, but cool nonetheless.

kurupt2134y ago

The aggregate data is around 1.5 million experimental results. MiniTab is too unwieldy and requires too much manual reformatting of the data sheets.

Is this something I should be looking at in R or project Jupyter? Does one make better visualizations than the other?

awild4y ago

Ggplot is extremely powerful if you can grok its grammar, which takes some getting used to. But I'd assume that if you see a graph in a scientific paper it's made with ggplot.

Having many data points you want to explore you are always going to be at the edges of what your hardware and software can produce.

kurupt2134y ago

In all seriousness, excell can’t be the right option for 15GB of alphanumeric data (one sheet?)

mint24y ago

Do you as a rule look at a sample of the individual raw data, non aggregated?

From time to time you can do a `.head()/.title()` or an `.iloc[X:Y]` to check some things visually. But just as a "refresher".

[1] https://github.com/mito-ds/monorepo/blob/dev/LICENSE [2] https://github.com/mito-ds/monorepo/blob/dev/mitosheet/src/p...

narush4y ago· 7 in thread

Hey everyone. Mito cofounder here. Thanks to whoever posted this - was a real surprise to find it here :-)

Mito (pronounced my-toe) was born out of our personal experience with spreadsheets, and a previous (failed) spreadsheet version control product.

We’ve been working on Mito for over a year now. Growth has really picked up in the past few months - and we’ve begun working with larger companies to help accelerate their transition to Python.

To any companies who are somewhere in that Python transition process - please do reach out - we would love to see if we can be helpful for all your spreadsheet users!

Feel free to browse my profile for other spreadsheet related thoughts, I’m a bit of a HN junkie. Of course, any and all feedback (positive or negative) is appreciated.

My cofounders and I will be trolling about in the comments. Say hey! :-)

aarondia4y ago

Heyo! Another co-founder here. Excited to see Mito on HN :) Thanks @alefnula for posting!

+1 to everything @narush said.

Excited to interact with you all in the comments :)

kite_and_code4y ago

Can you please clarify what you mean by "mito is open-source"?

Last time I checked the code was under a proprietary license.

Edit: I found in another comment below that mito is now available under GPL license here: https://github.com/mito-ds/monorepo/blob/dev/LICENSE

Edit2: Just saw your answer now - thanks for the clarification and links!

aarondia4y ago

kite_and_code4y ago

If you are a large company trying to migrate to Python, you might also want to have a look at bamboolib.com which was acquired by Databricks.

bamboolib is very similar to mito (hard to tell who was first).

The advantage is that it runs within Databricks which gives you the ability to scale to any amount of data easily and Databricks has many (and growing) security certifications e.g. HIPAA compliance.

bamboolib can be used in plain Jupyter. Also, bamboolib private preview within Databricks is about to start within the next days.

Full disclosure: I am a co-founder of bamboolib and employed by Databricks

bamboolib appears to be closed-source. You're at their mercy.

kite_and_code4y ago

bamboolib co-founder here:

It's correct that bamboolib is (still) closed-source (which might be subject to change but I don't make promises).

It's also correct that customers can extend the bamboolib UI in various ways via plugins that they can author themselves. That empowers them to build bamboolib into the kind of tool that they want.

Also, all the code is always exported and thus, there is at least no "code lockin".

Regarding being "at their mercy", I can say that there are many customers who are happy by the service that we provide.

[1] https://github.com/mito-ds/monorepo/blob/974091b455950c6c50e... [2] https://github.com/mito-ds/monorepo/blob/dev/mitosheet/mitos...

lcrmorin4y ago

rcarmo4y ago· 5 in thread

The telemetry thing is... weird. So we can use it for free but have no way to turn it off but upgrade to paid?

aarondia4y ago

Thanks for that feedback. Mito's approach to telemetry is that we never log any of your data or metadata about your data. We don't track things like the size, shape, or content of your data.

We do collect info about app usage, things like which buttons users click. This allows us to focus development time on improving the features that are used most often.

But of course, we're always open to feedback about how we can continue to improve our practices!

learndeeply4y ago

I don't get it. What in the license prevents users from removing the telemetry? AGPL just means the user needs to open source that change, right?

Edit: To remove telemetry, just call:

   from mitoinstaller.user_install import go_pro; go_pro();

No licensing or payment required, and doesn't violate the license.

narush4y ago

You can of course fork Mito and turn off telemetry as long as you open source your changes! Go for it - happy to hop on a call and help you get set up with the codebase, if you want. Yay open source!

Just to be very clear, the way to be "totally telemetry-less" is to pay you?

MadameBanaan4y ago

Yeah, I have a hard pass on anything that offers an "Open Source" version, but actually meant to be a "Try it and be my Guinea Pig".

boringg4y ago· 5 in thread

Hope for the best though - pandas is pretty fantastic.

okennedy4y ago

aarondia4y ago

This looks cool :)

aarondia4y ago

One of the creators of Mito, here. Thanks for your feedback. I wanted to share a couple of nuggets about Mito that have been helpful in talking about this with other users.

Not requiring a big switch is nice for the user and its nice for Mito too! Lots of large companies have been able to get up and running with Mito in 30 minutes because it fits into their data stack.

Anyways, not that these are the only two reasons you might feel uneasy about adopting Mito, but at least wanted to share why the switch to Mito might be less scary than switching to other tools.

kite_and_code4y ago

I love how mito enables companies to use the power of open-source!

narush4y ago

Yeah, we've thought a bit about a plugin API - for the reasons you say, I think it would be an awesome feature to open up to teams!

Any tips on going about it? No need to share the secret sauce, unless you want :P

Of course, if Mito is missing features, we're open source [1] -- all contributions are welcome! Also feel free to open an issue and we can discuss :)

kite_and_code4y ago· 4 in thread

To the founders of mito, regarding the mito GPL license:

What is your take on that regarding usage inside cloud provider's notebooks like AWS, GCP, Azure, Databricks?

Is it allowed or not allowed by the license? And who should/can control the usage since users can install any kind of Python library in those environments.

And, separately from the maybe ambiguous legal answer: What is your personal intention with the license?

Disclosure: I am employed by Databricks.

narush4y ago

Hiya kite_and_code - thanks for the question + good to see you here :)

EDIT: due to comment below, I edited this comment for clarity that we 100% support users using Mito from notebooks in the cloud!

kite_and_code4y ago

I can totally relate that finding a suitable open-source business model is a fuzzy journey.

Nevertheless, from the user perspective I would love to hear a more clear answer - at least for e.g. the next 6-12 months.

I can understand if you don't want to answer on the spot in case you don't have a clear stance yet. In this case, please take your time and let us know when you made your decision.

Really love what you're doing and the innovation that you are pushing for! <3

narush4y ago

Oh, sorry I wasn't clear! We totally expect that users will use Mito in notebooks on the cloud cloud, and we are in support of this usage!

I'll edit my answer above to be more clear about this as well. Thanks for the ask for clarification!

mbreese4y ago

> Our understanding of our license is evolving

flakiness4y ago· 4 in thread

I didn't realize that the "too nice" landing page makes me anxious for open source software :-/

narush4y ago

P.S. Check out our Mito Pro roadmap here: https://www.trymito.io/plans#mito_pro_roadmap. Feedback appreciated!

aarondia4y ago

Well first of off, thank you, I put a lot of effort into implementing that landing page :)

But we're an open source tool through and through!

narush4y ago

Fancy seeing you here, writing the same comment as me... :}

kite_and_code4y ago

I am not so sure about the open-source fact. Please see comments and thread below.

Edit: It is GPL by now as seen here https://github.com/mito-ds/monorepo/blob/dev/LICENSE

harabat4y ago· 3 in thread

For those who are going through the thread finding new tools: pandas-profiling[0] is a library for automatic EDA (which bamboolib[1], mentioned elsewhere, also does).

[0]: https://github.com/pandas-profiling/pandas-profiling [1]: https://bamboolib.com/

kite_and_code4y ago

Lux might also be interesting: https://github.com/lux-org/lux

narush4y ago

Def check these all out! Lots of cool tools out there. For anyone who's tried a bunch of these... that's a great topic for a Medium post :)

MaxDPS4y ago

I just found out about pandas-profiling a couple days ago and the examples blew my mind, it looks amazing (I’ve yet to actually try it out though).

jpalomaki4y ago· 3 in thread

If others are interested, Mito does not work in vscode or Google Collab. Only classic Jupyter Notebooks and Jupyter Labs are supported currently [1].

[1] https://docs.trymito.io/misc/faq

malshe4y ago

Thanks for checking that! I use vscode so this is a no go...

aarondia4y ago

Yeah, Mito is limited to the Jupyter ecosystem (for now). We want to expand to VSCode, Google Collab, and Streamlit!

pen2l4y ago

[0] https://news.ycombinator.com/item?id=31450910

kite_and_code4y ago· 3 in thread

Another alternative is bamboolib.com which was acquired by Databricks last September to offer it within Databricks notebooks

filmor4y ago

Are you affiliated? There are three comments in this comment page by you, and they all manage to mention bamboolib...

kite_and_code4y ago

Yes, I am one of the co-founders of bamboolib and employed by Databricks.

I already added my disclosure to the following answer [0] in this thread but I was hesitant to add it to every answer.

Closi4y ago

> Do you prefer if I explicitly add my affiliation in every comment that mentions bamboolib?

Personally I thought your original post's tone implied that you weren't affiliated to me personally.

You don't have to add a formal 'disclosure', but you could just say "I built x which is..." rather than "Another product is x which is...".