Mito (pronounced my-toe) was born out of our personal experience with spreadsheets, and a previous (failed) spreadsheet version control product.
Spreadsheets were the original killer app for computers, and are the most popular programming language used worldwide today. That being said, spreadsheets have some growing to do! They don’t handle large datasets well, they don’t lead to repeatable or auditable processes, and generally they disrespect many of the hard won software engineering principals that us engineers fight for.
More than that, as spreadsheet users run into these problems and turn to Python to solve them, they struggle to use pandas to accomplish what would have been two clicks in a spreadsheet. Pandas is great, but the syntax is not always so obvious (not is learning to program in the first place!)
Mito is the our first step in addressing these problems. Take any dataframe, edit it like a spreadsheet, and generate code that corresponds to those edits. You can then take this Python code and use it in other scripts, send it to your colleagues, or just rerun it.
We’ve been working on Mito for over a year now. Growth has really picked up in the past few months - and we’ve begun working with larger companies to help accelerate their transition to Python.
To any companies who are somewhere in that Python transition process - please do reach out - we would love to see if we can be helpful for all your spreadsheet users!
Feel free to browse my profile for other spreadsheet related thoughts, I’m a bit of a HN junkie. Of course, any and all feedback (positive or negative) is appreciated.
My cofounders and I will be trolling about in the comments. Say hey! :-)
+1 to everything @narush said.
It's important to us that the software we build is empowering to users and not restrictive. This plays out in two primary ways: 1) Since Mito is open source and generates Python code for every edit, Mito doesn't lock users into a 'Mito ecosystem', instead it help users interact with the powerful & robust Python ecosystem. 2) Because Mito is an extension to Jupyter Notebooks + JupyterLab, Mito improves your existing workflows instead of completely altering your data analytics stack.
Excited to interact with you all in the comments :)
Last time I checked the code was under a proprietary license.
Edit: I found in another comment below that mito is now available under GPL license here: https://github.com/mito-ds/monorepo/blob/dev/LICENSE
Edit2: Just saw your answer now - thanks for the clarification and links!
bamboolib is very similar to mito (hard to tell who was first).
The advantage is that it runs within Databricks which gives you the ability to scale to any amount of data easily and Databricks has many (and growing) security certifications e.g. HIPAA compliance.
bamboolib can be used in plain Jupyter. Also, bamboolib private preview within Databricks is about to start within the next days.
Full disclosure: I am a co-founder of bamboolib and employed by Databricks
Exploring large datasets requires a COMPLETELY different mindset. When your data starts growing, it's impossible to keep it all in a visual format (for 2 reasons[0]) and you have to start thinking analytically. You have to start looking at the statistical values of your data to understand what's its shape. That's why the `.describe()` and `.info()` methods in Pandas are so useful. After many years doing this, I can "see" the shape of my data just by looking at the statistical information about it (mean, median, std, min, max, etc).
After some time you don't need to rely on visual tools, just can run a few methods, look at some numbers, and understand all your data. Kinda feels like the operator of The Matrix that is looking at the green numbers descend and knows what's going on behind the scenes.
[0] Your eyes are really inefficient at capturing information and there's only so much memory available: try loading a 15GB CSV in Excel.
I find it’s important to actually “touch” the raw data even if only in a buffered, random sampling sort of way to get a feel for it. Sometimes with big datasets, looking through rows of data feels tedious and meaningless but I’ve found that I’ve often picked up on things I wouldn’t have without actually looking at the raw data. Raw data is often flawed, but there’s often some signal in it that tells a story hence it’s important not to overlook these through a lens of aggregate statistics.
The next step is to visualize the data multidimensionally in something like Tableau. Tableau works on very large datasets (it has an internal columnstore format called Hyper) and can dynamically disaggregate and drill down. Insights are usually obtained by looking at details, not aggregates.
- bamboolib (proprietary license - acquired by Databricks in order to run within the Databricks notebooks)
- mito (GPL license)
- dtale (MIT license)
Some of the tools that you mentioned exist in Mito today. For example, Mito generates summary information about each column (all of the .describe() info along with a histogram of the data). And we're creating features for gaining a global understanding of the data too.
In practice, one of the main ways that we see people use Mito is for that initial exploration of the data. Often the first thing that users do when they import data into Mito is to correct the column dtype, delete columns that are irrelevant to their analysis, and filter out/replace missing values.
You could develop some IP around efficient and effective ways to do this. Probably would require an ensemble of unsupervised methods.
Thanks for the feedback!
Also, even with huge datasets I tend to always look at a random sample, and the "most extreme" datapoints -- mainly because in my experience there is a good chance some parts of the data are malformed, and need to be recollected/fixed. Of course, if you trust your data collection you don't need this!
Or visualising it in r or pandas without meaningful subsampling.
It allows you to use Altair in Python for visualising data, but does the computation in the backend using Arrow DataFusion. Not for 15GB perhaps, but cool nonetheless.
The aggregate data is around 1.5 million experimental results. MiniTab is too unwieldy and requires too much manual reformatting of the data sheets.
Is this something I should be looking at in R or project Jupyter? Does one make better visualizations than the other?
From time to time you can do a `.head()/.title()` or an `.iloc[X:Y]` to check some things visually. But just as a "refresher".
- https://github.com/quantopian/qgrid
- https://github.com/man-group/dtale
I find that I'm actually a lot faster using basic Pandas methods to get the data I want in exactly the form I want it.
If I really want to show everything, I just use:
```
with pd.option_context('display.max_rows', None):
print(df)
``````
def showAllRows(dataframeToShow):
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
display(dataframeToShow)
# calling it while limiting the number of returned rows.showAllRows(df.head(1000))
```
Be warned though! if you call this function without limiting the number of rows to be fetched, it is guaranteed you will crash your machine. Always use head, sample or slices.
If do get a crush, then your only option is to open the ipynb file with vi and manually delete the millions of lines this function created.
Another function that I like is:
```
def showColumns(df, substring):
print([x for x in df.columns if substring in x])
return
# calling itshowColumns(df, "year")
```
This is useful in data frames with many columns, when you want to find all the columns that have a specific string in their name. It returns a string, which then you can pass it in the dataframe to print only these columns.
[0]: https://github.com/pandas-profiling/pandas-profiling [1]: https://bamboolib.com/
We do collect info about app usage, things like which buttons users click. This allows us to focus development time on improving the features that are used most often.
That being said, it's important to us that there is a way to be totally telemetry-less if users don't want any information to be leave their computer. Compared to most other cloud-based sass data science tools where you pretty much have no hope of total privacy, we're proud of the flexibility that we offer.
But of course, we're always open to feedback about how we can continue to improve our practices!
Edit: To remove telemetry, just call:
from mitoinstaller.user_install import go_pro; go_pro();
No licensing or payment required, and doesn't violate the license.What is your take on that regarding usage inside cloud provider's notebooks like AWS, GCP, Azure, Databricks?
Is it allowed or not allowed by the license? And who should/can control the usage since users can install any kind of Python library in those environments.
And, separately from the maybe ambiguous legal answer: What is your personal intention with the license?
Disclosure: I am employed by Databricks.
Our understanding of our license is evolving - we're first time open source devs, and as I'm sure you know it can be a tricky process. That being said: we totally support Mito users using Mito from notebooks hosted in the cloud!
Currently, we have quite a few users using Mito in notebooks hosted through AWS, GCP, etc. We’re aiming to be good stewards of open source software, and want to see Mito exist where ever it is solving users problems!
We’ve had lots of folks in lots of environments request Mito, and are actively working on prioritizing supporting those other environments. We added classic Notebook support last month (funnily, I thought it’d take weeks to support, and it took 2 days lol) - and are looking into VS Code, Streamlit, Dash, and more!
EDIT: due to comment below, I edited this comment for clarity that we 100% support users using Mito from notebooks in the cloud!
Nevertheless, from the user perspective I would love to hear a more clear answer - at least for e.g. the next 6-12 months.
Currently, it seems like you are tolerating usage inside the cloud providers without taking a clear stance. I think this creates fear, uncertainty, doubt and slows down mito adoption within the cloud.
I would appreciate a clear statement in the near future around your thinking on how mito should be made available in those environments. After all, the clouds are an environment to where more and more users are migrating to. Or at least use it in parallel to local setups.
I can understand if you don't want to answer on the spot in case you don't have a clear stance yet. In this case, please take your time and let us know when you made your decision.
Really love what you're doing and the innovation that you are pushing for! <3
As a potential user, this is pretty troubling. I can understand your intentions, but if the license doesn’t match your intentions (and if you don’t completely understand the license), how can we be sure our workflows will be supported/possible in the future?
Hope for the best though - pandas is pretty fantastic.
1. The core Mito product is open source. You can see our GitHub here [1]. We also have a pro version that has some additional, code visible, but non-open source features. The way that we think about which features belong in which version of the product is as following: Features that are needed to just get any average analysis done are open source features. On the other hand, features that are specifically useful in an organization -- connecting to company databases, formatting / styling data and graphs for a presentation, etc. -- are pro features. So if you are a team that is relying on our pro features, you're helping support the longevity & progress of Mito. If you are not one of those users and using the open source version, then you will always have access to Mito (and can even help improve it!). Of course the line between what features are specifically helpful in an organization and what feature are needed for an average analysis is a bit blurry, and is a moving target as we continue to expand Mito's offering.
2. Mito is designed specifically to not force users to make a big 'switch'. I've commented this elsewhere in this thread, but just to recap: Because Mito is an extension to Juptyer and because we generate python code for every edit you make, Mito is designed to improve your existing workflow instead of lock you into a new system. Many Mito users use Mito as a starting point! They do as much of their analysis as they can in the Mito spreadsheet and then continue writing more customized Python code to finish up their work.
Not requiring a big switch is nice for the user and its nice for Mito too! Lots of large companies have been able to get up and running with Mito in 30 minutes because it fits into their data stack.
Anyways, not that these are the only two reasons you might feel uneasy about adopting Mito, but at least wanted to share why the switch to Mito might be less scary than switching to other tools.
You might want to think about enabling companies to create the company specific extensions themselves e.g. via a plugin API. You might still request them to pay for this version of Mito but they are enabled to extend it with their engineering power instead of relying on you.
We had good experiences with this at bamboolib (I am one of the co-founders) and in addition to recurring license revenue it also increased demand for consulting from our end because the internal company devs started working on plugins and then wanted our direct guidance on how to get the more tricky things to work.
Another tool like Mito is Bamboo: https://bamboolib.8080labs.com/
For the time being, because Mito generates pandas code for every edit you make, you can always use Mito in Jupyter to generate code, and then copy it over to VSCode. Admittedly, its not as nice of a workflow, but it does work!
video of it in use.
And some cool Medium posts too! Mitosheet: enabling collaboration [3], Mito: One of the Coolest Python Libraries You Have Ever Seen [4] Preparing a dataset for analysis [5]
[1] https://www.youtube.com/watch?v=l2nBO_LkkcQ [2] https://www.youtube.com/watch?v=XAGmSPZsYLU [3] https://medium.com/trymito/mitosheet-empowering-collaboratio... [4] https://towardsdatascience.com/mito-one-of-the-coolest-pytho... [5] https://medium.com/@twelsh37/preparing-a-dataset-for-analysi...
You're right though, there are several tools that fit the general shape of: GUI on top of Jupyter on top of Python. There's a few general vectors to understand these tools by:
1. Excel-ness: Although most (if not all) of these tools incorporate some type of spreadsheet, the interface for interacting with the data in that spreadsheet differs greatly. Some tools, like Bamboolib [1] and Datasette [2] resemble Excel only in the spreadsheet. Other tools, like Mito [3], stick to a lot of the other Excel design decisions -- things like having a toolbar with buttons and menu items to access functionality, the ability to write spreadsheet formulas inside of the cell & formula bar, etc. In many ways, this Excel-ness design vector is a proxy for how easy it is to get started with the tool. What we see, is that users are able to download Mito and get something useful out their first analysis because the interface is one that they are used to!
2. Ownership of your analysis / lack of lockin: We believe that the most powerful low-code spreadsheet tools allow spreadsheet users to easily transition to full programming languages, if they want to. Instead oflocking users into a limited and proprietary product, it's better if users can transition to a full programming language (like Python) very naturally. This transition is super natural in Mito because we generate Python code for every edit that a user makes. So if Mito doesn't support the exact transformation that you want, you can use Mito as a starting point for your analysis and customize the script that Mito generates.
[1] https://bamboolib.8080labs.com/ [2] https://datasette.io/ [3] https://www.trymito.io/
However, please be aware that bamboolib might soon only be available within Databricks notebooks instead of local Jupyter notebooks like mito.
I didn't realize that the "too nice" landing page makes me anxious for open source software :-/
As we've begun to engage with larger teams, we often take features that we build out for their workflow and open source them as well - a few of the teams have been explicit proponents for the open source tool, which is awesome to see.
I'm sure our thinking on this will evolve over time, but we are highly focused on developing just a _great_ piece of open source software. And for folks that need more power, we want to give them the chance to get it - while also supporting Mito's development :)
P.S. Check out our Mito Pro roadmap here: https://www.trymito.io/plans#mito_pro_roadmap. Feedback appreciated!
We're super focused on the open source offering. The vast majority of our users are on the open source version and the vast majority of the features we release are open source! (You can check out our PR's if you're interested in verifying)
The Mito Pro and Enterprise plans are designed for advanced users and teams. In those versions we provide features that make it easier to collaborate, create presentation-ready materials, and hook up to other company resources.
But we're an open source tool through and through!
Edit: It is GPL by now as seen here https://github.com/mito-ds/monorepo/blob/dev/LICENSE
I already added my disclosure to the following answer [0] in this thread but I was hesitant to add it to every answer.
Do you prefer if I explicitly add my affiliation in every comment that mentions bamboolib? If so, I will try to edit them (if the HN UI still allows me to - I observed that it stops allowing this after some time)
1. Would you want the component to generate code? Or would it just be the editing of a dataframe that is useful to you?
2. What other components would be used in this dashboard? Would love to hear a bit more about the workflow around Mito here.
The more detail you can provide - the more helpful in prioritizing this! I think Mito in streamlit would be ... awesome!
Believe it or not, the last version of this website was even heavier...