undefined | Better HN

0 pointsstinos8y ago0 comments

As far as I know this is slowly changing. According to scientists I've met (life science fields) there is an ongoing trend towards demanding raw data and even the code used to analyze it. Not every journal everywhere yet of course, but the consensus seemed to be that in 5 to 10 years it would become the new norm. Which might cause quite the disruption. Having seen quite the amounts of code written by PHD students I'm 100% sure tere are many, many bugs out there. Possibly leading to faulty results.

0 comments

6 comments · 2 top-level

vanderZwan8y ago· 4 in thread

This is exactly the motivation behind the Loom file format[0] being developed and used in the molecular neurobiology-group I work for[1]. It is an HDF5-based file format for 'omics data, to deal with the ever-growing size of the data sets.

I specifically work on the Loom Viewer[1], a SPA that we're trying to design in such a way that it will be really easy and cheap for research groups to host and share these loom files themselves. This would make easy for other groups to ask simple questions about each others data, and in the worst case the raw loom file is always available for download.

We're already hosting some of our own published datasets with this viewer, you can check it out here[3].

To lower expectations a bit: the viewer is not trying to be comparable to the big atlases like [4] or [5] (I mean, it's being developed by one dude - me - so by comparison it's a no-budget OSS project). It's much simpler and basic - the idea is that if you use the Loom file format in your pipeline in a sensible manner, the viewer will more or less know what to do with the data.

[0] https://github.com/linnarsson-lab/loompy

[1] http://linnarssonlab.org/

[2] https://github.com/linnarsson-lab/loom-viewer

[3] http://loom.linnarssonlab.org/

[4] http://mouse.brain-map.org/agea

[5] https://www.proteinatlas.org/tissue

reallymental8y ago

This is a significant amount of effort, and has some very good touches. Especially the Loom-Viewer, I really liked the ease with which one's workflow can integrate into the data analysis.

vanderZwan8y ago

Thanks! :) It's always good to hear that we're on the right track with usability.

It's one of the things that I'm most worried about: I've been working pretty isolated for the last year and a half, and lack a background in biology or bioinformatics, and did not even have webdev experience when I took on this project (plenty of embarrassing proof of that in the code). Kudos to Sten Linnarsson, the PI of the group and my boss, for taking a gamble and hiring me anyway.

> I really liked the ease with which one's workflow can integrate into the data analysis.

Just to make this clear: the file format is a "dumb" data storage, and the viewer a "dumb" plotter of that data. To do more in-depth analysis requires loading the file in python, R, or anything else that might support the files in the future. The idea is to then store the results of this as attributes in the file. For example, the tSNE plot here[0] is just pre-calculated x/y data stored as two attributes.

Currently there is an issue with fully integrating the viewer into such a work-flow: for performance reasons, it caches accessed data from the file. This cache needs to be refreshed manually.

Sten recently added library support for keeping track of file modifications[1]. That enables me to make the viewer automatically update stale cache whenever a file is modified, making it even easier to integrate. Currently working on that.

There's still a ton of polishing and bug-fixing to do. Feedback, suggestions and help are always welcome!

[0] http://loom.linnarssonlab.org/dataset/cells/Dentate%20gyrus/...

[1] https://github.com/linnarsson-lab/loompy/issues/26

wowbot8y ago

What's the difference between this and jupyter? besides the file format itself obviously

vanderZwan8y ago

Using loompy in Jupyter is what you would do when analysing your own data. You would connect to a loom file with the loompy library, then extract the data you're interested in, apply whatever algorithms you need to apply, and plotting the results. The difference is that this is all code of your own. Loom is just the storage format for the data in this context.

The viewer is a specialised application: it has a server and client. The server extracts (meta)data requested by the client from a loom file, and serves it as JSON. The client then uses this metadata to generate plots. The off-line viewer is actually just running that server locally and opening it on localhost:8003.

That makes it better for sharing raw data on-line: most of the time, people do not need the full dataset of 27k+ genes, they're only interested in a dozen or so. This makes it easy to access that.

Hosting your own viewer is quite simple:

    # this also installs the loom CLI

    pip install loom-viewer

    # start the server

    loom --dataset-path [DATASET_PATH] --server --port [PORT_NUMBER]

(Well, you probably want to use something like a supervisor script for that, which is what we do, but you get the idea)

We don't use a database; instead the server looks for loom files in a dataset folder like this:

    [DATASET_PATH]\[PROJECT_FOLDER]\[LOOM FILE]

That means that sharing a loom file is as simple as copying it to the right folder.

This probably not web-scale or really safe or anything, but we're talking small labs sharing data with other labs - the risks are different. These viewers will be accessed by a few biologists. Using files in a folder structure keeps it simple enough to set up for the less tech-savvy.

In theory, a third work-flow is also possible: having Jupyter open in one tab and manipulating the loom file from there, and the viewer in another.

There are three blocking issues for that, however:

- the stale cache problem I mentioned in the other comment,

- single writer/multiple reader support,

- the server needs to be an isolated sub-process due to gevent monkeypatching messing with Jupyter

Main issue here is dev-team of one person so... this might take some time.

c128y ago

I think this is a good trend. The opening of raw data AND the code used to analyse it will aid in others being able to find errors sooner.

j / k navigate · click thread line to collapse

0 comments

6 comments · 2 top-level

vanderZwan8y ago· 4 in thread

We're already hosting some of our own published datasets with this viewer, you can check it out here[3].

[0] https://github.com/linnarsson-lab/loompy

[1] http://linnarssonlab.org/

[2] https://github.com/linnarsson-lab/loom-viewer

[3] http://loom.linnarssonlab.org/

[4] http://mouse.brain-map.org/agea

[5] https://www.proteinatlas.org/tissue

reallymental8y ago

This is a significant amount of effort, and has some very good touches. Especially the Loom-Viewer, I really liked the ease with which one's workflow can integrate into the data analysis.

vanderZwan8y ago

Thanks! :) It's always good to hear that we're on the right track with usability.

> I really liked the ease with which one's workflow can integrate into the data analysis.

Currently there is an issue with fully integrating the viewer into such a work-flow: for performance reasons, it caches accessed data from the file. This cache needs to be refreshed manually.

There's still a ton of polishing and bug-fixing to do. Feedback, suggestions and help are always welcome!

[0] http://loom.linnarssonlab.org/dataset/cells/Dentate%20gyrus/...

[1] https://github.com/linnarsson-lab/loompy/issues/26

wowbot8y ago

What's the difference between this and jupyter? besides the file format itself obviously

vanderZwan8y ago

That makes it better for sharing raw data on-line: most of the time, people do not need the full dataset of 27k+ genes, they're only interested in a dozen or so. This makes it easy to access that.

Hosting your own viewer is quite simple:

    # this also installs the loom CLI

    pip install loom-viewer

    # start the server

    loom --dataset-path [DATASET_PATH] --server --port [PORT_NUMBER]

(Well, you probably want to use something like a supervisor script for that, which is what we do, but you get the idea)

We don't use a database; instead the server looks for loom files in a dataset folder like this:

    [DATASET_PATH]\[PROJECT_FOLDER]\[LOOM FILE]

That means that sharing a loom file is as simple as copying it to the right folder.

In theory, a third work-flow is also possible: having Jupyter open in one tab and manipulating the loom file from there, and the viewer in another.

There are three blocking issues for that, however:

- the stale cache problem I mentioned in the other comment,

- single writer/multiple reader support,

- the server needs to be an isolated sub-process due to gevent monkeypatching messing with Jupyter

Main issue here is dev-team of one person so... this might take some time.

c128y ago

I think this is a good trend. The opening of raw data AND the code used to analyse it will aid in others being able to find errors sooner.

j / k navigate · click thread line to collapse